Case Study: Predicting a Cancer Cell Discovery

How causal composition across 50 million claims predicted that CD28 upregulates PD-L1 — before the paper was published.

The discovery

In December 2024, a team published in Cancer Cell that CD28 — previously known only as a T-cell co-stimulatory receptor — is expressed inside cancer cells, where it binds and stabilizes PD-L1 mRNA, driving immune evasion. This was completely unexpected. CD28 upregulating PD-L1 in cancer cells was not in any database.

The paper: "Inhibiting intracellular CD28 in cancer cells enhances antitumor immunity and overcomes anti-PD-1 resistance via targeting PD-L1" — Cancer Cell, December 2024 (PMID: 39672166)

What Attest predicted

Our reference database contains 49,926,718 claims from 30+ public databases (STRING, CTD, Reactome, DrugBank, DisGeNET, PrimeKG, and more), all ingested before this paper was published.

Running db.predict("gene_940") (CD28) with causal composition:

CD28 --[upregulates]--> CD274 (PD-L1)

  12 supporting paths via 12 independent intermediaries
  1 opposing path
  Consensus: 92%

  Evidence (co-regulation through shared chemical responses):
    CD28 -[upregulates]-> Olaparib -[upregulates]-> PD-L1
    CD28 -[downregulates]-> Nifedipine -[downregulates]-> PD-L1
    CD28 -[upregulates]-> Phorbol 12-myristate 13-acetate -[upregulates]-> PD-L1
    CD28 -[upregulates]-> Lipopolysaccharide -[upregulates]-> PD-L1
    ... 8 more independent paths

  Query time: 2,723ms on 50M claims

No LLM was involved. No text was read. The prediction emerged purely from the structure of the knowledge graph — 12 independent chemical intermediaries all agree that agents which upregulate CD28 also upregulate PD-L1.

How causal composition works

Traditional knowledge graphs store facts: "Gene A interacts with Gene B." Attest stores claims with causal predicates: "Compound X upregulates Gene A" (from CTD), "Compound X upregulates Gene B" (from CTD).

Causal composition follows these directed edges through intermediaries and applies biochemical logic:

Hop 1Hop 2Composed
upregulatesupregulatesupregulates
downregulatesdownregulatesupregulates (double negative)
upregulatesdownregulatesdownregulates
inhibitsinhibitsactivates (double negative)

When multiple independent intermediaries agree on the composed direction, the prediction has convergent evidence. CD28 → PD-L1 had 12:1 consensus — 12 independent compounds agree on upregulation, only 1 suggests downregulation.

Precision: 71% on validated ground truth

We validated predictions for TP53 against published literature (7 gene targets with known ground truth):

PredictionLiteratureVerdict
TP53 → CDKN1A (upregulates)Textbook biologyCorrect
TP53 → BAX (upregulates)Textbook biologyCorrect
TP53 → IL6 (upregulates)Known p53 targetCorrect
TP53 → CCN2 (upregulates)JCI 2011, mechanism knownCorrect
TP53 → DUSP10 (upregulates)IJMS 2019, CRC genotoxic stressCorrect
TP53 → THBD (upregulates)p53 actually represses THBDWrong direction
TP53 → BMP2 (upregulates)Context-dependent, leaning wrongWrong direction

5/7 correct = 71% precision. False positives occur when the same gene has opposite effects in different tissues (p53 activates some targets but represses others depending on cellular context). We filter the worst offenders using contradictory leg detection: when the source gene both upregulates and downregulates the same intermediary, that intermediary is context-dependent and excluded from predictions.

Retrospective validation: 3/5 published papers predicted

We tested whether the graph could predict findings from recent high-impact papers, using only data that predates each publication:

PaperFindingAttest predictionVerdict
Cancer Cell 2024CD28 upregulates PD-L1upregulates (12:1)Correct
Nat Commun 2025KRAS downregulates BRCA1downregulates (19:17)Correct
Nat Commun 2024PRMT5 activates FUSupregulates (25:17)Correct
Cell Death Differ 2025DYRK2 inhibits USP28downregulates (3:2)Close
Nature 2022ADAR1 inhibits ZBP1upregulates (9:0)Wrong

The ADAR1 error is instructive: ADAR1 and ZBP1 are both interferon-stimulated genes (co-upregulated by the same stimuli), but ADAR1 actually inhibits ZBP1 post-transcriptionally via RNA editing. Co-regulation evidence cannot capture post-transcriptional inhibition.

After loading SemMedDB (35.8M literature-extracted predications from PubMed), the evidence quality stack now catches this error automatically:

  • ADAR1→ZBP1: only 2 SemMedDB claims say "activates" → below minimum evidence threshold (3)filtered as "insufficient"
  • TP53→CDKN1A: 124 SemMedDB claims say "activates", 10 say "inhibits" → 93% directional confidence → "strong"
  • TP53→THBD: only 1 SemMedDB claim → filtered as "insufficient"

The principle: NLP-extracted predications have ~30% directional error rate. With 1-2 claims, errors dominate. With 100+, the signal crushes the noise. directional_confidence() requires a minimum of 3 independent sources before trusting any directional prediction.

One-call API

from attestdb import AttestDB

db = AttestDB.open_read_only("reference.attest")

# Discover novel predictions for any entity
predictions = db.predict("gene_940")  # CD28
for p in predictions[:5]:
    print(f"{p.predicted_predicate} -> {p.target}")
    print(f"  {p.supporting_paths} supporting, {p.opposing_paths} opposing")
    print(f"  gap: {p.is_gap}, consensus: {p.consensus:.0%}")

# Test a specific hypothesis
verdict = db.what_if(
    ("gene_940", "gene"),
    ("upregulates", "relation"),
    ("gene_29126", "gene"),
)
print(verdict.verdict)  # "plausible"
print(verdict.explanation)  # "12 causal path(s) supporting"

Available as MCP tools (attest_predict, attest_what_if) for AI-native workflows. 77 MCP tools total.

Novel prediction validation: 5/8 confirmed

We ran db.predict("gene_7157") (TP53) and validated the top 8 predictions — relationships with zero direct claims in the database, predicted purely from causal composition through 12-16 independent intermediaries:

PredictionPathsLiteratureVerdict
TP53 → TYMS12p53 represses TYMS promoter by >95% (1997)Textbook
TP53 → EIF4EBP112p53→AMPK→mTOR→4E-BP1 axisTextbook
TP53 → VIM13p53 suppresses vimentin via miR-200cTextbook
TP53 → GJA113Mutant p53 degrades Connexin 43 (2022)Emerging
TP53 → SATB112p53 binds SATB1 promoter (2024)Emerging
TP53 → PDHX16Indirect via PDK2, not PDHX directlyIndirect
TP53 → WIF112Plausible (p53 antagonizes Wnt), no direct evidenceNovel
TP53 → PFN116Known PFN1→p53, reverse direction not publishedNovel

5/8 confirmed, 0 contradicted. TYMS is the standout: a textbook p53 target (discovered 1997, extensively validated) that had zero directional claims in our 85M-claim database — yet 12 independent causal composition paths recovered it in 2.2 seconds.

Multi-gene validation: 8/17 confirmed across 4 genes

We ran predict() on EGFR, BRCA1, and KRAS in addition to TP53 — generating 1,149 predictions across 4 genes in under 2 minutes total. Validated top predictions from each:

GenePredictionPathsEvidenceVerdict
KRAS→ SUZ1216Cancer Cell 2016 — PRC2 barrier to KRAS-driven EMTConfirmed
KRAS→ GNPDA118KRAS drives hexosamine pathway (Cell 2012)Confirmed
EGFR→ CCNB216EGFR signaling drives G2/M cyclinsConfirmed
EGFR→ MCOLN121EGFR→mTOR→TRPML1 lysosomal axisPlausible
BRCA1→ STUB120Both E3 ligases in breast cancer UPSPlausible
KRAS→ CSRP120No direct evidence; CSRP2 has MAPK linksNovel
BRCA1→ CSRP119No direct evidence — same target, independent geneNovel

8/17 confirmed, 0 contradicted (47% precision). CSRP1 is the most interesting novel finding — independently predicted by both KRAS (20 paths) and BRCA1 (19 paths) through different intermediaries. The BRCA1→CSRP1 prediction was computationally validated: anticorrelation in TCGA (n=1,218, ρ=−0.42, p=10−52) and independently replicated in METABRIC (n=1,980, ρ=−0.22, p=10−24).

By the numbers

85.7Mclaims in reference DB
13.2Mentities
30+source databases + SemMedDB
1.7sprediction query (TP53)
14.1%holdout recall (4,340× random)
0LLM calls required