Case Study: Predicting a Cancer Cell Discovery

How causal composition across 50 million claims predicted that CD28 upregulates PD-L1 - before the paper was published.

The discovery

In December 2024, a team published in Cancer Cell that CD28 - previously known only as a T-cell co-stimulatory receptor - is expressed inside cancer cells, where it binds and stabilizes PD-L1 mRNA, driving immune evasion. This was completely unexpected. CD28 upregulating PD-L1 in cancer cells was not in any database.

The paper: "Inhibiting intracellular CD28 in cancer cells enhances antitumor immunity and overcomes anti-PD-1 resistance via targeting PD-L1" - Cancer Cell, December 2024 (PMID: 39672166)

What Attest predicted

Our reference database contains 49,926,718 claims from 30+ public databases (STRING, CTD, Reactome, DrugBank, DisGeNET, PrimeKG, and more), all ingested before this paper was published.

Running db.predict("gene_940") (CD28) with causal composition:

CD28 --[upregulates]--> CD274 (PD-L1)

  12 supporting paths via 12 independent intermediaries
  1 opposing path
  Consensus: 92%

  Evidence (co-regulation through shared chemical responses):
    CD28 -[upregulates]-> Olaparib -[upregulates]-> PD-L1
    CD28 -[downregulates]-> Nifedipine -[downregulates]-> PD-L1
    CD28 -[upregulates]-> Phorbol 12-myristate 13-acetate -[upregulates]-> PD-L1
    CD28 -[upregulates]-> Lipopolysaccharide -[upregulates]-> PD-L1
    ... 8 more independent paths

  Query time: 2,723ms on 50M claims

No LLM was involved. No text was read. The prediction emerged purely from the structure of the knowledge graph - 12 independent chemical intermediaries all agree that agents which upregulate CD28 also upregulate PD-L1.

How causal composition works

Traditional knowledge graphs store facts: "Gene A interacts with Gene B." Attest stores claims with causal predicates: "Compound X upregulates Gene A" (from CTD), "Compound X upregulates Gene B" (from CTD).

Causal composition follows these directed edges through intermediaries and applies biochemical logic:

Hop 1	Hop 2	Composed
upregulates	upregulates	upregulates
downregulates	downregulates	upregulates (double negative)
upregulates	downregulates	downregulates
inhibits	inhibits	activates (double negative)

When multiple independent intermediaries agree on the composed direction, the prediction has convergent evidence. CD28 → PD-L1 had 12:1 consensus - 12 independent compounds agree on upregulation, only 1 suggests downregulation.

Precision: 71% on validated ground truth

We validated predictions for TP53 against published literature (7 gene targets with known ground truth):

Prediction	Literature	Verdict
TP53 → CDKN1A (upregulates)	Textbook biology	Correct
TP53 → BAX (upregulates)	Textbook biology	Correct
TP53 → IL6 (upregulates)	Known p53 target	Correct
TP53 → CCN2 (upregulates)	JCI 2011, mechanism known	Correct
TP53 → DUSP10 (upregulates)	IJMS 2019, CRC genotoxic stress	Correct
TP53 → THBD (upregulates)	p53 actually represses THBD	Wrong direction
TP53 → BMP2 (upregulates)	Context-dependent, leaning wrong	Wrong direction

5/7 correct = 71% precision. False positives occur when the same gene has opposite effects in different tissues (p53 activates some targets but represses others depending on cellular context). We filter the worst offenders using contradictory leg detection: when the source gene both upregulates and downregulates the same intermediary, that intermediary is context-dependent and excluded from predictions.

Retrospective validation: 3/5 published papers predicted

We tested whether the graph could predict findings from recent high-impact papers, using only data that predates each publication:

Paper	Finding	Attest prediction	Verdict
Cancer Cell 2024	CD28 upregulates PD-L1	upregulates (12:1)	Correct
Nat Commun 2025	KRAS downregulates BRCA1	downregulates (19:17)	Correct
Nat Commun 2024	PRMT5 activates FUS	upregulates (25:17)	Correct
Cell Death Differ 2025	DYRK2 inhibits USP28	downregulates (3:2)	Close
Nature 2022	ADAR1 inhibits ZBP1	upregulates (9:0)	Wrong

The ADAR1 error is instructive: ADAR1 and ZBP1 are both interferon-stimulated genes (co-upregulated by the same stimuli), but ADAR1 actually inhibits ZBP1 post-transcriptionally via RNA editing. Co-regulation evidence cannot capture post-transcriptional inhibition.

After loading SemMedDB (35.8M literature-extracted predications from PubMed), the evidence quality stack now catches this error automatically:

ADAR1→ZBP1: only 2 SemMedDB claims say "activates" → below minimum evidence threshold (3) → filtered as "insufficient"
TP53→CDKN1A: 124 SemMedDB claims say "activates", 10 say "inhibits" → 93% directional confidence → "strong"
TP53→THBD: only 1 SemMedDB claim → filtered as "insufficient"

The principle: NLP-extracted predications have ~30% directional error rate. With 1-2 claims, errors dominate. With 100+, the signal crushes the noise. directional_confidence() requires a minimum of 3 independent sources before trusting any directional prediction.

One-call API

from attestdb import AttestDB

db = AttestDB.open_read_only("reference.attest")

# Discover novel predictions for any entity
predictions = db.predict("gene_940")  # CD28
for p in predictions[:5]:
    print(f"{p.predicted_predicate} -> {p.target}")
    print(f"  {p.supporting_paths} supporting, {p.opposing_paths} opposing")
    print(f"  gap: {p.is_gap}, consensus: {p.consensus:.0%}")

# Test a specific hypothesis
verdict = db.what_if(
    ("gene_940", "gene"),
    ("upregulates", "relation"),
    ("gene_29126", "gene"),
)
print(verdict.verdict)  # "plausible"
print(verdict.explanation)  # "12 causal path(s) supporting"

Available as MCP tools (attest_predict, attest_what_if) for AI-native workflows. 106 MCP tools total.

Novel prediction validation: 5/8 confirmed

We ran db.predict("gene_7157") (TP53) and validated the top 8 predictions - relationships with zero direct claims in the database, predicted purely from causal composition through 12-16 independent intermediaries:

Prediction	Paths	Literature	Verdict
TP53 → TYMS	12	p53 represses TYMS promoter by >95% (1997)	Textbook
TP53 → EIF4EBP1	12	p53→AMPK→mTOR→4E-BP1 axis	Textbook
TP53 → VIM	13	p53 suppresses vimentin via miR-200c	Textbook
TP53 → GJA1	13	Mutant p53 degrades Connexin 43 (2022)	Emerging
TP53 → SATB1	12	p53 binds SATB1 promoter (2024)	Emerging
TP53 → PDHX	16	Indirect via PDK2, not PDHX directly	Indirect
TP53 → WIF1	12	Plausible (p53 antagonizes Wnt), no direct evidence	Novel
TP53 → PFN1	16	Known PFN1→p53, reverse direction not published	Novel

5/8 confirmed, 0 contradicted. TYMS is the standout: a textbook p53 target (discovered 1997, extensively validated) that had zero directional claims in our 85M-claim database - yet 12 independent causal composition paths recovered it in 2.2 seconds.

Multi-gene validation: 8/17 confirmed across 4 genes

We ran predict() on EGFR, BRCA1, and KRAS in addition to TP53 - generating 1,149 predictions across 4 genes in under 2 minutes total. Validated top predictions from each:

Gene	Prediction	Paths	Evidence	Verdict
KRAS	→ SUZ12	16	Cancer Cell 2016 - PRC2 barrier to KRAS-driven EMT	Confirmed
KRAS	→ GNPDA1	18	KRAS drives hexosamine pathway (Cell 2012)	Confirmed
EGFR	→ CCNB2	16	EGFR signaling drives G2/M cyclins	Confirmed
EGFR	→ MCOLN1	21	EGFR→mTOR→TRPML1 lysosomal axis	Plausible
BRCA1	→ STUB1	20	Both E3 ligases in breast cancer UPS	Plausible
KRAS	→ CSRP1	20	No direct evidence; CSRP2 has MAPK links	Novel
BRCA1	→ CSRP1	19	No direct evidence - same target, independent gene	Novel

8/17 confirmed, 0 contradicted (47% precision). CSRP1 is the most interesting novel finding - independently predicted by both KRAS (20 paths) and BRCA1 (19 paths) through different intermediaries. The BRCA1→CSRP1 prediction was computationally validated: anticorrelation in TCGA (n=1,218, ρ=−0.42, p=10⁻⁵²) and independently replicated in METABRIC (n=1,980, ρ=−0.22, p=10⁻²⁴).

By the numbers

85.7Mclaims in reference DB

13.2Mentities

30+source databases + SemMedDB

1.7sprediction query (TP53)

14.1%holdout recall (4,340× random)

0LLM calls required