I. The Invisible Load-Bearing Wall
An eleven-fold increase in mortality risk. That is the headline number from a recent systematic review of COVID-19-induced acute pancreatitis: patients with an AST-to-ALT ratio of 2 or greater had an adjusted hazard ratio of 11.052 (95% CI 1.441–84.770, p = 0.021) for death. Published in a peer-reviewed journal, reported with the standard apparatus of Cox regression, confidence intervals, and p-values. The kind of finding that gets cited.
But trace its dependencies. The adjusted hazard ratio comes from a multivariate Cox model that controlled for age and gender — and nothing else. Not disease severity. Not comorbidities. Not treatment variation across the 87 case reports from which the 111-patient cohort was assembled. The model had 11 deaths to work with, giving it a confidence interval that spans a 60-fold range. The true effect could be a modest 44% increase or an implausible 8,400% increase. The statistical method is legitimate; the precision is not.
Go deeper. The entire analysis rests on a definitional claim: that these 111 patients had pancreatitis caused by COVID-19, not pancreatitis that merely coincided with COVID-19 infection. The study defines this by exclusion — ruling out biliary, alcoholic, and iatrogenic etiologies — but acknowledges that “comprehensive assessment of AP etiology was constrained by the limitations of the published literature.” If the inclusion criteria misclassify even a fraction of the cohort, the patient population is contaminated and every downstream finding is unreliable.
This is not a critique of Chen et al. The study is competently executed, the limitations are disclosed, and the findings are genuinely informative for a rare clinical presentation. The point is structural. A headline finding that reads as a single dramatic claim is actually the terminal node of a dependency chain — inclusion criteria → patient population → statistical methodology → univariate screening → multivariate adjustment → synthesis. Pull the foundational nodes and the terminal claim collapses. But you cannot see this chain by reading the paper in prose. The dependencies are there, but they are invisible — load-bearing walls hidden behind drywall.
Every research paper has this architecture. The question is whether we can make it explicit, machine-readable, and maintainable as evidence evolves. That is what Scholion does.
Decomposition Pipeline · prose → typed dependency graph
Source text
"Multivariate analysis demonstrated that WBC count, AST-to-ALT ratio, and surgical intervention remained independently correlated with survival."
— Chen et al. (2025), Results Discussion
mort.6
Elevated WBC count is an independent predictor of mortality after adjustment for age and gender
HR 1.013 · CI 1.000–1.025 · p = .042
mort.7
AST-to-ALT ratio ≥ 2 is an independent predictor of mortality after adjustment for age and gender
HR 11.052 · CI 1.441–84.770 · p = .021
mort.8
Surgical intervention is an independent predictor of mortality after adjustment for age and gender
HR 6.604 · CI 1.581–27.593 · p = .010
One sentence → three individually falsifiable claims, each with its own statistical evidence
mort.8Surgical intervention is an independent predictor of mortality
Warrant
Patients requiring surgery represent the most physiologically compromised subgroup Implicit
Qualifier
Based on 7 surgical patients (3 deaths). Adjusted for age and gender only — not disease severity.
Rebuttal
Confounding by indication: surgery is performed on the sickest patients, so the association reflects severity, not a causal effect of surgery on death.
The paper calls surgery an "independent predictor" without discussing confounding by indication. The warrant field surfaces this gap — the implicit reasoning the author expects readers to fill in, and the point where the argument is most structurally vulnerable.
method.3 · CRUX
Cox regression, age + gender only
mort.7
AST/ALT ≥ 2
HR 11.05
mortality.9
Three independent mortality predictors
Each relationship is classified by type. The graph makes structural properties computable: if method.3 (the Cox regression) is overturned, all three predictor claims collapse simultaneously — but findings from the chronology chain, which uses a different statistical method, survive.
II. The Methodology
Scholion is a decomposition methodology. It takes the prose arguments in research documents and renders them as typed dependency graphs where every node is an individually falsifiable claim and every edge is a classified relationship. The methodology has three layers.
Atomic decomposition is the foundation. A single sentence in a research paper often contains multiple logical assertions. “WBC count, AST/ALT ratio, and surgical intervention remained independently correlated with survival” is three claims, not one. Atomic decomposition breaks compound assertions into individually falsifiable propositions, each carrying its own statistical evidence, scope conditions, and failure modes. You cannot track dependencies between claims you have not identified.
Toulmin reveal operates on each atomic claim. Stephen Toulmin’s model of argumentation distinguishes six components: the claim itself, the grounds (data supporting it), the warrant (the reasoning connecting data to claim), the backing (what authorizes the warrant), the qualifier (scope conditions), and the rebuttal (conditions under which the claim fails). Of these, the warrant is the highest-value extraction target. Warrants are where authors expect readers to fill in unstated reasoning, and they are where arguments are most structurally vulnerable. A paper can report a statistical result accurately while leaving the inferential hazard — the gap between what the data shows and what the conclusion claims — entirely implicit. Toulmin decomposition forces the annotator to surface that gap, and once surfaced, it becomes auditable. Section III demonstrates this concretely.
Dependency typing turns the flat list of annotated claims into a directed graph. Each relationship between claims is classified by type — causal, conditional, purposive, contrastive, or conjunctive — making structural properties computable. Transitive closure identifies everything that depends on a given claim. Crux identification finds load-bearing nodes whose failure would collapse significant downstream structure. Invalidation propagation traces the consequences of a claim being overturned — which conclusions survive, which fall, and which become undetermined. The output is not a summary. It is the argument’s logical architecture rendered as a navigable, queryable graph.
A critical design observation emerges from this layering: the methodology separates into two distinct tasks that require different capabilities. The first is structural decomposition — identifying claims, tracing dependencies, classifying relationship types, flagging where warrants are unstated. This task is textual, not inferential. It asks “what does the argument say and how are its parts connected?” and can be performed by someone who does not know the domain, provided they have method. The second is substantive evaluation — assessing whether a warrant is sound, whether a statistical method is appropriate, whether a rebuttal is fatal. This requires domain knowledge.
The separation matters because it implies that the annotator does not need to carry the domain expertise. The annotation environment can carry it instead — through inline contextual annotations that surface domain-specific meaning at the point of need. When a non-specialist annotator encounters “AST/ALT ratio ≥ 2” in a source text, a contextual popover can provide the clinical significance (the De Ritis ratio, hepatocellular injury patterns, what this marker means in the context of multi-organ involvement) without requiring the annotator to already know hepatology. The annotator’s job is to decompose the structure; the scaffolding bridges the vocabulary gap. Domain knowledge shifts from a prerequisite of the annotator to a property of the annotation environment.
This is not merely a workflow optimization. The absence of domain priors may be a structural advantage for the decomposition task. An expert annotating the Chen extraction fills in warrants from their own clinical knowledge — confounding by indication, the De Ritis ratio interpretation — producing richer annotations but also importing reasoning the paper does not contain. A non-expert, forced to work from the text, annotates what the argument actually says rather than what a knowledgeable reader infers. When the non-expert marks a warrant field as “unstated — the paper does not explain why surgery would predict mortality,” they have identified the structural vulnerability just as precisely as the expert who fills in the mechanism, and without contaminating the extraction with external knowledge. The gap itself is the finding.
The absence of domain priors may be a structural advantage for the decomposition task.
III. The Medical Demonstration
The methodology was tested on Chen et al.’s 2025 systematic review of COVID-19-induced acute pancreatitis. The paper was chosen deliberately: a medical systematic review uses argument structures — statistical evidence, inclusion criteria, regression models, negative findings — that are remote from the AI safety domain where Scholion originated. If the schema works on clinical medicine, it is more credibly general than if tested only on the arguments it was designed for.
Twenty-five claims were extracted across two structurally distinct argument chains.
The mortality predictors chain follows a clean tree structure. The foundational crux is the inclusion criterion — the diagnosis-by-exclusion definition of “COVID-19-induced” pancreatitis. From there, the chain flows through the study population (111 patients from 87 case reports, dual-reviewer screening per PRISMA), to the statistical methodology (Cox proportional hazards, adjusted for age and gender only), through four univariate predictors, to three that survive multivariate adjustment (WBC count at HR 1.013, AST/ALT ratio ≥ 2 at HR 11.052, surgical intervention at HR 6.604), and finally to the synthesis claim that these three are independent mortality predictors.
Two nodes are load-bearing cruxes. The inclusion criteria (method.1) are foundational: if the induced/coincident distinction is invalid, the patient population is contaminated and all 25 claims collapse. The statistical methodology (method.3) is the second: if the Cox model is inappropriate or the adjustment scope is too narrow, the mortality predictor findings fall — but the chronology findings, which use a different analytical method (Kaplan-Meier), survive. This kind of selective invalidation is invisible in prose but immediate in the graph.
The warrant field did its most productive work on the surgical intervention finding. The paper reports surgery as an “independent predictor” without discussing confounding by indication. The warrant extraction surfaced the implicit reasoning: surgery does not cause death; severe disease causes both surgery and death. The De Ritis ratio interpretation of the AST/ALT finding — that a ratio ≥ 2 is classically associated with hepatocellular injury and, in this context, suggests multi-organ involvement — is another implicit warrant the annotator reconstructed. These are not criticisms of the paper. They are structural observations that Toulmin decomposition makes systematic.
The symptom chronology chain is a structurally different kind of argument: a negative finding. The paper reports that symptom timing does not predict survival — Kaplan-Meier log-rank p = 0.543 for the sequence of GI versus respiratory onset, p = 0.228 for whether pancreatitis developed before or during hospitalization. The authors interpret these nulls as evidence that COVID-19-induced pancreatitis is a distinct clinical entity. The extraction revealed three structural vulnerabilities that the prose obscures.
The absence-of-evidence problem is the most fundamental. With 11 deaths across three subgroups, the analysis had very low statistical power. The raw mortality rates — 8.1% in the before-admission group versus 16.7% during hospitalization — suggest a twofold difference the study was underpowered to detect. The non-significant p-value is an absence of evidence, not evidence of absence, and the paper’s central conclusion treats it as the latter.
The Balthazar-survival tension is the most analytically interesting. Symptom chronology is significantly associated with radiologic severity (Balthazar score ≥ D: p = 0.013 for GI-first presentation, p = 0.042 for pre-admission onset). But chronology does not predict survival. If timing predicts imaging severity but not death, either radiologic severity does not translate to mortality in this disease, or the survival analysis lacks the power to detect the downstream effect. The paper does not address this disconnect. It emerged from the extraction — a structural observation by the annotator juxtaposing two sets of findings the authors present separately.
The cross-study comparison problem is the most methodologically consequential. The paper’s thesis depends on contrasting its own null findings with positive findings from other studies using looser etiological definitions. But those studies used different populations, timeframes, and healthcare systems. The argument requires the assumption that the etiological definition is the relevant variable, rather than any of dozens of other methodological differences between the studies. This inferential pattern — one of six documented schema failure modes — does not map cleanly to the schema’s five dependency types.
Chen et al. (2025) — Dependency Structure · 25 claims, 2 chains
Scroll to see full graph →
Selective invalidation: If method.1 falls, all 25 claims collapse — both chains depend on the induced/coincident distinction. If method.3 falls, only the mortality chain collapses; the chronology chain uses Kaplan-Meier, not Cox regression, and survives. This asymmetry is invisible in the prose but immediate in the graph.
Crux (load-bearing)
Synthesis / grouped
Contrastive / limitation
Standard claim
IN
UNDETERMINED
Reduced opacity = lower confidence · Tap any node for context
A dependency graph is immune to rhetorical fluency in a way that prose evaluation is not.
IV. The Safety Case Application
The medical demonstration establishes that Scholion’s decomposition methodology works across domains. But the domain where the methodology has the most urgent institutional demand is AI safety — specifically, safety cases.
A safety case is a structured argument that a system is safe enough to deploy. In the past year, this has moved from a conceptual proposal to an operational framework. Anthropic’s Responsible Scaling Policy (v3.0) made the most structurally significant move: it replaced pre-specified AI Safety Levels with argument-based standards, restructuring its industry-wide recommendations “around requiring analysis and arguments making a strong case for safety, rather than AI Safety Levels.” The policy then acknowledges the gap this opens — “one actor’s view of what constitutes good risk assessment and mitigation may be very different from another’s” — and responds with external review requirements that ask reviewers to assess “analytical rigor” and whether they “disagree with any of the Risk Report’s key claims.” The quality of safety arguments is now the load-bearing question, and there is no shared method for evaluating argument quality at the structural level. The UK AI Security Institute (formerly AISI) published safety case templates including inability arguments and end-to-end misuse safeguard cases using Goal Structuring Notation. A January 2026 paper proposed a reusable template framework with comprehensive taxonomies for claim types (assertion-based, constraint-based, capability-based), argument types (demonstrative, comparative, causal, risk-based, normative), and evidence families.
Everyone is converging on structured safety arguments. And everyone acknowledges, in varying degrees of explicitness, that the unsolved problem is maintenance.
A safety case is not a one-time artifact. It is a living argument that must be updated as model capabilities change, as new evaluation results arrive, as interpretability findings revise our understanding of what a model can do. When a new result challenges a claim in a safety case — say, a scaling monosemanticity finding that reveals features not captured by the original interpretability assessment — someone has to trace the implications. Which claims in the safety case depended on the original assessment? Which downstream conclusions are affected? Does the top-level safety claim still hold, or does it need revision?
Today, that trace is manual. A safety case author reads the new finding, mentally reconstructs the dependency structure, and updates the relevant claims. This works when the safety case is small and the author is the same person who wrote it. It does not scale to safety cases maintained by teams over months, informed by dozens of upstream research papers, and subject to regulatory review.
Scholion automates that trace. If the claims in a safety case are decomposed into a typed dependency graph with identified cruxes and classified warrants, then the question “what happens if this claim is overturned?” becomes a graph operation: propagate the status change through the dependency edges, identify which downstream conclusions are affected, and flag which load-bearing claims have lost their foundation. The structural bookkeeping, which is error-prone, tedious, and easy to get wrong at scale, becomes mechanical.
A safety case has the same structural properties the Chen extraction revealed. An interpretability claim supporting an inability argument is a crux, just as the inclusion criterion was a crux for all 25 claims in the medical extraction. When a new finding undermines it, the graph immediately identifies which safety claims depended on it, which alternative evidence paths survive, and whether the top-level safety claim still holds — the same selective invalidation the Chen extraction demonstrated when overturning the statistical methodology collapsed the mortality findings but left the chronology findings intact. The medical extraction demonstrated this on paper. Safety cases need it operationally.
The Sabotage Risk Report published alongside the RSP illustrates the point concretely. The report’s central argument rests on four claims — prior expectations about alignment, alignment assessment findings, inability to undermine the assessment, and limited opaque reasoning capabilities — functioning as evidential pillars supporting the top-level safety determination. Its Section 7 contains a table that is, in effect, a hand-built Scholion extraction: load-bearing claims in the left column, invalidation conditions in the right. “An updated alignment audit reports major new concerning findings” invalidates “we are aware of no dangerous coherent misaligned goals.” “A model-organisms study shows circumvention of the full suite of audit methods” invalidates “the methods used in the alignment audit appear sufficient.” This is manually constructed invalidation propagation — the same graph operation the Chen extraction demonstrated. The warrant problem appears here too. The report’s Claim 1 rests on continuity between successive models, but the warrant connecting similar training to similar alignment properties is stated as belief rather than mechanism — an implicit warrant that Toulmin decomposition would surface and make auditable, just as it surfaced the confounding-by-indication problem in the surgical intervention finding. The alignment science team converged on this structure independently because the structural bookkeeping is the hard part, and they are doing it by hand.
But the deeper implication is not organizational. It concerns oversight.
The central problem in AI safety is asymmetric evaluation: how does a less capable overseer verify the reasoning of a more capable system? Khan’s research programme on safe automated research (2024) — which reframes superalignment as a verification problem, where the challenge shifts from specifying all desired behaviors to verifying work outputs — provides an empirical anchor for thinking about this, investigating how multi-agent structure can compensate for capability gaps across several domains. In their study of AI debates, optimizing debaters for persuasiveness increases judge accuracy on truthfulness questions — the opposite of the naïve expectation that more persuasive models would be more misleading. The mechanism is asymmetric access to evidence: debaters arguing for the correct answer can cite verifiable facts from the source text, and this advantage grows with persuasiveness. Crucially, stronger consultants in a single-advisor format make judges less accurate — the adversarial structure is doing epistemic work that single-expert consultancy cannot, because consultancy provides no pressure to surface the strongest counter-argument. The judge does not need to match the debaters’ capability. The structure compensates for the capability gap. Khan is candid about where this breaks down: debate works well for code and factual domains but struggles with “complex theoretical work” and “abstract alignment theories” — precisely the kind of arguments safety cases contain. The two-layer annotation model addresses this gap: it decomposes arguments into auditable structural joints even when the substantive claims exceed the verifier’s domain expertise, complementing debate at the argument level rather than replacing it at the claim level.
Scholion extends this principle beyond the debate format. A dependency graph is immune to rhetorical fluency in a way that prose evaluation is not. A beautifully written argument with a foundational crux resting on an unstated warrant looks exactly as vulnerable in the graph as a clumsily written one. A recent survey of LLMs in argument mining finds that they “favour fluent but logically thin arguments over less polished yet better-supported ones” (arXiv 2506.16383) — precisely the failure mode that structural representation neutralizes. If the claims in a safety case are rendered as a typed dependency graph, a reviewer who is weaker than the system that produced the arguments can still audit them, because the evaluation task is structural: are the dependencies valid? Are the warrants stated? Do the cruxes survive the flip test? These are questions about the argument’s architecture, not about the domain.
Combined with the two-layer annotation model described in Section II — structural decomposition by method-equipped non-specialists, substantive evaluation by domain experts, bridged by inline contextual scaffolding — this yields a specific version of scalable oversight. The structural layer is where the tractable, automatable, capability-independent verification happens. The substantive layer is where domain expertise remains necessary but is scoped to a specific, bounded task: reviewing warrants and cruxes that the structural layer has already identified as load-bearing. The overseer does not need to reconstruct the entire argument. They need to audit the joints.
The overseer does not need to reconstruct the entire argument. They need to audit the joints.
This is the project’s meta-crux: does structural verification of arguments preserve the asymmetric oversight property that Khan demonstrates in debate? The Chen extraction provides initial evidence. The confounding by indication, the absence-of-evidence fallacy, the Balthazar-survival tension — these structural vulnerabilities were visible in the dependency graph and obscured in the prose. They were identified by an annotator following method, not by a clinical epidemiologist applying domain knowledge. If this pattern holds for safety cases, then a structurally competent verifier can oversee a substantively stronger researcher — not by matching their domain expertise, but by auditing the logical scaffolding their arguments rest on.
Sabotage Risk Report — Safety Case Fragment · 4 pillars, 4 invalidation conditions
Scroll to see full graph →
Same graph, different domain. The Section 7 table is already doing invalidation tracking — each right-column condition, if triggered, signals that its corresponding pillar needs re-evaluation. Because the top-level determination is a conjunctive dependency on all four pillars, any single invalidation is sufficient to revisit the safety case. Scholion's contribution is making this structure computationally tractable: as safety cases grow across model generations, the structural bookkeeping that the report's authors did manually in Section 7 becomes increasingly valuable to automate.
Safety determination
Evidential pillar (crux)
Invalidation condition
Implicit warrant
Tap any node for context
A structurally competent verifier can oversee a substantively stronger researcher — not by matching their domain expertise, but by auditing the logical scaffolding their arguments rest on.
Twenty-five claims from one paper is a promising start, not a validated methodology. The extraction stress-tested the schema and produced an honest accounting of where it works and where it does not.
What worked. The warrant field consistently surfaced implicit reasoning the paper’s prose did not make explicit — the structural vulnerabilities described in Section III were all captured through Toulmin decomposition doing genuine analytical work, not just reformatting. The five dependency types covered most relationships without straining; the tree structure of the mortality chain and the conjunctive structure of the chronology synthesis both mapped naturally. Cross-extraction references worked: the chronology extraction shares two foundational claims with the mortality extraction without duplication, validating the decision to extract by argument block rather than by paper section. Crux identification via the flip test produced the essay’s most practically significant output: the selective invalidation pattern where different cruxes collapse different subsets of the argument.
What did not work. The binary IN/OUT/UNDETERMINED status cannot represent epistemic weight. The chronology negative findings are technically IN — they are published results from a peer-reviewed review — but they are epistemically weak, resting on 11 events with wide confidence intervals. Treating them as straightforwardly IN misrepresents their evidential strength. The schema now includes a confidence field (high/medium/low) alongside status, but a richer representation may be needed — the same problem at an institutional scale, where Anthropic’s own capability threshold assessments resist binary classification because models approach thresholds without clearly passing them.
Annotator-synthesized claims posed a boundary problem. The Balthazar-survival tension — the most analytically interesting observation in the chronology extraction — is not something the authors assert. The schema now includes a claim_source field (author_explicit / author_implicit / annotator_synthesized) to maintain traceability, but the convention for when and how annotators should generate structural observations needs codification before a second annotator begins work.
Cross-study comparison reasoning does not fit the five dependency types cleanly. The paper’s central thesis — that COVID-19-induced AP is distinct from concurrent AP — rests on contrasting its own null results with positive results from other studies. This was coded as “purposive” and “conjunctive,” but neither captures the actual logic: “our results differ from theirs, therefore the methodological difference explains the divergence.” Whether this warrants a sixth dependency type (“comparative”) or remains a documented edge case depends on how frequently the pattern recurs across domains.
Schema Evolution · v0.1 → v0.2 after Chen extraction
warrantValidated
Consistently surfaced implicit reasoning the paper’s prose left unstated. The confounding-by-indication problem in the surgical intervention finding, the continuity assumption in the safety case — both captured through the warrant field doing genuine analytical work.
dependency_typeValidated
Five types (causal, conditional, purposive, contrastive, conjunctive) covered most relationships without straining. The mortality tree and chronology synthesis both mapped naturally.
cruxValidated
The flip test produced the essay’s most practically significant output: the selective invalidation pattern where different cruxes collapse different subsets of the argument.
depends_onValidated
Cross-extraction references worked: the chronology extraction shares two foundational claims with the mortality extraction without duplication, validating extraction by argument block rather than by paper section.
confidenceNew
Binary IN/OUT/UNDETERMINED cannot represent epistemic weight. The chronology null findings are technically IN — published, peer-reviewed results — but rest on 11 events with wide confidence intervals. Treating them as straightforwardly IN misrepresents their evidential strength.
Values: high | medium | low · Alongside status, not replacing it
claim_sourceNew
The Balthazar-survival tension — the most analytically interesting observation in the extraction — is not something the authors assert. The schema needs to distinguish between what the paper says and what the annotator surfaces.
Values: author_explicit | author_implicit | annotator_synthesized
SF-001High
Weighted support
Binary status cannot capture that a claim is published but epistemically weak. Absence of evidence treated identically to strong positive findings.
Resolved → confidence field added
SF-002Medium
Annotator-synthesized claims
No way to distinguish author assertions from structural observations the annotator surfaces by juxtaposing findings.
Resolved → claim_source field added
SF-003Medium
Cross-study comparison reasoning
“Our null results differ from their positive results, therefore the methodological difference explains the divergence” — doesn’t map to any of the five dependency types.
Open → may warrant a sixth type (“comparative”). Track frequency in Phase 1.
SF-004Low
Informal schema drift
Extraction practice surfaced useful fields (boundary conditions, flip test notes, typed anchors) not in the canonical schema. The schema evolved informally through annotation before being formally revised.
Open → reconcile practice-derived fields with schema.yaml for Phase 1.
SF-005Medium
Confounding by indication
Statistical association ≠ causal relationship. The warrant field captures this, but no structured way to flag that the observed effect may be spurious.
Open → track across domains. May be specific to empirical/statistical research.
SF-006Low
Multi-span claim text
Synthesis claims draw from multiple non-contiguous sections. claim_text becomes a semi-verbatim composite rather than a true verbatim extract.
Open → allow claim_text as a list of source spans with section references.
Honest accounting over premature polish. Two of six failure modes were resolved by adding fields (confidence, claim_source). Four remain open — deliberately. The schema should be tested across domains before committing to structural changes that may be artifacts of a single paper's idiosyncrasies. Phase 1's job is to determine which edge cases are universal and which are domain-specific.
VI. The Competitive Landscape
The existing research infrastructure is good at adjacent problems. Elicit extracts structured data from papers — effect sizes, sample sizes, study characteristics — across 138 million indexed papers. Semantic Scholar maps citation graphs at massive scale. scite.ai classifies citation context as supporting, contradicting, or mentioning. Safety case frameworks help authors build structured arguments. Each of these operates at a different resolution than Scholion. Elicit answers “what did this paper find?” Semantic Scholar answers “which papers cite which?” scite.ai answers “does this citation support or contradict?” None of them answers “which specific claim in this argument depends on which other claim, through what warrant, and what breaks if this claim is wrong?”
The capabilities that define Scholion’s territory — claim-level decomposition, typed dependency relationships, warrant extraction, crux identification, and invalidation propagation — are not provided by any existing tool. scite.ai comes closest with its citation context classification, but it operates on citation sentences, not on the internal argument structure of a paper. If scite tells you Paper B contradicts Paper A, Scholion tells you which claim contradicts which claim, through what reasoning, and what downstream conclusions in both papers are affected. The gap is real and has not closed since the first essay was written.
Capability Comparison
Scroll to see all tools →
Core capability
Adjacent / partial
Not provided
VII. Where This Goes
Scholion is currently a schema, two worked extractions, and a documented set of limitations. Turning it into a validated methodology and eventually a usable tool requires four phases, each with explicit success criteria and kill conditions.
Phase 1: Manual annotation and the novice annotator hypothesis. Three to five papers across domains — clinical medicine (already started), interpretability research, and possibly formal verification or philosophy of science. The core validation is inter-annotator agreement: does the schema carve arguments at consistent joints across independent annotators? If agreement is too low, the schema captures annotator idiosyncrasies, not stable structural features. That would be a kill condition.
But Phase 1 also tests a more specific hypothesis: that non-specialist annotators equipped with structured method and inline contextual scaffolding can produce structural decompositions of comparable quality to domain experts. This requires building the scaffolding — point-of-need contextual annotations, authored by domain experts, that bridge vocabulary gaps at each step of the decomposition. If the hypothesis holds, it means the structural layer of argument analysis is genuinely domain-independent, which transforms the scalability story: the bottleneck is not “finding willing domain experts” but “building good enough method and scaffolding that non-experts can do reliable structural work.” The annotated corpus is also a potential contribution to the argument mining community as a novel dataset with Toulmin decomposition and typed dependencies.
Phase 2: LLM extraction pipeline. Benchmark automated extraction against the manual ground truth from Phase 1. The argument mining literature suggests that LLMs can perform argument component detection at levels rivaling supervised baselines, but struggle with fine-grained structural reasoning — exactly the capability Scholion requires. If extraction quality is too low for the dependency graph to be structurally valid, the methodology requires either human-in-the-loop annotation at a level that does not scale or fundamental advances in LLM reasoning. Either way, the question is empirical.
Phase 3: Graph infrastructure. Typed dependency storage, invalidation propagation as a graph operation, cross-document claim linking. The technical implementation depends on Phase 1–2 findings — whether the dependency types are stable, whether cross-document references are tractable, and what query patterns matter for actual use.
Phase 4: Reading interface and validation. Domain experts use the dependency graph alongside the original paper and report whether the structural representation adds value over reading the prose. If domain experts do not find the graph more useful than the paper, the structural advantage thesis is wrong. This is the final kill condition and the most important one: utility, not elegance.
The project is a bet that making argument structure explicit is worth the overhead of decomposition — and that the structural layer carries independent epistemic authority, accessible to overseers who do not match the domain expertise of the systems or researchers they are evaluating. The medical extraction is the first evidence that the overhead produces insights the prose does not. The safety case application is the first domain where institutional demand is acute. The novice annotator hypothesis is the scalability test. The answer is not known yet. But the question is now precise enough to answer empirically.
<Roadmap />
Three to five papers across domains — clinical medicine (started), interpretability research, and formal verification or philosophy of science. Tests inter-annotator agreement and the novice annotator hypothesis: that non-specialists with structured method and inline contextual scaffolding can produce decompositions of comparable quality to domain experts.
Success criteria
Consistent inter-annotator agreement on claim boundaries, dependency types, and crux identification across independent annotators and domains.
Kill condition
Agreement too low — the schema captures annotator idiosyncrasies, not stable structural features of arguments.
Benchmark automated extraction against manual ground truth from Phase 1. The argument mining literature suggests LLMs can perform argument component detection at levels rivaling supervised baselines but struggle with fine-grained structural reasoning — exactly the capability Scholion requires.
Success criteria
Automated extraction produces dependency graphs structurally valid enough to propagate invalidation correctly.
Kill condition
Extraction quality too low — methodology requires human-in-the-loop annotation at a level that does not scale, or fundamental advances in LLM reasoning.
←Depends on Phase 1 ground truth corpus
Typed dependency storage, invalidation propagation as a graph operation, cross-document claim linking. Technical implementation depends on whether dependency types are stable, whether cross-document references are tractable, and what query patterns matter for actual use.
Success criteria
Graph operations correctly propagate status changes through typed dependencies and support cross-document claim resolution.
Kill condition
Dependency types too unstable across domains — no consistent graph substrate supports the structural operations the methodology requires.
←Depends on Phase 1–2 findings on type stability and extraction fidelity
Domain experts use the dependency graph alongside the original paper and report whether the structural representation adds value over reading the prose. This is the final and most important test: utility, not elegance.
Success criteria
Domain experts report the graph surfaces structural insights — cruxes, hidden dependencies, invalidation paths — that the prose alone does not make visible.
Kill condition
Experts do not find the graph more useful than the paper. The structural advantage thesis is wrong.
←Depends on Phase 3 graph infrastructure and validated extraction pipeline
Each phase gates the next. The kill conditions are not hypothetical — they are the specific empirical questions the project exists to answer. If the structure is not stable, do not automate it. If the automation is not faithful, do not build infrastructure on it. If the infrastructure does not help readers, stop.