Deterministic Facts Layer: Building a Site-Specific Evidence Pack (No AI Needed)
Before “judgment” is possible, you need a durable evidence pack: site-specific facts you can point to, reproduce, and debug. This page defines what belongs in the facts layer—and what should never be delegated to AI.
Most “AI audit” products fail in the same place: they try to sound like judgment without holding any evidence. The cure is not more prompting. The cure is a strict foundation: a deterministic facts layer that outputs a site-specific evidence pack.
Definition
A facts layer is the deterministic part of your system that answers: “What is true about this site right now?” It produces structured evidence that can be inspected under failure, reused across reports, and compared across time.
What you’ll build on this page
1) Evidence Pack Blueprint
A minimal, repeatable set of facts that makes your reports defensible and non-generic.
2) Deterministic extraction rules
What to compute with code (always) vs what to phrase with models (only later, under constraints).
Why “facts first” is non-negotiable
If your report can’t point to concrete site facts, users don’t trust it. If your system can’t reproduce those facts, you can’t debug it.
- Trust: A judgment without evidence reads like opinion—especially across multiple clients.
- Debuggability: Without facts you can’t isolate the failure mode (crawler vs parser vs renderer vs logic).
- Non-template outputs: Facts force variation. Even identical “diagnosis types” still require different evidence, examples, and scope.
- Cost control: Deterministic facts are cheap. Model tokens are expensive. Facts are your “free” leverage.
Rule: If a sentence cannot be anchored to a site fact, it does not belong in a paid report.
The Evidence Pack: what it must contain
A site-specific evidence pack is not a “data dump”. It is a curated set of facts that directly supports: (1) diagnosis, (2) prioritization, (3) “why not the other actions”.
A. Site Identity + Constraints
- Target market/language guess: heuristics based on homepage language / hreflang.
- Primary business intent: inferred from nav + hero + CTA density (not “AI guess”; a deterministic heuristic score).
- System constraints: known CMS patterns, rendering complexity, JS dependency signals.
B. Indexing gates & canonical reality
- Robots signals: meta robots, X-Robots-Tag, blocked resources that affect render.
- Canonical map: canonical target for key pages (including self vs non-self).
- Redirect chain facts: 301/302 hops for tested URLs.
C. Page set sample (N pages) + roles
- Sample list: exact URLs you analyzed (no ambiguity).
- Page type guess: home / hub / article / product / category / service (heuristic).
- Role signals: how central each page is inside the site (internal links, nav presence).
D. Text & structure facts (per page)
- Title / meta description: presence, length, duplication across sample.
- H1/H2 structure: counts + anomalies (missing, multiple, non-descriptive).
- Extractable text volume: does the page have “readable text” vs “shell UI”.
- Schema presence: types detected and basic validity checks.
E. Site-level network facts
- Internal link graph summary: degrees, hubs, orphan-like pages (within sample).
- Theme overlap signals: similarity clusters among titles/H1/primary text.
- “Center of gravity” score: whether authority concentrates on a small set of pages or leaks evenly.
What belongs in code (deterministic) vs what belongs in language (later)
Deterministic (always code)
Anything you must reproduce, compare, debug, or defend as “facts”. If it changes, you should be able to diff it.
Controlled language (optional, later)
Only the translation layer: turning locked judgments + evidence into natural language without inventing new conclusions.
- ✅ Code: “How many pages share near-identical titles?”
- ✅ Code: “Which pages compete on the same intent cluster based on similarity thresholds?”
- ✅ Code: “Is there a single page that receives disproportionate internal links?”
- ❌ Not AI-first: “What’s the root problem of this website?” (too open)
- ✅ Controlled language later: “Explain why continuing content expansion is wasteful under evidence X/Y/Z.”
A practical output format (JSON + human-readable)
Your facts layer should output two artifacts: (1) a structured JSON file, (2) a human-readable “evidence page” in the report. The JSON is for your system. The evidence page is for trust and explainability.
1) JSON evidence pack (minimal example)
{
"run_id": "2025-12-28T14:12:09Z__daphnetxg.com",
"target": {
"root_url": "https://example.com",
"locale_guess": "en",
"site_intent_guess": "lead_generation"
},
"indexing_gates": {
"meta_robots": "index,follow",
"x_robots_tag": null,
"has_noindex": false,
"canonical_consistency": "mixed"
},
"sample": {
"n": 10,
"urls": [
"https://example.com/",
"https://example.com/services/",
"https://example.com/blog/..."
]
},
"page_facts": [
{
"url": "https://example.com/",
"title": { "value": "Home", "len": 4 },
"h1": { "count": 0, "value": null },
"text": { "visible_chars": 380, "is_shell_like": true },
"schema": { "types": ["WebSite","Organization"], "valid": true }
}
],
"network": {
"internal_link_degree": { "max": 32, "median": 4 },
"center_of_gravity": "low",
"clusters": [
{ "label": "pricing-intent", "urls": ["..."], "similarity": 0.84 }
]
}
}
Notice what’s missing: no judgments, no prescriptions, no “AI language”. This is evidence only.
2) Human evidence page (what the client sees)
- List the analyzed URLs (exactly).
- Show 3–6 standout facts that later justify the diagnosis (e.g., duplicated title cluster, missing H1 pattern, canonical inconsistency).
- Keep it factual: “Page A and Page B both try to answer X,” not “Your SEO is bad.”
Failure modes: what the facts layer helps you detect
The fastest way to build a reliable system is to log evidence at each extraction stage. A strong facts layer will tell you whether the failure is:
- Fetch failure: timeouts, blocks, wrong content-type.
- Render failure: JS-dependent pages returning “shell HTML”.
- Parse failure: selectors break, headings missing due to template differences.
- Normalization failure: URL canonicalization errors causing duplicates.
- Sampling failure: your sample misses category/service pages that actually drive outcomes.
A simple operating rule
If you can’t explain why the system produced an output, you don’t have a product—you have a demo. The facts layer is what turns it into an operable system.
How this layer feeds the next layers (without becoming a black box)
Your judgment system (SJA, or any equivalent) should consume the evidence pack as input. The evidence pack becomes the contract: it limits what can be concluded, and makes each conclusion provable.
- Facts → Triggers: evidence thresholds activate a specific diagnosis type.
- Triggers → Actions: actions are generated from the evidence (not from “best practices”).
- Actions → Language: language is rendered from evidence variables (runtime), not copy-paste templates.
If you later introduce models: models only write within the boundaries set by evidence + triggers. They must never invent new triggers or conclusions.
Closing: the “free leverage” most builders skip
People rush to the “AI part” because it looks impressive. But deterministic evidence is the real moat: it’s cheap, reproducible, explainable, and it scales across clients without becoming generic.