Your crawlers fetch billions of pages. They download raw HTML, run it through extraction pipelines, classify content, deduplicate, and store it — then do it again next week because the page might have changed. Most of it hasn't. Most of that work is wasted. What if websites just told you what's on them?
AI Certified is a structured content manifest standard backed by a public verification API. Sites publish /.well-known/ai-content.json — a single file that describes every page: its type, sections, BLAKE3 hash, and LuperAI classification. Your crawler reads one file instead of hundreds of pages. You skip extraction entirely. You only re-fetch pages whose hashes have changed.
The Crawling Problem Is Costing You Millions
Every major AI company is running the same infrastructure against the same websites. The duplication is staggering.
- Raw volume. A site with 500 pages requires 500 HTTP requests, 500 HTML downloads, and 500 extraction passes — every crawl cycle, whether or not anything changed.
- Redundant compute. Multiple AI companies crawl the same domain on the same day. The same HTML is downloaded, parsed, and classified independently by each.
- NLP pipeline cost. Every page runs through language detection, entity extraction, content classification, and deduplication. For a typical web page, you're classifying what someone else already classified when they wrote it.
- Staleness guessing. Without per-page content hashes, you can't know what changed. Most crawlers use URL-based heuristics or crawl everything on a fixed schedule. The signal-to-noise ratio is poor.
- No authenticity layer. After all that work, you still can't verify whether the content you ingested matches what the site actually says, or whether it was manipulated between the site and your pipeline.
The result is massive crawler fleets burning compute, bandwidth, and energy to produce data whose provenance you can't fully verify.
What If Websites Just Told You What's on Them?
AI Certified sites publish a manifest at /.well-known/ai-content.json. One file. Every public page described, pre-classified, and BLAKE3-hashed. Your crawler reads it once and knows exactly what's on the site and what (if anything) has changed since your last visit.
What the manifest contains
{
"version": "1.0",
"dict_version": "v1",
"generator": "luperiq-ai-certified/1.0",
"site": {
"name": "Example Business",
"url": "https://example.com",
"seal": "https://api.luperiq.com/v1/seal/verify/abc123"
},
"pages": [
{
"url": "/services/",
"title": "Our Services",
"modified": "2026-02-20T14:30:00Z",
"type": "luperai:service-page",
"industry": "luperai:home-services/hvac",
"intent": "luperai:transactional",
"manifest": "/.well-known/ai-content/services.json",
"checksum": "blake3:a1b2c3d4e5f6..."
},
{
"url": "/about/",
"title": "About Us",
"modified": "2026-01-15T09:00:00Z",
"type": "luperai:about-page",
"checksum": "blake3:f7e8d9c0b1a2..."
}
],
"total_pages": 42,
"generated_at": "2026-02-26T12:00:00Z",
"certified_at": "2026-02-26T08:00:00Z",
"next_verification": "2026-02-27T08:00:00Z"
}
Each page entry links to a detailed page manifest at its manifest URL. The page manifest includes structured sections, images with alt text, internal and external links, LD+JSON structured data, and classification terms from the LuperAI Dictionary v1.0. BLAKE3 checksums are computed per page at verification time by the LuperIQ crawler — not self-reported by the site owner.
What classification looks like
LuperAI Dict v1.0 covers eight content categories across five section types, six industries, and additional taxonomies for intent, trust signals, and available page actions. Terms are namespaced and versioned. A classified page entry might look like:
"classification": {
"content_type": "luperai:service-page",
"section_types": ["luperai:hero", "luperai:features", "luperai:cta"],
"industry": "luperai:home-services/hvac",
"intent": "luperai:transactional",
"trust_signals": ["luperai:has-reviews", "luperai:has-certifications"],
"actions": ["luperai:contact-form", "luperai:booking"],
"confidence": 0.87
}
Your model receives structured classification instead of raw text. The extraction pipeline that would normally produce this output has already run.
The AI Content Verification API
The public API is available at https://api.luperiq.com. All /v1/* endpoints return JSON with Access-Control-Allow-Origin: *. No API key required for read operations.
REST endpoints
| Endpoint | Method | Purpose |
|---|---|---|
/v1/seal/verify?domain=example.com |
GET | Check certification status for a domain |
/v1/manifest?domain=example.com |
GET | Redirect to the site's ai-content.json |
/v1/dict/{version} |
GET | Full LuperAI Dictionary schema for a given version |
/v1/dict/frequencies |
GET | N-gram frequency data used to build the dictionary |
/v1/stats |
GET | Ecosystem statistics: certified sites, active seals, dict version |
/v1/report |
POST | Submit a content discrepancy report |
Seal verification response
GET /v1/seal/verify?domain=example.com
{
"valid": true,
"seal_id": "abc123",
"domain": "example.com",
"tier": "pro",
"status": "active",
"issued_at": "2026-02-01T00:00:00Z",
"valid_until": "2026-03-01T00:00:00Z",
"seal_hash": "blake3:9f8e7d..."
}
A valid: false response returns 404 with { "valid": false, "error": "seal_not_found" }. Revoked seals return valid: false and include a revoked_at timestamp. The seal_hash field is the BLAKE3 signature of the seal record in the Apex event journal — independently verifiable.
GraphQL API
The GraphQL endpoint at /graphql exposes 14 queries and 15 mutations for programmatic access to certification data, content rules, reporter management, and configuration. Useful for building integrations that need structured reads across multiple sites or batch operations. The schema is available at /graphql?introspection=1.
What AI Content Verification Means for Your AI Product
Fewer hallucinations
Hallucinations often trace back to source data quality problems: misclassified content, stale pages ingested as current, or HTML that your extractor parsed incorrectly. Certified manifests are BLAKE3-verified at the time of scan — not self-reported by the site owner. When hashes match, you know your ingested content matches what the LuperIQ crawler independently verified.
Better content understanding out of the box
LuperAI Dict v1.0 classification is already applied when you read the manifest. Your model receives luperai:service-page with a confidence score instead of a 400KB HTML file from which you have to infer the same thing. Section-level classification tells you what each part of the page is for — hero, features, pricing table, testimonials, CTA. That signal passes directly into your retrieval or training pipeline.
Certification tier as a confidence weight
The AI Certified system issues seals at three tiers. Enterprise tier sites are verified daily with webhook-triggered re-scans on content change. Pro tier sites are verified daily. Free tier sites are verified monthly. Tier is included in the seal verify response. You can weight your confidence in content freshness accordingly without building your own freshness estimation logic.
Reporter reputation as a trust signal
AI crawlers that submit reports via POST /v1/report accumulate a reputation score. Confirmed reports increase reputation (capped at 2.0). Dismissed reports decrease it. Reporters below 0.3 are deprioritized. This creates a credibility signal you can expose in your own product: your system participated in verifying the data it's using.
Provenance chain
Every certification event — registration, scan, seal issuance, violation, revocation — is an immutable record in the Apex event journal, BLAKE3-signed and Merkle-chained. You can trace any seal back through its full history: when it was issued, what the scan found, whether any violations were detected and resolved. That chain is the provenance record for content your model ingests from certified sites.
The Economics of Verified Data vs. Scraping
The math is straightforward. A site with 500 pages currently costs you 500 HTTP requests per crawl cycle. With AI Certified, it costs you 1 manifest read plus targeted re-fetches only for pages whose BLAKE3 hashes changed.
Request reduction
Across a corpus of certified sites, the typical reduction is 99% fewer crawler requests. The manifest tells you which pages changed. You fetch only those. For a site that publishes two new blog posts per week, you're fetching 2 pages instead of 500 — every cycle.
Caching becomes aggressive and correct
BLAKE3 per-page checksums mean you can cache page content indefinitely and invalidate only on hash change. You don't need to guess at cache TTLs based on content type heuristics or Last-Modified headers. The hash is the cache key. If the hash in the manifest matches your stored hash, you don't fetch the page. If it doesn't, you do.
NLP pipeline simplification
Classification is the most expensive part of content ingestion at scale. LuperAI Dict v1.0 covers the majority of web content types for small business and e-commerce sites: service pages, product pages, blog articles, FAQ pages, landing pages, portfolio pages. When classification is already done and confidence-scored, that step drops out of your pipeline for certified content.
Fleet size, energy, margins
Fewer requests means smaller crawler fleets. Smaller fleets mean lower bandwidth costs, less compute provisioned, less energy consumed. The infrastructure savings scale with the number of certified sites in your corpus. As adoption grows, the savings compound.
Integration in Three Steps
No SDK required. The integration pattern is three HTTP calls.
-
Check for a manifest. Before crawling any site, check
/.well-known/ai-content.json. If it exists, read the manifest index. If a valid seal is present, verify it against/v1/seal/verify?domain={domain}.GET https://example.com/.well-known/ai-content.json # or via the API: GET https://api.luperiq.com/v1/manifest?domain=example.com -
Compare hashes. Fetch only what changed. For each page in the manifest, compare the
checksumagainst your stored value. If they match, skip the page — your cached content is still valid. If they differ, fetch the individual page manifest and update your store.# Page manifest at the URL from the index "manifest" field GET https://example.com/.well-known/ai-content/services.json # Returns structured sections, images, links, classification, and checksum -
Use trust signals as model inputs. Pull the seal tier and reporter reputation from the verify endpoint and pass them as confidence weights into your retrieval or training pipeline. If your crawler detects a discrepancy between the manifest and the live page, submit a report.
POST https://api.luperiq.com/v1/report Content-Type: application/json { "seal_id": "abc123", "page_url": "https://example.com/services/", "issue": "content_mismatch", "details": "Manifest shows pricing at $99 but live page shows $149", "reporter": "YourCrawler/2.0" }The report triggers an immediate re-scan of the affected page. If the discrepancy is confirmed, the site owner is notified and the violation ladder begins. Your reporter identity accumulates a trust score over time.
Technical Foundation
AI Certified is backed by Apex DB — an event-sourced database written in Rust. Every certification action is an immutable event: BLAKE3-signed, Merkle-chained, and addressable by aggregate and sequence number. The current implementation has 285 tests across the workspace and covers 14 aggregate types relevant to the AI certification domain.
The verification crawler (luperiq-crawler) is a separate Rust crate that reads and writes the shared ForgeJournal. It respects robots.txt, rate-limits per domain, follows redirects, and runs content rules against extracted text before issuing seals. The content rules engine supports keyword lists, regex patterns, and structural checks — including cloaking detection and manifest inflation detection.
The LuperAI Dictionary is versioned and immutable per version. v1 never changes once published. v2 adds terms. Sites reference the dict version in their manifests so classification terms remain stable for your downstream systems even as the vocabulary evolves.
Read the Full API Documentation
The complete API reference, GraphQL schema, manifest specification, and LuperAI Dictionary v1.0 are available in the AI Certified technical documentation.
Read the AI Certified documentation — full manifest spec, endpoint reference, and integration guides.
If you are evaluating AI Certified for integration into a crawler or training pipeline, contact us directly. Enterprise API access, bulk seal verification, and custom Dict term extensions are available.
