Technical Whitepaper
This paper describes AI Certified, a structured manifest standard and cryptographic verification
service that replaces thousands of redundant page crawls with a single structured file per site.
We document the scale of the current problem, the technical specification of the
/.well-known/ai-content.json standard, the BLAKE3-backed verification
architecture, and the infrastructure built to support it at scale.
1. Executive Summary
The internet is drowning in redundant scraping. More than 50 AI companies operate independent crawler fleets. Every one of them visits the same websites, downloads the same HTML pages, strips the same navigation markup, and runs the same extraction pipelines to produce nearly identical datasets. A 100-page business website may receive 5,000 bot requests per day delivering content that is identical to what was delivered yesterday, and the day before.
Website owners absorb all of this cost: bandwidth, server load, CDN egress charges. None of it generates revenue. The only alternative is to block the bots, which means disappearing from AI-generated answers entirely. Neither option is acceptable for a small business trying to stay visible in an AI-first discovery landscape.
AI companies pay their own price: massive crawler fleets, extraction pipelines, NLP classification runs, and deduplication logic, all processing the same content that five other AI companies processed last week. At the scale of the modern web, this is not inefficiency around the margins. It is a structural failure in how AI systems and websites exchange information.
AI Certified introduces a cooperative alternative. A structured JSON manifest at
/.well-known/ai-content.json describes every public page on a site in
machine-optimized format: URLs, titles, content types, section-level text, modification
dates, and BLAKE3 cryptographic checksums computed by an independent verification crawler.
Instead of downloading and parsing a 200KB HTML page, an AI system reads a 2KB manifest
entry and knows whether the content has changed since the last read.
The math is direct. One manifest replaces thousands of page fetches. Per-page BLAKE3 checksums let AI systems skip unchanged content without fetching it at all. At scale across 1 million certified sites and 50 AI crawlers, this eliminates roughly 4.95 billion HTTP requests per crawl cycle.
This paper describes how the standard works, how verification is performed, and the purpose-built infrastructure supporting it. The goal is not to restrict AI access to web content. It is to make that access structured, efficient, and mutually beneficial.
2. The Scraping Problem
Scale
As of early 2026, more than 50 companies operate AI crawler fleets, including the major foundation model providers, search-integrated AI systems, retrieval-augmented generation services, and specialized data aggregators. Each operates independently. None shares crawl results with the others. Each downloads, parses, and stores its own copy of effectively the same content.
This is not new. Web indexing has always involved redundant crawling. But the economics have changed. Search engine crawlers index for discoverability and serve billions of users across a single index. The cost-per-user-served is infinitesimal. AI training and retrieval crawlers serve a different function, and the redundancy has grown in proportion to the number of competing AI products, not in proportion to any actual increase in content diversity.
The Math
Consider a small business website with 100 public pages. Each page averages 150KB of HTML. With 50 AI crawlers each running a monthly full crawl:
| Metric | Per Crawler | 50 Crawlers |
|---|---|---|
| Pages fetched per crawl cycle | 100 | 5,000 |
| Data transferred (HTML only) | 15 MB | 750 MB |
| Requests per day (monthly crawls) | ~3 | ~167 |
| Requests per day (daily crawlers) | 100 | 5,000 |
Most of those 5,000 requests return content that has not changed. The crawler does not know that without fetching the page. The site pays the bandwidth cost for the crawler to learn that nothing is new.
The Impossible Choice for Website Owners
Website owners currently face a binary decision with no good option. Allow bots, and
absorb server load and bandwidth costs that scale with how many AI companies exist, not
with how popular the site is or how much its content changes. Block bots via
robots.txt or IP blocking, and the site disappears from AI-generated answers,
recommendation engines, and retrieval-augmented systems. Visibility in AI-mediated search
is already significant and growing. Opting out is not a neutral choice.
Some site owners have attempted selective blocking, allowing some crawlers and denying others based on published user-agent strings. This requires active maintenance, is easily circumvented, and does not reduce the underlying problem for the crawlers that are allowed.
The Cost to AI Companies
The inefficiency is not one-sided. AI companies operating crawler fleets face their own redundancy costs. Fetching raw HTML requires rendering or at minimum parsing full document structure. Navigation menus, footer boilerplate, cookie consent banners, and decorative markup must be stripped before the meaningful content can be identified. NLP classification pipelines determine what type of page this is, what industry it serves, what sections contain actionable information. These pipelines run on every page, on every crawl, for every AI company, independently.
A classified, structured manifest that an AI system can read and trust eliminates all of that downstream processing for every page it covers. The extraction work was done once, by a single verification crawler, and the result is available to anyone.
Environmental Cost
HTTP requests are not free in energy terms. Each redundant fetch burns compute at both ends: the server generating and transmitting the response, and the crawler receiving, decompressing, and parsing it. At internet scale, with billions of redundant requests per crawl cycle across all AI systems, the aggregate energy cost is material. This is not a marketing claim. It is a mathematical consequence of the current architecture, and it scales linearly with the number of AI companies in the market.
3. The Manifest Standard
The Core Idea
/.well-known/ai-content.json is a structured JSON file at a predictable URL
that describes every public page on a site. It is the crawl result, pre-computed and
served on demand. An AI system that reads this file knows what pages exist, what they
contain, when they were last changed, and what their content type is, without fetching
a single HTML page.
The precedent for this pattern is well established. robots.txt, introduced
in 1994, gave website owners a standard way to tell crawlers where not to go. Sitemaps
told crawlers what URLs existed and when they were last modified. Neither standard required
a standards body to emerge before it became useful. They spread because they were simple,
open, and immediately valuable to all parties. The ai-content.json standard
follows the same adoption pattern.
The distinction from sitemaps is structural. A sitemap tells a crawler which pages exist. The AI manifest tells a crawler what those pages contain. It shifts the classification and extraction work from the crawler side to the content owner side, where it can be done once accurately, rather than approximated imperfectly thousands of times by third parties.
Site Manifest Index
The root manifest at /.well-known/ai-content.json is the entry point for
the entire site. A requesting system fetches this one file and receives a complete index
of every page, including its modification date and a BLAKE3 checksum of the current
page content. A crawler that cached this file from last week can compare checksums and
skip every page that has not changed.
A representative site manifest:
{
"version": "1.0",
"dict_version": "v1",
"generator": "luperiq-ai-certified/1.0",
"site": {
"name": "Riverside HVAC Solutions",
"url": "https://riversidehvac.com",
"description": "Residential and commercial HVAC services, Dallas-Fort Worth",
"language": "en-US",
"seal": "https://api.luperiq.com/v1/seal/verify/7f3a9c2e"
},
"pages": [
{
"url": "/services/",
"title": "HVAC Services",
"modified": "2026-02-18T09:15:00Z",
"type": "luperai:service-page",
"industry": "luperai:home-services/hvac",
"manifest": "/.well-known/ai-content/services.json",
"checksum": "blake3:a3f92c1d8e047b6a29d1f83c44e7b501"
},
{
"url": "/blog/heat-pump-vs-central-air/",
"title": "Heat Pump vs. Central Air: Which Is Right for Your Home?",
"modified": "2026-01-30T14:00:00Z",
"type": "luperai:blog-post",
"manifest": "/.well-known/ai-content/heat-pump-vs-central-air.json",
"checksum": "blake3:b8e1a4c9f20d3e7b15a8c301d6f40298"
},
{
"url": "/contact/",
"title": "Contact Us",
"modified": "2025-11-02T11:00:00Z",
"type": "luperai:contact-page",
"manifest": "/.well-known/ai-content/contact.json",
"checksum": "blake3:c7d2b5e0a31f4c8a96e2d415b7a90163"
}
],
"total_pages": 47,
"generated_at": "2026-02-27T06:00:00Z",
"certified_at": "2026-02-27T06:00:00Z",
"next_verification": "2026-02-28T06:00:00Z"
}
Page Manifests
Each entry in the site index links to a full page manifest at a predictable path under
/.well-known/ai-content/. The page manifest contains section-level structured
content: every heading with its associated text, images with dimensions and alt text,
internal and external links, structured data extracted from ld+json blocks,
and LuperAI Dictionary classifications for content type, section purpose, and industry.
An AI system that receives a page manifest does not need to fetch the page, parse the HTML, strip navigation markup, or run NLP classification. The structured, pre-classified data is already in the manifest. The checksum in the site index lets the AI system know whether anything has changed since it last processed this page. If the checksum matches, the cached version is still valid.
{
"url": "https://riversidehvac.com/services/",
"canonical": "https://riversidehvac.com/services/",
"title": "HVAC Services",
"description": "Full-service HVAC installation, repair, and maintenance for Dallas-Fort Worth homes and businesses.",
"content_type": "luperai:service-page",
"industry": "luperai:home-services/hvac",
"intent": "luperai:transactional",
"language": "en-US",
"modified": "2026-02-18T09:15:00Z",
"author": "Riverside HVAC Team",
"trust_signals": ["luperai:has-reviews", "luperai:has-certifications", "luperai:has-pricing"],
"actions": ["luperai:booking", "luperai:contact-form", "luperai:phone-call"],
"sections": [
{
"id": "ac-installation",
"heading": "Air Conditioning Installation",
"level": 2,
"section_type": "luperai:features",
"anchor": "/services/#ac-installation",
"content": "We install all major brands including Carrier, Trane, and Lennox. Free in-home estimates. Most installs completed same week."
},
{
"id": "pricing",
"heading": "Service Pricing",
"level": 2,
"section_type": "luperai:pricing-table",
"anchor": "/services/#pricing",
"content": "Diagnostic visit: $89. Routine tune-up: $129. Emergency service (nights and weekends): $189 call-out fee plus parts."
}
],
"images": [
{
"src": "https://riversidehvac.com/wp-content/uploads/hvac-team.jpg",
"alt": "Riverside HVAC technicians in uniform",
"media_type": "luperai:team-photo",
"width": 1200,
"height": 800
}
],
"structured_data": {
"@type": "LocalBusiness",
"name": "Riverside HVAC Solutions",
"telephone": "+1-214-555-0147",
"address": { "addressLocality": "Dallas", "addressRegion": "TX" }
},
"links": [
{ "text": "Schedule Service", "url": "/schedule/", "rel": "internal" },
{ "text": "Read Our Reviews", "url": "https://g.page/riversidehvac", "rel": "external" }
],
"checksum": "blake3:a3f92c1d8e047b6a29d1f83c44e7b501"
}
Discovery
AI systems discover site manifests through three mechanisms. The primary path is a
robots.txt directive:
AI-Content: /.well-known/ai-content.json
Alternatively, sites can include a <link> tag in the HTML
<head>:
<link rel="ai-content" href="/.well-known/ai-content.json">
Sites registered with AI Certified are also discoverable via the LuperIQ API at
/v1/manifest/{domain}, which redirects to the site's manifest. AI systems
that support the standard can check this endpoint as a fallback for any domain they
encounter.
Why a Single File Works
The manifest is the crawl result. An AI system reading ai-content.json
receives pre-structured, pre-classified content that would otherwise require fetching and
parsing every page individually. The information density of a 2KB manifest entry is
equivalent to reading a 150-200KB HTML page, after stripping markup, navigation, ads,
and boilerplate. The manifest delivers only the content. No rendering. No extraction.
No NLP guesswork about what type of page this is.
The BLAKE3 checksum embedded in each page entry makes the manifest cacheable with mathematical certainty. A cached manifest entry is valid until the checksum changes. An AI system does not need to re-fetch pages to know whether its cached data is current. The checksum comparison is a single string equality check against a 2KB manifest file.
4. Verification Architecture
A self-reported manifest solves the efficiency problem but introduces a trust problem.
If a site can write anything into its own ai-content.json, there is no
guarantee that the manifest reflects the actual page content. Prices could be misrepresented.
Content that does not appear in the manifest could be served to human visitors. The
manifest could describe a 5-section page while the actual page contains 20 sections of
content the site owner does not want indexed.
AI Certified solves this with independent third-party verification. Every checksum in a certified manifest was computed by the LuperIQ crawler from the live page, not self-reported by the site owner. AI systems that check the seal can know the manifest reflects what the crawler actually found.
BLAKE3 Cryptographic Hashing
BLAKE3 is currently one of the fastest general-purpose cryptographic hash functions available. It is faster than MD5 on modern hardware while providing security properties equivalent to SHA-3. For content verification at scale, speed matters: the LuperIQ crawler generates per-page checksums for every page it visits, across thousands of sites, continuously.
Per-page BLAKE3 checksums serve two functions. First, they let AI systems determine whether a page has changed since the last verified crawl, without fetching the page. Second, they provide the cryptographic anchor for seal verification. A seal is only valid if the stored checksums match the checksums the LuperIQ crawler computed when it last verified the site. If a site owner edits a page to say something different than what the manifest describes, the next scan detects the mismatch and the seal status changes immediately.
Apex DB stores every certification event as a BLAKE3-signed record in an append-only, Merkle-chained journal. No event can be modified retroactively. An auditor can reconstruct the complete history of any site's certification status, including when each seal was issued, what checksums were current at that moment, and whether any violations occurred.
Verification Flow
The verification process from registration to active seal:
-
Registration. Site owner registers at luperiq.com and provides the
domain. Apex event:
SiteRegistered. -
Ownership verification. Site owner adds a DNS TXT record or an HTML
meta tag to prove domain control. The LuperIQ system checks and confirms.
Apex event:
OwnershipVerified. -
Initial crawl. The LuperIQ crawler visits every public page, respecting
robots.txtand rate limits. It discovers pages via the existingai-content.jsonif present, otherwise via sitemap, otherwise via internal link following from the homepage. Apex event:VerificationScheduled, thenVerificationCompleted. -
Manifest generation. For each page, the crawler extracts structured
content, runs LuperAI Dictionary classification, and computes a BLAKE3 checksum.
If a site-hosted manifest exists, the crawler compares its output against it.
Apex event:
ManifestGenerated. - Content rules scan. Every page passes through the content rules engine before the seal decision is made. Hidden text, cloaking, keyword stuffing, and policy violations are detected here.
-
Seal issuance. If the scan produces a clean or drift result (content
changed normally, structurally sound), the seal is issued or renewed. The seal record
is BLAKE3-signed by LuperIQ and appended to the Apex journal.
Apex event:
SealIssued.
Trust Seal Tiers
Verification frequency is the primary differentiator between tiers. Content that is verified daily is more trustworthy than content verified monthly, because the gap between what the manifest says and what the page actually contains is bounded by the scan interval. AI systems receive tier information in seal verification responses and can weight confidence accordingly.
| Feature | Free | Pro ($19/mo) | Enterprise ($49/mo) |
|---|---|---|---|
| Verification frequency | Monthly | Daily | Real-time + daily baseline |
| Pages scanned | Up to 50 | Up to 500 | Unlimited |
| Seal badge | Standard (gray) | Priority (blue) | Premium (gold) |
| Dictionary auto-classification | Yes | Yes | Yes + custom terms |
| AI report notifications | Email digest | Instant email | Instant + webhook |
| Manifest hosting | Self-hosted | CDN option | CDN included |
| API access | Read-only | Full | Full + bulk |
| Violation grace period | 7 days | 3 days | 24 hours |
Content Rules Engine
The content rules engine runs on every page during every scan. It operates on the extracted page content before any seal decision is made. Rules come in three types:
- Keyword rules. Word and phrase lists matched against extracted text. Adult content filters, spam patterns, and manipulation phrases (for example, text designed to mislead AI crawlers about page content) are detected here.
-
Pattern rules. Regular expression matching against raw HTML or
extracted text. Hidden text detection checks for content in
display:noneelements that does not appear in the manifest. Keyword stuffing detection flags phrases repeated beyond a configurable threshold. Cloaking detection compares content served to the crawler user-agent against content served to a standard browser user-agent. - Structural rules. Checks for structural inconsistencies between the manifest and the live page. If a manifest claims five sections and the crawler finds twenty, that discrepancy triggers a structural rule match. If a page redirects to a different domain than the registered site, that is caught here.
Rules are configurable by LuperIQ administrators and scoped to all sites, specific tiers, or individual domains. Approximately ten rules ship as defaults, covering the most common manipulation patterns. Custom rules can be added via the admin dashboard or GraphQL API.
Reporter Reputation System
Any AI system can report a discrepancy between a certified manifest and the content
it actually encountered when visiting a page. Reports are submitted via
POST /v1/report with the seal ID, page URL, issue type, and details.
The system creates an AiCertReport event in Apex and triggers an
immediate re-scan of the affected page.
Reporter trust is tracked on a 0.0 to 2.0 scale. Each reporter starts at 1.0. A confirmed report (re-scan finds the discrepancy) adds 0.1 to the reporter's score. A dismissed report (re-scan finds no discrepancy) subtracts 0.2. Reporters below 0.3 have their reports queued rather than triggering immediate action. Reporters below 0.0 are blocked; their reports are accepted with a 200 response but silently ignored. LuperIQ administrators can manually adjust or reset reporter scores.
This creates a self-correcting community oversight layer. AI systems that report accurately accumulate trust and gain faster response times. Systems that submit false reports lose standing. The system does not require LuperIQ to manually adjudicate every report.
Violation Penalty Ladder
Content problems escalate through a configurable penalty sequence:
- First violation detected: owner notified, grace period begins (7 days free, 3 days Pro, 24 hours Enterprise).
- Grace period expires without resolution: seal downgraded to warning status. AI systems receive warning status in seal verification responses.
- Second violation within the rolling window: seal suspended. Site does not appear in certified listings.
- Malicious content rule match (malware, phishing, manifest manipulation): immediate revocation, no grace period.
- Reinstatement: requires a clean scan, followed by a 30-day probation period with daily monitoring regardless of tier.
Every step in this ladder is an immutable Apex event. Revocations cannot be removed from the record, only succeeded by reinstatement events. The full history is auditable by any party with API access.
5. The LuperAI Dictionary
Structured manifests require a shared vocabulary. If one AI system reads a content type
as service-page and another reads it as services and a third
reads it as commercial-offering, the classification data is not interoperable.
The LuperAI Dictionary is a versioned, public vocabulary for describing web content types,
section purposes, industry categories, visitor intent signals, media types, trust indicators,
and available actions.
Current State: v0.1
LuperAI Dictionary v0.1 is a hand-curated seed vocabulary: 7 categories, 73 terms. It was built from the content patterns present across LuperIQ's 37 active WordPress modules and the industry verticals those modules serve: home services, professional services, restaurants, retail, SaaS, and agencies. It covers the most common content classification needs for small and medium business websites.
| Category | Example Terms | Purpose |
|---|---|---|
content_type |
service-page, blog-post, product, faq, landing-page, portfolio | Page-level classification |
section_type |
hero, features, pricing-table, testimonials, cta, team, gallery | Section purpose within a page |
business_type |
home-services/hvac, restaurant, retail, saas, agency | Industry and business category |
intent |
informational, transactional, navigational, support | Primary visitor intent |
media_type |
product-photo, team-headshot, hero-banner, infographic | Image classification |
trust_signal |
has-reviews, has-certifications, has-case-studies, has-pricing | Quality and credibility indicators |
action_type |
contact-form, booking, purchase, download, subscribe | Available user actions |
v1.0: Data-Driven Generation
Dictionary v1.0 will be generated from n-gram frequency analysis across crawled sites,
not from manual curation. The LuperIQ crawler's ngram_collector module
tokenizes extracted page text into 1-to-6-word phrases during every scan. Phrase frequency
counts are accumulated per site per scan. Periodically, the system writes frequency
snapshots as AiDict:Frequencies events in Apex.
When a sufficient corpus of crawl data has accumulated, the codebook generation process:
- Loads all frequency snapshots from the Apex journal.
- Computes cross-scan phrase frequency rollups, weighting by site count rather than page count, to avoid large sites dominating the vocabulary.
- Applies a frequency threshold filter to retain phrases that appear consistently across many sites in similar industries.
- Clusters semantically similar phrases under canonical terms.
- Publishes the resulting vocabulary as an immutable Dict v1.0 event in Apex.
The output is a vocabulary derived from what websites actually say, not from what someone assumed they would say. This produces classification terms that match real-world content patterns more accurately than any manually curated vocabulary can.
Versioning Semantics
Every dictionary version is immutable once published. v1 never changes. v2 adds terms.
v3 may deprecate terms from earlier versions, but deprecated terms remain in the schema
with a deprecation marker. Sites that reference "dict_version": "v1" in
their manifests will have their classification interpreted under v1 semantics permanently,
regardless of how the dictionary evolves. This ensures that AI systems built against
a specific dict version do not experience silent classification drift as the vocabulary
expands.
The current dictionary is available at GET /v1/dict/{version}. The latest
version redirects at GET /v1/dict/latest.
6. Apex DB: Purpose-Built Infrastructure
Certifying that a website's content is what it claims to be requires a database that can make a credible guarantee: nobody altered the historical record. Standard relational databases do not provide this. Records can be updated, deleted, or rolled back. A certification system built on a mutable store cannot prove that a seal was valid at a specific point in time, or that a revocation was not retroactively inserted.
Apex DB was built specifically for this class of problem. It is an event-sourced, append-only database engine written in Rust, designed for high-throughput immutable record keeping with cryptographic proof of integrity.
Core Architecture
The fundamental data structure is the ForgeJournal: an append-only Write-Ahead Log where every event is BLAKE3-signed by LuperIQ's key and linked into a Merkle chain. Each event references the hash of the previous event. Modifying any past event would break the chain at that point, making tampering detectable.
Aggregates are the primary abstraction. An aggregate is a named entity (for example,
AiCert:Site with aggregate ID riversidehvac.com) whose
current state is derived by replaying all events associated with that aggregate in
sequence. This means the current state of any site's certification can be reconstructed
at any point in history by replaying events up to a given sequence number.
Periodic snapshots accelerate reads. Rather than replaying the entire event history every time someone queries a site's certification status, the system materializes snapshots of aggregate state and replays only events since the last snapshot. The underlying event chain is preserved; the snapshot is a read optimization.
Scale Numbers
The current Apex DB implementation for the AI Certified domain includes:
- 16 facades covering Platform, DataCore, Security, Email, Commerce, Messaging, and Identity domains
- 64+ GraphQL resolvers across the full workspace
- 14 aggregate types specific to AI certification
- 14 GraphQL queries and 15 mutations for the AI Cert domain
- 285 tests across the full workspace (170 forge + 64 apex-demo + 37 crawler + 2 integration)
The Crawler Pipeline
The luperiq-crawler Rust crate implements the verification pipeline as
a 9-module library with a thin binary entry point. It reads from and writes to the
shared ForgeJournal directly, without a network hop to the API server. This means
the crawler and API server share a single source of truth and the crawler's writes
are immediately available to API reads.
| Module | Responsibility |
|---|---|
fetcher |
Async HTTP client. robots.txt compliance. Per-domain rate limiting (default 1000ms delay). Configurable concurrency (default 4 pages per site). |
extractor |
HTML to structured manifest. Headings, body text, images, links, structured data. No CMS access required. |
classifier |
Dict-based content classification. Deterministic keyword and pattern matching. Confidence scoring (above 0.3 = match). |
ngram_collector |
1-to-6-word phrase frequency tracking for Dict v1.0 generation. Periodic snapshots to Apex. |
verifier |
Manifest comparison. BLAKE3 checksum verification. Field-level diff. Severity classification: clean, drift, mismatch, missing. |
content_rules |
Keyword, pattern, and structural rule matching. Hidden text, cloaking, and keyword stuffing detection. |
scheduler |
Scan queue management. Tier-based frequency enforcement. On-demand dispatch. Report-triggered re-scans. |
config |
Three-layer configuration resolution: global defaults, tier overrides, site overrides. Re-reads from journal on each scheduler tick without a restart. |
API Layer
Two API surfaces are exposed:
GraphQL at /graphql. The full management API. 14 queries
and 15 mutations covering site registration, seal management, scan history, content rule
configuration, reporter management, and notification settings. Full introspection available.
Used by the admin dashboard and the WordPress plugin.
REST at /v1/*. The public API for AI systems. No API key
required for read operations. Six endpoints:
| Endpoint | Method | Purpose |
|---|---|---|
/v1/seal/verify/{seal_id} |
GET | Seal validity check. Returns status, tier, issue date, and valid-until date. |
/v1/manifest/{domain} |
GET | Redirects to the domain's ai-content.json. |
/v1/dict/{version} |
GET | Full dictionary schema JSON for the specified version. |
/v1/dict/frequencies |
GET | Aggregated n-gram frequency data from the current crawl corpus. |
/v1/report |
POST | Submit a content discrepancy report. Rate-limited per reporter. |
/v1/stats |
GET | Public aggregate statistics: total sites, active seals, total scans, current dict version. |
Full Stack
The complete technical stack from storage to user interface:
Apex DB (Rust event-sourced engine)
ForgeJournal: BLAKE3-signed, Merkle-chained, append-only WAL
14 AI Cert aggregate types
GraphQL: 14 queries, 15 mutations
REST: /v1/* public API (6 endpoints, no auth required for reads)
luperiq-crawler (Rust verification pipeline)
9 modules, 37 tests, BFS page discovery
robots.txt compliance, configurable rate limiting
BLAKE3 checksum generation per page
Nexus Plugin (PHP, WordPress admin layer)
5-tab admin dashboard
26 WP-CLI commands (wp luperiq ai-certified <command>)
Headless management for CI/CD integration
Standalone WordPress Plugin
Available for any WordPress site regardless of host
Free tier manifest generation
Connects to LuperIQ Central for verification and certification
7. The Economics
For Website Owners
The cost reduction is immediate and proportional to the number of AI crawlers that adopt the standard. The mechanism is simple: a crawler that reads a manifest and compares checksums does not need to fetch any page whose checksum has not changed. For a site whose content changes infrequently, this means most page fetches are replaced by a single manifest read.
Concrete calculation for a 100-page site with 50 AI crawlers, assuming daily crawls and a 10% content change rate per day:
| Metric | Without AI Certified | With AI Certified |
|---|---|---|
| Requests per crawler per day | 100 page fetches | 1 manifest read + 10 page fetches |
| Total requests from 50 crawlers | 5,000 per day | 550 per day |
| Bandwidth (150KB avg page, 2KB manifest) | 750 MB per day | ~76 MB per day |
| Effective request reduction | — | 89% fewer requests |
For sites with lower content change rates, the savings are higher. A site that updates 5 pages per month receives 50 manifest reads per day from 50 crawlers rather than 5,000 page fetches. That is a 99% reduction in AI-generated server load. The site still serves content to human visitors normally. The bandwidth cost it was paying to educate AI systems about content that had not changed is eliminated.
Beyond bandwidth, the seal functions as a trust signal for human visitors. A verified AI Certified seal communicates that the site's content is independently confirmed, current, and structured for AI discoverability. The SSL padlock became an expected trust signal for e-commerce transactions. The AI Certified seal represents the equivalent signal for AI-mediated discovery.
For AI Companies
The economics for AI companies are equally direct. Consider a retrieval-augmented generation system that maintains a corpus of 10 million websites:
-
Manifest-first crawling. The system fetches
ai-content.jsonfor each site. For certified sites, it compares per-page checksums against its cached versions. It only fetches pages whose checksums have changed. The crawler fleet shrinks in proportion to the share of the corpus that is certified. - No extraction pipeline for classified content. A certified manifest contains pre-structured, pre-classified content. There is no HTML to parse, no navigation to strip, no NLP classification to run. The data is ready for indexing as received.
- Reliable cache keys. BLAKE3 checksums are ideal cache keys. An AI system that caches a page manifest with its checksum does not need a time-to-live policy. The cache entry is valid until the checksum in the site's manifest changes. No re-fetch required to validate the cache.
-
Feedback loop participation. Reporting discrepancies via
/v1/reportbuilds reporter reputation and keeps the verification system accurate. AI companies that participate get faster responses to their reports and contribute to a resource they depend on.
At Internet Scale
The aggregate savings at scale are significant. Model the scenario: 1 million certified sites, 50 AI crawlers, monthly full crawl cycles, average 100 pages per site.
| Scenario | HTTP Requests per Cycle |
|---|---|
| Current (no manifest standard) | 5,000,000,000 |
| With AI Certified (10% content change rate) | ~550,000,000 |
| With AI Certified (1% content change rate) | ~55,000,000 |
| Requests eliminated at 10% change rate | 4,450,000,000 |
4.45 billion fewer HTTP requests per crawl cycle. Each request involves a TCP connection, TLS handshake, HTTP round trip, HTML decompression, and parsing. The compute and energy savings at this scale are not marginal. They are structural, and they grow linearly with both the number of certified sites and the number of AI crawlers.
Environmental Consequence
The environmental benefit is a mathematical consequence of eliminating redundant computation. Fewer HTTP requests means fewer server CPU cycles, less network transit, less crawler compute, and less energy consumed by data centers at both ends of each request. At the scale of billions of eliminated requests per cycle, the energy savings are real, measurable, and ongoing. They compound as certification adoption grows.
8. Integration Guide
For Website Owners
Getting certified requires four steps and no technical background beyond the ability to add a DNS record or an HTML tag to a webpage:
- Register at luperiq.com. Provide your domain and select a tier (Free, Pro, or Enterprise).
- Verify ownership. Add the provided DNS TXT record or HTML meta tag to your site.
- Wait for the initial scan. The LuperIQ crawler indexes your site and generates your manifest.
- Add the seal badge. Paste the provided HTML snippet onto your site. The badge links to your live seal verification endpoint.
WordPress sites using the LuperIQ WordPress plugin get additional control via the admin dashboard: 5 tabs covering certification status, scan history, content rule matches, reporter activity, and settings. For headless or CI/CD environments, 26 WP-CLI commands provide full management without a browser:
wp luperiq ai-certified status
wp luperiq ai-certified scan trigger
wp luperiq ai-certified seal verify
wp luperiq ai-certified rules list
wp luperiq ai-certified reports list
For AI Companies
The integration path for an AI crawler is three HTTP calls per site per crawl cycle:
# 1. Verify the site is certified and check its current tier
GET https://api.luperiq.com/v1/seal/verify/7f3a9c2e
# Response
{
"valid": true,
"seal_id": "7f3a9c2e",
"domain": "riversidehvac.com",
"tier": "pro",
"status": "active",
"issued_at": "2026-02-20T06:00:00Z",
"valid_until": "2026-02-27T06:00:00Z"
}
# 2. Fetch the manifest and compare checksums against cached values
GET https://riversidehvac.com/.well-known/ai-content.json
# 3. Fetch only pages whose checksums differ from your cache
GET https://riversidehvac.com/.well-known/ai-content/services.json
No SDK is required. No API key for read operations. The entire integration is standard
HTTP. CORS headers (Access-Control-Allow-Origin: *) are set on all
/v1/* endpoints.
AI systems that encounter content discrepancies should submit reports via
POST /v1/report. Accurate reporting builds reporter reputation and
accelerates response times for future reports from the same system.
For Developers
The full GraphQL API at /graphql exposes 14 queries and 15 mutations
covering every certification management operation. Full schema introspection is available.
The API supports bulk operations for Enterprise-tier integrations.
Content rules are configurable via API. Developers building on top of the platform can add custom keyword lists, pattern rules, and structural checks for their specific use cases. The configuration hierarchy (global, tier, site) allows precise scoping of rules without affecting other sites on the platform.
The public dictionary schema at /v1/dict/{version} is designed for
consumption without authentication. AI systems that want to validate their own
classification results against the LuperAI vocabulary can fetch the schema and
run local comparisons. The n-gram frequency data at /v1/dict/frequencies
is available for research use.
9. Roadmap
The current system is production-capable. The roadmap addresses scale, observability, and ecosystem expansion:
Near Term
- Multi-tier configuration hierarchy. Full three-layer override resolution (global defaults, tier overrides, site overrides) surfaced through admin UI. Administrators can adjust scan parameters for individual domains without touching global settings. Overrides can be time-bounded with automatic reversion.
- Scheduled scan queue with priority ordering. Enterprise sites receive priority queue placement. On-demand scan requests from site owners and AI-triggered re-scans are prioritized over routine scheduled crawls.
- Email and webhook notifications. The Apex notification event layer is in place. Delivery via email and outbound webhooks is the next integration layer, enabling sites to wire certification status changes into their own monitoring systems.
Medium Term
- Historical scan trend analysis. Compliance dashboards showing content drift over time, section-level change frequency, and classification stability. Site owners can compare any two scan results and see exactly what changed.
- LuperAI Dictionary v1.0. Data-driven vocabulary generated from n-gram frequency analysis across the crawled corpus. Replaces the current hand-curated v0.1 with a vocabulary grounded in real-world content patterns.
- Crawler-generated manifests for non-WordPress sites. Any site on any platform registers at luperiq.com and receives a hosted manifest generated entirely by the LuperIQ crawler. No plugin required.
Longer Term
- Cross-site content deduplication signals. When the crawler encounters content that is substantially identical across many sites, the manifest can include a deduplication signal. AI systems detect syndicated or duplicated content without fetching every site independently.
- Broader industry dictionary coverage. Dict v2 and beyond, expanding from small-business verticals into professional services, healthcare, finance, education, and other sectors with sufficient crawl data.
-
Standardization proposal. Submit
/.well-known/ai-content.jsonas a proposed standard to relevant working groups. Open-source the dictionary schema. Engage major AI providers to formally recognize the seal in their crawler policies. - Multi-machine deployment. Journal replication for horizontal crawler scaling. Multiple crawler instances coordinated through a shared Apex journal with conflict resolution semantics.
10. Conclusion
The current scraping model is the product of an early internet where crawling was the only available mechanism for content discovery. It was never designed for a world where 50 competing AI companies each independently need structured, classified, trustworthy data from the same millions of websites. The costs of this design are real: website owners pay for bandwidth they did not invite, AI companies duplicate work they could share, and the aggregate compute and energy consumption is substantial.
AI Certified is a replacement model built on a different assumption: websites can publish structured, machine-readable content once, and AI systems can read it efficiently rather than approximating it through repeated scraping. The manifest standard is simple enough to implement in an afternoon and immediately valuable to every party that adopts it. The verification layer creates the trust foundation that makes self-reported manifests reliable rather than gameable.
The infrastructure supporting this is already in production. Apex DB provides a cryptographically auditable certification record that no party can retroactively alter. The LuperIQ crawler independently verifies every claim. The LuperAI Dictionary provides a shared vocabulary for content classification that eliminates the need for every AI system to solve that problem independently.
The technology exists. The math is sound. A 100-page site with 50 AI crawlers generates 5,000 page requests per crawl cycle today. With AI Certified, it generates 50 manifest reads. The difference is 4,950 requests that did not need to happen. Multiply that across the web, and the case for adoption is not a prediction about future impact. It is arithmetic about the present.
The only open question is how quickly the ecosystem moves from the current model to a better one.
Appendix: Technical Reference
A. Apex DB Aggregate Types (AI Cert Domain)
| Aggregate | Purpose |
|---|---|
AiCert:Site | Registered site record and ownership verification status |
AiCert:Seal | Cryptographically signed certification seal |
AiCert:Scan | Crawl run record with per-page results |
AiCert:Manifest | Generated page manifest with Dict classifications |
AiCert:Violation | Content discrepancy or policy breach record |
AiCert:Report | AI-submitted discrepancy report |
AiCert:Reporter | AI crawler reputation tracking (0.0 to 2.0 scale) |
AiCert:Notification | Owner notification events |
AiCert:Config | Global crawler configuration |
AiCert:TierConfig | Per-tier setting overrides |
AiCert:SiteConfig | Per-site setting overrides |
AiCert:ContentRule | Content watchlist rules |
AiCert:ContentRuleMatch | Rule match results per scan |
AiDict:Frequencies | N-gram frequency snapshots for Dict v1.0 generation |
B. Key Apex Events
| Event | Aggregate | Trigger |
|---|---|---|
SiteRegistered | AiCert:Site | Site owner completes registration |
OwnershipVerified | AiCert:Site | DNS or meta tag confirmation passes |
VerificationScheduled | AiCert:Scan | Scan added to queue |
VerificationCompleted | AiCert:Scan | Crawler finishes all pages for a site |
ManifestGenerated | AiCert:Manifest | Page manifest created or updated |
SealIssued | AiCert:Seal | Clean or drift scan result |
SealRevoked | AiCert:Seal | Malicious content or unresolved violation |
SealSuspended | AiCert:Seal | Second violation within penalty window |
ViolationDetected | AiCert:Violation | Content mismatch or content rule match |
ViolationResolved | AiCert:Violation | Problem fixed within grace period |
AiReport | AiCert:Report | POST /v1/report from an AI crawler |
AiReporterCredited | AiCert:Report | Confirmed or dismissed report resolution |
DictVersionPublished | AiDict | New vocabulary version released |
C. Crawler Configuration Defaults
| Setting | Default | Scope |
|---|---|---|
| Request timeout | 30 seconds | Global |
| Rate limit delay | 1000ms between requests per domain | Global |
| Page concurrency per site | 4 | Global |
| Simultaneous sites | 2 | Global |
| Max crawl depth | 3 | Global |
| Max pages per scan | 500 | Tier |
| Max redirects | 5 | Global |
| Retry backoff | 1h / 4h / 24h, then abandon | Global |
| Report rate limit | 10 per reporter per hour | Global |
| Stats cache TTL | 5 minutes | Global |
| N-gram range | 1 to 6 words | Global |
| Scheduler tick | 60 seconds | Global |
D. Contact and Resources
- Platform: luperiq.com
- Seal verification:
https://api.luperiq.com/v1/seal/verify/{seal_id} - Manifest lookup:
https://api.luperiq.com/v1/manifest/{domain} - Dictionary schema:
https://api.luperiq.com/v1/dict/latest - Public statistics:
https://api.luperiq.com/v1/stats - Report a discrepancy:
POST https://api.luperiq.com/v1/report
