Self-Healing Architecture
Legacy B2B portals and dynamic retail sites break simple scrapers. ShopGraph uses a three-tier fallback pipeline to guarantee extraction, automatically routing around Cloudflare blocks, missing semantic data, and JavaScript-rendered DOMs.
Request
Check
Router
The Escalation Logic
The engine makes deterministic routing decisions to balance speed, cost, and accuracy without requiring downstream intervention.
Tier 1: Semantic Extraction (schema_org)
The engine always attempts the fastest route first. If the target URL contains valid JSON-LD or Schema.org markup, ShopGraph extracts and maps the data instantly.
Tier 2: Probabilistic Fallback (llm)
If semantic data is missing, malformed, or incomplete, the engine passes the raw DOM to the LLM to structure the Universal Commerce Protocol (UCP) JSON.
Tier 3: The Infrastructure Hammer (hybrid)
If the HTTP request hits a 403 (e.g., Cloudflare, Datadome) or requires JavaScript execution to reveal variant pricing, the engine spins up a headless Chrome instance to solve the challenge and render the page before passing the resolved DOM back to the LLM.
Latency & Cost Tradeoffs
Predictable orchestration requires transparent physics. Escalation guarantees data, but it introduces latency.
| Extraction Method | Base Latency | Compute Cost | Primary Trigger |
|---|---|---|---|
| Schema.org | < 1s | $0.000 | Standard catalog ingestion |
| LLM Only | ~3s | $0.001 | Missing semantic markup |
| Hybrid (Playwright) | 5-10s | $0.005 | Bot protection or JS-rendered matrices |
Developer Control (max_cache_age)
You control the latency tradeoff. By default, ShopGraph forces a fresh extraction for accurate pricing. For bulk catalog ingestion where real-time inventory is not critical, pass max_cache_age: 86400 in your request to accept 24-hour cached data and bypass the Playwright sync block entirely.
Payload Transparency
Your downstream orchestrator must know how the data was acquired. ShopGraph attaches a _shopgraph metadata object to every response, allowing you to programmatically branch your agent's logic based on the extraction method used.
{
"product": { "..." },
"_shopgraph": {
"extraction_method": "hybrid",
"confidence_method": "llm_validated",
"field_confidence": {
"product_name": 0.75,
"price": 0.70,
"brand": 0.75
},
"latency_ms": 8450
}
}
403 and reason: captcha_required, allowing your agent to route to a human reviewer rather than hanging on a timeout.
Continuous Calibration
The self-healing system does not just react; it measures its own accuracy. Every extraction tier is continuously benchmarked via an LLM-as-validator pipeline across a static 208-URL B2B/B2C corpus.
Read the full calibration methodology →