Self-Healing Architecture

Legacy B2B portals and dynamic retail sites break simple scrapers. ShopGraph uses a three-tier fallback pipeline to guarantee extraction, automatically routing around Cloudflare blocks, missing semantic data, and JavaScript-rendered DOMs.

ShopGraph Routing Engine
Method
Latency
Cost
Conf
API
Request
Cache
Check
Tool
Router
Schema.org
LLM
Playwright
Result
Routing Strategy: Click "Simulate Extraction" to visualize the engine's decision path for the target URL.
0.70

The Escalation Logic

The engine makes deterministic routing decisions to balance speed, cost, and accuracy without requiring downstream intervention.

Tier 1: Semantic Extraction (schema_org)

The engine always attempts the fastest route first. If the target URL contains valid JSON-LD or Schema.org markup, ShopGraph extracts and maps the data instantly.

Tier 2: Probabilistic Fallback (llm)

If semantic data is missing, malformed, or incomplete, the engine passes the raw DOM to the LLM to structure the Universal Commerce Protocol (UCP) JSON.

Tier 3: The Infrastructure Hammer (hybrid)

If the HTTP request hits a 403 (e.g., Cloudflare, Datadome) or requires JavaScript execution to reveal variant pricing, the engine spins up a headless Chrome instance to solve the challenge and render the page before passing the resolved DOM back to the LLM.

Latency & Cost Tradeoffs

Predictable orchestration requires transparent physics. Escalation guarantees data, but it introduces latency.

Extraction Method Base Latency Compute Cost Primary Trigger
Schema.org < 1s $0.000 Standard catalog ingestion
LLM Only ~3s $0.001 Missing semantic markup
Hybrid (Playwright) 5-10s $0.005 Bot protection or JS-rendered matrices

Developer Control (max_cache_age)

You control the latency tradeoff. By default, ShopGraph forces a fresh extraction for accurate pricing. For bulk catalog ingestion where real-time inventory is not critical, pass max_cache_age: 86400 in your request to accept 24-hour cached data and bypass the Playwright sync block entirely.

Payload Transparency

Your downstream orchestrator must know how the data was acquired. ShopGraph attaches a _shopgraph metadata object to every response, allowing you to programmatically branch your agent's logic based on the extraction method used.

response.json
{
  "product": { "..." },
  "_shopgraph": {
    "extraction_method": "hybrid",
    "confidence_method": "llm_validated",
    "field_confidence": {
      "product_name": 0.75,
      "price": 0.70,
      "brand": 0.75
    },
    "latency_ms": 8450
  }
}
If Cloudflare presents an unsolvable CAPTCHA, the pipeline fails predictably with a 403 and reason: captcha_required, allowing your agent to route to a human reviewer rather than hanging on a timeout.

Continuous Calibration

The self-healing system does not just react; it measures its own accuracy. Every extraction tier is continuously benchmarked via an LLM-as-validator pipeline across a static 208-URL B2B/B2C corpus.

Read the full calibration methodology →