How Extraction Works
B2B portals and dynamic retail sites require multiple extraction strategies. ShopGraph uses a three-tier pipeline that escalates through Schema.org, LLM inference, and headless browser — handling missing structured data, JavaScript-rendered content, and sites requiring browser execution. Each tier's confidence score reflects its method.
Design Principle: Full provenance. Every response shows which extraction method produced each field. Your downstream systems make informed decisions.
Extraction Pipeline Overview
The Escalation Logic
The engine makes structured routing decisions to balance speed, cost, and accuracy without requiring downstream intervention.
Tier 1: Semantic Extraction (schema_org)
The engine always attempts the fastest route first. If the target URL contains valid JSON-LD or Schema.org markup, ShopGraph extracts and maps the data instantly.
Free tier includes 50 full-pipeline extractions per month. See Pricing for details.
Tier 2: LLM Extraction (llm)
If semantic data is missing, malformed, or incomplete, the engine passes the raw DOM to the LLM to structure the Universal Commerce Protocol (UCP) JSON.
Tier 3: Full Rendering (hybrid)
If the HTTP request hits a 403 (e.g., Cloudflare, Datadome) or requires JavaScript execution to reveal variant pricing, the engine spins up a headless Chrome instance to render the page with full JavaScript execution before passing the resolved DOM back to the LLM.
Latency & Cost Tradeoffs
Each escalation tier adds latency and cost. Escalation improves coverage, but it introduces latency.
| Extraction Method | Base Latency | Compute Cost | Primary Trigger |
|---|---|---|---|
| Schema.org | < 1s | $0.000 | Standard catalog ingestion |
| LLM Only | ~3s | $0.001 | Missing semantic markup |
| Hybrid (Playwright) | 5-10s | $0.005 | CDN access controls or JS-rendered content |
Developer Control (max_cache_age)
You control the latency tradeoff. By default, ShopGraph forces a fresh extraction for accurate pricing. For bulk catalog ingestion where real-time inventory is not critical, pass max_cache_age: 86400 in your request to accept 24-hour cached data and skip the Playwright rendering step entirely.
Payload Transparency
Your downstream orchestrator must know how the data was acquired. ShopGraph attaches a _shopgraph metadata object to every response, allowing you to programmatically branch your agent's logic based on the extraction method used.
{
"product": { "..." },
"_shopgraph": {
"extraction_method": "hybrid",
"confidence_method": "tier_baseline",
"field_confidence": {
"product_name": 0.75,
"price": 0.70,
"brand": 0.75
},
"latency_ms": 8450
}
}
403 and reason: captcha_required, allowing your agent to route to a human reviewer rather than hanging on a timeout.
Continuous Calibration
The extraction pipeline does not just react; it measures its own accuracy. Every extraction tier is continuously benchmarked via an LLM-as-validator pipeline across a static 208-URL B2B/B2C corpus.
Read the full calibration methodology →Example: Fire-and-Forget Extraction
Your agent sends a URL. It does not need to know whether the data came from Schema.org markup, LLM extraction, or Playwright rendering. The pipeline handles tier selection. The agent gets structured data back with metadata showing which path succeeded.
const result = await shopgraph.enrich(
'https://www.uline.com/Product/Detail/S-19318/Kraft-Paper/30-lb-Kraft-Paper-Sheets-8-1-2-x-11'
);
// Agent uses the data without caring about extraction tier
console.log(result.product.title);
// "30 lb Kraft Paper Sheets - 8 1/2 x 11"
// But the metadata is there if you need it
console.log(result._shopgraph.extraction_method);
// "hybrid" — Playwright was needed, agent never had to know
The agent's code is the same regardless of extraction complexity. Tier selection is the pipeline's problem.