How Extraction Works

B2B portals and dynamic retail sites require multiple extraction strategies. ShopGraph uses a three-tier pipeline that escalates through Schema.org, LLM inference, and headless browser — handling missing structured data, JavaScript-rendered content, and sites requiring browser execution. Each tier's confidence score reflects its method.

Design Principle: Full provenance. Every response shows which extraction method produced each field. Your downstream systems make informed decisions.

Extraction Pipeline Overview

See it in action: The Playground extracts real product data and shows which extraction method produced each field. Try it with any product URL, then drag the threshold slider to see which fields your automation would keep or discard.

The Escalation Logic

The engine makes structured routing decisions to balance speed, cost, and accuracy without requiring downstream intervention.

Tier 1: Semantic Extraction (schema_org)

The engine always attempts the fastest route first. If the target URL contains valid JSON-LD or Schema.org markup, ShopGraph extracts and maps the data instantly.

Free tier includes 50 full-pipeline extractions per month. See Pricing for details.

Tier 2: LLM Extraction (llm)

If semantic data is missing, malformed, or incomplete, the engine passes the raw DOM to the LLM to structure the Universal Commerce Protocol (UCP) JSON.

Tier 3: Full Rendering (hybrid)

If the HTTP request hits a 403 (e.g., Cloudflare, Datadome) or requires JavaScript execution to reveal variant pricing, the engine spins up a headless Chrome instance to render the page with full JavaScript execution before passing the resolved DOM back to the LLM.

Latency & Cost Tradeoffs

Each escalation tier adds latency and cost. Escalation improves coverage, but it introduces latency.

Extraction Method	Base Latency	Compute Cost	Primary Trigger
Schema.org	< 1s	$0.000	Standard catalog ingestion
LLM Only	~3s	$0.001	Missing semantic markup
Hybrid (Playwright)	5-10s	$0.005	CDN access controls or JS-rendered content

Developer Control (max_cache_age)

You control the latency tradeoff. By default, ShopGraph forces a fresh extraction for accurate pricing. For bulk catalog ingestion where real-time inventory is not critical, pass max_cache_age: 86400 in your request to accept 24-hour cached data and skip the Playwright rendering step entirely.

Payload Transparency

Your downstream orchestrator must know how the data was acquired. ShopGraph attaches a _shopgraph metadata object to every response, allowing you to programmatically branch your agent's logic based on the extraction method used.

response.json

{
  "product": { "..." },
  "_shopgraph": {
    "extraction_method": "hybrid",
    "confidence_method": "tier_baseline",
    "field_confidence": {
      "product_name": 0.75,
      "price": 0.70,
      "brand": 0.75
    },
    "latency_ms": 8450
  }
}

If Cloudflare presents an unsolvable CAPTCHA, the pipeline fails predictably with a 403 and reason: captcha_required, allowing your agent to route to a human reviewer rather than hanging on a timeout.

Continuous Calibration

The extraction pipeline does not just react; it measures its own accuracy. Every extraction tier is continuously benchmarked via an LLM-as-validator pipeline across a static 208-URL B2B/B2C corpus.

Read the full calibration methodology →

Example: Fire-and-Forget Extraction

Your agent sends a URL. It does not need to know whether the data came from Schema.org markup, LLM extraction, or Playwright rendering. The pipeline handles tier selection. The agent gets structured data back with metadata showing which path succeeded.

JavaScript

const result = await shopgraph.enrich(
  'https://www.uline.com/Product/Detail/S-19318/Kraft-Paper/30-lb-Kraft-Paper-Sheets-8-1-2-x-11'
);

// Agent uses the data without caring about extraction tier
console.log(result.product.title);
// "30 lb Kraft Paper Sheets - 8 1/2 x 11"

// But the metadata is there if you need it
console.log(result._shopgraph.extraction_method);
// "hybrid" — Playwright was needed, agent never had to know

The agent's code is the same regardless of extraction complexity. Tier selection is the pipeline's problem.

What this unlocks: Fire-and-forget extraction. Your agent sends a URL and gets structured data back. It does not need to know which extraction tier succeeded or how many fallbacks occurred.