Routing Engine

Intelligent extraction method selection that picks the cheapest path to quality data.

Pipeline

API REQUEST
CACHE CHECK
TOOL ROUTER
EXTRACTION
RESULT

Extraction Method Selection

The tool router evaluates each URL and selects the optimal extraction strategy:

  1. Cache check — If a recent extraction exists (within TTL), return it immediately.
  2. Schema.org probe — Fetch the page and check for JSON-LD / Microdata. If found, extract structured data (cheapest method).
  3. LLM extraction — If Schema.org is absent or incomplete, run LLM extraction on the HTML.
  4. Auto-heal merge — If Schema.org is partial (e.g. missing price), merge with LLM-extracted fields to create a hybrid result.
  5. Playwright fallback — For JS-heavy pages that return empty HTML, render with a headless browser first, then extract.

Cost per Method

MethodLatencyCostConfidence Baseline
cache~50ms$0.000Original score
schema_org~800ms$0.0010.93
llm~3s$0.0080.70
hybrid~3.5s$0.0090.85
playwright~8s$0.0150.75

Auto-Heal Merge

When Schema.org extraction returns partial data (e.g. title and price but no description), the router automatically:

  1. Keeps all Schema.org fields at their high confidence baseline
  2. Runs LLM extraction on the same HTML
  3. Fills in missing fields from the LLM result at LLM confidence levels
  4. Marks the result as hybrid with per-field provenance

This gives you the best of both worlds: high-confidence structured data where available, with LLM-powered gap filling.

The routing engine is automatic. You do not need to specify an extraction method. If you want to force a specific method, use the method parameter in the REST API.