Spectrawl
The unified web layer for AI agents. Search, browse, crawl, extract, and act on platforms — one package, self-hosted.
5,000 free searches/month via Gemini Grounded Search. Full page scraping, stealth browsing, multi-page crawling, structured extraction, AI browser agent, 24 platform adapters.
What It Does
AI agents need to interact with the web — searching, browsing pages, crawling sites, logging into platforms, posting content. Today you wire together Playwright + a search API + cookie managers + platform-specific scripts. Spectrawl is one package that does all of it.
npm install spectrawlHow It Works
Spectrawl searches via Gemini Grounded Search (Google-quality results), scrapes the top pages for full content, and returns everything to your agent. Your agent's LLM reads the actual sources and forms its own answer — no pre-chewed summaries.
Quick Start
npm install spectrawl
export GEMINI_API_KEY=your-free-key # Get one at aistudio.google.comconst { Spectrawl } = require('spectrawl')
const web = new Spectrawl()
// Deep search — returns sources for your agent/LLM to process
const result = await web.deepSearch('how to build an MCP server in Node.js')
console.log(result.sources) // [{ title, url, content, score }]
// With AI summary (opt-in — uses extra Gemini call)
const withAnswer = await web.deepSearch('query', { summarize: true })
console.log(withAnswer.answer) // AI-generated answer with [1] [2] citations
// Fast mode — snippets only, skip scraping
const fast = await web.deepSearch('query', { mode: 'fast' })
// Basic search — raw results
const basic = await web.search('query')Why no summary by default? Your agent already has an LLM. If we summarize AND your agent summarizes, you're paying two LLMs for one answer. We return rich sources — your agent does the rest.
Spectrawl vs Others
| Tavily | Crawl4AI | Firecrawl | Stagehand | Spectrawl | |
|---|---|---|---|---|---|
| Speed | ~2s | ~5s | ~3s | ~3s | ~6-10s |
| Free tier | 1,000/mo | Unlimited | 500/mo | None | 5,000/mo |
| Returns | Snippets + AI | Markdown | Markdown/JSON | Structured | Full page + structured |
| Self-hosted | No | Yes | Yes | Yes | Yes |
| Anti-detect | No | No | No | No | Yes (Camoufox) |
| Block detection | No | No | No | No | 8 services |
| CAPTCHA solving | No | No | No | No | Yes (Gemini Vision) |
| Structured extraction | No | No | No | Yes | Yes |
| NL browser agent | No | No | No | Yes | Yes |
| Network capturing | No | Yes | No | No | Yes |
| Multi-page crawl | No | Yes | Yes | No | Yes (+ sitemap) |
| Platform posting | No | No | No | No | 24 adapters |
| Auth management | No | No | No | No | Cookie store + refresh |
Search
Two modes: basic search and deep search.
Basic Search
const results = await web.search('query')Returns raw search results from the engine cascade. Fast, lightweight.
Deep Search
const results = await web.deepSearch('query', { summarize: true })Full pipeline: query expansion → parallel search → merge/dedup → rerank → scrape top N → optional AI summary with citations.
Search Engine Cascade
Default cascade: Gemini Grounded → Tavily → Brave
Gemini Grounded Search gives you Google-quality results through the Gemini API. Free tier: 5,000 grounded queries/month.
| Engine | Free Tier | Key Required | Default |
|---|---|---|---|
| Gemini Grounded | 5,000/month | GEMINI_API_KEY | ✅ Primary |
| Tavily | 1,000/month | TAVILY_API_KEY | ✅ 1st fallback |
| Brave | 2,000/month | BRAVE_API_KEY | ✅ 2nd fallback |
| DuckDuckGo | Unlimited | None | Available |
| Bing | Unlimited | None | Available |
| Serper | 2,500 trial | SERPER_API_KEY | Available |
| Google CSE | 100/day | GOOGLE_CSE_KEY | Available |
| Jina Reader | Unlimited | None | Available |
| SearXNG | Unlimited | Self-hosted | Available |
Deep Search Pipeline
Query → Gemini Grounded + DDG (parallel)
→ Merge & deduplicate (12-19 results)
→ Source quality ranking (boost GitHub/SO/Reddit, penalize SEO spam)
→ Parallel scraping (Jina → readability → Playwright fallback)
→ Returns sources to your agent (AI summary opt-in with summarize: true)What you get without any keys
DDG-only search, raw results, no AI answer. Works from home IPs. Datacenter IPs get rate-limited by DDG — recommend at minimum a free Gemini key.
Browse
Stealth browsing with anti-detection. Three tiers (auto-escalation):
- Playwright + stealth plugin — default, works immediately
- Camoufox binary — engine-level anti-fingerprint (
npx spectrawl install-stealth) - Remote Camoufox — for existing deployments
If tier 1 gets blocked, Spectrawl automatically escalates to tier 2 (if installed) or tier 3 (if configured). No manual intervention needed.
Browse Options
const page = await web.browse('https://example.com', {
screenshot: true, // Take a PNG screenshot
fullPage: true, // Full page screenshot (not just viewport)
html: true, // Return raw HTML alongside markdown
stealth: true, // Force stealth mode
camoufox: true, // Force Camoufox engine
noCache: true, // Bypass cache
auth: 'reddit' // Use stored auth cookies for this platform
})Browse Response
{
content: "# Page Title\n\nExtracted markdown content...",
url: "https://example.com",
title: "Page Title",
statusCode: 200,
cached: false,
engine: "camoufox", // which engine was used
screenshot: Buffer<png>, // PNG buffer (JS) or base64 (HTTP)
html: "<html>...</html>", // raw HTML (if html: true)
blocked: false, // true if block page detected
blockInfo: null // { type: 'cloudflare', detail: '...' }
}Block Page Detection
Spectrawl detects block/challenge pages from 8 anti-bot services and reports them in the response instead of returning garbage HTML:
- Cloudflare (including RFC 9457 structured errors)
- Akamai
- AWS WAF
- Imperva / Incapsula
- DataDome
- PerimeterX / HUMAN
- hCaptcha challenges
- reCAPTCHA challenges
- Generic bot detection (403, "access denied", etc.)
When a block is detected, the response includes blocked: true and blockInfo: { type, detail }.
Site-Specific Fallbacks
Some sites block all datacenter IPs regardless of stealth. Spectrawl automatically routes these through alternative APIs:
| Site | Problem | Fallback | Cost |
|---|---|---|---|
| Blocks all datacenter IPs | PullPush API — Reddit archive | Free | |
| Amazon | CAPTCHA wall on product pages | Jina Reader — server-side rendering | Free |
| X/Twitter | Login wall on posts | xAI Responses API with x_search | ~$0.06/post |
| HTTP 999, IP fingerprinting | Requires residential proxy (see below) | ~$7/GB |
These fallbacks activate automatically — just browse() the URL and Spectrawl picks the right path. No config needed for Reddit and Amazon. X requires XAI_API_KEY env var. LinkedIn requires a residential proxy.
LinkedIn: Why It's Different
LinkedIn fingerprints the IP where cookies were created. Even valid cookies get rejected from a different IP. Every free approach fails from datacenter servers:
- Direct browse: HTTP 999
- Voyager API with cookies: 401 (IP mismatch)
- Jina Reader: empty response
- Facebook/Googlebot UA: 317K of CSS, zero content
The only working solution is a residential proxy. We recommend Bright Data for best results (72M+ residential IPs, ~99.7% success rate, dedicated social media unlockers). For budget use, Smartproxy ($7/GB, 55M IPs, 3-day free trial) works well at lower cost.
Setup:
# Bright Data (recommended)
npx spectrawl config set proxy '{"host":"brd.superproxy.io","port":22225,"username":"YOUR_ZONE_USER","password":"YOUR_PASS"}'
# Smartproxy (budget alternative)
npx spectrawl config set proxy '{"host":"gate.smartproxy.com","port":10001,"username":"YOUR_USER","password":"YOUR_PASS"}'
# Store your LinkedIn cookies (export from browser)
npx spectrawl login linkedin --account yourname --cookies ./linkedin-cookies.json
# Now browse LinkedIn normally
curl localhost:3900/browse -d '{"url":"https://www.linkedin.com/in/someone"}'Other residential proxy providers that work:
- IPRoyal — $7/GB, 32M IPs
- Bright Data — premium quality, higher cost
- Oxylabs — enterprise-grade
⚠️ Avoid WebShare — recycled datacenter IPs marketed as residential, no HTTPS support.
CAPTCHA Solving
Built-in CAPTCHA solver using Gemini Vision (free tier: 1,500 req/day):
- ✅ Image CAPTCHAs
- ✅ Text/math CAPTCHAs
- ✅ Simple visual challenges
- ❌ reCAPTCHA v2/v3 (requires token solving services)
- ❌ hCaptcha (requires token solving services)
- ❌ Cloudflare Turnstile (requires token solving services)
The solver automatically detects CAPTCHA type and attempts resolution before returning the page.
Extract — Structured Data Extraction
Pull structured data from any page using LLM + optional CSS/XPath selectors. Like Stagehand's extract() but self-hosted and integrated with Spectrawl's anti-detect browsing.
Basic Extraction
const result = await web.extract('https://news.ycombinator.com', {
instruction: 'Extract the top 3 story titles and their point counts',
schema: {
type: 'object',
properties: {
stories: {
type: 'array',
items: {
type: 'object',
properties: {
title: { type: 'string' },
points: { type: 'number' }
}
}
}
}
}
})
// result.data = { stories: [{ title: "...", points: 210 }, ...] }HTTP API
curl -X POST http://localhost:3900/extract \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com",
"instruction": "Extract the page title and main heading",
"schema": {"type": "object", "properties": {"title": {"type": "string"}, "heading": {"type": "string"}}}
}'Response:
{
"data": { "title": "Example Domain", "heading": "Example Domain" },
"url": "https://example.com",
"title": "Example Domain",
"contentLength": 129,
"duration": 679
}Targeted Extraction with Selectors
Narrow extraction scope using CSS or XPath selectors — reduces tokens and improves accuracy:
const result = await web.extract('https://news.ycombinator.com', {
instruction: 'Extract all story titles',
selector: '.titleline', // CSS selector
// or: selector: 'xpath=//table[@class="itemlist"]'
schema: { type: 'object', properties: { titles: { type: 'array', items: { type: 'string' } } } }
})Relevance Filtering (BM25)
For large pages, filter content by relevance before sending to the LLM — saves tokens:
const result = await web.extract('https://en.wikipedia.org/wiki/Node.js', {
instruction: 'Extract the creator and release date',
relevanceFilter: true // BM25 scoring keeps only relevant sections
})
// Content reduced from 50K+ chars to ~2K relevant charsExtract from Content (No Browsing)
Already have the content? Skip the browse step:
const result = await web.extractFromContent(markdownContent, {
instruction: 'Extract all email addresses',
schema: { type: 'object', properties: { emails: { type: 'array', items: { type: 'string' } } } }
})Uses Gemini Flash (free) by default. Falls back to OpenAI if configured.
Agent — Natural Language Browser Actions
Control a browser with natural language. Navigate, click, type, scroll — the LLM interprets the page and decides what to do.
const result = await web.agent('https://exampl
…