Back to MCP Servers

Spectrawl

Unified web layer for AI agents. Search (8 engines), stealth browse, cookie auth, and act on 24 platforms. 5,000 free searches/month via Gemini Grounded Search.

search-data-extractionaiagent
By FayAndXan
256Updated 3 months agoJavaScriptMIT

Installation

npx -y spectrawl

Configuration

{
  "mcpServers": {
    "spectrawl": {
      "command": "npx",
      "args": ["-y", "spectrawl"]
    }
  }
}

How to use

  1. Run the installation command above (if needed)
  2. Open your Claude Code settings file (~/.claude/settings.json)
  3. Add the configuration to the mcpServers section
  4. Restart Claude Code to apply changes

Spectrawl

The unified web layer for AI agents. Search, browse, crawl, extract, and act on platforms — one package, self-hosted.

5,000 free searches/month via Gemini Grounded Search. Full page scraping, stealth browsing, multi-page crawling, structured extraction, AI browser agent, 24 platform adapters.

What It Does

AI agents need to interact with the web — searching, browsing pages, crawling sites, logging into platforms, posting content. Today you wire together Playwright + a search API + cookie managers + platform-specific scripts. Spectrawl is one package that does all of it.

npm install spectrawl

How It Works

Spectrawl searches via Gemini Grounded Search (Google-quality results), scrapes the top pages for full content, and returns everything to your agent. Your agent's LLM reads the actual sources and forms its own answer — no pre-chewed summaries.

Quick Start

npm install spectrawl
export GEMINI_API_KEY=your-free-key  # Get one at aistudio.google.com
const { Spectrawl } = require('spectrawl')
const web = new Spectrawl()

// Deep search — returns sources for your agent/LLM to process
const result = await web.deepSearch('how to build an MCP server in Node.js')
console.log(result.sources)   // [{ title, url, content, score }]

// With AI summary (opt-in — uses extra Gemini call)
const withAnswer = await web.deepSearch('query', { summarize: true })
console.log(withAnswer.answer)  // AI-generated answer with [1] [2] citations

// Fast mode — snippets only, skip scraping
const fast = await web.deepSearch('query', { mode: 'fast' })

// Basic search — raw results
const basic = await web.search('query')

Why no summary by default? Your agent already has an LLM. If we summarize AND your agent summarizes, you're paying two LLMs for one answer. We return rich sources — your agent does the rest.

Spectrawl vs Others

TavilyCrawl4AIFirecrawlStagehandSpectrawl
Speed~2s~5s~3s~3s~6-10s
Free tier1,000/moUnlimited500/moNone5,000/mo
ReturnsSnippets + AIMarkdownMarkdown/JSONStructuredFull page + structured
Self-hostedNoYesYesYesYes
Anti-detectNoNoNoNoYes (Camoufox)
Block detectionNoNoNoNo8 services
CAPTCHA solvingNoNoNoNoYes (Gemini Vision)
Structured extractionNoNoNoYesYes
NL browser agentNoNoNoYesYes
Network capturingNoYesNoNoYes
Multi-page crawlNoYesYesNoYes (+ sitemap)
Platform postingNoNoNoNo24 adapters
Auth managementNoNoNoNoCookie store + refresh

Search

Two modes: basic search and deep search.

Basic Search

const results = await web.search('query')

Returns raw search results from the engine cascade. Fast, lightweight.

Deep Search

const results = await web.deepSearch('query', { summarize: true })

Full pipeline: query expansion → parallel search → merge/dedup → rerank → scrape top N → optional AI summary with citations.

Search Engine Cascade

Default cascade: Gemini Grounded → Tavily → Brave

Gemini Grounded Search gives you Google-quality results through the Gemini API. Free tier: 5,000 grounded queries/month.

EngineFree TierKey RequiredDefault
Gemini Grounded5,000/monthGEMINI_API_KEY✅ Primary
Tavily1,000/monthTAVILY_API_KEY✅ 1st fallback
Brave2,000/monthBRAVE_API_KEY✅ 2nd fallback
DuckDuckGoUnlimitedNoneAvailable
BingUnlimitedNoneAvailable
Serper2,500 trialSERPER_API_KEYAvailable
Google CSE100/dayGOOGLE_CSE_KEYAvailable
Jina ReaderUnlimitedNoneAvailable
SearXNGUnlimitedSelf-hostedAvailable

Deep Search Pipeline

Query → Gemini Grounded + DDG (parallel)
  → Merge & deduplicate (12-19 results)
  → Source quality ranking (boost GitHub/SO/Reddit, penalize SEO spam)
  → Parallel scraping (Jina → readability → Playwright fallback)
  → Returns sources to your agent (AI summary opt-in with summarize: true)

What you get without any keys

DDG-only search, raw results, no AI answer. Works from home IPs. Datacenter IPs get rate-limited by DDG — recommend at minimum a free Gemini key.

Browse

Stealth browsing with anti-detection. Three tiers (auto-escalation):

  1. Playwright + stealth plugin — default, works immediately
  2. Camoufox binary — engine-level anti-fingerprint (npx spectrawl install-stealth)
  3. Remote Camoufox — for existing deployments

If tier 1 gets blocked, Spectrawl automatically escalates to tier 2 (if installed) or tier 3 (if configured). No manual intervention needed.

Browse Options

const page = await web.browse('https://example.com', {
  screenshot: true,    // Take a PNG screenshot
  fullPage: true,      // Full page screenshot (not just viewport)
  html: true,          // Return raw HTML alongside markdown
  stealth: true,       // Force stealth mode
  camoufox: true,      // Force Camoufox engine
  noCache: true,       // Bypass cache
  auth: 'reddit'       // Use stored auth cookies for this platform
})

Browse Response

{
  content: "# Page Title\n\nExtracted markdown content...",
  url: "https://example.com",
  title: "Page Title",
  statusCode: 200,
  cached: false,
  engine: "camoufox",            // which engine was used
  screenshot: Buffer<png>,        // PNG buffer (JS) or base64 (HTTP)
  html: "<html>...</html>",       // raw HTML (if html: true)
  blocked: false,                 // true if block page detected
  blockInfo: null                 // { type: 'cloudflare', detail: '...' }
}

Block Page Detection

Spectrawl detects block/challenge pages from 8 anti-bot services and reports them in the response instead of returning garbage HTML:

  • Cloudflare (including RFC 9457 structured errors)
  • Akamai
  • AWS WAF
  • Imperva / Incapsula
  • DataDome
  • PerimeterX / HUMAN
  • hCaptcha challenges
  • reCAPTCHA challenges
  • Generic bot detection (403, "access denied", etc.)

When a block is detected, the response includes blocked: true and blockInfo: { type, detail }.

Site-Specific Fallbacks

Some sites block all datacenter IPs regardless of stealth. Spectrawl automatically routes these through alternative APIs:

SiteProblemFallbackCost
RedditBlocks all datacenter IPsPullPush API — Reddit archiveFree
AmazonCAPTCHA wall on product pagesJina Reader — server-side renderingFree
X/TwitterLogin wall on postsxAI Responses API with x_search~$0.06/post
LinkedInHTTP 999, IP fingerprintingRequires residential proxy (see below)~$7/GB

These fallbacks activate automatically — just browse() the URL and Spectrawl picks the right path. No config needed for Reddit and Amazon. X requires XAI_API_KEY env var. LinkedIn requires a residential proxy.

LinkedIn: Why It's Different

LinkedIn fingerprints the IP where cookies were created. Even valid cookies get rejected from a different IP. Every free approach fails from datacenter servers:

  • Direct browse: HTTP 999
  • Voyager API with cookies: 401 (IP mismatch)
  • Jina Reader: empty response
  • Facebook/Googlebot UA: 317K of CSS, zero content

The only working solution is a residential proxy. We recommend Bright Data for best results (72M+ residential IPs, ~99.7% success rate, dedicated social media unlockers). For budget use, Smartproxy ($7/GB, 55M IPs, 3-day free trial) works well at lower cost.

Setup:

# Bright Data (recommended)
npx spectrawl config set proxy '{"host":"brd.superproxy.io","port":22225,"username":"YOUR_ZONE_USER","password":"YOUR_PASS"}'

# Smartproxy (budget alternative)
npx spectrawl config set proxy '{"host":"gate.smartproxy.com","port":10001,"username":"YOUR_USER","password":"YOUR_PASS"}'

# Store your LinkedIn cookies (export from browser)
npx spectrawl login linkedin --account yourname --cookies ./linkedin-cookies.json

# Now browse LinkedIn normally
curl localhost:3900/browse -d '{"url":"https://www.linkedin.com/in/someone"}'

Other residential proxy providers that work:

⚠️ Avoid WebShare — recycled datacenter IPs marketed as residential, no HTTPS support.

CAPTCHA Solving

Built-in CAPTCHA solver using Gemini Vision (free tier: 1,500 req/day):

  • ✅ Image CAPTCHAs
  • ✅ Text/math CAPTCHAs
  • ✅ Simple visual challenges
  • ❌ reCAPTCHA v2/v3 (requires token solving services)
  • ❌ hCaptcha (requires token solving services)
  • ❌ Cloudflare Turnstile (requires token solving services)

The solver automatically detects CAPTCHA type and attempts resolution before returning the page.

Extract — Structured Data Extraction

Pull structured data from any page using LLM + optional CSS/XPath selectors. Like Stagehand's extract() but self-hosted and integrated with Spectrawl's anti-detect browsing.

Basic Extraction

const result = await web.extract('https://news.ycombinator.com', {
  instruction: 'Extract the top 3 story titles and their point counts',
  schema: {
    type: 'object',
    properties: {
      stories: {
        type: 'array',
        items: {
          type: 'object',
          properties: {
            title: { type: 'string' },
            points: { type: 'number' }
          }
        }
      }
    }
  }
})
// result.data = { stories: [{ title: "...", points: 210 }, ...] }

HTTP API

curl -X POST http://localhost:3900/extract \
  -H 'Content-Type: application/json' \
  -d '{
    "url": "https://example.com",
    "instruction": "Extract the page title and main heading",
    "schema": {"type": "object", "properties": {"title": {"type": "string"}, "heading": {"type": "string"}}}
  }'

Response:

{
  "data": { "title": "Example Domain", "heading": "Example Domain" },
  "url": "https://example.com",
  "title": "Example Domain",
  "contentLength": 129,
  "duration": 679
}

Targeted Extraction with Selectors

Narrow extraction scope using CSS or XPath selectors — reduces tokens and improves accuracy:

const result = await web.extract('https://news.ycombinator.com', {
  instruction: 'Extract all story titles',
  selector: '.titleline',  // CSS selector
  // or: selector: 'xpath=//table[@class="itemlist"]'
  schema: { type: 'object', properties: { titles: { type: 'array', items: { type: 'string' } } } }
})

Relevance Filtering (BM25)

For large pages, filter content by relevance before sending to the LLM — saves tokens:

const result = await web.extract('https://en.wikipedia.org/wiki/Node.js', {
  instruction: 'Extract the creator and release date',
  relevanceFilter: true   // BM25 scoring keeps only relevant sections
})
// Content reduced from 50K+ chars to ~2K relevant chars

Extract from Content (No Browsing)

Already have the content? Skip the browse step:

const result = await web.extractFromContent(markdownContent, {
  instruction: 'Extract all email addresses',
  schema: { type: 'object', properties: { emails: { type: 'array', items: { type: 'string' } } } }
})

Uses Gemini Flash (free) by default. Falls back to OpenAI if configured.

Agent — Natural Language Browser Actions

Control a browser with natural language. Navigate, click, type, scroll — the LLM interprets the page and decides what to do.

const result = await web.agent('https://exampl

…
View source on GitHub