workspace/skills/deep-scraper/SKILL.md

---
name: deep-scraper
description: Deep scrape X/Twitter feeds to extract structured post data (author, text, metrics, links, media, cards) via Chrome DevTools Protocol, then classify posts by category (crypto, polymarket, arbitrage, trading) with spam detection and signal scoring. Use when scraping social feeds, analyzing X posts for money-making signals, or extracting structured data from web pages via authenticated browser sessions.
---

# Deep Scraper

Three-stage pipeline for X/Twitter feed intelligence:

1. **Scrape** — Extract structured posts from DOM via CDP
2. **Triage** — Identify posts with verifiable claims + actionable links
3. **Investigate** — Agent follows links, pulls real data, verifies claims

## Prerequisites

- Chrome with `--remote-debugging-port=9222 --remote-allow-origins=*`
- User's Chrome profile copied to debug dir (script handles this)
- `websocket-client` Python package

### Launch Chrome for scraping

```bash
bash scripts/launch-chrome-debug.sh  # copies auth cookies, starts on port 9222
```

## Stage 1: Scrape

```bash
python3 -u scripts/scrape-x-feed.py --port 9222 --scroll-pages 5
```

Output: `data/x-feed/<timestamp>/posts.json` — structured per-post data.

## Stage 2: Triage

```bash
python3 scripts/triage-posts.py data/x-feed/<timestamp>/posts.json
```

Output: `triage.json` with investigation queue. Detects:
- **Performance claims** — "wins 97% of the time", "10x return"
- **Copy trading** — "copy this trader", "mirror bets"
- **Arbitrage** — spreads, mispricing, risk-free claims
- **Prediction markets** — Polymarket/Kalshi links, odds claims
- **Price targets** — entry points, PT claims
- **Airdrops** — free tokens, claiming steps

Each post gets an investigation priority score and a task list.

## Stage 3: Investigate (Agent Workflow)

Read `triage.json`, then for each post in the investigation queue:

1. **Read the claim** — What exactly is being claimed?
2. **Follow links** — Use `web_fetch` or `browser` to visit linked pages
3. **Extract evidence** — Pull actual data (P&L, odds, prices, activity)
4. **Verify** — Does the data support the claim?
5. **Assess** — Is this actionable? Time-sensitive? Still valid?
6. **Report** — Summarize findings with verdict

### Investigation methods by claim type:

- **performance_claim / copy_trading**: Browse to user profile or linked platform. Check actual trade history, recent activity, win rate.
- **arbitrage_opp**: Check both sides of the spread. Verify it still exists. Calculate fees and slippage.
- **prediction_market**: Fetch current market page. Check odds, volume, resolution date. Assess if bet has edge.
- **price_target**: Check current price on CoinGecko/TradingView. Compare to claim.
- **airdrop**: Verify project legitimacy. Check contract on block explorer. Look for scam signals.

### Example agent flow:

```
Post: "Copying @whaletrader99 bets wins 97% of the time"
→ Triage: performance_claim + copy_trading
→ Tasks:
  1. Browse to @whaletrader99 profile on Polymarket/X
  2. Check their actual bet history and P&L
  3. Verify "97%" claim against real data
  4. Check recency — are they still active?
  5. Assess: Is copying viable? Fees? Timing lag?
→ Verdict: VERIFIED / EXAGGERATED / STALE / SCAM
```

## Output Structure

All outputs go to `data/x-feed/<timestamp>/`:
- `posts.json` — raw structured scrape
- `summary.md` — human-readable feed
- `triage.json` — investigation queue
- `analysis.json` — category/spam analysis (optional, from analyze-posts.py)