βββββββ βββββββ βββββββ
ββββββββ ββββββββββββββββ
βββ βββββββ ββββββ ββββ
βββ ββββββ ββββββ βββ
ββββββββββββββββββββββββββ
βββββββ βββββββ βββββββ
Graphic Density Grounding
Spatial text browser execution layer for AI agents
AI agents see web pages as screenshots. We convert them to text.
Any model reads it. No vision encoder. 10-30x cheaper per step.
Quick Start Β· How It Works Β· API Reference Β· MCP Integration Β· Multi-Model CLI Β· Benchmarks Β· Architecture
Every browser agent today β OpenAI Operator, Anthropic Computer Use, Google Mariner β takes a screenshot, feeds it to a vision model, and hopes the model can figure out where the buttons are. This costs 10,000-15,000 tokens per page view and introduces OCR hallucinations, spatial reasoning errors, and mandatory vision model dependencies.
What if the page representation was text that looks like the page? A character grid where element density encodes type β buttons render as ββββ, inputs as ββββ, links as βΈ, and layout is preserved spatially. Any text model reads it natively. No vision encoder needed.
[1] [2] [3]
βββ[4]ββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββ
[5] [6] [7]
[8] [9] [10]
[11]ββββββββββββββ
β + New Project β
ββββββββββββββββββ
The model sees spatial layout, element types, and numbered targets in ~1,000 tokens. It says {"action": "click", "element": 11} and the extension executes it.
We tested graphic density output on 11 different products β GitHub, ChatGPT, Gmail, Google Docs, Reddit, Namecheap, Convex, Supabase, Cloudflare, and more. A text-only model with zero instructions correctly identified every product, located primary actions, and understood visual hierarchy.
Then we benchmarked against WebArena-style tasks:
| Run | Score | Tokens | Time | What Changed |
|---|---|---|---|---|
| Run 1 | 0/5 (0%) | 394K | 33 min | Baseline β broken agent loop |
| Run 2 | 2/5 (40%) | 172K | 5.8 min | Prompt + budget fixes only |
| Run 3 | 3/5 (60%) | 172K | 10 min | Read mode β agent can see text content |
0% to 60% in one day. Same token budget. No task-specific tuning.
Fourteen properties fall out of the spatial text encoding:
| Property | How |
|---|---|
| No vision model needed | Any text model reads the character grid natively |
| Spatial layout | Elements appear where they are on screen |
| Action targeting | Numbered elements resolve to coordinates |
| 10-30x cheaper | ~1,000 tokens vs ~10,000-15,000 for screenshots |
| Model-agnostic | Claude, GPT, Llama, Gemini β anything that reads text |
| Anti-bot invisible | Runs in user's real browser via extension, zero fingerprint |
| Scroll containers | Independent scrollable regions as first-class elements |
| State diffable | Two text grids diff trivially β 90% token reduction per step |
| Multi-step planning | Hidden DOM scan reveals entire SPA flows in one shot |
| Iframe pipelining | Prefetch next page state, eliminate inter-step latency |
| Action sandboxing | Fork state in iframe, preview consequences before committing |
| Injection resistant | Non-interactive text compressed away β attack payloads stripped |
| Cross-platform | Same technique works with OS accessibility trees (desktop, mobile) |
| Page type detection | Heuristic classification from element composition |
git clone https://github.com/Badgerion/GDG-browser.git- Open Chrome β
chrome://extensions/ - Enable Developer Mode (top right)
- Click Load unpacked β select the cloned folder
- Navigate to any website β click the extension icon β Scan Page
cd bridge
chmod +x install.sh server.js
./install.sh <your-extension-id> # ID from chrome://extensionsReload the extension. Test the connection:
curl http://127.0.0.1:7080/healthpip install requests anthropicfrom gd_client import GraphicDensity
gd = GraphicDensity()
# See the page
gd.print_state(mode="read")
# Navigate and interact
gd.navigate("https://github.com")
gd.fill(3, "search query")
gd.click(4)
# Read data from the page
state = gd.read()
print(state["content"]) # visible text
print(state["tables"]) # extracted table data1. GET /state?mode=numbered β model sees spatial map + element registry
2. Model decides β {"action": "click", "element": 11}
3. POST /action β extension executes, waits for DOM to settle
4. Returns new state β model sees what happened
5. Repeat
| Mode | Purpose | Tokens | Use When |
|---|---|---|---|
full |
Complete page with density typing | ~3,000-5,000 | Initial page understanding |
actions_only |
Only interactive elements | ~500-800 | Cheapest navigation |
numbered |
Interactive elements with IDs | ~1,000-2,000 | Standard agent operation |
numbered_v2 |
+ action hints, forms, layers | ~1,200-2,500 | Complex UIs with modals |
read |
+ visible text + table extraction | ~2,000-4,000 | Reading data, finding answers |
Navigate Phase (numbered mode β cheap, spatial)
β Click menus, fill search bars, navigate to target page
Read Phase (read mode β rich, content-aware)
β Extract text, read tables, find specific data values
The model switches modes mid-task: {"action": "switch_mode", "mode": "read"}
Interaction hints β every element shows what actions it supports:
[5] button "Submit" click β submit
[12] input "Search" fill, clear
[17] link "Documentation" click β nav
[22] select "Country" select
Form grouping β elements inside <form> tags are linked:
[7] input "Email" fill {form:login}
[8] input "Password" fill {form:login}
[9] button "Sign in" click β submit {form:login}
Layer awareness β modals and overlays are detected:
β MODAL ACTIVE β interact with elements 41, 42 first
[41] button "Cancel" click [modal]
[42] button "Confirm" click [modal]
[1] button "Menu" click [blocked]
Table extraction β structured data with pagination:
ββ Table 1 (Showing 1-20 of 2,048 results) ββ
| Name | Email | Status |
|---|---|---|
| Veronica Costello | veronica@example.com | Active |
| ...
The bridge exposes a local HTTP API on 127.0.0.1:7080:
| Method | Endpoint | Description |
|---|---|---|
GET |
/state?mode=read |
Page state (map + registry + content) |
GET |
/environment |
Full state + history + page type |
POST |
/action |
Execute action: {"action":"click","element":5} |
POST |
/batch |
Execute sequence, stops on first failure |
POST |
/navigate |
Go to URL, returns new state |
GET |
/tabs |
List all open browser tabs |
GET |
/history |
Action history for current tab |
DELETE |
/history |
Clear action history |
GET |
/health |
Connection status |
{"action": "click", "element": 5}
{"action": "fill", "element": 3, "value": "hello"}
{"action": "clear", "element": 3}
{"action": "select", "element": 7, "value": "Option text"}
{"action": "hover", "element": 12}
{"action": "scroll", "direction": "down"}
{"action": "scroll", "container": 14, "direction": "down"}
{"action": "keypress", "key": "Enter"}
{"action": "keypress", "key": "a", "modifiers": {"ctrl": true}}
{"action": "back"}
{"action": "forward"}
{"action": "wait", "duration": 1000}Connect GDG to Claude Desktop, Cursor, or any MCP client:
{
"mcpServers": {
"gdg-browser": {
"command": "node",
"args": ["/path/to/bridge/mcp-server.js"]
}
}
}Tools: gdg_get_state, gdg_click, gdg_fill, gdg_select, gdg_scroll, gdg_navigate, gdg_keypress, gdg_back, gdg_hover, gdg_tabs
One script, any AI provider:
python bridge/gdg-agent.py "What's trending on GitHub?"
python bridge/gdg-agent.py -m openai/gpt-4o "Find flights to Tokyo"
python bridge/gdg-agent.py -m ollama/llama3 "Check Hacker News"
python bridge/gdg-agent.py -m groq/llama-3.3-70b-versatile "Search Amazon"
python bridge/gdg-agent.py -m gemini/gemini-2.0-flash "Go to Reddit"
python bridge/gdg-agent.py -m sambanova/Meta-Llama-3.1-70B-Instruct "Check weather"Supports: Anthropic, OpenAI, Groq, Ollama (local), Sambanova, Gemini
Tested against WebArena-style tasks on a Magento admin panel (multi-step navigation, data retrieval, form interaction):
WebArena β Magento admin panel (multi-step navigation, data retrieval, form interaction):
| Run | Score | Tokens | Time | What Changed |
|---|---|---|---|---|
| Run 1 | 0/5 (0%) | 394K | 33 min | Baseline β broken agent loop |
| Run 2 | 2/5 (40%) | 172K | 5.8 min | Prompt + budget fixes only |
| Run 3 | 3/5 (60%) | 172K | 10 min | Read mode β agent can see text content |
0% to 60% in one day. Same token budget. No task-specific tuning.
WebVoyager β GitHub tasks:
| Task | Intent | Steps | Result |
|---|---|---|---|
| gh_001 | Open issues in anthropics/anthropic-sdk-python | 2 | β Found count |
| gh_002 | Most-used language in microsoft/vscode | 8 | β TypeScript |
| gh_003 | Latest release of openai/openai-python | 2 | β v2.29.0 |
| gh_004 | Contributor count for facebook/react | 25 | β Hit step limit |
| gh_005 | About section of huggingface/transformers | 2 | β Found description |
Score: 80% (4/5) β three tasks completed in just 2 steps.
Comparison to screenshot-based agents:
| Approach | Tokens/step | Vision required | Cost/step |
|---|---|---|---|
| Screenshot + GPT-4o | 10,000-15,000 | Yes | $0.01-0.04 |
| DOM tree (rtrvr.ai) | 2,000-3,000 | No | $0.003-0.006 |
| Graphic Density (numbered) | 800-1,500 | No | $0.001-0.003 |
| Graphic Density (actions_only) | 400-800 | No | $0.0005-0.001 |
External Process (Python, curl, any language)
β HTTP (localhost:7080)
bridge/server.js β Native messaging host + HTTP server
β stdin/stdout (Chrome native messaging)
background.js β Service worker, tab routing, navigation
β chrome.tabs.sendMessage
renderer.js β Scanner + classifier + renderer + executor
β DOM / accessibility tree
Web Page
Extension-based, not CDP. Chrome Extension APIs are first-class browser citizens β sandboxed execution, no automation fingerprint, session persistence across crashes. CDP (Puppeteer/Playwright) is a debugging backdoor with detectable fingerprints.
Text output, not images. Any model that reads text can drive the browser. No vision encoder, no image preprocessing, no OCR. Token cost scales with page complexity, not pixel count.
Semantic density, not raw DOM. Elements are classified by role and rendered with visual weight. Buttons are heavy (ββββ), inputs have borders (ββββ), links are light (βΈ). The model perceives UI hierarchy from character density without needing CSS or computed styles.
GDG-browser/
βββ manifest.json Chrome extension manifest (MV3)
βββ renderer.js Core: scanner + classifier + 5 render modes + action executor
βββ background.js Service worker: message routing, native messaging, tab management
βββ popup.html Testing UI with State/Actions/History tabs
βββ popup.js Popup controller
βββ icons/ Extension icons
βββ bridge/
β βββ server.js Native messaging host + HTTP API server
β βββ mcp-server.js MCP server for Claude Desktop / Cursor integration
β βββ install.sh One-time setup for native messaging registration
β βββ gd_client.py Python client library
β βββ gdg-agent.py Universal multi-model browser agent CLI
β βββ agent_example.py Example AI agent loop
β βββ benchmark.py WebArena benchmark harness
βββ README.md
- v0.1 β Spatial renderer + action executor + popup testing UI
- v0.2 β Read mode, interaction hints, form grouping, layer awareness, table extraction
- API β HTTP bridge + Python client + agent example
- v0.2.1 β MCP server (Claude Desktop / Cursor), multi-model CLI (Anthropic, OpenAI, Groq, Ollama, Gemini, Sambanova)
- v0.3 β State diff (send only changes), hidden DOM flow scan, iframe pipelining
- Sessions β Checkpoint/restore across model context resets
- Desktop β macOS/Windows accessibility tree β spatial text (same technique, native apps)
- Framework integrations β Browser Use, LangChain adapters
This is early-stage infrastructure. If you're building AI agents and hit a page the renderer can't handle, open an issue with the URL and what broke. Edge cases are how this gets better.
AGPL-3.0 β Use it, modify it, build on it. If you run it as a service, share your changes.
For commercial licensing, open an issue or reach out.