Skip to content

Badgerion/GDG-browser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation


 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— 
β–ˆβ–ˆβ•”β•β•β•β•β• β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ•”β•β•β•β•β• 
β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ–ˆβ•—
β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘
β•šβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•β•šβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•
 β•šβ•β•β•β•β•β• β•šβ•β•β•β•β•β•  β•šβ•β•β•β•β•β• 

Graphic Density Grounding

Spatial text browser execution layer for AI agents

License: AGPL v3 Chrome Extension Node.js Python


AI agents see web pages as screenshots. We convert them to text.
Any model reads it. No vision encoder. 10-30x cheaper per step.

Quick Start Β· How It Works Β· API Reference Β· MCP Integration Β· Multi-Model CLI Β· Benchmarks Β· Architecture


The Problem

Every browser agent today β€” OpenAI Operator, Anthropic Computer Use, Google Mariner β€” takes a screenshot, feeds it to a vision model, and hopes the model can figure out where the buttons are. This costs 10,000-15,000 tokens per page view and introduces OCR hallucinations, spatial reasoning errors, and mandatory vision model dependencies.

The Idea

What if the page representation was text that looks like the page? A character grid where element density encodes type β€” buttons render as β–ˆβ–ˆβ–ˆβ–ˆ, inputs as ╔══╗, links as β–Έ, and layout is preserved spatially. Any text model reads it natively. No vision encoder needed.

         [1]              [2]  [3]
  
  ╔══[4]═══════════════════════════╗
  β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

     [5]         [6]         [7]

     [8]         [9]         [10]

              [11]β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
              β–ˆ + New Project β–ˆ
              β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ

The model sees spatial layout, element types, and numbered targets in ~1,000 tokens. It says {"action": "click", "element": 11} and the extension executes it.

Validation

We tested graphic density output on 11 different products β€” GitHub, ChatGPT, Gmail, Google Docs, Reddit, Namecheap, Convex, Supabase, Cloudflare, and more. A text-only model with zero instructions correctly identified every product, located primary actions, and understood visual hierarchy.

Then we benchmarked against WebArena-style tasks:

Run Score Tokens Time What Changed
Run 1 0/5 (0%) 394K 33 min Baseline β€” broken agent loop
Run 2 2/5 (40%) 172K 5.8 min Prompt + budget fixes only
Run 3 3/5 (60%) 172K 10 min Read mode β€” agent can see text content

0% to 60% in one day. Same token budget. No task-specific tuning.

Properties

Fourteen properties fall out of the spatial text encoding:

Property How
No vision model needed Any text model reads the character grid natively
Spatial layout Elements appear where they are on screen
Action targeting Numbered elements resolve to coordinates
10-30x cheaper ~1,000 tokens vs ~10,000-15,000 for screenshots
Model-agnostic Claude, GPT, Llama, Gemini β€” anything that reads text
Anti-bot invisible Runs in user's real browser via extension, zero fingerprint
Scroll containers Independent scrollable regions as first-class elements
State diffable Two text grids diff trivially β€” 90% token reduction per step
Multi-step planning Hidden DOM scan reveals entire SPA flows in one shot
Iframe pipelining Prefetch next page state, eliminate inter-step latency
Action sandboxing Fork state in iframe, preview consequences before committing
Injection resistant Non-interactive text compressed away β€” attack payloads stripped
Cross-platform Same technique works with OS accessibility trees (desktop, mobile)
Page type detection Heuristic classification from element composition

Quick Start

1. Load the Extension

git clone https://github.com/Badgerion/GDG-browser.git
  1. Open Chrome β†’ chrome://extensions/
  2. Enable Developer Mode (top right)
  3. Click Load unpacked β†’ select the cloned folder
  4. Navigate to any website β†’ click the extension icon β†’ Scan Page

2. Set Up the API Bridge

cd bridge
chmod +x install.sh server.js
./install.sh <your-extension-id>  # ID from chrome://extensions

Reload the extension. Test the connection:

curl http://127.0.0.1:7080/health

3. Drive It from Python

pip install requests anthropic
from gd_client import GraphicDensity

gd = GraphicDensity()

# See the page
gd.print_state(mode="read")

# Navigate and interact
gd.navigate("https://github.com")
gd.fill(3, "search query")
gd.click(4)

# Read data from the page
state = gd.read()
print(state["content"])   # visible text
print(state["tables"])    # extracted table data

How It Works

The Core Loop

1. GET /state?mode=numbered  β†’  model sees spatial map + element registry
2. Model decides             β†’  {"action": "click", "element": 11}
3. POST /action              β†’  extension executes, waits for DOM to settle
4. Returns new state         β†’  model sees what happened
5. Repeat

Five Render Modes

Mode Purpose Tokens Use When
full Complete page with density typing ~3,000-5,000 Initial page understanding
actions_only Only interactive elements ~500-800 Cheapest navigation
numbered Interactive elements with IDs ~1,000-2,000 Standard agent operation
numbered_v2 + action hints, forms, layers ~1,200-2,500 Complex UIs with modals
read + visible text + table extraction ~2,000-4,000 Reading data, finding answers

Adaptive Two-Phase Strategy

Navigate Phase (numbered mode β€” cheap, spatial)
  β†’ Click menus, fill search bars, navigate to target page

Read Phase (read mode β€” rich, content-aware)
  β†’ Extract text, read tables, find specific data values

The model switches modes mid-task: {"action": "switch_mode", "mode": "read"}

v0.2 Enhancements

Interaction hints β€” every element shows what actions it supports:

[5]  button  "Submit"           click β†’ submit
[12] input   "Search"           fill, clear
[17] link    "Documentation"    click β†’ nav
[22] select  "Country"          select

Form grouping β€” elements inside <form> tags are linked:

[7]  input   "Email"            fill    {form:login}
[8]  input   "Password"         fill    {form:login}
[9]  button  "Sign in"          click β†’ submit  {form:login}

Layer awareness β€” modals and overlays are detected:

⚠ MODAL ACTIVE β€” interact with elements 41, 42 first

[41] button  "Cancel"     click     [modal]
[42] button  "Confirm"    click     [modal]
[1]  button  "Menu"       click     [blocked]

Table extraction β€” structured data with pagination:

── Table 1 (Showing 1-20 of 2,048 results) ──
| Name | Email | Status |
|---|---|---|
| Veronica Costello | veronica@example.com | Active |
| ...

API Reference

The bridge exposes a local HTTP API on 127.0.0.1:7080:

Method Endpoint Description
GET /state?mode=read Page state (map + registry + content)
GET /environment Full state + history + page type
POST /action Execute action: {"action":"click","element":5}
POST /batch Execute sequence, stops on first failure
POST /navigate Go to URL, returns new state
GET /tabs List all open browser tabs
GET /history Action history for current tab
DELETE /history Clear action history
GET /health Connection status

Action Types

{"action": "click", "element": 5}
{"action": "fill", "element": 3, "value": "hello"}
{"action": "clear", "element": 3}
{"action": "select", "element": 7, "value": "Option text"}
{"action": "hover", "element": 12}
{"action": "scroll", "direction": "down"}
{"action": "scroll", "container": 14, "direction": "down"}
{"action": "keypress", "key": "Enter"}
{"action": "keypress", "key": "a", "modifiers": {"ctrl": true}}
{"action": "back"}
{"action": "forward"}
{"action": "wait", "duration": 1000}

MCP Integration

Connect GDG to Claude Desktop, Cursor, or any MCP client:

{
  "mcpServers": {
    "gdg-browser": {
      "command": "node",
      "args": ["/path/to/bridge/mcp-server.js"]
    }
  }
}

Tools: gdg_get_state, gdg_click, gdg_fill, gdg_select, gdg_scroll, gdg_navigate, gdg_keypress, gdg_back, gdg_hover, gdg_tabs


Multi-Model CLI

One script, any AI provider:

python bridge/gdg-agent.py "What's trending on GitHub?"
python bridge/gdg-agent.py -m openai/gpt-4o "Find flights to Tokyo"
python bridge/gdg-agent.py -m ollama/llama3 "Check Hacker News"
python bridge/gdg-agent.py -m groq/llama-3.3-70b-versatile "Search Amazon"
python bridge/gdg-agent.py -m gemini/gemini-2.0-flash "Go to Reddit"
python bridge/gdg-agent.py -m sambanova/Meta-Llama-3.1-70B-Instruct "Check weather"

Supports: Anthropic, OpenAI, Groq, Ollama (local), Sambanova, Gemini


Benchmarks

Tested against WebArena-style tasks on a Magento admin panel (multi-step navigation, data retrieval, form interaction):

WebArena β€” Magento admin panel (multi-step navigation, data retrieval, form interaction):

Run Score Tokens Time What Changed
Run 1 0/5 (0%) 394K 33 min Baseline β€” broken agent loop
Run 2 2/5 (40%) 172K 5.8 min Prompt + budget fixes only
Run 3 3/5 (60%) 172K 10 min Read mode β€” agent can see text content

0% to 60% in one day. Same token budget. No task-specific tuning.

WebVoyager β€” GitHub tasks:

Task Intent Steps Result
gh_001 Open issues in anthropics/anthropic-sdk-python 2 βœ“ Found count
gh_002 Most-used language in microsoft/vscode 8 βœ“ TypeScript
gh_003 Latest release of openai/openai-python 2 βœ“ v2.29.0
gh_004 Contributor count for facebook/react 25 βœ— Hit step limit
gh_005 About section of huggingface/transformers 2 βœ“ Found description

Score: 80% (4/5) β€” three tasks completed in just 2 steps.

Comparison to screenshot-based agents:

Approach Tokens/step Vision required Cost/step
Screenshot + GPT-4o 10,000-15,000 Yes $0.01-0.04
DOM tree (rtrvr.ai) 2,000-3,000 No $0.003-0.006
Graphic Density (numbered) 800-1,500 No $0.001-0.003
Graphic Density (actions_only) 400-800 No $0.0005-0.001

Architecture

External Process (Python, curl, any language)
  ↕ HTTP (localhost:7080)
bridge/server.js         ← Native messaging host + HTTP server
  ↕ stdin/stdout (Chrome native messaging)
background.js            ← Service worker, tab routing, navigation
  ↕ chrome.tabs.sendMessage
renderer.js              ← Scanner + classifier + renderer + executor
  ↕ DOM / accessibility tree
Web Page

Key Design Decisions

Extension-based, not CDP. Chrome Extension APIs are first-class browser citizens β€” sandboxed execution, no automation fingerprint, session persistence across crashes. CDP (Puppeteer/Playwright) is a debugging backdoor with detectable fingerprints.

Text output, not images. Any model that reads text can drive the browser. No vision encoder, no image preprocessing, no OCR. Token cost scales with page complexity, not pixel count.

Semantic density, not raw DOM. Elements are classified by role and rendered with visual weight. Buttons are heavy (β–ˆβ–ˆβ–ˆβ–ˆ), inputs have borders (╔══╗), links are light (β–Έ). The model perceives UI hierarchy from character density without needing CSS or computed styles.


File Structure

GDG-browser/
β”œβ”€β”€ manifest.json           Chrome extension manifest (MV3)
β”œβ”€β”€ renderer.js             Core: scanner + classifier + 5 render modes + action executor
β”œβ”€β”€ background.js           Service worker: message routing, native messaging, tab management
β”œβ”€β”€ popup.html              Testing UI with State/Actions/History tabs
β”œβ”€β”€ popup.js                Popup controller
β”œβ”€β”€ icons/                  Extension icons
β”œβ”€β”€ bridge/
β”‚   β”œβ”€β”€ server.js           Native messaging host + HTTP API server
β”‚   β”œβ”€β”€ mcp-server.js       MCP server for Claude Desktop / Cursor integration
β”‚   β”œβ”€β”€ install.sh          One-time setup for native messaging registration
β”‚   β”œβ”€β”€ gd_client.py        Python client library
β”‚   β”œβ”€β”€ gdg-agent.py        Universal multi-model browser agent CLI
β”‚   β”œβ”€β”€ agent_example.py    Example AI agent loop
β”‚   └── benchmark.py        WebArena benchmark harness
└── README.md

Roadmap

  • v0.1 β€” Spatial renderer + action executor + popup testing UI
  • v0.2 β€” Read mode, interaction hints, form grouping, layer awareness, table extraction
  • API β€” HTTP bridge + Python client + agent example
  • v0.2.1 β€” MCP server (Claude Desktop / Cursor), multi-model CLI (Anthropic, OpenAI, Groq, Ollama, Gemini, Sambanova)
  • v0.3 β€” State diff (send only changes), hidden DOM flow scan, iframe pipelining
  • Sessions β€” Checkpoint/restore across model context resets
  • Desktop β€” macOS/Windows accessibility tree β†’ spatial text (same technique, native apps)
  • Framework integrations β€” Browser Use, LangChain adapters

Contributing

This is early-stage infrastructure. If you're building AI agents and hit a page the renderer can't handle, open an issue with the URL and what broke. Edge cases are how this gets better.


License

AGPL-3.0 β€” Use it, modify it, build on it. If you run it as a service, share your changes.

For commercial licensing, open an issue or reach out.

Packages

 
 
 

Contributors