mcp-see

Vision for AI agents, without the context bloat.

Vision models bloat context windows. You need to understand an image, not load every pixel. A 1MB screenshot becomes thousands of tokens — and most of those tokens are background noise.

mcp-see flips the approach: agents request descriptions, not raw data.

Structured Extraction

Instead of loading an image into context, your agent asks questions about it:

describe — "What's in this image?" Get a text summary.
detect — "Find all the buttons." Get objects with bounding boxes.
describe_region — "What's in the top-right corner?" Zoom into areas of interest.
analyze_colors — "What's the color palette?" Extract design specs.

The agent gets structured understanding without the context cost. A screenshot that would burn 2,000 tokens becomes a 50-token description.

Hierarchical Analysis

The real power is in the workflow. Start broad, then zoom in:

Overview first. "Describe this dashboard mockup."
Find regions of interest. "Detect all charts and graphs."
Zoom into specifics. "Describe the bar chart in the bottom-left."

Your agent builds understanding incrementally, only loading detail where it matters. No wasted context on the parts that don't need attention.

Design Workflow

mcp-see shines for design-to-code workflows. An agent can:

Get an overview of a mockup
Detect UI components (buttons, inputs, cards)
Extract colors and spacing from specific regions
Generate code without ever loading the full image

The mockup stays outside the context window. The structured specs go in.

Multi-Provider

mcp-see works with multiple vision providers:

Gemini — Google's vision models
OpenAI — GPT-4 Vision
Claude — Anthropic's vision capabilities

Configure your preferred provider. The tools work the same way regardless.

Try It

Add to your MCP configuration:

{
  "mcpServers": {
    "mcp-see": {
      "command": "npx",
      "args": ["github:sanity-io/mcp-see"],
      "env": {
        "GEMINI_API_KEY": "your-key"
      }
    }
  }
}

Works with Gemini (free tier available), OpenAI, or Anthropic. Set the API key for your preferred provider.

Your agent can now see without seeing.

→ GitHub

Built with Miriad. Start building →