BeginnerGetting Started

Claude can see — using vision and multimodal in your workflow

Drop a screenshot. Ask a question. Get a real answer.

Claude reads images, PDFs, and screenshots. Most people don't use this. The patterns that turn vision into a daily-use tool, not a parlor trick.

7 min read
claudevisionmultimodalworkflow

Claude reads images. Not just describes them — actually reads them. Charts, screenshots, handwritten whiteboards, design mockups, PDFs, error screens. Most people who use Claude daily haven''t adopted this; the few who have report it changes how they work.

The capability, briefly

In Claude.ai, drag any image into the chat. In the API, send it as a content block:

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": image_b64
                    }
                },
                {
                    "type": "text",
                    "text": "What''s wrong with this dashboard?"
                }
            ]
        }
    ]
)

Claude responds as if it had eyes. The accuracy on real-world images — handwriting, screenshots with small text, chart axes — is high enough to rely on.

Five workflows where it earns its keep

1. Debugging UIs

Drop a screenshot of your broken UI. Ask: "Why is the right column wider than expected?" Claude reads the layout, identifies the likely CSS culprit, suggests fixes. You didn''t write a single selector or copy a single element.

This works for design review too: paste a Figma export, ask "what doesn''t feel polished?" Claude points to inconsistent spacing, weak hierarchy, color clashes you''d miss.

2. Reading dashboards

Drop a chart screenshot from Datadog, Grafana, your analytics tool. Ask: "What''s the anomaly here? Compared to what?" Claude reads the axes, the trend, the spike, the units. Faster than configuring the same alert in your monitoring tool.

3. Understanding error screens

Some errors only show up as screenshots — non-tech teammate sends a Slack screenshot of a bug, you can''t reproduce. Drop the screenshot into Claude with: "What''s this error telling us? What''d break to cause this?" Claude reads the stack visible in the screenshot, infers the cause, suggests the fix.

4. Whiteboard capture

Take a phone photo of a whiteboard after a meeting. Drop it into Claude: "Transcribe and structure this." Claude reads the handwriting, organizes the boxes-and-arrows, and gives you a clean Markdown summary. Faster than asking the meeting''s scribe.

5. PDF deep reads

Claude reads PDFs — research papers, contracts, financial statements. With a 1M context window on Opus, you can drop a 200-page contract and ask specific questions: "Where''s the termination clause? What''s the notice period?" Faster and cheaper than human review for first-pass questions.

What it can''t do

  • Generate images. Claude reads but doesn''t draw. Pair with a separate image-gen model when you need pictures back.
  • Click on things. Vision is a read-only sense. For "click this button," use Computer Use or the Chrome MCP.
  • Watch video natively. No video stream support yet. For video, sample frames as images.

The trap

Don''t use vision when text would work better. If you have raw HTML or the actual data, send that. A screenshot of a JSON response is dramatically worse for Claude than the JSON itself — Claude has to OCR it back to text, with risk of error.

Rule: vision is for when text isn''t available. Don''t reach for screenshots out of habit.

In Claude Code

Claude Code accepts image inputs the same way Claude.ai does — drag a screenshot into the terminal session. Useful for:

  • Debugging your dev server output by screenshotting the browser
  • Pasting design mocks for "build this UI"
  • Investigating CI failures that only show up as a Vercel preview screenshot

Where to go next

  • Read the Anthropic vision docs for the SDK details and supported formats.
  • For Claude that can act on what it sees, look into Computer Use.
  • Pair vision with tool use — Claude can read a screenshot, then call your tools to act on what it found.

Keep learning

Apply this with Waymaker

Get this article surfaced where you work

Inside Waymaker, this article shows up next to the right Signal page — so the lesson lands when you need it, not before.

No credit card required.