Express course · No. 34
For years, models only handled text. Now they take images, audio, even video — a multimodal model can describe a photo, read a chart, transcribe a meeting, or answer questions about a screenshot. The engineering you learned for text still applies, plus new powers and new traps. Learn what multimodal unlocks, how a model perceives more than words, and where it goes wrong.
Essence only · One picture per idea · More than words
The first thing to grasp is simply what changed: models stopped being limited to reading and writing words. Once a model can take in images and sound, a whole new range of problems opens up.
A modality is a kind of input or output
A person who can read, but also see, listen, and speak — each sense is a different channel for taking in the world, not a different brain.
A modality is a type of data — text, images, audio, video. A model that handles more than one is multimodal: it can take an image and text together, or produce speech from a description. For years, models were text-only — read words, write words. The shift is that a single model can now perceive across modalities, the way a person uses several senses, rather than being deaf and blind to everything but text.
The same model can take mixed input
Handing someone a photo and asking "what's wrong with this?" — they look and answer in words. One question, two kinds of input, one response.
The power of multimodal isn't just handling images instead of text — it's handling them together. You can show the model a screenshot and ask a question about it, give it a chart and a written request, send a photo with instructions. The model reasons over all of it at once. This combination is what makes multimodal genuinely new: input is no longer one channel, so your prompts can mix words and pictures the way a real conversation does.
The text skills still apply
Learning to see doesn't make you forget how to read — the new sense adds to what you could already do, it doesn't replace it.
Everything you learned for text — prompting, context, structured output, grounding, evals — still applies to multimodal. The image or audio is just another part of the input you assemble into the context; the model still produces an output you constrain, validate, and measure. So multimodal isn't a separate discipline to relearn from scratch; it's the same engineering with an extra kind of input and output. Carry your text skills over and add the new modality's specifics on top.
A modality is a kind of data; a multimodal model handles several — images and audio, not just text — and can mix them in one input. All your text skills still apply, with a new sense added.
It helps to know, roughly, how a model takes in an image — because it demystifies what multimodal can and can't do, and why it sometimes misreads what's right in front of it.
An image becomes the same kind of representation as text
A translator who turns both spoken words and sign language into the same written notes — different inputs, converted to one common form the brain works with.
Under the hood, a multimodal model converts an image into the same kind of internal representation it uses for text — turning pixels into a sequence the model can reason over alongside words. The image and the text end up in a shared form, which is exactly why the model can answer a text question about a picture: both have been translated into one common language inside the model. You don't need the details, but knowing image and text become comparable explains how the model relates them.
Seeing the text in an image is not the same as understanding the scene
You can read the label on a jar and separately understand what the jar is for — two different acts, even though both involve looking at the same object.
Two distinct capabilities get blurred. Reading the text within an image — a sign, a document, a screenshot — is roughly what used to need separate "OCR" (optical character recognition); a multimodal model can do it directly. Understanding the scene — what's happening, what objects are present, what a chart means — is different. A good multimodal model does both, but it's worth distinguishing "extract the words from this image" from "interpret what this image shows," because they're different asks with different reliability.
Perception has limits
Even sharp eyes miss fine print, misjudge a blurry photo, or misread a cluttered diagram — seeing is powerful but not flawless.
A model's vision is impressive but imperfect. It can misread small or low-quality text, miss details in a busy image, miscount objects, or confidently misdescribe something subtle. The perception is genuinely useful but not a precise instrument — treat what the model "sees" as a strong interpretation, not a guaranteed reading. This matters because it's tempting to assume that because the model can see, it sees correctly; like its text answers, its visual ones can be confidently wrong.
A model converts an image into the same internal form as text, so it can reason over both together. Reading text in an image differs from understanding the scene — and the perception, though powerful, can be confidently wrong.
The reason multimodal matters is the range of real problems it makes solvable. Seeing the concrete uses shows why it's more than a novelty.
Understanding documents and screenshots
An assistant who can glance at a paper form or a screen and pull out exactly what you need, instead of you typing it all in by hand.
A huge practical use is reading documents, forms, and screenshots — extracting data from an invoice, understanding a PDF's layout, answering questions about what's on a screen. Before multimodal, this needed brittle, specialised tools; now a model can look at the document and work with it directly, structure and all. Anywhere information lives in a visual format rather than clean text, multimodal turns "a human has to read and retype this" into something a model can handle.
Analysing images and charts
Showing an expert a graph and asking what it means — they read the visual and explain the trend, no spreadsheet required.
Multimodal models can analyse visual content: describe a photo, read and interpret a chart or diagram, spot what's in an image, compare two pictures. You can hand the model a sales chart and ask for the trend, a product photo and ask what's wrong, a diagram and ask it to explain. This turns images from things only humans could interpret into inputs your software can reason about — opening up any workflow where the meaningful information is visual.
Accessibility and reach
A guide who describes the scene aloud for someone who can't see it — turning a visual world into words anyone can use.
Multimodal also widens who and what your product can serve: describing images for visually-impaired users, letting people point a camera instead of typing, working with content that was never in text form. The same capability that reads a chart can narrate a photo or caption a video. Beyond any single feature, multimodal expands the surface of what an AI product can take as input — from "type your question" to "show me, tell me, or play me what you mean."
Multimodal unlocks reading documents and screenshots, analysing images and charts, and widening reach through accessibility — turning visual and audio information into something software can reason about.
Building with multimodal is mostly the engineering you already know, with images or audio added to the input. The familiar disciplines carry straight over.
Send the image alongside the text
Including a photo with your written question in the same message — the recipient sees both and answers the whole thing at once.
In practice, you build a multimodal request much like a text one: you assemble the context, but now it can include an image (or audio) along with your text instructions. "Here's a screenshot, and here's what I want to know about it" goes to the model as one combined input. This is just context engineering with a richer input — the image is one more thing you put in the window, deliberately, alongside the words. The assembly mindset you already have applies directly.
Still ask for structured output
A form-filler who looks at a messy receipt and writes the total, date, and vendor into neat labelled boxes — chaos in, clean data out.
When you use multimodal to extract information — pull fields from a document, classify an image, read a chart — you still want structured output: ask the model to return clean, schema-shaped data your code can use, not prose. A multimodal model reading a receipt should hand you {total, date, vendor}, not a paragraph. The same structured-output discipline that turns a text model into a reliable component does the same for a vision model. The modality is new; the bridge to your code isn't.
Validate and ground, as always
You double-check what someone reports back from a quick glance, especially the important details — a second look on what matters.
Because the model's perception can be wrong, you apply the same reliability discipline: validate the extracted data, ground answers in what's actually verifiable, and keep a human on high-stakes reads. A model misreading a number off an invoice is the visual version of a hallucination, so you treat its visual output as untrusted-until-checked, exactly as you would its text. The lesson from every other course holds: the model is a fallible component, and a new modality doesn't change that.
Building multimodal is context engineering with a richer input: send the image alongside the text, still ask for structured output, and validate and ground the result — the model is a fallible component, modality aside.
So far we've talked about models taking in images and audio. The other direction — models producing them — is its own large area, worth naming so you know where it fits.
Models can create images and audio too
An artist who paints what you describe, or a voice actor who speaks your script — generation is the mirror image of perception.
Just as models can take in images and sound, other models generate them: text-to-image models paint a picture from a description, text-to-speech turns words into a spoken voice, and there are models for music, video, and more. Generation is the flip side of understanding — output as another modality rather than input. It's a vast field of its own, but the key point is that "multimodal" spans both directions: a model can perceive other modalities, and a model can produce them.
The same engineering posture applies
You direct an artist with a clear brief and then review the result before using it — the same loop, whatever they're producing.
Building with generative modalities follows the same posture as everything else: a clear instruction (the prompt), an output you treat as a draft to review, and a human in control of using it. A generated image or voice is the model's confident attempt, to be checked and edited, not blindly shipped — the product-design discipline applies whether the output is text or a picture. So you don't need an entirely new playbook for generation; the same "fallible component, you stay in control" stance carries over.
Know it exists; reach for it deliberately
You don't commission a custom painting for a job that needs a sentence — you use the medium the task actually calls for.
Generative modalities are powerful for the right job — illustrations, synthesised speech for a voice product, video — but they're a deliberate choice, not a default to sprinkle everywhere. Reach for image or audio generation when the output genuinely needs to be in that modality, and stick to text when text does the job. Knowing this whole capability exists, and where it fits, is enough for now — the point is that multimodal is a two-way street, perceiving and producing across senses.
Multimodal runs both ways: models also generate images, speech, and more. The same engineering posture applies — a clear brief, a reviewed draft, a human in control — and you reach for it only when the output truly needs that modality.
Multimodal adds new failure modes on top of the familiar ones. A few specific traps catch teams who treat images as if they were as safe and cheap as text.
Images can carry hidden attacks
A photograph with instructions written in it that the eye barely notices but the machine reads perfectly — a message smuggled past you in plain sight.
A multimodal model reads everything in an image, including text a human might miss. That makes images a channel for prompt injection: an attacker can hide instructions in a picture — faint text, embedded in the pixels — that the model dutifully follows. This is the visual version of the injection problem from the security course, and it's nastier because you can't easily see the attack. Treat any image the model ingests as untrusted input that might be carrying instructions, not just innocent pixels.
Images cost a lot more than text
Sending a photograph instead of a sentence — far more to transmit and process, and the bill reflects it.
An image is worth far more than a thousand words to the meter: processing one consumes many more tokens than a short text prompt, so multimodal calls can be substantially more expensive. A feature that sends high-resolution images on every request can run up a surprising bill. The model-economics discipline applies with extra force here — be deliberate about image size and how often you send one, because the cost of "just include the picture" is much higher than including a line of text.
Confident misreading is the quiet failure
Someone who glances at a blurry sign and confidently tells you the wrong number — fast, sure, and wrong.
The familiar hallucination problem takes a visual form: a model can confidently misread a figure, miscount items, or describe something that isn't in the image — and sound just as sure as when it's right. Because the input is visual, these errors can be harder to catch than a textual slip. So for anything where a misread matters — a number off a document, a medical image, a safety check — you verify, keep a human in the loop, and never assume "it can see it" means "it read it right." The model's confidence is no more reliable about images than about text.
Multimodal's new traps: images can smuggle prompt injections, they cost far more tokens than text, and the model can misread them as confidently as it hallucinates in text. Treat images as untrusted, pricey, and fallibly read.
Using multimodal well is mostly applying everything you already know to a richer input, while respecting the new modality's specific costs and risks.
Reach for multimodal when the information is visual
You bring in the camera when the problem is something to look at, and stick to typing when it's something to say — matching the channel to the task.
The decision is simple: use multimodal when the meaningful information genuinely lives in an image or audio — a document to read, a scene to interpret, speech to understand — rather than forcing it into text first or avoiding it when it would help. But don't reach for it reflexively where text is cleaner, faster, and cheaper. The skill is recognising when a problem is actually visual or auditory, and letting the model perceive it directly, instead of either ignoring that capability or overusing it.
Carry every discipline over, plus the modality's own
A pilot rated for a new aircraft brings all their existing skills and adds the specifics of the new plane — not starting over, but extending.
Multimodal isn't a reason to forget what you learned. Assemble the image into context deliberately, ask for structured output, validate and ground the result, watch the cost, and treat images as untrusted. Then add the modality's specifics: visual injection, higher token cost, perception limits. Everything from the text courses still governs; multimodal just extends it with a new kind of input and its own handful of traps. Build on what you know, and learn only the new edges.
- Is the information genuinely visual or audio — or am I using multimodal where text is cleaner? - Am I assembling the image into context deliberately, like any other input? - Am I asking for structured output when extracting data from an image? - Could the image carry a hidden injection — am I treating it as untrusted? - Have I accounted for the much higher token cost of images? - Am I validating the perception and keeping a human on high-stakes reads?
- modality / multimodal — a kind of data; a model that handles more than one. - mixed input — combining image (or audio) and text in one prompt. - OCR / scene understanding — reading text in an image versus interpreting what it shows. - perception limits — the model can misread or misdescribe what it sees, confidently. - generation across modalities — text-to-image, text-to-speech; producing other modalities. - visual prompt injection — instructions hidden in an image that the model follows. - image token cost — images consume far more tokens than text.
- You reach for it when the information is truly visual or audio, not by reflex. - You assemble the image into context and ask for structured output when extracting data. - You treat images as untrusted, alert to visual injection. - You account for the higher cost and the perception limits. - You validate and keep a human on reads where a misread would matter.
Multimodal extends your text engineering to images and audio: reach for it when the information is visual, carry over every discipline — context, structured output, validation, cost — and respect the new traps of injection, cost, and confident misreading.