Every time a new multimodal model ships, someone posts that alt text is solved.
It isn't. It has been publicly not-solved for five years, and the models have gotten much better at image description in that time. The gap isn't description quality. The gap is what alt text is for.
What alt text is for
Alt text is not a caption. It is not a description. It is the text equivalent of whatever job the image was doing in its page.
- Decorative flourish: correct alt is empty (
alt=""). - Logo used as a link home: correct alt is "Home."
- Chart making a claim the surrounding text depends on: correct alt is the claim.
- Product photo where the details matter to a purchase: correct alt enumerates the details.
- Headshot above a caption naming the person: correct alt references the person, not their features.
The right alt text depends on the author's intent for the image in the page, plus the reader's task. Neither is in the pixels.
This is the irreducible problem with AI alt text. The input a multimodal model gets is the image. The right output depends on information that is, by definition, not in the image.
Where current models consistently fail
Not as a benchmark. As failure modes that are structural.
- Length. Screen readers do not want four-hundred-word descriptions. Models, asked to "describe this image," produce them. You can instruct around this, but the default is wrong, and defaults ship.
- Decorative vs. meaningful. Every decorative flourish gets described. The correct output for a decorative image is empty alt. A model cannot tell the difference without the surrounding markup, and usually not even then.
- Text in images. Often transcribed well. Sometimes hallucinated. For a sign, a diagram label, or a UI screenshot, the transcription has to be exact, and "usually correct" is not the same as correct.
- Context collapse. A headshot above a caption should reference the person by name, not their features. A chart alt should state the claim, not list the axes. Models default to description when the context calls for interpretation.
- Intent. The author chose this image to do something in this page. A model cannot recover that intent from the pixels alone.
None of these are bugs in a specific model. All five are structural consequences of the fact that correct alt text is a function of image + page + reader, and a multimodal model sees only the image.
Scaling the model fixes image understanding. It does not add the other two inputs.
Where it does help
Not useless. Specifically:
- As a fallback where the alternative is silence. Long-form content with missing alt, handled by an assistive-tech layer that injects AI-generated descriptions, is a real win. A mediocre description beats nothing.
- As a draft step. Auto-generate, then the author edits before publishing. The model does the grunt work, the human does the intent. This is the correct pipeline.
- For private use. A user asking what's in an image sent to them, for their own understanding, is well served.
These are meaningful uses and they are growing in quality. Don't confuse them with the other claim.
The claim that is wrong
That claim: shipping AI-generated alt text on a professional, authored site is accessible.
It is not. It is the appearance of accessibility, with a non-trivial error rate baked in, on behalf of the users who hit that error rate hardest. Screen reader users notice. They recognize machine-generated alt within a sentence. They notice the hallucinations when they happen, and they notice when the alt is describing a decorative image instead of being empty.
If you ship this pattern without human review, you have not shipped accessibility. You have shipped a score.
The right framing
Use AI alt text the way you use spell-check. It catches the baseline. It does not make the piece correct.
The author is not replaceable by the model. The author knows what the image is for. The model knows what the image contains. Those are different jobs, and the job that actually matters is the one the model doesn't have the inputs to do.