Multimodal AI
Also known as · multimodal · vision-language model
Models that handle more than text — images, audio, or video alongside language.
A multimodal model can take in and/or produce more than one type of data — for instance, reading an image and answering questions about it, transcribing audio, or generating images from a text description. The model learns a shared representation that connects, say, the word 'dog' with pictures of dogs.
This unlocks use cases text-only models can't touch: describing a photo for accessibility, reading a chart in a PDF, analyzing a screenshot, or holding a spoken conversation. Frontier assistants are increasingly multimodal by default.
Under the hood it's still largely transformer-based — different data types are converted into tokens the same model can process together.