Multimodal AI

Also known as · multimodal · vision-language model

Models that handle more than text — images, audio, or video alongside language.

A multimodal model can take in and/or produce more than one type of data — for instance, reading an image and answering questions about it, transcribing audio, or generating images from a text description. The model learns a shared representation that connects, say, the word 'dog' with pictures of dogs.

This unlocks use cases text-only models can't touch: describing a photo for accessibility, reading a chart in a PDF, analyzing a screenshot, or holding a spoken conversation. Frontier assistants are increasingly multimodal by default.

Under the hood it's still largely transformer-based — different data types are converted into tokens the same model can process together.

Learn more in Module 13 — Multimodal AI →

Multimodal AI

Related terms

Beyond definitions.