📚 Glossary
Multimodal
In one line: A model that can handle multiple input types — text, images, audio, video — not just text.
A multimodal model can process more than just text. The most common second mode is vision (images), but newer models also handle audio and video.
Multimodal capabilities on AskAI.free:
- ChatGPT 4o: text + images + voice
- Claude Sonnet 4: text + images
- Gemini 2.5 Pro: text + images + audio + video
Use cases: 'extract text from this screenshot', 'what's wrong with this diagram', 'transcribe and summarise this meeting recording'. File uploads are a Pro feature on AskAI.free.
See it in action — ask any AI about multimodal on AskAI.free.
Try it free →