Why Multimodal Agents Need a Different Architecture Than Voice-Only
You can't add video to a voice pipeline and call it multimodal. True multimodal agents require parallel processing, unified reasoning, and a runtime designed for modality switching.
A voice agent has a linear pipeline: audio in → ASR → LLM → TTS → audio out. A multimodal agent runs parallel streams: audio in + video frames → ASR + vision model → unified LLM reasoning → TTS + optional visual output. The difference isn't additive — it's architectural. The system must decide, at every moment, which modality to prioritize, how to fuse inputs from different senses, and how to allocate compute across streams.
Modality fusion, not modality switching
A naive multimodal system switches between modes: it's either listening or looking. A proper multimodal system fuses inputs: the agent hears the customer say 'this one is broken' while simultaneously seeing them point at a specific item on screen. The word 'this' only resolves with visual context. This cross-modal resolution requires a reasoning model that takes both inputs simultaneously — not sequentially.
Seamless modality transitions
The agent starts as a voice call. The customer needs help navigating their account, so the agent offers screen sharing. Now it's voice + screen share. The customer shows a document on camera — now it's voice + vision. These transitions should be seamless from the user's perspective and from the agent's perspective. The conversation state, context, and persona carry across modality changes without restart. This requires a unified session runtime, not separate voice and video services stitched together.
Ready to build?
See how Mazed's multimodal AI agents work for your use case.