Skip to main content

Multimedia is multimodal by nature. We study multimodal learning: how to align representations across modalities (vision, language, audio, motion), how to fuse them effectively, and how to build models that reason in a grounded way across modalities. This theme connects with our work on generative models and on geometric structure.