Audio General Intelligence

Open models that listen, understand, and reason about all sound

GAMMA Lab at the University of Maryland builds AI systems that reason deeply about speech, music, and the acoustic world — developing open benchmarks, models, and methods that advance the science of auditory intelligence.

Audio General Intelligence — the capacity of AI agents to deeply understand and reason about all types of auditory input, including speech, environmental sounds, and music — is crucial for enabling AI to interact seamlessly and naturally with our world.

Despite this importance, audio intelligence has traditionally lagged behind advancements in vision and language processing. This gap arises from significant challenges: limited datasets, the complexity of audio signals, and a shortage of advanced neural architectures and effective training methodologies tailored specifically for audio. Recent breakthroughs in Large Language Models have begun to transform the landscape — offering promising pathways to enhance foundational audio tasks like ASR, cross-modal retrieval, and audio captioning, while giving emergence to new tasks like complex Audio Question Answering.

At GAMMA Lab, our mission is to accelerate progress toward Audio General Intelligence through open and accessible innovation. Our flagship Audio Flamingo series — spanning GAMA, AF2, AF3, and now Audio Flamingo Next — features specialized architectures, optimized audio encoders, and meticulously curated alignment datasets that excel in complex reasoning, long-form audio understanding, and hallucination robustness across speech, sound, and music.

We are actively expanding beyond audio into full omni-modal intelligence. MMOU and EgoAVU push reasoning over long, complex video. Through open-source models, rigorous benchmarks like MMAU and MMOU, and synthetic data frameworks such as Synthio, GAMMA Lab fosters transparency and collaboration — ensuring audio and multimodal intelligence remains inclusive, impactful, and accessible worldwide.

NVIDIANVIDIA
MetaMeta
GoogleGoogle
AdobeAdobe
DolbyDolby
Scale AIScale AI
SesameSesame

2026 Papers

View all →
ICASSP 2026Oral
Exploring Audio Hallucination in Egocentric Video Understanding
MultimodalHallucination
Seth, Mei et al.
Preprint
Video-Robin: Autoregressive Diffusion Planning for Intent-Grounded Video-to-Music Generation
GenerationMultimodal
Lokegaonkar, Bhosale et al.
Preprint
Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music
Audio LLMsReasoning
Ghosh, Goel et al.
Preprint
Audio Hallucination Attacks: Probing the Reliability of Large Audio Language Models
RobustnessAudio LLMs
Seth, Kumar et al.
Preprint
MMOU: A Massive Multi-Task Omni Understanding and Reasoning Benchmark for Long and Complex Real-World Videos
BenchmarksOmni-Modal
Goel, Ghosh et al.
Preprint
TAC: Timestamped Audio Captioning
Audio CaptioningTemporal
Kumar, Seetharaman et al.
CVPR 2026Highlight
EgoAVU: Egocentric Audio-Visual Understanding
EgocentricAudio-Visual
Seth, Mei et al.
Apr 2026
Audio Flamingo Next released — next-generation open audio-language model supporting 30-minute audio, temporal reasoning, and 1M+ hours of training data. Project page ↗
Apr 2026
Apr 2026
Do Audio-Visual Large Language Models Really See and Hear? accepted to CVPR 2026 — mechanistic interpretability study revealing a strong vision bias in audio-visual LLMs.
Mar 2026
MMOU released — omni-modal benchmark for reasoning over long, complex real-world videos across visual, audio, and text signals.
Mar 2026
AHA: Audio Hallucination Attacks released — probing the reliability of large audio language models with 6,500 adversarial QA pairs.
Dec 2025
MultiVox accepted as Oral at EMNLP 2025 — multimodal voice assistant benchmark over audio-visual content.
Nov 2025
Audio Flamingo 3 selected as Spotlight at NeurIPS 2025. New state-of-the-art on MMAU and Air-Bench.
Oct 2025
EH-MAM accepted as Oral at EMNLP 2025 — easy-to-hard masked acoustic modeling for self-supervised pre-training.
Sep 2025
Three papers at NAACL 2025: ProSE (Oral), PAT (Oral), and Do Audio-Language Models Understand Linguistic Variations?
Jul 2025
Audio Flamingo 2 accepted to ICML 2025. Long-form audio understanding up to 5 minutes across speech, music, and environmental sound.
May 2025
MMAU accepted as Spotlight at ICLR 2025. Music Flamingo accepted at ICLR 2026.
Nov 2024
GAMA accepted as Oral at EMNLP 2024 — general-purpose audio understanding via instruction-tuned LLMs.
Oct 2025
DCASE 2025 Audio Question Answering Challenge — Workshop, Oct 30–31, Barcelona, Spain.
Jun 2025
JSALT 2025 Summer Workshop — June 9 – August 1, Brno, Czechia.
Apr 2025
SALMA Workshop @ ICASSP 2025 — Sound and Language Multimodal Analysis. April 6–11, Hyderabad, India.