Audio General Intelligence

Open models that listen, understand, and reason about all sound

GAMMA Lab at the University of Maryland builds AI systems that reason deeply about speech, music, and the acoustic world — developing open benchmarks, models, and methods that advance the science of auditory intelligence.

Open Models

AF-Next AF3 AF2 Music Flamingo GAMA CompA

Datasets

MMOU MMAU-Pro AHA MMAU MultiVox CounterFactual AV Eval

Our Mission

Audio General Intelligence — the capacity of AI agents to deeply understand and reason about all types of auditory input, including speech, environmental sounds, and music — is crucial for enabling AI to interact seamlessly and naturally with our world.

Despite this importance, audio intelligence has traditionally lagged behind advancements in vision and language processing. This gap arises from significant challenges: limited datasets, the complexity of audio signals, and a shortage of advanced neural architectures and effective training methodologies tailored specifically for audio. Recent breakthroughs in Large Language Models have begun to transform the landscape — offering promising pathways to enhance foundational audio tasks like ASR, cross-modal retrieval, and audio captioning, while giving emergence to new tasks like complex Audio Question Answering.

At GAMMA Lab, our mission is to accelerate progress toward Audio General Intelligence through open and accessible innovation. Our flagship Audio Flamingo series — spanning GAMA, AF2, AF3, and now Audio Flamingo Next — features specialized architectures, optimized audio encoders, and meticulously curated alignment datasets that excel in complex reasoning, long-form audio understanding, and hallucination robustness across speech, sound, and music.

We are actively expanding beyond audio into full omni-modal intelligence. MMOU and EgoAVU push reasoning over long, complex video. Through open-source models, rigorous benchmarks like MMAU and MMOU, and synthetic data frameworks such as Synthio, GAMMA Lab fosters transparency and collaboration — ensuring audio and multimodal intelligence remains inclusive, impactful, and accessible worldwide.

Industry Collaborations

NVIDIA

2026 Papers

View all →

ICASSP 2026Oral

Exploring Audio Hallucination in Egocentric Video Understanding

MultimodalHallucination

Seth, Mei et al.→

Preprint

Video-Robin: Autoregressive Diffusion Planning for Intent-Grounded Video-to-Music Generation

GenerationMultimodal

Lokegaonkar, Bhosale et al.→

Preprint

Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music

Audio LLMsReasoning

Ghosh, Goel et al.→

Preprint

Audio Hallucination Attacks: Probing the Reliability of Large Audio Language Models

RobustnessAudio LLMs

Seth, Kumar et al.→

Preprint

MMOU: A Massive Multi-Task Omni Understanding and Reasoning Benchmark for Long and Complex Real-World Videos

BenchmarksOmni-Modal

Goel, Ghosh et al.→

Preprint

TAC: Timestamped Audio Captioning

Audio CaptioningTemporal

Kumar, Seetharaman et al.→

CVPR 2026Highlight

EgoAVU: Egocentric Audio-Visual Understanding

EgocentricAudio-Visual

Seth, Mei et al.→

Research Areas

◈

Audio-Language Models

AF-Next · AF3 · GAMA · SPUR

◎

Benchmarks & Evaluation

MMOU · MMAU · MMAU-Pro · CompA

⊞

Audio-Text Alignment

ReCLAP · PAT · RobustCLAP · TSPE

⊡

Self-Supervised Learning

EH-MAM · SLICER · MAST · UNFUSED

◉

Speech Enhancement & ASR

ProSE · LipGER · DARAG

⊛

Data & Augmentation

Synthio · SILA · Video-Robin

⊜

Domain Adaptation

FusDom · Stable Distillation

⊝

Multimodal Interaction

MMOU · EgoAVU · MultiVox

Updates

Apr 2026

Audio Flamingo Next released — next-generation open audio-language model supporting 30-minute audio, temporal reasoning, and 1M+ hours of training data. Project page ↗

Apr 2026

Exploring Audio Hallucination in Egocentric Video accepted to ICASSP 2026.

Apr 2026

Do Audio-Visual Large Language Models Really See and Hear? accepted to CVPR 2026 — mechanistic interpretability study revealing a strong vision bias in audio-visual LLMs.

Mar 2026

MMOU released — omni-modal benchmark for reasoning over long, complex real-world videos across visual, audio, and text signals.

Mar 2026

AHA: Audio Hallucination Attacks released — probing the reliability of large audio language models with 6,500 adversarial QA pairs.

Dec 2025

MultiVox accepted as Oral at EMNLP 2025 — multimodal voice assistant benchmark over audio-visual content.

Nov 2025

Audio Flamingo 3 selected as Spotlight at NeurIPS 2025. New state-of-the-art on MMAU and Air-Bench.

Oct 2025

EH-MAM accepted as Oral at EMNLP 2025 — easy-to-hard masked acoustic modeling for self-supervised pre-training.

Sep 2025

Three papers at NAACL 2025: ProSE (Oral), PAT (Oral), and Do Audio-Language Models Understand Linguistic Variations?

Jul 2025

Audio Flamingo 2 accepted to ICML 2025. Long-form audio understanding up to 5 minutes across speech, music, and environmental sound.

May 2025

MMAU accepted as Spotlight at ICLR 2025. Music Flamingo accepted at ICLR 2026.

Nov 2024

GAMA accepted as Oral at EMNLP 2024 — general-purpose audio understanding via instruction-tuned LLMs.

The Team

Sakshi

Ramaneswaran S. +5 more →

Events

Oct 2025

DCASE 2025 Audio Question Answering Challenge — Workshop, Oct 30–31, Barcelona, Spain.

Jun 2025

JSALT 2025 Summer Workshop — June 9 – August 1, Brno, Czechia.

Apr 2025

SALMA Workshop @ ICASSP 2025 — Sound and Language Multimodal Analysis. April 6–11, Hyderabad, India.