GAMMA Lab — Research

No papers match the selected filters.

ICASSP 2026 Oral

Exploring Audio Hallucination in Egocentric Video Understanding

Ashish Seth, Xinhao Mei, Changsheng Zhao, Varun Nagaraja, Ernie Chang, Gregory P. Meyer, Gael Le Lan, Yunyang Xiong, Vikas Chandra, Yangyang Shi, Dinesh Manocha, Zhipeng Cai

Multimodal Hallucination Egocentric Video

An investigation of how audio-visual language models incorrectly infer sounds from visual context in egocentric video. An evaluation framework is constructed using 300 egocentric videos and 1,000 sound-focused questions, with a taxonomy distinguishing foreground action sounds from background ambient sounds. State-of-the-art models like Qwen2.5 Omni achieve only 27.3% and 39.5% accuracy on foreground and background sound questions respectively, highlighting critical gaps in reliable multimodal egocentric understanding.

arXiv

Preprint

Video-Robin: Autoregressive Diffusion Planning for Intent-Grounded Video-to-Music Generation

Vaibhavi Lokegaonkar, Aryan Vijay Bhosale, Vishnu Raj, Gouthaman KV, Ramani Duraiswami, Lie Lu, Sreyan Ghosh, Dinesh Manocha

Video-to-Music Generation Multimodal

A text-conditioned video-to-music generation model that enables fast, high-quality, semantically and stylistically controllable background music creation. Video-Robin uses an autoregressive module to align visual and textual inputs for high-level musical planning, followed by Diffusion Transformers that refine the plan into high-quality audio. The dual-component architecture achieves superior performance over existing approaches on standard and challenging test sets with significantly faster inference speeds.

arXiv

Preprint

Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music

Sreyan Ghosh, Arushi Goel, Kaousheik Jayakumar, Lasha Koroshinadze, Nishit Anand, Zhifeng Kong, Siddharth Gururani, Sang-gil Lee, Jaehyeon Kim, Aya Aljafari, Chao-Han Huck Yang, Sungwon Kim, Ramani Duraiswami, Dinesh Manocha, Mohammad Shoeybi, Bryan Catanzaro, Ming-Yu Liu, Wei Ping

Audio LLMs Reasoning Long-Form Audio

The next-generation model in the Audio Flamingo series, advancing understanding and reasoning over speech, environmental sounds, and music. AF-Next introduces a stronger foundational audio-language model, scalable strategies for constructing large-scale audio understanding and reasoning data, support for up to 30-minute audio inputs, and temporal reasoning that grounds intermediate steps to timestamps. Trained on over 1 million hours of data with curriculum-based training, AF-Next achieves competitive performance with both open and larger proprietary models across 20 benchmarks.

arXiv Project

CVPR 2026

Do Audio-Visual Large Language Models Really See and Hear?

Ramaneswaran Selvakumar, Kaousheik Jayakumar, Sakshi, Sreyan Ghosh, Ruohan Gao, Dinesh Manocha

Interpretability Audio-Visual Multimodal

Investigates how audio-visual large language models process multimodal inputs through mechanistic interpretability techniques. The study reveals a strong vision bias in audio understanding — models hallucinate sounds from what they see rather than what they hear. Attention analysis and causal interventions show that while audio representations contain meaningful semantic information internally, the visual modality dominates text generation, particularly in deeper layers where cross-modal integration occurs.

arXiv Project

Preprint

Audio Hallucination Attacks: Probing the Reliability of Large Audio Language Models

Ashish Seth, Sonal Kumar, Ramaneswaran Selvakumar, Nishit Anand, Utkarsh Tyagi, Prem Seetharaman, Ramani Duraiswami, Dinesh Manocha

Hallucination Robustness Audio LLMs

Introduces AHA-Eval, an attack suite of 6,500 QA pairs that probes whether large audio language models genuinely ground responses in audio input rather than relying on spurious correlations. AHA targets two attack surfaces: query-based attacks that manipulate question structure and audio-based attacks using synthetic speech. Testing reveals alarming vulnerability rates of 95.35% and 79.65% on Audio Flamingo 3 and Gemini Pro respectively. A mitigation dataset is also proposed, reducing attack success rates by up to 49%.

arXiv

Preprint

MMOU: A Massive Multi-Task Omni Understanding and Reasoning Benchmark for Long and Complex Real-World Videos

Arushi Goel, Sreyan Ghosh, Vatsal Agarwal, Nishit Anand, Kaousheik Jayakumar, Lasha Koroshinadze, Yao Xu, Katie Lyons, James Case, Karan Sapra, Kevin J. Shih, Siddharth Gururani, Abhinav Shrivastava, Ramani Duraiswami, Dinesh Manocha, Andrew Tao, Bryan Catanzaro, Mohammad Shoeybi, Wei Ping

Benchmarks Multimodal Video Understanding

A benchmark for evaluating multimodal large language models on their capacity to jointly reason over visual, audio, and textual signals in long and complex real-world videos. MMOU comprises 15,000 questions paired with 9,038 videos across 13 skill categories. The best proprietary model achieves only 64.2% accuracy while leading open-source models reach 46.8%, revealing fundamental gaps in omni-modal reasoning over extended video content.

arXiv Dataset

Preprint

TAC: Timestamped Audio Captioning

Sonal Kumar, Prem Seetharaman, Ke Chen, Oriol Nieto, Jiaqi Su, Zhepei Wang, Rithesh Kumar, Dinesh Manocha, Nicholas J. Bryan, Zeyu Jin, Justin Salamon

Audio Captioning Temporal Grounding Hallucination

Addresses hallucinations and temporal inconsistencies in audio captioning by generating temporally grounded descriptions of complex acoustic scenes. TAC disentangles overlapping sound events and assigns precise timestamps to each, producing more accurate and temporally consistent captions across audio and audio-visual understanding benchmarks.

arXiv Project

CVPR 2026 Highlight

EgoAVU: Egocentric Audio-Visual Understanding

Ashish Seth, Xinhao Mei, Changsheng Zhao, Varun Nagaraja, Ernie Chang, Gregory P. Meyer, Gael Le Lan, Yunyang Xiong, Vikas Chandra, Yangyang Shi, Dinesh Manocha, Zhipeng Cai

Multimodal Egocentric Video Audio-Visual

A dataset and benchmark for joint audio-visual understanding of egocentric videos. A data generation pipeline using token-based video filtering and modular graph-based curation produces 3 million audio-visual training samples. Existing MLLMs are shown to bias heavily toward visual signals, often neglecting audio cues. Fine-tuning on the resulting dataset yields up to 113% performance improvement on the EgoAVU benchmark and up to 28% relative gains on other egocentric video benchmarks.

arXiv

ICLR 2026

Music Flamingo: Scaling Music Understanding in Audio Language Models

Sreyan Ghosh, Arushi Goel, Lasha Koroshinadze, Sang-gil Lee, Zhifeng Kong, Joao Felipe Santos, Ramani Duraiswami, Dinesh Manocha, Wei Ping, Mohammad Shoeybi, Bryan Catanzaro

Audio LLMs Music Understanding Chain-of-Thought

An audio-language model designed for deep music understanding, addressing music's dynamic, layered, and information-dense nature. Introduces MF-Skills, a curated dataset spanning harmony, structure, timbre, lyrics, and cultural context, alongside chain-of-thought training (MF-Think) and reinforcement learning. Achieves state-of-the-art results on 10+ music understanding benchmarks, advancing from surface-level recognition toward layered, human-like perception of songs.

arXiv Project Demo

Preprint

SPUR: A Plug-and-Play Framework for Integrating Spatial Audio Understanding and Reasoning into Large Audio-Language Models

S Sakshi, Vaibhavi Lokegaonkar, Neil Zhang, Ramani Duraiswami, Sreyan Ghosh, Dinesh Manocha, Lie Lu

Spatial Audio Audio LLMs Reasoning

A lightweight plug-and-play framework for endowing large audio-language models with spatial audio understanding and reasoning. SPUR incorporates a First-Order Ambisonics encoder for processing directional audio channels and introduces SPUR-Set, a dataset for training spatial reasoning capabilities. The approach preserves existing model functionality while enabling detection of sound direction, elevation, distance, and speaker attribution in real-world acoustic environments.

arXiv Project

AAAI 2026

MMAU-Pro: A Challenging and Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence

Sonal Kumar, Šimon Sedláček, Vaibhavi Lokegaonkar, Fernando López, Wenyi Yu, Nishit Anand, Hyeonggon Ryu, Lichang Chen, Maxim Plička, Miroslav Hlaváček, William Fineas Ellingwood, Sathvik Udupa, Siyuan Hou, Allison Ferner, Sara Barahona, Cecilia Bolaños, Satish Rahi, Laura Herrera-Alarcón, Satvik Dixit, Siddhi Patil, Soham Deshmukh, Lasha Koroshinadze, Yao Liu, Leibny Paola Garcia Perera, Eleni Zanou, Themos Stafylakis, Joon Son Chung, David Harwath, Chao Zhang, Dinesh Manocha, Alicia Lozano-Diez, Santosh Kesiraju, Sreyan Ghosh, Ramani Duraiswami

Benchmarks Evaluation Reasoning

A benchmark containing 5,305 instances with audio paired to expert-generated Q&A across speech, sound, and music. Evaluates models on 49 unique skills including long-form comprehension and spatial reasoning. Tests 22 models, revealing limitations where even state-of-the-art systems achieve only 59.2% accuracy — exposing gaps that simpler benchmarks cannot surface.

arXiv

EMNLP 2025

EGOILLUSION: Benchmarking Hallucinations in Egocentric Video Understanding

Ashish Seth, Utkarsh Tyagi, Ramaneswaran Selvakumar, Nishit Anand, Sonal Kumar, Sreyan Ghosh, Ramani Duraiswami, Chirag Agarwal, Dinesh Manocha

Multimodal Benchmarks Robustness

EgoIllusion investigates how audio-language models can be misled by acoustic illusions within egocentric audio-visual scenarios. Through a novel benchmark of auditory illusion cases in first-person video, we expose systematic failure modes in current models that rely on shallow acoustic cues rather than genuine perceptual reasoning. We further propose mitigation strategies and training interventions for more robust perceptual audio-visual understanding.

arXiv

EMNLP 2025 Oral

MultiVox: A Benchmark for Evaluating Voice Assistants for Multimodal Interactions

Ramaneswaran Selvakumar, Ashish Seth, Nishit Anand, Utkarsh Tyagi, Sonal Kumar, Sreyan Ghosh, Dinesh Manocha

Multimodal Dialogue Audio-Visual

The first omni voice assistant benchmark, featuring 1,000 annotated dialogues that evaluate the integration of paralinguistic speech features and visual cues for multimodal understanding. MultiVox exposes significant gaps in how current voice assistants handle non-verbal speech cues and visual context simultaneously.

arXiv

NeurIPS 2025 Spotlight

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

Arushi Goel, Sreyan Ghosh, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang-gil Lee, Chao-Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, Bryan Catanzaro

Audio LLMs Reasoning Multi-Turn

A fully open large audio-language model featuring the AF-Whisper unified encoder, flexible thinking capabilities, multi-turn chat, and 10-minute audio understanding. AF3 supports voice-to-voice interaction across speech, sound, and music modalities, setting new state-of-the-art performance on MMAU, Air-Bench, and other major audio understanding benchmarks.

arXiv

ICASSP 2026

Multi-Domain Audio Question Answering Benchmark Toward Acoustic Content Reasoning

Chao-Han Huck Yang, Sreyan Ghosh, Qing Wang, Jaeyeon Kim, Hengyi Hong, Sonal Kumar, Guirui Zhong, Zhifeng Kong, Sakshi, Vaibhavi Lokegaonkar, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha, Gunhee Kim, Jun Du, Rafael Valle, Bryan Catanzaro

Benchmarks Audio QA Reasoning

A multi-domain audio question answering benchmark designed to evaluate acoustic content reasoning across speech, music, and environmental sound. The benchmark tests models on complex reasoning over audio content, requiring understanding of acoustic properties, temporal relationships, and domain-specific knowledge.

arXiv

NAACL 2025 Oral

ProSE: Diffusion Priors for Speech Enhancement

Sonal Kumar, Sreyan Ghosh, Utkarsh Tyagi, Anton Jeran Ratnarajah, Chandra Kiran Reddy Evuru, Ramani Duraiswami, Dinesh Manocha

Speech Enhancement Curriculum Learning

Proposes a methodology applying diffusion models in latent space to generate priors that are integrated into transformer-based regression for speech enhancement. The diffusion prior captures the full distribution of clean speech, enabling the regression model to produce higher-fidelity enhanced speech with state-of-the-art performance on standard benchmarks.

arXiv Code

ICML 2025

Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

Sreyan Ghosh, Zhifeng Kong, Sonal Kumar, S Sakshi, Jaehyeon Kim, Wei Ping, Rafael Valle, Dinesh Manocha, Bryan Catanzaro

Audio LLMs Long-Form Audio Reasoning

An audio-language model with a custom CLAP encoder and synthetic QA data, extending understanding to 30-second to 5-minute audio segments. Introduces the LongAudio dataset for long-form audio training, enabling expert-level reasoning over extended audio across speech, music, and environmental sound domains.

arXiv

ICLR 2025

Visual Description Grounding Reduces Hallucinations and Boosts Reasoning in LVLMs

Sreyan Ghosh, Chandra Kiran Reddy Evuru, Sonal Kumar, Utkarsh Tyagi, Oriol Nieto, Zeyu Jin, Dinesh Manocha

Multimodal Hallucination Reasoning

Addresses hallucinations in large vision-language models by grounding visual descriptions to specific image regions. The proposed approach reduces object hallucinations and improves compositional reasoning by linking textual descriptions to precise visual evidence, achieving state-of-the-art results on multiple hallucination and visual reasoning benchmarks.

OpenReview

ICASSP 2025 SALMA

TSPE: Task-Specific Prompt Ensemble for Improved Zero-Shot Audio Classification

Nishit Anand, Ashish Seth, Ramani Duraiswami, Dinesh Manocha

Speaker Extraction Audio-Text Speech

A training-free approach that generates context-rich prompts incorporating sound attributes and sources for improved audio-text alignment and zero-shot audio classification. TSPE constructs task-specific prompt ensembles at inference time, requiring no additional training while achieving consistent improvements over single-prompt baselines across diverse audio classification benchmarks.

arXiv Code

Preprint

Towards a Systematic Evaluation of Hallucinations in Large-Vision Language Models

Ashish Seth, Dinesh Manocha, Chirag Agarwal

Hallucination Benchmarks Vision-Language

HALLUCINOGEN, a VQA benchmark that uses contextual reasoning prompts as hallucination attacks to systematically evaluate large vision-language models. The framework categorizes visual entities as salient (easily recognizable objects) or latent (requiring domain knowledge, such as identifying disease in X-rays), and evaluates eleven models including LLaMA-3.2, DeepSeek-V2, and Gemini. Results show current LVLMs remain susceptible to hallucination attacks even with existing mitigation strategies.

arXiv

WASPAA 2025

SILA: Signal-to-Language Augmentation for Enhanced Control in Text-to-Audio Generation

Sonal Kumar, Prem Seetharaman, Justin Salamon, Dinesh Manocha, Oriol Nieto

Continual Learning Self-Supervised Representation Learning

Enables fine-grained control over acoustic characteristics in generated audio through learned disentanglement between semantic content and acoustic features. SILA augments text-to-audio generation models with signal-level conditioning, allowing precise specification of acoustic properties — timbre, reverberation, recording environment — independently of the sound event content.

arXiv

ICLR 2025 Spotlight

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, Dinesh Manocha

Benchmarks Evaluation Audio Understanding

Comprehensive benchmark with 10,000 audio clips paired with expert-annotated questions covering speech, environmental sounds, and music, requiring 27 distinct skills. MMAU tests models on perception, understanding, and reasoning far beyond simple classification, providing a standardized EvalAI leaderboard for community-wide evaluation and comparison.

arXiv Code EvalAI

NAACL 2025

Do Audio-Language Models Understand Linguistic Variations?

Ramaneswaran Selvakumar, Sonal Kumar, Hemant Kumar Giri, Nishit Anand, Ashish Seth, Sreyan Ghosh, Dinesh Manocha

Audio-Text CLAP Robustness

Enhances audio-language representations to handle linguistic diversity through a multi-view contrastive learning objective. RobustCLAP trains on paraphrase, negation, and attribute-permuted variants of audio captions, producing significantly more robust models that maintain high retrieval and classification performance across diverse natural language phrasings.

arXiv Code

NAACL 2025 Oral

PAT: Parameter-Free Audio-Text Aligner to Boost Zero-Shot Audio Classification

Ashish Seth, Ramaneswaran Selvakumar, Sonal Kumar, Sreyan Ghosh, Dinesh Manocha

Audio-Text Contrastive Learning Zero-Shot

A training and parameter-free method that improves zero-shot audio classification through enhanced audio-text alignment via mutual feedback between the audio encoder and text representations. PAT requires no additional training or parameters — it is applied directly at inference to improve alignment quality, achieving strong gains on zero-shot classification benchmarks.

arXiv Code

ACL 2025 Findings

Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation

Sreyan Ghosh, Mohammad Sadegh Rasooli, Michael Levit, Peidong Wang, Jian Xue, Dinesh Manocha, Jinyu Li

ASR Error Correction Retrieval Augmentation

Addresses limitations in generative error correction for speech recognition through the DARAG approach — combining synthetic data augmentation and retrieval-based entity handling for improved domain generalization. The method significantly improves ASR post-correction accuracy in low-resource and domain-shifted settings without requiring additional audio recordings.

arXiv

EMNLP 2024 Oral

EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech Representation Learning

Ashish Seth, Ramaneswaran Selvakumar, S Sakshi, Sonal Kumar, Sreyan Ghosh, Dinesh Manocha

Self-Supervised Learning Masked Modeling Pre-training

A self-supervised approach using an adaptive masking strategy that progressively increases difficulty — masking easy acoustic regions first, then harder ones — for improved speech representation learning. EH-MAM consistently outperforms uniform masking baselines across downstream classification, detection, and retrieval tasks.

arXiv Code

ICLR 2025

Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data

Sreyan Ghosh, Sonal Kumar, Zhifeng Kong, Rafael Valle, Bryan Catanzaro, Dinesh Manocha

Data Augmentation Synthesis Low-Resource

Generates synthetic audio data through preference-optimized text-to-audio models and iterative LLM prompting. Synthio produces diverse, acoustically varied training examples that achieve up to 39% performance improvement on small-scale audio classification datasets, providing a scalable solution for low-resource audio understanding without manual data collection.

arXiv Code

ICASSP 2025

ReCLAP: Improving Zero Shot Audio Classification by Describing Sounds

Sreyan Ghosh, Sonal Kumar, Chandra Kiran Reddy Evuru, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha

Audio-Text CLAP Zero-Shot

Improves zero-shot audio classification by describing sounds through caption augmentation and custom prompt generation. ReCLAP generates rich, acoustic-attribute-aware descriptions of audio events during training and inference, producing more discriminative audio-text representations that significantly improve zero-shot performance on sound classification and retrieval benchmarks.

arXiv Code

EMNLP 2024 Oral

GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities

Sreyan Ghosh, Sonal Kumar, Ashish Seth, Chandra Kiran Reddy Evuru, Utkarsh Tyagi, S Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha

Audio LLMs Instruction Tuning QA

GAMA (General-purpose Audio-Language Model with Attributes) is an instruction-following audio-language model that combines audio encoders with large language models for rich audio question answering, captioning, and reasoning. GAMA introduces attribute-based audio representation that captures fine-grained acoustic properties — timbre, rhythm, pitch, and spatial characteristics — enabling more grounded and interpretable responses than prior audio LLMs.

arXiv Code Demo

Interspeech 2024 Oral

LipGER: Visually-Conditioned Generative Error Correction for Robust Automatic Speech Recognition

Sreyan Ghosh, Sonal Kumar, Ashish Seth, Purva Chiniya, Utkarsh Tyagi, Ramani Duraiswami, Dinesh Manocha

ASR Error Correction Audio-Visual

A framework that leverages lip movement cues to correct ASR errors in noisy environments, instructing a language model to select among N-best hypotheses conditioned on visual lip cues. LipGER achieves word error rate improvements of 1.1–49.2% across four datasets and introduces LipHyp, a large-scale dataset with hypothesis-transcription pairs equipped with lip motion cues.

arXiv Code

ICML 2024

A Closer Look at the Limitations of Instruction Tuning

Sreyan Ghosh, Chandra Kiran Reddy Evuru, Sonal Kumar, Ramaneswaran S, Deepali Aneja, Zeyu Jin, Ramani Duraiswami, Dinesh Manocha

Instruction Tuning LLMs Fine-tuning

A systematic investigation of the shortcomings of instruction tuning (IT) for adapting pre-trained LLMs into conversational agents. Through rigorous experiments, the paper identifies four key limitations: IT cannot expand model knowledge (LoRA only learns style, full fine-tuning causes knowledge loss); copying response formats degrades output quality; full-parameter fine-tuning heightens hallucinations through token borrowing; and popular IT improvement techniques fail to outperform simple LoRA baselines. The central finding is that responses generated purely from pre-trained knowledge consistently outperform models that learn new knowledge from IT.

arXiv

ICASSP 2024

FusDom: Combining In-Domain and Out-of-Domain Knowledge for Continuous Self-Supervised Learning

Ashish Seth, Sreyan Ghosh, S. Umesh, Dinesh Manocha

Domain Adaptation Self-Supervised Continual Learning

Continued pre-training approach using teacher-student models with cross-attention for domain adaptation without catastrophic forgetting. FusDom combines in-domain and out-of-domain knowledge through a dual-stream architecture, enabling consistent performance gains across diverse acoustic environments when adapting pre-trained audio models to new domains without requiring target domain labels.

arXiv Code

ICASSP 2024

Stable Distillation: Regularizing Continued Pre-training for Low-Resource Automatic Speech Recognition

Ashish Seth, Sreyan Ghosh, S. Umesh, Dinesh Manocha

Knowledge Distillation Domain Adaptation ASR

A self-distillation approach for continued SSL pre-training that adapts large pre-trained models to low-resource ASR domains. By regularizing the adaptation process with a stable teacher signal, Stable Distillation achieves 0.8–7% WER improvements over naive continued pre-training while avoiding catastrophic forgetting of the original pre-trained representations.

arXiv Code

CVPR 2024

AV-RIR: Audio-Visual Room Impulse Response Estimation

Anton Ratnarajah, Sreyan Ghosh, Sonal Kumar, Purva Chiniya, Dinesh Manocha

Audio-Visual Room Acoustics Spatial Audio

A multimodal approach combining audio and visual information for Room Impulse Response (RIR) estimation via a neural codec-based architecture. Introduces Geo-Mat features encoding material properties from visual data, and CRIP, which improves late reverberation components via image-to-RIR retrieval by 86%. Achieves 36–63% improvements over audio-only and visual-only baselines across acoustic metrics.

arXiv Project

ICLR 2024

CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models

Sreyan Ghosh, Ashish Seth, Sonal Kumar, Utkarsh Tyagi, Chandra Kiran Evuru, S. Ramaneswaran, S. Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha

Compositional Reasoning Benchmarks Audio-Language

CompA (Compositional Evaluation of Audio-Language Models) benchmarks the compositional understanding of audio-language models through attribute binding, order sensitivity, and multi-event reasoning. It reveals that current models struggle significantly with compositional audio descriptions — failing to correctly bind attributes to the right sound sources or reason about temporal ordering of events. CompA provides both a benchmark and targeted training signal to address these compositional gaps.

arXiv Code

ICASSP 2024 Oral

RECAP: Retrieval-Augmented Audio Captioning

Sreyan Ghosh, Sonal Kumar, Chandra Kiran Reddy Evuru, Ramani Duraiswami, Dinesh Manocha

Audio Captioning Retrieval Augmentation Out-of-Domain

A retrieval-augmented audio captioning system that conditions a GPT-2 decoder on both the input audio and similar captions retrieved from a datastore via CLAP. RECAP achieves competitive in-domain performance and significant out-of-domain improvements, capable of captioning novel audio events unseen during training. Also releases 150,000+ new weakly labeled captions for AudioSet, AudioCaps, and Clotho.

arXiv

ICCV 2023

AdVerb: Visually Guided Audio Dereverberation

Sanjoy Chowdhury, Sreyan Ghosh, Subhrajyoti Dasgupta, Anton Ratnarajah, Utkarsh Tyagi, Dinesh Manocha

Audio-Visual Dereverberation Speech Enhancement

An audio-visual framework that combines scene images with reverberant audio to recover clean sound via a geometry-aware cross-modal transformer architecture. AdVerb captures scene geometry and audio-visual cross-modal relationships to generate complex ideal ratio masks, achieving 18–82% relative improvements on LibriSpeech for speech enhancement, recognition, and speaker verification.

arXiv

ICASSP 2023

UNFUSED: UNsupervised Finetuning Using SElf supervised Distillation

Ashish Seth, Sreyan Ghosh, S. Umesh, Dinesh Manocha

Self-Supervised Learning Audio Classification Pseudo-Labels

A self-supervised learning approach that reduces reliance on labeled data for audio classification by generating pseudo-labels via clustering, then using them for self-distillation on a randomly initialized model before final fine-tuning. Unlike prior work that directly fine-tunes SSL encoders, UNFUSED decouples representation learning from task adaptation, achieving state-of-the-art on the LAPE Benchmark with a 40% reduction in parameters over the previous best system.

arXiv

ICASSP 2023

SLICER: Learning Universal Audio Representations using Low-Resource Self-Supervised Pre-Training

Ashish Seth, Sreyan Ghosh, S. Umesh, Dinesh Manocha

Self-Supervised Learning Contrastive Audio Representation

A self-supervised audio learning approach combining clustering and contrastive learning with a symmetric loss for audio encoder pre-training. SLICER jointly optimizes instance-level discrimination and cluster-level consistency, producing representations that transfer well across environmental sound, music, and speech classification tasks with limited labeled data.

Paper Code

ICASSP 2023

MAST: Multiscale Audio Spectrogram Transformers

Sreyan Ghosh, Ashish Seth, S. Umesh, Dinesh Manocha

Self-Supervised Learning Transformers Pre-training

An audio encoder incorporating multiscale feature hierarchies into the Audio Spectrogram Transformer, achieving a 3.4% accuracy gain over single-scale baselines. MAST captures both fine-grained acoustic detail and broad spectral structure through parallel multi-resolution attention paths, producing richer representations for downstream audio classification and retrieval.

arXiv Code

Interspeech 2023 Oral

MMER: Multimodal Multi-task Learning for Speech Emotion Recognition

Sreyan Ghosh, Utkarsh Tyagi, S Ramaneswaran, Harshvardhan Srivastava, Dinesh Manocha

Speech Emotion Multimodal Multi-task Learning

A multimodal network for speech emotion recognition using early-fusion and cross-modal self-attention between text and acoustic modalities, trained jointly on three auxiliary tasks for richer supervision. MMER achieves state-of-the-art results on the IEMOCAP benchmark by jointly leveraging acoustic and linguistic signals.

arXiv