No papers match the selected filters.
An investigation of how audio-visual language models incorrectly infer sounds from visual context in egocentric video. An evaluation framework is constructed using 300 egocentric videos and 1,000 sound-focused questions, with a taxonomy distinguishing foreground action sounds from background ambient sounds. State-of-the-art models like Qwen2.5 Omni achieve only 27.3% and 39.5% accuracy on foreground and background sound questions respectively, highlighting critical gaps in reliable multimodal egocentric understanding.
A text-conditioned video-to-music generation model that enables fast, high-quality, semantically and stylistically controllable background music creation. Video-Robin uses an autoregressive module to align visual and textual inputs for high-level musical planning, followed by Diffusion Transformers that refine the plan into high-quality audio. The dual-component architecture achieves superior performance over existing approaches on standard and challenging test sets with significantly faster inference speeds.
The next-generation model in the Audio Flamingo series, advancing understanding and reasoning over speech, environmental sounds, and music. AF-Next introduces a stronger foundational audio-language model, scalable strategies for constructing large-scale audio understanding and reasoning data, support for up to 30-minute audio inputs, and temporal reasoning that grounds intermediate steps to timestamps. Trained on over 1 million hours of data with curriculum-based training, AF-Next achieves competitive performance with both open and larger proprietary models across 20 benchmarks.
Investigates how audio-visual large language models process multimodal inputs through mechanistic interpretability techniques. The study reveals a strong vision bias in audio understanding — models hallucinate sounds from what they see rather than what they hear. Attention analysis and causal interventions show that while audio representations contain meaningful semantic information internally, the visual modality dominates text generation, particularly in deeper layers where cross-modal integration occurs.
Introduces AHA-Eval, an attack suite of 6,500 QA pairs that probes whether large audio language models genuinely ground responses in audio input rather than relying on spurious correlations. AHA targets two attack surfaces: query-based attacks that manipulate question structure and audio-based attacks using synthetic speech. Testing reveals alarming vulnerability rates of 95.35% and 79.65% on Audio Flamingo 3 and Gemini Pro respectively. A mitigation dataset is also proposed, reducing attack success rates by up to 49%.
A benchmark for evaluating multimodal large language models on their capacity to jointly reason over visual, audio, and textual signals in long and complex real-world videos. MMOU comprises 15,000 questions paired with 9,038 videos across 13 skill categories. The best proprietary model achieves only 64.2% accuracy while leading open-source models reach 46.8%, revealing fundamental gaps in omni-modal reasoning over extended video content.
Addresses hallucinations and temporal inconsistencies in audio captioning by generating temporally grounded descriptions of complex acoustic scenes. TAC disentangles overlapping sound events and assigns precise timestamps to each, producing more accurate and temporally consistent captions across audio and audio-visual understanding benchmarks.
A dataset and benchmark for joint audio-visual understanding of egocentric videos. A data generation pipeline using token-based video filtering and modular graph-based curation produces 3 million audio-visual training samples. Existing MLLMs are shown to bias heavily toward visual signals, often neglecting audio cues. Fine-tuning on the resulting dataset yields up to 113% performance improvement on the EgoAVU benchmark and up to 28% relative gains on other egocentric video benchmarks.
An audio-language model designed for deep music understanding, addressing music's dynamic, layered, and information-dense nature. Introduces MF-Skills, a curated dataset spanning harmony, structure, timbre, lyrics, and cultural context, alongside chain-of-thought training (MF-Think) and reinforcement learning. Achieves state-of-the-art results on 10+ music understanding benchmarks, advancing from surface-level recognition toward layered, human-like perception of songs.
A lightweight plug-and-play framework for endowing large audio-language models with spatial audio understanding and reasoning. SPUR incorporates a First-Order Ambisonics encoder for processing directional audio channels and introduces SPUR-Set, a dataset for training spatial reasoning capabilities. The approach preserves existing model functionality while enabling detection of sound direction, elevation, distance, and speaker attribution in real-world acoustic environments.
A benchmark containing 5,305 instances with audio paired to expert-generated Q&A across speech, sound, and music. Evaluates models on 49 unique skills including long-form comprehension and spatial reasoning. Tests 22 models, revealing limitations where even state-of-the-art systems achieve only 59.2% accuracy — exposing gaps that simpler benchmarks cannot surface.
EgoIllusion investigates how audio-language models can be misled by acoustic illusions within egocentric audio-visual scenarios. Through a novel benchmark of auditory illusion cases in first-person video, we expose systematic failure modes in current models that rely on shallow acoustic cues rather than genuine perceptual reasoning. We further propose mitigation strategies and training interventions for more robust perceptual audio-visual understanding.
The first omni voice assistant benchmark, featuring 1,000 annotated dialogues that evaluate the integration of paralinguistic speech features and visual cues for multimodal understanding. MultiVox exposes significant gaps in how current voice assistants handle non-verbal speech cues and visual context simultaneously.
A fully open large audio-language model featuring the AF-Whisper unified encoder, flexible thinking capabilities, multi-turn chat, and 10-minute audio understanding. AF3 supports voice-to-voice interaction across speech, sound, and music modalities, setting new state-of-the-art performance on MMAU, Air-Bench, and other major audio understanding benchmarks.
A multi-domain audio question answering benchmark designed to evaluate acoustic content reasoning across speech, music, and environmental sound. The benchmark tests models on complex reasoning over audio content, requiring understanding of acoustic properties, temporal relationships, and domain-specific knowledge.
Proposes a methodology applying diffusion models in latent space to generate priors that are integrated into transformer-based regression for speech enhancement. The diffusion prior captures the full distribution of clean speech, enabling the regression model to produce higher-fidelity enhanced speech with state-of-the-art performance on standard benchmarks.
An audio-language model with a custom CLAP encoder and synthetic QA data, extending understanding to 30-second to 5-minute audio segments. Introduces the LongAudio dataset for long-form audio training, enabling expert-level reasoning over extended audio across speech, music, and environmental sound domains.
Addresses hallucinations in large vision-language models by grounding visual descriptions to specific image regions. The proposed approach reduces object hallucinations and improves compositional reasoning by linking textual descriptions to precise visual evidence, achieving state-of-the-art results on multiple hallucination and visual reasoning benchmarks.
A training-free approach that generates context-rich prompts incorporating sound attributes and sources for improved audio-text alignment and zero-shot audio classification. TSPE constructs task-specific prompt ensembles at inference time, requiring no additional training while achieving consistent improvements over single-prompt baselines across diverse audio classification benchmarks.
HALLUCINOGEN, a VQA benchmark that uses contextual reasoning prompts as hallucination attacks to systematically evaluate large vision-language models. The framework categorizes visual entities as salient (easily recognizable objects) or latent (requiring domain knowledge, such as identifying disease in X-rays), and evaluates eleven models including LLaMA-3.2, DeepSeek-V2, and Gemini. Results show current LVLMs remain susceptible to hallucination attacks even with existing mitigation strategies.
Enables fine-grained control over acoustic characteristics in generated audio through learned disentanglement between semantic content and acoustic features. SILA augments text-to-audio generation models with signal-level conditioning, allowing precise specification of acoustic properties — timbre, reverberation, recording environment — independently of the sound event content.
Comprehensive benchmark with 10,000 audio clips paired with expert-annotated questions covering speech, environmental sounds, and music, requiring 27 distinct skills. MMAU tests models on perception, understanding, and reasoning far beyond simple classification, providing a standardized EvalAI leaderboard for community-wide evaluation and comparison.
Enhances audio-language representations to handle linguistic diversity through a multi-view contrastive learning objective. RobustCLAP trains on paraphrase, negation, and attribute-permuted variants of audio captions, producing significantly more robust models that maintain high retrieval and classification performance across diverse natural language phrasings.
A training and parameter-free method that improves zero-shot audio classification through enhanced audio-text alignment via mutual feedback between the audio encoder and text representations. PAT requires no additional training or parameters — it is applied directly at inference to improve alignment quality, achieving strong gains on zero-shot classification benchmarks.
Addresses limitations in generative error correction for speech recognition through the DARAG approach — combining synthetic data augmentation and retrieval-based entity handling for improved domain generalization. The method significantly improves ASR post-correction accuracy in low-resource and domain-shifted settings without requiring additional audio recordings.
A self-supervised approach using an adaptive masking strategy that progressively increases difficulty — masking easy acoustic regions first, then harder ones — for improved speech representation learning. EH-MAM consistently outperforms uniform masking baselines across downstream classification, detection, and retrieval tasks.
Generates synthetic audio data through preference-optimized text-to-audio models and iterative LLM prompting. Synthio produces diverse, acoustically varied training examples that achieve up to 39% performance improvement on small-scale audio classification datasets, providing a scalable solution for low-resource audio understanding without manual data collection.
Improves zero-shot audio classification by describing sounds through caption augmentation and custom prompt generation. ReCLAP generates rich, acoustic-attribute-aware descriptions of audio events during training and inference, producing more discriminative audio-text representations that significantly improve zero-shot performance on sound classification and retrieval benchmarks.
GAMA (General-purpose Audio-Language Model with Attributes) is an instruction-following audio-language model that combines audio encoders with large language models for rich audio question answering, captioning, and reasoning. GAMA introduces attribute-based audio representation that captures fine-grained acoustic properties — timbre, rhythm, pitch, and spatial characteristics — enabling more grounded and interpretable responses than prior audio LLMs.
A framework that leverages lip movement cues to correct ASR errors in noisy environments, instructing a language model to select among N-best hypotheses conditioned on visual lip cues. LipGER achieves word error rate improvements of 1.1–49.2% across four datasets and introduces LipHyp, a large-scale dataset with hypothesis-transcription pairs equipped with lip motion cues.
A systematic investigation of the shortcomings of instruction tuning (IT) for adapting pre-trained LLMs into conversational agents. Through rigorous experiments, the paper identifies four key limitations: IT cannot expand model knowledge (LoRA only learns style, full fine-tuning causes knowledge loss); copying response formats degrades output quality; full-parameter fine-tuning heightens hallucinations through token borrowing; and popular IT improvement techniques fail to outperform simple LoRA baselines. The central finding is that responses generated purely from pre-trained knowledge consistently outperform models that learn new knowledge from IT.
Continued pre-training approach using teacher-student models with cross-attention for domain adaptation without catastrophic forgetting. FusDom combines in-domain and out-of-domain knowledge through a dual-stream architecture, enabling consistent performance gains across diverse acoustic environments when adapting pre-trained audio models to new domains without requiring target domain labels.
A self-distillation approach for continued SSL pre-training that adapts large pre-trained models to low-resource ASR domains. By regularizing the adaptation process with a stable teacher signal, Stable Distillation achieves 0.8–7% WER improvements over naive continued pre-training while avoiding catastrophic forgetting of the original pre-trained representations.
A multimodal approach combining audio and visual information for Room Impulse Response (RIR) estimation via a neural codec-based architecture. Introduces Geo-Mat features encoding material properties from visual data, and CRIP, which improves late reverberation components via image-to-RIR retrieval by 86%. Achieves 36–63% improvements over audio-only and visual-only baselines across acoustic metrics.
CompA (Compositional Evaluation of Audio-Language Models) benchmarks the compositional understanding of audio-language models through attribute binding, order sensitivity, and multi-event reasoning. It reveals that current models struggle significantly with compositional audio descriptions — failing to correctly bind attributes to the right sound sources or reason about temporal ordering of events. CompA provides both a benchmark and targeted training signal to address these compositional gaps.
A retrieval-augmented audio captioning system that conditions a GPT-2 decoder on both the input audio and similar captions retrieved from a datastore via CLAP. RECAP achieves competitive in-domain performance and significant out-of-domain improvements, capable of captioning novel audio events unseen during training. Also releases 150,000+ new weakly labeled captions for AudioSet, AudioCaps, and Clotho.
An audio-visual framework that combines scene images with reverberant audio to recover clean sound via a geometry-aware cross-modal transformer architecture. AdVerb captures scene geometry and audio-visual cross-modal relationships to generate complex ideal ratio masks, achieving 18–82% relative improvements on LibriSpeech for speech enhancement, recognition, and speaker verification.
A self-supervised learning approach that reduces reliance on labeled data for audio classification by generating pseudo-labels via clustering, then using them for self-distillation on a randomly initialized model before final fine-tuning. Unlike prior work that directly fine-tunes SSL encoders, UNFUSED decouples representation learning from task adaptation, achieving state-of-the-art on the LAPE Benchmark with a 40% reduction in parameters over the previous best system.
A self-supervised audio learning approach combining clustering and contrastive learning with a symmetric loss for audio encoder pre-training. SLICER jointly optimizes instance-level discrimination and cluster-level consistency, producing representations that transfer well across environmental sound, music, and speech classification tasks with limited labeled data.
An audio encoder incorporating multiscale feature hierarchies into the Audio Spectrogram Transformer, achieving a 3.4% accuracy gain over single-scale baselines. MAST captures both fine-grained acoustic detail and broad spectral structure through parallel multi-resolution attention paths, producing richer representations for downstream audio classification and retrieval.
A multimodal network for speech emotion recognition using early-fusion and cross-modal self-attention between text and acoustic modalities, trained jointly on three auxiliary tasks for richer supervision. MMER achieves state-of-the-art results on the IEMOCAP benchmark by jointly leveraging acoustic and linguistic signals.