Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders
Whisper, a widely adopted ASR model, is known to suffer from hallucinations - coherent transcriptions generate
- 用途
- 検出
- 難易度
- Easy
- コスト
- Medium
「audio」の検索結果
18 件Whisper, a widely adopted ASR model, is known to suffer from hallucinations - coherent transcriptions generate
Understanding what generative models retain from training data remains challenging, with implications for copy
We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation
Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research move
We present dots.tts, a 2B-parameter continuous autoregressive text-to-speech (TTS) foundation model that model
Confidence-based loss weighting is usually avoided in generative models because it accelerates errors when the
Harmony is a compact symbolic layer where mathematical pitch relations, acoustic consonance, and musical conve
Automatic Speech Recognition (ASR) has become a key technology for human--AI interaction. However, code-switch
Audio is an inherently interactive modality, yet today's Large Audio Language Models (LALMs) are offline, and
Instruction-guided speech editing requires a model to modify specified speech attributes while preserving unre
We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, i
AI glasses present a compelling platform for AI agents to serve as personalized memory assistants. To be genui
Speech translation systems increasingly span speech-to-text translation (S2TT), speech-to-speech translation (
Diffusion models have emerged as the backbone of modern generative AI, powering advances in vision, language,
Automatic speech recognition (ASR) is a core component of human--computer interaction and an increasingly impo
Music recommendation systems typically treat songs as opaque tokens, relying on collaborative interaction hist
Weight-space model merging is usually formulated as an algebraic operation on checkpoints, yet at LLM scale th
Speech-based large language models are typically constrained to spoken replies, which limits their user-facing