MLinfo | 機械学習・AI論文まとめ

MLinfo|日々更新される技術をキャッチアップ/検索

「audio」の検索結果

6 件

すべて arxiv github huggingface 実装あり

huggingfaceHugging Faceあり2026-07-20

FlashRT: Agent Harness for Guiding Agents to Deploy Real-Time Multimodal Applications

Real-time multimodal applications, including voice agents and interactive video generation, compose heterogene

深層学習軽量化・量子化生成テキスト音声

用途: 生成
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-17

Audio-Visual Flamingo: Open Audio-Visual Intelligence for Long and Complex Videos

We present Audio-Visual Flamingo (AV-Flamingo), a fully open state-of-the-art audio-visual large language mode

説明可能自然言語処理大規模言語モデル画像テキスト音声

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-14

ReflectWorld-MM: An Entity-Oriented Multimodal Memory System for Open-Ended Video Streams

Building assistants that can continually watch the world, remember what they see, and reason over their accumu

コンピュータビジョンマルチモーダル画像テキスト音声

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-13

Qwen-Music Technical Report

In this report, we introduce Qwen-Music, a powerful music generation model capable of producing highly musical

センサ/時系列品質予測/異常検知深層学習Transformer生成テキスト音声

用途: 生成
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-11

GigaChat Audio: Time-aware Large Audio Language Model

Temporal grounding in long recordings remains challenging for audio-conditioned LLMs. We present a time-aware

自然言語処理大規模言語モデルテキスト音声

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-11

GigaAM Multilingual: Foundation Model for Underrepresented Languages

Despite recent scaling successes, multilingual ASR performance remains highly uneven, with long-tail languages

深層学習Transformer音声

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High