MLinfo | 機械学習・AI論文まとめ

MLinfo|日々更新される技術をキャッチアップ/検索

「audio」の検索結果

39 件

すべて arxiv github huggingface 実装あり

githubGitHubあり2026-06-10

screenpipe — YC (S26) | AI that knows what you've seen, said, or heard. Records everything you do, say, hear 24/7, local, private, secure

ユーザーの行動を認識し、オートエージェントを構築するためのツール。

自然言語処理大規模言語モデルテキスト音声マルチモーダル

用途: オートエージェント構築
難易度: Easy
コスト: High

→

githubGitHubあり2026-06-09

transformers — 🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

🤗 Transformersは、テキスト・ビジョン・音声など複雑なモデル定義をサポートするフレームワークで、インフェレンスターやトレーニングに使用できる。

深層学習Transformer分類テキスト音声

用途: 機械学習モデル定義
難易度: Easy
コスト: High

→

githubGitHubあり2026-06-09

mediapipe — Cross-platform, customizable ML solutions for live and streaming media.

mediapipeは、クロスプラットフォームでカスタマイズ可能なライブおよびストリーミングメディア向けのMLソリューションを提供している。

MLOpsモデルデプロイ音声動画

用途: ライブおよびストリーミングメディア用MLソリューション
難易度: Easy
コスト: High

→

githubGitHubあり2026-06-09

diffusers — 🤗 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch.

.diffusion モデルのライブラリ。画像・動画・音声生成に利用可能。

生成AI拡散モデル生成画像テキスト

用途: 画像・動画・音声生成
難易度: Easy
コスト: High

→

githubGitHubあり2026-06-09

datasets — 🤗 The largest hub of ready-to-use datasets for AI models with fast, easy-to-use and efficient data manipulation tools

AI用のデータセットを提供するプラットフォームです。

深層学習軽量化・量子化音声

用途: データセットハブ
難易度: Easy
コスト: Medium

→

githubGitHubあり2026-06-09

openvino — OpenVINO™ is an open source toolkit for optimizing and deploying AI inference

オープンソースのAI推論最適化と展開用ツールキットです。

深層学習Transformer分類生成音声

用途: AI推論の最適化と展開
難易度: Easy
コスト: Low

→

githubGitHubあり2026-06-09

unsloth — Unsloth Studio is a web UI for training and running open models like Gemma 4, Qwen3.6, DeepSeek, gpt-oss locally.

Unsloth Studioは、オープンモデルのトレーニングと実行を支援するWebUIです。このライブラリは、Gemma4、Qwen3.5などのオープンモデルのテストとトレーニングを支援するために使われます。

自然言語処理大規模言語モデルテキスト音声

用途: オープンモデルのトレーニングと実行
難易度: Easy
コスト: High

→

githubGitHubあり2026-06-09

pruna — Pruna is a model optimization framework built for developers, enabling you to deliver faster, more efficient models with minimal overhead.

デベロッパー向けのモデロプティミゼーションフレームワークです。モデルの高速化と効率化を実現することができます。

深層学習Transformer分類音声

用途: モデロプティミゼーション
難易度: Easy
コスト: Low

→

githubGitHubあり2026-06-09

FunASR — Industrial-grade speech recognition toolkit: 170x realtime, 50+ languages, speaker diarization, emotion detection, streaming, and OpenAI-compatible API.

電気生理信号から表現を学習し、脳コンピューターインターフェースの開発を支援する。

深層学習Transformer分類検出テキスト

用途: 電気生理信号から表現を学習する
難易度: Easy
コスト: Low

→

githubGitHubあり2026-06-09

compromise — modest natural-language processing

この研究では、自然言語処理の負担を減らすモジュラリティを目指しています。モジュラリティとは、システムを小さくて独立した部分に分割して、それぞれを簡素化することです。この研究では、文脈に応じてモジュラリティを変更できるメカ

自然言語処理分類音声

用途: 自然言語処理の簡素化
難易度: Easy
コスト: Low

→

githubGitHubあり2026-06-09

TextBlob — Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

テキスト分析、センチメント分析や単語分割などを行えるライブラリ。

自然言語処理テキスト音声

用途: テキスト分析
難易度: Easy
コスト: Medium

→

githubGitHubあり2026-06-09

modelscope — ModelScope: bring the notion of Model-as-a-Service to life.

ModelScopeは、モデルをサービス化するためのプラットフォームです。モデルを作成し、ホスティングし、管理し、配信することができます。

自然言語処理音声

用途: モデルをサービス化する
難易度: Easy
コスト: Medium

→

arxivGitHubあり2026-06-08

Few-shot Class-variable Incremental Audio Classification via Prototype Adaptation and Pseudo Class-variable Training

In the task of few-shot class-incremental audio classification, the number of classes is assumed to always inc

少数データ向き自然言語処理RAG分類音声

用途: 分類
難易度: Hard
コスト: High

→

arxivGitHubあり2026-06-08

Echo-DM: Ultrasound Marker Removal via Conditional Latent Diffusion and Region-Aware Fusion

Clinical ultrasound images often contain artificial markers, such as measurement calipers and text, to assist

品質予測/異常検知自然言語処理RAG画像テキスト音声

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

githubGitHubあり2026-06-08

VoxCPM — VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning

マルチラギングスピーチ生成やクリエイティブボイスデザイン、ルートライフクライミングなど、テクスチャファリーTTSの最新技術を実現するためのフレームワークです。

生成AI音声・音楽生成生成テキスト音声

用途: マルチラギングスピーチ生成
難易度: Easy
コスト: Medium

→

githubGitHubあり2026-06-07

SimpleTuner — A general fine-tuning kit geared toward image/video/audio diffusion models.

画像やビデオやオーディオディフュージョンモデルのファインチューニングを行うための、汎用的なファインチューニングキット。

自然言語処理ファインチューニング画像音声動画

用途: ディフュージョンモデルのファインチューニング
難易度: Easy
コスト: High

→

arxivGitHubあり2026-06-06

VideoWeaver: Evaluating and Evolving Skills for Agentic Long Video Generation

Recent agent frameworks such as Claude Code, Codex, and OpenClaw are strong at tool use and orchestration, but

MI向き品質予測/異常検知自然言語処理大規模言語モデル生成画像テキスト

用途: 生成
難易度: Hard
コスト: High

→

huggingfaceHugging Faceあり2026-06-05

Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders

Whisper, a widely adopted ASR model, is known to suffer from hallucinations - coherent transcriptions generate

自然言語処理ファインチューニング検出音声

用途: 検出
難易度: Easy
コスト: Medium

→

huggingfaceHugging Faceあり2026-06-05

Where Rectified Flows Leak: Characterising Membership Signals Along the Interpolation Path

Understanding what generative models retain from training data remains challenging, with implications for copy

機械学習特徴量エンジニアリング生成画像音声

用途: 生成
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-05

MMAE: A Massive Multitask Audio Editing Benchmark

We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation

MI向き自然言語処理大規模言語モデル生成画像テキスト

用途: 生成
難易度: Easy
コスト: High

→

huggingfaceGitHubありHugging Faceあり2026-06-05

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research move

深層学習軽量化・量子化画像テキスト音声

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-05

dots.tts Technical Report

We present dots.tts, a 2B-parameter continuous autoregressive text-to-speech (TTS) foundation model that model

センサ/時系列品質予測/異常検知深層学習軽量化・量子化生成テキスト音声

用途: 生成
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-05

Entropy as a Structural Prior: How a Log-Barrier on DiT Belief Space Drives Musical Diversity and Development

Confidence-based loss weighting is usually avoided in generative models because it accelerates errors when the

センサ/時系列自然言語処理ファインチューニング生成テキスト音声

用途: 生成
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-05

How Far Can Chord-Symbol Time-Series Adaptation Carry Genre Identity? Capabilities and Boundaries in Multi-Genre Chord-Symbol Modeling

Harmony is a compact symbolic layer where mathematical pitch relations, acoustic consonance, and musical conve

説明可能センサ/時系列品質予測/異常検知深層学習Transformer分類テキスト音声

用途: 分類
難易度: Easy
コスト: Low

→

huggingfaceHugging Faceあり2026-06-04

Towards Truly Multilingual ASR: Generalizing Code-Switching ASR to Unseen Language Pairs

Automatic Speech Recognition (ASR) has become a key technology for human--AI interaction. However, code-switch

自然言語処理ファインチューニング分類生成音声

用途: 分類
難易度: Easy
コスト: Low

→

githubGitHubあり2026-06-04

Irodori-TTS — A Flow Matching-based Text-to-Speech Model with Emoji-driven Style Control

Emotion-driven Style Controlを使用してテキストから声の変換が実行され、感情のあるテキストをエモタイザブルな声に変換することが可能になります。

生成AI拡散モデル生成テキスト音声

用途: テキスト-to-声の変換
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-03

Audio Interaction Model

Audio is an inherently interactive modality, yet today's Large Audio Language Models (LALMs) are offline, and

強化学習マルチエージェントテキスト音声

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

huggingfaceGitHubありHugging Faceあり2026-06-03

SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing

Instruction-guided speech editing requires a model to modify specified speech attributes while preserving unre

自然言語処理大規模言語モデル生成テキスト音声

用途: 生成
難易度: Easy
コスト: High

→

huggingfaceGitHubありHugging Faceあり2026-06-01

Cosmos 3: Omnimodal World Models for Physical AI

We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, i

深層学習Transformer生成画像テキスト

用途: 生成
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-05-30

SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory

AI glasses present a compelling platform for AI agents to serve as personalized memory assistants. To be genui

深層学習Transformer分類QA画像

用途: 分類
難易度: Easy
コスト: High

→

huggingfaceGitHubありHugging Faceあり2026-05-29

OpenSTBench: Beyond Semantic Evaluation for Speech Translation

Speech translation systems increasingly span speech-to-text translation (S2TT), speech-to-speech translation (

品質予測/異常検知コンピュータビジョン動画認識生成テキスト音声

用途: 生成
難易度: Easy
コスト: High

→

huggingfaceGitHubありHugging Faceあり2026-05-29

Score-Control for Hallucination Reduction in Diffusion Models

Diffusion models have emerged as the backbone of modern generative AI, powering advances in vision, language,

コンピュータビジョンセグメンテーション生成画像音声

用途: 生成
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-05-28

Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

Automatic speech recognition (ASR) is a core component of human--computer interaction and an increasingly impo

自然言語処理大規模言語モデル分類音声

用途: 分類
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-05-28

Multimodal Music Recommendation System using LLMs

Music recommendation systems typically treat songs as opaque tokens, relying on collaborative interaction hist

センサ/時系列品質予測/異常検知深層学習Transformerテキスト音声マルチモーダル

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-05-28

Access Sets Matter: Budgeting Expert Reads for Scalable Weight-Space Model Merging

Weight-space model merging is usually formulated as an algebraic operation on checkpoints, yet at LLM scale th

深層学習軽量化・量子化音声

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

githubGitHubあり2026-05-28

openFrameworks — openFrameworks is a community-developed cross platform toolkit for creative coding in C++.

OpenFrameworksは、C++で構築されたクロスプラットフォームのツールキットで、クリエイティブコーディングのために使われます。このライブラリは、各種のデバイス上でプログラムを動作させることを容易にします。

コンピュータビジョン音声動画

用途: クリエイティブコーディングのためのクロスプラットフォームツールキット
難易度: Easy
コスト: High

→

githubGitHubあり2026-05-25

Matcha-TTS — [ICASSP 2024] 🍵 Matcha-TTS: A fast TTS architecture with conditional flow matching

Matcha-TTSは、高速で条件付き流のマッチングを実現するTTSアーキテクチャであり、話者の特徴を考慮する。

生成AI拡散モデルテキスト音声

用途: TTSアーキテクチャ設計
難易度: Easy
コスト: High

→

githubGitHubあり2026-05-13

maths-cs-ai-compendium — Become a cracked AI/ML Research Engineer

Becoming a cracked AI/ML Research Engineerには、AI/ML研究者のスキルと知識を高めるための手法が紹介されています。

コンピュータビジョンマルチモーダルテキスト音声

用途: AI/ML研究者を育成
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-05-04

Liberating LLM Capabilities in Full-Duplex Speech Models

Speech-based large language models are typically constrained to spoken replies, which limits their user-facing

自然言語処理大規模言語モデル生成テキスト音声

用途: 生成
難易度: Easy
コスト: High

→