MLinfo | 機械学習・AI論文まとめ

MLinfo|日々更新される技術をキャッチアップ/検索

「audio」の検索結果

77 件

すべて arxiv github huggingface 実装あり

githubGitHubあり2026-06-10

screenpipe — YC (S26) | AI that knows what you've seen, said, or heard. Records everything you do, say, hear 24/7, local, private, secure

ユーザーの行動を認識し、オートエージェントを構築するためのツール。

自然言語処理大規模言語モデルテキスト音声マルチモーダル

用途: オートエージェント構築
難易度: Easy
コスト: High

→

githubGitHubあり2026-06-09

transformers — 🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

🤗 Transformersは、テキスト・ビジョン・音声など複雑なモデル定義をサポートするフレームワークで、インフェレンスターやトレーニングに使用できる。

深層学習Transformer分類テキスト音声

用途: 機械学習モデル定義
難易度: Easy
コスト: High

→

githubGitHubあり2026-06-09

mediapipe — Cross-platform, customizable ML solutions for live and streaming media.

mediapipeは、クロスプラットフォームでカスタマイズ可能なライブおよびストリーミングメディア向けのMLソリューションを提供している。

MLOpsモデルデプロイ音声動画

用途: ライブおよびストリーミングメディア用MLソリューション
難易度: Easy
コスト: High

→

githubGitHubあり2026-06-09

diffusers — 🤗 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch.

.diffusion モデルのライブラリ。画像・動画・音声生成に利用可能。

生成AI拡散モデル生成画像テキスト

用途: 画像・動画・音声生成
難易度: Easy
コスト: High

→

githubGitHubあり2026-06-09

datasets — 🤗 The largest hub of ready-to-use datasets for AI models with fast, easy-to-use and efficient data manipulation tools

AI用のデータセットを提供するプラットフォームです。

深層学習軽量化・量子化音声

用途: データセットハブ
難易度: Easy
コスト: Medium

→

githubGitHubあり2026-06-09

openvino — OpenVINO™ is an open source toolkit for optimizing and deploying AI inference

オープンソースのAI推論最適化と展開用ツールキットです。

深層学習Transformer分類生成音声

用途: AI推論の最適化と展開
難易度: Easy
コスト: Low

→

githubGitHubあり2026-06-09

unsloth — Unsloth Studio is a web UI for training and running open models like Gemma 4, Qwen3.6, DeepSeek, gpt-oss locally.

Unsloth Studioは、オープンモデルのトレーニングと実行を支援するWebUIです。このライブラリは、Gemma4、Qwen3.5などのオープンモデルのテストとトレーニングを支援するために使われます。

自然言語処理大規模言語モデルテキスト音声

用途: オープンモデルのトレーニングと実行
難易度: Easy
コスト: High

→

githubGitHubあり2026-06-09

pruna — Pruna is a model optimization framework built for developers, enabling you to deliver faster, more efficient models with minimal overhead.

デベロッパー向けのモデロプティミゼーションフレームワークです。モデルの高速化と効率化を実現することができます。

深層学習Transformer分類音声

用途: モデロプティミゼーション
難易度: Easy
コスト: Low

→

githubGitHubあり2026-06-09

FunASR — Industrial-grade speech recognition toolkit: 170x realtime, 50+ languages, speaker diarization, emotion detection, streaming, and OpenAI-compatible API.

電気生理信号から表現を学習し、脳コンピューターインターフェースの開発を支援する。

深層学習Transformer分類検出テキスト

用途: 電気生理信号から表現を学習する
難易度: Easy
コスト: Low

→

githubGitHubあり2026-06-09

compromise — modest natural-language processing

この研究では、自然言語処理の負担を減らすモジュラリティを目指しています。モジュラリティとは、システムを小さくて独立した部分に分割して、それぞれを簡素化することです。この研究では、文脈に応じてモジュラリティを変更できるメカ

自然言語処理分類音声

用途: 自然言語処理の簡素化
難易度: Easy
コスト: Low

→

githubGitHubあり2026-06-09

TextBlob — Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

テキスト分析、センチメント分析や単語分割などを行えるライブラリ。

自然言語処理テキスト音声

用途: テキスト分析
難易度: Easy
コスト: Medium

→

githubGitHubあり2026-06-09

modelscope — ModelScope: bring the notion of Model-as-a-Service to life.

ModelScopeは、モデルをサービス化するためのプラットフォームです。モデルを作成し、ホスティングし、管理し、配信することができます。

自然言語処理音声

用途: モデルをサービス化する
難易度: Easy
コスト: Medium

→

arxivPaper only2026-06-08

Scaling Neural Network Verification with Tensor Parallelism and Fully Sharded Data Parallelism

この研究では、Tensor ParallelismとFully Sharded Data Parallelism技術を利用して、GPU メモリ限界のある従来の検証アーキテクチャの制約を解いて、機械学習ネットワークの検証を

深層学習CNNテキスト音声

用途: 予測ネットワークの検証
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

Conan-embedding-v3: Fusing Modality-Specific Models for Omni-Modal Embedding

この研究では、テキスト、画像、ビデオ、アウディオ等の異なるモダリティのデータを統合したオムニモダル検索システムを構築します。

自然言語処理ファインチューニング回帰検索画像

用途: オムニモーダル検索
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

Multi-View Speech Representation Learning for Parkinson's Disease Detection Using Context-guided Cross-modal Attention

パーキンソン病（PD）の早期検出への取り組みとして、脳の損傷が発症前に生じる話術障害を分析するため、音声分析を用いてパーキンソン病の診断を提唱しています。

センサ/時系列深層学習Transformer検出生成埋め込み

用途: パーキンソン病の早期検出
難易度: Hard
コスト: High

→

arxivGitHubあり2026-06-08

Few-shot Class-variable Incremental Audio Classification via Prototype Adaptation and Pseudo Class-variable Training

In the task of few-shot class-incremental audio classification, the number of classes is assumed to always inc

少数データ向き自然言語処理RAG分類音声

用途: 分類
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

MeCo: One-Step MeanFlow-based Corrector for Multi-Channel Speech Separation

この論文では、スピーカーセパレーションを改善するために、新しいフレームワークを提案する。これにより、スピーカーセパレーションの精度が向上する。

表形式向き品質予測/異常検知自然言語処理RAG生成音声

用途: スピーカーセパレーションの改善
難易度: Hard
コスト: Low

→

arxivPaper only2026-06-08

A Finetuned SpeechLLM for Joint Multi-Granular L2 Assessment and Natural-Language Rationales

スピーチアセスメントを自動化するためのSpeechLLMが提案され、スピーチの質と能力を評価する。

説明可能自然言語処理大規模言語モデル音声

用途: L2スピーチアセスメントの実現
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

Real-time body pose non-verbal communication with a consistency-based reliability measure

Body movement communicates intent at distances and in conditions where neither the face, nor speech can be cap

機械学習教師なし学習分類予測テキスト

用途: 分類
難易度: Hard
コスト: Low

→

arxivPaper only2026-06-08

Physics-Guided Sequence-Based Generative Framework for Acoustic Metamaterial Inverse Design

可変化の帯域幅を考慮した、聴覚超材料の逆設計における新しいフレームワークである Physics-Guided Sequence-Based Generative Framework for Acoustic Metama

センサ/時系列深層学習Transformer生成画像音声

用途: 可変化の帯域幅を考慮した、聴覚超材料の逆設計
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

End-to-End Training for Discrete Token LLM based TTS System

エンドツーエンドトレーニングによるTTSシステムを提案し、エンドツーエンドトレーニングの利点を確認している。

自然言語処理大規模言語モデル分類生成テキスト

用途: エンドツーエンドトレーニングによるTTSシステムの提案
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

BareWave: Waveform-Native Flow-Matching Text-to-Speech

Removing intermediate representations and separately trained decoding stages has become an important direction

センサ/時系列品質予測/異常検知深層学習軽量化・量子化生成テキスト音声

用途: 生成
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

TLDR: Compressing Audio Tokens for Efficient Autoregressive Text-to-Speech

オーディオTokenと文書をモデル化するためにコーデックベースのARトークのジェネレーターが強力な文を音声の質を高めました。しかし、このアプローチでは、音声Tokenのシーケンスはテキストシーケンスより長くなるため、AR

品質予測/異常検知深層学習軽量化・量子化テキスト音声

用途: オーディオTokenの圧縮による話者ジェネレータの効率化
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

Quality-Diversity Search in Sound Generation: Investigating Innovation Engines for Audio Exploration

この研究では、音楽生成における多様性を促進するためのオープンソース・フレームワークを開発します。このフレームワークは、音楽生成における多様性の促進を支援するために、進化的プロセスと多様性促進アルゴリズムを組み合わせたもの

MI向き品質予測/異常検知自然言語処理ファインチューニング分類生成テキスト

用途: 音楽生成における多様性の促進
難易度: Hard
コスト: Low

→

arxivPaper only2026-06-08

Cross-Modal Masking for Robust Silent Speech Synthesis Using sEMG and Lipreading

この研究では、静黙の口承のシンセシスを実現するためのフレームワークを開発します。このフレームワークは、静黙の口承のシンセシスと精度を改善することができます。

センサ/時系列自然言語処理RAG生成音声動画

用途: 静黙の口承のシンセシス
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

OpenBibleTTS: Large-Scale Speech Resources and TTS Models for Low-Resource Languages

Recent advances in neural text-to-speech (TTS) and multilingual speech generation have substantially improved

品質予測/異常検知自然言語処理大規模言語モデル生成テキスト音声

用途: 生成
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

Overcoming Decoder Inconsistencies in Whisper for Dravidian and Low-Resource Languages

WhisperのようなマルチリンガルASRモデルの音声認識能力をDravidian言語で向上させるために、データセットと言語分析を用い、モデルをフィネチュアリングし、デコーダの不平衡を解消し、音声認識誤差を低減した。

センサ/時系列深層学習Transformerテキスト音声

用途: Dravidian言語の音声認識を改善する
難易度: Hard
コスト: Medium

→

arxivPaper only2026-06-08

Toward Signing Activity Projection in Sign Language Interaction

Social robots must interact robustly not only with users assumed by speech-centered systems but also with dive

深層学習Transformerテキスト音声

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

→

arxivPaper only2026-06-08

Is Text All You Need? Text as a Universal Information Bottleneck for Speech LLMs

Large language models (LLMs) provide a powerful reasoning backbone for speech understanding, but integrating c

センサ/時系列深層学習Transformer分類テキスト音声

用途: 分類
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

NüshuVoice: Reviving the Voice of Endangered Nüshu with Pitch-Aware Text-to-Speech

Nüshu is an endangered phonetic script historically used by women in Jiangyong County, southern Hunan, China.

センサ/時系列深層学習Transformer分類画像テキスト

用途: 分類
難易度: Hard
コスト: Low

→

arxivPaper only2026-06-08

CineDance: Towards Next-Generation Multi-Shot Long-Form Cinematic Audio-Video Generation

The fidelity and structural diversity of training datasets fundamentally determine the capabilities of video g

品質予測/異常検知自然言語処理RAG生成テキスト音声

用途: 生成
難易度: Hard
コスト: High

→

arxivGitHubあり2026-06-08

Echo-DM: Ultrasound Marker Removal via Conditional Latent Diffusion and Region-Aware Fusion

Clinical ultrasound images often contain artificial markers, such as measurement calipers and text, to assist

品質予測/異常検知自然言語処理RAG画像テキスト音声

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

githubGitHubあり2026-06-08

VoxCPM — VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning

マルチラギングスピーチ生成やクリエイティブボイスデザイン、ルートライフクライミングなど、テクスチャファリーTTSの最新技術を実現するためのフレームワークです。

生成AI音声・音楽生成生成テキスト音声

用途: マルチラギングスピーチ生成
難易度: Easy
コスト: Medium

→

arxivPaper only2026-06-07

From A to B to A: Palindromic Zero-Shot Voice Conversion with Non-Parallel Data

We present a voice conversion (VC) framework that utilizes K-Nearest Neighbors (KNN) retrieval over WavLM repr

自然言語処理プロンプトエンジニアリング音声教師あり

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-07

TeamHerald@CHIPSAL 2026: Hate Speech Detection and Sentiment Analysis of Nepali Memes using Transformer-based Architectures and Ensemble Learning

The analysis of internet memes in the Nepali language is complicated by frequent code-mixing and a lack of est

深層学習Transformer分類検出画像

用途: 分類
難易度: Hard
コスト: Low

→

arxivPaper only2026-06-07

Speaker-Invariant Representation Learning for Spoofing Detection via Gradient Reversal and A Variational Information Bottleneck

Sophisticated generative speech technology can undermined the reliability of voice biometrics. While spoofing

表形式向き自然言語処理RAG分類検出生成

用途: 分類
難易度: Hard
コスト: Low

→

arxivPaper only2026-06-07

Titans-as-a-Layer: Test-Time Memory for Conversational Speech Emotion Recognition

Speech emotion recognition (SER) is commonly formulated as utterance-level classification, although conversati

センサ/時系列自然言語処理大規模言語モデル分類テキスト音声

用途: 分類
難易度: Hard
コスト: High

→

arxivPaper only2026-06-07

RAILS: Verification-Native Clearing For Agentic Commerce

Autonomous agents negotiate, purchase, deploy code, and move funds, but no neutral mechanism determines whethe

品質予測/異常検知自然言語処理大規模言語モデル音声

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-07

HydraQE: OSU's Submission for the IWSLT 2026 Speech Translation Metrics Shared Task

We present HydraQE, our contribution to the IWSLT 2026 Speech Translation Metrics shared task. HydraQE is an e

品質予測/異常検知深層学習Transformer翻訳テキスト音声

用途: 翻訳
難易度: Hard
コスト: High

→

arxivPaper only2026-06-07

TRADE: Transducer-Augmented Decoder for Speech LLM

Speech Large Language Models (Speech LLMs) lack a principled mechanism for streaming inference: their label-sy

センサ/時系列深層学習Transformer分類検出生成

用途: 分類
難易度: Hard
コスト: High

→

arxivPaper only2026-06-07

TinyGiantALM: A Compact Audio-Language Model for Intent-Aware Reasoning under Resource Constraints

Current advancements in Audio Reasoning rely on massive Large Audio-Language Models (LALMs), hindering deploym

センサ/時系列自然言語処理プロンプトエンジニアリングテキスト音声

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

→

arxivPaper only2026-06-07

Hacking Generative Perplexity: Why Unconditional Text Evaluation Needs Distributional Metrics

Diffusion and continuous flow-based language models have emerged as the leading non-autoregressive alternative

品質予測/異常検知自然言語処理大規模言語モデル生成テキスト音声

用途: 生成
難易度: Hard
コスト: High

→

arxivPaper only2026-06-07

OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning

While Omni-modal Large Language Models (OLLMs) have demonstrated impressive capabilities in jointly processing

品質予測/異常検知自然言語処理大規模言語モデル画像テキスト音声

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

githubGitHubあり2026-06-07

SimpleTuner — A general fine-tuning kit geared toward image/video/audio diffusion models.

画像やビデオやオーディオディフュージョンモデルのファインチューニングを行うための、汎用的なファインチューニングキット。

自然言語処理ファインチューニング画像音声動画

用途: ディフュージョンモデルのファインチューニング
難易度: Easy
コスト: High

→

arxivPaper only2026-06-06

Beyond Additivity: Causal Discovery in Location-Scale Noise Models with Hidden Variables

We study causal discovery from observational data when some variables are hidden and the data-generating proce

深層学習Transformer音声

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

→

arxivPaper only2026-06-06

Paediatric-HGNN: A Hybrid Heterogeneous Graph Neural Network for Detecting Disfluency in Children's Speech via Multiscale Acoustic Fusion

Automated stuttering detection (ASD) systems struggle with paediatric speech due to high acoustic variability

説明可能センサ/時系列深層学習グラフニューラルネット検出テキスト音声

用途: 検出
難易度: Hard
コスト: Medium

→

arxivPaper only2026-06-06

GlobeAudio: A Multilingual Multicultural Benchmark for Naturalistic Evaluation of Large Audio-Language Models

Large Audio-Language Models (LALMs) integrate audio perception and language understanding within a unified fra

センサ/時系列自然言語処理大規模言語モデルテキスト音声

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivGitHubあり2026-06-06

VideoWeaver: Evaluating and Evolving Skills for Agentic Long Video Generation

Recent agent frameworks such as Claude Code, Codex, and OpenClaw are strong at tool use and orchestration, but

MI向き品質予測/異常検知自然言語処理大規模言語モデル生成画像テキスト

用途: 生成
難易度: Hard
コスト: High

→

huggingfaceHugging Faceあり2026-06-05

Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders

Whisper, a widely adopted ASR model, is known to suffer from hallucinations - coherent transcriptions generate

自然言語処理ファインチューニング検出音声

用途: 検出
難易度: Easy
コスト: Medium

→

huggingfaceHugging Faceあり2026-06-05

Where Rectified Flows Leak: Characterising Membership Signals Along the Interpolation Path

Understanding what generative models retain from training data remains challenging, with implications for copy

機械学習特徴量エンジニアリング生成画像音声

用途: 生成
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-05

MMAE: A Massive Multitask Audio Editing Benchmark

We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation

MI向き自然言語処理大規模言語モデル生成画像テキスト

用途: 生成
難易度: Easy
コスト: High

→

huggingfaceGitHubありHugging Faceあり2026-06-05

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research move

深層学習軽量化・量子化画像テキスト音声

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-05

dots.tts Technical Report

We present dots.tts, a 2B-parameter continuous autoregressive text-to-speech (TTS) foundation model that model

センサ/時系列品質予測/異常検知深層学習軽量化・量子化生成テキスト音声

用途: 生成
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-05

Entropy as a Structural Prior: How a Log-Barrier on DiT Belief Space Drives Musical Diversity and Development

Confidence-based loss weighting is usually avoided in generative models because it accelerates errors when the

センサ/時系列自然言語処理ファインチューニング生成テキスト音声

用途: 生成
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-05

How Far Can Chord-Symbol Time-Series Adaptation Carry Genre Identity? Capabilities and Boundaries in Multi-Genre Chord-Symbol Modeling

Harmony is a compact symbolic layer where mathematical pitch relations, acoustic consonance, and musical conve

説明可能センサ/時系列品質予測/異常検知深層学習Transformer分類テキスト音声

用途: 分類
難易度: Easy
コスト: Low

→

arxivPaper only2026-06-04

Towards Realistic 3D Sonar Simulation

この研究では、実際のアカウシック現象を考慮して、3Dソナーシミュレーションを改善するモジュラー構成を提案します。

センサ/時系列コンピュータビジョン3D・点群音声3D

用途: 3Dソナーシミュレーションの改善
難易度: Hard
コスト: High

→

huggingfaceHugging Faceあり2026-06-04

Towards Truly Multilingual ASR: Generalizing Code-Switching ASR to Unseen Language Pairs

Automatic Speech Recognition (ASR) has become a key technology for human--AI interaction. However, code-switch

自然言語処理ファインチューニング分類生成音声

用途: 分類
難易度: Easy
コスト: Low

→

githubGitHubあり2026-06-04

Irodori-TTS — A Flow Matching-based Text-to-Speech Model with Emoji-driven Style Control

Emotion-driven Style Controlを使用してテキストから声の変換が実行され、感情のあるテキストをエモタイザブルな声に変換することが可能になります。

生成AI拡散モデル生成テキスト音声

用途: テキスト-to-声の変換
難易度: Easy
コスト: High

→

arxivPaper only2026-06-03

Global Sketch-Based Watermarking for Diffusion Language Models

Watermarking methods for language models have been studied extensively in the autoregressive setting, where to

コンピュータビジョンセグメンテーション検出生成テキスト

用途: 検出
難易度: Hard
コスト: High

→

huggingfaceHugging Faceあり2026-06-03

Audio Interaction Model

Audio is an inherently interactive modality, yet today's Large Audio Language Models (LALMs) are offline, and

強化学習マルチエージェントテキスト音声

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

huggingfaceGitHubありHugging Faceあり2026-06-03

SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing

Instruction-guided speech editing requires a model to modify specified speech attributes while preserving unre

自然言語処理大規模言語モデル生成テキスト音声

用途: 生成
難易度: Easy
コスト: High

→

arxivPaper only2026-06-01

When Tabular Foundation Models Transfer Across Modalities: A Systematic Evaluation Across 95 Datasets, 7 Modalities, and Two Regimes

We present a single classification pipeline that combines an Equiangular Tight Frame (ETF) preprocessing stage

表形式向きセンサ/時系列品質予測/異常検知深層学習軽量化・量子化分類テキスト音声

用途: 分類
難易度: Hard
コスト: High

→

huggingfaceGitHubありHugging Faceあり2026-06-01

Cosmos 3: Omnimodal World Models for Physical AI

We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, i

深層学習Transformer生成画像テキスト

用途: 生成
難易度: Easy
コスト: High

→

arxivPaper only2026-05-31

Spiking and Event-driven Neuromorphic Mamba Models for Efficient Speech Recognition

Deep learning has greatly advanced automatic speech recognition (ASR), enabling widespread deployment on edge

深層学習軽量化・量子化分類音声

用途: 分類
難易度: Hard
コスト: Low

→

huggingfaceHugging Faceあり2026-05-30

SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory

AI glasses present a compelling platform for AI agents to serve as personalized memory assistants. To be genui

深層学習Transformer分類QA画像

用途: 分類
難易度: Easy
コスト: High

→

huggingfaceGitHubありHugging Faceあり2026-05-29

OpenSTBench: Beyond Semantic Evaluation for Speech Translation

Speech translation systems increasingly span speech-to-text translation (S2TT), speech-to-speech translation (

品質予測/異常検知コンピュータビジョン動画認識生成テキスト音声

用途: 生成
難易度: Easy
コスト: High

→

huggingfaceGitHubありHugging Faceあり2026-05-29

Score-Control for Hallucination Reduction in Diffusion Models

Diffusion models have emerged as the backbone of modern generative AI, powering advances in vision, language,

コンピュータビジョンセグメンテーション生成画像音声

用途: 生成
難易度: Easy
コスト: High

→

arxivPaper only2026-05-28

Deep Binarized Photonic Reservoir Computing for Ultrafast Multimedia Signal Processing

We present a deep photonic neural network architecture based on ultrafast binary optical modulation from a dig

センサ/時系列コンピュータビジョン動画認識分類検出画像

用途: 分類
難易度: Hard
コスト: High

→

huggingfaceHugging Faceあり2026-05-28

Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

Automatic speech recognition (ASR) is a core component of human--computer interaction and an increasingly impo

自然言語処理大規模言語モデル分類音声

用途: 分類
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-05-28

Multimodal Music Recommendation System using LLMs

Music recommendation systems typically treat songs as opaque tokens, relying on collaborative interaction hist

センサ/時系列品質予測/異常検知深層学習Transformerテキスト音声マルチモーダル

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-05-28

Access Sets Matter: Budgeting Expert Reads for Scalable Weight-Space Model Merging

Weight-space model merging is usually formulated as an algebraic operation on checkpoints, yet at LLM scale th

深層学習軽量化・量子化音声

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

githubGitHubあり2026-05-28

openFrameworks — openFrameworks is a community-developed cross platform toolkit for creative coding in C++.

OpenFrameworksは、C++で構築されたクロスプラットフォームのツールキットで、クリエイティブコーディングのために使われます。このライブラリは、各種のデバイス上でプログラムを動作させることを容易にします。

コンピュータビジョン音声動画

用途: クリエイティブコーディングのためのクロスプラットフォームツールキット
難易度: Easy
コスト: High

→

githubGitHubあり2026-05-25

Matcha-TTS — [ICASSP 2024] 🍵 Matcha-TTS: A fast TTS architecture with conditional flow matching

Matcha-TTSは、高速で条件付き流のマッチングを実現するTTSアーキテクチャであり、話者の特徴を考慮する。

生成AI拡散モデルテキスト音声

用途: TTSアーキテクチャ設計
難易度: Easy
コスト: High

→

arxivPaper only2026-05-15

Scalable neuromorphic computing from autonomous spiking dynamics in a clockless reconfigurable chip

We propose a scalable neuromorphic architecture based on spiking dynamics emerging from the autonomous time-co

深層学習軽量化・量子化分類音声

用途: スパイク計算精度向上
難易度: Hard
コスト: Low

→

githubGitHubあり2026-05-13

maths-cs-ai-compendium — Become a cracked AI/ML Research Engineer

Becoming a cracked AI/ML Research Engineerには、AI/ML研究者のスキルと知識を高めるための手法が紹介されています。

コンピュータビジョンマルチモーダルテキスト音声

用途: AI/ML研究者を育成
難易度: Easy
コスト: High

→

arxivPaper only2026-05-10

Encoding and Decoding Temporal Signals with Spiking Bandpass Wavelets

Spike-based encodings are sparse and energy-efficient, but have largely been formulated probabilistically, dis

深層学習軽量化・量子化音声

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

→

huggingfaceHugging Faceあり2026-05-04

Liberating LLM Capabilities in Full-Duplex Speech Models

Speech-based large language models are typically constrained to spoken replies, which limits their user-facing

自然言語処理大規模言語モデル生成テキスト音声

用途: 生成
難易度: Easy
コスト: High

→