MLinfo | 機械学習・AI論文まとめ

Scaling Neural Network Verification with Tensor Parallelism and Fully Sharded Data Parallelism

この研究では、Tensor ParallelismとFully Sharded Data Parallelism技術を利用して、GPU メモリ限界のある従来の検証アーキテクチャの制約を解いて、機械学習ネットワークの検証を

深層学習CNNテキスト音声

用途: 予測ネットワークの検証
難易度: Hard
コスト: High

Conan-embedding-v3: Fusing Modality-Specific Models for Omni-Modal Embedding

この研究では、テキスト、画像、ビデオ、アウディオ等の異なるモダリティのデータを統合したオムニモダル検索システムを構築します。

自然言語処理ファインチューニング回帰検索画像

用途: オムニモーダル検索
難易度: Hard
コスト: High

センサ/時系列深層学習Transformer検出生成埋め込み

Multi-View Speech Representation Learning for Parkinson's Disease Detection Using Context-guided Cross-modal Attention

パーキンソン病（PD）の早期検出への取り組みとして、脳の損傷が発症前に生じる話術障害を分析するため、音声分析を用いてパーキンソン病の診断を提唱しています。

用途: パーキンソン病の早期検出
難易度: Hard
コスト: High

arxivGitHubあり2026-06-08

Few-shot Class-variable Incremental Audio Classification via Prototype Adaptation and Pseudo Class-variable Training

In the task of few-shot class-incremental audio classification, the number of classes is assumed to always inc

少数データ向き自然言語処理RAG分類音声

用途: 分類
難易度: Hard
コスト: High

表形式向き品質予測/異常検知自然言語処理RAG生成音声

MeCo: One-Step MeanFlow-based Corrector for Multi-Channel Speech Separation

この論文では、スピーカーセパレーションを改善するために、新しいフレームワークを提案する。これにより、スピーカーセパレーションの精度が向上する。

用途: スピーカーセパレーションの改善
難易度: Hard
コスト: Low

A Finetuned SpeechLLM for Joint Multi-Granular L2 Assessment and Natural-Language Rationales

スピーチアセスメントを自動化するためのSpeechLLMが提案され、スピーチの質と能力を評価する。

説明可能自然言語処理大規模言語モデル音声

用途: L2スピーチアセスメントの実現
難易度: Hard
コスト: High

Real-time body pose non-verbal communication with a consistency-based reliability measure

Body movement communicates intent at distances and in conditions where neither the face, nor speech can be cap

機械学習教師なし学習分類予測テキスト

用途: 分類
難易度: Hard
コスト: Low

センサ/時系列深層学習Transformer生成画像音声

Physics-Guided Sequence-Based Generative Framework for Acoustic Metamaterial Inverse Design

可変化の帯域幅を考慮した、聴覚超材料の逆設計における新しいフレームワークである Physics-Guided Sequence-Based Generative Framework for Acoustic Metama

用途: 可変化の帯域幅を考慮した、聴覚超材料の逆設計
難易度: Hard
コスト: High

End-to-End Training for Discrete Token LLM based TTS System

エンドツーエンドトレーニングによるTTSシステムを提案し、エンドツーエンドトレーニングの利点を確認している。

自然言語処理大規模言語モデル分類生成テキスト

用途: エンドツーエンドトレーニングによるTTSシステムの提案
難易度: Hard
コスト: High

センサ/時系列品質予測/異常検知深層学習軽量化・量子化生成テキスト音声

BareWave: Waveform-Native Flow-Matching Text-to-Speech

Removing intermediate representations and separately trained decoding stages has become an important direction

用途: 生成
難易度: Hard
コスト: High

品質予測/異常検知深層学習軽量化・量子化テキスト音声

TLDR: Compressing Audio Tokens for Efficient Autoregressive Text-to-Speech

オーディオTokenと文書をモデル化するためにコーデックベースのARトークのジェネレーターが強力な文を音声の質を高めました。しかし、このアプローチでは、音声Tokenのシーケンスはテキストシーケンスより長くなるため、AR

用途: オーディオTokenの圧縮による話者ジェネレータの効率化
難易度: Hard
コスト: High

MI向き品質予測/異常検知自然言語処理ファインチューニング分類生成テキスト

Quality-Diversity Search in Sound Generation: Investigating Innovation Engines for Audio Exploration

この研究では、音楽生成における多様性を促進するためのオープンソース・フレームワークを開発します。このフレームワークは、音楽生成における多様性の促進を支援するために、進化的プロセスと多様性促進アルゴリズムを組み合わせたもの

用途: 音楽生成における多様性の促進
難易度: Hard
コスト: Low

Cross-Modal Masking for Robust Silent Speech Synthesis Using sEMG and Lipreading

この研究では、静黙の口承のシンセシスを実現するためのフレームワークを開発します。このフレームワークは、静黙の口承のシンセシスと精度を改善することができます。

センサ/時系列自然言語処理RAG生成音声動画

用途: 静黙の口承のシンセシス
難易度: Hard
コスト: High

品質予測/異常検知自然言語処理大規模言語モデル生成テキスト音声

OpenBibleTTS: Large-Scale Speech Resources and TTS Models for Low-Resource Languages

Recent advances in neural text-to-speech (TTS) and multilingual speech generation have substantially improved

用途: 生成
難易度: Hard
コスト: High

センサ/時系列深層学習Transformerテキスト音声

Overcoming Decoder Inconsistencies in Whisper for Dravidian and Low-Resource Languages

WhisperのようなマルチリンガルASRモデルの音声認識能力をDravidian言語で向上させるために、データセットと言語分析を用い、モデルをフィネチュアリングし、デコーダの不平衡を解消し、音声認識誤差を低減した。

用途: Dravidian言語の音声認識を改善する
難易度: Hard
コスト: Medium

Toward Signing Activity Projection in Sign Language Interaction

Social robots must interact robustly not only with users assumed by speech-centered systems but also with dive

深層学習Transformerテキスト音声

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

センサ/時系列深層学習Transformer分類テキスト音声

Is Text All You Need? Text as a Universal Information Bottleneck for Speech LLMs

Large language models (LLMs) provide a powerful reasoning backbone for speech understanding, but integrating c

用途: 分類
難易度: Hard
コスト: High

センサ/時系列深層学習Transformer分類画像テキスト

NüshuVoice: Reviving the Voice of Endangered Nüshu with Pitch-Aware Text-to-Speech

Nüshu is an endangered phonetic script historically used by women in Jiangyong County, southern Hunan, China.

用途: 分類
難易度: Hard
コスト: Low

品質予測/異常検知自然言語処理RAG生成テキスト音声

CineDance: Towards Next-Generation Multi-Shot Long-Form Cinematic Audio-Video Generation

The fidelity and structural diversity of training datasets fundamentally determine the capabilities of video g

用途: 生成
難易度: Hard
コスト: High

品質予測/異常検知自然言語処理RAG画像テキスト音声

arxivGitHubあり2026-06-08

Echo-DM: Ultrasound Marker Removal via Conditional Latent Diffusion and Region-Aware Fusion

Clinical ultrasound images often contain artificial markers, such as measurement calipers and text, to assist

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

自然言語処理プロンプトエンジニアリング音声教師あり

From A to B to A: Palindromic Zero-Shot Voice Conversion with Non-Parallel Data

We present a voice conversion (VC) framework that utilizes K-Nearest Neighbors (KNN) retrieval over WavLM repr

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

TeamHerald@CHIPSAL 2026: Hate Speech Detection and Sentiment Analysis of Nepali Memes using Transformer-based Architectures and Ensemble Learning

The analysis of internet memes in the Nepali language is complicated by frequent code-mixing and a lack of est

深層学習Transformer分類検出画像

用途: 分類
難易度: Hard
コスト: Low

Speaker-Invariant Representation Learning for Spoofing Detection via Gradient Reversal and A Variational Information Bottleneck

Sophisticated generative speech technology can undermined the reliability of voice biometrics. While spoofing

表形式向き自然言語処理RAG分類検出生成

用途: 分類
難易度: Hard
コスト: Low

センサ/時系列自然言語処理大規模言語モデル分類テキスト音声

Titans-as-a-Layer: Test-Time Memory for Conversational Speech Emotion Recognition

Speech emotion recognition (SER) is commonly formulated as utterance-level classification, although conversati

用途: 分類
難易度: Hard
コスト: High

品質予測/異常検知自然言語処理大規模言語モデル音声

RAILS: Verification-Native Clearing For Agentic Commerce

Autonomous agents negotiate, purchase, deploy code, and move funds, but no neutral mechanism determines whethe

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

品質予測/異常検知深層学習Transformer翻訳テキスト音声

HydraQE: OSU's Submission for the IWSLT 2026 Speech Translation Metrics Shared Task

We present HydraQE, our contribution to the IWSLT 2026 Speech Translation Metrics shared task. HydraQE is an e

用途: 翻訳
難易度: Hard
コスト: High

センサ/時系列深層学習Transformer分類検出生成

TRADE: Transducer-Augmented Decoder for Speech LLM

Speech Large Language Models (Speech LLMs) lack a principled mechanism for streaming inference: their label-sy

用途: 分類
難易度: Hard
コスト: High

センサ/時系列自然言語処理プロンプトエンジニアリングテキスト音声

TinyGiantALM: A Compact Audio-Language Model for Intent-Aware Reasoning under Resource Constraints

Current advancements in Audio Reasoning rely on massive Large Audio-Language Models (LALMs), hindering deploym

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

品質予測/異常検知自然言語処理大規模言語モデル生成テキスト音声

Hacking Generative Perplexity: Why Unconditional Text Evaluation Needs Distributional Metrics

Diffusion and continuous flow-based language models have emerged as the leading non-autoregressive alternative

用途: 生成
難易度: Hard
コスト: High

品質予測/異常検知自然言語処理大規模言語モデル画像テキスト音声

OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning

While Omni-modal Large Language Models (OLLMs) have demonstrated impressive capabilities in jointly processing

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

arxivPaper only2026-06-06

Beyond Additivity: Causal Discovery in Location-Scale Noise Models with Hidden Variables

We study causal discovery from observational data when some variables are hidden and the data-generating proce

深層学習Transformer音声

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

arxivPaper only2026-06-06

Paediatric-HGNN: A Hybrid Heterogeneous Graph Neural Network for Detecting Disfluency in Children's Speech via Multiscale Acoustic Fusion

Automated stuttering detection (ASD) systems struggle with paediatric speech due to high acoustic variability

説明可能センサ/時系列深層学習グラフニューラルネット検出テキスト音声

用途: 検出
難易度: Hard
コスト: Medium

arxivPaper only2026-06-06

GlobeAudio: A Multilingual Multicultural Benchmark for Naturalistic Evaluation of Large Audio-Language Models

Large Audio-Language Models (LALMs) integrate audio perception and language understanding within a unified fra

センサ/時系列自然言語処理大規模言語モデルテキスト音声

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

MI向き品質予測/異常検知自然言語処理大規模言語モデル生成画像テキスト

arxivGitHubあり2026-06-06

VideoWeaver: Evaluating and Evolving Skills for Agentic Long Video Generation

Recent agent frameworks such as Claude Code, Codex, and OpenClaw are strong at tool use and orchestration, but

用途: 生成
難易度: Hard
コスト: High

arxivPaper only2026-06-04

Towards Realistic 3D Sonar Simulation

この研究では、実際のアカウシック現象を考慮して、3Dソナーシミュレーションを改善するモジュラー構成を提案します。

センサ/時系列コンピュータビジョン3D・点群音声3D

用途: 3Dソナーシミュレーションの改善
難易度: Hard
コスト: High

arxivPaper only2026-06-03

Global Sketch-Based Watermarking for Diffusion Language Models

Watermarking methods for language models have been studied extensively in the autoregressive setting, where to

コンピュータビジョンセグメンテーション検出生成テキスト

用途: 検出
難易度: Hard
コスト: High

arxivPaper only2026-06-01

When Tabular Foundation Models Transfer Across Modalities: A Systematic Evaluation Across 95 Datasets, 7 Modalities, and Two Regimes

We present a single classification pipeline that combines an Equiangular Tight Frame (ETF) preprocessing stage

表形式向きセンサ/時系列品質予測/異常検知深層学習軽量化・量子化分類テキスト音声

用途: 分類
難易度: Hard
コスト: High

arxivPaper only2026-05-31

Spiking and Event-driven Neuromorphic Mamba Models for Efficient Speech Recognition

Deep learning has greatly advanced automatic speech recognition (ASR), enabling widespread deployment on edge

深層学習軽量化・量子化分類音声

用途: 分類
難易度: Hard
コスト: Low

arxivPaper only2026-05-28

Deep Binarized Photonic Reservoir Computing for Ultrafast Multimedia Signal Processing

We present a deep photonic neural network architecture based on ultrafast binary optical modulation from a dig

センサ/時系列コンピュータビジョン動画認識分類検出画像

用途: 分類
難易度: Hard
コスト: High

arxivPaper only2026-05-15

Scalable neuromorphic computing from autonomous spiking dynamics in a clockless reconfigurable chip

We propose a scalable neuromorphic architecture based on spiking dynamics emerging from the autonomous time-co

深層学習軽量化・量子化分類音声

用途: スパイク計算精度向上
難易度: Hard
コスト: Low

arxivPaper only2026-05-10

Encoding and Decoding Temporal Signals with Spiking Bandpass Wavelets

Spike-based encodings are sparse and energy-efficient, but have largely been formulated probabilistically, dis

深層学習軽量化・量子化音声

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium