MLinfo | 機械学習・AI論文まとめ

センサ/時系列品質予測/異常検知深層学習軽量化・量子化テキスト音声マルチモーダル

X$^3$-OPD: Distilling Reasoning into Large Audio-Language Models via On-Policy Alignment

大規模な言語モデルを用いた推論技術のための新しいアプローチであるX$^3$-OPD（Distilling Reasoning into Large Audio-Language Models via On-Policy

用途: 大規模な言語モデルを用いた推論技術の開発
難易度: Hard
コスト: High

センサ/時系列自然言語処理大規模言語モデル分類検出埋め込み

Toward Generalizable Cognitive Impairment Detection with Speech-Based Multimodal Large Language Models

Cognitive impairment (CI) is a growing public health concern. Early and accurate diagnosis is critical for ena

用途: 分類
難易度: Hard
コスト: High

Phonetic forced alignment for low-resource language varieties: Model training and evaluation on Chengdu Mandarin

Phonetic forced alignment is a key technique in phonetic research, yet existing alignment systems lack special

自然言語処理RAG分類テキスト音声

用途: 分類
難易度: Hard
コスト: High

Toward cryptographically verifiable authorization for autonomous AI agents: A security hypothesis, preliminary formal model, and proof-of-concept implementation

Autonomous AI agents increasingly execute actions, invoke tools, and operate on protected resources with limit

MLOpsモデルデプロイテキスト音声

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

Differentiable Logic Programming to Mitigate Reasoning Shortcuts in Neurosymbolic Systems

ニューロ記号学的スキットがミットゲーション（制約を満たさずにタスクを達成する）を持っていることを指摘し、制約サティスファクションミットゲーション削減手法を提案します。この手法では、制約の制約を満たす代わりに、モデルが制約

説明可能MLOpsモデルデプロイ音声

用途: ニュロシンボリックシステムのミットゲーション削減
難易度: Hard
コスト: Medium

Safeguards for Speech2Speech LLM-Assistants: A Case Study in Automotive Applications

S2S (Speech-to-Speech) LLMアシスタントを利用して、人間のような話し方をすることができますが、安全対策の実装が困難です。この研究では、S2S LLMアシスタントの安全対策を2つのアプローチで実現し

自然言語処理大規模言語モデルテキスト音声

用途: S2S LLMアシスタントの安全対策
難易度: Hard
コスト: High

品質予測/異常検知深層学習Transformer生成テキスト音声

Faster IndexTTS-2: Accelerating and Streaming Autoregressive Zero-Shot Text-to-Speech Synthesis on GPUs

Autoregressive text-to-speech models achieve strong naturalness but suffer from slow inference due to sequenti

用途: 生成
難易度: Hard
コスト: High

OPOD: On-Policy Omni Distillation

Omni-modal models can handle text, images, and audio in one system, but improving all of these abilities toget

深層学習軽量化・量子化画像テキスト音声

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

センサ/時系列深層学習Transformer分類テキスト音声

DONDO: Open w2v-BERT Speech-Recognition Base Models for African Languages

この論文では、DONDO と呼ばれるアフリカ諸国向けの音声認識ベースモデル (ASR)が構築されました。これらのモデルは、自律学習型スピーチエンコーダーであるw2v-BERT 2.0を使用して構築されています。このエンコ

用途: アフリカ諸国への音声認識技術の適用
難易度: Hard
コスト: Low

センサ/時系列自然言語処理大規模言語モデル生成テキスト音声

An Evaluation Framework for Structured Audio Captions Validated by Controlled Perturbations

この論文では、音声字幕の評価手法が提案され、音声字幕の評価において既存の手法の制約を克服することを目指しました。提案されたフレームワークは音声字幕の各側面を評価し、質問回答型の評価手法ではなく字幕の中立性を評価することが

用途: 音声字幕の評価フレームワークの構築
難易度: Hard
コスト: High

Word meaning co-determines vowel-inherent spectral change. A corpus-based investigation of conversational Mandarin

この論文では、会話マンダリンにおける単語の意味と子音の特性の関係を調べました。その結果、単語の

自然言語処理埋め込み・検索テキスト音声

用途: 会話マンダリンにおける単語の意味と子音特性の関係
難易度: Hard
コスト: Low

CPUで試しやすいセンサ/時系列深層学習軽量化・量子化分類テキスト音声

VibeVoice-ASR-BitNet Technical Report

We present VibeVoice-ASR-BitNet, a compressed variant of VibeVoice-ASR optimized for real-time inference on ed

用途: 分類
難易度: Hard
コスト: High

品質予測/異常検知深層学習正規化・最適化手法分類画像テキスト

Quality-Aware Multimodal Fusion Reveals Implicit Identity in Valence-Arousal Features

Conventional face recognition relies on static appearance cues and degrades in unconstrained settings with exp

用途: 分類
難易度: Hard
コスト: High

Out of Sight, Still in Mind: Token Compression for Omni-LLMs

The goal of this paper is to reduce the input token cost of Omni-modal large language models (Omni-LLMs) at in

自然言語処理大規模言語モデル画像テキスト音声

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

少数データ向きセンサ/時系列深層学習軽量化・量子化音声半教師あり

Latent Variable-Mediated Cross-Learning for Few-Shot Acoustic Impedance Imaging

Acoustic impedance imaging is a fundamental yet severely ill-posed problem in subsurface analysis: the seismic

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

MI向き品質予測/異常検知深層学習Transformer分類画像テキスト

Sidewalk Moments: Are Richer Representations Always More Human-Aligned? Evidence from City-Walk Videos

この研究では、都市ウォークビデオを分析するために、4つのモダリティの表現（スペース時領域情報、時間平均画像、オーディオ符号化、テキストベースの表現）を使用しました。

用途: 都市ウォークビデオの分析
難易度: Hard
コスト: High

RL-MACRO: A Cybernetic Closed-Loop Intelligence Framework for Multimodal Adaptive Robotic Craniotomy

クロアニオトミーの手術を自動化するために、複数のモジュールから形成されるサイバネティックなクローゼッドループのフレームワークを提案します。このフレームワークは、ツールと組織との対話を通じて、ツールと組織の相互作用に対して

センサ/時系列深層学習CNN音声マルチモーダル

用途: クロアニオトミー手術の自動化
難易度: Hard
コスト: High

センサ/時系列深層学習Transformer検出画像音声

Human-Inspired Framework for Robotic Craniotomy: Integrating Multimodal Fusion and Adaptive Trajectory Adjustment

人間の知能を模倣するクロアニオトミー手術のフレームワークを提案します。このフレームワークは、前方計画と後方実行を組み合わせて、手術中に手術台の位置を自動的に調整することで、人間と同様の安全で効率的な手順を実現します。

用途: クロアニオトミー手術の自動化
難易度: Hard
コスト: High

説明可能品質予測/異常検知強化学習方策勾配 (PPO / A3C)分類音声

Explanation-Based Runtime Verification for Trustworthy ML-driven Optical Networks

Machine learning (ML) models are increasingly integrated into optical network automation frameworks to support

用途: 分類
難易度: Hard
コスト: Low

Cumsum-Composable Phase Transport for Low-Cost Streaming Keyword Spotting

ストリーミングキーワードスポットイントを扱う研究、Cumsum-Composable Phase Transport を用いてストリーミングキーワードスポットイントを提案する。

センサ/時系列深層学習CNN音声

用途: ストリーミングキーワードスポットтинグ
難易度: Hard
コスト: High

The Giant Hippocampus: From Structural Monoculture to a System of Systems

この研究では、人工知能の研究者と神経科学者の間の分野を結びつけるために、脳のシステム構造を研究し、その研究から導かれた新しいアプローチを提案しました。

深層学習Transformer分類画像テキスト

用途: 脳のシステム構造とその応用
難易度: Hard
コスト: High

Local Causal Structure Learning in the Presence of Latent Variables and Selection Bias

Discovering the direct causes and effects of a target variable from observational data is a fundamental proble

コンピュータビジョンセグメンテーション音声

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

Convergence-Latency-Aware Adaptive Modulation and Resource Allocation in RIS-Assisted Wireless Federated Learning

Federated learning (FL) over wireless networks suffers from significant training latency and degraded converge

数学・理論最適化音声

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

説明可能センサ/時系列品質予測/異常検知自然言語処理RAG分類音声

Spatially Grounded Concept Bottleneck Models for Trustworthy Breast Ultrasound Diagnosis

Concept Bottleneck Models provide interpretable-by-design predictions by mediating diagnosis through human-und

用途: 分類
難易度: Hard
コスト: Low

Sound Probabilistic Safety Bounds for Large Language Models

最新言語モデル(LLM)が危険な生成を防ぐための確信的な安全な限界を計算するための新しいフレームワークを提案した。Clopper-Pearsonの信頼区間の新しい応用として、PAC(可能性が最も近い)の境界を得るためのア

深層学習軽量化・量子化生成テキスト音声

用途: 生成性質へのリスクを抑える
難易度: Hard
コスト: High

品質予測/異常検知深層学習軽量化・量子化生成テキスト音声

Pushing the Frontier of Full-Song Generation: Hierarchical Autoregressive Planning Meets Flow-Matching Rendering

3つのタスクをサポートする音曲生成フレームワークを提示した。これらのタスクには、歌詞、テキストの説明、音楽的特性を利用して、歌詞の生成、バンドの音楽の生成、カバー曲の生成などが含まれる。

用途: 音楽の生成
難易度: Hard
コスト: High

センサ/時系列MLOpsモデルデプロイテキスト音声

Audio-Zero: Label-Free Self-Evolution for Fine-Grained Audio Reasoning

Large Audio Language models (LALMs) have made rapid progress on acoustic understanding, yet they still struggl

用途: フィネグリ
難易度: Hard
コスト: High

Efficient Chain-of-Modality Reasoning via Progressive Compression for Spoken Language Models

Spoken language models (SLMs) enable natural human-computer interaction, but their reasoning ability still lag

深層学習軽量化・量子化QAテキスト音声

用途: QA
難易度: Hard
コスト: High

説明可能CPUで試しやすい深層学習軽量化・量子化分類テキスト音声

Lightweight Person-Place Relation Extraction from Historical Newspapers with Dependency Graphs and Proximity Features

人名と場所の関係を抽出するタスクは、歴史的ニュース記事の解釈において重要です。従来の方法では、言語モデルの前処理が必要でしたが、Lightweightアルゴリズムは、依存グラフと近接特性を使って、歴史的ニュース記事から人

用途: 歴史新聞から人名と場所の関係の抽出
難易度: Hard
コスト: High

MI向き品質予測/異常検知自然言語処理大規模言語モデル生成画像テキスト

ETPDesigner: Multi-Agent Orchestration for Interactive Multimodal Electronic Theater Program

ETPデザイナはマルチモーダルな電子シアターのデザインを自動化するフレームワークを提案します。

用途: 生成
難易度: Hard
コスト: High

説明可能センサ/時系列コンピュータビジョンセグメンテーション音声

Domain Shift in Echocardiography: Interpretable Quantification and Prediction of Cross-Dataset Left Ventricular Segmentation

Cross-dataset generalisation remains a major barrier to clinical deployment of echocardiographic left ventricu

用途: セグメンテーション
難易度: Hard
コスト: Medium

センサ/時系列強化学習方策勾配 (PPO / A3C)検出音声

Distributed Acoustic Localization Array Deployed Using a Soft Everting Vine Robot

Soft robot exteroception is increasingly being explored for a variety of field applications. In this work, we

用途: 検出
難易度: Hard
コスト: Medium

MeetingToM: Evaluating Multimodal LLMs on Theory-of-Mind Reasoning in Multi-Party Meetings

Theory of Mind (ToM), the ability to infer other's beliefs, intentions, and states of knowledge, is central to

自然言語処理大規模言語モデルQAテキスト音声

用途: QA
難易度: Hard
コスト: High

Benchmarking Human and Automatic Speech Recognition of Diverse Speech: Initial Results

人間の耳は最高の聴覚能力をもつものであると考えられており、音声認識では人間の聴覚機能を上回るようなシステムが作りだされるのを待っている。しかし、このようなシステムは実現しておらず、人間は音声認識システムの基準作成の参考と

センサ/時系列自然言語処理分類音声

用途: 多様な発音の音声認識の基準作成
難易度: Hard
コスト: Low

センサ/時系列深層学習Transformerテキスト音声教師あり

Content is What Remains: Invariant Speech Tokenization from Parallel Utterances

ある単語を複数のスピーカーや環境の異なる条件下で言語モデルが使用できるようにしたい場合は、単語の抽出を実現する必要がある。しかし、現在の言語モデルでは、スピーカーの特性や環境の特性が単語に含まれていることが多い。ここでは

用途: 可変な条件における単語の抽出
難易度: Hard
コスト: Medium

Constrained CTC Decoding for Efficient Diacritic Restoration

アラビア語の発音記号化は重要な問題だが、データが不足していることが難点の一つである。この問題を解決するために、ここでは「Connectionist Temporal Classification (CTC)」を使った制約

深層学習軽量化・量子化分類テキスト音声

用途: 語音記号化の制約付き復元
難易度: Hard
コスト: Low

品質予測/異常検知深層学習Attention機構検出音声

Transcription Policy as a Latent Variable: Activating Controllable Verbatim ASR with Word-Level Timing

記号化の種類 (verbatim vs. intended) は、現在の音声認識モデルの評価に影響を与えるが、このような制約はモデルのトレーニングに影響しないことが多い。しかし、ここでは、制約はモデルのトレーニングに影響

用途: 記号化の制約付き復元
難易度: Hard
コスト: High

From a Multilingual Streaming ASR Backbone to Kenyan-Language Systems: Data-Centric Adaptation of Nemotron 3.5 for Kikuyu, Dholuo, and Kalenjin

Automatic speech recognition (ASR) for African languages is constrained by orthographic inconsistency, annotat

深層学習RNN / LSTM分類生成テキスト

用途: 分類
難易度: Hard
コスト: Low

Fusion Embedding: A Unified Embedding Space for Text, Image, Video, and Audio

A single embedding space that covers text, images, video, and audio lets one index serve every query a user ca

自然言語処理大規模言語モデル生成画像テキスト

用途: 生成
難易度: Hard
コスト: High

MI向き品質予測/異常検知自然言語処理大規模言語モデル画像音声動画

arxivGitHubあり2026-07-21

OmniReasoner: Thinking with Long Audio-Video via Native Tool Use

オリジナルのデータとZoom-Inのツールを組み合わせた方法、OmniReasonerを提案する。これにより、オリンモードルLLMsの長いオーディオビデオの論理的推論を改善できる。

用途: 長いオーディオビデオの論理的推論を改善する
難易度: Hard
コスト: High

arxivPaper only2026-07-20

Technical Design Review of Duke Robotics Club's Oogway & Crush: AUVs for RoboSub 2026

existing AUV development methodの制約を解決するためのrobustなオートニモティクス基盤と機械学習アライアンスを開発する。

センサ/時系列コンピュータビジョン物体検出検出音声

用途: ROBOCUPのAUV開発を推進する
難易度: Hard
コスト: Medium

arxivPaper only2026-07-18

A Causal Markov Condition for Value

This paper proposes a causal independence principle for value -- the value Causal Markov Condition (v-CMC) --

自然言語処理大規模言語モデルテキスト音声

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

arxivPaper only2026-07-17

Constrained Hebbian Learning Supports Efficient Representational Allocation under Structural Constraints

脳のニューロン同士のつながりを分析する方法を提案する。この方法では、神経伝達の構造を考慮しながら、ニューロン間のつながりを分析できる。

深層学習Transformer分類画像音声

用途: 神経伝達の分析
難易度: Hard
コスト: Low

arxivPaper only2026-07-17

Back to the museum: Investigation of the acceptance of Android Andrea with and without emotion simulation in a museum

For a second time, the android robot Andrea was set up at a public museum in Germany for six consecutive days

品質予測/異常検知深層学習RNN / LSTMテキスト音声

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

arxivPaper only2026-07-15

Verifying formulas for interventional distributions

指定された観察的式が特定の介入分布を特定することを検証することができることを提案、識別と異なる概念であることが確認。

機械学習教師あり学習音声

用途: 因果の検証
難易度: Hard
コスト: Medium

arxivPaper only2026-07-14

PolarBM: Complex-valued Boltzmann Machine for Modeling Audio Signals in Polar and Log-polar Coordinates

オーディオ信号を分析するために、complex-valuedのBoltzmann マシンの開発を発表。

機械学習教師あり学習音声

用途: オーディオ信号を分析する
難易度: Hard
コスト: Medium

arxivPaper only2026-07-13

Difference-Driven Gating: Adaptive Feature Fusion for U-Net Decoder

この研究では、新しい特徴融合手法を提案した。この手法は、上からの特徴と下からの特徴の関係性を考慮することで、特徴を効率的に融合し、三次元データを2次元サムライグラフにコンパクトに表現する機能をもたらせる。

センサ/時系列コンピュータビジョンセグメンテーション画像音声

用途: 特徴融合
難易度: Hard
コスト: Medium

arxivPaper only2026-06-26

Neuromorphic Energy-Aware Learning for Adaptive Deep Brain Stimulation

Neuromorphic and edge computing research has focused on reducing the inference cost of neural network controll

深層学習軽量化・量子化音声強化学習

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

arxivPaper only2026-06-23

What Does a Pathological Speech Assessment Model Know about Acoustic Features? A Case Study on Oral and Oropharyngeal Cancer Patients

この研究では、パーソナライズされた話し言葉アシスタンスシステムを提案します。

説明可能センサ/時系列品質予測/異常検知深層学習軽量化・量子化音声

用途: パーソナライズされた話し言葉アシスタンスシステムの開発
難易度: Hard
コスト: Low

arxivPaper only2026-06-22

YUKTI: From Natural-Language Situations to Robust, Verifiable Decisions An Uncertainty-Typed Proposition IR, Assumption-Robust Pareto Frontiers, and a Regret Certificate

Language models turn a worded situation into a numeric plan, and the dominant pipelines (NL4Opt, OptiMUS, ORLM

深層学習軽量化・量子化テキスト音声

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

arxivPaper only2026-06-17

Adaptive Speech-to-Spike Encoding for Spiking Neural Networks

この研究では、音声認識のパターン認識を分析するためのスパイクニューラルネットワークを使用します。モデルは音声認識のパターン認識に役立ちます。

説明可能センサ/時系列品質予測/異常検知深層学習RNN / LSTM音声

用途: 音声認識のパターン認識
難易度: Hard
コスト: High

arxivPaper only2026-06-16

A Neuromorphic Trigger for Efficient Audio Event Detection

Efficient processing of continuous audio streams remains a key challenge for real-time and resource-constraine

深層学習軽量化・量子化分類検出音声

用途: オーディオイベント検出の
難易度: Hard
コスト: Low