MLinfo | 機械学習・AI論文まとめ

MLinfo|日々更新される技術をキャッチアップ/検索

「multimodal」の検索結果

200 件

すべて arxiv github huggingface 実装あり

githubGitHubあり2026-06-10

screenpipe — YC (S26) | AI that knows what you've seen, said, or heard. Records everything you do, say, hear 24/7, local, private, secure

ユーザーの行動を認識し、オートエージェントを構築するためのツール。

自然言語処理大規模言語モデルテキスト音声マルチモーダル

用途: オートエージェント構築
難易度: Easy
コスト: High

→

githubGitHubあり2026-06-09

transformers — 🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

🤗 Transformersは、テキスト・ビジョン・音声など複雑なモデル定義をサポートするフレームワークで、インフェレンスターやトレーニングに使用できる。

深層学習Transformer分類テキスト音声

用途: 機械学習モデル定義
難易度: Easy
コスト: High

→

githubGitHubあり2026-06-09

rerun — Visualize, query, and stream to train on multimodal robotics data.

データをロギング・ストーリング・クエリして視覚化できるSDKです。

コンピュータビジョンマルチモーダル画像

用途: データロギングおよび視覚化
難易度: Easy
コスト: High

→

githubGitHubあり2026-06-09

sglang — SGLang is a high-performance serving framework for large language models and multimodal models.

SGLangは、大規模言語モデルのサービングフレームワークです。このライブラリは、高性能なサービスフレームワークで、大規模言語モデルのサービングをサポートしています。

深層学習Transformer画像テキストマルチモーダル

用途: 大規模言語モデルのサービングフレームワーク
難易度: Easy
コスト: High

→

githubGitHubあり2026-06-09

xtuner — A Next-Generation Training Engine Built for Ultra-Large MoE Models

xtunerは、超大規模MoEモデルを高速にトレーニングするためのトレーニングエンジンです。

自然言語処理大規模言語モデル生成マルチモーダル

用途: MoEモデルの高速トレーニングを提供する
難易度: Easy
コスト: High

→

githubGitHubあり2026-06-09

lance — Open Lakehouse Format for Multimodal AI. Convert from Parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..

マルチモーダルAIに適したオープンレイクハウスフォーマットです。このフォーマットでは、パレットからデータを2行のコードで変換することができ、100倍速くなります。また、ベクトルインデックスやデータバージョニングが可能です

自然言語処理大規模言語モデルマルチモーダル

用途: オープンレイクハウスフォーマット
難易度: Easy
コスト: High

→

githubGitHubあり2026-06-09

runanywhere-sdks — Production ready toolkit to run AI locally

このリポジトリでは、AIモデルの互換性を確保するためのオープンスタンダードであるONNXを提供しています。

自然言語処理大規模言語モデルマルチモーダル

用途: AIモデルの互換性を確保するためのオープンスタンダード
難易度: Easy
コスト: High

→

githubGitHubあり2026-06-09

haystack — Open-source AI orchestration framework for building context-engineered, production-ready LLM applications. Design modular pipelines and agent workflows with explicit control over retrieval, routing, memory, and generation. Built for scalable agents, RAG, multimodal applications, semantic search, and conversational systems.

オープンソースのAIオーケストレーションフレームワークです。LLMアプリケーションの構築に必要なパイプラインやエージェントワークフローの設計ができるようになっています。

深層学習Transformer生成要約テキスト

用途: LLMアプリケーションの構築
難易度: Easy
コスト: High

→

arxivPaper only2026-06-08

Discovering Functionally Selective Brain Regions with a Deep Topographic Multimodal Model

この研究では、脳部帯域内のニューロンが同じ反応プロファイルを持つと仮定し、近接な脳部帯域内のニューロンの反応プロファイルを推論し、分野間の結合を特定しました。

自然言語処理RAG画像マルチモーダル

用途: 脳部帯域の研究
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

Your Model Already Knows: Attention-Guided Safety Filter for Vision-Language-Action Models

Vision-Language-Action (VLA) models have demonstrated impressive end-to-end performance across a variety of ro

深層学習軽量化・量子化テキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

Transition-Based Digital Twin Modelling for Alzheimer's Disease under Sparse Longitudinal Data

Alzheimer's disease (AD) progression is highly heterogeneous and is typically observed through sparse and irre

説明可能深層学習軽量化・量子化分類生成予測

用途: 分類
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

FMplex: Model Virtualization for Serving Extensible Foundation Models

Foundation models (FMs) are increasingly used as backbones for downstream tasks across language, vision, time-

センサ/時系列コンピュータビジョンマルチモーダル時系列

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action Policies

Vision-language-action (VLA) policies provide strong priors for language-conditioned manipulation, but remain

自然言語処理RAGテキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

Streaming Interventions: Can Video Large Language Models Correct Mistakes as They Occur?

動画大規模言語モデルを使用した質問に対する回答を研究。モデルの能力と限界を調査し、質問に対する答えを生成するための方法を提案した。

深層学習軽量化・量子化テキスト動画マルチモーダル

用途: 動画大規模言語モデルを使用した質問に対する回答
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

LargeMonitor: Monitoring Online Task-Free Continual Learning via Large Pretrained Models

オンライン学習の継続学習では、モデルは非駅性データストリームから知識を継続的に蓄積する必要があります。モデルのパラメータはトレーニング中に効果的に調整される必要がありますが、パラメータ効率的なプロンプトチューニングや

深層学習軽量化・量子化検出テキストマルチモーダル

用途: オンライン学習の継続学習
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

Zero-Shot Semantic Re-Identification for Autonomous Driving: A VLM Baseline Study

この研究では、ゼロショットセマンティック再特定の基準を設定し、画像のセマンティック特定を自動化します。

説明可能センサ/時系列深層学習CNN画像テキストマルチモーダル

用途: セマンティック再特定
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

PRISM: Topology-Aware Cross-Modal Imputation for Modality-Deficient Federated Graph Learning

Multimodal federated graph learning (MM-FGL) aims to collaboratively learn from decentralized graphs with text

自然言語処理RAG画像テキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

Multi-View Speech Representation Learning for Parkinson's Disease Detection Using Context-guided Cross-modal Attention

パーキンソン病（PD）の早期検出への取り組みとして、脳の損傷が発症前に生じる話術障害を分析するため、音声分析を用いてパーキンソン病の診断を提唱しています。

センサ/時系列深層学習Transformer検出生成埋め込み

用途: パーキンソン病の早期検出
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

Counterfactual Reasoning for Fine-Grained Evidence Disentanglement in VideoQA

この論文では、VideoQA が過度に信憑性の

コンピュータビジョンマルチモーダル検出画像動画

用途: ビデオQA に対するカウンターファクタルの推論
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

Multimodal large language models (MLLMs) commonly inherit the deep, symmetric Transformer backbone designed fo

深層学習Transformer生成画像テキスト

用途: 生成
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

Driving Video Retrieval for Complex Queries with Structured Grounding

Video retrieval at scale is central to data curation and safety validation in autonomous driving, where users

コンピュータビジョンマルチモーダルテキスト動画

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

From Shortcuts to Reasoning: Robust Post-Training of Theory of Mind with Reinforcement Learning

理論的思考は、最新の基礎モデルシステムが安全かつ効果的に現実世界で動作するには必須のスキルであると考えられています。しかし、理論的思考の進進には、「ショートカット」問題が存在し、タスクは99％の正解率を達成するのに、ただ

自然言語処理RAGテキストマルチモーダル強化学習

用途: 理論的思考の強化問題
難易度: Hard
コスト: High

→

arxivGitHubあり2026-06-08

Stabilizing On-Policy Distillation for MLLM Reasoning with Global Normalization

オンポリシーディストリレーションは、近年、重要なポストトレーニングの研究分野となりました。強い教師モデルを使用して学習トレッジを密に細かく指示することで、トピック認識を実現します。しかしなだな的にトークンレベルにおいてデ

深層学習軽量化・量子化マルチモーダル強化学習

用途: オンポリシーディストリレーション問題
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

Stage-1 Controls the Entropy Regime, Not the Outcome

Two-stage post-training -- a Stage-1 warm-start (supervised fine-tuning, SFT, or on-policy distillation, OPD)

深層学習軽量化・量子化テキストマルチモーダル強化学習

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

Understanding Quantization-Aware Training: Gradients at Quantized Weights Bias to the Low-Loss Basin

Post-training quantization (PTQ) converts a trained full-precision model into low-bit weights without task-lev

深層学習軽量化・量子化テキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

C$^3$ache: Accelerating World Action Models with Cross Inference Chunk Cache

ワールドアクションモデルを高速化するために、情報のキャッシュと伝達を提案します。

コンピュータビジョンセグメンテーション動画マルチモーダル

用途: ワールドアクションモデルを高速化するためのキャッシュと伝達
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics

この論文では、VLM ゲームエージェントの評価基準が提供され、さまざまなタイプのエージェント間の比較が可能になる。

自然言語処理大規模言語モデルテキストマルチモーダル

用途: VLM ゲームエージェントの評価基準
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and op

自然言語処理大規模言語モデル画像テキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

ArtiFact: A Large-Scale Multi-Modal Cultural Heritage Dataset

LLMを用いた臨床研究論文の草案作成を支援するために、生成されたテキストを検証するためのアーキテクチャを設計。これにより、虚偽の citaion、数字の不正確な記録、およびガイドライン違反が防がれます。

品質予測/異常検知コンピュータビジョン動画認識検出画像テキスト

用途: 医学論文執筆のサポート
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

ATN3D: Density-Aware LiDAR-Radar Early 3D Object Detection Under Extreme Sparsity

自動運転車やインテリジェント輸送システムなどの自動化された車両の感知には3次元オブジェクト検出が必要です。道路での長距離検出は困難ですが、道路ではこの「長距離」に対する感知と決定の時間は約1-2秒です。2つの主な課題が現

センサ/時系列深層学習Transformer分類検出テキスト

用途: 車のデッキの長距離認識に対する3次元オブジェクト検出
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text

Chain-of-Thought (CoT) improves the performance of Large Language Models (LLMs) and has been extended to Multi

深層学習軽量化・量子化画像テキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

TABVERSE: Benchmarking Cross-Format Table Understanding in LLMs and VLMs

Large Language Models (LLMs) and Vision-Language Models (VLMs) are increasingly evaluated on table reasoning t

自然言語処理大規模言語モデルQA画像テキスト

用途: QA
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

CT-VAM: A Cerebello-Thalamic-Inspired Vision-Action Model for Efficient Visuomotor Control

Vision-language-action models have shown strong promise for robot manipulation, yet raw language is primarily

深層学習軽量化・量子化画像マルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

Context-Aware Deep Learning for Defect Classification in Atomic-Resolution STEM

マテリアルの非破壊検査を目的としたContext-Aware Deep Learningが提案され、エアロックの欠陥を検出する。

MI向き品質予測/異常検知コンピュータビジョンマルチモーダル分類検出画像

用途: マテリアルの非破壊検査
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

Harness Engineering for Physical AI: Robot Middleware Is the Harness Layer

ボディポーズ認識と行動解釈を目的としたReal-time body pose non-verbal communicationが提案され、人間の動作を認識して行動を解釈する。

コンピュータビジョンマルチモーダル

用途: ボディポーズ認識と行動解釈
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

IMUG-Bench: Benchmarking Unified Multimodal Models on Interleaved Understanding and Generation

In recent years, unified multimodal models (UMMs) have emerged to support both understanding and generation wi

自然言語処理プロンプトエンジニアリング生成画像テキスト

用途: 生成
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

Vision Language Model Helps Private Information De-Identification in Vision Data

ビジュアル言語モデル（VLM）は、プライバシー保護において有効性の高い能力をもつ。しかし、視覚データを扱う際のプライバシーリスクについては、それまでほとんど注目されていなかった。VLMを使用して、プライバシー保護を確保す

コンピュータビジョン物体検出分類検出画像

用途: ビジョン言語モデルを使用したビジュアルデータのプライバシー保護
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

Personalization Meets Safety:Mechanisms,Risks,and Mitigations in Personalized LLMs

Large Language Models (LLMs) have enabled increasingly personalized interactions by adapting to users' prefere

MI向き深層学習軽量化・量子化テキストマルチモーダル強化学習

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

Baichuan-M4: A Clinical-Grade Medical Agent System for Continuous Care

連続的な治療に適した臨床級LLM医系であるBaichuan-M4を導入。臨床的な医療エージェントシステムであるBaichuan-M4は、統合的な医療エージェントシステムをベースとし、医療エージェントと医療エージェントの連

コンピュータビジョンマルチモーダルQA画像テキスト

用途: 統合医療医系のためのLLMベースの医療エージェント
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

An Effective Router for Vision-Language Model Selection

Vision-language models (VLMs) with varying performance and resource requirements are widely deployed, making i

自然言語処理大規模言語モデル異常検知画像テキスト

用途: 異常検知
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

AlloSpatial: Agentic Harness Framework for Spatial Reasoning in Foundation Models

Multimodal Foundation Models (MFMs) have made substantial progress, yet remain fragile in spatial reasoning ov

自然言語処理RAG画像マルチモーダル強化学習

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

NutriMLLM: Multimodal Large Language Models for Dietary Micronutrient Analysis

Comprehensive estimation of dietary micronutrients from food images could improve clinical nutrition care, but

自然言語処理大規模言語モデル生成画像テキスト

用途: 生成
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

Failure-Aware Refinement of Vision-Language Model for Lithography Defect Detection

Semiconductor lithography inspection requires reliable detection of small pattern defects such as bridge, burr

品質予測/異常検知自然言語処理ファインチューニング検出画像テキスト

用途: 検出
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

A multi-agent system for spine MRI report generation from multi-sequence imaging

Spinal pathology is a leading cause of pain and disability worldwide. Spine MRI is central to clinical evaluat

説明可能自然言語処理埋め込み・検索分類検出生成

用途: 分類
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

Cross-Modal Masking for Robust Silent Speech Synthesis Using sEMG and Lipreading

この研究では、静黙の口承のシンセシスを実現するためのフレームワークを開発します。このフレームワークは、静黙の口承のシンセシスと精度を改善することができます。

センサ/時系列自然言語処理RAG生成音声動画

用途: 静黙の口承のシンセシス
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

Where Does the Answer Come From? Benchmarking View-Level Visual Evidence Identification in Multi-View MLLMs for Autonomous Driving

Multimodal large language models (MLLMs) achieve strong results on visual reasoning benchmarks, but answer acc

自然言語処理大規模言語モデルQA画像テキスト

用途: QA
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

Interpretable Crisis Behavior Analysis Using Mobility and Social Media Data

人間は危機時に移動パターンやメディアの投稿のパターンが変化し、分析が難しいようになった。この研究では、運動データやメディアデータの統合を用いて危機時の行動パターンを分析し、危機の状況における行動を予測した。

説明可能品質予測/異常検知コンピュータビジョンセグメンテーションマルチモーダル

用途: クライシス時の行動分析
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

H2HMem: A Multimodal Memory Benchmark for Agents in Human-Human Interactions

大きな言語モデルには記憶や推論機能があるが、ユーザーとの対話におけるこれらの機能の効果はまだ理解されているわけではない。これを受け、この研究では、人間の相互作用、特に会話における記憶と推論能力を評価するためのマルチモーダ

自然言語処理大規模言語モデル生成テキストマルチモーダル

用途: マルチモーダル記憶の評価
難易度: Hard
コスト: High

→

arxivGitHubあり2026-06-08

MUDIDI: A Two-Stage Framework for Multilingual Dictionary Digitization with Language Models

この研究では、低リソース言語や絶滅言語の辞書のデジタル化が重要であるが、マルチモーダル辞書をデジタル化する方法は今まで難しかったが、この研究では、最近のビジョン言語モデルを用いて辞書のデジタル化が容易になり、辞書内の文字

品質予測/異常検知自然言語処理大規模言語モデル分類セグメンテーションテキスト

用途: ムルティリンガル辞書のデジタル化
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

Guide Me Out: A Framework to Benchmark VLM Operators Communication in Crisis Scenarios

危機管理では、コミュニケーションと地理

品質予測/異常検知コンピュータビジョンマルチモーダル分類画像テキスト

用途: 危機管理におけるコミュニケーションを評価する
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

Is Text All You Need? Text as a Universal Information Bottleneck for Speech LLMs

Large language models (LLMs) provide a powerful reasoning backbone for speech understanding, but integrating c

センサ/時系列深層学習Transformer分類テキスト音声

用途: 分類
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

Explicit Representation Alignment for Multimodal Sentiment Analysis

Multimodal affective analysis aims to understand human sentiment and emotion by jointly modeling heterogeneous

説明可能自然言語処理RAG画像テキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

CRANE: Knowledge Editing for Reasoning MLLMs

The emergence of reasoning multimodal large language models (MLLMs), which generate explicit chain-of-thought

自然言語処理大規模言語モデル異常検知画像テキスト

用途: 異常検知
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

Beyond Averages: Evaluating LLMs on Human Survey Replication at the Distributional Level

LLMs are increasingly used to simulate human survey responses, but prior work has mainly evaluated replication

自然言語処理大規模言語モデルマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

ChinaHeritaQA: A Culturally-Grounded Visual Question Answering Dataset for World Heritage Sites in China

We introduce ChinaHeritaQA, a multimodal benchmark dataset for evaluating the cultural reasoning abilities of

表形式向き品質予測/異常検知自然言語処理RAG分類QA画像

用途: 分類
難易度: Hard
コスト: High

→

arxivGitHubあり2026-06-08

Are Reasoning Vision-Language Models Robust to Semantic Visual Distractions?

Reasoning Vision-Language Models (VLMs) achieve strong performance on complex multimodal tasks, but reliable r

コンピュータビジョンマルチモーダル画像テキスト

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

MemoryVLA++: Temporal Modeling via Memory and Imagination in Vision-Language-Action Models

Temporal modeling is essential for robotic manipulation, as effective control requires both memory of past int

コンピュータビジョン動画認識テキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

HDSL: A Hierarchical Domain-Specific Language for Structured 3D Indoor Scene Generation and Localized Editing with LLM Agents

Text-driven indoor scene generation and editing require an intermediate representation that language models ca

自然言語処理大規模言語モデル生成テキスト3D

用途: 生成
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

Cranio-Diff: Diffusion-based Cross-domain Craniofacial Reconstruction with 2D X-ray Skull Guidance and Structural Identity Constraints

The state-of-the-art generative models, such as CycleGAN, Pix2Pix, and diffusion models have demonstrated rema

品質予測/異常検知コンピュータビジョンセグメンテーション生成画像テキスト

用途: 生成
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

GenEyePose: Patient-Free, Knowledge-Based Saccadic Eye Movement Modeling for Digital Neurophysiologic Biomarker Development

Eye movements, including saccades, are widely regarded as highly sensitive and objective biomarkers of neuroph

深層学習Transformer分類検出生成

用途: 分類
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

GD-MIL: Grade-Disentangled Multiple Instance Learning for Multimodal Biochemical Recurrence Prediction in Prostate Cancer

Biochemical recurrence (BCR) after radical prostatectomy is a critical endpoint in prostate cancer, yet risk s

深層学習CNN画像マルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

CapRL++: Unified Reinforcement Learning with Verifiable Rewards for Dense Image and Video Captioning

Image and video captioning are fundamental tasks that bridge the visual and linguistic domains, playing a crit

品質予測/異常検知自然言語処理大規模言語モデル画像テキスト動画

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

ExDet: Open-Domain Open-Vocabulary Detection with Cross-modal Extrapolation and Rectification

Open-domain open-vocabulary detection (ODOVD) requires detectors to generalize to both novel categories and un

深層学習軽量化・量子化分類検出画像

用途: 分類
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

IB-HFN: Information Bottleneck-Driven SAR-Optical Fusion Network for High-Fidelity Cloud Removal

Synthetic aperture radar (SAR)-assisted optical cloud removal aims to recover surface information obscured by

センサ/時系列コンピュータビジョン動画認識画像テキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

Reason Twice: Segmentation via Candidate Discovery and Comparative Reasoning

The rapid development of pretrained foundation models has enabled more general image segmentation. Multimodal

深層学習正規化・最適化手法分類生成セグメンテーション

用途: 分類
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

Self-supervised Learning Matters: A Simple Ensemble Solution for Micro-Gesture Recognition

In this paper, we present XInsight Lab's solution to the micro-gesture classification track of the 4th MiGA Ch

自然言語処理ファインチューニング分類埋め込み動画

用途: 分類
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

MAGIS: Evidence-Based Multi-Agent Reasoning for Interpretable Strabismus Clinical Decision-Making

Strabismus is a common ocular disorder that requires fine-grained subtype diagnosis for individualized treatme

説明可能コンピュータビジョンマルチモーダル生成画像テキスト

用途: 生成
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

Vision-Language Guided Hyperspectral Object Tracking via Semantics Fusion and Contextual Template Updating

Hyperspectral object tracking (HOT) leverages the rich spectral information provided by hyperspectral videos (

深層学習軽量化・量子化画像テキスト動画

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

CAMF-Det: Closure-Aware Multimodal Fusion for LiDAR-Camera 3D Object Detection on UAV Platforms

Multimodal 3D object detection based on LiDAR and cameras has demonstrated excellent performance in ground-veh

深層学習Transformer検出3Dマルチモーダル

用途: 検出
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

HDRAgent: An Agentic Framework for Multi-Exposure HDR Imaging

Most existing multi-exposure HDR methods follow a fixed feed-forward reconstruction paradigm, making them pron

品質予測/異常検知自然言語処理大規模言語モデル生成画像テキスト

用途: 生成
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

Scaling by Diversified Experience for Vision-Language-Action Models

Vision-Language-Action models face significant challenges in real-world deployment due to the entanglement of

コンピュータビジョンセグメンテーション異常検知テキストマルチモーダル

用途: 異常検知
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

When Vision Misleads, Let Location Speak: A Worldwide Image Geo-Localization Method via Location Attention Mechanism and Large Multimodal Models

Worldwide image geo-localization aims to determine the capture location of an image on a global scale. Existin

深層学習Transformer検出画像テキスト

用途: 検出
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

DifferSeg: Towards Diverse Multimodal Binary Segmentation via Differential Perception and Frequency Guidance

In many binary segmentation tasks, most multimodal methods rely on fixed feature concatenation for cross-modal

深層学習軽量化・量子化セグメンテーションマルチモーダル

用途: セグメンテーション
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

ProbeAct: Probe-Guided Training-Free Failure Recovery in Vision-Language-Action Models

Vision-Language-Action (VLA) models demonstrate strong perfor-1 mance on language-conditioned robotic manipula

深層学習軽量化・量子化3Dマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

TORL-VLA: Tactile Guided Online Reinforcement Learning for Contact-Rich Manipulation

Vision-Language-Action (VLA) models have become a powerful framework for robotic manipulation, and recent stud

深層学習軽量化・量子化マルチモーダル強化学習

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

Back to the Familiar Future: Failure Recovery for VLA Policies via Pre-Imagined Milestone Selection

Vision-language-action (VLA) policies can deviate from nominal trajectories during manipulation, even when tas

自然言語処理RAG画像マルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-08

MotionWAM: Towards Foundation World Action Models for Real-Time Humanoid Loco-Manipulation

World Action Models (WAMs) couple a video dynamics prior to the policy and have shown encouraging results on t

自然言語処理RAG画像動画マルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-07

Intrinsic Selection and Particle Resampling for Inference-Time Scaling Beyond Domain Verifiability

Inference-Time Scaling (ITS) has largely succeeded in verifiable domains like math and coding, where cheap ver

品質予測/異常検知深層学習軽量化・量子化生成マルチモーダル

用途: 生成
難易度: Hard
コスト: High

→

arxivGitHubあり2026-06-07

Artificial Intelligence for Mathematical Reasoning: An Integrated Survey of Language Models, Neuro-symbolic Systems, and Verified Discovery

Mathematical reasoning has long served as a stringent test of machine intelligence; over the past decade, it h

MI向き自然言語処理大規模言語モデル生成テキストマルチモーダル

用途: 生成
難易度: Hard
コスト: High

→

arxivPaper only2026-06-07

FiberTune: Preserving Action-Fiber Visual Residuals in Vision-Language-Action Fine-Tuning

Action-supervised fine-tuning of vision-language-action (VLA) policies fits demonstrations effectively but con

深層学習Transformer画像マルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-07

Reinforcement Learning for Flow-Matching Policies with Density Transport

We present an online reinforcement learning (RL) algorithm for fine-tuning flow-matching policies in continuou

品質予測/異常検知深層学習軽量化・量子化マルチモーダル強化学習

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-07

Benchmarking Vision-Language-Action Models on SO-101: Failure and Recovery Analysis

Vision-Language-Action (VLA) models have demonstrated strong generalization in robotic manipulation, yet exist

自然言語処理ファインチューニングマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-07

A Resilience-as-a-Service assessment framework for coordinated disruption response in interdependent urban transit systems

Urban public transport disruptions require rapid response strategies, yet existing studies rarely provide a de

深層学習Transformerマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-07

HydraQE: OSU's Submission for the IWSLT 2026 Speech Translation Metrics Shared Task

We present HydraQE, our contribution to the IWSLT 2026 Speech Translation Metrics shared task. HydraQE is an e

品質予測/異常検知深層学習Transformer翻訳テキスト音声

用途: 翻訳
難易度: Hard
コスト: High

→

arxivPaper only2026-06-07

Harnessing Streaming Video in the Wild

Vision-Language Models (VLMs) are increasingly required to process unbounded video streams in applications suc

表形式向きコンピュータビジョン動画認識テキスト動画マルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-07

TRADE: Transducer-Augmented Decoder for Speech LLM

Speech Large Language Models (Speech LLMs) lack a principled mechanism for streaming inference: their label-sy

センサ/時系列深層学習Transformer分類検出生成

用途: 分類
難易度: Hard
コスト: High

→

arxivPaper only2026-06-07

When Correct Decisions Hide Internal Stress: Decision-State Probing in Multimodal Language Models

Multimodal language models are typically evaluated through external behavior: selecting the correct image--tex

深層学習Transformer画像テキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivGitHubあり2026-06-07

Vision-Language Work Zone Intelligence for Safety-Critical Speed Regulation of Mixed-Autonomy Vehicles in Dynamic Environments

Temporary work-zone speed limits are communicated through visually inconsistent signage and are often missing

コンピュータビジョン物体検出分類検出画像

用途: 分類
難易度: Hard
コスト: High

→

arxivPaper only2026-06-07

RGB-S: Image-Aligned Tactile Saliency for Robust Dexterous Manipulation

Effective visuo-tactile integration is critical for robotic dexterous manipulation, especially when visual obs

センサ/時系列深層学習軽量化・量子化画像マルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-07

PRPO: Perception-Reinforced Policy Optimization via Token-Level Dynamic Advantage Reshaping

Reinforcement Learning with Verifiable Rewards (RLVR) has become an effective paradigm for improving the reaso

自然言語処理RAG画像テキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-07

PhysAgent: Automating Physics-Based 4D Synthesis via Trajectory-Grounded Multi-Agent Feedback

Achieving fully automated, physically plausible 3D motion synthesis is a core objective in graphics and genera

MI向き深層学習軽量化・量子化生成テキスト3D

用途: 生成
難易度: Hard
コスト: High

→

arxivGitHubあり2026-06-07

BLUE: Toward Better Language Use in Efficient Vision-Language-Action Models for Autonomous Driving

We present BLUE, a minimal method for better language use in vision-language-action (VLA) models for autonomou

深層学習軽量化・量子化生成マルチモーダル

用途: 生成
難易度: Hard
コスト: High

→

arxivPaper only2026-06-07

SSAFE: Simple and Strong AI-Generated Image Detection via Frozen Vision Encoders

The rapid advancement of generative models has blurred the boundary between synthetic and real imagery, creati

自然言語処理ファインチューニング分類検出生成

用途: 分類
難易度: Hard
コスト: High

→

arxivPaper only2026-06-07

Facial Expression Recognition in the Deep Learning Era: A Systematic Multi-Criteria Review of Methods, Models, Datasets, Performance, Challenges, and Future Research Directions

Facial Expression Recognition (FER) has advanced rapidly over the last decade, driven by the shift from handcr

深層学習CNN分類マルチモーダル

用途: 分類
難易度: Hard
コスト: High

→

arxivPaper only2026-06-07

Towards Accurate Emotion-Attributed Video Captioning via Fine-grained Emotion-Cause Pair Extraction

Emotional Video Captioning (EVC) is a challenging task that aims to generate factually accurate and emotionall

説明可能自然言語処理RAG生成画像動画

用途: 生成
難易度: Hard
コスト: High

→

arxivPaper only2026-06-07

When Video Misreads: Closed-Loop Distillation of Reading Heuristics for Exploratory Manipulation Trace QA

Exploratory manipulation often turns an apparent failed attempt into the key evidence for what to do next. For

深層学習軽量化・量子化分類動画マルチモーダル

用途: 分類
難易度: Hard
コスト: High

→

arxivPaper only2026-06-07

DriveReward: A Comprehensive Dataset and Generative Vision-Language Reward Model for Autonomous Driving

Reward models play a pivotal role in reinforcement learning (RL) and multi-modal trajectory selection for auto

表形式向きコンピュータビジョン動画認識生成画像テキスト

用途: 生成
難易度: Hard
コスト: High

→

arxivPaper only2026-06-07

Look Less, Reason More: Block-wise Attention Skipping for Efficient Multimodal LLMs

Multimodal Large Language Models (MLLMs) face a significant inference bottleneck due to the quadratic computat

少数データ向き深層学習Transformer画像テキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-07

EgoPriMo: Egocentric Motion Generation for Interactive Humanoid Control

Humanoid robots require whole-body motions that adapt to scene context, task requirements, and user intent. Mo

コンピュータビジョンセグメンテーション生成予測画像

用途: 生成
難易度: Hard
コスト: High

→

arxivPaper only2026-06-07

Seeing is Believing: Aligning Prompt Rewriting with Visual Anchors for Text-to-Image Generation

Despite the impressive capabilities of text-to-image (T2I) models, an intent-generation gap often persists due

深層学習Transformer生成画像テキスト

用途: 生成
難易度: Hard
コスト: High

→

arxivGitHubあり2026-06-07

TVI-CoT: Text-Visual Interleaved Chain-of-Thought Reasoning for Multimodal Understanding

Chain-of-thought (CoT) reasoning has proven effective for enhancing problem-solving in large language models.

自然言語処理大規模言語モデル画像テキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-07

GraspFoM: Towards Reconstruction-Driven Robotic Grasping with 3D Foundation Priors

Robotic grasping is a fundamental capability in robotic manipulation. Yet grasping remains challenging under p

自然言語処理RAG3Dマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-07

CheXanatomy: Anatomy-Aware Vision-Language Modeling for Chest Radiographs

Vision-language models (VLMs) pretrained on large-scale image-text pairs demonstrate strong image-level unders

深層学習CNN検出生成セグメンテーション

用途: 検出
難易度: Hard
コスト: High

→

arxivPaper only2026-06-07

Guided Discovery of New Behaviors using Diffusion Policies

Diffusion models have become a powerful tool for generative modeling in robotics, with diffusion policies exce

コンピュータビジョンセグメンテーション生成マルチモーダル強化学習

用途: 生成
難易度: Hard
コスト: High

→

arxivGitHubあり2026-06-07

Dream-Tac: A Unified Tactile World Action Model for Contact-Rich Robot Manipulation

World action models inherit the predictive capability of world models, enabling action generation to be guided

自然言語処理RAG生成画像マルチモーダル

用途: 生成
難易度: Hard
コスト: High

→

arxivPaper only2026-06-07

Language as a Sensor: Calibrated Spatial Belief Estimation in 3D Scenes from Natural Language

Robots deployed in human-centric environments routinely receive natural-language descriptions of spatial infor

センサ/時系列コンピュータビジョン3D・点群テキスト3Dマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivGitHubあり2026-06-07

GEAR-VLA: Learning Geometry-Aware Action Representations for Generalizable Robotic Manipulation

Vision-Language-Action (VLA) models achieve strong benchmark performance but still struggle in real-world depl

自然言語処理プロンプトエンジニアリング画像3Dマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-07

Two Bridges, One Pathway: From VLMs to Generalizable VLAs with Embodied Trajectory-Coupled Data

Vision-language models (VLMs) are powerful general-purpose reasoners, yet converting them into robot control p

自然言語処理ファインチューニング異常検知画像テキスト

用途: 異常検知
難易度: Hard
コスト: High

→

arxivPaper only2026-06-07

LUNA-AD: Lightweight Uncertainty-Aware Language Model with Lifelong Learning for Autonomous Driving

While large language models (LLMs) offer promising reasoning capabilities, their integration into safety-criti

深層学習軽量化・量子化テキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

githubGitHubあり2026-06-07

awesome-japanese-llm — 日本語LLMまとめ - Overview of Japanese LLMs

分析システムの性能を向上するための学習モデル開発を行う。

自然言語処理大規模言語モデル生成マルチモーダル

用途: 分析システムの性能を向上するための学習モデル開発
難易度: Easy
コスト: High

→

arxivPaper only2026-06-06

How Deep Are Deep GPs, Really? A Sharp Threshold and a Non-Gaussian Limit for Compositional GPs

Compositional priors describe the generic properties of layered functions in deep Bayesian models, where deep

少数データ向きコンピュータビジョンマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-06

When No Answer Is Correct: Diagnosing Absent Answer Detection for MLLMs in Video Understanding

Multimodal large language models (MLLMs) have made substantial advancements in video understanding, yet the re

自然言語処理大規模言語モデル検出生成テキスト

用途: 検出
難易度: Hard
コスト: High

→

arxivPaper only2026-06-06

CLASP: Language-Driven Robot Skill Selection and Composition using Task-Parameterized Learning

Enabling robots to understand and execute tasks from natural language commands while maintaining data efficien

少数データ向きMI向き条件最適化自然言語処理ファインチューニングテキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-06

Aligned but Not Partner-Specific: Distinguishing How Multimodal LLM Agents Succeed in Reference Games Without Human-Like Conventions

Repeated reference games test whether interlocutors replace their initially long descriptions with shorter, pa

深層学習軽量化・量子化テキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivGitHubあり2026-06-06

Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in visual understanding, yet the

説明可能品質予測/異常検知自然言語処理大規模言語モデル画像テキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-06

Decoupling Semantics and Logic: A Training-Free Coarse-to-Fine Pipeline for Video Retrieval-Augmented Generation

This paper presents our system description for the 2nd Workshop on Multimodal Augmented Generation via Multimo

深層学習軽量化・量子化生成検索画像

用途: 生成
難易度: Hard
コスト: High

→

arxivPaper only2026-06-06

Beyond Raw Signals: Undecoded Generative Latents as Privileged Synthetic Data

While multimodal integration significantly improves computer vision models, deploying them incurs prohibitive

深層学習軽量化・量子化分類生成画像

用途: 分類
難易度: Hard
コスト: High

→

arxivPaper only2026-06-06

TIDE: Task-Isolated Diffusion for Unified Video Editing and Generation

Recent advances in Diffusion Transformers have driven rapid progress in video generation and editing, yet thes

深層学習Transformer生成画像テキスト

用途: 生成
難易度: Hard
コスト: High

→

arxivPaper only2026-06-06

Test-Time Scaling in Multimodal Foundation Models: A Comprehensive Survey of Generation and Reasoning

Test-time Scaling (TTS) has emerged as a pivotal research direction for enhancing model performance by dynamic

コンピュータビジョンセグメンテーション生成マルチモーダル

用途: 生成
難易度: Hard
コスト: High

→

arxivPaper only2026-06-06

IMAGINE: Adaptive Schema-Imagery Enhanced Composition for Composed Video Retrieval

Composed Video Retrieval (CVR) is designed to retrieve a target video that matches a reference video modified

MI向きコンピュータビジョンマルチモーダル画像テキスト動画

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-06

One Stone, Three Birds: Self-adaptive Optimal Transport for Multi-VLM Selection, Adaptation, and Ensembling

Vision-language models (VLMs) enable visual recognition from semantic class descriptions, which makes them att

センサ/時系列コンピュータビジョンセグメンテーション分類画像テキスト

用途: 分類
難易度: Hard
コスト: High

→

arxivGitHubあり2026-06-06

VideoWeaver: Evaluating and Evolving Skills for Agentic Long Video Generation

Recent agent frameworks such as Claude Code, Codex, and OpenClaw are strong at tool use and orchestration, but

MI向き品質予測/異常検知自然言語処理大規模言語モデル生成画像テキスト

用途: 生成
難易度: Hard
コスト: High

→

arxivPaper only2026-06-06

MotionVLA: Injecting Geometric Motion into Vision-Language-Action Model

Vision-language-action (VLA) models increasingly condition robot policies on history, depth, or 4D features to

自然言語処理RAG生成画像テキスト

用途: 生成
難易度: Hard
コスト: High

→

arxivPaper only2026-06-06

SIMPLE: Simulation-Based Policy Learning and Evaluation for Humanoid Loco-manipulation

Humanoid foundation models are advancing faster than we can evaluate them. While real-world testing is expensi

深層学習軽量化・量子化生成マルチモーダル

用途: 生成
難易度: Hard
コスト: High

→

arxivPaper only2026-06-06

Learning from Human Driving: A Human-in-the-Loop Online Behavior Cloning Framework for Autonomous Driving

With the evolution of large foundation models (LFMs), data-driven autonomous driving has made significant stri

深層学習軽量化・量子化マルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-06

vla.cpp: A Unified Inference Runtime for Vision-Language-Action Models

Vision-Language-Action (VLA) policies are typically shipped as Python/PyTorch stacks that assume a workstation

自然言語処理大規模言語モデル動画マルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-06

Q-VGM: Q-Guided Value-Gradient Matching for Flow-Matching VLA Policies

We propose Q-Guided Value-Gradient Matching (Q-VGM), an off-policy reinforcement learning (RL) method that tac

少数データ向き深層学習軽量化・量子化生成マルチモーダル強化学習

用途: 生成
難易度: Hard
コスト: High

→

githubGitHubあり2026-06-06

EEGUnity — An open source tool for large-scale EEG datasets processing

ビデオ diffusioin trasformerは、ビデオの長さに依存しない推論能力を持っているが、この長さのエキサポレーションは実際には困難なものである。RIFLExという手法を開発し、ビデオ長さのエキサポレーション

コンピュータビジョンマルチモーダル

用途: ビデオ diffusioin trasformerで長さのエキサポレーション
難易度: Easy
コスト: High

→

arxivPaper only2026-06-05

Combinatorial Landscape Analysis for Dominating Set and Vertex Coloring

We analyze the two combinatorial problems of Dominating Set and Vertex Coloring regarding what kind of local o

コンピュータビジョンマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-05

TBD-VLA: Temporal Block Diffusion Vision Language Action Model

Discrete Vision-Language-Action (VLA) models typically formulate action generation as next-token prediction ov

コンピュータビジョンセグメンテーション生成回帰テキスト

用途: 生成
難易度: Hard
コスト: High

→

arxivPaper only2026-06-05

VoLo: A Physical Orchestrator for Open-Vocabulary Long-Horizon Manipulation

Open-vocabulary long-horizon manipulation requires robots to reason over flexible instructions and complex mul

コンピュータビジョンマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-05

Spline Policy: A Structured Representation for Robot Policies

この論文では、ロボット制御の新しい表現方法であるSpline Policy（SP）を提案した。SPは、行動を spline で表現することで、行動をより詳細かつ柔軟に表現することができた。

深層学習Transformerマルチモーダル

用途: ロボット制御の新しい表現方法
難易度: Hard
コスト: High

→

arxivGitHubあり2026-06-05

RhinoVLA Technical Report

この論文では、VLAモデルをedgeハードウェアにデプロイするための手法を提案しています。この手法は、VLAモデルをedgeハードウェアにデプロイするためのフレームワークです。この手法は、edgeハードウェアを利用してV

深層学習軽量化・量子化画像テキストマルチモーダル

用途: VLAモデルをedgeハードウェアにデプロイするための手法
難易度: Hard
コスト: High

→

arxivPaper only2026-06-05

Beyond Waypoints: A Trajectory-Centric Waypointing Paradigm for Vision-Language Navigation

この研究では、自然言語指示を実行するためにもっと実際的なエンベロイメントにおいて、視覚言語航行 (VLN) の問題に対処します。従来の 3 つのステージのアプローチは、目的地に到達するのを困難な場所や、計画と制御間の矛盾

コンピュータビジョンマルチモーダル生成

用途: 自動車のトラクタシー
難易度: Hard
コスト: High

→

arxivPaper only2026-06-05

Robotic Policy Adaptation via Weight-Space Meta-Learning

Vision-Language-Action (VLA) models are emerging as a promising paradigm for robotic manipulation, enabling ge

自然言語処理ファインチューニング動画マルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-05

Coarse-to-Control: Action-Token Planning for Vision-Language-Action Models

この論文では、視覚言語行動モデルの改良を実現した。Coarse-to-Controlは、行動に必要な計画の空間を大幅に縮小し、行動の計画を実現するための新しいフレーム

コンピュータビジョンマルチモーダル生成

用途: 視覚言語行動モデルの改良
難易度: Hard
コスト: High

→

arxivPaper only2026-06-05

LARA: Latent Action Representation Alignment for Vision-Language-Action Models

Visual-language action (VLA) models enable robots to predict actions directly from observations and language i

品質予測/異常検知自然言語処理RAG画像動画マルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-05

Dreaming when Necessary: Advancing World Action Models with Adaptive Multi-Modal Reasoning

World Action Models (WAMs) offer a promising approach to embodied intelligence, yet existing methods rely heav

深層学習軽量化・量子化画像テキスト動画

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-05

LIMMT: Less is More for Motion Tracking

We argue that high-quality motion data can steer tracking policies toward better optimization trajectories ear

品質予測/異常検知コンピュータビジョンマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivGitHubあり2026-06-05

ActionMap: Robot Policy Learning via Voxel Action Heatmap

この論文では、ロボットの制御を学習するための、新しいモデルの提案であるactionmapを提示しました。

深層学習軽量化・量子化回帰マルチモーダル

用途: ロボット制御の学習
難易度: Hard
コスト: High

→

arxivPaper only2026-06-05

Think Like a Pilot: Fine-Grained Long-Horizon UAV Navigation

VLNベンチマークでは、ディシクリットな操作や粗い操作が使われ、UAVのヴィジョンラングジュアクション（VLJ）タスクでは短い操作が中心で、長時間飛行に対応できるfineグラINEDUAVナビゲーション（FLIGHT）ベ

コンピュータビジョンマルチモーダルテキスト動画

用途: ドローンの長時間飛行
難易度: Hard
コスト: High

→

huggingfaceGitHubありHugging Faceあり2026-06-05

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research move

深層学習軽量化・量子化画像テキスト音声

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-05

Stream3D-VLM: Online 3D Spatial Understanding with Incremental Geometry Priors

Despite advances in 3D scene understanding, existing 3D Large Multimodal Models operate in offline settings, r

深層学習軽量化・量子化生成画像テキスト

用途: 生成
難易度: Easy
コスト: High

→

githubGitHubあり2026-06-05

stable-pretraining — Reliable, minimal and scalable library for pretraining foundation and world models

基礎モデルの前処理を行うためのライブラリ。最小限でシームレスにスケールできる。

深層学習Transformerマルチモーダル自己教師

用途: 基礎モデルの前処理
難易度: Easy
コスト: High

→

arxivPaper only2026-06-04

Discrete Causal Representations from Heterogeneous Domains: A Bayesian Approach with Social Survey Applications

この研究では、複数のドメインの複雑なデータを分析するために、Bayesian モデルを使用して因果関係を分析するツールを開発します。主に社会調査に使用できるツールです。

説明可能コンピュータビジョンセグメンテーション生成埋め込みマルチモーダル

用途: 複数のドメインの因果関係を分析するツールを開発
難易度: Hard
コスト: High

→

arxivPaper only2026-06-04

TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies

この研究では、ロボット操作のスピードの可変性を扱いました。この研究で提案したTempoVLAは、スピードの変化を可能にする強化学習モデルです。

コンピュータビジョンマルチモーダル強化学習

用途: スピード可変的バージョン言語行動ポリシー
難易度: Hard
コスト: High

→

arxivPaper only2026-06-04

VOLT: Vision and Language Trajectory Segmentation for Faster-than-Demonstration Policies

この研究では、フェスタースター自動運

品質予測/異常検知自然言語処理RAGセグメンテーションテキスト動画

用途: フェスタースター自動運転用の高速動作
難易度: Hard
コスト: High

→

arxivPaper only2026-06-04

MPCoT: Reward-Guided Multi-Path Latent Reasoning for Test-Time Scalable Vision-Language-Action

Vision-Language-Action(バブルラボ、VLアクション)ポリシーが長時間予測と高い不確実性の制御で脆弱であることを認識し、VLアクションポリシーが1パスでのアクションデコードのみを提供し、長時間予測のた

品質予測/異常検知自然言語処理プロンプトエンジニアリングテキストマルチモーダル

用途: long-horizonおよびhigh-uncertainty ControlでのVLAポリシーが脆弱である問題に対する解決策。
難易度: Hard
コスト: High

→

arxivPaper only2026-06-04

AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding

このリポジトリでは、画像認識モデルにアクション生成能力を付与することを目指したモデルを提案します。このモデルは、画像認識のための事前訓練モデルを用いて、複雑なアクションを生成することができます。

深層学習Transformer検出生成予測

用途: 画像認識とアクションの生成
難易度: Hard
コスト: High

→

arxivGitHubあり2026-06-04

A Conversational Framework for Human-Robot Collaborative Manipulation with Distributed Generative AI models

この研究では、人間-ロボット協力のためのDistributed Conversational Frameworkを提案します。

自然言語処理大規模言語モデル生成画像テキスト

用途: 人間-ロボット協力
難易度: Hard
コスト: High

→

arxivPaper only2026-06-04

L-SDPPO: Policy Optimization of Spiking Diffusion Policy for Intra-vehicular Robotic Manipulation

この研究では、L-SDPPO という方法を提案します。これは、連携型ロボット Manipulation に向けたディフュージョンポリシーの最適化を実現するものです。

深層学習Transformerマルチモーダル強化学習

用途: 連携型ロボットManipulation
難易度: Hard
コスト: High

→

arxivPaper only2026-06-04

Robots Need More than VLA and World Models

Generalist robot intelligence is often framed as a policy-scaling problem: collect more robot demonstrations,

コンピュータビジョン3D・点群生成動画3D

用途: 生成
難易度: Hard
コスト: High

→

arxivPaper only2026-06-04

World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis

統合された視覚言語アクションモデルを提案し、これを用いたタスクの性能を向上させることができるようになる。

深層学習Transformer生成画像テキスト

用途: 統合された視覚言語アクションモデル
難易度: Hard
コスト: High

→

arxivPaper only2026-06-04

T-FunS3D: Task-Driven Hierarchical Open-Vocabulary 3D Functionality Segmentation

Open-vocabulary 3D functionality segmentation enables robots to localize functional object components in 3D sc

自然言語処理RAG分類セグメンテーション画像

用途: 分類
難易度: Hard
コスト: High

→

arxivPaper only2026-06-04

Towards a Data Flywheel for Embodied Intelligence in Logistics

Autonomous drivingでは、ロボットが視覚認識した情報に基づいて行動を決定する必要があるが、過去のデータで構築された空間モデルでは、ロボットの行動を予測することが困難であるため、空間モデルを構築することによ

コンピュータビジョンマルチモーダル異常検知テキスト動画

用途: ロボットの行動予測に適した空間を構築
難易度: Hard
コスト: High

→

huggingfaceHugging Faceあり2026-06-04

AsyncWebRL: Efficient Multi-Step RL for Visual Web Agents

Training vision-language web agents with multi-step RL is compute-intensive, with two dominant forms of ineffi

深層学習軽量化・量子化異常検知画像マルチモーダル

用途: 異常検知
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-04

Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents

Latent visual reasoning (LVR) inserts supervised latent tokens between perception and answer generation in vis

品質予測/異常検知コンピュータビジョンマルチモーダル生成画像テキスト

用途: 生成
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-04

Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators

While Vision-Language Models (VLMs) have shown strong visual reasoning capabilities, their spatial reasoning a

自然言語処理大規模言語モデル画像テキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

huggingfaceGitHubありHugging Faceあり2026-06-04

Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models

Despite the rapid progress of Vision-Language Models (VLMs), the field lacks benchmarks that rigorously diagno

品質予測/異常検知深層学習Transformer生成画像テキスト

用途: 生成
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-04

WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark

In real-world applications, models are expected to perform reliably across diverse settings. Yet, many existin

自然言語処理大規模言語モデル画像テキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-04

LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing

Developing unified video generation and editing models capable of interpreting interleaved multimodal inputs i

深層学習Transformer生成画像テキスト

用途: 生成
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-04

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, r

表形式向き自然言語処理大規模言語モデルテキスト動画3D

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-04

Revising Context, Shifting Simulated Stance: Auditing LLM-Based Stance Simulation in Online Discussions

Large language models are increasingly used to simulate social media users and infer how individuals may respo

深層学習Transformerテキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-04

Benchmark Everything Everywhere All at Once

Benchmarks are fundamental for evaluating and advancing LLMs and MLLMs by providing standardized and explicit

品質予測/異常検知自然言語処理大規模言語モデルテキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

arxivPaper only2026-06-03

Worker Utility as Hysteresis: A Preisach Model of Transaction Acceptance in Gig Labour Markets

この研究では、個人の意思決定に対する効率的な解析 (Worker Utility) を提案しており、個人の意思決定を効率的に解析し、それを活用する。

表形式向きCPUで試しやすいコンピュータビジョンマルチモーダル分類

用途: 個人の意思決定に対する効率的な解析
難易度: Hard
コスト: High

→

huggingfaceHugging Faceあり2026-06-03

Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models

Vision language models (VLMs) excel at many tasks but still struggle with spatial reasoning when critical info

表形式向き説明可能コンピュータビジョンマルチモーダル画像テキスト

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-03

Video2LoRA: Parametric Video Internalization for Vision-Language Models

Processing video in vision-language models is expensive: each frame occupies hundreds of tokens, and inference

自然言語処理ファインチューニング要約QA画像

用途: 要約
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-03

BRepCLIP: Contrastive Multimodal Pretraining on BRep Primitives for CAD Understanding

Learning representations of CAD models is a largely open problem. While 3D representation learning has flouris

深層学習Transformer分類生成埋め込み

用途: 分類
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-03

MapAgent: An Industrial-Grade Agentic Framework for City-scale Lane-level Map Generation

Lane-level maps are critical infrastructure for autonomous driving and lane-level navigation, yet constructing

センサ/時系列コンピュータビジョンマルチモーダル生成画像

用途: 生成
難易度: Easy
コスト: High

→

githubGitHubあり2026-06-03

BentoML — The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!

モデルをサービングするためのライブラリを紹介している。

自然言語処理大規模言語モデル生成マルチモーダル

用途: モデルのサービング
難易度: Easy
コスト: High

→

arxivPaper only2026-06-02

A Quantitative Approximation Framework for Flow Distillation in Diffusion Models

We develop a quantitative approximation framework for diffusion distillation, viewing few-step sampling as err

深層学習軽量化・量子化マルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-02

Multimodal Transformer Based Generic Mixture Density Network for Scattering Timescale Estimation of Fast Radio Bursts

The discovery rate of fast radio bursts (FRBs) continues to increase with the advent of new radio facilities a

センサ/時系列深層学習Transformerマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-06-02

A Fast Screening Approach for High-dimensional Outcomes and High-dimensional Predictors

Modeling interactions among multimodal, high-dimensional data is intrinsically challenging due to ultra-high d

説明可能品質予測/異常検知深層学習軽量化・量子化マルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

huggingfaceHugging Faceあり2026-06-02

A Cookbook of 3D Vision: Data, Learning Paradigms, and Application

3D vision has rapidly evolved, driven by increasingly diverse data representations, learning paradigms, and mo

自然言語処理RAG生成動画3D

用途: 生成
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-02

MAOAM: Unified Object and Material Selection with Vision-Language Models

Selection is a core operation in interactive image editing. To be practical, a user should be able to specify

MI向き自然言語処理RAG生成セグメンテーション画像

用途: 生成
難易度: Easy
コスト: High

→

huggingfaceGitHubありHugging Faceあり2026-06-02

SEAOTTER: Sensor Embedded Autoencoding with One-Time Transcode for Efficient Reconstruction

In robotics systems, vast amounts of visual data are easily captured at high resolution using low-cost, low-po

センサ/時系列品質予測/異常検知深層学習軽量化・量子化画像マルチモーダル

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-02

OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

Multimodal agents in robotics, AR, and autonomous driving must reason about places and layouts from continuous

品質予測/異常検知自然言語処理大規模言語モデル生成テキスト動画

用途: 生成
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-02

Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

Wide-baseline matching (WBM) requires integrating geometric understanding, viewpoint changes, fine-grained per

自然言語処理大規模言語モデル生成画像テキスト

用途: 生成
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-02

When Graph Tokens Sink: A Mechanistic Analysis of Graph Language Models

Graph Language Models (GLMs) have become a promising direction for adapting Large Language Models (LLMs) to gr

深層学習軽量化・量子化テキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

arxivPaper only2026-06-01

Flow-Transformed Implicit Processes for Function-Space Variational Inference

Implicit-process priors define distributions over functions through flexible generative mechanisms, making the

深層学習軽量化・量子化生成マルチモーダル

用途: 生成
難易度: Hard
コスト: High

→

huggingfaceHugging Faceあり2026-06-01

The Road Ahead in Autonomous Driving: The KITScenes Multimodal Dataset

Existing autonomous driving datasets have enabled major progress, but fall short in sensor fidelity, map compl

センサ/時系列深層学習Transformer検出生成3D

用途: 検出
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-01

AdaCodec: A Predictive Visual Code for Video MLLMs

Video is temporally redundant: adjacent frames usually share most objects, background, and layout. Yet existin

自然言語処理大規模言語モデル画像テキスト動画

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

huggingfaceGitHubありHugging Faceあり2026-06-01

Cosmos 3: Omnimodal World Models for Physical AI

We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, i

深層学習Transformer生成画像テキスト

用途: 生成
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-01

MMG2Skill: Can Agents Distill In-the-Wild Guides into Self-Evolving Skills?

Abundant procedural knowledge on the Web holds great potential for helping agents solve long-horizon tasks. Ho

自然言語処理RAG回帰テキストマルチモーダル

用途: 回帰
難易度: Easy
コスト: High

→

githubGitHubあり2026-06-01

PosterGen — Official Repository for PosterGen - CVPR Findings 2026

このリポジトリには、CVPR 2026で発表されたポスター生成ツール「PosterGen」の公式リポジトリが含まれます。

コンピュータビジョンマルチモーダル

用途: ポスター生成ツールを提供する
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-05-29

MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding

Multimodal Large Language Models (MLLMs) have demonstrated significant achievements in general visual question

品質予測/異常検知自然言語処理大規模言語モデル分類QA画像

用途: 分類
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-05-29

SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes

Humans can effortlessly perceive spatial layouts, form cognitive representations, reason about spatial relatio

コンピュータビジョン3D・点群検出テキスト3D

用途: 検出
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-05-29

PaintBench: Deterministic Evaluation of Precise Visual Editing

While current multimodal models are proficient at open-ended visual editing, executing precise single-answer e

コンピュータビジョンマルチモーダル生成画像

用途: 生成
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-05-28

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capabilit

深層学習軽量化・量子化マルチモーダル

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-05-28

Multimodal Music Recommendation System using LLMs

Music recommendation systems typically treat songs as opaque tokens, relying on collaborative interaction hist

センサ/時系列品質予測/異常検知深層学習Transformerテキスト音声マルチモーダル

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-05-28

Stable-Layers: Fine-Tuning Image Layer Decomposition Models with VLM-Scored Reinforcement Learning

We present Stable-Layers, a reinforcement learning framework that eliminates the need for paired supervision b

自然言語処理ファインチューニング画像テキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

arxivPaper only2026-05-27

Evolving to the Aesthetics of a Vision-Language Model

Evolutionary systems have demonstrated remarkable results in creative domains, with recent applications in gen

コンピュータビジョンマルチモーダル生成テキスト

用途: 生成
難易度: Hard
コスト: High

→

arxivPaper only2026-05-22

Planktonzilla: Multimodal dataset and models for understanding plankton ecosystems

Marine plankton underpin aquatic food webs and play a key role in global CO2 sequestration, making reliable sp

少数データ向き深層学習Transformer分類画像テキスト

用途: 分類
難易度: Hard
コスト: High

→

huggingfaceHugging Faceあり2026-05-22

SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

Vision-Language Models (VLMs) are increasingly deployed in embodied environments, where they need produce nume

自然言語処理ファインチューニング画像テキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

githubGitHubあり2026-05-21

deeplake — Deeplake is AI Data Runtime for Agents. It provides serverless postgres with a multimodal datalake, enabling scalable retrieval and training.

自動変換により、モデルはテスト時に計算量を最適化し、難しいステップでより多く計算すると同時に、簡単なステップでより少ない計算を実行します。

自然言語処理大規模言語モデルマルチモーダル

用途: 言語モデルに計算量を最適化
難易度: Easy
コスト: High

→

arxivPaper only2026-05-19

Smooth Partial Lotteries for Stable Randomized Selection

部門間の競争では、評価に基づいて候補者を選択する必要があることが多い。しかし、これまでのランダムな選択メカニズムは、候補の中で微妙な差異のあるデータの不均衡を考慮していなかった。これにより、安定性が低くなる。そのため、今

品質予測/異常検知コンピュータビジョンマルチモーダル

用途: スマートなランダムな選択を促す方法を実現する
難易度: Hard
コスト: High

→

arxivPaper only2026-05-19

A Nash Equilibrium Framework For Training-Free Multimodal Step Verification

Multimodal large language models often generate reasoning chains containing subtle errors that lead to incorre

自然言語処理大規模言語モデルテキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

arxivPaper only2026-05-18

Mapping the Fitness Landscape: A Structure-Guided Approach to Multi-Modal Optimization

Multimodal optimization requires finding many optima rather than merely keeping a diverse population. Yet most

品質予測/異常検知自然言語処理RAGマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

→

githubGitHubあり2026-05-14

VidCom2 — [EMNLP 2025 Main] Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models

VidCom2は、ビデオ圧縮を改善するためのPlug-and-Playのインフェレンスアクセレレーションを備えたVideo Large Language Modelsです。

深層学習軽量化・量子化テキスト動画マルチモーダル

用途: ビデオ圧縮改善
難易度: Easy
コスト: High

→

githubGitHubあり2026-05-13

maths-cs-ai-compendium — Become a cracked AI/ML Research Engineer

Becoming a cracked AI/ML Research Engineerには、AI/ML研究者のスキルと知識を高めるための手法が紹介されています。

コンピュータビジョンマルチモーダルテキスト音声

用途: AI/ML研究者を育成
難易度: Easy
コスト: High

→