MLinfo | 機械学習・AI論文まとめ

transformers — 🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

🤗 Transformersは、テキスト・ビジョン・音声など複雑なモデル定義をサポートするフレームワークで、インフェレンスターやトレーニングに使用できる。

深層学習Transformer分類テキスト音声

用途: 機械学習モデル定義
難易度: Easy
コスト: High

Medical_Image_Analysis — Foundation models based medical image analysis

医学画像分析は、医療の診断や治療を支援するために画像に記載されたデータから情報を抽出する研究分野です。この研究では、foundation modelsを用い、医療画像分析のための新しいアプローチを提案しました。found

用途: 医学画像分析
難易度: Easy
コスト: High

自然言語処理大規模言語モデルテキスト音声マルチモーダル

screenpipe — YC (S26) | Record your screen 24/7 and plug into your agents. Local, private, secure. Connect to OpenClaw, Hermes agent and 100+ apps

ユーザーの行動を認識し、オートエージェントを構築するためのツール。

用途: オートエージェント構築
難易度: Easy
コスト: High

rerun — Visualize, query, and stream to train on multimodal robotics data.

データをロギング・ストーリング・クエリして視覚化できるSDKです。

コンピュータビジョンマルチモーダル画像

用途: データロギングおよび視覚化
難易度: Easy
コスト: High

深層学習Transformer画像テキストマルチモーダル

sglang — SGLang is a high-performance serving framework for large language models and multimodal models.

SGLangは、大規模言語モデルのサービングフレームワークです。このライブラリは、高性能なサービスフレームワークで、大規模言語モデルのサービングをサポートしています。

用途: 大規模言語モデルのサービングフレームワーク
難易度: Easy
コスト: High

自然言語処理大規模言語モデルテキストマルチモーダル

ai-agent-book — 《深入理解 AI Agent：设计原理与工程实践》（李博杰著）开源主仓库：全书正文、编译版 PDF 与按章配套代码

この論文では、現在のVision-Language-Benchmark（VLB）を超える、MLLMがアクティブな観察を実演できるようにするためのバenchmark、ActiveVisionを提案する。このActiveVi

用途: 弁論の実際的な対象を形成するためにAIが活用される
難易度: Easy
コスト: High

lance — Open Lakehouse Format for Multimodal AI. Convert from Parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..

マルチモーダルAIに適したオープンレイクハウスフォーマットです。このフォーマットでは、パレットからデータを2行のコードで変換することができ、100倍速くなります。また、ベクトルインデックスやデータバージョニングが可能です

用途: オープンレイクハウスフォーマット
難易度: Easy
コスト: High

runanywhere-sdks — Production ready toolkit to run AI locally

このリポジトリでは、AIモデルの互換性を確保するためのオープンスタンダードであるONNXを提供しています。

用途: AIモデルの互換性を確保するためのオープンスタンダード
難易度: Easy
コスト: High

verl-omni — Multimodal RL training framework for diffusion & omni models

CVV または CWE への分類を実現し、バグ修正のために重要な手順となるCVEへの CWE 分類を自動化する。

用途: CVVの分類と CWE 分類
難易度: Easy
コスト: High

haystack — Open-source AI orchestration framework for building context-engineered, production-ready LLM applications. Design modular pipelines and agent workflows with explicit control over retrieval, routing, memory, and generation. Built for scalable agents, RAG, multimodal applications, semantic search, and conversational systems.

オープンソースのAIオーケストレーションフレームワークです。LLMアプリケーションの構築に必要なパイプラインやエージェントワークフローの設計ができるようになっています。

深層学習Transformer生成要約テキスト

用途: LLMアプリケーションの構築
難易度: Easy
コスト: High

3D-Aware VLMs with Implicit and Explicit Geometries

3次元空間理解技術のための新しいアプローチであるVLM-IE3D（Vision-Language Models with Implicit and Explicit 3D geometry）を提案しました。VLM-IE3

コンピュータビジョン3D・点群検出画像テキスト

用途: 3次元空間理解技術の開発
難易度: Hard
コスト: High

品質予測/異常検知自然言語処理大規模言語モデル画像テキストマルチモーダル

MIRROR: Learning from the Other View for Multi-Modal Reasoning

多モーダル理解技術のための新しいアプローチであるMIRROR（Learning from the Other View）を提案しました。MIRRORは、テキスト、図、テキストと図の組み合わせから同等の視点を提供することで

用途: 多モーダル理解技術の開発
難易度: Hard
コスト: High

センサ/時系列品質予測/異常検知深層学習軽量化・量子化テキスト音声マルチモーダル

X$^3$-OPD: Distilling Reasoning into Large Audio-Language Models via On-Policy Alignment

大規模な言語モデルを用いた推論技術のための新しいアプローチであるX$^3$-OPD（Distilling Reasoning into Large Audio-Language Models via On-Policy

用途: 大規模な言語モデルを用いた推論技術の開発
難易度: Hard
コスト: High

センサ/時系列自然言語処理大規模言語モデル分類検出埋め込み

Toward Generalizable Cognitive Impairment Detection with Speech-Based Multimodal Large Language Models

Cognitive impairment (CI) is a growing public health concern. Early and accurate diagnosis is critical for ena

用途: 分類
難易度: Hard
コスト: High

M$^3$-Gen: Interpretable Multimodal Generation of Gene Expression Profiles Using Clinical and Imaging Data

Integrating heterogeneous biomedical data, including clinical metadata, histopathology images, and molecular p

説明可能自然言語処理RAG生成画像マルチモーダル

用途: 生成
難易度: Hard
コスト: High

Multi-Task Learning for Heterogeneous Prediction from Video Game State with Transfer Learning

Multi-task learning (MTL) is a promising approach for prediction tasks derived from video game state data, as

自然言語処理ファインチューニング画像テキスト動画

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

品質予測/異常検知コンピュータビジョンセグメンテーション生成マルチモーダル

Best-of-Evidence: Best-of-N Selection under Partial Verification

モデル出力の選択のためのBoN（ベストオブナ）を、部分検証が含まれるビジョン言語タスクに適用する。この方法により、モデル出力を効率化できる。

用途: 部分検証を含むビジョン言語タスクを効率化する
難易度: Hard
コスト: High

Three-Pronged Spectral Control for Federated Parameter Efficient Fine Tuning

FL（分散機械学習）におけるパラメータの効率的なフィンテューニングを支援するツールを提案した研究で、TRISHUL（Three-Pronged Spectral Control for Federated Paramet

深層学習軽量化・量子化マルチモーダル

用途: FLにおけるパラメータの効率的なフィンテューニングを支援する
難易度: Hard
コスト: High

OpenForgeRL: Train Harness-native Agents in Any Environment

OpenForgeRLは、ハーネス付きエージェントを訓練するためのフレームワークを提供する。これにより、エージェントが複雑なトラジショナルハーネスを利用して、外部システムと協力し、複数のタスクを同時に解決できるようになっ

深層学習軽量化・量子化マルチモーダル

用途: ハーネス付きエージェントのトレーニング
難易度: Hard
コスト: High

GS-Agent: Creating 4D Physical Worlds With Generative Simulation

GS-Agentは、自然言語から生成することができ、物理的に正しく動作する4次元の世界を生成することができる。方法は、物理的正しさを保つために、生成時に物理的推論を使用した。

MI向き自然言語処理RAG生成画像テキスト

用途: 4次元の物理世界の生成
難易度: Hard
コスト: High

When Are Reasoning-Based Guardrails Not Efficient? ResponseGuard: A Fast Vision-Language Guard for Real-Time Moderation

A vision-language AI assistant returns its answer as a stream of generated tokens. Therefore, a safety guard t

深層学習軽量化・量子化検出画像テキスト

用途: 検出
難易度: Hard
コスト: High

説明可能深層学習Transformer検出埋め込みテキスト

Multimodal Pretraining for Generalizable EEG Representation Learning

Electroencephalography (EEG) models used for epilepsy are often limited to specific datasets and tasks. This l

用途: 検出
難易度: Hard
コスト: High

DINOde: Continuous Vision-Text Alignment for Open-Vocabulary Semantic Segmentation

Open-vocabulary semantic segmentation (OVSS) leverages textual semantics to segment objects beyond predefined

自然言語処理RAGセグメンテーション画像テキスト

用途: セグメンテーション
難易度: Hard
コスト: High

品質予測/異常検知自然言語処理大規模言語モデルQA画像テキスト

Unlearning Under Imbalance: Benchmarking Fairness in Multimodal LLM Unlearning

LLMは、人間のアイデンティティのシミュレーションを使用して個人データを削除したり、未均衡なデータを削除したりしますが、これらのアプローチには制限があります。

用途: モデルの個人データ削除
難易度: Hard
コスト: High

CRAG-MM-Diagnostics: Enabling Stage-Wise Analysis of Knowledge-Intensive VQA

知識重視の質問応答システム (KI-VQA) を分析するために、新しい評価基準を提案します。これらの基準では、VLMの各タスクを個別に評価することができます。

自然言語処理大規模言語モデル分類QA画像

用途: 知識重視の質問応答システムの分析
難易度: Hard
コスト: High

EmoAgent-R1: Towards Multimodal Emotion Understanding with Reinforcement Learning-based Dynamic Agent Specialization

Multimodal large language models (MLLMs) have achieved impressive performance in multimodal emotion recognitio

自然言語処理大規模言語モデル分類テキスト動画

用途: 分類
難易度: Hard
コスト: High

自然言語処理プロンプトエンジニアリング分類画像テキスト

Sparse Concept Channels in Frozen 3D CT Vision Encoders

Large vision-language models are becoming increasingly dominant in 3D medical image interpretation, but we rar

用途: 分類
難易度: Hard
コスト: High

説明可能深層学習Transformer埋め込み画像動画

HyWorldVLA: A Vision-Language-Action Model with Hybrid World Modeling for Autonomous Driving

Vision-Language-Action (VLA) models augmented with world modeling represent a promising paradigm for end-to-en

用途: 埋め込み
難易度: Hard
コスト: High

品質予測/異常検知深層学習Transformer画像テキスト動画

Beyond Independent Optimization: Compression, MoE Routing, and Quantization Interactions in Multimodal Edge Intelligence

効率的な多モードの推論は、モデルの性能やFLOPCOuntだけでなく、移動、キャッシュ、変形、量化された表現を保存するコストやメモリ、エネルギーに関する制約にも制限されています。この論文では、最近のビジュアルトークン圧縮

用途: 分析的コストと効率性を向上させるための多モードのエッジAIの効率化
難易度: Hard
コスト: High

OPOD: On-Policy Omni Distillation

Omni-modal models can handle text, images, and audio in one system, but improving all of these abilities toget

深層学習軽量化・量子化画像テキスト音声

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

説明可能MI向き品質予測/異常検知深層学習Transformer分類生成画像

Enhancing Explainable Cardiac Diagnosis with Guide-Grounded Multimodal LLMs

The electrocardiogram (ECG) is a cornerstone of cardiac as- sessment, yet clinical deployment of deep learning

用途: 分類
難易度: Hard
コスト: High

MedGame: Storytelling Gamification Empowered by Large Language Models for Medical Education

Large Language Models (LLMs) は医学教育に大きな可能性を持っていますが、現在のシステムでは、質問に答えるか一時的なフィードバックしか行なわれていません。一方、臨床病例を決定センターへの学習トレ

自然言語処理大規模言語モデル生成QAテキスト

用途: 医学教育への Large Language ModeL の適用
難易度: Hard
コスト: High

センサ/時系列自然言語処理大規模言語モデル生成テキスト音声

An Evaluation Framework for Structured Audio Captions Validated by Controlled Perturbations

この論文では、音声字幕の評価手法が提案され、音声字幕の評価において既存の手法の制約を克服することを目指しました。提案されたフレームワークは音声字幕の各側面を評価し、質問回答型の評価手法ではなく字幕の中立性を評価することが

用途: 音声字幕の評価フレームワークの構築
難易度: Hard
コスト: High

MemTools: A Unified Research Framework for Interoperable Agent Memory

この論文では、記憶システムをサポートするフレームワークMemToolsが構築され、記憶システムの開発を容易にすることを目指しました。これにより開発者は、記憶システムの各コンポーネントを開発およびテストしやすくなり、設計と

自然言語処理RAGマルチモーダル

用途: エージェントの記憶をサポートするフレームワークの構築
難易度: Hard
コスト: High

深層学習軽量化・量子化セグメンテーションマルチモーダル

UnDA: Unpaired Domain Alignment for Cross-Modal Knowledge Transfer in Medical Imaging

複数モーダルデータの統合を支援するための方法とツールを提案し、医療画像認識におけるモーダル間の知見の共有を促進した。

用途: 複数モーダルデータの統合
難易度: Hard
コスト: High

品質予測/異常検知深層学習正規化・最適化手法分類画像テキスト

Quality-Aware Multimodal Fusion Reveals Implicit Identity in Valence-Arousal Features

Conventional face recognition relies on static appearance cues and degrades in unconstrained settings with exp

用途: 分類
難易度: Hard
コスト: High

品質予測/異常検知生成AIGAN生成画像マルチモーダル

Physics-Informed Deep Learning Model for Cross-Modality Super-Resolution in Fluorescence Microscopy

Cross-modality image translation offers a route to super-resolution fluorescence microscopy from low-resolutio

用途: 生成
難易度: Hard
コスト: High

Decoupling Cross-Modality Manifold Discrepancy: Leveraging Visible Diffusion Priors for Infrared Super-Resolution

Infrared image super-resolution (IISR) mitigates the limitations imposed by low spatial resolution. Existing m

自然言語処理RAG生成画像マルチモーダル

用途: 生成
難易度: Hard
コスト: High

HalluScope: Fine-grained Hallucination Diagnosis for Multimodal Large Language Models

大規模言語モデルはさまざまな画像をテキストに変換する上で優れた性能を示しているが、発生するホログラフィックな診断にはまだ解決策が必要です。この研究では、主流の粗い検出方法の欠点を補うため、細部の診断方法を提案しています。

説明可能自然言語処理大規模言語モデル分類検出生成

用途: ホログラフィックハロウィーンの診断
難易度: Hard
コスト: High

Geo3R: Mitigating Spatial Reasoning Hallucination in Multimodal Large Language Models

大規模言語モデルのハロウィーン診断では、対象の 3D 空間関係を推論する際に、視覚化が欠如していることが問題となります。この研究では、これらのハロウィーンを軽減するためのアプローチを提案しています。

自然言語処理大規模言語モデル画像テキスト3D

用途: 3D空間推論のハロウィーン診断
難易度: Hard
コスト: High

深層学習Transformerテキストマルチモーダル

C-PTQ: Fisher-weighted Channel-wise Sensitivity for Post-training Quantization of MLLMs

大規模言語モデルの圧縮には、モデルのパフォーマンスが低下する可能性があるため、量化の保護が重要です。この研究では、Fisher加重チャネル感受性を用い、MLLMの量化を安定させるためのC-PTQをプロPOSEしています。

用途: 大規模言語モデル圧縮
難易度: Hard
コスト: High

Do Pathology Vision-Language Models Truly See Pathology?

パスロジは、現在、パスロジ認識のための画像言語モデルを評価するために広く使用されていますが、この研究では、パスロジ認識において画像言語モデルの視覚知覚が機能していることを疑問に問っています。

用途: パスロジの認識
難易度: Hard
コスト: High

深層学習Transformer画像テキストマルチモーダル

MVEI & EmObserver: Empowering MLLM-Oriented Visual Emotional Intelligence via Emotion Statement Judgement

感情認識は、現代のアギを促進するために不可欠ですが、大規模

用途: 感情認識
難易度: Hard
コスト: High

品質予測/異常検知数学・理論確率・統計テキストマルチモーダル

Achieving Text-based Person Retrieval with Any Granularity

Text-based person retrieval faces a critical but under-explored challenge: the inherent uncertainty of query g

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

説明可能センサ/時系列コンピュータビジョンマルチモーダル画像テキスト

GeoThreat: Transferable Targeted Adversarial Attacks on Large Vision-Language Models for Remote Sensing Image Interpretation

Adversarial attacks against large vision-language models (LVLMs) serve as an effective means of assessing thei

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

品質予測/異常検知深層学習Transformer検出生成画像

GroupVideo: Multi-Identity Customized Text-to-Video Generation

Current identity customized video generation methodologies are predominantly limited to single-identity scenar

用途: 検出
難易度: Hard
コスト: High

品質予測/異常検知深層学習軽量化・量子化テキスト動画マルチモーダル

ProCap: Prominence-guided Object Rectification for Faithful and Comprehensive Video Captioning

Improving video captioning quality typically demands retraining large vision-language models, an expensive and

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

Distribution-Alignment Bridge for Uncertainty-Aware Text-to-Video Retrieval

本論文では、テキストと動画を対応させるDistribution-Alignment Bridge（DAB）を提案します。DABは、テキストと動画のエンティティを確率分布として表現し、両者の間の分布の差異を解決します。この

自然言語処理埋め込み・検索生成テキスト動画

用途: テキストから動画の検索
難易度: Hard
コスト: High

MagicMakeup: A Region-Controllable Diffusion Transformer for High-Fidelity Makeup-Transfer

この研究では、マイメイク移植を改善するために、マイメイクの強い地域性を考慮したRegion-Controllable Diffusion Transformer（MagicMakeup）を提案します。

深層学習Transformer生成画像テキスト

用途: マイメイク移植
難易度: Hard
コスト: High

DINO-VPT: Hierarchical Visual Prompt Tuning for Joint Physical-Digital Face Anti-Spoofing

この論文では、DINO-VPTという手法を提案します。DINO-VPTは、Hierarchical Visual Prompt Tuning（HVPT）を使用して、物理的なスポーフィングとデジタルスポーフィングを検出しま

深層学習軽量化・量子化画像テキストマルチモーダル

用途: フェイスアンティスポーフィング
難易度: Hard
コスト: High

品質予測/異常検知自然言語処理大規模言語モデル画像テキスト動画

ViSTR-Bench: Can MLLMs Reason from Continuous Visual Cues in Dynamic Scenes?

この論文では、ViSTR-Benchという手法を提案します。ViSTR-Benchは、MLLMが動的シーンから情報を取得できるかどうかを評価します。

用途: 3Dシーンの分析
難易度: Hard
コスト: High

センサ/時系列品質予測/異常検知コンピュータビジョンマルチモーダル画像

AXIS: A Growable Community-Driven Data Engine for Scalable Robot Manipulation

Learning effective robot manipulation policies requires diverse, high-quality demonstrations, yet existing dat

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

CPUで試しやすい深層学習軽量化・量子化検出3Dマルチモーダル

Factorized Spatio-Temporal Convolutions for Human Pose Estimation from Planar Lidar

この論文では、安全な人とロボット間の対話を目的とした、人間の姿勢推定とロボットの動作制御の一連のネットワークが提案されます。

用途: 人間とロボット間の安全な交互作用
難易度: Hard
コスト: High

RL-MACRO: A Cybernetic Closed-Loop Intelligence Framework for Multimodal Adaptive Robotic Craniotomy

クロアニオトミーの手術を自動化するために、複数のモジュールから形成されるサイバネティックなクローゼッドループのフレームワークを提案します。このフレームワークは、ツールと組織との対話を通じて、ツールと組織の相互作用に対して

センサ/時系列深層学習CNN音声マルチモーダル

用途: クロアニオトミー手術の自動化
難易度: Hard
コスト: High

センサ/時系列深層学習Transformer検出画像音声

Human-Inspired Framework for Robotic Craniotomy: Integrating Multimodal Fusion and Adaptive Trajectory Adjustment

人間の知能を模倣するクロアニオトミー手術のフレームワークを提案します。このフレームワークは、前方計画と後方実行を組み合わせて、手術中に手術台の位置を自動的に調整することで、人間と同様の安全で効率的な手順を実現します。

用途: クロアニオトミー手術の自動化
難易度: Hard
コスト: High

自然言語処理ファインチューニングテキスト3Dマルチモーダル

ZONDA: Zero-shot Object Navigation with Dynamic Avoidance in Multi-floor Environments

オブジェクト目標のナビゲーションにおける、動的な避け方とマルチフロア環境を考慮した、ゼロショットオブジェクトナビゲーションのフレームワークを提案します。このフレームワークでは、動的な人々とマルチフロア環境を考慮しながら、

用途: マルチフロアにおけるオブジェクト目標のナビゲーション
難易度: Hard
コスト: High

コンピュータビジョンセグメンテーションテキストマルチモーダル

URF: A Unified Robot Control-Policy Framework for Stable Contact Aware Manipulation

Learning-based manipulation policies usually predict robot actions from sensory observations and leave their e

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

huggingfaceHugging Faceあり2026-07-23

K12-KGraph: A Curriculum-Aligned Knowledge Graph for Benchmarking and Training Educational LLMs

Large language models are increasingly used in K-12 education, but existing benchmarks mainly test exam questi

自然言語処理大規模言語モデルQA画像テキスト

用途: QA
難易度: Easy
コスト: High

githubGitHubあり2026-07-23

xtuner — A Next-Generation Training Engine Built for Ultra-Large MoE Models

xtunerは、超大規模MoEモデルを高速にトレーニングするためのトレーニングエンジンです。

自然言語処理大規模言語モデル生成マルチモーダル

用途: MoEモデルの高速トレーニングを提供する
難易度: Easy
コスト: High

品質予測/異常検知コンピュータビジョンマルチモーダル分類

Adaptive Confidence-weighted Expansion for Trustworthy Multi-Omics Multimodal Fusion

Multimodal learning is a robust approach to improve predictive performance in applications such as medical pro

用途: 分類
難易度: Hard
コスト: High

Antigen-specific Antibody Multi-modal Foundation Model for Functional Antibody Design

この研究では、抗原特異性抗体を設計するために、抗原および抗体の間でエピトープレベルでのペアリングが必要であることを考慮した、抗原特異性の抗体多モーダルファンデーションモデル（AAMFM）を提案しました。

自然言語処理RAG分類生成テキスト

用途: 抗原特異性抗体設計
難易度: Hard
コスト: High

MI向き自然言語処理ファインチューニング生成テキストマルチモーダル

Hypothesis-and-Refinement Learning of Organic Structures from Multimodal Spectroscopic Data

分子構造を決定するために、スペクトルデータから自動的な構造解析を実施するための方法を提案している。この方法は、スペクトルデータに基づいてヒントと改良を繰り返すことで、分子構造を決定するもので、分子の可能性の広範な構造スペ

用途: 分子構造の解析
難易度: Hard
コスト: High

センサ/時系列深層学習軽量化・量子化画像テキストマルチモーダル

Robostral Navigate

Deploying navigation systems at scale requires a recipe that minimizes sensor assumptions, generalizes across

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

Closing the Lab-to-Store Gap: A Data-Efficient Post-Training and Experience-Driven Learning VLA Framework for Retail Humanoids

Closing the gap between benchmark performance and reliable real-world operation remains a central challenge fo

深層学習軽量化・量子化異常検知画像テキスト

用途: 異常検知
難易度: Hard
コスト: High

ENTRAP-VL: A Taxonomic Probe for Dual Contextual Entrainment in Vision-Language Models

Contextual entrainment is the tendency of a model to let auxiliary context in its input pull its output, indep

コンピュータビジョンマルチモーダル画像テキスト

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

品質予測/異常検知深層学習軽量化・量子化生成画像テキスト

Learning to Detect UI Principle Violations via Reinforcement Learning

Small language models and coding agents increasingly generate web front-end code, yet their outputs are typica

用途: 生成
難易度: Hard
コスト: High

深層学習Transformer画像テキストマルチモーダル

Test-Time Training for Modality Order Consistency in Vision-Language Models

異なる順番で画像と質問が提示される場合、視覚言語モデルはモデルのパフォーマンスに大きな影響を受けることが発見された。

用途: モデルの出力の順番に影響する問題を解決する
難易度: Hard
コスト: High

VizRAG: Enhancing Retrieval-Augmented Generation with Hypergraph Visualization

Hypergraph-based RAG systems surpass traditional graph-based approaches by organizing complex n-ary atomic fac

用途: 生成
難易度: Hard
コスト: High

品質予測/異常検知コンピュータビジョンマルチモーダル分類生成画像

Ocular Verification for Virtual Reality

Virtual reality (VR) headsets (e.g., Meta Quest, Apple Vision Pro) provide a seamless user experience due to t

用途: 分類
難易度: Hard
コスト: High

品質予測/異常検知コンピュータビジョンセグメンテーション生成画像3D

Axolotl3D: a Unified Framework for Faithful 3D Shape Completion

Recent 3D generative models produce high-quality geometry from a single image using large-scale priors and dif

用途: 生成
難易度: Hard
コスト: High

Look Less, Think Faster: Joint Token-Compute Adaptation for Multimodal LLMs

多モーダルラージランゲージモデルは、視覚言語タスクに強いですが、高い推論コストで問題となっています。Look Less, Think Fasterアルゴリズムは、単位次元を個別に最適化することで、多モーダルラージランゲー

深層学習軽量化・量子化画像テキストマルチモーダル

用途: 多モーダルラージランゲージモデルによる視覚言語タスクでのコスト削減
難易度: Hard
コスト: High

Diverse-Intent Multi-Turn Fashion Image Retrieval

複数ターンのファッション画像検索は、実世界のファッション検索では重要なタスクです。Diverse-Intent Multi-Turn Fashion Image Retrievalアルゴリズムは、異なる検索用途を扱うこと

用途: 複数ターンのファッション画像検索
難易度: Hard
コスト: High

センサ/時系列深層学習軽量化・量子化QA画像テキスト

Multimodal Large Language Models for Remote Sensing Image Understanding: Domain-Specific or General-Purpose?

画像理解のための多モーダルラージランゲージモデルは、強力ですが、まだ能力と限界については明確な理解が不足しています。この論文では、多モーダルラージランゲージモデルが画像理解においてどの程度の能力と限界を持つか、を分析し、

用途: 画像理解における多モーダルラージランゲージモデルの能力と限界
難易度: Hard
コスト: High

センサ/時系列深層学習軽量化・量子化検出セグメンテーション埋め込み

Not All Patches are Equal: Sampling Matters for Visible-Infrared Pre-Training

Visible-infrared (VIS-IR) alignment is a key pre-training task for robust multi-sensor perception. Most existi

用途: 検出
難易度: Hard
コスト: High

品質予測/異常検知深層学習Transformer生成画像テキスト

SHFormer: Dynamic Spectral Filtering Convolutional Neural Network and High-pass Kernel Generation Transformer for Adaptive MRI Reconstruction

Attention Mechanism (AM) selectively focuses on essential information for imaging tasks and captures relations

用途: 生成
難易度: Hard
コスト: High

Development of an automated, reliable, and clinically meaningful artificial intelligence (AI) tool for diagnosing cardiac disease from conventional cardiovascular magnetic resonance (CMR) images

Aims: Cardiovascular magnetic resonance (CMR) imaging enables non-invasive assessment of myocardial structure,

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

STEREOFLOW: Progressive Stereo Matching with StereoDiT and Transition Flow Matching

ステレオマッチングは3次元再構成において重要なタスクです。この研究では、ステレオマッチングを確率的生成タスクと組み合わせ、オブジェクト検出の向上を目的として、ステレオマッチングフレームワークと潜在分配を統合する方法を提案

深層学習Transformer生成回帰画像

用途: オブジェクト検出の向上
難易度: Hard
コスト: High

センサ/時系列品質予測/異常検知深層学習RNN / LSTM予測画像マルチモーダル

Forecasting the Number of Harvest-ready Fruits of Sweet Peppers Using Multimodal Time-Series Data

この研究では、スイートペッパーの収穫前期予測を目的として、多モード時系列データを統合するための深層学習フレームワークを提案します。

用途: 農業用果実の収穫前期の予測
難易度: Hard
コスト: High

MI向き品質予測/異常検知自然言語処理大規模言語モデル生成画像テキスト

ETPDesigner: Multi-Agent Orchestration for Interactive Multimodal Electronic Theater Program

ETPデザイナはマルチモーダルな電子シアターのデザインを自動化するフレームワークを提案します。

用途: 生成
難易度: Hard
コスト: High

MV-Bench: Benchmarking Multimodal Large Language Models for Coordinated Multi-View Interface Construction

Multimodal large language models (MLLMs) are increasingly expected to automate visualization development by ge

用途: 生成
難易度: Hard
コスト: High

LAVIFT: Latent-Action-Guided Vision Fine-Tuning for Surgical Interaction Recognition

Understanding instrument-tissue interactions is essential for context-aware surgical AI and autonomous robotic

自然言語処理ファインチューニング分類検出画像

用途: 分類
難易度: Hard
コスト: High

品質予測/異常検知深層学習Attention機構分類生成画像

MTVDiff: Multimodal Conditional Latent Diffusion for Enhanced Thermal-to-Visible Face Translation

Thermal-to-visible face translation presents fundamental challenges including geometric discontinuities, seman

用途: 分類
難易度: Hard
コスト: High

自然言語処理ファインチューニング画像動画マルチモーダル

EA-Nav: Learning Safe Visual Navigation Policies with Embodiment Awareness

Cross-embodiment navigation is a key challenge in embodied intelligence. Due to differences in embodiment, the

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

品質予測/異常検知深層学習軽量化・量子化検出セグメンテーション画像

Current Injection Spiking Neural Network for Infrared and Visible Image Fusion

Infrared and visible image fusion (IVIF) integrates the complementary information of two modalities into a sin

用途: 検出
難易度: Hard
コスト: High

自然言語処理大規模言語モデルセグメンテーション画像テキスト

Memory-Augmented Multimodal Large Language Models for Small Object Understanding in Streaming Aerial Videos

この研究では、ドローンで小さな物体を認識することを目的としたメモリ拡張型大規模言語モデルを開発しました。このモデルは、複雑なドローンの場面で、ユーザーの指示に従って物体を識別できるようになります。

用途: ドローンで物体認識を実行する
難易度: Hard
コスト: High

品質予測/異常検知コンピュータビジョンマルチモーダルQA画像

Silent Failures in Multimodal Agentic Search:A Diagnostic Taxonomy and Cross-Judge Evaluation

この研究では、可視化された質問への対応を評価するために、新しい方法を提案しました。この方法は、質問への回答の正確性だけでなく、質問への回答のパターンや特徴も評価することができます。

用途: 可視化された質問への対応を評価する
難易度: Hard
コスト: High

Trace: A Taxonomy-Guided Environment for Multidomain Visual Reasoning

自動運転システムには、道路のトポロジー（ドライバブルレーンとその接続性）を理解する機能が必要です。最近の検出モデルは360度の前方視野からボリュームイメージを取得することで、道路上のレーンのトポロジーを推測することができ

自然言語処理RAG画像テキストマルチモーダル

用途: 道路のトポロジー認識を改善
難易度: Hard
コスト: High

センサ/時系列品質予測/異常検知コンピュータビジョン物体検出検出マルチモーダル

DRGBT-1K: A Large-scale High-quality Benchmark for Dynamic RGBT Tracking

地上を表す重力式マップの高解像度版が、多くの用途で役立ちます。たとえば、市区町村の変化を監視したり、エネルギー対策を向上させたり、温室効果ガスの排出量を追跡したりすることができます。4つの主要な全世界建物Rasterデー

用途: 宇宙に分布する建物の面積を正確に推定する
難易度: Hard
コスト: High

説明可能深層学習Transformer検出マルチモーダル

An Exploratory Analysis of Pain Localization via Explainable Computational Modeling

Automatic pain localization, which involves identifying the anatomical origin of pain from peripheral physiolo

用途: 検出
難易度: Hard
コスト: High

SeededGrasp: Language-Guided Grasping in Complex Scenes with Multiple Embodiments

Language-Guided Grasping は、複雑なシーンで物体の把持を行うために、視覚言語モデル（VLM）を用いる。このアプローチでは、VLM は直接把持を予測するのではなく、3 次元空間における把持の位置を指

深層学習軽量化・量子化生成テキスト3D

用途: 複雑なシーンで物体の把持を実現
難易度: Hard
コスト: High

自然言語処理プロンプトエンジニアリング検出画像テキスト

ReferTrack: Referring Then Tracking for Embodied Visual Tracking

ReferTrack は、自然言語で対象の車両に付近する自動車を追従させるシステムである。このシステムでは、対象の車両に付近する自動車を認識する後、自動車の動きを予測する。

用途: 自動車が対象の車両に付きそわせるシステム
難易度: Hard
コスト: High

SOPD-SocialNav: Selective On-Policy Distillation for Vision-Language Social Navigation

SOPD-SocialNav は、学習モデルを小さなロボットに伝える技術であり、ロボットが環境と人間の行動を理解し、ナビゲーションが行えるようにする。

深層学習軽量化・量子化テキストマルチモーダル

用途: ソーシャルなナビゲーションのための学習モデルを小さなロボットに伝える技術
難易度: Hard
コスト: High

品質予測/異常検知コンピュータビジョン動画認識検出異常検知マルチモーダル

Clinical Pathways as Safety Specifications for Physical AI in Hospital Wards

Clinical Pathways は、ロボットが実際の環境で安全に動作するためのシステムである。これは、ロボットが病室で安全に作業し、医療スタッフや患者を守る。

用途: 医療機関で使うロボットの安全性を確保するためのシステム
難易度: Hard
コスト: High

LENS: LLM-guided Environment Simplification for Planning and Control in Clutter

Despite recent advances in general-purpose robotic manipulation, real-world multi-object clutter remains chall

深層学習軽量化・量子化マルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

arxivGitHubあり2026-07-21

Deep Shape Regression for Planar Curves with Multimodal Covariates

深層学習を用いた形状推定モデルを作成し、オープン平面曲線の形状を推定するための深層学習モデルを提案した。

深層学習CNN回帰画像マルチモーダル

用途: 多モデルの形状推定
難易度: Hard
コスト: High

Two-Level Meta-Rubrics for Evaluating Open-Ended Generation: GAMUT, a Benchmark for Factual Completeness

Evaluating the factuality of long-form generations has focused predominantly on precision, measuring whether t

用途: 生成
難易度: Hard
コスト: High

MeetingToM: Evaluating Multimodal LLMs on Theory-of-Mind Reasoning in Multi-Party Meetings

Theory of Mind (ToM), the ability to infer other's beliefs, intentions, and states of knowledge, is central to

自然言語処理大規模言語モデルQAテキスト音声

用途: QA
難易度: Hard
コスト: High

Computational Humor with Multimodal LLMs: Methods, Datasets, Evaluation, and Challenges

Multimodal humor in memes, cartoons, and comics remains difficult for AI systems because intended meaning depe

自然言語処理大規模言語モデル分類生成画像

用途: 分類
難易度: Hard
コスト: High

HPD-Parsing: Hierarchical Parallel Document Parsing

Efficient teamwork typically combines global coordination with parallel execution, a principle not yet fully r

深層学習軽量化・量子化生成テキストマルチモーダル

用途: 生成
難易度: Hard
コスト: High

Fusion Embedding: A Unified Embedding Space for Text, Image, Video, and Audio

A single embedding space that covers text, images, video, and audio lets one index serve every query a user ca

用途: 生成
難易度: Hard
コスト: High

Stochastic Meta-Unlearning: Bridging Language Backbone and Multimodal Unlearning

Machine unlearning for vision-language models (VLMs) remains underexplored. Unlike language models, VLMs combi

自然言語処理RAG画像テキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

深層学習Transformer分類生成セグメンテーション

Pathologist Attention-Aligned Report Generation for Prostate Histopathology

The allocation of visual attention by pathologists during cancer diagnosis is a highly selective process that

用途: 分類
難易度: Hard
コスト: High

MI向きコンピュータビジョンセグメンテーションQA画像テキスト

ChronoStitch: Training-Free Composition of Visual KV Memories for Long-Horizon Temporal Reasoning

Long-video question answering requires a model to preserve visual evidence over time without repeatedly reproc

用途: QA
難易度: Hard
コスト: High

センサ/時系列自然言語処理大規模言語モデル画像テキスト動画

D3VL: Understanding Driving Scenes from 3D Time Series Data and Video with Language Models

Recent advances in Multimodal Large Language Models (MLLMs) have triggered the development of end-to-end MLLMs

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

品質予測/異常検知深層学習Transformerテキスト動画マルチモーダル

BLUE: Semantics-Preserving Video Compression for Efficient Vision-Language Surveillance Analytics

Continuous surveillance video creates a growing storage, transmission, and inference burden for enterprise vid

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

arxivGitHubあり2026-07-21

Detect Early, Escalate Rarely: Anytime Detection of AI-Generated Video from the Compressed Bitstream

Detectors for AI-generated video are evaluated offline. A clip is decoded to pixels and scored once, increasin

CPUで試しやすい深層学習CNN検出画像テキスト

用途: 検出
難易度: Hard
コスト: High

MI向き深層学習Transformer生成画像テキスト

Appearance Pointers -- Multimodal Region Control of Diffusion Transformers

画像生成において、材料、 객체、領域を制御することが難しい問題がある。 Diffusion Transformers はテキストと画像を組み合わせて処理できるが、どちらをどの程度影響させるか決める仕組みがなかった。その

用途: 多モーダル画像制御
難易度: Hard
コスト: High

MI向き自然言語処理ファインチューニング生成画像テキスト

ExpertVerse: A General-Purpose Benchmark for Expert-Level Reasoning in Knowledge-Intensive Visual Synthesis

Recent advances in multimodal generative models have enabled instruction-based image generation to move beyond

用途: 生成
難易度: Hard
コスト: High

MI向き品質予測/異常検知自然言語処理大規模言語モデル画像音声動画

arxivGitHubあり2026-07-21

OmniReasoner: Thinking with Long Audio-Video via Native Tool Use

オリジナルのデータとZoom-Inのツールを組み合わせた方法、OmniReasonerを提案する。これにより、オリンモードルLLMsの長いオーディオビデオの論理的推論を改善できる。

用途: 長いオーディオビデオの論理的推論を改善する
難易度: Hard
コスト: High

InstructMixup: Instruction-Guided Salient Patch Editing for Robust Data Augmentation

記述情報に従って画像や動画データを混ぜ合わせる「対数混合法」を拡張する方法、InstructMixupを提案する。これにより、データを拡張しながらデータの内容とラベルが維持される。

深層学習Transformer分類検出生成

用途: データ拡張のための対数混合法を拡張する
難易度: Hard
コスト: High

No Training, Better Flights: Test-Time Scaled VLMs for UAV Navigation

無線無人飛行機のルートプランニングでは、視空間と言語モデルを利用して安全なルートを生成する必要がある。この問題を解決するために、テスト時にモデルをスケールアップさせる方法を提案する。

コンピュータビジョンマルチモーダルテキスト

用途: 無線無人飛行機のルートプランニングを改善する
難易度: Hard
コスト: High

PathAgentBench: Benchmarking Evidence-Seeking Vision-Language Models on Whole-Slide Pathology Image

Whole-slide image (WSI) diagnosis requires identifying diagnostically relevant regions, examining them across

自然言語処理ファインチューニング検出生成画像

用途: 検出
難易度: Hard
コスト: High

品質予測/異常検知深層学習Transformerセグメンテーション画像テキスト

IGGT4D: Streaming 4D Instance-Grounded Geometry Transformer

実際の空間知能では、空間に続いて流れるビデオを理解する必要がある。この問題を解決するために、4次元空間を理解することができるモデルを提案する。

用途: 空間に続いて流れるビデオを理解する
難易度: Hard
コスト: High

Eversion-based robots can enable safe access,steering and endoscopic imaging within the spinal subarachnoid space

この研究では、スパイナルサブアルテラノスパース内の安全な移動、操縦、内視鏡撮影を可能にする医療用ロボットを提案します。

コンピュータビジョンマルチモーダル画像

用途: 肌下腔内の医療ロボット
難易度: Hard
コスト: High

Cognitive Dual-Process Planning for Autonomous Driving with Structured Scene Knowledge and Verifiable Reasoning-Action Consistency

自動運転のための計画とは、状況理解、タイムリーな推論、行動選択というものがあるが、しかし、これらの要素を組み合わせるのは難しい。これを解決するために、シーン理解を分離することによって、計画を安全かつ有効性のあるものにする

深層学習軽量化・量子化画像テキストマルチモーダル

用途: 自動運転のための分離された計画システムを提案する
難易度: Hard
コスト: High

Agentic Real2Sim: Physics-based World Modeling with Vision-Language Agents

Real-to-sim conversion for robotic interaction with objects remains labor-intensive because it requires more t

コンピュータビジョンマルチモーダル画像テキスト

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

自然言語処理プロンプトエンジニアリング画像テキスト動画

WorldScape Policy 2.0: Empowering Steerable World Action Modeling with Reasoning-Augmented Memory

World Action Models(WAMs)は、ロボットマニピュレーションをモデル化するパラダイム。WAMsは、視覚ステートトランジションとロボットアクションを同時にモデル化する。しかし、既存のWAMsは、一定の時

用途: 多目的マニピュレーション問題を解決する
難易度: Hard
コスト: High

huggingfaceHugging Faceあり2026-07-21

Mage-Flow: An Efficient Native-Resolution Foundation Model for Image Generation and Editing

Large-scale visual generators are increasingly capable but costly to train, fine-tune, and deploy. We introduc

品質予測/異常検知深層学習Transformer生成画像テキスト

用途: 生成
難易度: Easy
コスト: High

説明可能品質予測/異常検知自然言語処理RAG生成画像テキスト

PathReportEval: A Systematic Benchmark for Pathology Report Generation

Pathology report generation from whole-slide images (WSIs) is a rapidly growing multimodal learning problem, y

用途: 生成
難易度: Hard
コスト: High

説明可能自然言語処理RAG生成テキストマルチモーダル

STeP: Signal Temporal Logic for Precise Specifications for Action Generation with Vision Language Models

Vision-language-action (VLA) models have shown impressive generalization, but often lack interpretability and

用途: 生成
難易度: Hard
コスト: High

品質予測/異常検知コンピュータビジョンマルチモーダル画像

MAGE: Human-Like Macro Placement via Agentic Multimodal Reasoning

Macro placement still requires substantial manual refinement in industrial physical design flows. We present M

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

深層学習Transformer埋め込み画像テキスト

Patch Policy: Efficient Embodied Control via Dense Visual Representations

ロボット制御を効率化するために、パッチを用いた政策学習を提案し、密集された視覺表現を用いて実装することを目的としている。

用途: リソース制限のあるロボットの制御
難易度: Hard
コスト: High

センサ/時系列深層学習軽量化・量子化画像テキストマルチモーダル

FM-VLA: Force-based Memory for Vision-Language-Action Models in Contact-Rich Manipulation

existing VLA modelの制約を解決するためのforce-based memory method、FM-VLAを提案する。

用途: manipulateする物体の状態を解決する
難易度: Hard
コスト: High

コンピュータビジョンセグメンテーション生成3Dマルチモーダル

Closing the Loop in Humanoid VLA: Persistent 3D Object Tokens for Verifiable Loco-Manipulation

existing VLA methodの制約を解決するためのpersistent object token methodを提案し、ロボット制御をより実用的なものにする。

用途: 人間のロボット制御を解決する
難易度: Hard
コスト: High

A2RL V\textsubscript{max}: The A2RL autonomous racing dataset for long-range, high-speed perception and multi-vehicle interaction

In autonomous driving development, a perception dataset is crucial, as it provides fundamental data for traini

コンピュータビジョン3D・点群検出テキスト3D

用途: 検出
難易度: Hard
コスト: High

Reasoning as a Double-Edged Sword: Architecture and Cross-Stage Robustness in Vision-Language-Action Models

この研究では、混乱のないターゲットに可視化言語アクションモデルを適応させることを目的として、3つのモデルを使用して研究を行った。3つのモデルは、観察から直接行動へのマッピング、テキストチャインオブスロット、潜在的な反復ル

自然言語処理RAGテキストマルチモーダル

用途: 可視化言語アクションモデルを混乱のないターゲットに適応させる
難易度: Hard
コスト: High

From Sign Language Generation to Humanoid Execution: Vision-Language Guided Retargeting with Collision Mitigation

この論文では、ラインダブルロボットのための自発的アクション生成を実現することを目標とし、vision-language 指向性の指令によりロボットが自発的に動作することができることを示します。

コンピュータビジョン3D・点群生成画像3D

用途: ラインダブルロボットのための自発的アクション生成
難易度: Hard
コスト: High

VLN-AVP: Zero-Shot Vision-Language Navigation with Hybrid Long-Short-Term Memory for Autonomous Valet Parking

Existing methods in Autonomous Valet Parking (AVP) typically rely on pre-built maps, which severely restricts

自然言語処理RAG画像テキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

HCPG-Flow:Hierarchical Contact-Progress Guidance for Flow-Policy Robot Manipulation

Flow policies can represent multimodal action distributions for robot manipulation, yet a robot must execute o

自然言語処理埋め込み・検索マルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

センサ/時系列自然言語処理埋め込み・検索画像マルチモーダル

COLIP-2: Olfaction-Vision-Language Embeddings

The Contrastive Olfaction-Language-Image Pre-training 2 (COLIP-2) model is a multimodal embeddings space that

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

説明可能品質予測/異常検知自然言語処理大規模言語モデル動画マルチモーダル

EduPanel: A Three-Agent LLM Judge for Teaching Videos -- Reliability, Complementarity, and Human Trust Calibration

Teaching videos are becoming a major medium for education, creating a growing need for scalable evaluation of

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

ConsiSpace: Learning Geometric Consistency Matters for Video Spatial Reasoning

Video spatial reasoning is essential for navigation-oriented perception and long-video question answering, whe

深層学習軽量化・量子化QAテキスト動画

用途: QA
難易度: Easy
コスト: High

HOMIE: Human-object Centric Video Personalization via Multimodal Intelligent Enchancement

Human-object centric video personalization (HOCVP) is a core task within subject-driven video generation. Howe

深層学習Transformer生成画像テキスト

用途: 生成
難易度: Easy
コスト: High

FlashRT: Agent Harness for Guiding Agents to Deploy Real-Time Multimodal Applications

Real-time multimodal applications, including voice agents and interactive video generation, compose heterogene

深層学習軽量化・量子化生成テキスト音声

用途: 生成
難易度: Easy
コスト: High

ReViV: Reconstructing the Viewer and the View in 4D from Monocular Egocentric Video

Egocentric devices, such as wearable front-facing cameras, provide a unique perspective for capturing the cont

深層学習Transformer生成動画3D

用途: 生成
難易度: Easy
コスト: High

githubGitHubあり2026-07-20

BentoML — The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!

モデルをサービングするためのライブラリを紹介している。

自然言語処理大規模言語モデル生成マルチモーダル

用途: モデルのサービング
難易度: Easy
コスト: High

arxivPaper only2026-07-19

From Perception to Assistance: Open-Vocabulary Shared Autonomy for Robotic Manipulation

Teleoperating a robotic manipulator in industrial environments demands precision that camera-based interfaces

コンピュータビジョンセグメンテーションテキスト動画マルチモーダル

用途: セグメンテーション
難易度: Hard
コスト: High

arxivPaper only2026-07-19

Asynchronous Multimodal Diffusion Policy Composition via Latency-Aware Guidance Fusion

Diffusion policies have shown strong potential for robotic imitation learning, and recent extensions incorpora

MI向きセンサ/時系列コンピュータビジョンマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

huggingfaceHugging Faceあり2026-07-19

TimeLens2: Generalist Video Temporal Grounding with Multimodal LLMs

Video multimodal large language models (MLLMs) can describe what happens in a video, but rarely identify when

自然言語処理大規模言語モデル検出テキスト動画

用途: 検出
難易度: Easy
コスト: High

説明可能深層学習軽量化・量子化生成テキストマルチモーダル

G2-Nav: Grounded and Guarded Vision-Language Costmaps for Robot Social Navigation

Social navigation requires the robot to reason and respond in complex real-world environments. While recent wo

用途: 生成
難易度: Hard
コスト: High

説明可能センサ/時系列コンピュータビジョンマルチモーダル生成画像

What Do They See? Interpreting Complex Road Scenarios Through the Eyes of Vision-Language-Action Models for Safe and Trustworthy Autonomous Vehicle Learning

End-to-end autonomous driving models are now able to navigate complex road scenarios, mapping raw sensor obser

用途: 生成
難易度: Hard
コスト: High

Token-Wise Latent Streaming from Slow Reasoners to Fast Planners for Dynamic Vision Language Navigation

Vision-Language Navigation in dynamic, human-centric environments exposes a fundamental tension: linguistic re

コンピュータビジョンマルチモーダル生成

用途: 生成
難易度: Hard
コスト: High

PhyAgentOS: A Self-Evolving Operating System for Embodied Agents with Decoupled Cognitive Planning and Physical Execution

Vision-language-action models, world models, and agentic planners each advance physical intelligence, yet thei

MI向きコンピュータビジョンマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

自然言語処理プロンプトエンジニアリング分類検出テキスト

Hazard or Anomaly? Evaluating VLMs for Understanding Dangers and Discrepancies

Modern safety-critical systems increasingly rely on human-robot interaction to reduce disaster risk and suppor

用途: 分類
難易度: Hard
コスト: High

コンピュータビジョンマルチモーダル検出画像テキスト

Autonomous VR-Based Risk Detection for Situational Awareness in Dangerous Settings

In high-risk environments such as disaster response, situational awareness depends not only on detecting hazar

用途: 検出
難易度: Hard
コスト: High

huggingfaceGitHubありHugging Faceあり2026-07-18

Dataset Distillation by Influence Matching

We revisit dataset distillation from an outcome-centric perspective. Rather than aligning process surrogates (

深層学習軽量化・量子化分類画像テキスト

用途: 分類
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-18

Can Multimodal Large Language Models Understand OCT?

Optical coherence tomography (OCT) imaging is essential for the diagnosis and treatment of retinal diseases. A

品質予測/異常検知自然言語処理大規模言語モデル分類QA画像

用途: 分類
難易度: Easy
コスト: High

githubGitHubあり2026-07-18

maths-cs-ai-compendium — Become a cracked AI/ML Research Engineer

Becoming a cracked AI/ML Research Engineerには、AI/ML研究者のスキルと知識を高めるための手法が紹介されています。

コンピュータビジョンマルチモーダルテキスト音声

用途: AI/ML研究者を育成
難易度: Easy
コスト: High

品質予測/異常検知コンピュータビジョンマルチモーダル画像強化学習

Foresight Residual RL for Long-Horizon Robot Manipulation with Vision-Language-Action Models

Vision-Language-Action (VLA) policies offer strong general-purpose manipulation priors, but often fail on tigh

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

Vision-Language-Motion Maps: An Open-Vocabulary, Uncertainty-Aware, Queryable Motion Attribute for 3D Scene Maps

この研究では、動的なシナリオを分析するために可視化した地図上にMotion Attributeを付与し、Language QueryによるMotion Attributeフィルタを使用して分析することができます。

自然言語処理大規模言語モデル3Dマルチモーダル

用途: 可視化した地図上での動的なシナリオの分析
難易度: Hard
コスト: High

センサ/時系列コンピュータビジョンセグメンテーションマルチモーダル

BayesContact: Uncertain Pose Estimation via Visuo-Tactile Proposals and Simulation-based Inference

この研究では、Vision-とTactile-based ProposalとSimulation-based Inferenceを組み合わせ、物体の位置と姿勢を推定する方法、BayesContactを提案しています。

用途: 視覚情報と触覚情報の融合によるロボットの動きの推定
難易度: Hard
コスト: High

PIXIE: A Zero-Shot texture-invariant 6D pose estimation framework for unseen objects with assembly defects

PIXIEフレームワークは、6次元オブジェクト位置推定を実現し、ロボットハンドの制御と物体の操作を実現します。

深層学習Transformer画像テキスト3D

用途: オブジェクトの6次元位置推定
難易度: Hard
コスト: High

センサ/時系列深層学習Transformerセグメンテーション画像マルチモーダル

PRISM: Multimodal Terrain Mapping for Rover Navigation in Unstructured Environments

Robotic navigation in unstructured environments requires robust situational awareness to safely traverse hazar

用途: セグメンテーション
難易度: Hard
コスト: High

An Exam for Active Observers

Human vision is a closed loop: gaze is continuously redirected by intermediate hypotheses rather than a single

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

CPUで試しやすい深層学習軽量化・量子化マルチモーダル強化学習

JoyNexus: Service-Oriented Multi-Tenant Post-Training for VLA Models

The post-training of Vision-Language-Action (VLA) models is essential due to the diversity of simulators, robo

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

MI向き自然言語処理大規模言語モデル生成画像テキスト

S1-Omni: A Unified Multimodal Reasoning Model for Scientific Understanding, Prediction, and Generation

We present S1-Omni, a unified multimodal reasoning model for scientific understanding, prediction, and generat

用途: 生成
難易度: Easy
コスト: High

説明可能自然言語処理大規模言語モデル画像テキスト音声

Audio-Visual Flamingo: Open Audio-Visual Intelligence for Long and Complex Videos

We present Audio-Visual Flamingo (AV-Flamingo), a fully open state-of-the-art audio-visual large language mode

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

自然言語処理大規模言語モデル生成テキストマルチモーダル

githubGitHubあり2026-07-17

generative-ai — Comprehensive resources on Generative AI, including a detailed roadmap, projects, use cases, interview preparation, and coding preparation.

ゼネレーティブAIに関連するリソースの一覧。

用途: ゼネレーティブAI
難易度: Easy
コスト: High

arxivPaper only2026-07-16

Diffusion models recover accurate mixture weights despite score function insensitivity

スコアベース生成モデルにおけるモード分解能の向上を目的とした研究で、モード分解能がスコア関数に依存しておらず、生成サンプルから混合重みを推測できることを明らかにした。

深層学習Transformer生成マルチモーダル

用途: スコアベース生成モデルにおけるモード分解能の向上
難易度: Hard
コスト: High

huggingfaceHugging Faceあり2026-07-16

RESOURCE2SKILL: Distilling Executable Agent Skills from Human-Created Multimodal Resources

Skills are a useful abstraction for software agents, turning human and agent experience into reusable procedur

自然言語処理RAG画像テキスト動画

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-16

Xiaomi-Robotics-1: Scaling Vision-Language-Action Models with over 100K Hours of Real-World Trajectories

We present Xiaomi-Robotics-1, a foundational vision-language-action (VLA) model capable of (1) following diver

深層学習軽量化・量子化生成テキストマルチモーダル

用途: 生成
難易度: Easy
コスト: High

深層学習Transformerマルチモーダル自己教師

githubGitHubあり2026-07-16

stable-pretraining — Reliable, minimal and scalable library for pretraining foundation and world models

基礎モデルの前処理を行うためのライブラリ。最小限でシームレスにスケールできる。

用途: 基礎モデルの前処理
難易度: Easy
コスト: High

arxivPaper only2026-07-15

Multimodal Empirical Bayes Variational Autoencoders for Joint Longitudinal and Time-to-Event Modeling

Longitudinal tumor measurements, dropout information, and genetic covariates provide complementary information

深層学習正規化・最適化手法マルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

arxivPaper only2026-07-15

S-CARD-CMSA: A Score-Aware Candidate Archive with Density-Filtered Reporting for Multimodal Optimization

Multimodal optimization aims to locate multiple globally optimal or near-optimal solutions in a single run. Th

自然言語処理RAGマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

huggingfaceHugging Faceあり2026-07-15

Generalizable VLA Finetuning via Representation Anchoring and Language-Action Alignment

Finetuning a pretrained vision-language model (VLM) on robot demonstrations via behavior cloning (BC) has beco

コンピュータビジョンセグメンテーション画像テキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

arxivPaper only2026-07-14

Ensemble Controlled-Flow Filtering for Implicit Data Assimilation

非線形オブザーバシオンメカニズムや多次元データには適合しない伝統的なエンサンブルフィルタリングアルゴリズムを導入し、隠蔽データアシミレーションを提案

説明可能コンピュータビジョンセグメンテーションマルチモーダル

用途: 隠蔽データアシミレーション
難易度: Hard
コスト: High

arxivPaper only2026-07-14

ANGLE: Angular Neural Generative Learning via Engression

Circular data, representing angles or directions, are frequently encountered in computer vision, biology, geol

深層学習軽量化・量子化生成回帰画像

用途: 生成
難易度: Hard
コスト: High

huggingfaceHugging Faceあり2026-07-14

ReflectWorld-MM: An Entity-Oriented Multimodal Memory System for Open-Ended Video Streams

Building assistants that can continually watch the world, remember what they see, and reason over their accumu

コンピュータビジョンマルチモーダル画像テキスト音声

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

arxivPaper only2026-07-13

Markov Chain Monte Carlo with Diffusion Paths

この研究では、マルコフ連鎖モンテカルロ法を改良し、多モーダル分布からサンプリングする能力を高めるための新しいアプローチを提案した。このアプローチでは、微分のノイズパスを使用することで、モデルの収束を高速化し、多モーダル分

自然言語処理ファインチューニングマルチモーダル

用途: マルコフ連鎖モンテカルロ
難易度: Hard
コスト: High

huggingfaceHugging Faceあり2026-07-13

See like a Robot: Robot-Centric Pointmaps for Vision-Language-Action Models

Vision-language-action (VLA) models predict robot actions from visual observations and language instructions.

コンピュータビジョン3D・点群画像3Dマルチモーダル

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-13

SVR-R1: Bootstrapping Multi-modal Reasoning with Self-verification in Reinforcement Learning

We introduce Self-Verified Reasoner (SVR-R1), a multi-turn RL framework that turns a model's own verification

コンピュータビジョンセグメンテーション生成マルチモーダル強化学習

用途: 生成
難易度: Easy
コスト: High

githubGitHubあり2026-07-13

Awesome-Mixture-of-Experts — Awesome Mixture of Experts (MoE): A Curated List of Mixture of Experts (MoE) and Mixture of Multimodal Experts (MoME)

Awesome Mixture of Experts (MoE): A Curated List of Mixture of Experts (MoE) and Mixture of Multimodal Experts

用途: 実装・検証基盤
難易度: Easy
コスト: High

githubGitHubあり2026-07-13

UniPic — Open-source SOTA multi-image editing model

UniPicは、オープンソースの最先端の画像編集モデルの実装です。

コンピュータビジョンマルチモーダル生成画像

用途: 多画像編集モデルの実装
難易度: Easy
コスト: High

arxivPaper only2026-07-12

Demixing Sparse Signals from Nonlinear Observations using Generalized Non-convex Regularization

We consider the recovery of a pair of sparse vectors from a limited number of nonlinear observations of their

説明可能コンピュータビジョンマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

githubGitHubあり2026-07-10

multimind-sdk — Your SDK solves all of this. One interface. Unified logic. Local + hosted models. Fine-tuning. Agent tools. Enterprise-ready. Hybrid RAG.Star 🌟 if you like it!

GUI操作自動化に伴う停止判定、復讐、再検索に関する問題を解決し、 GUI操作自動化を実現するためのフレームワークを開発します。

用途: GUI操作自動化ツール
難易度: Easy
コスト: High

arxivPaper only2026-07-07

Do You Remember? Toward Memory-Centric Multimodal AI

Human memory is reconstructive, not a faithful recording. Current multimodal LLMs (MLLMs) lack this capability

品質予測/異常検知深層学習軽量化・量子化画像テキスト3D

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

huggingfaceHugging Faceあり2026-07-07

UI2App: Benchmarking Visual Interaction Inference in Executable Web Application Generation

Large language models (LLMs) have demonstrated growing competence in web page generation. However, existing te

深層学習Transformer生成画像テキスト

用途: 生成
難易度: Easy
コスト: High

githubGitHubあり2026-07-07

VLM-R1 — Solve Visual Understanding with Reinforced VLMs

この研究では、画像理解を強化する強化されたビジョンホルシックスモデル (VLM-R1) が提案されます。この modelは、画像を理解しやすくするように設計されています。

自然言語処理大規模言語モデル画像マルチモーダル

用途: 画像理解の問題を解決
難易度: Easy
コスト: High

githubGitHubあり2026-07-03

EEGUnity — An open source tool for large-scale EEG datasets processing

ビデオ diffusioin trasformerは、ビデオの長さに依存しない推論能力を持っているが、この長さのエキサポレーションは実際には困難なものである。RIFLExという手法を開発し、ビデオ長さのエキサポレーション

コンピュータビジョンマルチモーダル

用途: ビデオ diffusioin trasformerで長さのエキサポレーション
難易度: Easy
コスト: High

githubGitHubあり2026-06-28

awesome-japanese-llm — 日本語LLMまとめ - Overview of Japanese LLMs

分析システムの性能を向上するための学習モデル開発を行う。

自然言語処理大規模言語モデル生成マルチモーダル

用途: 分析システムの性能を向上するための学習モデル開発
難易度: Easy
コスト: High

arxivPaper only2026-06-22

Each Judge Its Own Yardstick: Discovering Per-VLM Taxonomies for Physical Video Evaluation

Maintaining physical consistency in video generators and world models increasingly relies on vision-language m

自然言語処理大規模言語モデルテキスト動画マルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

arxivPaper only2026-06-19

Distance-based subsidy rate design to incentivize ride-hail access to advanced air mobility hubs

The success of advanced air mobility (AAM) operations is largely contingent on its effective integration with

コンピュータビジョンマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

arxivPaper only2026-06-15

Evolution & Foundation: AI Shares Creative Control

AIが人間と協力して作り出すアイデアを評価するための新しい手法を提案し、創造性の評価を向上させた。

自然言語処理ファインチューニング生成画像3D

用途: AIの創造性を評価するための新しい手法
難易度: Hard
コスト: High

arxivPaper only2026-06-14

MSC-CMA-ES: Structure-Aware Restarts for CMA-ES via Cyclic Nearest-Better Basin Discovery

CMA-ES behaves, per restart, primarily as a local optimizer; multimodal search relies on restart strategies su

MI向き自然言語処理RAGマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High