MLinfo | 機械学習・AI論文まとめ

MLinfo|日々更新される技術をキャッチアップ/検索

「multimodal」の検索結果

32 件

すべて arxiv github huggingface 実装あり

huggingfaceGitHubありHugging Faceあり2026-06-05

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research move

深層学習軽量化・量子化画像テキスト音声

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-05

Stream3D-VLM: Online 3D Spatial Understanding with Incremental Geometry Priors

Despite advances in 3D scene understanding, existing 3D Large Multimodal Models operate in offline settings, r

深層学習軽量化・量子化生成画像テキスト

用途: 生成
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-04

AsyncWebRL: Efficient Multi-Step RL for Visual Web Agents

Training vision-language web agents with multi-step RL is compute-intensive, with two dominant forms of ineffi

深層学習軽量化・量子化異常検知画像マルチモーダル

用途: 異常検知
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-04

Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents

Latent visual reasoning (LVR) inserts supervised latent tokens between perception and answer generation in vis

品質予測/異常検知コンピュータビジョンマルチモーダル生成画像テキスト

用途: 生成
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-04

Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators

While Vision-Language Models (VLMs) have shown strong visual reasoning capabilities, their spatial reasoning a

自然言語処理大規模言語モデル画像テキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

huggingfaceGitHubありHugging Faceあり2026-06-04

Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models

Despite the rapid progress of Vision-Language Models (VLMs), the field lacks benchmarks that rigorously diagno

品質予測/異常検知深層学習Transformer生成画像テキスト

用途: 生成
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-04

WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark

In real-world applications, models are expected to perform reliably across diverse settings. Yet, many existin

自然言語処理大規模言語モデル画像テキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-04

LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing

Developing unified video generation and editing models capable of interpreting interleaved multimodal inputs i

深層学習Transformer生成画像テキスト

用途: 生成
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-04

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, r

表形式向き自然言語処理大規模言語モデルテキスト動画3D

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-04

Revising Context, Shifting Simulated Stance: Auditing LLM-Based Stance Simulation in Online Discussions

Large language models are increasingly used to simulate social media users and infer how individuals may respo

深層学習Transformerテキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-04

Benchmark Everything Everywhere All at Once

Benchmarks are fundamental for evaluating and advancing LLMs and MLLMs by providing standardized and explicit

品質予測/異常検知自然言語処理大規模言語モデルテキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-03

Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models

Vision language models (VLMs) excel at many tasks but still struggle with spatial reasoning when critical info

表形式向き説明可能コンピュータビジョンマルチモーダル画像テキスト

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-03

Video2LoRA: Parametric Video Internalization for Vision-Language Models

Processing video in vision-language models is expensive: each frame occupies hundreds of tokens, and inference

自然言語処理ファインチューニング要約QA画像

用途: 要約
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-03

BRepCLIP: Contrastive Multimodal Pretraining on BRep Primitives for CAD Understanding

Learning representations of CAD models is a largely open problem. While 3D representation learning has flouris

深層学習Transformer分類生成埋め込み

用途: 分類
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-03

MapAgent: An Industrial-Grade Agentic Framework for City-scale Lane-level Map Generation

Lane-level maps are critical infrastructure for autonomous driving and lane-level navigation, yet constructing

センサ/時系列コンピュータビジョンマルチモーダル生成画像

用途: 生成
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-02

A Cookbook of 3D Vision: Data, Learning Paradigms, and Application

3D vision has rapidly evolved, driven by increasingly diverse data representations, learning paradigms, and mo

自然言語処理RAG生成動画3D

用途: 生成
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-02

MAOAM: Unified Object and Material Selection with Vision-Language Models

Selection is a core operation in interactive image editing. To be practical, a user should be able to specify

MI向き自然言語処理RAG生成セグメンテーション画像

用途: 生成
難易度: Easy
コスト: High

→

huggingfaceGitHubありHugging Faceあり2026-06-02

SEAOTTER: Sensor Embedded Autoencoding with One-Time Transcode for Efficient Reconstruction

In robotics systems, vast amounts of visual data are easily captured at high resolution using low-cost, low-po

センサ/時系列品質予測/異常検知深層学習軽量化・量子化画像マルチモーダル

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-02

OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

Multimodal agents in robotics, AR, and autonomous driving must reason about places and layouts from continuous

品質予測/異常検知自然言語処理大規模言語モデル生成テキスト動画

用途: 生成
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-02

Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

Wide-baseline matching (WBM) requires integrating geometric understanding, viewpoint changes, fine-grained per

自然言語処理大規模言語モデル生成画像テキスト

用途: 生成
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-02

When Graph Tokens Sink: A Mechanistic Analysis of Graph Language Models

Graph Language Models (GLMs) have become a promising direction for adapting Large Language Models (LLMs) to gr

深層学習軽量化・量子化テキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-01

The Road Ahead in Autonomous Driving: The KITScenes Multimodal Dataset

Existing autonomous driving datasets have enabled major progress, but fall short in sensor fidelity, map compl

センサ/時系列深層学習Transformer検出生成3D

用途: 検出
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-01

AdaCodec: A Predictive Visual Code for Video MLLMs

Video is temporally redundant: adjacent frames usually share most objects, background, and layout. Yet existin

自然言語処理大規模言語モデル画像テキスト動画

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

huggingfaceGitHubありHugging Faceあり2026-06-01

Cosmos 3: Omnimodal World Models for Physical AI

We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, i

深層学習Transformer生成画像テキスト

用途: 生成
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-01

MMG2Skill: Can Agents Distill In-the-Wild Guides into Self-Evolving Skills?

Abundant procedural knowledge on the Web holds great potential for helping agents solve long-horizon tasks. Ho

自然言語処理RAG回帰テキストマルチモーダル

用途: 回帰
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-05-29

MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding

Multimodal Large Language Models (MLLMs) have demonstrated significant achievements in general visual question

品質予測/異常検知自然言語処理大規模言語モデル分類QA画像

用途: 分類
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-05-29

SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes

Humans can effortlessly perceive spatial layouts, form cognitive representations, reason about spatial relatio

コンピュータビジョン3D・点群検出テキスト3D

用途: 検出
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-05-29

PaintBench: Deterministic Evaluation of Precise Visual Editing

While current multimodal models are proficient at open-ended visual editing, executing precise single-answer e

コンピュータビジョンマルチモーダル生成画像

用途: 生成
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-05-28

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capabilit

深層学習軽量化・量子化マルチモーダル

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-05-28

Multimodal Music Recommendation System using LLMs

Music recommendation systems typically treat songs as opaque tokens, relying on collaborative interaction hist

センサ/時系列品質予測/異常検知深層学習Transformerテキスト音声マルチモーダル

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-05-28

Stable-Layers: Fine-Tuning Image Layer Decomposition Models with VLM-Scored Reinforcement Learning

We present Stable-Layers, a reinforcement learning framework that eliminates the need for paired supervision b

自然言語処理ファインチューニング画像テキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-05-22

SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

Vision-Language Models (VLMs) are increasingly deployed in embodied environments, where they need produce nume

自然言語処理ファインチューニング画像テキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→