MLinfo | 機械学習・AI論文まとめ

MI向き自然言語処理大規模言語モデル生成画像テキスト

MMAE: A Massive Multitask Audio Editing Benchmark

We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation

用途: 生成
難易度: Easy
コスト: High

huggingfaceGitHubありHugging Faceあり2026-06-05

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research move

深層学習軽量化・量子化画像テキスト音声

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

PaperFlow: Profiling, Recommending, and Adapting Across Daily Paper Streams

Scientific paper recommendation is typically evaluated as static ranking over a fixed candidate set, yet real

コンピュータビジョン動画認識

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

Stream3D-VLM: Online 3D Spatial Understanding with Incremental Geometry Priors

Despite advances in 3D scene understanding, existing 3D Large Multimodal Models operate in offline settings, r

深層学習軽量化・量子化生成画像テキスト

用途: 生成
難易度: Easy
コスト: High

Streaming Video Generation with Streaming Force Control

We introduce StreamForce, a streaming video generation framework that enables physically grounded control thro

深層学習軽量化・量子化生成動画

用途: 生成
難易度: Easy
コスト: High

Physics in 2-Steps: Locking Motion Priors Before Visual Refinement Erases Them

Image-to-Video diffusion models leverage input images to generate visually stunning content, yet frequently pr

自然言語処理RAG生成画像動画

用途: 生成
難易度: Easy
コスト: High

LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing

Developing unified video generation and editing models capable of interpreting interleaved multimodal inputs i

深層学習Transformer生成画像テキスト

用途: 生成
難易度: Easy
コスト: High

huggingfaceGitHubありHugging Faceあり2026-06-04

Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?

Video generation models have made impressive strides in synthesizing visually compelling content, yet their ou

品質予測/異常検知自然言語処理RAG生成画像動画

用途: 生成
難易度: Easy
コスト: High

品質予測/異常検知自然言語処理大規模言語モデルテキスト動画

Towards One-to-Many Temporal Grounding

Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predo

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction

Video event prediction (VEP) requires models to infer unobserved future states from partial video evidence. Ex

自然言語処理大規模言語モデル画像テキスト動画

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

表形式向き自然言語処理大規模言語モデルテキスト動画3D

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, r

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding

We introduce VideoKR, the first large-scale training corpus specifically designed to strengthen knowledge- and

自然言語処理ファインチューニング生成テキスト動画

用途: 生成
難易度: Easy
コスト: High

Flash-WAM: Modality-Aware Distillation for World Action Models

World-action models (WAMs) jointly generate future video and robot actions through iterative diffusion, achiev

深層学習軽量化・量子化動画

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

Video2LoRA: Parametric Video Internalization for Vision-Language Models

Processing video in vision-language models is expensive: each frame occupies hundreds of tokens, and inference

自然言語処理ファインチューニング要約QA画像

用途: 要約
難易度: Easy
コスト: High

品質予測/異常検知深層学習Transformer生成動画

Echo-Infinity: Learning Evolving Memory for Real-Time Infinite Video Generation

We present Echo Infinity, an autoregressive (AR) framework towards real-time infinite video generation that em

用途: 生成
難易度: Easy
コスト: High

M^3Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks

As multi-modal models advance towards long-form video understanding, memory emerges as a critical capability.

自然言語処理RAG動画

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors

Scaling humanoid loco-manipulation requires robot-compatible demonstrations across diverse objects, whole-body

コンピュータビジョン3D・点群生成画像動画

用途: 生成
難易度: Easy
コスト: High

A Cookbook of 3D Vision: Data, Learning Paradigms, and Application

3D vision has rapidly evolved, driven by increasingly diverse data representations, learning paradigms, and mo

自然言語処理RAG生成動画3D

用途: 生成
難易度: Easy
コスト: High

品質予測/異常検知自然言語処理大規模言語モデル生成テキスト動画

OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

Multimodal agents in robotics, AR, and autonomous driving must reason about places and layouts from continuous

用途: 生成
難易度: Easy
コスト: High

Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

Wide-baseline matching (WBM) requires integrating geometric understanding, viewpoint changes, fine-grained per

自然言語処理大規模言語モデル生成画像テキスト

用途: 生成
難易度: Easy
コスト: High

AAD-1: Asymmetric Adversarial Distillation for One-Step Autoregressive Video Generation

We present AAD-1, an Asymmetric Adversarial Distillation framework for One-step autoregressive image-to-video

深層学習軽量化・量子化生成画像テキスト

用途: 生成
難易度: Easy
コスト: High

品質予測/異常検知自然言語処理大規模言語モデル画像テキスト動画

WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts

Existing benchmarks for MLLM-generated web artifacts assess interaction through local evidence and miss the re

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-06-01

AdaCodec: A Predictive Visual Code for Video MLLMs

Video is temporally redundant: adjacent frames usually share most objects, background, and layout. Yet existin

自然言語処理大規模言語モデル画像テキスト動画

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

huggingfaceGitHubありHugging Faceあり2026-06-01

Cosmos 3: Omnimodal World Models for Physical AI

We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, i

深層学習Transformer生成画像テキスト

用途: 生成
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-05-30

SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory

AI glasses present a compelling platform for AI agents to serve as personalized memory assistants. To be genui

深層学習Transformer分類QA画像

用途: 分類
難易度: Easy
コスト: High

huggingfaceGitHubありHugging Faceあり2026-05-29

OpenSTBench: Beyond Semantic Evaluation for Speech Translation

Speech translation systems increasingly span speech-to-text translation (S2TT), speech-to-speech translation (

品質予測/異常検知コンピュータビジョン動画認識生成テキスト音声

用途: 生成
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-05-24

WorldCraft: From Camera Navigation to Object Manipulation in Interactive Video World Models

Recent video-based world models have made pixel-space environments interactive at the camera level: users can

自然言語処理ファインチューニング生成画像動画

用途: 生成
難易度: Easy
コスト: High