MLinfo | 機械学習・AI論文まとめ

diffusers — 🤗 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch.

.diffusion モデルのライブラリ。画像・動画・音声生成に利用可能。

生成AI拡散モデル生成画像テキスト

用途: 画像・動画・音声生成
難易度: Easy
コスト: High

品質予測/異常検知コンピュータビジョンセグメンテーション分類検出画像

cvat — Computer Vision Annotation Tool (CVAT) is a leading platform for building high-quality visual datasets for vision AI. It offers open-source, cloud, and enterprise products, as well as labeling services, for image, video, and 3D annotation with AI-assisted labeling, quality assurance, team collaboration, analytics, and developer APIs.

CVATは、機械学習用の業界標準のデータエンジンです。さまざまなスケールのチームが使用し、さまざまなスケールのデータに対応しています。

用途: データのラベル付けと管理
難易度: Easy
コスト: High

コンピュータビジョンセグメンテーション分類画像動画

labelme — Image annotation with Python. Supports polygon, rectangle, circle, line, point, and AI-assisted annotation.

イメージを注釈するツール。ポリゴン、長方形、円、線、点などを注釈することができる。

用途: イメージ注釈
難易度: Easy
コスト: High

Sana — SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer

SANAは、高解像度画像生成モデルSANAを紹介する本研究であり、低計算コストで優れた高解像度画像を生成できる。

用途: 高解像度画像合成
難易度: Easy
コスト: High

Awesome-Video-Diffusion — A curated list of recent diffusion models for video generation, editing, and various other applications.

Awesome-Video-Diffusionは、Recent Diffusion Models for Video Generation, Editing, and Othersのリストを公開しています。

生成AI拡散モデル生成動画

用途: ビデオ生成や編集の問題を解決する
難易度: Easy
コスト: High

FastVideo — A unified inference and post-training framework for accelerated video generation.

FastVideoは、加速されたビデオ生成用の統合推論とポストトレーニングのフレームワークです。

深層学習軽量化・量子化生成動画

用途: ビデオ生成を加速する
難易度: Easy
コスト: High

LightX2V — Lightweight Image Video Action Generation Inference Framework

zenmlは、データパイプラインからエージェントまで、AIプラットフォームです。

深層学習軽量化・量子化生成画像動画

用途: AI推論を軽量化したインフラ
難易度: Easy
コスト: High

onnxruntime — ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

FastVideoは、加速されたビデオ生成用に統一された推論およびポストトレーニングフレームワークです。

MLOpsモデルデプロイ

用途: クロスプラットフォーム高性能ML推論用エンジンの実現
難易度: Easy
コスト: High

arxivGitHubあり2026-07-23

3D-Aware VLMs with Implicit and Explicit Geometries

3次元空間理解技術のための新しいアプローチであるVLM-IE3D（Vision-Language Models with Implicit and Explicit 3D geometry）を提案しました。VLM-IE3

コンピュータビジョン3D・点群検出画像テキスト

用途: 3次元空間理解技術の開発
難易度: Hard
コスト: High

説明可能センサ/時系列コンピュータビジョン動画認識予測テキスト

Climate-resilient electric vehicle charging infrastructure for sustainable cities: An interpretable causal-ensemble framework for preventive maintenance and low-carbon mobility

都市の電気自動車充電インフラは、可及的速やかに故障を予測・修理することで、耐久性と低炭素化を向上させる必要がある。機械学習を用い、故障を予測するモデルの開発を研究した。

用途: 都市の電気自動車充電インフラの耐久性向上
難易度: Hard
コスト: High

Multi-Task Learning for Heterogeneous Prediction from Video Game State with Transfer Learning

Multi-task learning (MTL) is a promising approach for prediction tasks derived from video game state data, as

自然言語処理ファインチューニング画像テキスト動画

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

品質予測/異常検知深層学習Transformer動画

DART: A Degradation-Aware Recurrent Transformer for Archival Film Restoration

Archival film restoration is a challenging problem because historical footage contains compound degradations s

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

品質予測/異常検知生成AI動画生成生成画像テキスト

GraphVid: Interactive Graph-Controllable Video Generation

GraphVidは、グラフと文本から生成することができ、オブジェクトの複数の移動を正確に制御することができる。グラフではオブジェクトの動きを表す情報を保存し、文から生成の制約を指定することができる。

用途: コントロール可能なビデオ生成
難易度: Hard
コスト: High

ElasticTTT: Prior-Preserving Test-Time Tuning for Video Editing

ElasticTTTは、プログラムがテストのときに動作を調整できるようにした。方法は、テストのときにモデルが前のサンプルの情報と現在の情報を組み合わせて、ビデオを編集する際に正しく動作するようにした。

生成AI拡散モデル生成テキスト動画

用途: ビデオ編集時のテストタイムチューニング
難易度: Hard
コスト: High

品質予測/異常検知深層学習Transformer画像テキスト動画

Adaptive Identity Anchoring: Closed-Loop Keyframe Placement for Synthetic Paired Supervision in Video Face Swapping

Video face swapping has no natural paired supervision: no real footage exists of one person's face performing

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

コンピュータビジョンセグメンテーション分類検出動画

BasketEvent: Understanding Who Did What and When in Basketball Videos

この研究では、大規模言語モデルを使用して、basketボールの動的理解に基づいて、プレイヤーへの関わりや時間境界を推測するモデルを開発しました。

用途: basketボールの動的理解
難易度: Hard
コスト: High

Logic Programming Semantics for Causal Processes

この研究では、大規模言語モデルを使用して、因果プロセスの理解を進めました。大規模言語モデルを活用することで、因果関係を予測することができました。

用途: 因果プロセスの理解
難易度: Hard
コスト: High

How Rules Represent Causal Knowledge: Causal Modeling with Probabilistic Logic Programming

この研究では、大規模言語モデルを活用して、因果関係のモデル化を研究しました。大規模言語モデルを活用することで、因果関係を予測することができました。

用途: 因果関係のモデル化
難易度: Hard
コスト: High

V-DEAL: Diagnosing Video Safety De-Calibration as an Understanding-Refusal Coupling Failure

ビデオLMMの安全性を確認するために、新しい診断フレームワークを提案します。これらのフレームワークは、モデルの挙動、理解、セマンティクスを同時に考慮します。

自然言語処理大規模言語モデル画像テキスト動画

用途: ビデオ安全性デ-カリブレーションの診断
難易度: Hard
コスト: High

Can Generative Recommendation Reach Cold Items? A Temporal Perspective on Semantic-ID Generation

Semantic-ID-based generative recommendation represents items as sequences of shared semantic tokens, enabling

コンピュータビジョン動画認識生成テキスト

用途: 生成的な推奨システムの冷たいアイテム
難易度: Hard
コスト: High

EmoAgent-R1: Towards Multimodal Emotion Understanding with Reinforcement Learning-based Dynamic Agent Specialization

Multimodal large language models (MLLMs) have achieved impressive performance in multimodal emotion recognitio

自然言語処理大規模言語モデル分類テキスト動画

用途: 分類
難易度: Hard
コスト: High

説明可能深層学習Transformer埋め込み画像動画

HyWorldVLA: A Vision-Language-Action Model with Hybrid World Modeling for Autonomous Driving

Vision-Language-Action (VLA) models augmented with world modeling represent a promising paradigm for end-to-en

用途: 埋め込み
難易度: Hard
コスト: High

品質予測/異常検知深層学習Transformer画像テキスト動画

Beyond Independent Optimization: Compression, MoE Routing, and Quantization Interactions in Multimodal Edge Intelligence

効率的な多モードの推論は、モデルの性能やFLOPCOuntだけでなく、移動、キャッシュ、変形、量化された表現を保存するコストやメモリ、エネルギーに関する制約にも制限されています。この論文では、最近のビジュアルトークン圧縮

用途: 分析的コストと効率性を向上させるための多モードのエッジAIの効率化
難易度: Hard
コスト: High

品質予測/異常検知深層学習Transformer生成画像テキスト

Streaming Multi-Agent Autoregressive Diffusion Model with World State Registers

多エージェントのシミュレーションにおいて、共有世界状態がエージェント間で保持され、その世界状態が観測結果に反映されると仮定している。

用途: マルチエージェントのシミュレーション
難易度: Hard
コスト: High

MI向き深層学習軽量化・量子化セグメンテーション異常検知画像

Unified Video Dense Prediction from Disjoint Data

ビデオ内の物体の空間推論を同時に行うことで、現存するタスク固有の注釈を超えた統一的なビデオ推論システムを構築した。

用途: ビデオの分割推論
難易度: Hard
コスト: High

Self-Supervised Learning of Structured Dynamics from Videos

ビデオ内のキャメラの動きと物体の動きを切り離すことで、モーションの表現学習を改善した。

深層学習Transformer埋め込み画像動画

用途: ビデオ内の動きの予測
難易度: Hard
コスト: High

品質予測/異常検知深層学習Transformer生成動画

SANA-Video 2.0: Hybrid Linear Attention with Attention Residuals for Efficient Video Generation

ビデオ生成モデルの効率性と高品質性を向上させるための新しい方法を提案した。

用途: ビデオの生成
難易度: Hard
コスト: High

品質予測/異常検知深層学習Transformer画像テキスト動画

Texture++: Elevating 3D Asset Texture Resolution with a Region-Aware Diffusion Model

Numerous 3D assets are discarded due to low texture resolution, while current super-resolution models ignore t

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

コンピュータビジョンセグメンテーション検出テキスト動画

Incremental Optimal Assignment for Real-Time Crowd Tracking

Multi-object tracking in dense crowds requires solving a bipartite assignment problem between detections and t

用途: 検出
難易度: Hard
コスト: High

品質予測/異常検知深層学習正規化・最適化手法分類画像テキスト

Quality-Aware Multimodal Fusion Reveals Implicit Identity in Valence-Arousal Features

Conventional face recognition relies on static appearance cues and degrades in unconstrained settings with exp

用途: 分類
難易度: Hard
コスト: High

コンピュータビジョンセグメンテーション生成テキスト動画

arxivGitHubあり2026-07-23

T-STAR: A Large-Scale Benchmark for Spatio-Temporal Panoptic Scene Graph Generation in Satellite Video

Structured understanding of satellite video is essential for advancing dynamic geospatial scene analysis from

用途: 生成
難易度: Hard
コスト: High

Out of Sight, Still in Mind: Token Compression for Omni-LLMs

The goal of this paper is to reduce the input token cost of Omni-modal large language models (Omni-LLMs) at in

自然言語処理大規模言語モデル画像テキスト音声

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

品質予測/異常検知深層学習Transformer検出生成画像

GroupVideo: Multi-Identity Customized Text-to-Video Generation

Current identity customized video generation methodologies are predominantly limited to single-identity scenar

用途: 検出
難易度: Hard
コスト: High

品質予測/異常検知深層学習軽量化・量子化画像動画3D

WAT3R: Feedforward Underwater 3D Reconstruction

Reliable feedforward underwater 3D reconstruction remains challenging due to severe light attenuation and back

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

品質予測/異常検知深層学習軽量化・量子化テキスト動画マルチモーダル

ProCap: Prominence-guided Object Rectification for Faithful and Comprehensive Video Captioning

Improving video captioning quality typically demands retraining large vision-language models, an expensive and

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

Distribution-Alignment Bridge for Uncertainty-Aware Text-to-Video Retrieval

本論文では、テキストと動画を対応させるDistribution-Alignment Bridge（DAB）を提案します。DABは、テキストと動画のエンティティを確率分布として表現し、両者の間の分布の差異を解決します。この

自然言語処理埋め込み・検索生成テキスト動画

用途: テキストから動画の検索
難易度: Hard
コスト: High

Ms. Forcing: Efficient Streaming Video Generation with Multi-Scale Patchification and Attention

この論文では、効率的なストリーミングビデオ生成手法であるMs. Forcingを提案します。Ms.フオーシングは、Multi-Scale PatchificationとAttentionを組み合わせた手法です。

深層学習Transformer生成動画

用途: ストリーミングビデオ生成
難易度: Hard
コスト: High

MI向き品質予測/異常検知深層学習Transformer分類画像テキスト

Sidewalk Moments: Are Richer Representations Always More Human-Aligned? Evidence from City-Walk Videos

この研究では、都市ウォークビデオを分析するために、4つのモダリティの表現（スペース時領域情報、時間平均画像、オーディオ符号化、テキストベースの表現）を使用しました。

用途: 都市ウォークビデオの分析
難易度: Hard
コスト: High

品質予測/異常検知自然言語処理大規模言語モデル画像テキスト動画

ViSTR-Bench: Can MLLMs Reason from Continuous Visual Cues in Dynamic Scenes?

この論文では、ViSTR-Benchという手法を提案します。ViSTR-Benchは、MLLMが動的シーンから情報を取得できるかどうかを評価します。

用途: 3Dシーンの分析
難易度: Hard
コスト: High

githubGitHubあり2026-07-23

SimpleTuner — A general fine-tuning kit geared toward image/video/audio diffusion models.

画像やビデオやオーディオディフュージョンモデルのファインチューニングを行うための、汎用的なファインチューニングキット。

自然言語処理ファインチューニング画像音声動画

用途: ディフュージョンモデルのファインチューニング
難易度: Easy
コスト: High

品質予測/異常検知深層学習軽量化・量子化生成テキスト動画

githubGitHubあり2026-07-23

Causal-Forcing — [ICML 2026] Official codebase for "Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation" & Causal Forcing++

この論文では、Causal-Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive

用途: 高品質のビデオ生成を実現する。
難易度: Easy
コスト: High

Perspective Latents as an Architectural Condition for Causal Emergence in Active Inference Agents

A recent line of work measures causal emergence in reinforcement learning agents through Integrated Informatio

コンピュータビジョン動画認識強化学習

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

品質予測/異常検知コンピュータビジョン動画認識テキスト

Attribution Markets: A Fisher-Market Formulation for Fractional Credit Assignment Between Planned Tasks and Performed Actions

Personal and organizational planning systems maintain two records that drift apart: what was planned (a task's

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

Adaptive Multi-Horizon Reinforcement Learning

Effective decision-making in complex and changing environments requires balancing short-term and long-term con

コンピュータビジョン動画認識強化学習

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

SalesLoop: Reinforcement Learning from Performance Feedback for Sales Lead Ranking

Lead ranking in Customer Relationship Management (CRM) systems faces a persistent challenge: models achieving

コンピュータビジョン動画認識強化学習

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

品質予測/異常検知深層学習軽量化・量子化分類生成動画

HeadCast: Casting Attention Heads for Efficient Autoregressive Video Generation

流動画像生成を扱う研究、HeadCast を用いて流動画像生成を提案する。

用途: 流動画像生成
難易度: Hard
コスト: High

Diffusion ReRoll: Revisable Denoising for Robotic Sequential Prediction

この研究では、実世界ロボットのシーケンシャル予測に使用できる、diffusion-based frameworkを提案しました。

自然言語処理RAG生成異常検知テキスト

用途: 実世界ロボットのシーケンシャル予測
難易度: Hard
コスト: High

Zero-Observation User Reactivation with Gap-Driven Dimensional Gating

連続的に観測された行動を捕捉するためのシーケンシャル推奨モデルを使用すると、期間が長い間隔が発生した場合に、再活性化されたユーザーへのリコールを改善できる提案されている。

深層学習Transformer動画

用途: 再活性化されたユーザーの推奨
難易度: Hard
コスト: High

Domain-Adapted Power Curve for Cross-Farm Applications

The wind energy industry relies on accurate power curve models to make power forecast, evaluate turbine perfor

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

品質予測/異常検知深層学習軽量化・量子化生成テキスト動画

RealVDeblur: One-Step Diffusion for Generalizable Real-World Video Deblurring

Real-world video deblurring remains challenging due to diverse motion patterns, complex degradations, and the

用途: 生成
難易度: Hard
コスト: High

MI向き品質予測/異常検知深層学習Transformer生成画像動画

StreamHOI: Interaction-aware Temporal Memory Adaptation for Streaming HOI Video Generation

オフラインでの短時間の視覚生成が一般的な人間の行動の分析では、人間の行動の長期的な視覚生成は、実践的な長時間の視覚生成では実行不能である。StreamHOI は、人間間の視覚的な行動の生成を生成したいくつの画像を使用して

用途: 人物間の相互作用による視覚生成
難易度: Hard
コスト: High

品質予測/異常検知コンピュータビジョンセグメンテーション検出異常検知テキスト

Rethinking Open-World Video Anomaly Detection: Diagnosing Definition Blindness

Open-world video anomaly detection (OWVAD) is expected to detect events that match a user-specified definition

用途: 検出
難易度: Hard
コスト: High

品質予測/異常検知自然言語処理大規模言語モデル生成動画強化学習

PercepCap: Video Captioner with Structured Spatio-Temporal Perception

ビデオキャプション生成には、空間と時刻の理解が重要です。PercepCapアルゴリズムは、ビデオ入力を空間時刻認識に分解することで、生成されたキャプションの理解度が向上するとともに、空間時刻の誤差をより正確に検出でき、キ

用途: ビデオキャプション生成のための構造化された空間時刻の理解
難易度: Hard
コスト: High

コンピュータビジョンセグメンテーション生成テキスト動画

Self Gradient Forcing: Native Long Video Extrapolation

長時間ビデオエクストラポレーションには、高度な視覚的知能が必要です。Self Gradient Forcingアルゴリズムは、学生モデルを教師モデルから生成される歴史の下で学習させることで、長時間ビデオエクストラポレーシ

用途: 長時間ビデオエクストラポレーションのための自力勾配強制
難易度: Hard
コスト: High

コンピュータビジョンセグメンテーション生成画像動画

Vera: Identity-Faithful Human Subject-to-Video Generation

Subject-to-video (S2V) generation has made substantial progress in preserving reference subjects across divers

用途: 生成
難易度: Hard
コスト: High

PerceptDrive: Perception Prior World-Action Modeling with Adaptive Expert Routing for End-to-End Autonomous Driving

Frozen perception foundation models encode rich geometric, semantic, and dynamic knowledge. Yet narrow conditi

深層学習軽量化・量子化生成動画自己教師

用途: 生成
難易度: Hard
コスト: High

LoRFT: Benchmarking Long-Range Vehicle Trajectory Reconstruction from Fixed Highway Cameras

Long-range vehicle trajectories provide important spatio-temporal evidence for traffic safety analysis, autono

自然言語処理RAG検出動画

用途: 検出
難易度: Hard
コスト: High

品質予測/異常検知コンピュータビジョンセグメンテーション生成画像テキスト

OSVE: One Step Video Editing with One Step Diffusion Models

Text-guided video editing with diffusion models is impractically slow, hindered by costly multi-step sampling

用途: 生成
難易度: Hard
コスト: High

LAVIFT: Latent-Action-Guided Vision Fine-Tuning for Surgical Interaction Recognition

Understanding instrument-tissue interactions is essential for context-aware surgical AI and autonomous robotic

自然言語処理ファインチューニング分類検出画像

用途: 分類
難易度: Hard
コスト: High

自然言語処理ファインチューニング画像動画マルチモーダル

EA-Nav: Learning Safe Visual Navigation Policies with Embodiment Awareness

Cross-embodiment navigation is a key challenge in embodied intelligence. Due to differences in embodiment, the

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

KineBench: Benchmarking Embodied World Models via IDM-Free Kinematic Grounding

Evaluating the physical consistency of embodied world models(EWMs) is a critical open challenge. While closed-

コンピュータビジョン3D・点群生成異常検知画像

用途: 生成
難易度: Hard
コスト: High

自然言語処理大規模言語モデルセグメンテーション画像テキスト

Memory-Augmented Multimodal Large Language Models for Small Object Understanding in Streaming Aerial Videos

この研究では、ドローンで小さな物体を認識することを目的としたメモリ拡張型大規模言語モデルを開発しました。このモデルは、複雑なドローンの場面で、ユーザーの指示に従って物体を識別できるようになります。

用途: ドローンで物体認識を実行する
難易度: Hard
コスト: High

少数データ向き品質予測/異常検知自然言語処理プロンプトエンジニアリング動画

MoAKE: Toward Unified All-in-One Action Quality Assessment via Mixture of Action Knowledge Experts

Action Quality Assessment (AQA) aims to objectively evaluate performance quality from action videos. Most exis

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

品質予測/異常検知深層学習軽量化・量子化検出セグメンテーション動画

Efficient Tracking and Understanding Object Transformations

Tracking objects through state transformations is essential for understanding real-world dynamics. However, ex

用途: 疼痛位置
難易度: Hard
コスト: High

ReFace: Reorganizing Facial Spatiotemporal Representations for Improved Pain Assessment

Automatic pain assessment from facial video remains challenging due to the spatial heterogeneity of pain-relat

コンピュータビジョンセグメンテーション画像動画

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

CPUで試しやすい深層学習Transformer分類動画3D

A Unified Tokenization Framework for Pain Recognition using Heterogeneous 3D Modalities

Pain is a complex and pervasive phenomenon affecting a large percentage of the population, and accurate assess

用途: 分類
難易度: Hard
コスト: High

SafeGen: Goal-Conditioned Video Diffusion of Safety-Critical Scenarios for VLM-Based Autonomous Driving

VLMs are increasingly deployed in AD systems, creating an urgent need for rigorous safety evaluation under rar

自然言語処理RAG生成画像テキスト

用途: 生成
難易度: Hard
コスト: High

Robots Acquire Manipulation Skills in Seconds from a Single Human Video

HOST は、ロボットが人間の動作からスキルをすぐに習得できるシステムである。このシステムでは、ロボットは単一の人間の動作ビデオからスキルを習得し、既に習得したスキルを維持する。

自然言語処理RAG動画

用途: ロボットが人間の動作からスキルをすぐに習得できるシステム
難易度: Hard
コスト: High

品質予測/異常検知コンピュータビジョン動画認識検出異常検知マルチモーダル

Clinical Pathways as Safety Specifications for Physical AI in Hospital Wards

Clinical Pathways は、ロボットが実際の環境で安全に動作するためのシステムである。これは、ロボットが病室で安全に作業し、医療スタッフや患者を守る。

用途: 医療機関で使うロボットの安全性を確保するためのシステム
難易度: Hard
コスト: High

コンピュータビジョン物体検出分類検出セグメンテーション

githubGitHubあり2026-07-22

supervision — We write your reusable computer vision tools. 💜

supervisionは、機械学習技術を活用して、ユーザー独自のコンピュータビジョンツールを作成することができる。

用途: オリジナルコンピュータビジョンツール
難易度: Easy
コスト: High

githubGitHubあり2026-07-22

OpenWorldLib — Unified Codebase for Advanced World Models.

OpenWorldLibは、進化する世界モデルを提供する統一されたコードベースです。

コンピュータビジョン3D・点群生成動画3D

用途: 世界モデルを提供する
難易度: Easy
コスト: High

githubGitHubあり2026-07-22

Awesome-CVPR2026-CVPR2025-ICCV2025-CVPR2024-ECCV2026-ECCV2024-AIGC — A Collection of Papers and Codes for CVPR2026/CVPR2025/ICCV2025/CVPR2024/ECCV2026/ECCV2024 AIGC

CVPRに基づくAIを取り入れるための資料集を提供します。CVPR 2026、2025、2024、およびECCV 2024に基づくAIGCに関する研究論文とソフトウェアコードを含みます。

コンピュータビジョン3D・点群生成画像動画

用途: AIをCVPRに応用する
難易度: Easy
コスト: High

MeetingToM: Evaluating Multimodal LLMs on Theory-of-Mind Reasoning in Multi-Party Meetings

Theory of Mind (ToM), the ability to infer other's beliefs, intentions, and states of knowledge, is central to

自然言語処理大規模言語モデルQAテキスト音声

用途: QA
難易度: Hard
コスト: High

Fusion Embedding: A Unified Embedding Space for Text, Image, Video, and Audio

A single embedding space that covers text, images, video, and audio lets one index serve every query a user ca

自然言語処理大規模言語モデル生成画像テキスト

用途: 生成
難易度: Hard
コスト: High

MI向きコンピュータビジョンセグメンテーションQA画像テキスト

ChronoStitch: Training-Free Composition of Visual KV Memories for Long-Horizon Temporal Reasoning

Long-video question answering requires a model to preserve visual evidence over time without repeatedly reproc

用途: QA
難易度: Hard
コスト: High

センサ/時系列自然言語処理大規模言語モデル画像テキスト動画

D3VL: Understanding Driving Scenes from 3D Time Series Data and Video with Language Models

Recent advances in Multimodal Large Language Models (MLLMs) have triggered the development of end-to-end MLLMs

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

Geospatial Diffusion-based Evolution Synthesis (GeoDES) for Storm-Centered Weather Augmentation

While machine learning-based weather models hold significant promise, they struggle to predict the detailed st

深層学習軽量化・量子化生成画像動画

用途: 生成
難易度: Hard
コスト: High

Crowd4D: Scene-Aware Monocular 4D Crowd Reconstruction

Recovering scene-consistent 4D crowd motion from monocular video in large-scale scenes remains challenging due

自然言語処理RAG画像動画3D

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

品質予測/異常検知深層学習Transformerテキスト動画マルチモーダル

BLUE: Semantics-Preserving Video Compression for Efficient Vision-Language Surveillance Analytics

Continuous surveillance video creates a growing storage, transmission, and inference burden for enterprise vid

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

arxivGitHubあり2026-07-21

Detect Early, Escalate Rarely: Anytime Detection of AI-Generated Video from the Compressed Bitstream

Detectors for AI-generated video are evaluated offline. A clip is decoded to pixels and scored once, increasin

CPUで試しやすい深層学習CNN検出画像テキスト

用途: 検出
難易度: Hard
コスト: High

Masked Visual Actions for Unified World Modeling

Video models absorb rich priors over how the visual world moves, interacts, and responds to contact, making th

コンピュータビジョンセグメンテーション画像動画

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

MI向き品質予測/異常検知自然言語処理大規模言語モデル画像音声動画

arxivGitHubあり2026-07-21

OmniReasoner: Thinking with Long Audio-Video via Native Tool Use

オリジナルのデータとZoom-Inのツールを組み合わせた方法、OmniReasonerを提案する。これにより、オリンモードルLLMsの長いオーディオビデオの論理的推論を改善できる。

用途: 長いオーディオビデオの論理的推論を改善する
難易度: Hard
コスト: High

InstructMixup: Instruction-Guided Salient Patch Editing for Robust Data Augmentation

記述情報に従って画像や動画データを混ぜ合わせる「対数混合法」を拡張する方法、InstructMixupを提案する。これにより、データを拡張しながらデータの内容とラベルが維持される。

深層学習Transformer分類検出生成

用途: データ拡張のための対数混合法を拡張する
難易度: Hard
コスト: High

品質予測/異常検知深層学習Transformerセグメンテーション画像テキスト

IGGT4D: Streaming 4D Instance-Grounded Geometry Transformer

実際の空間知能では、空間に続いて流れるビデオを理解する必要がある。この問題を解決するために、4次元空間を理解することができるモデルを提案する。

用途: 空間に続いて流れるビデオを理解する
難易度: Hard
コスト: High

センサ/時系列コンピュータビジョン動画認識検出生成画像

arxivGitHubあり2026-07-21

NGPS: GPS-Denied Aerial Geo-Localization and 2.5D Reconstruction via Deep Satellite Image Matching and Multi-Rate Sensor Fusion

この研究では、高空飛行の無信号位置指示のNGPS (Next-Generation Positioning System)というフレームワークを提案しました。NGPSは、GPSの信号を利用せずに位置推定を可能にします。N

用途: 高空飛行の無信号位置指示
難易度: Hard
コスト: High

自然言語処理プロンプトエンジニアリング画像テキスト動画

WorldScape Policy 2.0: Empowering Steerable World Action Modeling with Reasoning-Augmented Memory

World Action Models(WAMs)は、ロボットマニピュレーションをモデル化するパラダイム。WAMsは、視覚ステートトランジションとロボットアクションを同時にモデル化する。しかし、既存のWAMsは、一定の時

用途: 多目的マニピュレーション問題を解決する
難易度: Hard
コスト: High

Motion Primitive Discovery in a Humanoid Robot via Self-Organising Maps for Phase Recognition

行動モーター特徴は、社会認知や人間ロボットインターフェースなどの行動認識の核心です。人間ロボットのNICO用に、2段階のアーキテクチャを提案します。1段階目では、腕の移動を学習するSOMと、手の移動を学習するSOMを使用

コンピュータビジョン動画認識分類テキスト動画

用途: マニピュレーターの動作モーター特徴を解決する
難易度: Hard
コスト: High

センサ/時系列コンピュータビジョン3D・点群分類画像動画

MVP-Tac: A Miniaturized Dual-Modal Vision and Photoelastic Tactile Sensor for Robot-Assisted Minimally Invasive Surgery

Robot-assisted minimally invasive surgery (RMIS) offers major benefits over open and conventional laparoscopic

用途: 分類
難易度: Hard
コスト: High

huggingfaceHugging Faceあり2026-07-21

Moving Alphabet: A Controlled Study of Training Data for Text-to-Video Generation

Text-to-video generation has advanced significantly over the past five years through scaling of model size, da

品質予測/異常検知自然言語処理ファインチューニング分類生成テキスト

用途: 分類
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-21

ABot-World-0: Infinite Interactive World Rollout on a Single Desktop GPU

We present ABot-World-0, an action-conditioned video world model for real-time, long-horizon closed-loop inter

品質予測/異常検知深層学習軽量化・量子化テキスト動画

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

Organization of computation in reservoir computing

Reservoir computing exploits nonlinear dynamical systems to encode temporal inputs into high-dimensional state

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

説明可能品質予測/異常検知自然言語処理ファインチューニング検出異常検知テキスト

O-VAD: Industrial Video Anomaly Detection through Object-Centric Tracking and Reasoning

工場の中の異常が検出されるように設計された機械学習モデルを提案しています。通常の方法では、モデルはビデオ内のすべての内容を考慮し、複雑な問題を解決することは困難です。提案されたモデルのアプローチは、オブジェクトを検出して

用途: 産業ビデオの異常発生検出
難易度: Hard
コスト: High

深層学習Transformer埋め込み画像テキスト

Patch Policy: Efficient Embodied Control via Dense Visual Representations

ロボット制御を効率化するために、パッチを用いた政策学習を提案し、密集された視覺表現を用いて実装することを目的としている。

用途: リソース制限のあるロボットの制御
難易度: Hard
コスト: High

品質予測/異常検知自然言語処理大規模言語モデル生成テキスト動画

FARO: Feasibility-Aware Robot Motion Optimization

Fast planning of novel behaviors in unseen scenarios remains a fundamental challenge in robotics. The high-dim

用途: 生成
難易度: Hard
コスト: High

コンピュータビジョンセグメンテーション生成画像動画

Does Robust VIO Need More Learning? Geometry-Verified Visual Measurements under Distribution Shift

Learning is increasingly introduced into visual-inertial odometry (VIO), ranging from learned feature front-en

用途: 生成
難易度: Hard
コスト: High

Leveraging Two Robotic Arms for Tight Assembly Performance Gains

この研究では、2 つのロボット腕を同時に使用することで、緊張組立て操作のパフォーマンスを向上させる end-to-end フレームワークを提供します。ロボット腕は、 CAD モデルの数字、そして望ましい組み立て状態に置か

品質予測/異常検知自然言語処理RAG動画

用途: 2本のロボットアームによる組み立て
難易度: Hard
コスト: High

GeoWorldAD: Geometry World Action Model for Autonomous Driving

Autonomous driving requires both safe and efficient planning decisions in dynamic 3D environments. Although re

深層学習Transformer画像動画3D

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

AlayaWorld: Interactive Long-Horizon World Modeling -- Full Technical Report

Unlike conventional video game development, which relies on labor-intensive pipelines for asset production, an

用途: 生成
難易度: Easy
コスト: High

説明可能品質予測/異常検知自然言語処理大規模言語モデル動画マルチモーダル

EduPanel: A Three-Agent LLM Judge for Teaching Videos -- Reliability, Complementarity, and Human Trust Calibration

Teaching videos are becoming a major medium for education, creating a growing need for scalable evaluation of

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

ConsiSpace: Learning Geometric Consistency Matters for Video Spatial Reasoning

Video spatial reasoning is essential for navigation-oriented perception and long-video question answering, whe

深層学習軽量化・量子化QAテキスト動画

用途: QA
難易度: Easy
コスト: High

HOMIE: Human-object Centric Video Personalization via Multimodal Intelligent Enchancement

Human-object centric video personalization (HOCVP) is a core task within subject-driven video generation. Howe

用途: 生成
難易度: Easy
コスト: High

品質予測/異常検知自然言語処理大規模言語モデル検出生成セグメンテーション

FlowMimic: Mask-free Visual Editing and Generation with Pixel-pair Warped Flow Field for Online Video Editing Data Generation and Modality Mimicry

In line with the prevailing direction of vision research, we explore the integration of both generation and ed

用途: 検出
難易度: Easy
コスト: High

FlashRT: Agent Harness for Guiding Agents to Deploy Real-Time Multimodal Applications

Real-time multimodal applications, including voice agents and interactive video generation, compose heterogene

深層学習軽量化・量子化生成テキスト音声

用途: 生成
難易度: Easy
コスト: High

ShotPlan: Cinematic Video Generation with Learnable Planning Token

Current video generation models achieve impressive results in single-shot generation, yet remain limited in ci

MI向き自然言語処理埋め込み・検索生成動画

用途: 生成
難易度: Easy
コスト: High

ReViV: Reconstructing the Viewer and the View in 4D from Monocular Egocentric Video

Egocentric devices, such as wearable front-facing cameras, provide a unique perspective for capturing the cont

深層学習Transformer生成動画3D

用途: 生成
難易度: Easy
コスト: High

arxivPaper only2026-07-19

From Perception to Assistance: Open-Vocabulary Shared Autonomy for Robotic Manipulation

Teleoperating a robotic manipulator in industrial environments demands precision that camera-based interfaces

コンピュータビジョンセグメンテーションテキスト動画マルチモーダル

用途: セグメンテーション
難易度: Hard
コスト: High

arxivPaper only2026-07-19

BoxTwin: Learning Elastoplastic Articulated Object Dynamics from Videos

Digital twins enable robots to anticipate and adapt to physical interactions, but existing models struggle wit

MLOpsパイプライン構築動画

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

arxivPaper only2026-07-19

Temporal Fair Division of Indivisible Goods with Structured Constraints

This paper investigates temporal fair division, a setting where items are allocated over multiple rounds and a

コンピュータビジョン動画認識テキスト

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

huggingfaceHugging Faceあり2026-07-19

TimeLens2: Generalist Video Temporal Grounding with Multimodal LLMs

Video multimodal large language models (MLLMs) can describe what happens in a video, but rarely identify when

自然言語処理大規模言語モデル検出テキスト動画

用途: 検出
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-19

HarmoHOI: Harmonizing Appearance and 3D Motion for Multi-view Hand-Object Interaction Synthesis

Hand-Object Interaction (HOI) synthesis is a cornerstone for animation production and embodied AI. Despite the

品質予測/異常検知深層学習Transformer生成画像動画

用途: 生成
難易度: Easy
コスト: High

githubGitHubあり2026-07-19

awesome-artificial-intelligence — A curated list of Artificial Intelligence (AI) courses, books, video lectures and papers.

awesome-artificial-intelligenceは、人工知能に関する教材、アートcles、講義等を集め、提供しているオープンソースプロジェクトです。

機械学習教師なし学習動画教師なし

用途: AIに関するリソースの集めと提供
難易度: Easy
コスト: High

arxivPaper only2026-07-17

Transient State Reorganization and Cell Differentiation in the Developmental Dynamics of Growing Neural Cellular Automata

Neural Cellular Automataが複雑な形状を形成するプロセスを研究しました。

コンピュータビジョン動画認識検出

用途: 画像認識
難易度: Hard
コスト: High

深層学習Transformerセグメンテーション動画3D

arxivGitHubあり2026-07-17

DPNeXt: A Lightweight Multi-Scale Feature Fusion Framework for Efficient ViT-Based Multi-Task Dense Prediction

多タスク学習はロボティクスの視覚理解系で、セマンティックセグメンテーションと深度推定の統合をサポートします。視覚基底モデル(VFM)は強力な特徴エンコーダとして広く採用されていますが、既存のデコード戦略は重要なボトルネ

用途: ロボティクスの多タスク学習による3D空間理解
難易度: Hard
コスト: High

huggingfaceHugging Faceあり2026-07-17

FVAttn: Adaptive Sparse Attention with Runtime Load Balancing for Video Generation

Video Diffusion Transformers process long spatio-temporal sequences, making self-attention the main bottleneck

品質予測/異常検知深層学習Transformer生成動画

用途: 生成
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-17

Apple-π: Benchmarking Thinking with Video Towards Law-Grounded Physical Intelligence

Modern video generation models are increasingly hailed as emerging world models with an internalized grasp of

自然言語処理大規模言語モデル生成動画

用途: 生成
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-17

Audio-Visual Flamingo: Open Audio-Visual Intelligence for Long and Complex Videos

We present Audio-Visual Flamingo (AV-Flamingo), a fully open state-of-the-art audio-visual large language mode

説明可能自然言語処理大規模言語モデル画像テキスト音声

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

githubGitHubあり2026-07-17

mediapipe — Cross-platform, customizable ML solutions for live and streaming media.

mediapipeは、クロスプラットフォームでカスタマイズ可能なライブおよびストリーミングメディア向けのMLソリューションを提供している。

MLOpsモデルデプロイ音声動画

用途: ライブおよびストリーミングメディア用MLソリューション
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-16

Trajectory-aware Cross-view Geo-localization with Sequential Observations

Cross-view geo-localization matches ground-level observations against geo-tagged satellite imagery. Recent met

品質予測/異常検知深層学習軽量化・量子化検出画像テキスト

用途: 検出
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-16

RESOURCE2SKILL: Distilling Executable Agent Skills from Human-Created Multimodal Resources

Skills are a useful abstraction for software agents, turning human and agent experience into reusable procedur

自然言語処理RAG画像テキスト動画

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

githubGitHubあり2026-07-16

TurboDiffusion — TurboDiffusion: 100–200× Acceleration for Video Diffusion Models

画像認証システムにおける悪用された画像からの画像の認証方法を提示しました。

深層学習軽量化・量子化生成動画

用途: 画像認証システムの改良
難易度: Easy
コスト: High

arxivPaper only2026-07-15

Evaluating Encoding Strategies for Closed-Loop Classification in Biological Neural Networks

Interfacing with Biological Neural Networks (BNNs) requires encoding information into stimulation patterns tha

コンピュータビジョン動画認識分類画像

用途: 分類
難易度: Hard
コスト: High

huggingfaceHugging Faceあり2026-07-15

Open-AoE: An Open Egocentric Manipulation Dataset and Toolchain for Embodied Learning

Egocentric videos of human manipulation provide scalable supervision for embodied intelligence, yet existing r

コンピュータビジョンセグメンテーション画像テキスト動画

用途: セグメンテーション
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-15

VideoRAE: Taming Video Foundation Models for Generative Modeling via Representation Autoencoders

Video generative models commonly rely on latent spaces learned by 3D Variational Autoencoders (3D-VAEs). Howev

用途: 生成
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-14

ReflectWorld-MM: An Entity-Oriented Multimodal Memory System for Open-Ended Video Streams

Building assistants that can continually watch the world, remember what they see, and reason over their accumu

コンピュータビジョンマルチモーダル画像テキスト音声

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

githubGitHubあり2026-07-14

memvid — Memory layer for AI Agents. Replace complex RAG pipelines with a serverless, single-file memory layer. Give your agents instant retrieval and long-term memory.

MemVidは、サーバーレスで単一ファイルの記憶層を提案し、AIエージェントが即時検索と長期的な記憶を持つようにする記憶層です。

自然言語処理大規模言語モデル生成テキスト動画

用途: AIエージェントの記憶を管理する
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-10

OpenLongTail: Generative Scaling of Long-Tail Driving Data

Scaling robust driving policies is fundamentally bottlenecked by the scarcity of edge cases in curated dataset

自然言語処理RAG生成画像動画

用途: 生成
難易度: Easy
コスト: High

githubGitHubあり2026-07-07

cs-video-courses — List of Computer Science courses with video lectures.

このリポジトリはコンピュータサイエンスのビデオコースの一覧を提供しています。

機械学習教師あり学習動画

用途: 教育資源の共有
難易度: Easy
コスト: High

githubGitHubあり2026-06-30

ComfyUI-LTXVideo — LTX-Video Support for ComfyUI

医療画像分析で、深層學習モデルが実装されている問題に対する解決策を提示します。治療を導くために、批判的結果に影響を与える変化について特に重点が置かれています。

生成AI拡散モデル生成画像テキスト

用途: 医療画像を分析し治療を導く
難易度: Easy
コスト: High

githubGitHubあり2026-06-29

HunyuanVideo — HunyuanVideo: A Systematic Framework For Large Video Generation Model

画面の生成モデルであるHunyuanVideoを開発した。HunyuanVideoは、複雑なシーケンスを生成する能力を持つ。

深層学習Transformer生成動画

用途: 画面の生成モデルへの応用
難易度: Easy
コスト: High

githubGitHubあり2026-06-28

LanPaint — High quality training free inpaint for every stable diffusion model. Supports ComfyUI

画像生成のためのHigh Quality Training Free Inpaintを提供します。このInpaintはStable Diffusionモデルに使用でき、ComfyUIもサポートしています。

品質予測/異常検知生成AI拡散モデル生成画像動画

用途: 画像生成
難易度: Easy
コスト: High

arxivPaper only2026-06-22

Mass Conservation as an Inductive Bias for Self-Organized Criticality in NCA Reservoirs

Self-organized criticality (SOC), a dynamical regime associated with maximal information processing, offers a

品質予測/異常検知コンピュータビジョン動画認識分類

用途: 分類
難易度: Hard
コスト: High

arxivPaper only2026-06-22

Each Judge Its Own Yardstick: Discovering Per-VLM Taxonomies for Physical Video Evaluation

Maintaining physical consistency in video generators and world models increasingly relies on vision-language m

自然言語処理大規模言語モデルテキスト動画マルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

arxivPaper only2026-06-20

Physics-Informed Eikonal Caging for Whole-Arm Manipulation Planning

Planning contact-rich whole-arm manipulation is challenging because interactions that involve extended robot g

品質予測/異常検知強化学習方策勾配 (PPO / A3C)動画

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

arxivPaper only2026-06-15

Neural dynamical systems on ferroelectric compute-in-memory for real-time forecasting

ネットワークダイナミクスシステムを使って時間系列予測を高速化し、ニューロモルフィックコンピューティングを活用した。

コンピュータビジョン動画認識予測

用途: 時間系列予測を高速化するためのフェロイレクトリックコンピュートインメモリシステム
難易度: Hard
コスト: High