MLinfo | 機械学習・AI論文まとめ

diffusers — 🤗 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch.

.diffusion モデルのライブラリ。画像・動画・音声生成に利用可能。

生成AI拡散モデル生成画像テキスト

用途: 画像・動画・音声生成
難易度: Easy
コスト: High

品質予測/異常検知コンピュータビジョンセグメンテーション分類検出画像

cvat — Computer Vision Annotation Tool (CVAT) is a leading platform for building high-quality visual datasets for vision AI. It offers open-source, cloud, and enterprise products, as well as labeling services, for image, video, and 3D annotation with AI-assisted labeling, quality assurance, team collaboration, analytics, and developer APIs.

CVATは、機械学習用の業界標準のデータエンジンです。さまざまなスケールのチームが使用し、さまざまなスケールのデータに対応しています。

用途: データのラベル付けと管理
難易度: Easy
コスト: High

コンピュータビジョンセグメンテーション分類画像動画

labelme — Image annotation with Python. Supports polygon, rectangle, circle, line, point, and AI-assisted annotation.

イメージを注釈するツール。ポリゴン、長方形、円、線、点などを注釈することができる。

用途: イメージ注釈
難易度: Easy
コスト: High

Sana — SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer

SANAは、高解像度画像生成モデルSANAを紹介する本研究であり、低計算コストで優れた高解像度画像を生成できる。

用途: 高解像度画像合成
難易度: Easy
コスト: High

Awesome-Video-Diffusion — A curated list of recent diffusion models for video generation, editing, and various other applications.

Awesome-Video-Diffusionは、Recent Diffusion Models for Video Generation, Editing, and Othersのリストを公開しています。

生成AI拡散モデル生成動画

用途: ビデオ生成や編集の問題を解決する
難易度: Easy
コスト: High

FastVideo — A unified inference and post-training framework for accelerated video generation.

FastVideoは、加速されたビデオ生成用の統合推論とポストトレーニングのフレームワークです。

深層学習軽量化・量子化生成動画

用途: ビデオ生成を加速する
難易度: Easy
コスト: High

LightX2V — Lightweight Image Video Action Generation Inference Framework

zenmlは、データパイプラインからエージェントまで、AIプラットフォームです。

深層学習軽量化・量子化生成画像動画

用途: AI推論を軽量化したインフラ
難易度: Easy
コスト: High

onnxruntime — ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

FastVideoは、加速されたビデオ生成用に統一された推論およびポストトレーニングフレームワークです。

MLOpsモデルデプロイ

用途: クロスプラットフォーム高性能ML推論用エンジンの実現
難易度: Easy
コスト: High

arxivGitHubあり2026-07-23

3D-Aware VLMs with Implicit and Explicit Geometries

3次元空間理解技術のための新しいアプローチであるVLM-IE3D（Vision-Language Models with Implicit and Explicit 3D geometry）を提案しました。VLM-IE3

コンピュータビジョン3D・点群検出画像テキスト

用途: 3次元空間理解技術の開発
難易度: Hard
コスト: High

コンピュータビジョンセグメンテーション生成テキスト動画

arxivGitHubあり2026-07-23

T-STAR: A Large-Scale Benchmark for Spatio-Temporal Panoptic Scene Graph Generation in Satellite Video

Structured understanding of satellite video is essential for advancing dynamic geospatial scene analysis from

用途: 生成
難易度: Hard
コスト: High

githubGitHubあり2026-07-23

SimpleTuner — A general fine-tuning kit geared toward image/video/audio diffusion models.

画像やビデオやオーディオディフュージョンモデルのファインチューニングを行うための、汎用的なファインチューニングキット。

自然言語処理ファインチューニング画像音声動画

用途: ディフュージョンモデルのファインチューニング
難易度: Easy
コスト: High

品質予測/異常検知深層学習軽量化・量子化生成テキスト動画

githubGitHubあり2026-07-23

Causal-Forcing — [ICML 2026] Official codebase for "Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation" & Causal Forcing++

この論文では、Causal-Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive

用途: 高品質のビデオ生成を実現する。
難易度: Easy
コスト: High

品質予測/異常検知深層学習軽量化・量子化分類生成動画

HeadCast: Casting Attention Heads for Efficient Autoregressive Video Generation

流動画像生成を扱う研究、HeadCast を用いて流動画像生成を提案する。

用途: 流動画像生成
難易度: Hard
コスト: High

Zero-Observation User Reactivation with Gap-Driven Dimensional Gating

連続的に観測された行動を捕捉するためのシーケンシャル推奨モデルを使用すると、期間が長い間隔が発生した場合に、再活性化されたユーザーへのリコールを改善できる提案されている。

深層学習Transformer動画

用途: 再活性化されたユーザーの推奨
難易度: Hard
コスト: High

品質予測/異常検知コンピュータビジョンセグメンテーション生成画像テキスト

OSVE: One Step Video Editing with One Step Diffusion Models

Text-guided video editing with diffusion models is impractically slow, hindered by costly multi-step sampling

用途: 生成
難易度: Hard
コスト: High

少数データ向き品質予測/異常検知自然言語処理プロンプトエンジニアリング動画

MoAKE: Toward Unified All-in-One Action Quality Assessment via Mixture of Action Knowledge Experts

Action Quality Assessment (AQA) aims to objectively evaluate performance quality from action videos. Most exis

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

品質予測/異常検知深層学習軽量化・量子化検出セグメンテーション動画

Efficient Tracking and Understanding Object Transformations

Tracking objects through state transformations is essential for understanding real-world dynamics. However, ex

用途: 疼痛位置
難易度: Hard
コスト: High

コンピュータビジョン物体検出分類検出セグメンテーション

githubGitHubあり2026-07-22

supervision — We write your reusable computer vision tools. 💜

supervisionは、機械学習技術を活用して、ユーザー独自のコンピュータビジョンツールを作成することができる。

用途: オリジナルコンピュータビジョンツール
難易度: Easy
コスト: High

githubGitHubあり2026-07-22

OpenWorldLib — Unified Codebase for Advanced World Models.

OpenWorldLibは、進化する世界モデルを提供する統一されたコードベースです。

コンピュータビジョン3D・点群生成動画3D

用途: 世界モデルを提供する
難易度: Easy
コスト: High

githubGitHubあり2026-07-22

Awesome-CVPR2026-CVPR2025-ICCV2025-CVPR2024-ECCV2026-ECCV2024-AIGC — A Collection of Papers and Codes for CVPR2026/CVPR2025/ICCV2025/CVPR2024/ECCV2026/ECCV2024 AIGC

CVPRに基づくAIを取り入れるための資料集を提供します。CVPR 2026、2025、2024、およびECCV 2024に基づくAIGCに関する研究論文とソフトウェアコードを含みます。

コンピュータビジョン3D・点群生成画像動画

用途: AIをCVPRに応用する
難易度: Easy
コスト: High

arxivGitHubあり2026-07-21

Detect Early, Escalate Rarely: Anytime Detection of AI-Generated Video from the Compressed Bitstream

Detectors for AI-generated video are evaluated offline. A clip is decoded to pixels and scored once, increasin

CPUで試しやすい深層学習CNN検出画像テキスト

用途: 検出
難易度: Hard
コスト: High

MI向き品質予測/異常検知自然言語処理大規模言語モデル画像音声動画

arxivGitHubあり2026-07-21

OmniReasoner: Thinking with Long Audio-Video via Native Tool Use

オリジナルのデータとZoom-Inのツールを組み合わせた方法、OmniReasonerを提案する。これにより、オリンモードルLLMsの長いオーディオビデオの論理的推論を改善できる。

用途: 長いオーディオビデオの論理的推論を改善する
難易度: Hard
コスト: High

センサ/時系列コンピュータビジョン動画認識検出生成画像

arxivGitHubあり2026-07-21

NGPS: GPS-Denied Aerial Geo-Localization and 2.5D Reconstruction via Deep Satellite Image Matching and Multi-Rate Sensor Fusion

この研究では、高空飛行の無信号位置指示のNGPS (Next-Generation Positioning System)というフレームワークを提案しました。NGPSは、GPSの信号を利用せずに位置推定を可能にします。N

用途: 高空飛行の無信号位置指示
難易度: Hard
コスト: High

huggingfaceHugging Faceあり2026-07-21

Moving Alphabet: A Controlled Study of Training Data for Text-to-Video Generation

Text-to-video generation has advanced significantly over the past five years through scaling of model size, da

品質予測/異常検知自然言語処理ファインチューニング分類生成テキスト

用途: 分類
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-21

ABot-World-0: Infinite Interactive World Rollout on a Single Desktop GPU

We present ABot-World-0, an action-conditioned video world model for real-time, long-horizon closed-loop inter

品質予測/異常検知深層学習軽量化・量子化テキスト動画

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

AlayaWorld: Interactive Long-Horizon World Modeling -- Full Technical Report

Unlike conventional video game development, which relies on labor-intensive pipelines for asset production, an

用途: 生成
難易度: Easy
コスト: High

説明可能品質予測/異常検知自然言語処理大規模言語モデル動画マルチモーダル

EduPanel: A Three-Agent LLM Judge for Teaching Videos -- Reliability, Complementarity, and Human Trust Calibration

Teaching videos are becoming a major medium for education, creating a growing need for scalable evaluation of

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

ConsiSpace: Learning Geometric Consistency Matters for Video Spatial Reasoning

Video spatial reasoning is essential for navigation-oriented perception and long-video question answering, whe

深層学習軽量化・量子化QAテキスト動画

用途: QA
難易度: Easy
コスト: High

HOMIE: Human-object Centric Video Personalization via Multimodal Intelligent Enchancement

Human-object centric video personalization (HOCVP) is a core task within subject-driven video generation. Howe

用途: 生成
難易度: Easy
コスト: High

品質予測/異常検知自然言語処理大規模言語モデル検出生成セグメンテーション

FlowMimic: Mask-free Visual Editing and Generation with Pixel-pair Warped Flow Field for Online Video Editing Data Generation and Modality Mimicry

In line with the prevailing direction of vision research, we explore the integration of both generation and ed

用途: 検出
難易度: Easy
コスト: High

FlashRT: Agent Harness for Guiding Agents to Deploy Real-Time Multimodal Applications

Real-time multimodal applications, including voice agents and interactive video generation, compose heterogene

深層学習軽量化・量子化生成テキスト音声

用途: 生成
難易度: Easy
コスト: High

ShotPlan: Cinematic Video Generation with Learnable Planning Token

Current video generation models achieve impressive results in single-shot generation, yet remain limited in ci

MI向き自然言語処理埋め込み・検索生成動画

用途: 生成
難易度: Easy
コスト: High

ReViV: Reconstructing the Viewer and the View in 4D from Monocular Egocentric Video

Egocentric devices, such as wearable front-facing cameras, provide a unique perspective for capturing the cont

深層学習Transformer生成動画3D

用途: 生成
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-19

TimeLens2: Generalist Video Temporal Grounding with Multimodal LLMs

Video multimodal large language models (MLLMs) can describe what happens in a video, but rarely identify when

自然言語処理大規模言語モデル検出テキスト動画

用途: 検出
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-19

HarmoHOI: Harmonizing Appearance and 3D Motion for Multi-view Hand-Object Interaction Synthesis

Hand-Object Interaction (HOI) synthesis is a cornerstone for animation production and embodied AI. Despite the

品質予測/異常検知深層学習Transformer生成画像動画

用途: 生成
難易度: Easy
コスト: High

githubGitHubあり2026-07-19

awesome-artificial-intelligence — A curated list of Artificial Intelligence (AI) courses, books, video lectures and papers.

awesome-artificial-intelligenceは、人工知能に関する教材、アートcles、講義等を集め、提供しているオープンソースプロジェクトです。

機械学習教師なし学習動画教師なし

用途: AIに関するリソースの集めと提供
難易度: Easy
コスト: High

深層学習Transformerセグメンテーション動画3D

arxivGitHubあり2026-07-17

DPNeXt: A Lightweight Multi-Scale Feature Fusion Framework for Efficient ViT-Based Multi-Task Dense Prediction

多タスク学習はロボティクスの視覚理解系で、セマンティックセグメンテーションと深度推定の統合をサポートします。視覚基底モデル(VFM)は強力な特徴エンコーダとして広く採用されていますが、既存のデコード戦略は重要なボトルネ

用途: ロボティクスの多タスク学習による3D空間理解
難易度: Hard
コスト: High

huggingfaceHugging Faceあり2026-07-17

FVAttn: Adaptive Sparse Attention with Runtime Load Balancing for Video Generation

Video Diffusion Transformers process long spatio-temporal sequences, making self-attention the main bottleneck

品質予測/異常検知深層学習Transformer生成動画

用途: 生成
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-17

Apple-π: Benchmarking Thinking with Video Towards Law-Grounded Physical Intelligence

Modern video generation models are increasingly hailed as emerging world models with an internalized grasp of

自然言語処理大規模言語モデル生成動画

用途: 生成
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-17

Audio-Visual Flamingo: Open Audio-Visual Intelligence for Long and Complex Videos

We present Audio-Visual Flamingo (AV-Flamingo), a fully open state-of-the-art audio-visual large language mode

説明可能自然言語処理大規模言語モデル画像テキスト音声

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

githubGitHubあり2026-07-17

mediapipe — Cross-platform, customizable ML solutions for live and streaming media.

mediapipeは、クロスプラットフォームでカスタマイズ可能なライブおよびストリーミングメディア向けのMLソリューションを提供している。

MLOpsモデルデプロイ音声動画

用途: ライブおよびストリーミングメディア用MLソリューション
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-16

Trajectory-aware Cross-view Geo-localization with Sequential Observations

Cross-view geo-localization matches ground-level observations against geo-tagged satellite imagery. Recent met

品質予測/異常検知深層学習軽量化・量子化検出画像テキスト

用途: 検出
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-16

RESOURCE2SKILL: Distilling Executable Agent Skills from Human-Created Multimodal Resources

Skills are a useful abstraction for software agents, turning human and agent experience into reusable procedur

自然言語処理RAG画像テキスト動画

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

githubGitHubあり2026-07-16

TurboDiffusion — TurboDiffusion: 100–200× Acceleration for Video Diffusion Models

画像認証システムにおける悪用された画像からの画像の認証方法を提示しました。

深層学習軽量化・量子化生成動画

用途: 画像認証システムの改良
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-15

Open-AoE: An Open Egocentric Manipulation Dataset and Toolchain for Embodied Learning

Egocentric videos of human manipulation provide scalable supervision for embodied intelligence, yet existing r

コンピュータビジョンセグメンテーション画像テキスト動画

用途: セグメンテーション
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-15

VideoRAE: Taming Video Foundation Models for Generative Modeling via Representation Autoencoders

Video generative models commonly rely on latent spaces learned by 3D Variational Autoencoders (3D-VAEs). Howev

用途: 生成
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-14

ReflectWorld-MM: An Entity-Oriented Multimodal Memory System for Open-Ended Video Streams

Building assistants that can continually watch the world, remember what they see, and reason over their accumu

コンピュータビジョンマルチモーダル画像テキスト音声

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

githubGitHubあり2026-07-14

memvid — Memory layer for AI Agents. Replace complex RAG pipelines with a serverless, single-file memory layer. Give your agents instant retrieval and long-term memory.

MemVidは、サーバーレスで単一ファイルの記憶層を提案し、AIエージェントが即時検索と長期的な記憶を持つようにする記憶層です。

自然言語処理大規模言語モデル生成テキスト動画

用途: AIエージェントの記憶を管理する
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-10

OpenLongTail: Generative Scaling of Long-Tail Driving Data

Scaling robust driving policies is fundamentally bottlenecked by the scarcity of edge cases in curated dataset

自然言語処理RAG生成画像動画

用途: 生成
難易度: Easy
コスト: High

githubGitHubあり2026-07-07

cs-video-courses — List of Computer Science courses with video lectures.

このリポジトリはコンピュータサイエンスのビデオコースの一覧を提供しています。

機械学習教師あり学習動画

用途: 教育資源の共有
難易度: Easy
コスト: High

githubGitHubあり2026-06-30

ComfyUI-LTXVideo — LTX-Video Support for ComfyUI

医療画像分析で、深層學習モデルが実装されている問題に対する解決策を提示します。治療を導くために、批判的結果に影響を与える変化について特に重点が置かれています。

生成AI拡散モデル生成画像テキスト

用途: 医療画像を分析し治療を導く
難易度: Easy
コスト: High

githubGitHubあり2026-06-29

HunyuanVideo — HunyuanVideo: A Systematic Framework For Large Video Generation Model

画面の生成モデルであるHunyuanVideoを開発した。HunyuanVideoは、複雑なシーケンスを生成する能力を持つ。

深層学習Transformer生成動画

用途: 画面の生成モデルへの応用
難易度: Easy
コスト: High

githubGitHubあり2026-06-28

LanPaint — High quality training free inpaint for every stable diffusion model. Supports ComfyUI

画像生成のためのHigh Quality Training Free Inpaintを提供します。このInpaintはStable Diffusionモデルに使用でき、ComfyUIもサポートしています。

品質予測/異常検知生成AI拡散モデル生成画像動画

用途: 画像生成
難易度: Easy
コスト: High