MLinfo | 機械学習・AI論文まとめ

transformers — 🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

🤗 Transformersは、テキスト・ビジョン・音声など複雑なモデル定義をサポートするフレームワークで、インフェレンスターやトレーニングに使用できる。

深層学習Transformer分類テキスト音声

用途: 機械学習モデル定義
難易度: Easy
コスト: High

Medical_Image_Analysis — Foundation models based medical image analysis

医学画像分析は、医療の診断や治療を支援するために画像に記載されたデータから情報を抽出する研究分野です。この研究では、foundation modelsを用い、医療画像分析のための新しいアプローチを提案しました。found

自然言語処理大規模言語モデル生成画像テキスト

用途: 医学画像分析
難易度: Easy
コスト: High

自然言語処理大規模言語モデルテキスト音声マルチモーダル

screenpipe — YC (S26) | Record your screen 24/7 and plug into your agents. Local, private, secure. Connect to OpenClaw, Hermes agent and 100+ apps

ユーザーの行動を認識し、オートエージェントを構築するためのツール。

用途: オートエージェント構築
難易度: Easy
コスト: High

rerun — Visualize, query, and stream to train on multimodal robotics data.

データをロギング・ストーリング・クエリして視覚化できるSDKです。

コンピュータビジョンマルチモーダル画像

用途: データロギングおよび視覚化
難易度: Easy
コスト: High

深層学習Transformer画像テキストマルチモーダル

sglang — SGLang is a high-performance serving framework for large language models and multimodal models.

SGLangは、大規模言語モデルのサービングフレームワークです。このライブラリは、高性能なサービスフレームワークで、大規模言語モデルのサービングをサポートしています。

用途: 大規模言語モデルのサービングフレームワーク
難易度: Easy
コスト: High

自然言語処理大規模言語モデルテキストマルチモーダル

ai-agent-book — 《深入理解 AI Agent：设计原理与工程实践》（李博杰著）开源主仓库：全书正文、编译版 PDF 与按章配套代码

この論文では、現在のVision-Language-Benchmark（VLB）を超える、MLLMがアクティブな観察を実演できるようにするためのバenchmark、ActiveVisionを提案する。このActiveVi

用途: 弁論の実際的な対象を形成するためにAIが活用される
難易度: Easy
コスト: High

lance — Open Lakehouse Format for Multimodal AI. Convert from Parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..

マルチモーダルAIに適したオープンレイクハウスフォーマットです。このフォーマットでは、パレットからデータを2行のコードで変換することができ、100倍速くなります。また、ベクトルインデックスやデータバージョニングが可能です

用途: オープンレイクハウスフォーマット
難易度: Easy
コスト: High

runanywhere-sdks — Production ready toolkit to run AI locally

このリポジトリでは、AIモデルの互換性を確保するためのオープンスタンダードであるONNXを提供しています。

用途: AIモデルの互換性を確保するためのオープンスタンダード
難易度: Easy
コスト: High

verl-omni — Multimodal RL training framework for diffusion & omni models

CVV または CWE への分類を実現し、バグ修正のために重要な手順となるCVEへの CWE 分類を自動化する。

用途: CVVの分類と CWE 分類
難易度: Easy
コスト: High

haystack — Open-source AI orchestration framework for building context-engineered, production-ready LLM applications. Design modular pipelines and agent workflows with explicit control over retrieval, routing, memory, and generation. Built for scalable agents, RAG, multimodal applications, semantic search, and conversational systems.

オープンソースのAIオーケストレーションフレームワークです。LLMアプリケーションの構築に必要なパイプラインやエージェントワークフローの設計ができるようになっています。

深層学習Transformer生成要約テキスト

用途: LLMアプリケーションの構築
難易度: Easy
コスト: High

3D-Aware VLMs with Implicit and Explicit Geometries

3次元空間理解技術のための新しいアプローチであるVLM-IE3D（Vision-Language Models with Implicit and Explicit 3D geometry）を提案しました。VLM-IE3

コンピュータビジョン3D・点群検出画像テキスト

用途: 3次元空間理解技術の開発
難易度: Hard
コスト: High

When Are Reasoning-Based Guardrails Not Efficient? ResponseGuard: A Fast Vision-Language Guard for Real-Time Moderation

A vision-language AI assistant returns its answer as a stream of generated tokens. Therefore, a safety guard t

深層学習軽量化・量子化検出画像テキスト

用途: 検出
難易度: Hard
コスト: High

DINOde: Continuous Vision-Text Alignment for Open-Vocabulary Semantic Segmentation

Open-vocabulary semantic segmentation (OVSS) leverages textual semantics to segment objects beyond predefined

自然言語処理RAGセグメンテーション画像テキスト

用途: セグメンテーション
難易度: Hard
コスト: High

Decoupling Cross-Modality Manifold Discrepancy: Leveraging Visible Diffusion Priors for Infrared Super-Resolution

Infrared image super-resolution (IISR) mitigates the limitations imposed by low spatial resolution. Existing m

自然言語処理RAG生成画像マルチモーダル

用途: 生成
難易度: Hard
コスト: High

深層学習Transformer画像テキストマルチモーダル

MVEI & EmObserver: Empowering MLLM-Oriented Visual Emotional Intelligence via Emotion Statement Judgement

感情認識は、現代のアギを促進するために不可欠ですが、大規模

用途: 感情認識
難易度: Hard
コスト: High

huggingfaceHugging Faceあり2026-07-23

K12-KGraph: A Curriculum-Aligned Knowledge Graph for Benchmarking and Training Educational LLMs

Large language models are increasingly used in K-12 education, but existing benchmarks mainly test exam questi

自然言語処理大規模言語モデルQA画像テキスト

用途: QA
難易度: Easy
コスト: High

githubGitHubあり2026-07-23

xtuner — A Next-Generation Training Engine Built for Ultra-Large MoE Models

xtunerは、超大規模MoEモデルを高速にトレーニングするためのトレーニングエンジンです。

自然言語処理大規模言語モデル生成マルチモーダル

用途: MoEモデルの高速トレーニングを提供する
難易度: Easy
コスト: High

Antigen-specific Antibody Multi-modal Foundation Model for Functional Antibody Design

この研究では、抗原特異性抗体を設計するために、抗原および抗体の間でエピトープレベルでのペアリングが必要であることを考慮した、抗原特異性の抗体多モーダルファンデーションモデル（AAMFM）を提案しました。

自然言語処理RAG分類生成テキスト

用途: 抗原特異性抗体設計
難易度: Hard
コスト: High

センサ/時系列深層学習軽量化・量子化検出セグメンテーション埋め込み

Not All Patches are Equal: Sampling Matters for Visible-Infrared Pre-Training

Visible-infrared (VIS-IR) alignment is a key pre-training task for robust multi-sensor perception. Most existi

用途: 検出
難易度: Hard
コスト: High

品質予測/異常検知深層学習Transformer生成画像テキスト

SHFormer: Dynamic Spectral Filtering Convolutional Neural Network and High-pass Kernel Generation Transformer for Adaptive MRI Reconstruction

Attention Mechanism (AM) selectively focuses on essential information for imaging tasks and captures relations

用途: 生成
難易度: Hard
コスト: High

自然言語処理大規模言語モデル画像テキストマルチモーダル

Development of an automated, reliable, and clinically meaningful artificial intelligence (AI) tool for diagnosing cardiac disease from conventional cardiovascular magnetic resonance (CMR) images

Aims: Cardiovascular magnetic resonance (CMR) imaging enables non-invasive assessment of myocardial structure,

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

品質予測/異常検知コンピュータビジョンマルチモーダルQA画像

Silent Failures in Multimodal Agentic Search:A Diagnostic Taxonomy and Cross-Judge Evaluation

この研究では、可視化された質問への対応を評価するために、新しい方法を提案しました。この方法は、質問への回答の正確性だけでなく、質問への回答のパターンや特徴も評価することができます。

用途: 可視化された質問への対応を評価する
難易度: Hard
コスト: High

自然言語処理プロンプトエンジニアリング検出画像テキスト

ReferTrack: Referring Then Tracking for Embodied Visual Tracking

ReferTrack は、自然言語で対象の車両に付近する自動車を追従させるシステムである。このシステムでは、対象の車両に付近する自動車を認識する後、自動車の動きを予測する。

用途: 自動車が対象の車両に付きそわせるシステム
難易度: Hard
コスト: High

arxivGitHubあり2026-07-21

Deep Shape Regression for Planar Curves with Multimodal Covariates

深層学習を用いた形状推定モデルを作成し、オープン平面曲線の形状を推定するための深層学習モデルを提案した。

深層学習CNN回帰画像マルチモーダル

用途: 多モデルの形状推定
難易度: Hard
コスト: High

arxivGitHubあり2026-07-21

Detect Early, Escalate Rarely: Anytime Detection of AI-Generated Video from the Compressed Bitstream

Detectors for AI-generated video are evaluated offline. A clip is decoded to pixels and scored once, increasin

CPUで試しやすい深層学習CNN検出画像テキスト

用途: 検出
難易度: Hard
コスト: High

MI向き品質予測/異常検知自然言語処理大規模言語モデル画像音声動画

arxivGitHubあり2026-07-21

OmniReasoner: Thinking with Long Audio-Video via Native Tool Use

オリジナルのデータとZoom-Inのツールを組み合わせた方法、OmniReasonerを提案する。これにより、オリンモードルLLMsの長いオーディオビデオの論理的推論を改善できる。

用途: 長いオーディオビデオの論理的推論を改善する
難易度: Hard
コスト: High

huggingfaceHugging Faceあり2026-07-21

Mage-Flow: An Efficient Native-Resolution Foundation Model for Image Generation and Editing

Large-scale visual generators are increasingly capable but costly to train, fine-tune, and deploy. We introduc

品質予測/異常検知深層学習Transformer生成画像テキスト

用途: 生成
難易度: Easy
コスト: High

説明可能品質予測/異常検知自然言語処理大規模言語モデル動画マルチモーダル

EduPanel: A Three-Agent LLM Judge for Teaching Videos -- Reliability, Complementarity, and Human Trust Calibration

Teaching videos are becoming a major medium for education, creating a growing need for scalable evaluation of

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

ConsiSpace: Learning Geometric Consistency Matters for Video Spatial Reasoning

Video spatial reasoning is essential for navigation-oriented perception and long-video question answering, whe

深層学習軽量化・量子化QAテキスト動画

用途: QA
難易度: Easy
コスト: High

HOMIE: Human-object Centric Video Personalization via Multimodal Intelligent Enchancement

Human-object centric video personalization (HOCVP) is a core task within subject-driven video generation. Howe

深層学習Transformer生成画像テキスト

用途: 生成
難易度: Easy
コスト: High

FlashRT: Agent Harness for Guiding Agents to Deploy Real-Time Multimodal Applications

Real-time multimodal applications, including voice agents and interactive video generation, compose heterogene

深層学習軽量化・量子化生成テキスト音声

用途: 生成
難易度: Easy
コスト: High

ReViV: Reconstructing the Viewer and the View in 4D from Monocular Egocentric Video

Egocentric devices, such as wearable front-facing cameras, provide a unique perspective for capturing the cont

深層学習Transformer生成動画3D

用途: 生成
難易度: Easy
コスト: High

githubGitHubあり2026-07-20

BentoML — The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!

モデルをサービングするためのライブラリを紹介している。

自然言語処理大規模言語モデル生成マルチモーダル

用途: モデルのサービング
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-19

TimeLens2: Generalist Video Temporal Grounding with Multimodal LLMs

Video multimodal large language models (MLLMs) can describe what happens in a video, but rarely identify when

自然言語処理大規模言語モデル検出テキスト動画

用途: 検出
難易度: Easy
コスト: High

huggingfaceGitHubありHugging Faceあり2026-07-18

Dataset Distillation by Influence Matching

We revisit dataset distillation from an outcome-centric perspective. Rather than aligning process surrogates (

深層学習軽量化・量子化分類画像テキスト

用途: 分類
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-18

Can Multimodal Large Language Models Understand OCT?

Optical coherence tomography (OCT) imaging is essential for the diagnosis and treatment of retinal diseases. A

品質予測/異常検知自然言語処理大規模言語モデル分類QA画像

用途: 分類
難易度: Easy
コスト: High

githubGitHubあり2026-07-18

maths-cs-ai-compendium — Become a cracked AI/ML Research Engineer

Becoming a cracked AI/ML Research Engineerには、AI/ML研究者のスキルと知識を高めるための手法が紹介されています。

コンピュータビジョンマルチモーダルテキスト音声

用途: AI/ML研究者を育成
難易度: Easy
コスト: High

自然言語処理大規模言語モデル画像テキストマルチモーダル

An Exam for Active Observers

Human vision is a closed loop: gaze is continuously redirected by intermediate hypotheses rather than a single

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

CPUで試しやすい深層学習軽量化・量子化マルチモーダル強化学習

JoyNexus: Service-Oriented Multi-Tenant Post-Training for VLA Models

The post-training of Vision-Language-Action (VLA) models is essential due to the diversity of simulators, robo

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

MI向き自然言語処理大規模言語モデル生成画像テキスト

S1-Omni: A Unified Multimodal Reasoning Model for Scientific Understanding, Prediction, and Generation

We present S1-Omni, a unified multimodal reasoning model for scientific understanding, prediction, and generat

用途: 生成
難易度: Easy
コスト: High

説明可能自然言語処理大規模言語モデル画像テキスト音声

Audio-Visual Flamingo: Open Audio-Visual Intelligence for Long and Complex Videos

We present Audio-Visual Flamingo (AV-Flamingo), a fully open state-of-the-art audio-visual large language mode

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

自然言語処理大規模言語モデル生成テキストマルチモーダル

githubGitHubあり2026-07-17

generative-ai — Comprehensive resources on Generative AI, including a detailed roadmap, projects, use cases, interview preparation, and coding preparation.

ゼネレーティブAIに関連するリソースの一覧。

用途: ゼネレーティブAI
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-16

RESOURCE2SKILL: Distilling Executable Agent Skills from Human-Created Multimodal Resources

Skills are a useful abstraction for software agents, turning human and agent experience into reusable procedur

自然言語処理RAG画像テキスト動画

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-16

Xiaomi-Robotics-1: Scaling Vision-Language-Action Models with over 100K Hours of Real-World Trajectories

We present Xiaomi-Robotics-1, a foundational vision-language-action (VLA) model capable of (1) following diver

深層学習軽量化・量子化生成テキストマルチモーダル

用途: 生成
難易度: Easy
コスト: High

深層学習Transformerマルチモーダル自己教師

githubGitHubあり2026-07-16

stable-pretraining — Reliable, minimal and scalable library for pretraining foundation and world models

基礎モデルの前処理を行うためのライブラリ。最小限でシームレスにスケールできる。

用途: 基礎モデルの前処理
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-15

Generalizable VLA Finetuning via Representation Anchoring and Language-Action Alignment

Finetuning a pretrained vision-language model (VLM) on robot demonstrations via behavior cloning (BC) has beco

コンピュータビジョンセグメンテーション画像テキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-14

ReflectWorld-MM: An Entity-Oriented Multimodal Memory System for Open-Ended Video Streams

Building assistants that can continually watch the world, remember what they see, and reason over their accumu

コンピュータビジョンマルチモーダル画像テキスト音声

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-13

See like a Robot: Robot-Centric Pointmaps for Vision-Language-Action Models

Vision-language-action (VLA) models predict robot actions from visual observations and language instructions.

コンピュータビジョン3D・点群画像3Dマルチモーダル

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-13

SVR-R1: Bootstrapping Multi-modal Reasoning with Self-verification in Reinforcement Learning

We introduce Self-Verified Reasoner (SVR-R1), a multi-turn RL framework that turns a model's own verification

コンピュータビジョンセグメンテーション生成マルチモーダル強化学習

用途: 生成
難易度: Easy
コスト: High

githubGitHubあり2026-07-13

Awesome-Mixture-of-Experts — Awesome Mixture of Experts (MoE): A Curated List of Mixture of Experts (MoE) and Mixture of Multimodal Experts (MoME)

Awesome Mixture of Experts (MoE): A Curated List of Mixture of Experts (MoE) and Mixture of Multimodal Experts

用途: 実装・検証基盤
難易度: Easy
コスト: High

githubGitHubあり2026-07-13

UniPic — Open-source SOTA multi-image editing model

UniPicは、オープンソースの最先端の画像編集モデルの実装です。

コンピュータビジョンマルチモーダル生成画像

用途: 多画像編集モデルの実装
難易度: Easy
コスト: High

githubGitHubあり2026-07-10

multimind-sdk — Your SDK solves all of this. One interface. Unified logic. Local + hosted models. Fine-tuning. Agent tools. Enterprise-ready. Hybrid RAG.Star 🌟 if you like it!

GUI操作自動化に伴う停止判定、復讐、再検索に関する問題を解決し、 GUI操作自動化を実現するためのフレームワークを開発します。

用途: GUI操作自動化ツール
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-07

UI2App: Benchmarking Visual Interaction Inference in Executable Web Application Generation

Large language models (LLMs) have demonstrated growing competence in web page generation. However, existing te

深層学習Transformer生成画像テキスト

用途: 生成
難易度: Easy
コスト: High

githubGitHubあり2026-07-07

VLM-R1 — Solve Visual Understanding with Reinforced VLMs

この研究では、画像理解を強化する強化されたビジョンホルシックスモデル (VLM-R1) が提案されます。この modelは、画像を理解しやすくするように設計されています。

自然言語処理大規模言語モデル画像マルチモーダル

用途: 画像理解の問題を解決
難易度: Easy
コスト: High

githubGitHubあり2026-07-03

EEGUnity — An open source tool for large-scale EEG datasets processing

ビデオ diffusioin trasformerは、ビデオの長さに依存しない推論能力を持っているが、この長さのエキサポレーションは実際には困難なものである。RIFLExという手法を開発し、ビデオ長さのエキサポレーション

コンピュータビジョンマルチモーダル

用途: ビデオ diffusioin trasformerで長さのエキサポレーション
難易度: Easy
コスト: High

githubGitHubあり2026-06-28

awesome-japanese-llm — 日本語LLMまとめ - Overview of Japanese LLMs

分析システムの性能を向上するための学習モデル開発を行う。

自然言語処理大規模言語モデル生成マルチモーダル

用途: 分析システムの性能を向上するための学習モデル開発
難易度: Easy
コスト: High