MLinfo | 機械学習・AI論文まとめ

強化学習方策勾配 (PPO / A3C)画像テキスト

PhysScene: A Scene Graph Dataset for Scientific Visual Reasoning in Physics Experiments

Scene Graphs (SGs) provide structured representations of visual scenes by modeling objects and their pairwise

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

Are Reasoning Vision-Language Models Robust to Semantic Visual Distractions?

Reasoning Vision-Language Models (VLMs) achieve strong performance on complex multimodal tasks, but reliable r

コンピュータビジョンマルチモーダル画像テキスト

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

センサ/時系列品質予測/異常検知コンピュータビジョンセグメンテーション生成画像

TUDSR: Twice Upsampling-Diffusion for Higher Super-Resolution

Diffusion-based generative models have achieved remarkable success in real-world image super-resolution (SR).

用途: 生成
難易度: Hard
コスト: High

品質予測/異常検知自然言語処理RAG画像テキスト音声

Echo-DM: Ultrasound Marker Removal via Conditional Latent Diffusion and Region-Aware Fusion

Clinical ultrasound images often contain artificial markers, such as measurement calipers and text, to assist

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

品質予測/異常検知深層学習Transformer検出画像テキスト

Temporal-Aware Reasoning Optimization for Video Temporal Grounding

Multi-modal Large Language Models (MLLMs) have achieved remarkable progress in video temporal grounding with r

用途: 検出
難易度: Hard
コスト: High

Semi-supervised Source Detection in Astronomical Images: New Benchmark and Strong Baseline

Source detection in modern observational astronomy is a cornerstone for localizing and identifying stellar sou

機械学習教師あり学習検出生成画像

用途: 検出
難易度: Easy
コスト: Medium

CHROMA: Detecting AI-Generated Images through Inter-Channel Color-Space Correlations

The rapid adoption of diffusion and large-scale generative models has made it increasingly challenging to dist

深層学習CNN検出生成画像

用途: 検出
難易度: Hard
コスト: High

BLM-SGAN: Bidirectional Language Modeling for Semantic-Spatial Text-to-Image Generation

Despite the success of image generation from text descriptions, it still faces challenges that are difficult t

深層学習Transformer生成画像テキスト

用途: 生成
難易度: Easy
コスト: Low

センサ/時系列深層学習軽量化・量子化生成画像テキスト

IR-SIM: A Lightweight Skill-Native Simulator for Navigation, Learning, and Benchmarking

Simulation plays a key role in automated robotics research supported by large language models (LLMs). However,

用途: 生成
難易度: Hard
コスト: High

OrderDP: A Theoretically Guaranteed Lossless Dynamic Data Pruning Framework

Data pruning (DP), as an oft-stated strategy to alleviate heavy training burdens, reduces the volume of traini

説明可能深層学習軽量化・量子化画像

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

Vision-Language Work Zone Intelligence for Safety-Critical Speed Regulation of Mixed-Autonomy Vehicles in Dynamic Environments

Temporary work-zone speed limits are communicated through visually inconsistent signage and are often missing

コンピュータビジョン物体検出分類検出画像

用途: 分類
難易度: Hard
コスト: High

コンピュータビジョンセグメンテーション検出画像教師なし

AUCp: Pseudo-AUC for Inference Model Selection with Unlabeled Validation Data in Abnormality Detection

Abnormality detection is a crucial yet challenging task in medical image analysis. Distinguishing abnormalitie

用途: 検出
難易度: Hard
コスト: High

品質予測/異常検知深層学習Transformer生成画像動画

OmniTryOn: Video Try-On Anything at Once!

Although video virtual try-on (VVT) has achieved significant progress, existing methods still exhibit two fund

用途: 生成
難易度: Hard
コスト: High

自然言語処理大規模言語モデル画像テキストマルチモーダル

TVI-CoT: Text-Visual Interleaved Chain-of-Thought Reasoning for Multimodal Understanding

Chain-of-thought (CoT) reasoning has proven effective for enhancing problem-solving in large language models.

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

X-Palm: Paired Multispectral-to-Smartphone Dataset for Cross-Domain Palmprint Authentication

Palmprint modality offers a privacy-preserving biometric solution, yet its deployment is hindered by the domai

自然言語処理大規模言語モデル画像

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

表形式向きコンピュータビジョンセグメンテーション生成画像テキスト

Segmentation-Assisted Brain MRI Synthesis with Cross-Image Multi-Contrast Feature Memory Bank Retrieval Augmentation

Multi-contrast brain MRI provide complementary soft-tissue characteristics that aid in the screening and diagn

用途: 生成
難易度: Easy
コスト: Low

Dream-Tac: A Unified Tactile World Action Model for Contact-Rich Robot Manipulation

World action models inherit the predictive capability of world models, enabling action generation to be guided

自然言語処理RAG生成画像マルチモーダル

用途: 生成
難易度: Hard
コスト: High

自然言語処理プロンプトエンジニアリング画像3Dマルチモーダル

GEAR-VLA: Learning Geometry-Aware Action Representations for Generalizable Robotic Manipulation

Vision-Language-Action (VLA) models achieve strong benchmark performance but still struggle in real-world depl

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

説明可能品質予測/異常検知自然言語処理大規模言語モデル画像テキストマルチモーダル

arxivGitHubあり2026-06-06

Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in visual understanding, yet the

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

arxivGitHubあり2026-06-06

G2G: Exploiting Intra-Group Geometry for Inter-Group Pose Estimation

Recovering the relative 6-DoF pose between two image groups underlies cross-sequence relocalization and multi-

深層学習Transformer検出画像

用途: 検出
難易度: Hard
コスト: High

MI向き品質予測/異常検知自然言語処理大規模言語モデル生成画像テキスト

arxivGitHubあり2026-06-06

VideoWeaver: Evaluating and Evolving Skills for Agentic Long Video Generation

Recent agent frameworks such as Claude Code, Codex, and OpenClaw are strong at tool use and orchestration, but

用途: 生成
難易度: Hard
コスト: High

arxivGitHubあり2026-06-05

Constructing VAE Latent Spaces with Prescribed Topology

Variational autoencoders (VAEs) learn low-dimensional latent representations of high-dimensional data. When th

品質予測/異常検知生成AIVAE画像

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

説明可能自然言語処理大規模言語モデル分類画像テキスト

arxivGitHubあり2026-06-05

LLM-Guided Evolution for Medical Decision Pipelines

Adapting large language models (LLMs) to clinical workflows often requires costly fine-tuning or manual prompt

用途: 分類
難易度: Hard
コスト: High

arxivGitHubあり2026-06-05

RhinoVLA Technical Report

この論文では、VLAモデルをedgeハードウェアにデプロイするための手法を提案しています。この手法は、VLAモデルをedgeハードウェアにデプロイするためのフレームワークです。この手法は、edgeハードウェアを利用してV

深層学習軽量化・量子化画像テキストマルチモーダル

用途: VLAモデルをedgeハードウェアにデプロイするための手法
難易度: Hard
コスト: High

arxivGitHubあり2026-06-04

A Conversational Framework for Human-Robot Collaborative Manipulation with Distributed Generative AI models

この研究では、人間-ロボット協力のためのDistributed Conversational Frameworkを提案します。

自然言語処理大規模言語モデル生成画像テキスト

用途: 人間-ロボット協力
難易度: Hard
コスト: High