MLinfo | 機械学習・AI論文まとめ

品質予測/異常検知コンピュータビジョンセグメンテーション生成画像テキスト

Echo-Memory: A Controlled Study of Memory in Action World Models

この研究では、エピソード記憶を制御するために、エピソード記憶モデルを設計および評価しました。エピソード記憶モデルは、エピソード内の重要な情報を記憶し、エピソード間の相関関係を特定することができます。

用途: エピソード記憶
難易度: Hard
コスト: High

Discovering Functionally Selective Brain Regions with a Deep Topographic Multimodal Model

この研究では、脳部帯域内のニューロンが同じ反応プロファイルを持つと仮定し、近接な脳部帯域内のニューロンの反応プロファイルを推論し、分野間の結合を特定しました。

自然言語処理RAG画像マルチモーダル

用途: 脳部帯域の研究
難易度: Hard
コスト: High

Difference-Aware Retrieval Policies for Imitation Learning

この研究では、拒否学習における検索

MLOpsモデルデプロイ異常検知画像

用途: 拒否学習における検索ポリシー
難易度: Hard
コスト: High

What the Eyes See, the LLMs Miss: Exploiting Human Perception for Adversarial Text Attacks

大規模言語モデル（LLM）を運用するコンテンツモデレーションシステムは、有害なオンラインコンテンツを防止するために重要な役割を果たします。しかし、これらのシステムの主な目標は単にトークナイズされたテキストを操作することに

自然言語処理大規模言語モデル分類検出画像

用途: 文書の分類
難易度: Hard
コスト: High

Muon Learns More Robust and Transferable Features than Adam

Muon has recently emerged as a state-of-the-art optimizer for pretraining Large Language Models (LLMs) and vis

深層学習Transformer分類画像テキスト

用途: 分類
難易度: Hard
コスト: High

Automating the Expert Eye: A System-Agnostic Deep Learning Framework for Rare Event Discovery in Imbalanced Force Spectroscopy

SMFS データの自動化された分析を提案。モデルを使用して、不均衡された SMFS データを分析する方法を提案した。

説明可能MI向き深層学習CNN生成画像

用途: SMFS データの自動化された分析
難易度: Hard
コスト: Medium

説明可能数学・理論解釈可能性 (XAI)分類検出画像

SAILS: Surrogate-based Analysis of Interactions via Local Effect Smooths

この研究では、Surrogate-based Analysis of Interactions via Local Effect Smooths (SAILS) と呼ばれる構造間の相互作用を検測し、機能的な相互作用を推定

用途: 構造間の機能的な相互作用の検出
難易度: Hard
コスト: Low

説明可能センサ/時系列深層学習CNN画像テキストマルチモーダル

Zero-Shot Semantic Re-Identification for Autonomous Driving: A VLM Baseline Study

この研究では、ゼロショットセマンティック再特定の基準を設定し、画像のセマンティック特定を自動化します。

用途: セマンティック再特定
難易度: Hard
コスト: High

Conan-embedding-v3: Fusing Modality-Specific Models for Omni-Modal Embedding

この研究では、テキスト、画像、ビデオ、アウディオ等の異なるモダリティのデータを統合したオムニモダル検索システムを構築します。

自然言語処理ファインチューニング回帰検索画像

用途: オムニモーダル検索
難易度: Hard
コスト: High

PRISM: Topology-Aware Cross-Modal Imputation for Modality-Deficient Federated Graph Learning

Multimodal federated graph learning (MM-FGL) aims to collaboratively learn from decentralized graphs with text

自然言語処理RAG画像テキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

MI向き自然言語処理ファインチューニング画像テキスト

Orange Lab: Lowering Barriers to Data Mining through Embedded Interactive Workflows

この論文では、data mining におけるビジュアルプログラミングフレームワーク、Orange Lab を提唱しました。これにより、Webベースのデータ分析環境を提供し、ユーザーフェイシングの分析ツールとしてデータ分

用途: データ分析フロー
難易度: Hard
コスト: Medium

Counterfactual Reasoning for Fine-Grained Evidence Disentanglement in VideoQA

この論文では、VideoQA が過度に信憑性の

コンピュータビジョンマルチモーダル検出画像動画

用途: ビデオQA に対するカウンターファクタルの推論
難易度: Hard
コスト: High

Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

Multimodal large language models (MLLMs) commonly inherit the deep, symmetric Transformer backbone designed fo

用途: 生成
難易度: Hard
コスト: High

品質予測/異常検知深層学習グラフニューラルネット分類検出異常検知

Beyond Convolution: Advancing Hypergraph Neural Networks with Hypergraph U-Nets

Convolutions have successfully transitioned from image processing to the complex realm of non-Euclidean higher

用途: 分類
難易度: Hard
コスト: Low

Data augmented bootstrap: Unifying confidence interval construction by approximate invariance

We propose the data augmented bootstrap (DAB), a framework for constructing confidence intervals from approxim

自然言語処理RAG画像

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Low

From Hazard Functions to Language Space: Cox-Supervised Distillation of Survival Risk into a Large Language Model

言語モデルの寿命リスクへの適用を実現するために、コックス比例危険モデルを使用して、新しいアプローチを提案します。

深層学習軽量化・量子化生成画像テキスト

用途: 言語モデルの寿命リスクへの適用
難易度: Hard
コスト: High

AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing

この論文では、ロボット手術の制御を改善するために、ロボットの視覚的シーンの動作と操作を同時にモデル化する方法を提案する。

深層学習Transformer画像テキスト動画

用途: リモートハンドリングの制御
難易度: Hard
コスト: High

品質予測/異常検知自然言語処理RAG検出セグメンテーション異常検知

Visual Prompting Meets Feature Reconstruction-Based Anomaly Detection with Dual-Teacher Supervision

Recent Anomaly Detection methods achieve perfect detection and segmentation scores on well-established dataset

用途: ア
難易度: Hard
コスト: High

自然言語処理大規模言語モデル画像テキストマルチモーダル

SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and op

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

品質予測/異常検知コンピュータビジョン動画認識検出画像テキスト

ArtiFact: A Large-Scale Multi-Modal Cultural Heritage Dataset

LLMを用いた臨床研究論文の草案作成を支援するために、生成されたテキストを検証するためのアーキテクチャを設計。これにより、虚偽の citaion、数字の不正確な記録、およびガイドライン違反が防がれます。

用途: 医学論文執筆のサポート
難易度: Hard
コスト: High

I Was Scrolling and Then I Saw a Pregnant Strawberry

AIのミニドラマ（または果実のドラマ）は、最近、ソーシャルメディアプラットフォーム上で広まった短い、アルゴリズム的かつ分散された生成AIビデオシリーズです。これらのビデオの視覚表現は、性的に見えると思われる果物が表現され

深層学習Transformer生成画像動画

用途: AIの小ドラマ
難易度: Hard
コスト: High

Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text

Chain-of-Thought (CoT) improves the performance of Large Language Models (LLMs) and has been extended to Multi

深層学習軽量化・量子化画像テキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

TABVERSE: Benchmarking Cross-Format Table Understanding in LLMs and VLMs

Large Language Models (LLMs) and Vision-Language Models (VLMs) are increasingly evaluated on table reasoning t

自然言語処理大規模言語モデルQA画像テキスト

用途: QA
難易度: Hard
コスト: High

CT-VAM: A Cerebello-Thalamic-Inspired Vision-Action Model for Efficient Visuomotor Control

Vision-language-action models have shown strong promise for robot manipulation, yet raw language is primarily

深層学習軽量化・量子化画像マルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces

マルチモーダルのエージェントの評価を目的としたWeaveBenchが提案され、ハイブリッドインターフェースの機能を評価する。

機械学習教師あり学習画像

用途: マルチモーダルのエージェントの評価
難易度: Hard
コスト: Medium

MI向き品質予測/異常検知コンピュータビジョンマルチモーダル分類検出画像

Context-Aware Deep Learning for Defect Classification in Atomic-Resolution STEM

マテリアルの非破壊検査を目的としたContext-Aware Deep Learningが提案され、エアロックの欠陥を検出する。

用途: マテリアルの非破壊検査
難易度: Hard
コスト: High

RunAgent SuperBrowser: A Theory of Autonomous Web Navigation Grounded in Human Browsing Behaviour

We present SUPERBROWSER, an autonomous web-navigation agent designed against a single guiding hypothesis: a we

MLOpsパイプライン構築画像テキスト

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

強化学習方策勾配 (PPO / A3C)画像テキスト

PhysScene: A Scene Graph Dataset for Scientific Visual Reasoning in Physics Experiments

Scene Graphs (SGs) provide structured representations of visual scenes by modeling objects and their pairwise

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

品質予測/異常検知深層学習Transformer分類画像

Beyond Humans: Multispecies Animal Face Recognition Using Transfer Learning

異なる種類の動物を取り巻く面からの画像を使用して、動物の特定を行う方法を提案している。

用途: 獲得失われたペットや保護の対象になっている種類の個体の認識
難易度: Hard
コスト: Low

FF-JEPA: Long-Horizon Planning in World Models with Latent Planners

世界モデルを使用して、潜在的ステートを利用して長期的な計画を行えるFF-JEPAを提案している。

自然言語処理RAG画像

用途: 長期的な計画を実行するために、潜在的な状態を利用する
難易度: Hard
コスト: Low

センサ/時系列深層学習Transformer生成画像音声

Physics-Guided Sequence-Based Generative Framework for Acoustic Metamaterial Inverse Design

可変化の帯域幅を考慮した、聴覚超材料の逆設計における新しいフレームワークである Physics-Guided Sequence-Based Generative Framework for Acoustic Metama

用途: 可変化の帯域幅を考慮した、聴覚超材料の逆設計
難易度: Hard
コスト: High

EgoTactile: Learning Grasp Pressure for Everyday Objects from Egocentric Video

Egocentricビデオを利用して手の圧力を推定できるモデル EgoTactile を提案している。

センサ/時系列自然言語処理RAG画像動画3D

用途: Egocentricビデオを利用した、手の圧力の推定
難易度: Hard
コスト: High

自然言語処理プロンプトエンジニアリング生成画像テキスト

IMUG-Bench: Benchmarking Unified Multimodal Models on Interleaved Understanding and Generation

In recent years, unified multimodal models (UMMs) have emerged to support both understanding and generation wi

用途: 生成
難易度: Hard
コスト: High

Decoding Pedestrian Crossing Intention from Egocentric Vision via Vision Language Models

Egocentric visionを使用して、ペダストリアンの歩く道に渡るのを予測する。Closed-ended visual question answering（VQA）問題に形式することで、ビジョン言語モデルを使用

深層学習TransformerQA画像テキスト

用途: ペダストリアンが歩く道に渡るのを予測する
難易度: Hard
コスト: High

Vision Language Model Helps Private Information De-Identification in Vision Data

ビジュアル言語モデル（VLM）は、プライバシー保護において有効性の高い能力をもつ。しかし、視覚データを扱う際のプライバシーリスクについては、それまでほとんど注目されていなかった。VLMを使用して、プライバシー保護を確保す

コンピュータビジョン物体検出分類検出画像

用途: ビジョン言語モデルを使用したビジュアルデータのプライバシー保護
難易度: Hard
コスト: High

Unveiling Privacy Risks in Multi-modal Large Language Models: Task-specific Vulnerabilities and Mitigation Challenges

大規模言語モデルのプライバシーリスクについては、既に研究が行われていたが、マルチモデル大規模言語モデル（MLLM）のプライバシーリスクについては、まだ十分に調査されていなかった。MLLMでは、テキストだけでなく画像データ

自然言語処理大規模言語モデル画像テキスト

用途: マルチモデル大規模言語モデルにおけるプライバシーリスク
難易度: Hard
コスト: High

See More, Think Deeper: Query-Expanded Visual Evidence and Answer-Clue Guided Reflection for Long Video Understanding

Recent advances in Video Large Language Models (Video-LLMs) have enabled performance on long-video understandi

自然言語処理大規模言語モデル生成画像テキスト

用途: 生成
難易度: Hard
コスト: High

SpaceVLN: A Zero-Shot Vision-and-Language Navigation Agent with Online Spatial Cognitive Memory and Reasoning

Vision-and-Languageナビゲーションエージェントは、言語指示に従って環境を探索できる。Zero-shot Vision-and-Languageナビゲーションエージェントには、未知の環境における安全性と信

深層学習軽量化・量子化検出画像3D

用途: バイオインフォマティクスのための零-shot Vision-and-Languageナビゲーションエージェント
難易度: Hard
コスト: High

コンピュータビジョンマルチモーダルQA画像テキスト

Baichuan-M4: A Clinical-Grade Medical Agent System for Continuous Care

連続的な治療に適した臨床級LLM医系であるBaichuan-M4を導入。臨床的な医療エージェントシステムであるBaichuan-M4は、統合的な医療エージェントシステムをベースとし、医療エージェントと医療エージェントの連

用途: 統合医療医系のためのLLMベースの医療エージェント
難易度: Hard
コスト: High

An Effective Router for Vision-Language Model Selection

Vision-language models (VLMs) with varying performance and resource requirements are widely deployed, making i

自然言語処理大規模言語モデル異常検知画像テキスト

用途: 異常検知
難易度: Hard
コスト: High

AlloSpatial: Agentic Harness Framework for Spatial Reasoning in Foundation Models

Multimodal Foundation Models (MFMs) have made substantial progress, yet remain fragile in spatial reasoning ov

自然言語処理RAG画像マルチモーダル強化学習

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

NutriMLLM: Multimodal Large Language Models for Dietary Micronutrient Analysis

Comprehensive estimation of dietary micronutrients from food images could improve clinical nutrition care, but

自然言語処理大規模言語モデル生成画像テキスト

用途: 生成
難易度: Hard
コスト: High

センサ/時系列品質予測/異常検知深層学習Transformer検出生成セグメンテーション

PolyBuild: An End-to-End Method for Polygonal Building Contour Extraction from High-Resolution Remote Sensing Images

Extracting building polygon contours from high-resolution remote sensing images is a fundamental task for vari

用途: 検出
難易度: Hard
コスト: Low

品質予測/異常検知自然言語処理ファインチューニング検出画像テキスト

Failure-Aware Refinement of Vision-Language Model for Lithography Defect Detection

Semiconductor lithography inspection requires reliable detection of small pattern defects such as bridge, burr

用途: 検出
難易度: Hard
コスト: High

A multi-agent system for spine MRI report generation from multi-sequence imaging

Spinal pathology is a leading cause of pain and disability worldwide. Spine MRI is central to clinical evaluat

説明可能自然言語処理埋め込み・検索分類検出生成

用途: 分類
難易度: Hard
コスト: High

Where Does the Answer Come From? Benchmarking View-Level Visual Evidence Identification in Multi-View MLLMs for Autonomous Driving

Multimodal large language models (MLLMs) achieve strong results on visual reasoning benchmarks, but answer acc

自然言語処理大規模言語モデルQA画像テキスト

用途: QA
難易度: Hard
コスト: High

品質予測/異常検知コンピュータビジョンマルチモーダル分類画像テキスト

Guide Me Out: A Framework to Benchmark VLM Operators Communication in Crisis Scenarios

危機管理では、コミュニケーションと地理

用途: 危機管理におけるコミュニケーションを評価する
難易度: Hard
コスト: High

センサ/時系列深層学習Transformer分類画像テキスト

NüshuVoice: Reviving the Voice of Endangered Nüshu with Pitch-Aware Text-to-Speech

Nüshu is an endangered phonetic script historically used by women in Jiangyong County, southern Hunan, China.

用途: 分類
難易度: Hard
コスト: Low

品質予測/異常検知自然言語処理大規模言語モデル画像テキスト

TruthSplit: Operationalizing Conditional Validity in Arguments Through Multi-Perspective Reasoning

We present TruthSplit, an interactive system for multi-perspective argument analysis. Existing argumentation t

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

Symbolic and Abstractive Reasoning with Complex Visual Queries

Understanding and reasoning over abstract visual content remains a challenge for current multi-modal large lan

自然言語処理大規模言語モデル画像テキスト

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

説明可能自然言語処理RAG画像テキストマルチモーダル

Explicit Representation Alignment for Multimodal Sentiment Analysis

Multimodal affective analysis aims to understand human sentiment and emotion by jointly modeling heterogeneous

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

少数データ向き説明可能深層学習軽量化・量子化検出画像テキスト

MAAM: Anchor-Preserving Compression and Contextual Calibration for Chinese Discriminatory Language Detection

Chinese discriminatory-language detection is challenging because harmful intent is often implicit and context-

用途: 検出
難易度: Hard
コスト: High

CRANE: Knowledge Editing for Reasoning MLLMs

The emergence of reasoning multimodal large language models (MLLMs), which generate explicit chain-of-thought

自然言語処理大規模言語モデル異常検知画像テキスト

用途: 異常検知
難易度: Hard
コスト: High

表形式向き品質予測/異常検知自然言語処理RAG分類QA画像

ChinaHeritaQA: A Culturally-Grounded Visual Question Answering Dataset for World Heritage Sites in China

We introduce ChinaHeritaQA, a multimodal benchmark dataset for evaluating the cultural reasoning abilities of

用途: 分類
難易度: Hard
コスト: High

Are Reasoning Vision-Language Models Robust to Semantic Visual Distractions?

Reasoning Vision-Language Models (VLMs) achieve strong performance on complex multimodal tasks, but reliable r

コンピュータビジョンマルチモーダル画像テキスト

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

品質予測/異常検知自然言語処理RAG生成画像テキスト

Latent Spatial Memory for Video World Models

Video world models that maintain 3D spatial consistency across generated frames typically rely on explicit poi

用途: 生成
難易度: Hard
コスト: High

iMaC: Translating Actions into Motion and Contact Images for Embodied World Models

Embodied world models have emerged as a pivotal paradigm for visual robotic decision-making and interactive en

自然言語処理埋め込み・検索画像

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Low

POTATR: A Lightweight Image-to-Graph Model for Page-Level Table Extraction

Large-scale document processing requires contextually aware table extraction (TE) that is both accurate and ef

深層学習Transformer検出画像テキスト

用途: 検出
難易度: Hard
コスト: High

品質予測/異常検知コンピュータビジョンセグメンテーション生成画像テキスト

Cranio-Diff: Diffusion-based Cross-domain Craniofacial Reconstruction with 2D X-ray Skull Guidance and Structural Identity Constraints

The state-of-the-art generative models, such as CycleGAN, Pix2Pix, and diffusion models have demonstrated rema

用途: 生成
難易度: Hard
コスト: High

SoccerNet 2026 Player-Centric Ball-Action Spotting:Retraining and Post-Processing Extensions to the FOOTPASS Baselines

We describe our system for the SoccerNet 2026 Player-Centric Ball-Action Spotting Challenge, which requires pr

深層学習グラフニューラルネット画像テキスト

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

センサ/時系列品質予測/異常検知コンピュータビジョンセグメンテーション生成画像

TUDSR: Twice Upsampling-Diffusion for Higher Super-Resolution

Diffusion-based generative models have achieved remarkable success in real-world image super-resolution (SR).

用途: 生成
難易度: Hard
コスト: High

Efficient Minimal Solvers for Relative Pose Estimation in Autonomous Driving Applications

With the advancement of visual sensing systems, computer vision is playing an increasingly important role in a

センサ/時系列深層学習軽量化・量子化検出生成画像

用途: 検出
難易度: Hard
コスト: Medium

品質予測/異常検知深層学習正規化・最適化手法分類検出セグメンテーション

Adversarial Attack and Disturbance Detection by Hadamard-Coded Output Representations for Object Detection and Semantic Segmentation

Conventional one-hot encodings often yield poorly calibrated models, being overconfident under attack, and let

用途: 分類
難易度: Hard
コスト: Low

表形式向きCPUで試しやすい自然言語処理RAG分類検出異常検知

Securing Self-supervised Data Curation for Foundation Models Robustness

Self-supervised data curation provides a pathway to scaling and improving the generalization capabilities of m

用途: 分類
難易度: Hard
コスト: High

コンピュータビジョンセグメンテーション生成画像動画

Prisma-World: Camera-Controllable Multi-Agent Video World Model

Video world models have made rapid progress in generating controllable visual experiences, but most of them st

用途: 生成
難易度: Hard
コスト: High

ContextShift: A Controlled Benchmark for Context Dependence in Object Detection

Modern object detectors achieve strong performance on standard benchmarks, yet their robustness to contextual

コンピュータビジョン物体検出検出画像テキスト

用途: 検出
難易度: Hard
コスト: High

Optical Music Recognition for Real-World Manuscripts with Synthetic Data

Optical Music Recognition (OMR) has seen major progress in model design, with end-to-end methods now capable o

MLOpsモデルデプロイ分類生成画像

用途: 分類
難易度: Hard
コスト: High

Efficient Minimal Solvers for Visual-Inertial Relative Pose Estimation in Multi-Camera Systems

Estimating the relative poses of multi-camera systems is a fundamental problem in computer vision, with critic

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Low

少数データ向き自然言語処理プロンプトエンジニアリング分類セグメンテーション画像

Training-Free Generalized Few-Shot Segmentation through Open-Vocabulary Semantic Arbitration

Generalized Few-Shot Semantic Segmentation (GFSS) has traditionally been approached as a representation-learni

用途: 分類
難易度: Hard
コスト: High

GD-MIL: Grade-Disentangled Multiple Instance Learning for Multimodal Biochemical Recurrence Prediction in Prostate Cancer

Biochemical recurrence (BCR) after radical prostatectomy is a critical endpoint in prostate cancer, yet risk s

深層学習CNN画像マルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

説明可能深層学習Transformer分類検出画像

Leveraging Morphology for Historical Script Metrological Analysis

Advances in handwritten text recognition have enabled large-scale transcription of historical documents, but s

用途: 分類
難易度: Hard
コスト: High

vesselFM-CT: Segmenting All Blood Vessels in CT Images for System-Level Cardiovascular Analysis

The vascular network in the human body is characterized by blood vessels exhibiting drastic structural variati

コンピュータビジョン3D・点群分類生成画像

用途: 分類
難易度: Hard
コスト: High

品質予測/異常検知自然言語処理大規模言語モデル画像テキスト動画

CapRL++: Unified Reinforcement Learning with Verifiable Rewards for Dense Image and Video Captioning

Image and video captioning are fundamental tasks that bridge the visual and linguistic domains, playing a crit

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

An Opticalmechanics Framework for Dynamic Estimation of Multibody Systems

Conventional dynamics analysis of the human body is often constrained by the need for contact force and torque

センサ/時系列数学・理論最適化画像

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

品質予測/異常検知自然言語処理RAG画像テキスト音声

Echo-DM: Ultrasound Marker Removal via Conditional Latent Diffusion and Region-Aware Fusion

Clinical ultrasound images often contain artificial markers, such as measurement calipers and text, to assist

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

ExDet: Open-Domain Open-Vocabulary Detection with Cross-modal Extrapolation and Rectification

Open-domain open-vocabulary detection (ODOVD) requires detectors to generalize to both novel categories and un

深層学習軽量化・量子化分類検出画像

用途: 分類
難易度: Hard
コスト: High

センサ/時系列コンピュータビジョン動画認識画像テキストマルチモーダル

IB-HFN: Information Bottleneck-Driven SAR-Optical Fusion Network for High-Fidelity Cloud Removal

Synthetic aperture radar (SAR)-assisted optical cloud removal aims to recover surface information obscured by

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

深層学習正規化・最適化手法分類生成セグメンテーション

Reason Twice: Segmentation via Candidate Discovery and Comparative Reasoning

The rapid development of pretrained foundation models has enabled more general image segmentation. Multimodal

用途: 分類
難易度: Hard
コスト: High

Visual Para-Thinker++: A Single-Policy Multi-Agent Framework for Visual Reasoning

Visual reasoning requires integrating evidence distributed across regions, attributes, and relations, making s

深層学習軽量化・量子化画像テキスト

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

EditSSC: Toward Editable Semantic Occupancy Scenes with Unconditional Diffusion Models

3D semantic scene generation is crucial for autonomous driving applications, yet most methods rely on complex

深層学習軽量化・量子化生成画像3D

用途: 生成
難易度: Hard
コスト: High

See More, Match Better: Multi-Source Feature Fusion for Two-View Correspondence Learning

Two-view correspondence learning aims to distinguish true correspondences (inliers) from false ones (outliers)

自然言語処理RAG画像テキスト

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Low

A practical probabilistic framework for deformable image registration uncertainty in radiotherapy dose propagation

Deformable image registration (DIR) is widely used in radiotherapy for dose propagation and accumulation, but

説明可能深層学習軽量化・量子化画像

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

説明可能コンピュータビジョンマルチモーダル生成画像テキスト

MAGIS: Evidence-Based Multi-Agent Reasoning for Interpretable Strabismus Clinical Decision-Making

Strabismus is a common ocular disorder that requires fine-grained subtype diagnosis for individualized treatme

用途: 生成
難易度: Hard
コスト: High

品質予測/異常検知深層学習Transformer検出画像テキスト

Temporal-Aware Reasoning Optimization for Video Temporal Grounding

Multi-modal Large Language Models (MLLMs) have achieved remarkable progress in video temporal grounding with r

用途: 検出
難易度: Hard
コスト: High

Semi-supervised Source Detection in Astronomical Images: New Benchmark and Strong Baseline

Source detection in modern observational astronomy is a cornerstone for localizing and identifying stellar sou

機械学習教師あり学習検出生成画像

用途: 検出
難易度: Easy
コスト: Medium

Minimal Solvers for Full-DoF Motion Estimation from Asynchronous Differential SfM

As a bio-inspired intelligent sensor, event cameras have introduced a new paradigm in the intelligent percepti

センサ/時系列深層学習軽量化・量子化画像

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Low

Event-driven dynamic trajectories reconstruction and measurement of mechanical parameters for fragments

During warhead detonation, high-density, high-speed, and mutually occluded fragments are generated. Their mech

センサ/時系列自然言語処理RAG画像3D

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

CP4D: Compositional Physics-aware 4D Scene Generation

4D generation (\textit{i.e.}, dynamic 3D generation) has recently emerged as a rapidly growing research fronti

MI向き自然言語処理RAG生成画像テキスト

用途: 生成
難易度: Hard
コスト: High

Vision-Language Guided Hyperspectral Object Tracking via Semantics Fusion and Contextual Template Updating

Hyperspectral object tracking (HOT) leverages the rich spectral information provided by hyperspectral videos (

深層学習軽量化・量子化画像テキスト動画

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

Zero-Parameter Geometric Gating for Temporally Stable Low-Altitude UAV Video Semantic Segmentation

Video semantic segmentation for low-altitude UAVs requires temporal consistency, yet dense optical flow introd

コンピュータビジョンセグメンテーション画像動画

用途: セグメンテーション
難易度: Hard
コスト: High

コンピュータビジョンセグメンテーション生成画像テキスト

OmniGen-AR: AutoRegressive Any-to-Image Generation

Autoregressive (AR) models have demonstrated strong potential in visual generation, offering superior performa

用途: 生成
難易度: Hard
コスト: High

品質予測/異常検知深層学習軽量化・量子化生成画像動画

Ultra Flash: Scaling Real-Time Streaming Video Generation to High Resolutions

While recent autoregressive video diffusion models achieve remarkable streaming quality, they remain confined

用途: 生成
難易度: Hard
コスト: High

DiffSight-Former: Modeling Structural Differences and Temporal Dynamics for Glaucoma Progression Prediction

Glaucoma is a leading cause of irreversible blindness worldwide, and early detection from fundus images is cri

深層学習Transformer検出画像

用途: 検出
難易度: Hard
コスト: Low

品質予測/異常検知自然言語処理大規模言語モデル生成画像テキスト

HDRAgent: An Agentic Framework for Multi-Exposure HDR Imaging

Most existing multi-exposure HDR methods follow a fixed feed-forward reconstruction paradigm, making them pron

用途: 生成
難易度: Hard
コスト: High

Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions

Reward models are central to text-to-image post-training, but visual preference is subjective and better repre

深層学習軽量化・量子化生成画像テキスト

用途: 生成
難易度: Hard
コスト: High

Leveraging NeRF-Rendered Images for 3D Gaussian Splatting

Neural radiance field (NeRF) and 3D Gaussian splatting (3DGS) are two mainstream approaches for novel view syn

品質予測/異常検知自然言語処理RAG生成画像3D

用途: 生成
難易度: Hard
コスト: High

Frequency Decoupled Framework for Screen Content Image Super-Resolution

Methods based on implicit neural representations have demonstrated superior performance in Screen Content Imag

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

EPS3D: End-to-End Feed-Forward 3D Panoptic Segmentation

This paper introduces EPS3D, a new end-to-end feed-forward framework for open-vocabulary 3D panoptic segmentat

深層学習軽量化・量子化セグメンテーション画像3D

用途: セグメンテーション
難易度: Hard
コスト: High

When Vision Misleads, Let Location Speak: A Worldwide Image Geo-Localization Method via Location Attention Mechanism and Large Multimodal Models

Worldwide image geo-localization aims to determine the capture location of an image on a global scale. Existin

深層学習Transformer検出画像テキスト

用途: 検出
難易度: Hard
コスト: High

Modeling Components and Connections in Cyber-Physical Systems

Text based configuration files for cyber-physical systems show the hierarchy of component modules well but oft

強化学習モデルベース画像テキスト

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

$ω$-EVA: Envision, Verify, and Act with Latent Interactive World Models

Embodied policies typically map current observations directly to actions, leaving candidate-action consequence

強化学習モデルベース生成画像動画

用途: 生成
難易度: Hard
コスト: High

Dual Quaternion-Based Unscented Kalman Filter with Visual Inertial Odometry for Navigation in GPS-Denied Environments

Reliable navigation in GPS-denied environments remains a fundamental challenge in robotics, aerospace, and aut

センサ/時系列機械学習時系列生成画像

用途: 生成
難易度: Hard
コスト: Medium

センサ/時系列深層学習軽量化・量子化検出画像テキスト

VGP-Nav: Metric-Aware Visual Geometric Perception for Robot Navigation

Reliable robotic navigation necessitates the seamless integration of accurate global localization and dense, m

用途: 検出
難易度: Hard
コスト: High

Back to the Familiar Future: Failure Recovery for VLA Policies via Pre-Imagined Milestone Selection

Vision-language-action (VLA) policies can deviate from nominal trajectories during manipulation, even when tas

自然言語処理RAG画像マルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

MotionWAM: Towards Foundation World Action Models for Real-Time Humanoid Loco-Manipulation

World Action Models (WAMs) couple a video dynamics prior to the policy and have shown encouraging results on t

自然言語処理RAG画像動画マルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

CHROMA: Detecting AI-Generated Images through Inter-Channel Color-Space Correlations

The rapid adoption of diffusion and large-scale generative models has made it increasingly challenging to dist

深層学習CNN検出生成画像

用途: 検出
難易度: Hard
コスト: High

BLM-SGAN: Bidirectional Language Modeling for Semantic-Spatial Text-to-Image Generation

Despite the success of image generation from text descriptions, it still faces challenges that are difficult t

用途: 生成
難易度: Easy
コスト: Low

How Many Counterfactuals Does It Take? Probing VLM Hallucinations Through Circuits and Causal Effects

Visual Language Models (VLMs) are known to produce hallucinated predictions that are not grounded in visual ev

自然言語処理RAG画像テキスト

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Low

TeamHerald@CHIPSAL 2026: Hate Speech Detection and Sentiment Analysis of Nepali Memes using Transformer-based Architectures and Ensemble Learning

The analysis of internet memes in the Nepali language is complicated by frequent code-mixing and a lack of est

深層学習Transformer分類検出画像

用途: 分類
難易度: Hard
コスト: Low

センサ/時系列深層学習軽量化・量子化生成画像テキスト

IR-SIM: A Lightweight Skill-Native Simulator for Navigation, Learning, and Benchmarking

Simulation plays a key role in automated robotics research supported by large language models (LLMs). However,

用途: 生成
難易度: Hard
コスト: High

Learning to Solve Generative ODEs Beyond the Linear Span

Diffusion and flow generative models sample by integrating a learned ODE, but high quality still requires many

品質予測/異常検知深層学習軽量化・量子化生成画像

用途: 生成
難易度: Hard
コスト: High

FiberTune: Preserving Action-Fiber Visual Residuals in Vision-Language-Action Fine-Tuning

Action-supervised fine-tuning of vision-language-action (VLA) policies fits demonstrations effectively but con

深層学習Transformer画像マルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

How Much Capacity Does EEG Denoising Need? Ultra-Compact Networks reveal Benchmark Saturation and Metric-Utility Gap

Deep learning EEG denoising architectures have scaled from tens of thousands to tens of millions of parameters

用途: 分類
難易度: Hard
コスト: High

説明可能センサ/時系列機械学習教師あり学習分類検出画像

A spectral audit framework reveals task-dependent aperiodic reliance across EEG and ECG deep learning

Deep learning on physiological time series is interpreted through domain-specific features -- oscillatory rhyt

用途: 分類
難易度: Hard
コスト: Low

OrderDP: A Theoretically Guaranteed Lossless Dynamic Data Pruning Framework

Data pruning (DP), as an oft-stated strategy to alleviate heavy training burdens, reduces the volume of traini

説明可能深層学習軽量化・量子化画像

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

少数データ向き深層学習Transformer生成画像テキスト

ZIPP:Zero-shot Image Personalization from Personas

Text-to-image diffusion models are increasingly deployed in open-ended creative contexts, yet their outputs re

用途: 生成
難易度: Hard
コスト: High

表形式向き説明可能自然言語処理大規模言語モデル分類検出生成

Bridging Expert Knowledge and Automated Feature Engineering via Self-Evolution

In high-stakes settings such as brand compliance, clinical care, and content moderation, machine learning cann

用途: 分類
難易度: Hard
コスト: High

Unifying Object-Centric World Models and Diffusion Policy: A Hierarchical Framework for Multi-Stage Robotic Tasks

Visual world models have shown great potential in learning complex system dynamics. Recent advancements levera

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

A Joint Finite-Sample Certificate for Adaptive Selective Conformal Risk Control

Selective predictors answer on confident inputs and abstain elsewhere; deploying one safely needs a single fin

深層学習CNNセグメンテーション画像

用途: セグメンテーション
難易度: Hard
コスト: Medium

深層学習Transformer画像テキストマルチモーダル

When Correct Decisions Hide Internal Stress: Decision-State Probing in Multimodal Language Models

Multimodal language models are typically evaluated through external behavior: selecting the correct image--tex

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

Vision-Language Work Zone Intelligence for Safety-Critical Speed Regulation of Mixed-Autonomy Vehicles in Dynamic Environments

Temporary work-zone speed limits are communicated through visually inconsistent signage and are often missing

コンピュータビジョン物体検出分類検出画像

用途: 分類
難易度: Hard
コスト: High

センサ/時系列品質予測/異常検知自然言語処理RAG検出回帰画像

Geometry-Aware Fisheye-LiDAR Fusion for Robust 3D Object Detection in Low-Overlap Setups

As autonomous systems expand from capital-intensive robotaxis to cost-sensitive logistics, sensor configuratio

用途: 検出
難易度: Hard
コスト: High

CSFlow: Aligning Flow Matching with Human Contrast Sensitivity

We introduce Contrast Sensitive Flow (CSFlow), a weighting scheme that connects the human eye's Contrast Sensi

深層学習Transformer生成画像

用途: 生成
難易度: Hard
コスト: High

Classifying galaxies in the Galaxy10 DECals dataset using Inception and Residual CNNs

Image data regarding galactic morphology is expected to increase both in quantity and quality for the next for

品質予測/異常検知深層学習CNN分類画像

用途: 分類
難易度: Hard
コスト: Low

品質予測/異常検知自然言語処理RAG分類検出セグメンテーション

PairWise Image Finder: An Open-source Tool for Finding Visually Aligned Street-Level Image Pairs for Urban Perception Studies

Change detection and scene recognition techniques have been widely applied to Street View Imagery (SVI) to und

用途: 分類
難易度: Hard
コスト: Low

品質予測/異常検知深層学習Transformer生成画像自己教師

MaskAlign: Token-Subset Representation Alignment for Efficient Diffusion Training

Representation alignment with pretrained vision models has recently shown strong potential for accelerating di

用途: 生成
難易度: Hard
コスト: High

DeepMine-Mamba: Mitigating Information Dilution in Mamba-Based State Space Models for Document Image Binarization

Document image binarization aims to separate foreground text from degraded backgrounds while preserving thin,

用途: 生成
難易度: Hard
コスト: Low

Beyond Consistency: Preserving Temporal Structure in Zero-Shot Video Editing

Existing zero-shot video editing methods rely on pre-trained diffusion models, successfully achieving spatial

自然言語処理RAG画像動画

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

センサ/時系列深層学習軽量化・量子化画像マルチモーダル

RGB-S: Image-Aligned Tactile Saliency for Robust Dexterous Manipulation

Effective visuo-tactile integration is critical for robotic dexterous manipulation, especially when visual obs

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

品質予測/異常検知コンピュータビジョンセグメンテーション画像3D

Less Is More: Training-Free Acceleration Framework of 3D Diffusion Models for Low-Count PET Denoising via Global-Local Trajectory Reduction

Accurate quantification and uptake measurement in PET are critical for assessing disease progression and suppo

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

Stain-Aware Wavelet Regularization for Instant Adversarial Purification in Histopathology

Deep learning has become prevalent in computational pathology pipelines that support tasks such as cancer scre

自然言語処理RAG画像テキスト

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Low

コンピュータビジョンセグメンテーション検出画像教師なし

AUCp: Pseudo-AUC for Inference Model Selection with Unlabeled Validation Data in Abnormality Detection

Abnormality detection is a crucial yet challenging task in medical image analysis. Distinguishing abnormalitie

用途: 検出
難易度: Hard
コスト: High

Thinking Without Images: Internalizing Visual Manipulation with On-Policy Self-Distillation

''Thinking with Images'' has emerged as an effective paradigm for fine-grained visual reasoning: by explicitly

品質予測/異常検知深層学習軽量化・量子化画像

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

PRPO: Perception-Reinforced Policy Optimization via Token-Level Dynamic Advantage Reshaping

Reinforcement Learning with Verifiable Rewards (RLVR) has become an effective paradigm for improving the reaso

自然言語処理RAG画像テキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

Distortion-Aware PETR for BEV Object Detection with Mixed Pinhole-Fisheye Cameras

Fisheye cameras are widely deployed in autonomous driving perception suites for their low cost and full-covera

自然言語処理RAG検出画像3D

用途: 検出
難易度: Hard
コスト: High

MI向きコンピュータビジョンセグメンテーション画像3D

PhysGraph: A Physics-aware 3D Scene Graph for Perception and Reasoning

To perform a wide range of daily tasks, robots need to construct a 3D representation that is semantically rich

用途: セグメンテーション
難易度: Hard
コスト: High

Reconstructing Synthetic SDO/AIA 193 A EUV Images from He I 10830 A Observations with Diffusion Model Translator

Routine full-disk EUV imaging has been available only since the modern era, such as SOHO and SDO. To extend EU

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

Learnable Token Sparsification for Efficient Gigapixel Whole Slide Image Reasoning

The processing of gigapixel whole slide images within vision language models faces a major difficulty due to a

深層学習軽量化・量子化画像テキスト

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

SSAFE: Simple and Strong AI-Generated Image Detection via Frozen Vision Encoders

The rapid advancement of generative models has blurred the boundary between synthetic and real imagery, creati

自然言語処理ファインチューニング分類検出生成

用途: 分類
難易度: Hard
コスト: High

品質予測/異常検知自然言語処理大規模言語モデル画像テキスト音声

OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning

While Omni-modal Large Language Models (OLLMs) have demonstrated impressive capabilities in jointly processing

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

Towards Accurate Emotion-Attributed Video Captioning via Fine-grained Emotion-Cause Pair Extraction

Emotional Video Captioning (EVC) is a challenging task that aims to generate factually accurate and emotionall

説明可能自然言語処理RAG生成画像動画

用途: 生成
難易度: Hard
コスト: High

センサ/時系列深層学習Transformer検出セグメンテーション異常検知

NGram-MoSE: Efficient Remote Sensing Super-Resolution via N-Gram Context and Mixture-of-Experts

Remote sensing applications for environmental monitoring and disaster management are frequently constrained by

用途: 検出
難易度: Hard
コスト: High

表形式向きコンピュータビジョン動画認識生成画像テキスト

DriveReward: A Comprehensive Dataset and Generative Vision-Language Reward Model for Autonomous Driving

Reward models play a pivotal role in reinforcement learning (RL) and multi-modal trajectory selection for auto

用途: 生成
難易度: Hard
コスト: High

品質予測/異常検知深層学習Transformer生成画像動画

OmniTryOn: Video Try-On Anything at Once!

Although video virtual try-on (VVT) has achieved significant progress, existing methods still exhibit two fund

用途: 生成
難易度: Hard
コスト: High

少数データ向き深層学習Transformer画像テキストマルチモーダル

Look Less, Reason More: Block-wise Attention Skipping for Efficient Multimodal LLMs

Multimodal Large Language Models (MLLMs) face a significant inference bottleneck due to the quadratic computat

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

コンピュータビジョンセグメンテーション生成予測画像

EgoPriMo: Egocentric Motion Generation for Interactive Humanoid Control

Humanoid robots require whole-body motions that adapt to scene context, task requirements, and user intent. Mo

用途: 生成
難易度: Hard
コスト: High

Seeing is Believing: Aligning Prompt Rewriting with Visual Anchors for Text-to-Image Generation

Despite the impressive capabilities of text-to-image (T2I) models, an intent-generation gap often persists due

用途: 生成
難易度: Hard
コスト: High

自然言語処理大規模言語モデル画像テキストマルチモーダル

TVI-CoT: Text-Visual Interleaved Chain-of-Thought Reasoning for Multimodal Understanding

Chain-of-thought (CoT) reasoning has proven effective for enhancing problem-solving in large language models.

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

X-Palm: Paired Multispectral-to-Smartphone Dataset for Cross-Domain Palmprint Authentication

Palmprint modality offers a privacy-preserving biometric solution, yet its deployment is hindered by the domai

自然言語処理大規模言語モデル画像

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

Reinforcing Temporal Answer Grounding in Instructional Video via Candidate-Aware Causal Reasoning

The task of temporal answer grounding in instructional video (TAGV), which aims to locate precise video segmen

深層学習Transformer画像テキスト動画

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

表形式向きコンピュータビジョンセグメンテーション生成画像テキスト

Segmentation-Assisted Brain MRI Synthesis with Cross-Image Multi-Contrast Feature Memory Bank Retrieval Augmentation

Multi-contrast brain MRI provide complementary soft-tissue characteristics that aid in the screening and diagn

用途: 生成
難易度: Easy
コスト: Low

CheXanatomy: Anatomy-Aware Vision-Language Modeling for Chest Radiographs

Vision-language models (VLMs) pretrained on large-scale image-text pairs demonstrate strong image-level unders

深層学習CNN検出生成セグメンテーション

用途: 検出
難易度: Hard
コスト: High

MI向き自然言語処理RAG生成セグメンテーション画像

SceneConductor: 3D Scene Generation from Single Image with Multi-Agent Orchestration

Generating complete 3D scenes from a single image requires inferring globally consistent geometry, object rela

用途: 生成
難易度: Hard
コスト: High

Dream-Tac: A Unified Tactile World Action Model for Contact-Rich Robot Manipulation

World action models inherit the predictive capability of world models, enabling action generation to be guided

自然言語処理RAG生成画像マルチモーダル

用途: 生成
難易度: Hard
コスト: High

品質予測/異常検知自然言語処理プロンプトエンジニアリング生成画像3D

OASIS: From Simulation Data Collection to Real-World Humanoid Loco-Manipulation

Recent progress in robot manipulation has been largely driven by learning from large-scale demonstrations. For

用途: 生成
難易度: Hard
コスト: High

自然言語処理プロンプトエンジニアリング画像3Dマルチモーダル

GEAR-VLA: Learning Geometry-Aware Action Representations for Generalizable Robotic Manipulation

Vision-Language-Action (VLA) models achieve strong benchmark performance but still struggle in real-world depl

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

自然言語処理ファインチューニング異常検知画像テキスト

Two Bridges, One Pathway: From VLMs to Generalizable VLAs with Embodied Trajectory-Coupled Data

Vision-language models (VLMs) are powerful general-purpose reasoners, yet converting them into robot control p

用途: 異常検知
難易度: Hard
コスト: High

センサ/時系列深層学習軽量化・量子化画像テキスト強化学習

Towards End to End Motion Planning and Execution for Autonomous Underwater Vehicles Using Reinforcement Learning

Autonomous Underwater Vehicles (AUVs) traditionally rely on complex, heavily engineered pipelines for percepti

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

ActProbe: Action-Space Probe for Early Failure Detection of Generative Robot Policies

Generative robot policies fail unpredictably at deployment: they hesitate at critical moments, drift off-task,

深層学習RNN / LSTM検出生成画像

用途: 検出
難易度: Hard
コスト: Low

説明可能品質予測/異常検知自然言語処理大規模言語モデル画像テキストマルチモーダル

arxivGitHubあり2026-06-06

Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in visual understanding, yet the

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

Sci-Rho: A Multilingual Visually-Grounded Symbolic Benchmark for STEM Problems

Symbolic benchmarks have emerged as a key approach to assess model robustness under minor modifications to STE

品質予測/異常検知自然言語処理RAG画像テキスト

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Low

説明可能品質予測/異常検知自然言語処理ファインチューニング生成要約画像

IEA: Amateur-Friendly Conversational Image Editing Agent via Three Stages of Multitask Alignment

Current image editing software often hinges on fixed filters or expert tuning, leaving a gap between amateur u

用途: 生成
難易度: Hard
コスト: Medium

品質予測/異常検知自然言語処理埋め込み画像テキスト

FMRFusion: Frequency-Aware Multi-View Representation Learning for Heterogeneous Image Fusion

Infrared and visible image fusion aims to generate a composite image that retains significant target informati

用途: 埋め込み
難易度: Hard
コスト: Low

Decoupling Semantics and Logic: A Training-Free Coarse-to-Fine Pipeline for Video Retrieval-Augmented Generation

This paper presents our system description for the 2nd Workshop on Multimodal Augmented Generation via Multimo

深層学習軽量化・量子化生成検索画像

用途: 生成
難易度: Hard
コスト: High

Programmable Silicon Retina on Pixel Processor Array

Standard dynamic vision sensors approximate retinal processing by detecting temporal contrast changes, offerin

深層学習軽量化・量子化画像動画

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

Self-Supervised Vision Transformers for CBCT-Based Detection of Temporomandibular Joint Osteoarthritis

Temporomandibular joint osteoarthritis (TMJ OA) is a prevalent degenerative condition whose osseous changes ar

深層学習Transformer分類検出生成

用途: 分類
難易度: Hard
コスト: High

Beyond Raw Signals: Undecoded Generative Latents as Privileged Synthetic Data

While multimodal integration significantly improves computer vision models, deploying them incurs prohibitive

深層学習軽量化・量子化分類生成画像

用途: 分類
難易度: Hard
コスト: High

SMI: Efficient Self-Supervised Learning via Mutual-Information-Inspired Dependency Optimization

Self-supervised learning (SSL) has achieved remarkable representation learning performance, but many existing

深層学習CNN埋め込み画像教師あり

用途: 埋め込み
難易度: Hard
コスト: High

Where the Score Lives: A Wavelet View of Diffusion

Score-based generative models have had remarkable success over the last decade in generating a diverse set of

説明可能深層学習Transformer生成画像

用途: 生成
難易度: Hard
コスト: High

品質予測/異常検知深層学習軽量化・量子化生成画像テキスト

HACK++: Towards More Effective Head-Aware Key-Value Compression for Efficient Visual Autoregressive Modeling

Visual Autoregressive (VAR) models adopt a next-scale prediction paradigm, offering high-quality generation wi

用途: 生成
難易度: Hard
コスト: High

arxivGitHubあり2026-06-06

G2G: Exploiting Intra-Group Geometry for Inter-Group Pose Estimation

Recovering the relative 6-DoF pose between two image groups underlies cross-sequence relocalization and multi-

深層学習Transformer検出画像

用途: 検出
難易度: Hard
コスト: High

TIDE: Task-Isolated Diffusion for Unified Video Editing and Generation

Recent advances in Diffusion Transformers have driven rapid progress in video generation and editing, yet thes

用途: 生成
難易度: Hard
コスト: High

コンピュータビジョンセグメンテーション分類画像3D

MS-COOT: Comparing Morse-Smale Complexes with Co-Optimal Transport

Understanding and comparing structures in scalar fields is a central challenge in scientific visualization, wi

用途: 分類
難易度: Hard
コスト: High

Empowering Feed-Forward Reconstruction Models with Metric Scale via Satellite Images

Feed-forward 3D reconstruction models have recently shown strong generalization across diverse scenes, yet mos

コンピュータビジョン3D・点群検出画像3D

用途: 検出
難易度: Hard
コスト: High

コンピュータビジョンセグメンテーション生成埋め込み画像

Neural Field Tokenizations with Hierarchy and Spatial Locality Priors

Neural fields parameterize data as functions from coordinates to values, providing a unified framework for rep

用途: 生成
難易度: Hard
コスト: High

RAPID: Layer-Wise Redundancy-Aware Pruning and Importance-Driven Token Merging for Efficient ViT

Vision Transformers (ViTs) achieve strong performance but suffer from high computational costs due to quadrati

用途: 分類
難易度: Hard
コスト: High

MI向きコンピュータビジョンマルチモーダル画像テキスト動画

IMAGINE: Adaptive Schema-Imagery Enhanced Composition for Composed Video Retrieval

Composed Video Retrieval (CVR) is designed to retrieve a target video that matches a reference video modified

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

深層学習Transformerセグメンテーション画像

Phase Marginalization for Patch-Grid Instability in Vision Transformers

Vision Transformers operate on fixed patch grids, which can introduce phase-dependent instability for dense pr

用途: セグメンテーション
難易度: Hard
コスト: High

センサ/時系列コンピュータビジョンセグメンテーション分類画像テキスト

One Stone, Three Birds: Self-adaptive Optimal Transport for Multi-VLM Selection, Adaptation, and Ensembling

Vision-language models (VLMs) enable visual recognition from semantic class descriptions, which makes them att

用途: 分類
難易度: Hard
コスト: High

Trustworthy Visual Predicates for Robust Manipulation Understanding under Degradation

Manipulation understanding requires reliable relational evidence, such as contact, support, containment, motio

深層学習Transformer検出画像動画

用途: 検出
難易度: Hard
コスト: High

Revisiting Articulated Parts Perception in Robot Manipulation

We are surrounded by various objects with movable, articulated parts, e.g., box, handle, door. An accurate and

品質予測/異常検知深層学習軽量化・量子化画像

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

MI向き品質予測/異常検知自然言語処理大規模言語モデル生成画像テキスト

arxivGitHubあり2026-06-06

VideoWeaver: Evaluating and Evolving Skills for Agentic Long Video Generation

Recent agent frameworks such as Claude Code, Codex, and OpenClaw are strong at tool use and orchestration, but

用途: 生成
難易度: Hard
コスト: High

MI向き深層学習Transformer分類回帰予測

OSMGraphCLIP: Learning Global Location Representations from OpenStreetMap Graphs

We present OSMGraphCLIP, a CLIP-style geospatial representation model that learns global location embeddings f

用途: 分類
難易度: Hard
コスト: Low

品質予測/異常検知コンピュータビジョン3D・点群画像3D

Wispy to Voluminous: Prior-free Multi-view Capture of Strand-level Facial Hair

Facial hair is a defining trait of personal identity, yet remains a critical bottleneck for digital avatars. R

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

Uncertainty-Aware Intention Prediction for Human-to-Robot Assembly Teleoperation

In assisted teleoperation for human-robot collaboration, accurate intention prediction is critical for enablin

自然言語処理RAG分類検出セグメンテーション

用途: 分類
難易度: Hard
コスト: High

MotionVLA: Injecting Geometric Motion into Vision-Language-Action Model

Vision-language-action (VLA) models increasingly condition robot policies on history, depth, or 4D features to

自然言語処理RAG生成画像テキスト

用途: 生成
難易度: Hard
コスト: High

センサ/時系列深層学習Transformer生成画像テキスト

SynthICL: Scalable In-context Imitation Learning with Synthetic Data

In-context imitation learning (ICIL) enables robots to learn new tasks from a small number of demonstrations b

用途: 生成
難易度: Hard
コスト: High

IntentNav: Learning Spatial-Visual Object Navigation from Human Demonstrations

Object navigation requires a robot to search for an unobserved target in an unknown environment by deciding wh

自然言語処理RAG画像3D

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

PRISM: PRior-guided Imagination Sampling in world Models

A learned world model provides a powerful physical intuition for evaluating future states. But its effectivene

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

説明可能MI向きセンサ/時系列強化学習方策勾配 (PPO / A3C)画像

Instrumented data for causal scientific machine learning

Scientific machine learning is limited less by model size than by the data it is trained on. Observational dat

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

arxivGitHubあり2026-06-05

Constructing VAE Latent Spaces with Prescribed Topology

Variational autoencoders (VAEs) learn low-dimensional latent representations of high-dimensional data. When th

品質予測/異常検知生成AIVAE画像

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

説明可能自然言語処理大規模言語モデル分類画像テキスト

arxivGitHubあり2026-06-05

LLM-Guided Evolution for Medical Decision Pipelines

Adapting large language models (LLMs) to clinical workflows often requires costly fine-tuning or manual prompt

用途: 分類
難易度: Hard
コスト: High

DroneDAR: Long-Range Drone Distance Estimation Using Monocular Vision and Bounding-Box Features

Accurate distance estimation for small drones in long-range imagery is important for tracking and situational

深層学習Transformer検出回帰画像

用途: 検出
難易度: Hard
コスト: High

arxivGitHubあり2026-06-05

RhinoVLA Technical Report

この論文では、VLAモデルをedgeハードウェアにデプロイするための手法を提案しています。この手法は、VLAモデルをedgeハードウェアにデプロイするためのフレームワークです。この手法は、edgeハードウェアを利用してV

深層学習軽量化・量子化画像テキストマルチモーダル

用途: VLAモデルをedgeハードウェアにデプロイするための手法
難易度: Hard
コスト: High

CAPE: Contrastive Action-conditioned Parallel Encoding for Embodied Planning

この論文では、embodied agentsが未来の行動を予測するためのnew Contrastive Action-conditioned Parallel Encoding（CAPE）フレームワークを提案した。CAP

自然言語処理プロンプトエンジニアリング画像

用途: Embodied Planningの新しいフレームワーク
難易度: Hard
コスト: Low

Does Appearance Help? A Systematic Study of Image-Based Re-Identification in Online 3D Multi-Pedestrian Tracking

3D Multi-Object Tracking (MOT)では、人の動きを検出し続けるために、3D点群データから3D人体の姿勢姿勢を推測する必要があり、主に幾何学情報に依存しているが、これは状況によっては人を分別するの

深層学習Transformer検出画像テキスト

用途: 3D人間の追跡システムの外観の有用性
難易度: Hard
コスト: High

QuadVerse: An Integrated Framework Aligning Visual-Physical Reality for Quadruped Simulation

この論文では、四足ロボットのシマイルのためのQuadVerseフレームワークを提案した。QuadVerseは、視覚的、物理的、動的なギャップを考慮したシマイルを用い、四足ロボットの実験環境とシマイルを統合した。

品質予測/異常検知自然言語処理RAG画像動画3D

用途: 四足ロボットのシマイル
難易度: Hard
コスト: High

品質予測/異常検知自然言語処理RAG画像動画マルチモーダル

LARA: Latent Action Representation Alignment for Vision-Language-Action Models

Visual-language action (VLA) models enable robots to predict actions directly from observations and language i

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

Dreaming when Necessary: Advancing World Action Models with Adaptive Multi-Modal Reasoning

World Action Models (WAMs) offer a promising approach to embodied intelligence, yet existing methods rely heav

深層学習軽量化・量子化画像テキスト動画

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

What Is My Robot Thinking? Design Considerations for Transparent and Trustworthy Shared Autonomy

Assistive robots operating under shared autonomy must balance user control with autonomous assistance. Because

機械学習特徴量エンジニアリング画像

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

STRIPS-WM: Learning Grounded Propositional STRIPS-style World Models from Images

Robots performing long-horizon visual manipulation observe high-dimensional images, but successful plans depen

強化学習モデルベース画像

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

品質予測/異常検知画像検査コンピュータビジョン3D・点群画像3D

Three-dimensional hydro-cluttered locomotion by an undulatory robot

Aquatic robots have expanded human access to underwater environments, yet many underwater spaces contain obsta

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

Synthetic Benchmarks Overstate Forward-Forward Scaling: Real-Data Limits of Layer-Local Training

Forward-Forward (FF) learning [Hinton, 2022] replaces backpropagation with strictly layer-local goodness updat

深層学習CNN画像

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

Multi-Robot Planning and Control from CCTV Camera Networks in a Real Warehouse

Off-board control of mobile robots from cameras embedded in the environment offers a practical path to scalabl

センサ/時系列機械学習時系列画像

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

AxisGuide: Grounding Robot Action Coordinate System in RGB Observations for Robust Visuomotor Manipulation

Visuomotor manipulation policies trained via large-scale behavior cloning have achieved strong semantic scene

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

Meridian: Metric-Semantic Primitive Matching for Cross-View Geo-Localization Beyond Urban Environments

この研究では、地位認識を改善するために、地位認識と位置推定を統合した Meridian を提案します。

自然言語処理RAG検出画像

用途: 地位認識の改善
難易度: Hard
コスト: High

Synthetic Data Generation and Vision-based Wrinkle and Keypoint Detection for Bimanual Cloth Manipulation

布物操作の学習システムを開発しました。このシステムは、人間が布物操作を学習できます。

品質予測/異常検知深層学習CNN検出生成画像

用途: 布物操作の学習
難易度: Hard
コスト: Medium

Multi-Resolution Tactile Imitation Learning for Contact-Rich Robotic Manipulation

この研究では、さまざまな時脈に沿った触角の融合を利用して、複雑な多モーダル接触リソースの学習を実現する MiTaS を提案します。

センサ/時系列深層学習Transformer画像

用途: 多モーダル接触リソースの学習
難易度: Hard
コスト: High

CLEAR: Cognition and Latent Evaluation for Adaptive Routing in End-to-End Autonomous Driving

End-to-end autonomous driving modelsがmulti-modal maneuver generationとreal-time inferenceをバランスすることが難しい問題を解決し、di

深層学習Attention機構生成画像

用途: End-to-End Autonomous DrivingのためのLatent Evaluation
難易度: Hard
コスト: High

AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding

このリポジトリでは、画像認識モデルにアクション生成能力を付与することを目指したモデルを提案します。このモデルは、画像認識のための事前訓練モデルを用いて、複雑なアクションを生成することができます。

深層学習Transformer検出生成予測

用途: 画像認識とアクションの生成
難易度: Hard
コスト: High

arxivGitHubあり2026-06-04

A Conversational Framework for Human-Robot Collaborative Manipulation with Distributed Generative AI models

この研究では、人間-ロボット協力のためのDistributed Conversational Frameworkを提案します。

自然言語処理大規模言語モデル生成画像テキスト

用途: 人間-ロボット協力
難易度: Hard
コスト: High

World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis

統合された視覚言語アクションモデルを提案し、これを用いたタスクの性能を向上させることができるようになる。

用途: 統合された視覚言語アクションモデル
難易度: Hard
コスト: High

T-FunS3D: Task-Driven Hierarchical Open-Vocabulary 3D Functionality Segmentation

Open-vocabulary 3D functionality segmentation enables robots to localize functional object components in 3D sc

自然言語処理RAG分類セグメンテーション画像

用途: 分類
難易度: Hard
コスト: High

arxivPaper only2026-06-03

Identifying Gems from Roman RAPIDly

この研究では、将来の天文台 Roman が取得するデータに対して、変換検出と変換エラー検出の自動パイプラインを提案している。変換検出は、特に天文台 Roman のデータでは重要な機能であり、天文現象を検出するために迅速な

機械学習教師あり学習分類検出画像

用途: 有望な天体に自動エラー検出と変換検出機能
難易度: Hard
コスト: High

センサ/時系列自然言語処理埋め込み・検索画像教師なし

Central Description Length (CDL) Clustering Validation Index

Selecting a clustering algorithm and its hyperparameters without labels is a common difficulty in engineering

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Low

Optimized Labeling Resource Allocation for Prediction-Assisted Inference via OPAL

Active Statistical Inference is a new framework to make precise claims about population parameters with provab

品質予測/異常検知自然言語処理RAG画像

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Low

Quadratic integrate-and-fire neurons exhibit less fragmented loss landscapes and outperform leaky integrate-and-fire neurons in spike-based gradient descent

The ability to train spiking neural networks is essential for modeling biological neural networks as well as f

深層学習Transformer画像

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

Training a Predictive Coding Network on ImageNet using Equilibrium Propagation

Equilibrium Propagation (EP)は、エネルギーベースのモデル、特にPredcitveCodingNetwork (PCN)のトレーニングに利用できるフレームワークです。EPは、トレーニングの過程に

深層学習CNN分類画像

用途: 画像認識のためのEP法を用いたPCNのトレーニング
難易度: Hard
コスト: High

PSViT: A Methodology for Structurally Pruning Spiking Vision Transformers

スパイク式ビジョン変換模型（SVM）を圧縮するための削減法の開発と、それを用いた実験結果について論じます。

深層学習Transformer画像

用途: スパイク式ビジョン変換模型（SVM）を圧縮するための削減法の開発
難易度: Hard
コスト: Medium

arxivPaper only2026-06-01

Democracy on Rugged Landscapes: Phase Transitions in Optimal Voting Rules

Laws and institutions shape individual outcomes through complex interactions with citizens' diverse circumstan

機械学習特徴量エンジニアリング画像

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

arxivPaper only2026-05-30

Active Learning with Foundation Model Priors: Efficient Learning under Class Imbalance

Real-world datasets across image and text domains are often characterized by skewed class distributions and no

少数データ向き条件最適化深層学習軽量化・量子化画像テキスト

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Low

arxivPaper only2026-05-28

Deep Binarized Photonic Reservoir Computing for Ultrafast Multimedia Signal Processing

We present a deep photonic neural network architecture based on ultrafast binary optical modulation from a dig

センサ/時系列コンピュータビジョン動画認識分類検出画像

用途: 分類
難易度: Hard
コスト: High

arxivPaper only2026-05-27

CLANE: Continual Learning of Actions on Neuromorphic Hardware from Event Cameras

Recognizing and continuously learning novel human actions without forgetting prior classes is a requirement fo

センサ/時系列深層学習CNN分類画像動画

用途: 分類
難易度: Hard
コスト: High

arxivPaper only2026-05-23

Cloud Computing Review: A Decade of Research

The popularity and rapid development of Cloud Computing in recent years has led to a vast number of publicatio

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

arxivPaper only2026-05-22

Planktonzilla: Multimodal dataset and models for understanding plankton ecosystems

Marine plankton underpin aquatic food webs and play a key role in global CO2 sequestration, making reliable sp

少数データ向き深層学習Transformer分類画像テキスト

用途: 分類
難易度: Hard
コスト: High

arxivPaper only2026-05-22

SpikingMoE: SDPrompt-Guided Dynamic Expert Fusion in Spiking Neural Networks

スパイキングニューラルネットワークを高速化するためのSpikingMoEを提案しています。このフレームワークは、スパイク通信を削減するためのSDPrompt-Guided Dynamics Expert Fusionを提

用途: スパイクを活用した知能を向上させるためのモジュール
難易度: Hard
コスト: Low

arxivPaper only2026-05-21

Cross-Species RSA Reveals Conserved Early Visual Alignment but Divergent Higher-Area Rankings Across Human fMRI and Macaque Electrophysiology

この研究では、人間とマカスの視覚的アラインメントを比較検討しました。調査結果は、CNNを用いてマカスの視覚野を予測することが可能であることを示しました。

深層学習CNN画像

用途: 複数種間の視覚的アラインメント
難易度: Hard
コスト: High

arxivPaper only2026-05-20

E-ReCON: An Energy- and Resource-Efficient Precision-Configurable Sparse nvCIM Macro for Conventional and Spiking Neural Edge Inference

This work presents E-ReCON, a 16 Kb energy and resource-efficient digital compute-in-memory (DCIM) macro based

センサ/時系列深層学習Transformer生成画像

用途: 生成
難易度: Hard
コスト: Medium

arxivPaper only2026-05-19

Scalable, Energy-Efficient Optical-Neural Architecture for Multiplexed Deepfake Video Detection

The rapid proliferation of AI-generated visual media has created an urgent need for efficient, trustworthy dee

深層学習Transformer検出画像動画

用途: 検出
難易度: Hard
コスト: High

arxivPaper only2026-05-15

XOResNet: Exclusive-OR Meta-Residuals Facilitate Deep Spiking Neural Networks Learning

Spiking neural networks (SNNs) hold promise for demonstrating superior learning and representation capabilitie

深層学習CNN画像

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

arxivPaper only2026-05-15

Structure Abstraction and Generalization in a Hippocampal-Entorhinal Inspired World Model

Hippocampal-Entorhinal の構造を取り入れ、抽象的な表現と予測的世界モデルを学習します。

自然言語処理RAG画像テキスト教師あり

用途: Hippocampal-Entorhinal の世界モデル
難易度: Hard
コスト: Low

arxivPaper only2026-05-14

On the Stability of Growth in Structural Plasticity

Standard deep-learning pipelines usually choose the network architecture before training and keep it fixed thr

深層学習CNN分類画像テキスト

用途: 分類
難易度: Hard
コスト: High

arxivPaper only2026-05-12

STARS: Spike Tail-Aware Relational Synthesis for ANN-to-SNN Data-Free Knowledge Distillation

SNNs promise energy-efficient and low-latency inference, but their performance still trails that of ANNs. ANN-

説明可能深層学習軽量化・量子化生成画像

用途: 生成
難易度: Hard
コスト: High

arxivPaper only2026-05-12

Self-organized MT Direction Maps Emerge from Spatiotemporal Contrastive Optimization

The spatial and functional organization of the primate visual cortex is a fundamental problem in neuroscience.

深層学習Transformer画像動画3D

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

arxivPaper only2026-05-12

Breaking Global Self-Attention Bottlenecks in Transformer-based Spiking Neural Networks with Local Structure-Aware Self-Attention

Transformer-based Spiking Neural Networks (SNNs) integrate SNNs with global self-attention and have demonstrat

用途: 分類
難易度: Hard
コスト: Low

arxivPaper only2026-05-11

Energy-Efficient Implementation of Spiking Recurrent Cells on FPGA

FPGA上でスパイク神経ネットワークモデルを実装し、エネルギー消費を削減する方法を提案しています。

用途: エネルギー効率化
難易度: Hard
コスト: Medium

arxivPaper only2026-05-11

Prospective Compression in Human Abstraction Learning

人間的抽象化を推定するための新たなアプローチを提案し、未知のタスクを効率的に学習することができます。

用途: 人間的抽象化
難易度: Hard
コスト: High