MLinfo | 機械学習・AI論文まとめ

netdata — The fastest path to AI-powered full stack observability, even for lean teams.

netdataは、チームに関係なくAIパワーで全システム観察できる最速のパスを提供している。

用途: 全システム観察
難易度: Easy
コスト: Medium

コンピュータビジョン物体検出分類検出セグメンテーション

ultralytics — Ultralytics YOLO26, YOLO11, YOLOv8 — object detection, instance segmentation, semantic segmentation, image classification, pose estimation, object tracking

ultralyticsはYOLO(You Only Look Once)の技術を使用したオブジェクト検出ライブラリで、高い精度を提供している。

用途: オブジェクト検出
難易度: Easy
コスト: Low

streamlit — Streamlit — A faster way to build and share data apps.

streamlitはStreamlitライブラリを使って、データアプリを作成・共有することができる。

用途: データアプリ作成
難易度: Easy
コスト: Medium

gradio — Build and share delightful machine learning apps, all in Python. 🌟 Star to support our work!

Pythonでマシンラーニングアプリを作成・共有することができるライブラリです。

強化学習方策勾配 (PPO / A3C)画像

用途: マシンラーニングアプリ作成
難易度: Easy
コスト: Medium

photoprism — AI-Powered Photos App 🌈💎✨

photoprismはAIパワーで管理される写真管理アプリケーションで、写真の特徴や情報を自動的に検出することができる。

用途: 写真管理
難易度: Easy
コスト: Medium

opencv — Open Source Computer Vision Library

このリポジトリでは、64MパラメータのGPTを完全にTrainingし、2時間以内に完成させる手法を提供します。

深層学習画像

用途: 大モデル 2時間で完全にTraining
難易度: Easy
コスト: High

コンピュータビジョン物体検出分類検出セグメンテーション

yolov5 — Ultralytics YOLOv5 in PyTorch for object detection, instance segmentation, classification, training, and export.

YOLOv5という物体検出アルゴリズムをPyTorchから他の言語に変換できるライブラリ。

用途: 物体検出
難易度: Easy
コスト: High

diffusers — 🤗 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch.

.diffusion モデルのライブラリ。画像・動画・音声生成に利用可能。

生成AI拡散モデル生成画像テキスト

用途: 画像・動画・音声生成
難易度: Easy
コスト: High

netron — Visualizer for neural network, deep learning and machine learning models

神経ネットワークの可視化に利用できるツール。深層学習・機械学習モデルも可視化可能。

MLOpsモデルデプロイ画像

用途: 神経ネットワーク可視化
難易度: Easy
コスト: Medium

コンピュータビジョン物体検出分類セグメンテーション画像

label-studio — Label Studio is a multi-type data labeling and annotation tool with standardized output format

データラベル化と注釈化を行うためのツールです。

用途: データラベル化ツール
難易度: Easy
コスト: Low

Medical_Image_Analysis — Foundation models based medical image analysis

医学画像分析は、医療の診断や治療を支援するために画像に記載されたデータから情報を抽出する研究分野です。この研究では、foundation modelsを用い、医療画像分析のための新しいアプローチを提案しました。found

用途: 医学画像分析
難易度: Easy
コスト: High

品質予測/異常検知コンピュータビジョンセグメンテーション分類検出画像

cvat — Computer Vision Annotation Tool (CVAT) is a leading platform for building high-quality visual datasets for vision AI. It offers open-source, cloud, and enterprise products, as well as labeling services, for image, video, and 3D annotation with AI-assisted labeling, quality assurance, team collaboration, analytics, and developer APIs.

CVATは、機械学習用の業界標準のデータエンジンです。さまざまなスケールのチームが使用し、さまざまなスケールのデータに対応しています。

用途: データのラベル付けと管理
難易度: Easy
コスト: High

コンピュータビジョンセグメンテーション分類画像動画

labelme — Image annotation with Python. Supports polygon, rectangle, circle, line, point, and AI-assisted annotation.

イメージを注釈するツール。ポリゴン、長方形、円、線、点などを注釈することができる。

用途: イメージ注釈
難易度: Easy
コスト: High

Meshroom — Node-based Visual Programming Toolbox

ノードベースのビジュアルプログラミングツールです。

コンピュータビジョン3D・点群画像テキスト3D

用途: ビジュアルプログラミングツール
難易度: Easy
コスト: High

深層学習Transformerセグメンテーション画像

segmentation_models.pytorch — Semantic segmentation models with 500+ pretrained convolutional and transformer-based backbones.

セマンティックシーケンス分割モデルのライブラリです。

用途: セマンティックシーケンス分割モデル
難易度: Easy
コスト: High

kornia — 🐍 Geometric Computer Vision Library for Spatial AI

このリポジトリでは、金融分野に適したLarge Language Modelsを提供しています。

コンピュータビジョン画像

用途: 金融用のLarge Language Models
難易度: Easy
コスト: High

rerun — Visualize, query, and stream to train on multimodal robotics data.

データをロギング・ストーリング・クエリして視覚化できるSDKです。

コンピュータビジョンマルチモーダル画像

用途: データロギングおよび視覚化
難易度: Easy
コスト: High

品質予測/異常検知機械学習教師あり学習分類検出画像

fiftyone — Refine high-quality datasets and visual AI models

FiftyOneは、データセットの精査とAIモデル可視化を支援するライブラリです。このライブラリは、データセットの品質を高め、AIモデルを可視化するのを支援するために使用できます。

用途: データセットの精査とAIモデル可視化
難易度: Easy
コスト: Low

深層学習Transformer画像テキストマルチモーダル

sglang — SGLang is a high-performance serving framework for large language models and multimodal models.

SGLangは、大規模言語モデルのサービングフレームワークです。このライブラリは、高性能なサービスフレームワークで、大規模言語モデルのサービングをサポートしています。

用途: 大規模言語モデルのサービングフレームワーク
難易度: Easy
コスト: High

Sana — SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer

SANAは、高解像度画像生成モデルSANAを紹介する本研究であり、低計算コストで優れた高解像度画像を生成できる。

用途: 高解像度画像合成
難易度: Easy
コスト: High

qdrant — Qdrant - High-performance, massive-scale Vector Database and Vector Search Engine for the next generation of AI. Also available in the cloud https://cloud.qdrant.io/

このリポジトリでは、データとAIアルゴリズムを製品化するためのプラットフォームであるTaipyを提供しています。

自然言語処理埋め込み・検索生成画像

用途: AIアプリケーションを製品化するためのプラットフォーム
難易度: Easy
コスト: Low

taipy — Turns Data and AI algorithms into production-ready web applications in no time.

このリポジトリでは、AIワークロードを管理するための自動化システムであるClearMLを提供しています。

MLOpsパイプライン構築画像

用途: AIワークロードを管理するための自動化システム
難易度: Easy
コスト: Medium

weaviate — Weaviate is an open-source vector database that stores both objects and vectors, allowing for the combination of vector search with structured filtering with the fault tolerance and scalability of a cloud-native database.

ベクトル検索と構造化されたフィルタリングを組み合わせたベクターデータベースです。

MLOps生成画像

用途: ベクターデータベース
難易度: Easy
コスト: Medium

aim — Aim 💫 — An easy-to-use & supercharged open-source experiment tracker.

skypilotは、AIワークロードを任意のAIインフラストラクチャで実行、管理、スケールさせることができるプラットフォームです。

MLOps実験管理画像

用途: AIワークフローの管理
難易度: Easy
コスト: Medium

LightX2V — Lightweight Image Video Action Generation Inference Framework

zenmlは、データパイプラインからエージェントまで、AIプラットフォームです。

深層学習軽量化・量子化生成画像動画

用途: AI推論を軽量化したインフラ
難易度: Easy
コスト: High

deepinv — DeepInverse: a PyTorch library for solving imaging inverse problems using deep learning

ピラミードライブラリを使ったイメージインバース問題の解決に使えるライブラリです。

生成AI拡散モデル画像自己教師

用途: イメージインバース問題の解決
難易度: Easy
コスト: High

表形式向き深層学習Transformer分類検出画像

presidio — An open-source framework for detecting, redacting, masking, and anonymizing sensitive data (PII) across text, images, and structured data. Supports NLP, pattern matching, and customizable pipelines.

presidioは、テキスト、画像、構造化データを含む敏感データを検出、削除、マスク、アノニマイズするオープンソースフレームワークです。自然言語処理、パターンマッチング、カスタマイズ可能なパイプラインをサポートします。

用途: データのプライバシーを保護する
難易度: Easy
コスト: Low

3D-Aware VLMs with Implicit and Explicit Geometries

3次元空間理解技術のための新しいアプローチであるVLM-IE3D（Vision-Language Models with Implicit and Explicit 3D geometry）を提案しました。VLM-IE3

コンピュータビジョン3D・点群検出画像テキスト

用途: 3次元空間理解技術の開発
難易度: Hard
コスト: High

品質予測/異常検知画像検査深層学習Transformer検出生成画像

Synthetic data generation framework for quality control automation in gravure printing

印刷品質管理技術のための新しいアプローチであるシンセティックデータ生成フレームワークを提案しました。このフレームワークは、ロトグラビューグラビング技術における品質管理のためのシンセティックデータを生成することで、印刷

用途: 印刷品質管理技術の開発
難易度: Hard
コスト: High

品質予測/異常検知自然言語処理大規模言語モデル画像テキストマルチモーダル

MIRROR: Learning from the Other View for Multi-Modal Reasoning

多モーダル理解技術のための新しいアプローチであるMIRROR（Learning from the Other View）を提案しました。MIRRORは、テキスト、図、テキストと図の組み合わせから同等の視点を提供することで

用途: 多モーダル理解技術の開発
難易度: Hard
コスト: High

Zero-Flow Two-Sample Tests

We propose a new approach to two-sample testing for deciding whether two sets of samples are drawn from the sa

コンピュータビジョンセグメンテーション回帰画像

用途: 回帰
難易度: Hard
コスト: Medium

品質予測/異常検知深層学習Transformer画像

KroQuant: Kronecker-Structured Block Transforms for Efficient Post-Training Quantization of Diffusion Transformers

Post-training quantization (PTQ) of diffusion transformers (DiTs) to W4A4 severely degrades output quality, be

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

M$^3$-Gen: Interpretable Multimodal Generation of Gene Expression Profiles Using Clinical and Imaging Data

Integrating heterogeneous biomedical data, including clinical metadata, histopathology images, and molecular p

説明可能自然言語処理RAG生成画像マルチモーダル

用途: 生成
難易度: Hard
コスト: High

Multi-Task Learning for Heterogeneous Prediction from Video Game State with Transfer Learning

Multi-task learning (MTL) is a promising approach for prediction tasks derived from video game state data, as

自然言語処理ファインチューニング画像テキスト動画

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

Safety-oriented sidewalk and road segmentation for smartphone-based assistive navigation

この研究では、車椅子の位置情報を取得するために、安全な歩道と道路を分類するセグメントを提案し、視覚障害がある人々や盲人の移動を支援する手段になる可能性があります。

コンピュータビジョンセグメンテーション画像

用途: 車椅子の位置情報取得
難易度: Hard
コスト: Medium

Counterfactual Explainability Framework With CycleGAN And Counterfactual-Classifier Alignnment Score for Retinal Disease Classification

Automated detection of vision impairing retina-based ocular conditions from fundus images is important for ear

説明可能深層学習CNN分類検出画像

用途: 分類
難易度: Hard
コスト: Low

センサ/時系列品質予測/異常検知深層学習CNN分類画像

Machine Learning for Charge State Characterization of Isolated Double Quantum Dots

ダブル量子ドットのCharge Stateを分析するためのMachine Learning方法を提案した研究で、この方法により、量子ドットのCharge Stateが効率的に分析できる。

用途: ダブル量子ドットのCharge Stateを分析する
難易度: Hard
コスト: High

品質予測/異常検知生成AI動画生成生成画像テキスト

GraphVid: Interactive Graph-Controllable Video Generation

GraphVidは、グラフと文本から生成することができ、オブジェクトの複数の移動を正確に制御することができる。グラフではオブジェクトの動きを表す情報を保存し、文から生成の制約を指定することができる。

用途: コントロール可能なビデオ生成
難易度: Hard
コスト: High

Visual Contrastive Self-Distillation

Visual Contrastive Self-Distillationは、セルフディスタンスルールを高速化する方法を提案した。この方法は、入力情報だけで学生と教師の間の情報の不均衡をなくした。

用途: セルフディスタンスルールの高速化
難易度: Hard
コスト: Medium

GS-Agent: Creating 4D Physical Worlds With Generative Simulation

GS-Agentは、自然言語から生成することができ、物理的に正しく動作する4次元の世界を生成することができる。方法は、物理的正しさを保つために、生成時に物理的推論を使用した。

MI向き自然言語処理RAG生成画像テキスト

用途: 4次元の物理世界の生成
難易度: Hard
コスト: High

Thinkink: 2D Spatial Ink-native Interaction with LLMs

People often use handwritten notes and sketches to externalize ideas for ideation. To integrate large language

深層学習軽量化・量子化画像テキスト

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

品質予測/異常検知深層学習Transformer画像テキスト動画

Adaptive Identity Anchoring: Closed-Loop Keyframe Placement for Synthetic Paired Supervision in Video Face Swapping

Video face swapping has no natural paired supervision: no real footage exists of one person's face performing

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

When Are Reasoning-Based Guardrails Not Efficient? ResponseGuard: A Fast Vision-Language Guard for Real-Time Moderation

A vision-language AI assistant returns its answer as a stream of generated tokens. Therefore, a safety guard t

深層学習軽量化・量子化検出画像テキスト

用途: 検出
難易度: Hard
コスト: High

センサ/時系列自然言語処理大規模言語モデル画像テキスト3D

VoLN: Vision-Only Long-Horizon Navigation---Paradigm, Benchmark, and Method

Vision-and-Language Navigation (VLN) enables embodied agents to follow natural-language instructions. However,

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

DINOde: Continuous Vision-Text Alignment for Open-Vocabulary Semantic Segmentation

Open-vocabulary semantic segmentation (OVSS) leverages textual semantics to segment objects beyond predefined

自然言語処理RAGセグメンテーション画像テキスト

用途: セグメンテーション
難易度: Hard
コスト: High

品質予測/異常検知コンピュータビジョンセグメンテーション検出画像テキスト

PC-Edit: Prompt-Contrastive Region Discovery and Region-Guided Editing

Replacing an object with one that differs in category or shape requires complete source removal, natural targe

用途: 検出
難易度: Hard
コスト: High

品質予測/異常検知自然言語処理大規模言語モデルQA画像テキスト

Unlearning Under Imbalance: Benchmarking Fairness in Multimodal LLM Unlearning

LLMは、人間のアイデンティティのシミュレーションを使用して個人データを削除したり、未均衡なデータを削除したりしますが、これらのアプローチには制限があります。

用途: モデルの個人データ削除
難易度: Hard
コスト: High

Animation, Verification and Visualisation of Prolog Transition Systems with ProB

ProB is a Prolog-based model checker, animator and constraint solver for high-level formal specifications. One

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

Encoding Event-B Proof Rules in Prolog: An Interactive Sequent Prover for ProB

Event-B is a formal method rooted in predicate logic and set theory. We encoded over 600 proof rules in Prolog

自然言語処理ファインチューニング画像テキスト

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

CRAG-MM-Diagnostics: Enabling Stage-Wise Analysis of Knowledge-Intensive VQA

知識重視の質問応答システム (KI-VQA) を分析するために、新しい評価基準を提案します。これらの基準では、VLMの各タスクを個別に評価することができます。

自然言語処理大規模言語モデル分類QA画像

用途: 知識重視の質問応答システムの分析
難易度: Hard
コスト: High

V-DEAL: Diagnosing Video Safety De-Calibration as an Understanding-Refusal Coupling Failure

ビデオLMMの安全性を確認するために、新しい診断フレームワークを提案します。これらのフレームワークは、モデルの挙動、理解、セマンティクスを同時に考慮します。

自然言語処理大規模言語モデル画像テキスト動画

用途: ビデオ安全性デ-カリブレーションの診断
難易度: Hard
コスト: High

自然言語処理プロンプトエンジニアリング分類画像テキスト

Sparse Concept Channels in Frozen 3D CT Vision Encoders

Large vision-language models are becoming increasingly dominant in 3D medical image interpretation, but we rar

用途: 分類
難易度: Hard
コスト: High

説明可能深層学習Transformer埋め込み画像動画

HyWorldVLA: A Vision-Language-Action Model with Hybrid World Modeling for Autonomous Driving

Vision-Language-Action (VLA) models augmented with world modeling represent a promising paradigm for end-to-en

用途: 埋め込み
難易度: Hard
コスト: High

品質予測/異常検知深層学習Transformer画像テキスト動画

Beyond Independent Optimization: Compression, MoE Routing, and Quantization Interactions in Multimodal Edge Intelligence

効率的な多モードの推論は、モデルの性能やFLOPCOuntだけでなく、移動、キャッシュ、変形、量化された表現を保存するコストやメモリ、エネルギーに関する制約にも制限されています。この論文では、最近のビジュアルトークン圧縮

用途: 分析的コストと効率性を向上させるための多モードのエッジAIの効率化
難易度: Hard
コスト: High

OPOD: On-Policy Omni Distillation

Omni-modal models can handle text, images, and audio in one system, but improving all of these abilities toget

深層学習軽量化・量子化画像テキスト音声

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

説明可能MI向き品質予測/異常検知深層学習Transformer分類生成画像

Enhancing Explainable Cardiac Diagnosis with Guide-Grounded Multimodal LLMs

The electrocardiogram (ECG) is a cornerstone of cardiac as- sessment, yet clinical deployment of deep learning

用途: 分類
難易度: Hard
コスト: High

Tencent WorkBuddy Bench: A Multi-Domain Coding-Agent Benchmark with Contamination-Resistant Task Construction

コーディングエージェントの評価基準を導入し、現実世界のコミットやプルリクエストに基づくタスクを構築した。

用途: コーディングエージェントの評価
難易度: Hard
コスト: High

品質予測/異常検知深層学習Transformer生成画像テキスト

Streaming Multi-Agent Autoregressive Diffusion Model with World State Registers

多エージェントのシミュレーションにおいて、共有世界状態がエージェント間で保持され、その世界状態が観測結果に反映されると仮定している。

用途: マルチエージェントのシミュレーション
難易度: Hard
コスト: High

MI向き深層学習軽量化・量子化セグメンテーション異常検知画像

Unified Video Dense Prediction from Disjoint Data

ビデオ内の物体の空間推論を同時に行うことで、現存するタスク固有の注釈を超えた統一的なビデオ推論システムを構築した。

用途: ビデオの分割推論
難易度: Hard
コスト: High

Inference-Time Scaling of Diffusion Models via Progressive Seed Pruning

ディフュージョンモデルにおける初期的なNoise Seed の影響が、モデルが生成する高質のイメージに大きく影響していることを提示し、Seed Search 時の時間的負荷を削減するための方法を提案した。

用途: ディフュージョンモデルのサケリング
難易度: Hard
コスト: High

Self-Supervised Learning of Structured Dynamics from Videos

ビデオ内のキャメラの動きと物体の動きを切り離すことで、モーションの表現学習を改善した。

深層学習Transformer埋め込み画像動画

用途: ビデオ内の動きの予測
難易度: Hard
コスト: High

説明可能品質予測/異常検知コンピュータビジョンセグメンテーション画像

Scene Parameter Saliency via Differentiable Light Transport

光の伝達の可微分化を用いて、入力が最も影響するシーン要素を特定するための方法を提案した。

用途: グラデンスの推測
難易度: Hard
コスト: Medium

Towards Robust Iris Recognition Through Occlusion Identification and Conditional Diffusion-Based Reconstruction

アイス認識の精度を向上させるための方法を提案し、視覚認識におけるアイス認識タスクの課題を分析した。

深層学習CNN分類画像テキスト

用途: アイス認識
難易度: Hard
コスト: High

センサ/時系列深層学習軽量化・量子化画像3D自己教師

Boosting Robustness for All-Weather Self-Supervised Depth Estimation in Autonomous Driving

Self-supervised depth estimation is challenging for safe autonomous driving under various adverse weather cond

用途: 自走車両の障害物認識
難易度: Hard
コスト: High

品質予測/異常検知深層学習Transformer画像テキスト動画

Texture++: Elevating 3D Asset Texture Resolution with a Region-Aware Diffusion Model

Numerous 3D assets are discarded due to low texture resolution, while current super-resolution models ignore t

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

Recurrent Sinusoidal INRs for Efficient High-Fidelity Representation

We study sinusoidal recurrence as an iterative mechanism for harmonic spectral enrichment in implicit neural r

深層学習RNN / LSTM画像3D

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

品質予測/異常検知深層学習RNN / LSTM画像テキスト

CLUIE: Clustering-Aware Recurrent Propagation with Local Structural Compensation for Underwater Image Enhancement

Underwater image enhancement remains challenging due to wavelength-dependent light absorption, scattering, and

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

DAPM: UAV Monocular Depth Estimation from Any Height, Pitch, Roll and FOV

UAVは、高度、ピッチ、ロール、FOVの変動を含む高度なカメラポーズにおいて動作するため、非対称分布の深さが含まれる広範な空中画像におけるモノラル深度推定を実現するには、高度な深度推定手法が必要である。ほとんどの推定手法

深層学習軽量化・量子化画像3D

用途: UAV用モノラル深度推定
難易度: Hard
コスト: High

自然言語処理ファインチューニング分類セグメンテーション画像

ASTRA-Net: Anatomy-Specific Transfer and Representation Alignment for Drug-Induced Sleep Endoscopy Segmentation

Quantitative drug-induced sleep endoscopy (DISE) requires reliable airway boundaries at specific anatomical le

用途: 分類
難易度: Hard
コスト: Low

品質予測/異常検知深層学習正規化・最適化手法分類画像テキスト

Quality-Aware Multimodal Fusion Reveals Implicit Identity in Valence-Arousal Features

Conventional face recognition relies on static appearance cues and degrades in unconstrained settings with exp

用途: 分類
難易度: Hard
コスト: High

品質予測/異常検知深層学習Transformer生成画像

SlerpFlow: Spherical Trajectory Correction for Rectified Flow Inversion

Rectified-flow-based diffusion transformers, particularly FLUX, have demonstrated outstanding performance in h

用途: 生成
難易度: Hard
コスト: High

品質予測/異常検知自然言語処理RAG検出画像テキスト

Detectors Learn the Wrong Thing: Shortcut-Resistant Adversarial Training Against Physically Realizable Attacks

AI-enabled visual perception systems are increasingly deployed in intelligent transportation infrastructure an

用途: 検出
難易度: Hard
コスト: High

Stokes-Informed Diffusion for Robust Linear Polarization Estimation

Polarization cues benefit applications such as material detection and de-reflection, yet acquiring them typica

深層学習軽量化・量子化検出画像

用途: 検出
難易度: Hard
コスト: High

品質予測/異常検知生成AIGAN生成画像マルチモーダル

Physics-Informed Deep Learning Model for Cross-Modality Super-Resolution in Fluorescence Microscopy

Cross-modality image translation offers a route to super-resolution fluorescence microscopy from low-resolutio

用途: 生成
難易度: Hard
コスト: High

Out of Sight, Still in Mind: Token Compression for Omni-LLMs

The goal of this paper is to reduce the input token cost of Omni-modal large language models (Omni-LLMs) at in

自然言語処理大規模言語モデル画像テキスト音声

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

Decoupling Cross-Modality Manifold Discrepancy: Leveraging Visible Diffusion Priors for Infrared Super-Resolution

Infrared image super-resolution (IISR) mitigates the limitations imposed by low spatial resolution. Existing m

自然言語処理RAG生成画像マルチモーダル

用途: 生成
難易度: Hard
コスト: High

Causal-AgentIR: Self-Evolving Causal Memory for Adaptive Image Restoration Agents

Image restoration agents have recently emerged as a flexible paradigm for handling diverse and unpredictable d

品質予測/異常検知生成AIGAN画像テキスト

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Low

The Second LoViF 2026 Challenge on Real-World All-in-One Image Restoration: Methods and Results

LoViF の 2 回目のチャレンジでは、画像修復に新たなアプローチを提案しています。実世界の画像を修復するための包括的な評価基準を提供しており、低光照度、ハッジ、雨、雪などのさまざまな障害に対する解決策を研究者に求めて

コンピュータビジョン画像

用途: 画像修復
難易度: Hard
コスト: Medium

HalluScope: Fine-grained Hallucination Diagnosis for Multimodal Large Language Models

大規模言語モデルはさまざまな画像をテキストに変換する上で優れた性能を示しているが、発生するホログラフィックな診断にはまだ解決策が必要です。この研究では、主流の粗い検出方法の欠点を補うため、細部の診断方法を提案しています。

説明可能自然言語処理大規模言語モデル分類検出生成

用途: ホログラフィックハロウィーンの診断
難易度: Hard
コスト: High

Geo3R: Mitigating Spatial Reasoning Hallucination in Multimodal Large Language Models

大規模言語モデルのハロウィーン診断では、対象の 3D 空間関係を推論する際に、視覚化が欠如していることが問題となります。この研究では、これらのハロウィーンを軽減するためのアプローチを提案しています。

自然言語処理大規模言語モデル画像テキスト3D

用途: 3D空間推論のハロウィーン診断
難易度: Hard
コスト: High

The RealDefocus Benchmark for Defocus Deblurring

ドリフス脱失は画像を再構築するために不可欠ですが、再構築画像とドリフス画像のペアリングや標準化されたプロトコルなどの要件を満たすデータセットが不足しているため、評価が難しいです。この研究では、レアルワールドに基づくドリフ

品質予測/異常検知コンピュータビジョン画像

用途: ドリフス脱失の解除
難易度: Hard
コスト: High

Show, Don't Tell: Evaluating Spatial Cognition in Generative Pixels Rather Than LLM Text

空間理解は、物理世界と静的のセマンティック理解の間でつながるために不可欠です。多くの空間タスクは、場所、領域、パスの自然な表現は、ポインティングやマーキングなど、連続的な視覚的シーンで行われることが多いが、現行の空間推論

用途: 空間理解
難易度: Hard
コスト: High

品質予測/異常検知コンピュータビジョンセグメンテーション画像3D

TransBiolab: A Real-World Multi-View Dataset of Cluttered Transparent Biomedical Objects

自動化された生理学ラボでは、透明なプラスチック製品を認識、位置付け、操作するために視覚知覚が必要ですが、対象となる高品質のリアルワールドデータセットは現在限られています。この研究では、複雑なマルチオブジェクトのシーンを扱

用途: 膚質物体の可視化
難易度: Hard
コスト: High

Do Pathology Vision-Language Models Truly See Pathology?

パスロジは、現在、パスロジ認識のための画像言語モデルを評価するために広く使用されていますが、この研究では、パスロジ認識において画像言語モデルの視覚知覚が機能していることを疑問に問っています。

用途: パスロジの認識
難易度: Hard
コスト: High

深層学習Transformer画像テキストマルチモーダル

MVEI & EmObserver: Empowering MLLM-Oriented Visual Emotional Intelligence via Emotion Statement Judgement

感情認識は、現代のアギを促進するために不可欠ですが、大規模

用途: 感情認識
難易度: Hard
コスト: High

センサ/時系列コンピュータビジョンセグメンテーション分類画像

HyperImageNet: A Large-Scale High-Spatial Resolution Hyperspectral Imagery Classification Benchmark

We present HyperImageNet, a large-scale benchmark for fine-grained hyperspectral land-cover understanding. The

用途: 分類
難易度: Hard
コスト: Low

説明可能センサ/時系列コンピュータビジョンマルチモーダル画像テキスト

GeoThreat: Transferable Targeted Adversarial Attacks on Large Vision-Language Models for Remote Sensing Image Interpretation

Adversarial attacks against large vision-language models (LVLMs) serve as an effective means of assessing thei

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

Spectral-Spatial Synergistic Guided Network for Hyperspectral Salient Object Detection

Hyperspectral salient object detection aims to identify visually salient regions from hyperspectral images. Ex

深層学習軽量化・量子化検出画像

用途: 検出
難易度: Hard
コスト: Low

品質予測/異常検知深層学習Transformer検出生成画像

GroupVideo: Multi-Identity Customized Text-to-Video Generation

Current identity customized video generation methodologies are predominantly limited to single-identity scenar

用途: 検出
難易度: Hard
コスト: High

品質予測/異常検知深層学習軽量化・量子化画像動画3D

WAT3R: Feedforward Underwater 3D Reconstruction

Reliable feedforward underwater 3D reconstruction remains challenging due to severe light attenuation and back

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

Explainable Deepfake Detection Challenge

Deepfake detection is moving beyond binary classification decisions toward systems that can also explain the v

説明可能コンピュータビジョン画像分類分類検出生成

用途: 分類
難易度: Easy
コスト: Low

少数データ向き自然言語処理プロンプトエンジニアリング分類画像

AUCH-Net: Action Unit-Based Consistency-Aware Hypergraph Network for Cross-Domain Few-Shot Facial Expression Recognition

Recently, cross-domain few-shot facial expression recognition (CF-FER) has received considerable attention. Ho

用途: 分類
難易度: Hard
コスト: Low

品質予測/異常検知自然言語処理RAG生成画像教師なし

Unsupervised Metal Artifact Reduction in Dental CBCT using Fine-tuned Cycle-Consistent Adversarial Networks

この研究では、歯科CBCT画像中のメタルアーティファクトを除去するための循環互換的アドバーサリアルネットワーク（CycleGAN）を提案します。CycleGANを使用すると、メタルアーティファクトを除去した後、CBCT画

用途: メタルアーティファクトの除去
難易度: Hard
コスト: Low

MagicMakeup: A Region-Controllable Diffusion Transformer for High-Fidelity Makeup-Transfer

この研究では、マイメイク移植を改善するために、マイメイクの強い地域性を考慮したRegion-Controllable Diffusion Transformer（MagicMakeup）を提案します。

用途: マイメイク移植
難易度: Hard
コスト: High

MI向き品質予測/異常検知深層学習Transformer分類画像テキスト

Sidewalk Moments: Are Richer Representations Always More Human-Aligned? Evidence from City-Walk Videos

この研究では、都市ウォークビデオを分析するために、4つのモダリティの表現（スペース時領域情報、時間平均画像、オーディオ符号化、テキストベースの表現）を使用しました。

用途: 都市ウォークビデオの分析
難易度: Hard
コスト: High

DINO-VPT: Hierarchical Visual Prompt Tuning for Joint Physical-Digital Face Anti-Spoofing

この論文では、DINO-VPTという手法を提案します。DINO-VPTは、Hierarchical Visual Prompt Tuning（HVPT）を使用して、物理的なスポーフィングとデジタルスポーフィングを検出しま

深層学習軽量化・量子化画像テキストマルチモーダル

用途: フェイスアンティスポーフィング
難易度: Hard
コスト: High

MAGE-Vein: Multi-Instance Age and Gender Estimation from Finger Vein Images

この論文では、finger vein画像から年齢と性別を推測するためのMulti-InstanceAge and Gender Estimation（MAGE-Vein）モデルを提案します。

機械学習教師あり学習分類画像

用途: 年齢と性別の推測
難易度: Easy
コスト: Low

Engine-Native Editable 3D World Reconstruction with Objects and Lighting

この論文では、Lumeraという手法を提案します。Lumeraは、Engine-Native 3D World ReconstructionとLightsを検出するために使用します。

自然言語処理大規模言語モデル検出生成画像

用途: 3D世界の再構成
難易度: Hard
コスト: High

WhereEdit: Mask-aware Local Latent Editing for One-Step Image Editing

この研究では、WhereEditという手法を提案します。WhereEditは、Mask-aware Local Latent Editingを使用して、一ステップの画像編集を実行します。

用途: 画像編集
難易度: Hard
コスト: Medium

コンピュータビジョンセグメンテーション分類画像教師あり

Webly Supervised Multi-Label Recognition: Evaluation Benchmark and Dual-Branch Multi-Label Contrastive Learning

この論文では、Webly Supervised Multi-Label Recognition（WS-MLR）という手法を提案します。WS-MLRは、web画像データセットを使用して、多ラベルを解釈します。

用途: Weblyスーパーバイズ多ラベル認識
難易度: Hard
コスト: High

品質予測/異常検知自然言語処理大規模言語モデル画像テキスト動画

ViSTR-Bench: Can MLLMs Reason from Continuous Visual Cues in Dynamic Scenes?

この論文では、ViSTR-Benchという手法を提案します。ViSTR-Benchは、MLLMが動的シーンから情報を取得できるかどうかを評価します。

用途: 3Dシーンの分析
難易度: Hard
コスト: High

品質予測/異常検知深層学習軽量化・量子化生成画像3D

SubSplat: High-Resolution Pixel-aligned 3DGS via Sub-pixel Gaussian Reparameterization

Pixel-aligned Gaussian splatting enables efficient and generalizable novel-view synthesis. However, high-resol

用途: 生成
難易度: Hard
コスト: High

センサ/時系列品質予測/異常検知コンピュータビジョンマルチモーダル画像

AXIS: A Growable Community-Driven Data Engine for Scalable Robot Manipulation

Learning effective robot manipulation policies requires diverse, high-quality demonstrations, yet existing dat

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

コンピュータビジョンセグメンテーションQA画像テキスト

Beyond Episodic Evaluation: Memory Architectural Bottlenecks in Sequential Embodied Question Answering

Embodied question answering (EQA) is traditionally evaluated under an episodic formulation, where agents solve

用途: QA
難易度: Hard
コスト: High

センサ/時系列深層学習Transformer検出画像音声

Human-Inspired Framework for Robotic Craniotomy: Integrating Multimodal Fusion and Adaptive Trajectory Adjustment

人間の知能を模倣するクロアニオトミー手術のフレームワークを提案します。このフレームワークは、前方計画と後方実行を組み合わせて、手術中に手術台の位置を自動的に調整することで、人間と同様の安全で効率的な手順を実現します。

用途: クロアニオトミー手術の自動化
難易度: Hard
コスト: High

GuidedAttention: Interpretable and Correctable Visual Attention for OOD-Robust Robot Manipulation via Imitation Learning

視覚モータリティポリシーを学習する際、人間が視覚アタッチメントを理解し、修正できるようにするため、視覚アタッチメントを明示的にしたフレームワークを提案します。

説明可能生成AI拡散モデル異常検知画像

用途: ロボットマニュピュレーションの視覺アタッチメント
難易度: Hard
コスト: High

品質予測/異常検知自然言語処理RAG生成画像テキスト

TableVerse: A Large-scale Tabletop Dataset with Real-world Grounded Layouts for Generalizable Manipulation

オートメーションされたマニピュレーションを目的とした、大規模なテーブルトップのデータセットであるTableVerse を提案します。このデータセットには、物理的に可能な実世界のレイアウトを生成する実用的な方法が含まれてお

用途: オートメーションされたマニピュレーションのためのテーブル環境の生成
難易度: Hard
コスト: Low

huggingfaceHugging Faceあり2026-07-23

K12-KGraph: A Curriculum-Aligned Knowledge Graph for Benchmarking and Training Educational LLMs

Large language models are increasingly used in K-12 education, but existing benchmarks mainly test exam questi

自然言語処理大規模言語モデルQA画像テキスト

用途: QA
難易度: Easy
コスト: High

remove-ai-watermarks — AI watermark remover. CLI and Python library to strip visible and invisible AI watermarks (Gemini / Nano Banana sparkle, SynthID) and provenance metadata (C2PA, EXIF, IPTC) from images.

音声認識、声活動検出、テキスト処理などを行う、基盤となる音声認識ツールキットを提供する。

自然言語処理大規模言語モデル生成画像

用途: 音声認識の基盤技術の提供
難易度: Easy
コスト: High

SimpleTuner — A general fine-tuning kit geared toward image/video/audio diffusion models.

画像やビデオやオーディオディフュージョンモデルのファインチューニングを行うための、汎用的なファインチューニングキット。

自然言語処理ファインチューニング画像音声動画

用途: ディフュージョンモデルのファインチューニング
難易度: Easy
コスト: High

best-of-ml-python — 🏆 A ranked list of awesome machine learning Python libraries. Updated weekly.

Pythonで使えるマシンラーニングライブラリを紹介している。

用途: Python MLライブラリ
難易度: Easy
コスト: Medium

表形式向き自然言語処理大規模言語モデル画像テキスト表形式

unstructured — Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.

ドキュメントを構造化するために使えるオープンソースのETLソリューション。

用途: ドキュメントの構造化
難易度: Easy
コスト: High

PhysCoRe: Physics-Corrected Residual World Models for Material-Aware Deformable Dynamics

Predicting how deformable objects evolve under robotic manipulation is a longstanding challenge. Existing appr

自然言語処理ファインチューニング画像

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

One Round Is All You Need: Analytic Federated Learning for Task-Heterogeneous Multi-Label Medical Image Classification

Federated learning (FL) enables multiple clinical institutions to collaboratively train a shared disease class

コンピュータビジョン画像分類分類回帰画像

用途: 分類
難易度: Hard
コスト: Low

MI向きセンサ/時系列深層学習Transformer分類画像時系列

Multi-modal transformer for signal classification in nanopore blockade experiments

この研究では、ナノポア測定器から得られる複雑な信号を分析するために、多モーダル変換ニューラルネットワーク (Multi-modal Transformer) を提案し、信号分類の精度を向上させた。

用途: ナノポア測定器における信号の分類
難易度: Hard
コスト: Low

Self-supervision drives representational convergence in medical foundation models more than clinical supervision

Medical image encoders from different groups are increasingly treated as interchangeable, on the assumption th

自然言語処理RAG分類画像テキスト

用途: 分類
難易度: Hard
コスト: High

Non--negative matrix factorization using the \textit{R} package \textsf{nnmf}

ネガティブメトリクスの因数分解を扱う研究、非負行列因数分解 (NMF) を用いて因

数学・理論最適化画像テキスト

用途: ネガティブメトリクスの因数分解
難易度: Hard
コスト: Medium

Bayesian uncertainty estimation improves clinical decision making in medical AI agents

Machine learning models for medical image analysis typically lack a reliable measure of confidence, limiting t

深層学習正規化・最適化手法分類検出画像

用途: 分類
難易度: Hard
コスト: High

The Giant Hippocampus: From Structural Monoculture to a System of Systems

この研究では、人工知能の研究者と神経科学者の間の分野を結びつけるために、脳のシステム構造を研究し、その研究から導かれた新しいアプローチを提案しました。

深層学習Transformer分類画像テキスト

用途: 脳のシステム構造とその応用
難易度: Hard
コスト: High

Adversarial Frontiers: Minimum-Norm Attack Ensembles for Robustness Evaluation

Adversarial robustness is commonly evaluated with predefined attack ensembles, such as AutoAttack, at a single

コンピュータビジョン画像分類画像

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Low

品質予測/異常検知コンピュータビジョンセグメンテーション分類生成画像

Analytic Distribution of Classifier-Free Guidance for Schedule Design

Classifier-free guidance (CFG) is the default mechanism for conditional generation in diffusion models, but th

用途: 分類
難易度: Hard
コスト: High

Can an AI System Be Creative? A Critical Perspective from Art and Engineering

This paper examines the question of whether artificial intelligence (AI) systems can be creative, approached f

深層学習Transformer分類生成画像

用途: 分類
難易度: Hard
コスト: Low

センサ/時系列深層学習軽量化・量子化画像テキストマルチモーダル

Robostral Navigate

Deploying navigation systems at scale requires a recipe that minimizes sensor assumptions, generalizes across

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

品質予測/異常検知深層学習軽量化・量子化セグメンテーション画像

U-CFR: Uncertainty-Guided Cascade Forward Refinement for Interactive Segmentation

Interactive image segmentation is critical for efficient image annotation; however, existing methods often req

用途: セグメンテーション
難易度: Hard
コスト: Low

DS@GT ARC at ImageCLEFmed GANs 2026: Geometric Filtering for Privacy-Preserving CT Slice Generation

We present a privacy-preserving framework for synthetic lung CT slice generation developed for the Image-CLEFm

自然言語処理埋め込み・検索生成画像

用途: 生成
難易度: Hard
コスト: High

品質予測/異常検知画像検査深層学習軽量化・量子化生成画像テキスト

Demonstrating GenDB: Instance-Optimized and Customized Query Processing Code Generation via LLM Agents

Traditional query processing engines require continuous development and extensions to support new techniques a

用途: 生成
難易度: Hard
コスト: High

SoftReason: A Fully Differentiable Neuro-Soft-Symbolic Deductive Reasoning Architecture over High-Dimensional Perceptual Data

In many reasoning problems, the premises are not observed as discrete symbols, but must be inferred from high-

MI向き自然言語処理埋め込み・検索QA画像

用途: QA
難易度: Hard
コスト: Low

品質予測/異常検知深層学習Transformer分類生成画像

Persian Pixel: A large-scale synthetic OCR dataset for Persian language

Optical Character Recognition (OCR) for Persian remains substantially less mature than for Latin-script langua

用途: 分類
難易度: Hard
コスト: High

Closing the Lab-to-Store Gap: A Data-Efficient Post-Training and Experience-Driven Learning VLA Framework for Retail Humanoids

Closing the gap between benchmark performance and reliable real-world operation remains a central challenge fo

深層学習軽量化・量子化異常検知画像テキスト

用途: 異常検知
難易度: Hard
コスト: High

MI向き品質予測/異常検知深層学習Transformer生成画像動画

StreamHOI: Interaction-aware Temporal Memory Adaptation for Streaming HOI Video Generation

オフラインでの短時間の視覚生成が一般的な人間の行動の分析では、人間の行動の長期的な視覚生成は、実践的な長時間の視覚生成では実行不能である。StreamHOI は、人間間の視覚的な行動の生成を生成したいくつの画像を使用して

用途: 人物間の相互作用による視覚生成
難易度: Hard
コスト: High

ENTRAP-VL: A Taxonomic Probe for Dual Contextual Entrainment in Vision-Language Models

Contextual entrainment is the tendency of a model to let auxiliary context in its input pull its output, indep

コンピュータビジョンマルチモーダル画像テキスト

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

A Systematic Benchmark of Intensity Normalisation Methods for 3D Knee MRI Segmentation and Cross-Domain Generalisability

MRI画像の強度正規化方法を7つ比較し、3DUネットワークモデルでMeniscusの分割精度を評価。

コンピュータビジョンセグメンテーション画像3D

用途: MRI画像の強度正規化を解決する
難易度: Hard
コスト: High

SpikingMOT: A Spike-Driven Multi-Object Tracker

Multi-object tracking (MOT) plays a fundamental role in visual perception, where accurate trajectory predictio

深層学習正規化・最適化手法画像

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

Learning to Detect UI Principle Violations via Reinforcement Learning

Small language models and coding agents increasingly generate web front-end code, yet their outputs are typica

用途: 生成
難易度: Hard
コスト: High

深層学習Transformer画像テキストマルチモーダル

Test-Time Training for Modality Order Consistency in Vision-Language Models

異なる順番で画像と質問が提示される場合、視覚言語モデルはモデルのパフォーマンスに大きな影響を受けることが発見された。

用途: モデルの出力の順番に影響する問題を解決する
難易度: Hard
コスト: High

MI向き自然言語処理大規模言語モデル生成画像テキスト

Back to Back with a Copy: A Computational Analysis of AI-Generated Visual Contemporary Art Pastiches

AIは、特に当代芸術作品のパスティーシュを作成する能力が高いが、これらの作品はどれだけ実際の作品と似ているかを調べました。

用途: AI生成された芸術作品と原画との相似性を調べる
難易度: Hard
コスト: High

VizRAG: Enhancing Retrieval-Augmented Generation with Hypergraph Visualization

Hypergraph-based RAG systems surpass traditional graph-based approaches by organizing complex n-ary atomic fac

用途: 生成
難易度: Hard
コスト: High

品質予測/異常検知コンピュータビジョンマルチモーダル分類生成画像

Ocular Verification for Virtual Reality

Virtual reality (VR) headsets (e.g., Meta Quest, Apple Vision Pro) provide a seamless user experience due to t

用途: 分類
難易度: Hard
コスト: High

コンピュータビジョンセグメンテーション生成画像3D

A real-time RGB-D perception pipeline for autonomous impact hammers in mining: self-filtering, rock segmentation and rock-breaking poses generation

Impact hammers, also known as rock-breakers, are essential machines in mining operations, where they perform s

用途: 生成
難易度: Hard
コスト: High

品質予測/異常検知コンピュータビジョンセグメンテーション生成画像3D

Axolotl3D: a Unified Framework for Faithful 3D Shape Completion

Recent 3D generative models produce high-quality geometry from a single image using large-scale priors and dif

用途: 生成
難易度: Hard
コスト: High

品質予測/異常検知コンピュータビジョン3D・点群生成画像3D

ATSplat: Compact Feed-forward 3D Gaussian Splatting with Adaptive Token Expansion

Novel View Synthesisは、入力画像から新しい視点の画像を生成するタスクです。ATSplatアルゴリズムは、3次元ガウススプラッタリングを Feed-forward に適合させました。これにより、ATSp

用途: Novel View Synthesis
難易度: Hard
コスト: High

Look Less, Think Faster: Joint Token-Compute Adaptation for Multimodal LLMs

多モーダルラージランゲージモデルは、視覚言語タスクに強いですが、高い推論コストで問題となっています。Look Less, Think Fasterアルゴリズムは、単位次元を個別に最適化することで、多モーダルラージランゲー

深層学習軽量化・量子化画像テキストマルチモーダル

用途: 多モーダルラージランゲージモデルによる視覚言語タスクでのコスト削減
難易度: Hard
コスト: High

Diverse-Intent Multi-Turn Fashion Image Retrieval

複数ターンのファッション画像検索は、実世界のファッション検索では重要なタスクです。Diverse-Intent Multi-Turn Fashion Image Retrievalアルゴリズムは、異なる検索用途を扱うこと

用途: 複数ターンのファッション画像検索
難易度: Hard
コスト: High

センサ/時系列深層学習軽量化・量子化QA画像テキスト

Multimodal Large Language Models for Remote Sensing Image Understanding: Domain-Specific or General-Purpose?

画像理解のための多モーダルラージランゲージモデルは、強力ですが、まだ能力と限界については明確な理解が不足しています。この論文では、多モーダルラージランゲージモデルが画像理解においてどの程度の能力と限界を持つか、を分析し、

用途: 画像理解における多モーダルラージランゲージモデルの能力と限界
難易度: Hard
コスト: High

表形式向き説明可能CPUで試しやすい品質予測/異常検知コンピュータビジョン物体検出分類検出画像

How Does Urban Context Relate to Residential Building Health? A Vision-POI Fusion Framework for Building-Level Housing Inspection

Housing-level urban physical examination is essential for identifying residential building problems and suppor

用途: 分類
難易度: Hard
コスト: Low

コンピュータビジョンセグメンテーション生成画像動画

Vera: Identity-Faithful Human Subject-to-Video Generation

Subject-to-video (S2V) generation has made substantial progress in preserving reference subjects across divers

用途: 生成
難易度: Hard
コスト: High

センサ/時系列深層学習軽量化・量子化検出セグメンテーション埋め込み

Not All Patches are Equal: Sampling Matters for Visible-Infrared Pre-Training

Visible-infrared (VIS-IR) alignment is a key pre-training task for robust multi-sensor perception. Most existi

用途: 検出
難易度: Hard
コスト: High

センサ/時系列品質予測/異常検知自然言語処理大規模言語モデル生成画像テキスト

RS-RIE-Bench: Benchmarking Reasoning-Guided Remote Sensing Image Editing

Remote sensing image editing aims to modify remote sensing images according to natural language instructions w

用途: 生成
難易度: Hard
コスト: High

品質予測/異常検知深層学習Transformer生成画像テキスト

SHFormer: Dynamic Spectral Filtering Convolutional Neural Network and High-pass Kernel Generation Transformer for Adaptive MRI Reconstruction

Attention Mechanism (AM) selectively focuses on essential information for imaging tasks and captures relations

用途: 生成
難易度: Hard
コスト: High

RIM: A Retrieval-In-Matching Framework for Cross-Domain Global Visual Localization of UAVs

Global visual localization of unmanned aerial vehicles (UAVs) using remote-sensing reference maps has attracte

センサ/時系列深層学習軽量化・量子化検出画像3D

用途: 検出
難易度: Hard
コスト: High

Development of an automated, reliable, and clinically meaningful artificial intelligence (AI) tool for diagnosing cardiac disease from conventional cardiovascular magnetic resonance (CMR) images

Aims: Cardiovascular magnetic resonance (CMR) imaging enables non-invasive assessment of myocardial structure,

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

説明可能品質予測/異常検知深層学習軽量化・量子化回帰画像

Factor-Informed Uncertainty Distillation for Gaze Estimation

Deep gaze estimation works well in controlled capture but degrades in unconstrained settings, where systems mu

用途: 回帰
難易度: Hard
コスト: Medium

Importance-Aware OBS Pruning for Diffusion Models

セグメンテーションのパフォーマンスの向上と計算的リソースの削減を目的として、Lean-SAM2は対象領域をアサインする対象アンバウンダリーセグメンテーション（SAM2）にターゲットアンチャイニングされたメモリとエンコーダ

深層学習軽量化・量子化生成画像

用途: 画像のセグメンテーションに効率を実現
難易度: Hard
コスト: High

Toward Seasonal Guidelines for Robust Deep-Learning Sentinel-2 Building Detection in Different Area Types

OffNadirLocは地学化におけるオフナジアムの視点を考慮するための基準セットを提案します。これにより、ドローンと衛星画像の交差視点地学化プロセスでは重要な構造的シーン理解と内部ドメイン間の関係的制約に重点を置くこと

深層学習CNN分類検出セグメンテーション

用途: ドローンから衛星画像への地学化の改善
難易度: Hard
コスト: High

STEREOFLOW: Progressive Stereo Matching with StereoDiT and Transition Flow Matching

ステレオマッチングは3次元再構成において重要なタスクです。この研究では、ステレオマッチングを確率的生成タスクと組み合わせ、オブジェクト検出の向上を目的として、ステレオマッチングフレームワークと潜在分配を統合する方法を提案

深層学習Transformer生成回帰画像

用途: オブジェクト検出の向上
難易度: Hard
コスト: High

センサ/時系列品質予測/異常検知深層学習RNN / LSTM予測画像マルチモーダル

Forecasting the Number of Harvest-ready Fruits of Sweet Peppers Using Multimodal Time-Series Data

この研究では、スイートペッパーの収穫前期予測を目的として、多モード時系列データを統合するための深層学習フレームワークを提案します。

用途: 農業用果実の収穫前期の予測
難易度: Hard
コスト: High

自然言語処理プロンプトエンジニアリング検出画像テキスト

OffNadirLoc: Benchmark and Framework for Challenging UAV-to-Satellite Geo-Localization under Large Off-Nadir Views

OffNadirLocは交差視点地理位置を推定するための基準セットを提案します。これにより、ドローンと衛星画像の交差視点地理位置推定プロセスでは重要な構造的シーン理解と内部ドメイン間の関係制約に焦点を当てることができます

用途: ユーザー間の地理的位置の推定改善
難易度: Hard
コスト: High

MI向き品質予測/異常検知自然言語処理大規模言語モデル生成画像テキスト

ETPDesigner: Multi-Agent Orchestration for Interactive Multimodal Electronic Theater Program

ETPデザイナはマルチモーダルな電子シアターのデザインを自動化するフレームワークを提案します。

用途: 生成
難易度: Hard
コスト: High

MV-Bench: Benchmarking Multimodal Large Language Models for Coordinated Multi-View Interface Construction

Multimodal large language models (MLLMs) are increasingly expected to automate visualization development by ge

用途: 生成
難易度: Hard
コスト: High

品質予測/異常検知コンピュータビジョンセグメンテーション生成画像テキスト

OSVE: One Step Video Editing with One Step Diffusion Models

Text-guided video editing with diffusion models is impractically slow, hindered by costly multi-step sampling

用途: 生成
難易度: Hard
コスト: High

LAVIFT: Latent-Action-Guided Vision Fine-Tuning for Surgical Interaction Recognition

Understanding instrument-tissue interactions is essential for context-aware surgical AI and autonomous robotic

自然言語処理ファインチューニング分類検出画像

用途: 分類
難易度: Hard
コスト: High

品質予測/異常検知深層学習Attention機構分類生成画像

MTVDiff: Multimodal Conditional Latent Diffusion for Enhanced Thermal-to-Visible Face Translation

Thermal-to-visible face translation presents fundamental challenges including geometric discontinuities, seman

用途: 分類
難易度: Hard
コスト: High

自然言語処理ファインチューニング画像動画マルチモーダル

EA-Nav: Learning Safe Visual Navigation Policies with Embodiment Awareness

Cross-embodiment navigation is a key challenge in embodied intelligence. Due to differences in embodiment, the

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

品質予測/異常検知深層学習軽量化・量子化検出セグメンテーション画像

Current Injection Spiking Neural Network for Infrared and Visible Image Fusion

Infrared and visible image fusion (IVIF) integrates the complementary information of two modalities into a sin

用途: 検出
難易度: Hard
コスト: High

KineBench: Benchmarking Embodied World Models via IDM-Free Kinematic Grounding

Evaluating the physical consistency of embodied world models(EWMs) is a critical open challenge. While closed-

コンピュータビジョン3D・点群生成異常検知画像

用途: 生成
難易度: Hard
コスト: High

少数データ向きCPUで試しやすい条件最適化自然言語処理ファインチューニング検出生成画像

PRISM-DR: Per-lesion Retinal Inference with Specialist Models for Diabetic Retinopathy

この研究では、糖尿病性黄斑病変の検出を目的としたPRISM-DRシステムを開発しました。このシステムは、医師が見逃す可能性がある小さな低コントラストな病変を見つけるのに役立ちます。

用途: 糖尿病性黄斑病変を検出する
難易度: Hard
コスト: High

自然言語処理大規模言語モデルセグメンテーション画像テキスト

Memory-Augmented Multimodal Large Language Models for Small Object Understanding in Streaming Aerial Videos

この研究では、ドローンで小さな物体を認識することを目的としたメモリ拡張型大規模言語モデルを開発しました。このモデルは、複雑なドローンの場面で、ユーザーの指示に従って物体を識別できるようになります。

用途: ドローンで物体認識を実行する
難易度: Hard
コスト: High

深層学習Attention機構セグメンテーション画像テキスト

Lean-SAM2: Target-Anchored Memory and Encoder Acceleration for SAM2

The Segment Anything Model 2 (SAM2) has advanced temporal promptable segmentation, yet its deployment remains

用途: セグメンテーション
難易度: Easy
コスト: Medium

品質予測/異常検知コンピュータビジョンマルチモーダルQA画像

Silent Failures in Multimodal Agentic Search:A Diagnostic Taxonomy and Cross-Judge Evaluation

この研究では、可視化された質問への対応を評価するために、新しい方法を提案しました。この方法は、質問への回答の正確性だけでなく、質問への回答のパターンや特徴も評価することができます。

用途: 可視化された質問への対応を評価する
難易度: Hard
コスト: High

Trace: A Taxonomy-Guided Environment for Multidomain Visual Reasoning

自動運転システムには、道路のトポロジー（ドライバブルレーンとその接続性）を理解する機能が必要です。最近の検出モデルは360度の前方視野からボリュームイメージを取得することで、道路上のレーンのトポロジーを推測することができ

自然言語処理RAG画像テキストマルチモーダル

用途: 道路のトポロジー認識を改善
難易度: Hard
コスト: High

Physics-Aware Complex-Valued State Space Model with Scattering-Prior Feature Modulation for PolSAR Image Classification

この研究では、地象性AIにおける物理的知識を使用してポーラリメトリック合成開口ラダール画像を分類するための新しいモデルを提案しました。このモデルは、ラダール画像を物理的なプロセスと関連付けることができます。

深層学習軽量化・量子化分類埋め込み画像

用途: ポーラリメトリック合成開口ラダール画像の分類
難易度: Hard
コスト: Low

品質予測/異常検知自然言語処理ファインチューニング生成セグメンテーション画像

Extending a Large View Synthesis Model for Multi-view Panoptic Segmentation

自律ロボットには、障害物や事故の回避能力が必要です。これは、障害物や事故の回避能力が強化されていれば、障害物や事故に対しての対策がより効果的になります。障害物や事故の回避能力が強まることで、ロボットが障害物や事故から安全

用途: 自動ロボットが障害物や事故を回避できるようにする
難易度: Hard
コスト: High

ReFace: Reorganizing Facial Spatiotemporal Representations for Improved Pain Assessment

Automatic pain assessment from facial video remains challenging due to the spatial heterogeneity of pain-relat

コンピュータビジョンセグメンテーション画像動画

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

SafeGen: Goal-Conditioned Video Diffusion of Safety-Critical Scenarios for VLM-Based Autonomous Driving

VLMs are increasingly deployed in AD systems, creating an urgent need for rigorous safety evaluation under rar

自然言語処理RAG生成画像テキスト

用途: 生成
難易度: Hard
コスト: High

PhenSPINE: A Standardized Benchmark for Spine Pathology Diagnosis

The accurate diagnosis of spinal pathologies depends heavily on radiological interpretation, yet automated sys

品質予測/異常検知深層学習CNN画像テキスト

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

深層学習Transformerセグメンテーション画像

A Unified Variational Framework for Deep Weakly Supervised Image Segmentation

We propose a unified variational framework for image segmentation under sparse pixel-level supervision. Our me

用途: セグメンテーション
難易度: Hard
コスト: High

FELT: Generating Tactile Signals from Vision for Visuo-Tactile Manipulation

The sense of touch is central to manipulation, especially when vision is occluded or ambiguous. Although combi

センサ/時系列深層学習軽量化・量子化画像

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

DINS-IO: Learned Inertial Odometry via Differentiable INS Consistency

The training of learned inertial odometry depends on dense, high-precision position ground truth from motion c

自然言語処理ファインチューニング画像自己教師

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

自然言語処理プロンプトエンジニアリング検出画像テキスト

ReferTrack: Referring Then Tracking for Embodied Visual Tracking

ReferTrack は、自然言語で対象の車両に付近する自動車を追従させるシステムである。このシステムでは、対象の車両に付近する自動車を認識する後、自動車の動きを予測する。

用途: 自動車が対象の車両に付きそわせるシステム
難易度: Hard
コスト: High

Digital Twin Modeling of a Highly Automated Agricultural Tractor

このプロジェクトでは、農林業用自動化トラクターのデジタルツインモデリングが行われた。デジタルツインはCAN通信を使用することでトラクターの動きを模倣し、実際のトラクターの動作をシミュレートする。

強化学習画像

用途: 農林業用自動化トラクターのデジタルツインモデリング
難易度: Hard
コスト: Medium

V2F: Vision-Informed Grasp Force Prediction for Damage-Aware Robotic Handling of Date Fruits

V2F は、日用消費財をロボットで取扱するためのシステムである。このシステムでは、ロボットが消費財を取扱うときに必要な力を予測し、物体を傷つけたり、物を作れなかったりするのを防ぐ。

自然言語処理RAGセグメンテーション画像

用途: 日用消費財をロボットで取扱するためのシステム
難易度: Hard
コスト: Low

コンピュータビジョン物体検出分類検出セグメンテーション

githubGitHubあり2026-07-22

supervision — We write your reusable computer vision tools. 💜

supervisionは、機械学習技術を活用して、ユーザー独自のコンピュータビジョンツールを作成することができる。

用途: オリジナルコンピュータビジョンツール
難易度: Easy
コスト: High

githubGitHubあり2026-07-22

Awesome-CVPR2026-CVPR2025-ICCV2025-CVPR2024-ECCV2026-ECCV2024-AIGC — A Collection of Papers and Codes for CVPR2026/CVPR2025/ICCV2025/CVPR2024/ECCV2026/ECCV2024 AIGC

CVPRに基づくAIを取り入れるための資料集を提供します。CVPR 2026、2025、2024、およびECCV 2024に基づくAIGCに関する研究論文とソフトウェアコードを含みます。

コンピュータビジョン3D・点群生成画像動画

用途: AIをCVPRに応用する
難易度: Easy
コスト: High

Deep Shape Regression for Planar Curves with Multimodal Covariates

深層学習を用いた形状推定モデルを作成し、オープン平面曲線の形状を推定するための深層学習モデルを提案した。

深層学習CNN回帰画像マルチモーダル

用途: 多モデルの形状推定
難易度: Hard
コスト: High

Strong Gravitational Lensing Posterior Sampling in Pixel-Space Using Diffusion Models and Recurrent Inference Machines

Modeling galaxy-galaxy strong gravitational lenses to infer the brightness of the source galaxy and the mass d

深層学習Transformer生成画像

用途: 生成
難易度: Hard
コスト: High

Provable diffusion-based posterior sampling for linear inverse problems via DDIM

逆問題を解くために、拡散ベースのサンプリングアルゴリズムが提案されていました。これにより、解の特性の正確さが向上することが期待されます。

用途: 逆問題を解く
難易度: Hard
コスト: High

Spiking Neural Networks for fMRI-Based Visual Semantic Decoding

fMRIデータから視覚情報を解釈するために、スパイクニューラルネットワークを用いた方法を提案し、fMRIデータから視覚情報を解釈する検証を行う。

深層学習Transformer回帰画像

用途: fMRIから視覚情報の解釈
難易度: Hard
コスト: Medium

Two-Level Meta-Rubrics for Evaluating Open-Ended Generation: GAMUT, a Benchmark for Factual Completeness

Evaluating the factuality of long-form generations has focused predominantly on precision, measuring whether t

用途: 生成
難易度: Hard
コスト: High

Computational Humor with Multimodal LLMs: Methods, Datasets, Evaluation, and Challenges

Multimodal humor in memes, cartoons, and comics remains difficult for AI systems because intended meaning depe

自然言語処理大規模言語モデル分類生成画像

用途: 分類
難易度: Hard
コスト: High

Bounding Boxes to Improve Small Language Model Performance on Vision-Based Grading Tasks

The deployment of Small Language Models (SLMs) in educational settings offers significant advantages in terms

コンピュータビジョン物体検出検出画像テキスト

用途: 検出
難易度: Hard
コスト: Medium

Fusion Embedding: A Unified Embedding Space for Text, Image, Video, and Audio

A single embedding space that covers text, images, video, and audio lets one index serve every query a user ca

用途: 生成
難易度: Hard
コスト: High

Stochastic Meta-Unlearning: Bridging Language Backbone and Multimodal Unlearning

Machine unlearning for vision-language models (VLMs) remains underexplored. Unlike language models, VLMs combi

自然言語処理RAG画像テキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

深層学習Transformer分類生成セグメンテーション

Pathologist Attention-Aligned Report Generation for Prostate Histopathology

The allocation of visual attention by pathologists during cancer diagnosis is a highly selective process that

用途: 分類
難易度: Hard
コスト: High

VQ-Transplant: Efficient VQ-Module Integration for Pre-trained Visual Tokenizers

Vector Quantization (VQ) underpins modern discrete visual tokenization. However, training quantization modules

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

MI向きコンピュータビジョンセグメンテーションQA画像テキスト

ChronoStitch: Training-Free Composition of Visual KV Memories for Long-Horizon Temporal Reasoning

Long-video question answering requires a model to preserve visual evidence over time without repeatedly reproc

用途: QA
難易度: Hard
コスト: High

Synthetic and Derived Training Images for Campus Waste Detection: A Multi-Seed Evaluation with YOLOv8n

Incorrect disposal can contaminate campus recycling streams, and a bin-mounted camera could provide feedback a

コンピュータビジョン物体検出検出画像

用途: 検出
難易度: Hard
コスト: High

センサ/時系列自然言語処理大規模言語モデル画像テキスト動画

D3VL: Understanding Driving Scenes from 3D Time Series Data and Video with Language Models

Recent advances in Multimodal Large Language Models (MLLMs) have triggered the development of end-to-end MLLMs

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

Geospatial Diffusion-based Evolution Synthesis (GeoDES) for Storm-Centered Weather Augmentation

While machine learning-based weather models hold significant promise, they struggle to predict the detailed st

深層学習軽量化・量子化生成画像動画

用途: 生成
難易度: Hard
コスト: High

Crowd4D: Scene-Aware Monocular 4D Crowd Reconstruction

Recovering scene-consistent 4D crowd motion from monocular video in large-scale scenes remains challenging due

自然言語処理RAG画像動画3D

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

Detect Early, Escalate Rarely: Anytime Detection of AI-Generated Video from the Compressed Bitstream

Detectors for AI-generated video are evaluated offline. A clip is decoded to pixels and scored once, increasin

CPUで試しやすい深層学習CNN検出画像テキスト

用途: 検出
難易度: Hard
コスト: High

MI向き深層学習Transformer生成画像テキスト

Appearance Pointers -- Multimodal Region Control of Diffusion Transformers

画像生成において、材料、 객체、領域を制御することが難しい問題がある。 Diffusion Transformers はテキストと画像を組み合わせて処理できるが、どちらをどの程度影響させるか決める仕組みがなかった。その

用途: 多モーダル画像制御
難易度: Hard
コスト: High

Masked Visual Actions for Unified World Modeling

Video models absorb rich priors over how the visual world moves, interacts, and responds to contact, making th

コンピュータビジョンセグメンテーション画像動画

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

MI向き自然言語処理ファインチューニング生成画像テキスト

ExpertVerse: A General-Purpose Benchmark for Expert-Level Reasoning in Knowledge-Intensive Visual Synthesis

Recent advances in multimodal generative models have enabled instruction-based image generation to move beyond

用途: 生成
難易度: Hard
コスト: High

MI向き品質予測/異常検知自然言語処理大規模言語モデル画像音声動画

OmniReasoner: Thinking with Long Audio-Video via Native Tool Use

オリジナルのデータとZoom-Inのツールを組み合わせた方法、OmniReasonerを提案する。これにより、オリンモードルLLMsの長いオーディオビデオの論理的推論を改善できる。

用途: 長いオーディオビデオの論理的推論を改善する
難易度: Hard
コスト: High

品質予測/異常検知深層学習Transformer生成画像

ROMS-IMLE: A Minimalist Approach to Competitive Single-Step Generative Modelling

生成モデルの構築のための新しいアプローチが提案されていました。これにより、生成モデルの構築が効率化され、強い表現力が得られるようになります。

用途: 生成モデルの構築
難易度: Hard
コスト: High

InstructMixup: Instruction-Guided Salient Patch Editing for Robust Data Augmentation

記述情報に従って画像や動画データを混ぜ合わせる「対数混合法」を拡張する方法、InstructMixupを提案する。これにより、データを拡張しながらデータの内容とラベルが維持される。

深層学習Transformer分類検出生成

用途: データ拡張のための対数混合法を拡張する
難易度: Hard
コスト: High

ERank in Latent Space as an Image-Complexity and Richness Measure

計算機ビジョンと画像認識では、画像の視覚的なリッチネスを評価するために有用な指標が求められるが、これまでの指標は制限があった。この問題を解決するために、チャンネル空間の分散を利用した指標を提案する。

コンピュータビジョンセグメンテーション分類画像

用途: 画像の視覚的なリッチネスを評価するための新しい指標を提案する
難易度: Hard
コスト: High

PathAgentBench: Benchmarking Evidence-Seeking Vision-Language Models on Whole-Slide Pathology Image

Whole-slide image (WSI) diagnosis requires identifying diagnostically relevant regions, examining them across

自然言語処理ファインチューニング検出生成画像

用途: 検出
難易度: Hard
コスト: High

品質予測/異常検知深層学習Transformerセグメンテーション画像テキスト

IGGT4D: Streaming 4D Instance-Grounded Geometry Transformer

実際の空間知能では、空間に続いて流れるビデオを理解する必要がある。この問題を解決するために、4次元空間を理解することができるモデルを提案する。

用途: 空間に続いて流れるビデオを理解する
難易度: Hard
コスト: High

Anatomy-Aware 3D Mesh Refinement of Pericardium Segmentations on Computed Tomography

心臓の囲みの区別は、食道肥厚の測定に重要であるが、しかし、これを正確に区別することは難しい。これを解決するために、周囲の解剖学的構造を利用して囲みの区別を改善する方法を提案する。

自然言語処理RAGセグメンテーション画像テキスト

用途: 心臓CT画像から心臓の囲みを正確に区別する
難易度: Hard
コスト: High

Eversion-based robots can enable safe access,steering and endoscopic imaging within the spinal subarachnoid space

この研究では、スパイナルサブアルテラノスパース内の安全な移動、操縦、内視鏡撮影を可能にする医療用ロボットを提案します。

コンピュータビジョンマルチモーダル画像

用途: 肌下腔内の医療ロボット
難易度: Hard
コスト: High

Cognitive Dual-Process Planning for Autonomous Driving with Structured Scene Knowledge and Verifiable Reasoning-Action Consistency

自動運転のための計画とは、状況理解、タイムリーな推論、行動選択というものがあるが、しかし、これらの要素を組み合わせるのは難しい。これを解決するために、シーン理解を分離することによって、計画を安全かつ有効性のあるものにする

深層学習軽量化・量子化画像テキストマルチモーダル

用途: 自動運転のための分離された計画システムを提案する
難易度: Hard
コスト: High

Agentic Real2Sim: Physics-based World Modeling with Vision-Language Agents

Real-to-sim conversion for robotic interaction with objects remains labor-intensive because it requires more t

コンピュータビジョンマルチモーダル画像テキスト

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

センサ/時系列コンピュータビジョン動画認識検出生成画像

NGPS: GPS-Denied Aerial Geo-Localization and 2.5D Reconstruction via Deep Satellite Image Matching and Multi-Rate Sensor Fusion

この研究では、高空飛行の無信号位置指示のNGPS (Next-Generation Positioning System)というフレームワークを提案しました。NGPSは、GPSの信号を利用せずに位置推定を可能にします。N

用途: 高空飛行の無信号位置指示
難易度: Hard
コスト: High

自然言語処理プロンプトエンジニアリング画像テキスト動画

WorldScape Policy 2.0: Empowering Steerable World Action Modeling with Reasoning-Augmented Memory

World Action Models(WAMs)は、ロボットマニピュレーションをモデル化するパラダイム。WAMsは、視覚ステートトランジションとロボットアクションを同時にモデル化する。しかし、既存のWAMsは、一定の時

用途: 多目的マニピュレーション問題を解決する
難易度: Hard
コスト: High

センサ/時系列コンピュータビジョン3D・点群分類画像動画

MVP-Tac: A Miniaturized Dual-Modal Vision and Photoelastic Tactile Sensor for Robot-Assisted Minimally Invasive Surgery

Robot-assisted minimally invasive surgery (RMIS) offers major benefits over open and conventional laparoscopic

用途: 分類
難易度: Hard
コスト: High

End-to-end Conditional Diffusion for Realistic and Controllable Visual Traffic Scenario Generation

この文書では、閉回路交通シナリオ生成のための変分ベースのアプローチ「E2E-CDiff」を提案しました。これを使用すると、実世界に近い交通ルールを生成したり、交通ルールを操作することができるようになります。

生成AI拡散モデル生成画像

用途: 自動運転データの生成
難易度: Hard
コスト: High

huggingfaceHugging Faceあり2026-07-21

Text Template Tokens Are Implicit Semantic Registers in Diffusion Transformers

Text-to-image diffusion transformers (DiTs) jointly process text and image tokens, yet their internal computat

説明可能深層学習Transformer生成画像テキスト

用途: 生成
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-21

Mage-Flow: An Efficient Native-Resolution Foundation Model for Image Generation and Editing

Large-scale visual generators are increasingly capable but costly to train, fine-tune, and deploy. We introduc

品質予測/異常検知深層学習Transformer生成画像テキスト

用途: 生成
難易度: Easy
コスト: High

huggingfaceGitHubありHugging Faceあり2026-07-21

Delineate Anything v2: A Global Foundation Model for Field Delineation

Accurate agricultural field boundary delineation at large scale is a foundational task for food security, supp

自然言語処理RAG画像テキスト

用途: 技術検証・論文読解補助
難易度: Easy
コスト: Low

githubGitHubあり2026-07-21

awesome-datascience — :memo: An awesome Data Science repository to learn and apply for real world problems.

データサイエンスの学習には役立つリポジトリ。実世界の問題に応じた学習が可能。

深層学習画像

用途: データサイエンス学習
難易度: Easy
コスト: Medium

Program Synthesis for Simulation-Based Inference: Joint Model Selection and Parameter Estimation

Neural simulation-based inference enables parameter estimation for complex models, but typically requires the

用途: 生成
難易度: Hard
コスト: High

CANDOR: Chance-Calibrated Discordance in Frozen Foundation Encoders

Frozen encoders are chosen by how well a lightweight head reads a finding from their features, not whether the

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

説明可能品質予測/異常検知自然言語処理RAG生成画像テキスト

PathReportEval: A Systematic Benchmark for Pathology Report Generation

Pathology report generation from whole-slide images (WSIs) is a rapidly growing multimodal learning problem, y

用途: 生成
難易度: Hard
コスト: High

Relay-Bench: Evaluating LLMs on Multi-Domain Reasoning Chains

Introducing Relay-Bench, an unsaturated, holistic, text-only benchmark that measures LLMs' ability to complete

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

センサ/時系列深層学習Transformer分類異常検知画像

Recti-Q: Feature-Space Rectification for Out-of-Distribution-Robust Quantized Perception in Edge Robotics

エッジロボチクスでの画像認識精度を安定させ、その安定性を確保するために、量化後のパフォーマンスを向上させ、分散型データ量化を実現し、分布シフトの影響を緩和する、新しい機械学習アプローチを提案します。

用途: エッジロボチクスでの画像認識の安定性
難易度: Hard
コスト: High

品質予測/異常検知コンピュータビジョンマルチモーダル画像

MAGE: Human-Like Macro Placement via Agentic Multimodal Reasoning

Macro placement still requires substantial manual refinement in industrial physical design flows. We present M

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

深層学習Transformer埋め込み画像テキスト

Patch Policy: Efficient Embodied Control via Dense Visual Representations

ロボット制御を効率化するために、パッチを用いた政策学習を提案し、密集された視覺表現を用いて実装することを目的としている。

用途: リソース制限のあるロボットの制御
難易度: Hard
コスト: High

センサ/時系列深層学習軽量化・量子化画像テキストマルチモーダル

FM-VLA: Force-based Memory for Vision-Language-Action Models in Contact-Rich Manipulation

existing VLA modelの制約を解決するためのforce-based memory method、FM-VLAを提案する。

用途: manipulateする物体の状態を解決する
難易度: Hard
コスト: High

Optimization of sim-to-real transfer in the humanoid robot NICO

existing robotic grasping methodの限界を解決するためのsim-to-real transfer methodを提案し、成功率を向上させる。

コンピュータビジョン物体検出検出画像

用途: ロボットの手順を解決する
難易度: Hard
コスト: Medium

Learning Adaptive Safety Margins for Visual Navigation

Robots in cluttered indoor spaces often fail not because they cannot generate collision-free paths, but becaus

コンピュータビジョン3D・点群画像テキスト3D

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

Remote Awareness of Seafloor Images Collected by AUVs over Low-Bandwidth Communication Links

This paper introduces a method for real-time processing and transmission of autonomous underwater vehicle (AUV

自然言語処理RAG画像

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Low

Distilling Global Traversability Priors for Image-based Affordance Prediction in Off-road Environments

existing robot navigation methodの限界を解決するためのglobal traversability prior extraction methodを提案し、オフロード環境でのロボット移動を実

センサ/時系列自然言語処理RAG画像3D

用途: オフロード環境でのロボット移動を解決する
難易度: Hard
コスト: High

コンピュータビジョンセグメンテーション生成画像動画

Does Robust VIO Need More Learning? Geometry-Verified Visual Measurements under Distribution Shift

Learning is increasingly introduced into visual-inertial odometry (VIO), ranging from learned feature front-en

用途: 生成
難易度: Hard
コスト: High

UMCP: A Unified Multi-Task Collaborative Perception Network for Luggage Trolley Pose Estimation

ロボット車の視覚システムは、高精度でリアルタイム性能を持つロジスティクス車両の位置検出を実現する必要があります。従来の手法では、複数のモデルが連続してインフェレンズされ、インフェレンスラティシーが増加し、高規模デプロイメ

コンピュータビジョン物体検出検出画像

用途: luggage trolleyの位置推定
難易度: Hard
コスト: Medium

説明可能品質予測/異常検知強化学習方策勾配 (PPO / A3C)画像

ConceptTree: Bringing Semantic Transparency to Black-Box Decision Making for Robotic Manipulation

この論文では、ConceptTreeというフレームワークを提案しています。このフレームワークは、人の見える概念を使用して、マニピュレーションの高位のスキル選択を表現し、透明性を高めます。

用途: マニピュレーションの高位のスキル選択のための透明性の実現
難易度: Hard
コスト: High

センサ/時系列品質予測/異常検知自然言語処理ファインチューニング検出画像

arxivGitHubあり2026-07-20

Polar Coordinate-based Differential Evolution for Moving Target Search Using Vision Sensor on Unmanned Aerial Vehicles

In search and rescue operations, there is a period known as the "golden time" during which the probability of

用途: 検出
難易度: Easy
コスト: Medium

From Sign Language Generation to Humanoid Execution: Vision-Language Guided Retargeting with Collision Mitigation

この論文では、ラインダブルロボットのための自発的アクション生成を実現することを目標とし、vision-language 指向性の指令によりロボットが自発的に動作することができることを示します。

コンピュータビジョン3D・点群生成画像3D

用途: ラインダブルロボットのための自発的アクション生成
難易度: Hard
コスト: High

VLN-AVP: Zero-Shot Vision-Language Navigation with Hybrid Long-Short-Term Memory for Autonomous Valet Parking

Existing methods in Autonomous Valet Parking (AVP) typically rely on pre-built maps, which severely restricts

自然言語処理RAG画像テキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

深層学習Transformer分類検出セグメンテーション

Seg2Grasp: A Robust Modular Suction Grasping in Bin Picking

採掘ロボットの性能向上を目指したSeg2Graspを構築し、セグメンテーション、グレイシング、クラスフィルタリングの3つのモジュールで構成されます。セグメンテーションモジュールではTransformerを利用したオブジェ

用途: 採掘ロボットがオブジェクトを取り上げる能力の向上
難易度: Hard
コスト: Low

Predictive Training with Latent Imagination for Visual Quadruped Navigation

四足ロボットのナビゲーションのための予測的推論方法が提案されます。ロボットは、現在の観察と短期的な記憶によってアクションを選択しますが、障害物の発展を予測することができないため、このアプローチには課題があります。この課題

用途: ロボットのナビゲーション
難易度: Hard
コスト: High

センサ/時系列自然言語処理埋め込み・検索画像マルチモーダル

COLIP-2: Olfaction-Vision-Language Embeddings

The Contrastive Olfaction-Language-Image Pre-training 2 (COLIP-2) model is a multimodal embeddings space that

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

GeoWorldAD: Geometry World Action Model for Autonomous Driving

Autonomous driving requires both safe and efficient planning decisions in dynamic 3D environments. Although re

深層学習Transformer画像動画3D

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

AlayaWorld: Interactive Long-Horizon World Modeling -- Full Technical Report

Unlike conventional video game development, which relies on labor-intensive pipelines for asset production, an

用途: 生成
難易度: Easy
コスト: High

huggingfaceGitHubありHugging Faceあり2026-07-20

SciForma: Structure-Faithful Generation of Scientific Diagrams

Structural fidelity is essential to scientific methodology diagrams. To communicate research logic, these diag

品質予測/異常検知自然言語処理大規模言語モデル生成画像テキスト

用途: 生成
難易度: Easy
コスト: High

HOMIE: Human-object Centric Video Personalization via Multimodal Intelligent Enchancement

Human-object centric video personalization (HOCVP) is a core task within subject-driven video generation. Howe

用途: 生成
難易度: Easy
コスト: High

品質予測/異常検知自然言語処理大規模言語モデル検出生成セグメンテーション

FlowMimic: Mask-free Visual Editing and Generation with Pixel-pair Warped Flow Field for Online Video Editing Data Generation and Modality Mimicry

In line with the prevailing direction of vision research, we explore the integration of both generation and ed

用途: 検出
難易度: Easy
コスト: High

DiFA: Inference-Time Forward-Process Alignment for Diffusion Models

The prevailing inference framework for diffusion models formulates generation fundamentally as a problem of nu

コンピュータビジョン画像分類生成画像

用途: 生成
難易度: Easy
コスト: High

表形式向き品質予測/異常検知自然言語処理RAG回帰画像テキスト

Econometrics with Pre-Trained Embeddings for Unstructured Data

Unstructured data, such as images and text, are increasingly used in empirical economics. Since training machi

用途: 回帰
難易度: Hard
コスト: High

Expressivity of Shallow Neural Networks Over Finite Fields

We study the expressivity of shallow polynomial neural networks (PNNs) with monomial activation functions over

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

センサ/時系列コンピュータビジョンセグメンテーション検出画像3D

DeeperRadar: End-to-End MIMO Radar Design and Multi-Modal Fusion for Autonomous Vehicle Perception

DeeperRadar is a radar-centric, sensor-stack-conditioned framework that co-designs radar sensing and multi-mod

用途: 検出
難易度: Hard
コスト: High

Multi-Resolution Voxelized Map-Based Stereo Visual-Inertial Odometry

Incorporating prior maps significantly enhances the accuracy and robustness of pose estimation in visual-inert

コンピュータビジョン3D・点群画像3D

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

VIDAR: Visual-Inertial Dense Alignment and Reconstruction via a Geometric Foundation Model

Monocular foundation models provide dense geometry but usually lack a stable metric scale. This paper presents

強化学習画像

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

センサ/時系列深層学習RNN / LSTM画像3D

DROID-ANCHOR: Odometry-Anchored Recurrent Metric Depth Estimation

Precise metric depth estimation is fundamental for autonomous robot navigation, yet monocular systems inherent

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

huggingfaceHugging Faceあり2026-07-19

HarmoHOI: Harmonizing Appearance and 3D Motion for Multi-view Hand-Object Interaction Synthesis

Hand-Object Interaction (HOI) synthesis is a cornerstone for animation production and embodied AI. Despite the

品質予測/異常検知深層学習Transformer生成画像動画

用途: 生成
難易度: Easy
コスト: High

表形式向き品質予測/異常検知コンピュータビジョンセグメンテーション生成画像表形式

Semi-Supervised Conditional Diffusion via Label Augmentation

Conditional diffusion models have become a powerful and flexible framework for learning complex conditional di

用途: 生成
難易度: Hard
コスト: High

説明可能強化学習方策勾配 (PPO / A3C)画像

arxivGitHubあり2026-07-18

SinD 2.0: A Multi-City UAV Dataset with Semantic Risk Annotations for SOTIF-Oriented Safety Validation at Signalized Intersections

Safety validation at signalized intersections remains a critical bottleneck for the deployment of autonomous d

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

説明可能センサ/時系列コンピュータビジョンマルチモーダル生成画像

What Do They See? Interpreting Complex Road Scenarios Through the Eyes of Vision-Language-Action Models for Safe and Trustworthy Autonomous Vehicle Learning

End-to-end autonomous driving models are now able to navigate complex road scenarios, mapping raw sensor obser

用途: 生成
難易度: Hard
コスト: High

GLidE-SLAM: GL-Accelerated Indirect-Direct Embedded SLAM

With the growing demand for robotics, autonomous drones, and wearable extended reality systems, the deployment

CPUで試しやすい自然言語処理RAG画像

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Low

センサ/時系列コンピュータビジョン物体検出検出画像

Hybrid Machine Learning for Articulation Angle Estimation of Truck-Semitrailer Combinations

Accurate articulation angle estimation of trucks with trailers is critical for autonomous driving and advanced

用途: 検出
難易度: Hard
コスト: Medium

An Indoor Navigation System for the Visually Impaired based on UWB Positioning and D* Lite Path Planning Algorithm

This paper proposes an indoor navigation system for the visually impaired, leveraging Ultra-Wideband (UWB) pos

自然言語処理RAG検出画像

用途: 検出
難易度: Hard
コスト: Low

コンピュータビジョンマルチモーダル検出画像テキスト

Autonomous VR-Based Risk Detection for Situational Awareness in Dangerous Settings

In high-risk environments such as disaster response, situational awareness depends not only on detecting hazar

用途: 検出
難易度: Hard
コスト: High

huggingfaceGitHubありHugging Faceあり2026-07-18

Dataset Distillation by Influence Matching

We revisit dataset distillation from an outcome-centric perspective. Rather than aligning process surrogates (

深層学習軽量化・量子化分類画像テキスト

用途: 分類
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-18

DataFlow-Harness: A Grounded Code-Agent Platform for Constructing Editable LLM Data Pipelines

Large language models (LLMs) are increasingly used to automate data-processing workflows, yet coding agents ty

用途: 生成
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-18

Can Multimodal Large Language Models Understand OCT?

Optical coherence tomography (OCT) imaging is essential for the diagnosis and treatment of retinal diseases. A

品質予測/異常検知自然言語処理大規模言語モデル分類QA画像

用途: 分類
難易度: Easy
コスト: High

Constrained Hebbian Learning Supports Efficient Representational Allocation under Structural Constraints

脳のニューロン同士のつながりを分析する方法を提案する。この方法では、神経伝達の構造を考慮しながら、ニューロン間のつながりを分析できる。

深層学習Transformer分類画像音声

用途: 神経伝達の分析
難易度: Hard
コスト: Low

品質予測/異常検知コンピュータビジョンマルチモーダル画像強化学習

Foresight Residual RL for Long-Horizon Robot Manipulation with Vision-Language-Action Models

Vision-Language-Action (VLA) policies offer strong general-purpose manipulation priors, but often fail on tigh

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

VTLoc: Learning-based Tactile Contact Localization in Visual Point Clouds

VTLocフレームワークは、視覚情報と触覚情報を統合し、ロボットハンドの位置を推定することで、ロボットハンドの位置推定と動作操作を実現します。

コンピュータビジョン3D・点群検出画像テキスト

用途: ロボットハンドの位置推定
難易度: Hard
コスト: High

PIXIE: A Zero-Shot texture-invariant 6D pose estimation framework for unseen objects with assembly defects

PIXIEフレームワークは、6次元オブジェクト位置推定を実現し、ロボットハンドの制御と物体の操作を実現します。

深層学習Transformer画像テキスト3D

用途: オブジェクトの6次元位置推定
難易度: Hard
コスト: High

少数データ向き条件最適化自然言語処理RAG検出画像

Embodied Active Learning under Limited Annotation and Navigation Budget for Object Detection

この研究では、ロボットのナビゲーション時間と注釈時間の制約を考慮したオブジェクト検出フレームワークを提案します。

用途: オブジェクト検出を適応化
難易度: Hard
コスト: Low

センサ/時系列深層学習Transformerセグメンテーション画像マルチモーダル

PRISM: Multimodal Terrain Mapping for Rover Navigation in Unstructured Environments

Robotic navigation in unstructured environments requires robust situational awareness to safely traverse hazar

用途: セグメンテーション
難易度: Hard
コスト: High

huggingfaceHugging Faceあり2026-07-17

An Exam for Active Observers

Human vision is a closed loop: gaze is continuously redirected by intermediate hypotheses rather than a single

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-17

S1-Omni: A Unified Multimodal Reasoning Model for Scientific Understanding, Prediction, and Generation

We present S1-Omni, a unified multimodal reasoning model for scientific understanding, prediction, and generat

MI向き自然言語処理大規模言語モデル生成画像テキスト

用途: 生成
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-17

Audio-Visual Flamingo: Open Audio-Visual Intelligence for Long and Complex Videos

We present Audio-Visual Flamingo (AV-Flamingo), a fully open state-of-the-art audio-visual large language mode

説明可能自然言語処理大規模言語モデル画像テキスト音声

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

arxivPaper only2026-07-16

cGAP: Generalized Association Plots with HOMALS-Guided Heatmaps for Visualization of High-Dimensional Categorical Data

高次元カテゴリデータを可視化するため、hierarchical optimizing linear assignment (HOMALS)を使用し、可視化に役立つ関連表

説明可能自然言語処理ファインチューニング分類画像

用途: 高次元カテゴリデータの可視化
難易度: Hard
コスト: Low

arxivPaper only2026-07-16

Optimal Self-Distillation for Rectified Flow via Linear Probing

Modern generative models are increasingly trained using model-generated signals, creating both opportunities f

深層学習軽量化・量子化生成画像

用途: モデル改善
難易度: Hard
コスト: Medium

huggingfaceHugging Faceあり2026-07-16

Trajectory-aware Cross-view Geo-localization with Sequential Observations

Cross-view geo-localization matches ground-level observations against geo-tagged satellite imagery. Recent met

品質予測/異常検知深層学習軽量化・量子化検出画像テキスト

用途: 検出
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-16

RESOURCE2SKILL: Distilling Executable Agent Skills from Human-Created Multimodal Resources

Skills are a useful abstraction for software agents, turning human and agent experience into reusable procedur

自然言語処理RAG画像テキスト動画

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

githubGitHubあり2026-07-16

pytorch-image-models — The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more

PyTorchで使用できる画像エンコーダとバックボーンの最大のコレクションです。トレーニング、評価、推論など様々なスクリプトや事前の重み付きデータが含まれます。

深層学習Transformer分類画像

用途: PyTorchで使用できる画像エンコーダとバックボーン
難易度: Easy
コスト: High

PiVoT: A Variational Solution for Real-time Large-scale Multi-object Detection and Tracking under Heavy Clutter

難しい環境でマルチオブジェクトの検知と追跡が可能なPiVoTを開発、実用的なソリューションを提案した。

深層学習軽量化・量子化検出画像3D

用途: マルチオブジェクトの検知と追跡
難易度: Hard
コスト: High

Heavy-Tailed Flow Matching via Random Clocks

重尾流を見つけるには、Standard diffusionとflowマッチングモデルの欠陥を解決するRandom Clocksを提案した。

品質予測/異常検知自然言語処理RAG画像

用途: 重尾流の検出
難易度: Hard
コスト: High

Evaluating Encoding Strategies for Closed-Loop Classification in Biological Neural Networks

Interfacing with Biological Neural Networks (BNNs) requires encoding information into stimulation patterns tha

コンピュータビジョン動画認識分類画像

用途: 分類
難易度: Hard
コスト: High

Visual Place Recognition Using Rate-Encoded Spiking Neural Networks with Discrete STDP Learning

Spiking Neural Networks (SNNs) trained through unsupervised Spike-Timing-Dependent Plasticity (STDP) have been

深層学習軽量化・量子化分類画像教師なし

用途: 分類
難易度: Hard
コスト: Low

コンピュータビジョンセグメンテーション画像テキストマルチモーダル

Generalizable VLA Finetuning via Representation Anchoring and Language-Action Alignment

Finetuning a pretrained vision-language model (VLM) on robot demonstrations via behavior cloning (BC) has beco

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

コンピュータビジョンセグメンテーション画像テキスト動画

Open-AoE: An Open Egocentric Manipulation Dataset and Toolchain for Embodied Learning

Egocentric videos of human manipulation provide scalable supervision for embodied intelligence, yet existing r

用途: セグメンテーション
難易度: Easy
コスト: High

DiffGI: Differentiable Geometry Images for High-Fidelity Thin-Shell 3D Generation

Existing 3D generative models predominantly rely on implicit volumetric representations, which enforce waterti

深層学習Transformer生成画像3D

用途: 生成
難易度: Easy
コスト: High

Cura 1T: Specialized Model for Agentic Healthcare

Healthcare spans high-stakes communication, expert reasoning, and workflow execution, yet specialized LLMs tha

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

VideoRAE: Taming Video Foundation Models for Generative Modeling via Representation Autoencoders

Video generative models commonly rely on latent spaces learned by 3D Variational Autoencoders (3D-VAEs). Howev

用途: 生成
難易度: Easy
コスト: High

arxivPaper only2026-07-14

LatentFlow: A General Framework for Conditioning Stochastic Processes

ストロチャスティックプロセスに観察値を組み込むことが困難であれば、単に観察値を観察できるものを学習しているという理解を拡張する新しいフレームワークを発表

CPUで試しやすいコンピュータビジョンセグメンテーション画像

用途: ストロチャスティックプロセスの調整
難易度: Hard
コスト: High

arxivPaper only2026-07-14

ANGLE: Angular Neural Generative Learning via Engression

Circular data, representing angles or directions, are frequently encountered in computer vision, biology, geol

深層学習軽量化・量子化生成回帰画像

用途: 生成
難易度: Hard
コスト: High

huggingfaceHugging Faceあり2026-07-14

Color Pass-Through via Camera-Display Coupling

When a real-world scene is captured by a smartphone camera and viewed on its screen, the displayed image often

用途: 技術検証・論文読解補助
難易度: Easy
コスト: Low

huggingfaceHugging Faceあり2026-07-14

ReflectWorld-MM: An Entity-Oriented Multimodal Memory System for Open-Ended Video Streams

Building assistants that can continually watch the world, remember what they see, and reason over their accumu

コンピュータビジョンマルチモーダル画像テキスト音声

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

githubGitHubあり2026-07-14

OpenRLHF — An Easy-to-use, Scalable and High-performance Agentic RL Framework based on Ray (PPO & DAPO & REINFORCE++ & VLM & TIS & vLLM & Ray & Async RL)

OpenRLHFは、Ray上に構築された強化学習フレームワークです。このフレームワークは、PPO、DAPO、REINFORCE++など、様々な強化学習アルゴリズムをサポートしています。

用途: 強化学習フレームワーク
難易度: Easy
コスト: High

githubGitHubあり2026-07-14

LakonLab — Official implementation of AsymFlow, pi-Flow, GMFlow

LakonLabは、AsymFlow、pi-Flow、GMFlowなどの生成型流体力学を実装するためのオープンソースプロジェクトです。

深層学習軽量化・量子化生成画像テキスト

用途: 生成型流体力学の実装
難易度: Easy
コスト: Medium

arxivPaper only2026-07-13

Difference-Driven Gating: Adaptive Feature Fusion for U-Net Decoder

この研究では、新しい特徴融合手法を提案した。この手法は、上からの特徴と下からの特徴の関係性を考慮することで、特徴を効率的に融合し、三次元データを2次元サムライグラフにコンパクトに表現する機能をもたらせる。

センサ/時系列コンピュータビジョンセグメンテーション画像音声

用途: 特徴融合
難易度: Hard
コスト: Medium

huggingfaceHugging Faceあり2026-07-13

See like a Robot: Robot-Centric Pointmaps for Vision-Language-Action Models

Vision-language-action (VLA) models predict robot actions from visual observations and language instructions.

コンピュータビジョン3D・点群画像3Dマルチモーダル

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

githubGitHubあり2026-07-13

UniPic — Open-source SOTA multi-image editing model

UniPicは、オープンソースの最先端の画像編集モデルの実装です。

コンピュータビジョンマルチモーダル生成画像

用途: 多画像編集モデルの実装
難易度: Easy
コスト: High

arxivPaper only2026-07-10

Foveation-Guided Dynamic Token Selection for Robust and Efficient Vision Transformers

The human visual system (HVS) employs foveated sampling and eye movements to achieve efficient perception, con

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Low

huggingfaceHugging Faceあり2026-07-10

OpenLongTail: Generative Scaling of Long-Tail Driving Data

Scaling robust driving policies is fundamentally bottlenecked by the scarcity of edge cases in curated dataset

自然言語処理RAG生成画像動画

用途: 生成
難易度: Easy
コスト: High

huggingfaceHugging Faceあり2026-07-10

REBASE: Reference-Background Subspace Elimination for Training-Free In-Context Segmentation

Training-free in-context segmentation enables new object categories to be introduced at inference time from a

品質予測/異常検知自然言語処理プロンプトエンジニアリング検出セグメンテーション画像

用途: 検出
難易度: Easy
コスト: High

深層学習Transformer分類検出セグメンテーション

githubGitHubあり2026-07-10

pytorch-grad-cam — Advanced AI Explainability for computer vision. Support for CNNs, Vision Transformers, Classification, Object detection, Segmentation, Image similarity and more.

このライブラリは、コンピュータービジョンのための高度なAI解釈と可視化ソリューションです。このライブラリは、CNN、ビジョントランスフォーム、分類、物体検出、分割、画像類似度など、さまざまなコンピュータービジョンの

用途: AIの解釈と可視化ソリューション
難易度: Easy
コスト: Low

arxivPaper only2026-07-08

Social-spatial dependencies for learning visual navigation

これは、社会的行動を予測するための新しいフレームワークであるSocial-spatial dependenciesを提案し、個々のエージェントが社会的信号を学習する能力を向上させる。

品質予測/異常検知深層学習Transformer画像テキスト

用途: 社会的行動の予測
難易度: Hard
コスト: Low

arxivPaper only2026-07-08

Size independence of consistency index for pairwise comparison matrices in analytic hierarchy process

Pairwise comparisons are fundamental in the analytic hierarchy process. Various consistency indices have been

用途: AHPにおけるペ
難易度: Hard
コスト: Medium

arxivPaper only2026-07-07

Do You Remember? Toward Memory-Centric Multimodal AI

Human memory is reconstructive, not a faithful recording. Current multimodal LLMs (MLLMs) lack this capability

品質予測/異常検知深層学習軽量化・量子化画像テキスト3D

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

huggingfaceHugging Faceあり2026-07-07

UI2App: Benchmarking Visual Interaction Inference in Executable Web Application Generation

Large language models (LLMs) have demonstrated growing competence in web page generation. However, existing te

用途: 生成
難易度: Easy
コスト: High

githubGitHubあり2026-07-07

VLM-R1 — Solve Visual Understanding with Reinforced VLMs

この研究では、画像理解を強化する強化されたビジョンホルシックスモデル (VLM-R1) が提案されます。この modelは、画像を理解しやすくするように設計されています。

自然言語処理大規模言語モデル画像マルチモーダル

用途: 画像理解の問題を解決
難易度: Easy
コスト: High

arxivPaper only2026-07-06

An event-driven framework for fly-inspired visual motion detection

イベントベースセンシングの活用と生物学的インスピレーションを利用した障害物検出を実現するために、飛行経路を用いた新しいアプローチが提案される。このアプローチは、イベントベースセンシングの活用と生物学的インスピレーションを

説明可能センサ/時系列深層学習Transformer検出画像

用途: イベントベースセンシングと飛行経路を用いた動的環境での障害物検出
難易度: Hard
コスト: High

品質予測/異常検知コンピュータビジョン3D・点群生成画像3D

githubGitHubあり2026-07-06

Magic123 — [ICLR'24] Official PyTorch Implementation of Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors

Magic123は、画像を1枚入力し、画像と3Dデータ双方の情報を利用して高質の3Dオブジェクトを生成することができる。

用途: 高質の3Dオブジェクト生成
難易度: Easy
コスト: High

arxivPaper only2026-07-05

Burst Spiking Neural Networks

A central goal of current Spiking Neural Network (SNN) research is to improve their accuracy toward becoming l

深層学習CNN画像

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

githubGitHubあり2026-07-03

EEGUnity — An open source tool for large-scale EEG datasets processing

ビデオ diffusioin trasformerは、ビデオの長さに依存しない推論能力を持っているが、この長さのエキサポレーションは実際には困難なものである。RIFLExという手法を開発し、ビデオ長さのエキサポレーション

コンピュータビジョンマルチモーダル

用途: ビデオ diffusioin trasformerで長さのエキサポレーション
難易度: Easy
コスト: High

arxivPaper only2026-07-02

Epistemic Horizon Minority Games: When Abundance Reduces Strategic Value

Strategic value can fall when an option becomes visible. A route, signal, bet, or opportunity may be attractiv

深層学習Transformer分類画像

用途: 分類
難易度: Hard
コスト: Low

githubGitHubあり2026-07-02

langextract — A Python library for extracting structured information from unstructured text using LLMs with precise source grounding and interactive visualization.

LLMを使用して、自然言語処理における情報抽出を行うためのPythonライブラリです。

用途: 自然言語処理情報抽出
難易度: Easy
コスト: High

githubGitHubあり2026-06-30

ComfyUI-LTXVideo — LTX-Video Support for ComfyUI

医療画像分析で、深層學習モデルが実装されている問題に対する解決策を提示します。治療を導くために、批判的結果に影響を与える変化について特に重点が置かれています。

生成AI拡散モデル生成画像テキスト

用途: 医療画像を分析し治療を導く
難易度: Easy
コスト: High

arxivPaper only2026-06-29

Partition-Guided Distance Saliency: Bridging Decision and Objective Spaces in Many-Objective Optimization

この論文では、多目的最適化の解釈を向上させるために用いる Partition-Guided Distance Saliency (PGDS) アルゴリズムを提案しました。これにより、多目的最適化の解釈の向上に役立つものと

説明可能MLOpsパイプライン構築画像

用途: 多目的最適化の解釈の向上
難易度: Hard
コスト: Medium

arxivPaper only2026-06-29

Evolutionary Hyperparameter Optimization to Find Lightweight CNN Models for Autonomous Steering

This research investigates the optimization of Convolutional and Dense Neural Networks (CNNs and DNNs) for aut

深層学習CNN画像

用途: 技術検証・論文読解補助
難易度: Hard
コスト: High

arxivPaper only2026-06-29

Theory of Continual Learning Against Data Poisoning Attacks

Continual learning (CL), where a model is trained on a sequence of data tasks, is increasingly being adopted a

深層学習Transformer分類画像テキスト

用途: 分類
難易度: Hard
コスト: High

arxivPaper only2026-06-28

Geometric Stability of Neural Population Codes: Regional Variation, Behavioral Relevance, and Circuit Dependence

Current models of representational reliability in neural populations focus on temporal stability: whether popu

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

githubGitHubあり2026-06-28

LanPaint — High quality training free inpaint for every stable diffusion model. Supports ComfyUI

画像生成のためのHigh Quality Training Free Inpaintを提供します。このInpaintはStable Diffusionモデルに使用でき、ComfyUIもサポートしています。

品質予測/異常検知生成AI拡散モデル生成画像動画

用途: 画像生成
難易度: Easy
コスト: High

arxivPaper only2026-06-26

Heterogeneous synaptic motifs bridge microscale structure and macroscale nonlinear dynamics

Recent breakthroughs in synaptic-resolution network connectomics have revealed that brain circuits feature fin

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

githubGitHubあり2026-06-25

ml-mdm — Train high-quality text-to-image diffusion models in a data & compute efficient manner

Train high-quality text-to-image diffusion models in a data & compute efficient manner

用途: 生成
難易度: Easy
コスト: High

arxivPaper only2026-06-21

A Theory-grounded Hybrid Neural Network Integrating Complementary Estimation Mechanisms for Stable Visual Object TrackingA

Hybrid neural networks (HNNs) that integrate artificial neural networks (ANNs) with brain-inspired neural netw

コンピュータビジョンセグメンテーション画像

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Medium

arxivPaper only2026-06-18

Hybrid ANN-SNN Pipeline with Local Plasticity

神経網路の設計を目指す本研究では、ANNとSNNを組み合わせたハフマン式設計法

深層学習CNN分類画像

用途: 神経網路の設計
難易度: Hard
コスト: High

arxivPaper only2026-06-15

Evolution & Foundation: AI Shares Creative Control

AIが人間と協力して作り出すアイデアを評価するための新しい手法を提案し、創造性の評価を向上させた。

自然言語処理ファインチューニング生成画像3D

用途: AIの創造性を評価するための新しい手法
難易度: Hard
コスト: High

arxivPaper only2026-06-14

AQ4SViT: An Automated Quantization Framework with Search Gating Policy for Compressing Spiking Vision Transformers

Spiking Vision Transformers (SViTs) have emerged as alternative low-power ViT models, but their large sizes hi

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Low

arxivPaper only2026-06-12

Harnessing cortical geometry, wiring, and function as inductive biases for recurrent neural networks

How the wiring and functional organization of cortex shape recurrent computation remains a central question in

用途: 技術検証・論文読解補助
難易度: Hard
コスト: Low

arxivPaper only2026-06-12

A Programmer's Guide to Cascaded Adaptive Combiners: Online Learning by Biologically Accurate Models of Multilayer Neuron Networks

Learning in biological multilayer neuronal networks offers insights that extend beyond the classical weighted-

深層学習軽量化・量子化分類画像

用途: 分類
難易度: Hard
コスト: Low

arxivPaper only2026-06-11

ReSCom: A Reconfigurable Spiking Neural Network Accelerator Using Stochastic Computing

スパイクニューラルネットワークは、エネルギー効率のよいAIモデルです。この研究では、スパイクニューラルネットワークのアクセラレータを実装し、その性能をテストしました。

深層学習RNN / LSTM分類画像

用途: スパイクニューラルネットワークアクセラレータの実装
難易度: Hard
コスト: Low