MLinfo | 機械学習・AI論文まとめ

MLinfo|日々更新される技術をキャッチアップ/検索

「reinforcement」の検索結果

21 件

すべて arxiv github huggingface 実装あり

huggingfaceHugging Faceあり2026-06-05

On the Geometry of On-Policy Distillation

On-policy distillation (OPD) is increasingly used to improve large language model reasoning, but its training

深層学習軽量化・量子化検出生成テキスト

用途: 検出
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-05

SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating

Deep research agents have demonstrated remarkable capabilities in complex information-seeking tasks, yet this

深層学習Transformer強化学習

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-05

DuMate-DeepResearch: An Auditable Multi-Agent System with Recursive Search and Rubric-Grounded Reasoning

Deep Research (DR) has emerged as a new agentic paradigm to tackle complex, open-ended research tasks, demandi

品質予測/異常検知強化学習マルチエージェント生成

用途: 生成
難易度: Easy
コスト: Low

→

huggingfaceHugging Faceあり2026-06-04

Reinforcement Learning Elicits Contextual Learning of Unseen Language Translation

Prior work has shown that large language models (LLMs) can translate unseen or low-resource languages by under

深層学習軽量化・量子化テキスト強化学習

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-03

Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data

Large language models are increasingly evaluated by other models, raising a natural question: can a model pred

少数データ向き品質予測/異常検知深層学習軽量化・量子化テキスト強化学習

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-03

Reinforcement Learning from Rich Feedback with Distributional DAgger

Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR)

深層学習軽量化・量子化強化学習

用途: 技術検証・論文読解補助
難易度: Easy
コスト: Medium

→

huggingfaceHugging Faceあり2026-06-03

Audio Interaction Model

Audio is an inherently interactive modality, yet today's Large Audio Language Models (LALMs) are offline, and

強化学習マルチエージェントテキスト音声

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

huggingfaceGitHubありHugging Faceあり2026-06-03

Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning

Rubric-based reinforcement learning (RL) uses an LLM-as-a-Judge (LaaJ) to score model outputs according to rub

自然言語処理大規模言語モデル強化学習

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

huggingfaceGitHubありHugging Faceあり2026-06-02

Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill

Reward models (RMs) provide critical feedback signals for LLM post-training, notably in reinforced fine-tuning

自然言語処理大規模言語モデル強化学習

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

huggingfaceGitHubありHugging Faceあり2026-06-02

EvoDS: Self-Evolving Autonomous Data Science Agent with Skill Learning and Context Management

Recent progress in Large Language Model (LLM) agents has enabled promising advances in automated data science.

深層学習軽量化・量子化テキスト強化学習

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-02

ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning

Large Reasoning Models (LRMs) have achieved remarkable progress thanks to Reinforcement Learning with Verifiab

深層学習Transformer強化学習

用途: 技術検証・論文読解補助
難易度: Easy
コスト: Low

→

huggingfaceGitHubありHugging Faceあり2026-06-02

Self-Distilled Policy Gradient

On-policy self-distillation, where a language model conditions on privileged context to supervise its own gene

深層学習軽量化・量子化生成テキスト強化学習

用途: 生成
難易度: Easy
コスト: Medium

→

huggingfaceHugging Faceあり2026-06-02

MemTrain: Self-Supervised Context Memory Training

Memory is an indispensable capability for long-horizon LLM agents, enabling them to preserve and utilize infor

品質予測/異常検知自然言語処理大規模言語モデルテキスト自己教師強化学習

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-02

Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

Wide-baseline matching (WBM) requires integrating geometric understanding, viewpoint changes, fine-grained per

自然言語処理大規模言語モデル生成画像テキスト

用途: 生成
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-06-02

Large Language Models Hack Rewards, and Society

Reinforcement learning (RL) has become a dominant post-training paradigm, enabling large language models (LLMs

自然言語処理大規模言語モデル生成テキスト強化学習

用途: 生成
難易度: Easy
コスト: High

→

huggingfaceGitHubありHugging Faceあり2026-06-02

Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning

Large language models improve final-answer accuracy through extended chain-of-thought reasoning, but often spe

深層学習軽量化・量子化生成テキスト強化学習

用途: 生成
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-05-30

SDR: Set-Distance Rewards for Radiology Report Generation

Reinforcement learning with verifiable rewards has rapidly advanced reasoning in vision--language models. Howe

品質予測/異常検知深層学習Transformer生成テキスト強化学習

用途: 生成
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-05-29

Combinatorial Synthesis: Scaling Code RLVR via Atomic Decomposition and Recombination

Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as the cornerstone for shaping the

品質予測/異常検知自然言語処理大規模言語モデル生成テキスト強化学習

用途: 生成
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-05-28

Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents

Memory-augmented LLM agents tackle complex long-horizon tasks by recursively summarizing interaction trajector

品質予測/異常検知自然言語処理大規模言語モデルテキスト自己教師強化学習

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-05-28

Stable-Layers: Fine-Tuning Image Layer Decomposition Models with VLM-Scored Reinforcement Learning

We present Stable-Layers, a reinforcement learning framework that eliminates the need for paired supervision b

自然言語処理ファインチューニング画像テキストマルチモーダル

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→

huggingfaceHugging Faceあり2026-05-26

Trust Region Q Adjoint Matching

Off-policy reinforcement learning of pretrained flow policies remains challenging due to the instability of op

自然言語処理RAG強化学習

用途: 技術検証・論文読解補助
難易度: Easy
コスト: High

→