On the Geometry of On-Policy Distillation
On-policy distillation (OPD) is increasingly used to improve large language model reasoning, but its training
- 用途
- 検出
- 難易度
- Easy
- コスト
- High
「reinforcement」の検索結果
21 件On-policy distillation (OPD) is increasingly used to improve large language model reasoning, but its training
Deep research agents have demonstrated remarkable capabilities in complex information-seeking tasks, yet this
Deep Research (DR) has emerged as a new agentic paradigm to tackle complex, open-ended research tasks, demandi
Prior work has shown that large language models (LLMs) can translate unseen or low-resource languages by under
Large language models are increasingly evaluated by other models, raising a natural question: can a model pred
Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR)
Audio is an inherently interactive modality, yet today's Large Audio Language Models (LALMs) are offline, and
Rubric-based reinforcement learning (RL) uses an LLM-as-a-Judge (LaaJ) to score model outputs according to rub
Reward models (RMs) provide critical feedback signals for LLM post-training, notably in reinforced fine-tuning
Recent progress in Large Language Model (LLM) agents has enabled promising advances in automated data science.
Large Reasoning Models (LRMs) have achieved remarkable progress thanks to Reinforcement Learning with Verifiab
On-policy self-distillation, where a language model conditions on privileged context to supervise its own gene
Memory is an indispensable capability for long-horizon LLM agents, enabling them to preserve and utilize infor
Wide-baseline matching (WBM) requires integrating geometric understanding, viewpoint changes, fine-grained per
Reinforcement learning (RL) has become a dominant post-training paradigm, enabling large language models (LLMs
Large language models improve final-answer accuracy through extended chain-of-thought reasoning, but often spe
Reinforcement learning with verifiable rewards has rapidly advanced reasoning in vision--language models. Howe
Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as the cornerstone for shaping the
Memory-augmented LLM agents tackle complex long-horizon tasks by recursively summarizing interaction trajector
We present Stable-Layers, a reinforcement learning framework that eliminates the need for paired supervision b
Off-policy reinforcement learning of pretrained flow policies remains challenging due to the instability of op