On the Geometry of On-Policy Distillation
On-policy distillation (OPD) is increasingly used to improve large language model reasoning, but its training
- 用途
- 検出
- 難易度
- Easy
- コスト
- High
「generation」の検索結果
59 件On-policy distillation (OPD) is increasingly used to improve large language model reasoning, but its training
Deep Research (DR) has emerged as a new agentic paradigm to tackle complex, open-ended research tasks, demandi
Understanding what generative models retain from training data remains challenging, with implications for copy
We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation
In this work, we focus on extending SHARP, the popular photorealistic view synthesis method, for universal mon
We present dots.tts, a 2B-parameter continuous autoregressive text-to-speech (TTS) foundation model that model
LLM-driven software engineering agents have become a central testbed for real-world language-model capability,
Despite advances in 3D scene understanding, existing 3D Large Multimodal Models operate in offline settings, r
Confidence-based loss weighting is usually avoided in generative models because it accelerates errors when the
Developers increasingly use AI tools such as ChatGPT, Copilot, and Claude in everyday software workflows, but
We introduce StreamForce, a streaming video generation framework that enables physically grounded control thro
Latent visual reasoning (LVR) inserts supervised latent tokens between perception and answer generation in vis
Object insertion aims to seamlessly composite a reference object into a specified region of a background image
Image-to-Video diffusion models leverage input images to generate visually stunning content, yet frequently pr
Despite the rapid progress of Vision-Language Models (VLMs), the field lacks benchmarks that rigorously diagno
Developing unified video generation and editing models capable of interpreting interleaved multimodal inputs i
Large language model (LLM) agents are increasingly applied to long-horizon tasks such as scientific discovery
Video generation models have made impressive strides in synthesizing visually compelling content, yet their ou
Standard continuous-time generative models rely on monolithic architectures that must navigate vastly differen
Inference-time skill augmentation provides a lightweight way to improve data-analytic agents by injecting reus
Large language models can reproduce training data, but existing memorization evaluations mostly measure whethe
Large language models often improve reasoning by generating explicit chain-of-thought (CoT), demonstrating the
Automatic Speech Recognition (ASR) has become a key technology for human--AI interaction. However, code-switch
Autonomous driving requires reasoning about how ego actions shape the evolution of the surrounding world. Howe
AI research often requires decisions before future evidence exists: which bottleneck to attack, which directio
We introduce VideoKR, the first large-scale training corpus specifically designed to strengthen knowledge- and
System prompt optimization improves agent behavior without modifying the underlying model, yielding human-read
Learning representations of CAD models is a largely open problem. While 3D representation learning has flouris
We present Echo Infinity, an autoregressive (AR) framework towards real-time infinite video generation that em
Lane-level maps are critical infrastructure for autonomous driving and lane-level navigation, yet constructing
Scaling humanoid loco-manipulation requires robot-compatible demonstrations across diverse objects, whole-body
Autoregressive mesh generation has gained attention by tokenizing meshes into sequences and training models in
Instruction-guided speech editing requires a model to modify specified speech attributes while preserving unre
Text-to-image models rely on text prompts as their primary interface to human intent. Prompts are encoded by a
3D vision has rapidly evolved, driven by increasingly diverse data representations, learning paradigms, and mo
While household robots are often evaluated based on task completion, everyday domestic environments involve va
Selection is a core operation in interactive image editing. To be practical, a user should be able to specify
Few-step distillation has become an effective strategy for accelerating advanced visual generative models, yet
Multimodal agents in robotics, AR, and autonomous driving must reason about places and layouts from continuous
On-policy self-distillation, where a language model conditions on privileged context to supervise its own gene
Wide-baseline matching (WBM) requires integrating geometric understanding, viewpoint changes, fine-grained per
We present AAD-1, an Asymmetric Adversarial Distillation framework for One-step autoregressive image-to-video
Reinforcement learning (RL) has become a dominant post-training paradigm, enabling large language models (LLMs
Computer-use agents extend language models from text generation to sustained interaction with files, terminals
Large language models improve final-answer accuracy through extended chain-of-thought reasoning, but often spe
Existing autonomous driving datasets have enabled major progress, but fall short in sensor fidelity, map compl
We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, i
Deep-research agents solve tasks through long trajectories of search, tool use, evidence inspection, and answe
A long standing challenge in computational chemistry and biophysics is efficiently sampling the Boltzmann dist
The rapid progress of frontier large language models has led to widespread benchmark saturation, limiting the
Reinforcement learning with verifiable rewards has rapidly advanced reasoning in vision--language models. Howe
We examine whether human psychometric questionnaires can serve as reliable tools for characterizing and predic
Distillation attacks create a deployment trade-off for model providers: the same outputs that make a model mor
Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as the cornerstone for shaping the
Speech translation systems increasingly span speech-to-text translation (S2TT), speech-to-speech translation (
While current multimodal models are proficient at open-ended visual editing, executing precise single-answer e
Diffusion models have emerged as the backbone of modern generative AI, powering advances in vision, language,
Recent video-based world models have made pixel-space environments interactive at the camera level: users can
Speech-based large language models are typically constrained to spoken replies, which limits their user-facing