Trajectory-Refined Distillation
On-policy distillation (OPD) has become a central post-training tool for large language models (LLMs), providi
- 用途
- 技術検証・論文読解補助
- 難易度
- Easy
- コスト
- High
「text」の検索結果
107 件On-policy distillation (OPD) has become a central post-training tool for large language models (LLMs), providi
Repository-level coding benchmarks such as SWE-bench have driven a rapid surge in the capabilities of coding a
On-policy distillation (OPD) is increasingly used to improve large language model reasoning, but its training
We present SigmaScale, a method for learning auxiliary scaling matrices S to aid truncated Singular Value Deco
Large language models exhibit impressive zero-shot capabilities across a wide range of downstream tasks. Howev
We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation
Despite being a pivotal frontier, interactive world modeling remains underexplored in terms of the versatile c
Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research move
We present dots.tts, a 2B-parameter continuous autoregressive text-to-speech (TTS) foundation model that model
Retrieval for search agents is still inherited from non-agentic information retrieval: a retriever ranks the c
Despite advances in 3D scene understanding, existing 3D Large Multimodal Models operate in offline settings, r
Confidence-based loss weighting is usually avoided in generative models because it accelerates errors when the
Developers increasingly use AI tools such as ChatGPT, Copilot, and Claude in everyday software workflows, but
Hard-negative source selection for dense retrieval is usually decided only after fine-tuning and downstream ev
Harmony is a compact symbolic layer where mathematical pitch relations, acoustic consonance, and musical conve
Agent systems increasingly use textual skills to encode reusable task procedures, but injecting these skills i
Linear activation steering has gained popularity as a simple and empirically effective way to control language
Retrieval-augmented QA pipelines often route retrieved passages through an LLM rewriter before a smaller reade
Latent visual reasoning (LVR) inserts supervised latent tokens between perception and answer generation in vis
Evaluating LLM mediators remains challenging, as mediation unfolds as a real-time trajectory shaped by disputa
Object insertion aims to seamlessly composite a reference object into a specified region of a background image
Self-evolving agents requires adaptation after deployment, but existing approaches assume a usable learning lo
Persistent AI assistants, such as OpenClaw, accumulate large collections of related memories over long-term in
We introduce UnpredictaBench, an evaluation that tests the ability of large language models (LLMs) to capture
While Vision-Language Models (VLMs) have shown strong visual reasoning capabilities, their spatial reasoning a
Causal graphs provide a high-level language for making mechanisms transparent. Recent work uses Large Language
Despite the rapid progress of Vision-Language Models (VLMs), the field lacks benchmarks that rigorously diagno
We study the transformation of autoregressive models (ARLMs) into diffusion language models (DLMs). Rather tha
In real-world applications, models are expected to perform reliably across diverse settings. Yet, many existin
Code language models need repository-level context to resolve imports, APIs, and project conventions. Existing
Role-playing language agents (RPLAs) should play characters whose values and behavior evolve as the story prog
Planning for real-world problems by language models often involves both world and user constraints, which may
Developing unified video generation and editing models capable of interpreting interleaved multimodal inputs i
Prior work has shown that large language models (LLMs) can translate unseen or low-resource languages by under
Large language model (LLM) agents are increasingly applied to long-horizon tasks such as scientific discovery
Large language models can reproduce training data, but existing memorization evaluations mostly measure whethe
Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predo
Large language models often improve reasoning by generating explicit chain-of-thought (CoT), demonstrating the
Video event prediction (VEP) requires models to infer unobserved future states from partial video evidence. Ex
Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, r
Large language models are increasingly used to simulate social media users and infer how individuals may respo
Benchmarks are fundamental for evaluating and advancing LLMs and MLLMs by providing standardized and explicit
Muon improves training efficiency over Adam in large language-model training by about two times, but the local
Large language models are increasingly evaluated by other models, raising a natural question: can a model pred
Vision language models (VLMs) excel at many tasks but still struggle with spatial reasoning when critical info
Agents are widely deployed as assistants over documents, tools, and code. However, they typically act only on
We introduce VideoKR, the first large-scale training corpus specifically designed to strengthen knowledge- and
Experience internalization converts contextual experience from past interactions into reusable parametric capa
We study the personal camera roll visual question answering setting. In this setting, a conversational AI assi
System prompt optimization improves agent behavior without modifying the underlying model, yielding human-read
Processing video in vision-language models is expensive: each frame occupies hundreds of tokens, and inference
Learning representations of CAD models is a largely open problem. While 3D representation learning has flouris
Audio is an inherently interactive modality, yet today's Large Audio Language Models (LALMs) are offline, and
Feed-forward 3D Gaussian Splatting methods reconstruct a scene from posed or pose-free images in a single forw
Autoregressive mesh generation has gained attention by tokenizing meshes into sequences and training models in
Training Data Attribution (TDA) seeks to trace a model's predictions back to its training data. The gold stand
Large language models (LLMs) are increasingly proposed as clinical agents, yet static, single-turn benchmarks
Instruction-guided speech editing requires a model to modify specified speech attributes while preserving unre
Text-to-image models rely on text prompts as their primary interface to human intent. Prompts are encoded by a
Equipping Large Language Models (LLMs) to execute reliable multi-step workflows has become a central challenge
Selection is a core operation in interactive image editing. To be practical, a user should be able to specify
Inference-time scaling has emerged as a critical avenue for enhancing Large Language Models' performance, yet
Recent progress in Large Language Model (LLM) agents has enabled promising advances in automated data science.
Few-step distillation has become an effective strategy for accelerating advanced visual generative models, yet
Multimodal agents in robotics, AR, and autonomous driving must reason about places and layouts from continuous
On-policy self-distillation, where a language model conditions on privileged context to supervise its own gene
High-quality pretraining data is a central ingredient in modern language models, but German-language resources
Memory is an indispensable capability for long-horizon LLM agents, enabling them to preserve and utilize infor
Wide-baseline matching (WBM) requires integrating geometric understanding, viewpoint changes, fine-grained per
We present AAD-1, an Asymmetric Adversarial Distillation framework for One-step autoregressive image-to-video
Reinforcement learning (RL) has become a dominant post-training paradigm, enabling large language models (LLMs
Existing benchmarks for MLLM-generated web artifacts assess interaction through local evidence and miss the re
Structured financial audit verification is difficult for language-model agents because correctness depends on
Computer-use agents extend language models from text generation to sustained interaction with files, terminals
Large language model (LLM) agents are evolving from request-response assistants into long-running software act
Graph Language Models (GLMs) have become a promising direction for adapting Large Language Models (LLMs) to gr
Training and scaling Large Language Models demand enormous computational resources, motivating both efficient
Large language models improve final-answer accuracy through extended chain-of-thought reasoning, but often spe
Agentic language model systems alternate between two structurally distinct step types: structured tool calls (
Large language models (LLMs) have recently been adopted as synthetic agents for public opinion simulation, off
Financial AI agents often fail for a simple reason: they make users carry the complexity. A user must repeated
Video is temporally redundant: adjacent frames usually share most objects, background, and layout. Yet existin
Agentic LLMs with web search change the threat model for text anonymization: weak contextual cues can become c
We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, i
On-Policy distillation (OPD) in large language models is shifting from full-trace KL supervision toward more s
Abundant procedural knowledge on the Web holds great potential for helping agents solve long-horizon tasks. Ho
Large language models are increasingly deployed as coding agents, shifting safety from individual responses to
The rapid progress of frontier large language models has led to widespread benchmark saturation, limiting the
Reinforcement learning with verifiable rewards has rapidly advanced reasoning in vision--language models. Howe
Agentic search systems iteratively interact with retrieval models to answer complex queries. Despite substanti
AI glasses present a compelling platform for AI agents to serve as personalized memory assistants. To be genui
Large Language Models exhibit paradoxical fragility in fundamental arithmetic, implying a disconnect between i
Multimodal Large Language Models (MLLMs) have demonstrated significant achievements in general visual question
Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as the cornerstone for shaping the
Speech translation systems increasingly span speech-to-text translation (S2TT), speech-to-speech translation (
Humans can effortlessly perceive spatial layouts, form cognitive representations, reason about spatial relatio
Memory-augmented LLM agents tackle complex long-horizon tasks by recursively summarizing interaction trajector
Music recommendation systems typically treat songs as opaque tokens, relying on collaborative interaction hist
We present Stable-Layers, a reinforcement learning framework that eliminates the need for paired supervision b
Mixture-of-Experts (MoE) is now the dominant architecture for frontier language models, yet it requires all ex
Language models can use verifiable rewards to improve at a wide variety of reasoning tasks. However, both para
Efficient inference is critical for long-context language models, where attention computation and KV-cache acc
We present DEI: Diversity in Evolutionary Inference, a distributed Quality-Diversity (QD) search framework tha
Customizing an LLM judge to a specific task or domain often involves optimizing its prompt across multiple eva
Vision-Language Models (VLMs) are increasingly deployed in embodied environments, where they need produce nume
Speech-based large language models are typically constrained to spoken replies, which limits their user-facing
Diffusion-based image editing has achieved strong visual fidelity under natural language instructions, yet mos