Trajectory-Refined Distillation
On-policy distillation (OPD) has become a central post-training tool for large language models (LLMs), providi
- 用途
- 技術検証・論文読解補助
- 難易度
- Easy
- コスト
- High
「LLM」の検索結果
81 件On-policy distillation (OPD) has become a central post-training tool for large language models (LLMs), providi
Existing scientific relation extraction benchmarks mainly target domains such as computer science, where entit
We present SigmaScale, a method for learning auxiliary scaling matrices S to aid truncated Singular Value Deco
Large language models exhibit impressive zero-shot capabilities across a wide range of downstream tasks. Howev
We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation
Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research move
LLM-driven software engineering agents have become a central testbed for real-world language-model capability,
Retrieval for search agents is still inherited from non-agentic information retrieval: a retriever ranks the c
Despite advances in 3D scene understanding, existing 3D Large Multimodal Models operate in offline settings, r
Developers increasingly use AI tools such as ChatGPT, Copilot, and Claude in everyday software workflows, but
Hard-negative source selection for dense retrieval is usually decided only after fine-tuning and downstream ev
Agent systems increasingly use textual skills to encode reusable task procedures, but injecting these skills i
Retrieval-augmented QA pipelines often route retrieved passages through an LLM rewriter before a smaller reade
Evaluating LLM mediators remains challenging, as mediation unfolds as a real-time trajectory shaped by disputa
Self-evolving agents requires adaptation after deployment, but existing approaches assume a usable learning lo
Existing benchmarks evaluate Tool-Integrated Reasoning (TIR) in LLMs on idealized ''happy paths'', largely ove
We introduce UnpredictaBench, an evaluation that tests the ability of large language models (LLMs) to capture
While Vision-Language Models (VLMs) have shown strong visual reasoning capabilities, their spatial reasoning a
Causal graphs provide a high-level language for making mechanisms transparent. Recent work uses Large Language
In real-world applications, models are expected to perform reliably across diverse settings. Yet, many existin
Planning for real-world problems by language models often involves both world and user constraints, which may
Developing unified video generation and editing models capable of interpreting interleaved multimodal inputs i
Prior work has shown that large language models (LLMs) can translate unseen or low-resource languages by under
Large language model (LLM) agents are increasingly applied to long-horizon tasks such as scientific discovery
Large language models can reproduce training data, but existing memorization evaluations mostly measure whethe
Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predo
Large language models often improve reasoning by generating explicit chain-of-thought (CoT), demonstrating the
Video event prediction (VEP) requires models to infer unobserved future states from partial video evidence. Ex
Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, r
Large language models are increasingly used to simulate social media users and infer how individuals may respo
Benchmarks are fundamental for evaluating and advancing LLMs and MLLMs by providing standardized and explicit
A situated query like "where is Lin Wei?" often encodes more than its literal content: the user may also want
AI research often requires decisions before future evidence exists: which bottleneck to attack, which directio
Large language models are increasingly evaluated by other models, raising a natural question: can a model pred
Experience internalization converts contextual experience from past interactions into reusable parametric capa
Rubric-based reinforcement learning (RL) uses an LLM-as-a-Judge (LaaJ) to score model outputs according to rub
Multi-agent reasoning systems adopt a "generate-then-transfer" paradigm that forces end-to-end latency to scal
Scientific and engineering progress is fundamentally a long-horizon iterative process: proposing changes, runn
Deontic reasoning is the task of answering questions by applying explicit rules and policies to case-specific
Training Data Attribution (TDA) seeks to trace a model's predictions back to its training data. The gold stand
Large language models (LLMs) are increasingly proposed as clinical agents, yet static, single-turn benchmarks
Instruction-guided speech editing requires a model to modify specified speech attributes while preserving unre
LLMs can appear cautious in risk decision-making tasks, yet cautious-looking outputs do not necessarily indica
Reward models (RMs) provide critical feedback signals for LLM post-training, notably in reinforced fine-tuning
Equipping Large Language Models (LLMs) to execute reliable multi-step workflows has become a central challenge
While household robots are often evaluated based on task completion, everyday domestic environments involve va
Inference-time scaling has emerged as a critical avenue for enhancing Large Language Models' performance, yet
Recent progress in Large Language Model (LLM) agents has enabled promising advances in automated data science.
Multimodal agents in robotics, AR, and autonomous driving must reason about places and layouts from continuous
Memory is an indispensable capability for long-horizon LLM agents, enabling them to preserve and utilize infor
Wide-baseline matching (WBM) requires integrating geometric understanding, viewpoint changes, fine-grained per
Reinforcement learning (RL) has become a dominant post-training paradigm, enabling large language models (LLMs
Existing benchmarks for MLLM-generated web artifacts assess interaction through local evidence and miss the re
Structured financial audit verification is difficult for language-model agents because correctness depends on
Computer-use agents extend language models from text generation to sustained interaction with files, terminals
Large language model (LLM) agents are evolving from request-response assistants into long-running software act
Graph Language Models (GLMs) have become a promising direction for adapting Large Language Models (LLMs) to gr
Large language models improve final-answer accuracy through extended chain-of-thought reasoning, but often spe
LLM agents are increasingly expected to operate across heterogeneous task regimes that require distinct execut
Large language models (LLMs) have recently been adopted as synthetic agents for public opinion simulation, off
Financial AI agents often fail for a simple reason: they make users carry the complexity. A user must repeated
Video is temporally redundant: adjacent frames usually share most objects, background, and layout. Yet existin
Agentic LLMs with web search change the threat model for text anonymization: weak contextual cues can become c
Deep-research agents solve tasks through long trajectories of search, tool use, evidence inspection, and answe
Large language models are increasingly deployed as coding agents, shifting safety from individual responses to
The rapid progress of frontier large language models has led to widespread benchmark saturation, limiting the
Reinforcement learning with verifiable rewards has rapidly advanced reasoning in vision--language models. Howe
AI glasses present a compelling platform for AI agents to serve as personalized memory assistants. To be genui
We examine whether human psychometric questionnaires can serve as reliable tools for characterizing and predic
Prompt-injection detectors are heterogeneous: each is strong on a different slice of attacks, and none is alwa
Multimodal Large Language Models (MLLMs) have demonstrated significant achievements in general visual question
Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as the cornerstone for shaping the
AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capabilit
Automatic speech recognition (ASR) is a core component of human--computer interaction and an increasingly impo
Memory-augmented LLM agents tackle complex long-horizon tasks by recursively summarizing interaction trajector
Music recommendation systems typically treat songs as opaque tokens, relying on collaborative interaction hist
Weight-space model merging is usually formulated as an algebraic operation on checkpoints, yet at LLM scale th
We present DEI: Diversity in Evolutionary Inference, a distributed Quality-Diversity (QD) search framework tha
LLM agents are rapidly evolving from coding assistants into autonomous software engineering systems. However,
Customizing an LLM judge to a specific task or domain often involves optimizing its prompt across multiple eva
Speech-based large language models are typically constrained to spoken replies, which limits their user-facing