SWE-Explore: Benchmarking How Coding Agents Explore Repositories
Repository-level coding benchmarks such as SWE-bench have driven a rapid surge in the capabilities of coding a
- 用途
- 検出
- 難易度
- Easy
- コスト
- Low
「Agent」の検索結果
61 件Repository-level coding benchmarks such as SWE-bench have driven a rapid surge in the capabilities of coding a
Deep research agents have demonstrated remarkable capabilities in complex information-seeking tasks, yet this
Deep Research (DR) has emerged as a new agentic paradigm to tackle complex, open-ended research tasks, demandi
We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation
Despite being a pivotal frontier, interactive world modeling remains underexplored in terms of the versatile c
LLM-driven software engineering agents have become a central testbed for real-world language-model capability,
Retrieval for search agents is still inherited from non-agentic information retrieval: a retriever ranks the c
Agent systems increasingly use textual skills to encode reusable task procedures, but injecting these skills i
Training vision-language web agents with multi-step RL is compute-intensive, with two dominant forms of ineffi
Evaluating LLM mediators remains challenging, as mediation unfolds as a real-time trajectory shaped by disputa
Self-evolving agents requires adaptation after deployment, but existing approaches assume a usable learning lo
Existing benchmarks evaluate Tool-Integrated Reasoning (TIR) in LLMs on idealized ''happy paths'', largely ove
Persistent AI assistants, such as OpenClaw, accumulate large collections of related memories over long-term in
While Vision-Language Models (VLMs) have shown strong visual reasoning capabilities, their spatial reasoning a
Role-playing language agents (RPLAs) should play characters whose values and behavior evolve as the story prog
Planning for real-world problems by language models often involves both world and user constraints, which may
Large language model (LLM) agents are increasingly applied to long-horizon tasks such as scientific discovery
Inference-time skill augmentation provides a lightweight way to improve data-analytic agents by injecting reus
Benchmarks are fundamental for evaluating and advancing LLMs and MLLMs by providing standardized and explicit
A situated query like "where is Lin Wei?" often encodes more than its literal content: the user may also want
AI research often requires decisions before future evidence exists: which bottleneck to attack, which directio
Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not transla
Agents are widely deployed as assistants over documents, tools, and code. However, they typically act only on
Experience internalization converts contextual experience from past interactions into reusable parametric capa
We study the personal camera roll visual question answering setting. In this setting, a conversational AI assi
System prompt optimization improves agent behavior without modifying the underlying model, yielding human-read
Audio is an inherently interactive modality, yet today's Large Audio Language Models (LALMs) are offline, and
Rubric-based reinforcement learning (RL) uses an LLM-as-a-Judge (LaaJ) to score model outputs according to rub
Multi-agent reasoning systems adopt a "generate-then-transfer" paradigm that forces end-to-end latency to scal
Scientific and engineering progress is fundamentally a long-horizon iterative process: proposing changes, runn
Lane-level maps are critical infrastructure for autonomous driving and lane-level navigation, yet constructing
Deontic reasoning is the task of answering questions by applying explicit rules and policies to case-specific
Large language models (LLMs) are increasingly proposed as clinical agents, yet static, single-turn benchmarks
Reward models (RMs) provide critical feedback signals for LLM post-training, notably in reinforced fine-tuning
Equipping Large Language Models (LLMs) to execute reliable multi-step workflows has become a central challenge
Recent progress in Large Language Model (LLM) agents has enabled promising advances in automated data science.
Multimodal agents in robotics, AR, and autonomous driving must reason about places and layouts from continuous
Memory is an indispensable capability for long-horizon LLM agents, enabling them to preserve and utilize infor
Structured financial audit verification is difficult for language-model agents because correctness depends on
Computer-use agents extend language models from text generation to sustained interaction with files, terminals
Large language model (LLM) agents are evolving from request-response assistants into long-running software act
Large language models improve final-answer accuracy through extended chain-of-thought reasoning, but often spe
LLM agents are increasingly expected to operate across heterogeneous task regimes that require distinct execut
Agentic language model systems alternate between two structurally distinct step types: structured tool calls (
Large language models (LLMs) have recently been adopted as synthetic agents for public opinion simulation, off
Financial AI agents often fail for a simple reason: they make users carry the complexity. A user must repeated
Agentic LLMs with web search change the threat model for text anonymization: weak contextual cues can become c
We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, i
Deep-research agents solve tasks through long trajectories of search, tool use, evidence inspection, and answe
Abundant procedural knowledge on the Web holds great potential for helping agents solve long-horizon tasks. Ho
How can a population of agents self-orchestrate and self-adapt into stronger collective intelligence without c
Large language models are increasingly deployed as coding agents, shifting safety from individual responses to
Agentic search systems iteratively interact with retrieval models to answer complex queries. Despite substanti
AI glasses present a compelling platform for AI agents to serve as personalized memory assistants. To be genui
Prompt-injection detectors are heterogeneous: each is strong on a different slice of attacks, and none is alwa
Humans can effortlessly perceive spatial layouts, form cognitive representations, reason about spatial relatio
AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capabilit
Automatic speech recognition (ASR) is a core component of human--computer interaction and an increasingly impo
Memory-augmented LLM agents tackle complex long-horizon tasks by recursively summarizing interaction trajector
Humans are the bottleneck in building and improving AI. Both the models and the agents that wrap them are writ
LLM agents are rapidly evolving from coding assistants into autonomous software engineering systems. However,