SWE-Explore: Benchmarking How Coding Agents Explore Repositories
Repository-level coding benchmarks such as SWE-bench have driven a rapid surge in the capabilities of coding a
- 用途
- 検出
- 難易度
- Easy
- コスト
- Low
「detection」の検索結果
11 件Repository-level coding benchmarks such as SWE-bench have driven a rapid surge in the capabilities of coding a
On-policy distillation (OPD) is increasingly used to improve large language model reasoning, but its training
Whisper, a widely adopted ASR model, is known to suffer from hallucinations - coherent transcriptions generate
Training vision-language web agents with multi-step RL is compute-intensive, with two dominant forms of ineffi
Equipping Large Language Models (LLMs) to execute reliable multi-step workflows has become a central challenge
Computer-use agents extend language models from text generation to sustained interaction with files, terminals
Existing autonomous driving datasets have enabled major progress, but fall short in sensor fidelity, map compl
Agentic LLMs with web search change the threat model for text anonymization: weak contextual cues can become c
Deep-research agents solve tasks through long trajectories of search, tool use, evidence inspection, and answe
Prompt-injection detectors are heterogeneous: each is strong on a different slice of attacks, and none is alwa
Humans can effortlessly perceive spatial layouts, form cognitive representations, reason about spatial relatio