MMAE: A Massive Multitask Audio Editing Benchmark
We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation
- 用途
- 生成
- 難易度
- Easy
- コスト
- High
「video」の検索結果
27 件We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation
Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research move
Scientific paper recommendation is typically evaluated as static ranking over a fixed candidate set, yet real
Despite advances in 3D scene understanding, existing 3D Large Multimodal Models operate in offline settings, r
We introduce StreamForce, a streaming video generation framework that enables physically grounded control thro
Image-to-Video diffusion models leverage input images to generate visually stunning content, yet frequently pr
Developing unified video generation and editing models capable of interpreting interleaved multimodal inputs i
Video generation models have made impressive strides in synthesizing visually compelling content, yet their ou
Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predo
Video event prediction (VEP) requires models to infer unobserved future states from partial video evidence. Ex
Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, r
We introduce VideoKR, the first large-scale training corpus specifically designed to strengthen knowledge- and
World-action models (WAMs) jointly generate future video and robot actions through iterative diffusion, achiev
Processing video in vision-language models is expensive: each frame occupies hundreds of tokens, and inference
We present Echo Infinity, an autoregressive (AR) framework towards real-time infinite video generation that em
As multi-modal models advance towards long-form video understanding, memory emerges as a critical capability.
Scaling humanoid loco-manipulation requires robot-compatible demonstrations across diverse objects, whole-body
3D vision has rapidly evolved, driven by increasingly diverse data representations, learning paradigms, and mo
Multimodal agents in robotics, AR, and autonomous driving must reason about places and layouts from continuous
Wide-baseline matching (WBM) requires integrating geometric understanding, viewpoint changes, fine-grained per
We present AAD-1, an Asymmetric Adversarial Distillation framework for One-step autoregressive image-to-video
Existing benchmarks for MLLM-generated web artifacts assess interaction through local evidence and miss the re
Video is temporally redundant: adjacent frames usually share most objects, background, and layout. Yet existin
We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, i
AI glasses present a compelling platform for AI agents to serve as personalized memory assistants. To be genui
Speech translation systems increasingly span speech-to-text translation (S2TT), speech-to-speech translation (
Recent video-based world models have made pixel-space environments interactive at the camera level: users can