Where Rectified Flows Leak: Characterising Membership Signals Along the Interpolation Path
Understanding what generative models retain from training data remains challenging, with implications for copy
- 用途
- 生成
- 難易度
- Easy
- コスト
- High
「image」の検索結果
45 件Understanding what generative models retain from training data remains challenging, with implications for copy
We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation
Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research move
In this work, we focus on extending SHARP, the popular photorealistic view synthesis method, for universal mon
Despite advances in 3D scene understanding, existing 3D Large Multimodal Models operate in offline settings, r
Training vision-language web agents with multi-step RL is compute-intensive, with two dominant forms of ineffi
Latent visual reasoning (LVR) inserts supervised latent tokens between perception and answer generation in vis
Object insertion aims to seamlessly composite a reference object into a specified region of a background image
While Vision-Language Models (VLMs) have shown strong visual reasoning capabilities, their spatial reasoning a
Image-to-Video diffusion models leverage input images to generate visually stunning content, yet frequently pr
Despite the rapid progress of Vision-Language Models (VLMs), the field lacks benchmarks that rigorously diagno
In real-world applications, models are expected to perform reliably across diverse settings. Yet, many existin
Developing unified video generation and editing models capable of interpreting interleaved multimodal inputs i
Video generation models have made impressive strides in synthesizing visually compelling content, yet their ou
Video event prediction (VEP) requires models to infer unobserved future states from partial video evidence. Ex
Autonomous driving requires reasoning about how ego actions shape the evolution of the surrounding world. Howe
Vision language models (VLMs) excel at many tasks but still struggle with spatial reasoning when critical info
The robustness of deep neural networks is crucial for safety-critical deployments, yet existing evaluation met
We study the personal camera roll visual question answering setting. In this setting, a conversational AI assi
Processing video in vision-language models is expensive: each frame occupies hundreds of tokens, and inference
Learning representations of CAD models is a largely open problem. While 3D representation learning has flouris
Feed-forward 3D Gaussian Splatting methods reconstruct a scene from posed or pose-free images in a single forw
Lane-level maps are critical infrastructure for autonomous driving and lane-level navigation, yet constructing
Scaling humanoid loco-manipulation requires robot-compatible demonstrations across diverse objects, whole-body
Text-to-image models rely on text prompts as their primary interface to human intent. Prompts are encoded by a
While household robots are often evaluated based on task completion, everyday domestic environments involve va
Selection is a core operation in interactive image editing. To be practical, a user should be able to specify
In robotics systems, vast amounts of visual data are easily captured at high resolution using low-cost, low-po
Few-step distillation has become an effective strategy for accelerating advanced visual generative models, yet
Wide-baseline matching (WBM) requires integrating geometric understanding, viewpoint changes, fine-grained per
We present AAD-1, an Asymmetric Adversarial Distillation framework for One-step autoregressive image-to-video
Existing benchmarks for MLLM-generated web artifacts assess interaction through local evidence and miss the re
Large language model (LLM) agents are evolving from request-response assistants into long-running software act
Video is temporally redundant: adjacent frames usually share most objects, background, and layout. Yet existin
Training accurate medical image segmentation models requires large amounts of densely annotated data, which is
We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, i
AI glasses present a compelling platform for AI agents to serve as personalized memory assistants. To be genui
Transfer learning aims to facilitate the learning of a target domain by transferring knowledge from a source d
Multimodal Large Language Models (MLLMs) have demonstrated significant achievements in general visual question
While current multimodal models are proficient at open-ended visual editing, executing precise single-answer e
Diffusion models have emerged as the backbone of modern generative AI, powering advances in vision, language,
We present Stable-Layers, a reinforcement learning framework that eliminates the need for paired supervision b
Recent video-based world models have made pixel-space environments interactive at the camera level: users can
Vision-Language Models (VLMs) are increasingly deployed in embodied environments, where they need produce nume
Diffusion-based image editing has achieved strong visual fidelity under natural language instructions, yet mos