Echo-Memory: A Controlled Study of Memory in Action World Models
この研究では、エピソード記憶を制御するために、エピソード記憶モデルを設計および評価しました。エピソード記憶モデルは、エピソード内の重要な情報を記憶し、エピソード間の相関関係を特定することができます。
- 用途
- エピソード記憶
- 難易度
- Hard
- コスト
- High
「image」の検索結果
237 件この研究では、エピソード記憶を制御するために、エピソード記憶モデルを設計および評価しました。エピソード記憶モデルは、エピソード内の重要な情報を記憶し、エピソード間の相関関係を特定することができます。
この研究では、脳部帯域内のニューロンが同じ反応プロファイルを持つと仮定し、近接な脳部帯域内のニューロンの反応プロファイルを推論し、分野間の結合を特定しました。
この研究では、拒否学習における検索
大規模言語モデル(LLM)を運用するコンテンツモデレーションシステムは、有害なオンラインコンテンツを防止するために重要な役割を果たします。しかし、これらのシステムの主な目標は単にトークナイズされたテキストを操作することに
Muon has recently emerged as a state-of-the-art optimizer for pretraining Large Language Models (LLMs) and vis
SMFS データの自動化された分析を提案。モデルを使用して、不均衡された SMFS データを分析する方法を提案した。
この研究では、Surrogate-based Analysis of Interactions via Local Effect Smooths (SAILS) と呼ばれる構造間の相互作用を検測し、機能的な相互作用を推定
この研究では、ゼロショット セマンティック再特定の基準を設定し、画像のセマンティック特定を自動化します。
この研究では、テキスト、画像、ビデオ、アウディオ等の異なるモダリティのデータを統合したオムニモダル検索システムを構築します。
Multimodal federated graph learning (MM-FGL) aims to collaboratively learn from decentralized graphs with text
この論文では、data mining におけるビジュアルプログラミングフレームワーク、Orange Lab を提唱しました。これにより、Webベースのデータ分析環境を提供し、ユーザーフェイシングの分析ツールとしてデータ分
この論文では、VideoQA が過度に信憑性の
Multimodal large language models (MLLMs) commonly inherit the deep, symmetric Transformer backbone designed fo
Convolutions have successfully transitioned from image processing to the complex realm of non-Euclidean higher
We propose the data augmented bootstrap (DAB), a framework for constructing confidence intervals from approxim
言語モデルの寿命リスクへの適用を実現するために、コックス比例危険モデルを使用して、新しいアプローチを提案します。
この論文では、ロボット手術の制御を改善するために、ロボットの視覚的シーンの動作と操作を同時にモデル化する方法を提案する。
Recent Anomaly Detection methods achieve perfect detection and segmentation scores on well-established dataset
Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and op
LLMを用いた臨床研究論文の草案作成を支援するために、生成されたテキストを検証するためのアーキテクチャを設計。これにより、虚偽の citaion、数字の不正確な記録、およびガイドライン違反が防がれます。
AIのミニドラマ(または果実のドラマ)は、最近、ソーシャルメディアプラットフォーム上で広まった短い、アルゴリズム的かつ分散された生成AIビデオシリーズです。これらのビデオの視覚表現は、性的に見えると思われる果物が表現され
Chain-of-Thought (CoT) improves the performance of Large Language Models (LLMs) and has been extended to Multi
Large Language Models (LLMs) and Vision-Language Models (VLMs) are increasingly evaluated on table reasoning t
Vision-language-action models have shown strong promise for robot manipulation, yet raw language is primarily
マルチモーダルのエージェントの評価を目的としたWeaveBenchが提案され、ハイブリッドインターフェースの機能を評価する。
マテリアルの非破壊検査を目的としたContext-Aware Deep Learningが提案され、エアロックの欠陥を検出する。
We present SUPERBROWSER, an autonomous web-navigation agent designed against a single guiding hypothesis: a we
Scene Graphs (SGs) provide structured representations of visual scenes by modeling objects and their pairwise
異なる種類の動物を取り巻く面からの画像を使用して、動物の特定を行う方法を提案している。
世界モデルを使用して、潜在的ステートを利用して長期的な計画を行えるFF-JEPAを提案している。
可変化の帯域幅を考慮した、聴覚超材料の逆設計における新しいフレームワークである Physics-Guided Sequence-Based Generative Framework for Acoustic Metama
Egocentricビデオを利用して手の圧力を推定できるモデル EgoTactile を提案している。
In recent years, unified multimodal models (UMMs) have emerged to support both understanding and generation wi
Egocentric visionを使用して、ペダストリアンの歩く道に渡るのを予測する。Closed-ended visual question answering(VQA)問題に形式することで、ビジョン言語モデルを使用
ビジュアル言語モデル(VLM)は、プライバシー保護において有効性の高い能力をもつ。しかし、視覚データを扱う際のプライバシーリスクについては、それまでほとんど注目されていなかった。VLMを使用して、プライバシー保護を確保す
大規模言語モデルのプライバシーリスクについては、既に研究が行われていたが、マルチモデル大規模言語モデル(MLLM)のプライバシーリスクについては、まだ十分に調査されていなかった。MLLMでは、テキストだけでなく画像データ
Recent advances in Video Large Language Models (Video-LLMs) have enabled performance on long-video understandi
Vision-and-Languageナビゲーションエージェントは、言語指示に従って環境を探索できる。Zero-shot Vision-and-Languageナビゲーションエージェントには、未知の環境における安全性と信
連続的な治療に適した臨床級LLM医系であるBaichuan-M4を導入。臨床的な医療エージェントシステムであるBaichuan-M4は、統合的な医療エージェントシステムをベースとし、医療エージェントと医療エージェントの連
Vision-language models (VLMs) with varying performance and resource requirements are widely deployed, making i
Multimodal Foundation Models (MFMs) have made substantial progress, yet remain fragile in spatial reasoning ov
Comprehensive estimation of dietary micronutrients from food images could improve clinical nutrition care, but
Extracting building polygon contours from high-resolution remote sensing images is a fundamental task for vari
Semiconductor lithography inspection requires reliable detection of small pattern defects such as bridge, burr
Spinal pathology is a leading cause of pain and disability worldwide. Spine MRI is central to clinical evaluat
Multimodal large language models (MLLMs) achieve strong results on visual reasoning benchmarks, but answer acc
危機管理では、コミュニケーションと地理
Nüshu is an endangered phonetic script historically used by women in Jiangyong County, southern Hunan, China.
We present TruthSplit, an interactive system for multi-perspective argument analysis. Existing argumentation t
Understanding and reasoning over abstract visual content remains a challenge for current multi-modal large lan
Multimodal affective analysis aims to understand human sentiment and emotion by jointly modeling heterogeneous
Chinese discriminatory-language detection is challenging because harmful intent is often implicit and context-
The emergence of reasoning multimodal large language models (MLLMs), which generate explicit chain-of-thought
We introduce ChinaHeritaQA, a multimodal benchmark dataset for evaluating the cultural reasoning abilities of
Reasoning Vision-Language Models (VLMs) achieve strong performance on complex multimodal tasks, but reliable r
Video world models that maintain 3D spatial consistency across generated frames typically rely on explicit poi
Embodied world models have emerged as a pivotal paradigm for visual robotic decision-making and interactive en
Large-scale document processing requires contextually aware table extraction (TE) that is both accurate and ef
The state-of-the-art generative models, such as CycleGAN, Pix2Pix, and diffusion models have demonstrated rema
We describe our system for the SoccerNet 2026 Player-Centric Ball-Action Spotting Challenge, which requires pr
Diffusion-based generative models have achieved remarkable success in real-world image super-resolution (SR).
With the advancement of visual sensing systems, computer vision is playing an increasingly important role in a
Conventional one-hot encodings often yield poorly calibrated models, being overconfident under attack, and let
Self-supervised data curation provides a pathway to scaling and improving the generalization capabilities of m
Video world models have made rapid progress in generating controllable visual experiences, but most of them st
Modern object detectors achieve strong performance on standard benchmarks, yet their robustness to contextual
Optical Music Recognition (OMR) has seen major progress in model design, with end-to-end methods now capable o
Estimating the relative poses of multi-camera systems is a fundamental problem in computer vision, with critic
Generalized Few-Shot Semantic Segmentation (GFSS) has traditionally been approached as a representation-learni
Biochemical recurrence (BCR) after radical prostatectomy is a critical endpoint in prostate cancer, yet risk s
Advances in handwritten text recognition have enabled large-scale transcription of historical documents, but s
The vascular network in the human body is characterized by blood vessels exhibiting drastic structural variati
Image and video captioning are fundamental tasks that bridge the visual and linguistic domains, playing a crit
Conventional dynamics analysis of the human body is often constrained by the need for contact force and torque
Clinical ultrasound images often contain artificial markers, such as measurement calipers and text, to assist
Open-domain open-vocabulary detection (ODOVD) requires detectors to generalize to both novel categories and un
Synthetic aperture radar (SAR)-assisted optical cloud removal aims to recover surface information obscured by
The rapid development of pretrained foundation models has enabled more general image segmentation. Multimodal
Visual reasoning requires integrating evidence distributed across regions, attributes, and relations, making s
3D semantic scene generation is crucial for autonomous driving applications, yet most methods rely on complex
Two-view correspondence learning aims to distinguish true correspondences (inliers) from false ones (outliers)
Deformable image registration (DIR) is widely used in radiotherapy for dose propagation and accumulation, but
Strabismus is a common ocular disorder that requires fine-grained subtype diagnosis for individualized treatme
Multi-modal Large Language Models (MLLMs) have achieved remarkable progress in video temporal grounding with r
Source detection in modern observational astronomy is a cornerstone for localizing and identifying stellar sou
As a bio-inspired intelligent sensor, event cameras have introduced a new paradigm in the intelligent percepti
During warhead detonation, high-density, high-speed, and mutually occluded fragments are generated. Their mech
4D generation (\textit{i.e.}, dynamic 3D generation) has recently emerged as a rapidly growing research fronti
Hyperspectral object tracking (HOT) leverages the rich spectral information provided by hyperspectral videos (
Video semantic segmentation for low-altitude UAVs requires temporal consistency, yet dense optical flow introd
Autoregressive (AR) models have demonstrated strong potential in visual generation, offering superior performa
While recent autoregressive video diffusion models achieve remarkable streaming quality, they remain confined
Glaucoma is a leading cause of irreversible blindness worldwide, and early detection from fundus images is cri
Most existing multi-exposure HDR methods follow a fixed feed-forward reconstruction paradigm, making them pron
Reward models are central to text-to-image post-training, but visual preference is subjective and better repre
Neural radiance field (NeRF) and 3D Gaussian splatting (3DGS) are two mainstream approaches for novel view syn
Methods based on implicit neural representations have demonstrated superior performance in Screen Content Imag
This paper introduces EPS3D, a new end-to-end feed-forward framework for open-vocabulary 3D panoptic segmentat
Worldwide image geo-localization aims to determine the capture location of an image on a global scale. Existin
Text based configuration files for cyber-physical systems show the hierarchy of component modules well but oft
Embodied policies typically map current observations directly to actions, leaving candidate-action consequence
Reliable navigation in GPS-denied environments remains a fundamental challenge in robotics, aerospace, and aut
Reliable robotic navigation necessitates the seamless integration of accurate global localization and dense, m
Vision-language-action (VLA) policies can deviate from nominal trajectories during manipulation, even when tas
World Action Models (WAMs) couple a video dynamics prior to the policy and have shown encouraging results on t
The rapid adoption of diffusion and large-scale generative models has made it increasingly challenging to dist
Despite the success of image generation from text descriptions, it still faces challenges that are difficult t
Visual Language Models (VLMs) are known to produce hallucinated predictions that are not grounded in visual ev
The analysis of internet memes in the Nepali language is complicated by frequent code-mixing and a lack of est
Simulation plays a key role in automated robotics research supported by large language models (LLMs). However,
Diffusion and flow generative models sample by integrating a learned ODE, but high quality still requires many
Action-supervised fine-tuning of vision-language-action (VLA) policies fits demonstrations effectively but con
Deep learning EEG denoising architectures have scaled from tens of thousands to tens of millions of parameters
Deep learning on physiological time series is interpreted through domain-specific features -- oscillatory rhyt
Data pruning (DP), as an oft-stated strategy to alleviate heavy training burdens, reduces the volume of traini
Text-to-image diffusion models are increasingly deployed in open-ended creative contexts, yet their outputs re
In high-stakes settings such as brand compliance, clinical care, and content moderation, machine learning cann
Visual world models have shown great potential in learning complex system dynamics. Recent advancements levera
Selective predictors answer on confident inputs and abstain elsewhere; deploying one safely needs a single fin
Multimodal language models are typically evaluated through external behavior: selecting the correct image--tex
Temporary work-zone speed limits are communicated through visually inconsistent signage and are often missing
As autonomous systems expand from capital-intensive robotaxis to cost-sensitive logistics, sensor configuratio
We introduce Contrast Sensitive Flow (CSFlow), a weighting scheme that connects the human eye's Contrast Sensi
Image data regarding galactic morphology is expected to increase both in quantity and quality for the next for
Change detection and scene recognition techniques have been widely applied to Street View Imagery (SVI) to und
Representation alignment with pretrained vision models has recently shown strong potential for accelerating di
Document image binarization aims to separate foreground text from degraded backgrounds while preserving thin,
Existing zero-shot video editing methods rely on pre-trained diffusion models, successfully achieving spatial
Effective visuo-tactile integration is critical for robotic dexterous manipulation, especially when visual obs
Accurate quantification and uptake measurement in PET are critical for assessing disease progression and suppo
Deep learning has become prevalent in computational pathology pipelines that support tasks such as cancer scre
Abnormality detection is a crucial yet challenging task in medical image analysis. Distinguishing abnormalitie
''Thinking with Images'' has emerged as an effective paradigm for fine-grained visual reasoning: by explicitly
Reinforcement Learning with Verifiable Rewards (RLVR) has become an effective paradigm for improving the reaso
Fisheye cameras are widely deployed in autonomous driving perception suites for their low cost and full-covera
To perform a wide range of daily tasks, robots need to construct a 3D representation that is semantically rich
Routine full-disk EUV imaging has been available only since the modern era, such as SOHO and SDO. To extend EU
The processing of gigapixel whole slide images within vision language models faces a major difficulty due to a
The rapid advancement of generative models has blurred the boundary between synthetic and real imagery, creati
While Omni-modal Large Language Models (OLLMs) have demonstrated impressive capabilities in jointly processing
Emotional Video Captioning (EVC) is a challenging task that aims to generate factually accurate and emotionall
Remote sensing applications for environmental monitoring and disaster management are frequently constrained by
Reward models play a pivotal role in reinforcement learning (RL) and multi-modal trajectory selection for auto
Although video virtual try-on (VVT) has achieved significant progress, existing methods still exhibit two fund
Multimodal Large Language Models (MLLMs) face a significant inference bottleneck due to the quadratic computat
Humanoid robots require whole-body motions that adapt to scene context, task requirements, and user intent. Mo
Despite the impressive capabilities of text-to-image (T2I) models, an intent-generation gap often persists due
Chain-of-thought (CoT) reasoning has proven effective for enhancing problem-solving in large language models.
Palmprint modality offers a privacy-preserving biometric solution, yet its deployment is hindered by the domai
The task of temporal answer grounding in instructional video (TAGV), which aims to locate precise video segmen
Multi-contrast brain MRI provide complementary soft-tissue characteristics that aid in the screening and diagn
Vision-language models (VLMs) pretrained on large-scale image-text pairs demonstrate strong image-level unders
Generating complete 3D scenes from a single image requires inferring globally consistent geometry, object rela
World action models inherit the predictive capability of world models, enabling action generation to be guided
Recent progress in robot manipulation has been largely driven by learning from large-scale demonstrations. For
Vision-Language-Action (VLA) models achieve strong benchmark performance but still struggle in real-world depl
Vision-language models (VLMs) are powerful general-purpose reasoners, yet converting them into robot control p
Autonomous Underwater Vehicles (AUVs) traditionally rely on complex, heavily engineered pipelines for percepti
Generative robot policies fail unpredictably at deployment: they hesitate at critical moments, drift off-task,
Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in visual understanding, yet the
Symbolic benchmarks have emerged as a key approach to assess model robustness under minor modifications to STE
Current image editing software often hinges on fixed filters or expert tuning, leaving a gap between amateur u
Infrared and visible image fusion aims to generate a composite image that retains significant target informati
This paper presents our system description for the 2nd Workshop on Multimodal Augmented Generation via Multimo
Standard dynamic vision sensors approximate retinal processing by detecting temporal contrast changes, offerin
Temporomandibular joint osteoarthritis (TMJ OA) is a prevalent degenerative condition whose osseous changes ar
While multimodal integration significantly improves computer vision models, deploying them incurs prohibitive
Self-supervised learning (SSL) has achieved remarkable representation learning performance, but many existing
Score-based generative models have had remarkable success over the last decade in generating a diverse set of
Visual Autoregressive (VAR) models adopt a next-scale prediction paradigm, offering high-quality generation wi
Recovering the relative 6-DoF pose between two image groups underlies cross-sequence relocalization and multi-
Recent advances in Diffusion Transformers have driven rapid progress in video generation and editing, yet thes
Understanding and comparing structures in scalar fields is a central challenge in scientific visualization, wi
Feed-forward 3D reconstruction models have recently shown strong generalization across diverse scenes, yet mos
Neural fields parameterize data as functions from coordinates to values, providing a unified framework for rep
Vision Transformers (ViTs) achieve strong performance but suffer from high computational costs due to quadrati
Composed Video Retrieval (CVR) is designed to retrieve a target video that matches a reference video modified
Vision Transformers operate on fixed patch grids, which can introduce phase-dependent instability for dense pr
Vision-language models (VLMs) enable visual recognition from semantic class descriptions, which makes them att
Manipulation understanding requires reliable relational evidence, such as contact, support, containment, motio
We are surrounded by various objects with movable, articulated parts, e.g., box, handle, door. An accurate and
Recent agent frameworks such as Claude Code, Codex, and OpenClaw are strong at tool use and orchestration, but
We present OSMGraphCLIP, a CLIP-style geospatial representation model that learns global location embeddings f
Facial hair is a defining trait of personal identity, yet remains a critical bottleneck for digital avatars. R
In assisted teleoperation for human-robot collaboration, accurate intention prediction is critical for enablin
Vision-language-action (VLA) models increasingly condition robot policies on history, depth, or 4D features to
In-context imitation learning (ICIL) enables robots to learn new tasks from a small number of demonstrations b
Object navigation requires a robot to search for an unobserved target in an unknown environment by deciding wh
A learned world model provides a powerful physical intuition for evaluating future states. But its effectivene
Scientific machine learning is limited less by model size than by the data it is trained on. Observational dat
Variational autoencoders (VAEs) learn low-dimensional latent representations of high-dimensional data. When th
Adapting large language models (LLMs) to clinical workflows often requires costly fine-tuning or manual prompt
Accurate distance estimation for small drones in long-range imagery is important for tracking and situational
この論文では、VLAモデルをedgeハードウェアにデプロイするための手法を提案しています。この手法は、VLAモデルをedgeハードウェアにデプロイするためのフレームワークです。この手法は、edgeハードウェアを利用してV
この論文では、embodied agentsが未来の行動を予測するためのnew Contrastive Action-conditioned Parallel Encoding(CAPE)フレームワークを提案した。CAP
3D Multi-Object Tracking (MOT)では、人の動きを検出し続けるために、3D点群データから3D人体の姿勢姿勢を推測する必要があり、主に幾何学情報に依存しているが、これは状況によっては人を分別するの
この論文では、四足ロボットのシマイルのためのQuadVerseフレームワークを提案した。QuadVerseは、視覚的、物理的、動的なギャップを考慮したシマイルを用い、四足ロボットの実験環境とシマイルを統合した。
Visual-language action (VLA) models enable robots to predict actions directly from observations and language i
World Action Models (WAMs) offer a promising approach to embodied intelligence, yet existing methods rely heav
Assistive robots operating under shared autonomy must balance user control with autonomous assistance. Because
Robots performing long-horizon visual manipulation observe high-dimensional images, but successful plans depen
Aquatic robots have expanded human access to underwater environments, yet many underwater spaces contain obsta
Forward-Forward (FF) learning [Hinton, 2022] replaces backpropagation with strictly layer-local goodness updat
Off-board control of mobile robots from cameras embedded in the environment offers a practical path to scalabl
Visuomotor manipulation policies trained via large-scale behavior cloning have achieved strong semantic scene
この研究では、地位認識を改善するために、地位認識と位置推定を統合した Meridian を提案します。
布物操作の学習システムを開発しました。このシステムは、人間が布物操作を学習できます。
この研究では、さまざまな時脈に沿った触角の融合を利用して、複雑な多モーダル接触リソースの学習を実現する MiTaS を提案します。
End-to-end autonomous driving modelsがmulti-modal maneuver generationとreal-time inferenceをバランスすることが難しい問題を解決し、di
このリポジトリでは、画像認識モデルにアクション生成能力を付与することを目指したモデルを提案します。このモデルは、画像認識のための事前訓練モデルを用いて、複雑なアクションを生成することができます。
この研究では、人間-ロボット 協力のためのDistributed Conversational Frameworkを提案します。
統合された視覚言語アクションモデルを提案し、これを用いたタスクの性能を向上させることができるようになる。
Open-vocabulary 3D functionality segmentation enables robots to localize functional object components in 3D sc
この研究では、将来の天文台 Roman が取得するデータに対して、変換検出と変換エラー検出の自動パイプラインを提案している。変換検出は、特に天文台 Roman のデータでは重要な機能であり、天文現象を検出するために迅速な
Selecting a clustering algorithm and its hyperparameters without labels is a common difficulty in engineering
Active Statistical Inference is a new framework to make precise claims about population parameters with provab
The ability to train spiking neural networks is essential for modeling biological neural networks as well as f
Equilibrium Propagation (EP)は、エネルギーベースのモデル、特にPredcitveCodingNetwork (PCN)のトレーニングに利用できるフレームワークです。EPは、トレーニングの過程に
スパイク式ビジョン変換模型(SVM)を圧縮するための削減法の開発と、それを用いた実験結果について論じます。
Laws and institutions shape individual outcomes through complex interactions with citizens' diverse circumstan
Real-world datasets across image and text domains are often characterized by skewed class distributions and no
We present a deep photonic neural network architecture based on ultrafast binary optical modulation from a dig
Recognizing and continuously learning novel human actions without forgetting prior classes is a requirement fo
The popularity and rapid development of Cloud Computing in recent years has led to a vast number of publicatio
Marine plankton underpin aquatic food webs and play a key role in global CO2 sequestration, making reliable sp
スパイキングニューラルネットワークを高速化するためのSpikingMoEを提案しています。このフレームワークは、スパイク通信を削減するためのSDPrompt-Guided Dynamics Expert Fusionを提
この研究では、人間とマカスの視覚的アラインメントを比較検討しました。調査結果は、CNNを用いてマカスの視覚野を予測することが可能であることを示しました。
This work presents E-ReCON, a 16 Kb energy and resource-efficient digital compute-in-memory (DCIM) macro based
The rapid proliferation of AI-generated visual media has created an urgent need for efficient, trustworthy dee
Spiking neural networks (SNNs) hold promise for demonstrating superior learning and representation capabilitie
Hippocampal-Entorhinal の構造を取り入れ、抽象的な表現と予測的世界モデルを学習します。
Standard deep-learning pipelines usually choose the network architecture before training and keep it fixed thr
SNNs promise energy-efficient and low-latency inference, but their performance still trails that of ANNs. ANN-
The spatial and functional organization of the primate visual cortex is a fundamental problem in neuroscience.
Transformer-based Spiking Neural Networks (SNNs) integrate SNNs with global self-attention and have demonstrat
FPGA上でスパイク神経ネットワークモデルを実装し、エネルギー消費を削減する方法を提案しています。
人間的抽象化を推定するための新たなアプローチを提案し、未知のタスクを効率的に学習することができます。