Skip to content

Vision Transformer

Vision Transformer

Publish Date Title Authors PDF Code
2025-07-03 MultiGen: Using Multimodal Generation in Simulation to Learn Multimodal Policies in Real Renhao Wang et.al. 2507.02864v1 null
2025-07-03 Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory Yuqi Wu et.al. 2507.02863v1 null
2025-07-03 LiteReality: Graphics-Ready 3D Scene Reconstruction from RGB-D Scans Zhening Huang et.al. 2507.02861v1 null
2025-07-03 RefTok: Reference-Based Tokenization for Video Generation Xiang Fan et.al. 2507.02862v1 null
2025-07-03 Less is Enough: Training-Free Video Diffusion Acceleration via Runtime-Adaptive Caching Xin Zhou et.al. 2507.02860v1 null
2025-07-03 Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation Jiaer Xia et.al. 2507.02859v1 null
2025-07-03 AnyI2V: Animating Any Conditional Image with Motion Control Ziye Li et.al. 2507.02857v1 null
2025-07-03 Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection Ziqi Miao et.al. 2507.02844v1 null
2025-07-03 Neutrino mixing parameters and masses from $Δ(96)\rtimes H_{CP}$ in the tri-direct CP approach Li-Na Yan et.al. 2507.02840v1 null
2025-07-03 USAD: An Unsupervised Data Augmentation Spatio-Temporal Attention Diffusion Network Ying Yu et.al. 2507.02827v1 null
2025-07-03 Confidence-driven Gradient Modulation for Multimodal Human Activity Recognition: A Dynamic Contrastive Dual-Path Learning Approach Panpan Ji et.al. 2507.02826v1 null
2025-07-03 DNN-Based Precoding in RIS-Aided mmWave MIMO Systems With Practical Phase Shift Po-Heng Chou et.al. 2507.02824v1 null
2025-07-03 LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion Fangfu Liu et.al. 2507.02813v1 null
2025-07-03 HyperGaussians: High-Dimensional Gaussian Splatting for High-Fidelity Animatable Face Avatars Gent Serifi et.al. 2507.02803v1 null
2025-07-03 AREE-Based Decoupled Design of Hybrid Beamformers in mmWave XL-MIMO Systems Jiazhe Li et.al. 2507.02802v1 null
2025-07-03 Time-Masked Transformers with Lightweight Test-Time Adaptation for Neural Speech Decoding Ebrahim Feghhi et.al. 2507.02800v1 null
2025-07-03 No time to train! Training-Free Reference-Based Instance Segmentation Miguel Espinosa et.al. 2507.02798v1 null
2025-07-03 A Highly Carbon-Rich Dayside and Disequilibrium Chemistry in the Ultra-Hot Jupiter WASP-19b Suman Saha et.al. 2507.02797v1 null
2025-07-03 Ultrafast optical excitation of magnons in 2D antiferromagnets via spin torque exerted by photocurrent of excitons: Signatures in charge pumping and THz emission Jalil Varela-Manjarres et.al. 2507.02793v1 null
2025-07-03 RichControl: Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation Liheng Zhang et.al. 2507.02792v1 null
2025-07-03 From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding Xiangfeng Wang et.al. 2507.02790v1 null
2025-07-03 From Pixels to Damage Severity: Estimating Earthquake Impacts Using Semantic Segmentation of Social Media Images Danrong Zhang et.al. 2507.02781v1 null
2025-07-03 Self-Correction Bench: Revealing and Addressing the Self-Correction Blind Spot in LLMs Ken Tsui et.al. 2507.02778v1 null
2025-07-03 Grounding Intelligence in Movement Melanie Segado et.al. 2507.02771v1 null
2025-07-03 Fast and Simplex: 2-Simplicial Attention in Triton Aurko Roy et.al. 2507.02754v1 null
2025-07-03 Partial Weakly-Supervised Oriented Object Detection Mingxin Liu et.al. 2507.02751v1 null
2025-07-03 Linear Attention with Global Context: A Multipole Attention Mechanism for Vision and Physics Alex Colagrande et.al. 2507.02748v1 null
2025-07-03 DexVLG: Dexterous Vision-Language-Grasp Model at Scale Jiawei He et.al. 2507.02747v1 null
2025-07-03 Prompt learning with bounding box constraints for medical image segmentation Mélanie Gaillochet et.al. 2507.02743v1 null
2025-07-03 Leveraging Transformer Models to Capture Multi-Scale Dynamics in Biomolecules by nano-GPT Wenqi Zeng et.al. 2507.02734v1 null