Vision Transformer

Publish Date	Title	Authors	PDF	Code
2025-07-03	MultiGen: Using Multimodal Generation in Simulation to Learn Multimodal Policies in Real	Renhao Wang et.al.	2507.02864v1	null
2025-07-03	Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory	Yuqi Wu et.al.	2507.02863v1	null
2025-07-03	LiteReality: Graphics-Ready 3D Scene Reconstruction from RGB-D Scans	Zhening Huang et.al.	2507.02861v1	null
2025-07-03	RefTok: Reference-Based Tokenization for Video Generation	Xiang Fan et.al.	2507.02862v1	null
2025-07-03	Less is Enough: Training-Free Video Diffusion Acceleration via Runtime-Adaptive Caching	Xin Zhou et.al.	2507.02860v1	null
2025-07-03	Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation	Jiaer Xia et.al.	2507.02859v1	null
2025-07-03	AnyI2V: Animating Any Conditional Image with Motion Control	Ziye Li et.al.	2507.02857v1	null
2025-07-03	Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection	Ziqi Miao et.al.	2507.02844v1	null
2025-07-03	Neutrino mixing parameters and masses from $Δ(96)\rtimes H_{CP}$ in the tri-direct CP approach	Li-Na Yan et.al.	2507.02840v1	null
2025-07-03	USAD: An Unsupervised Data Augmentation Spatio-Temporal Attention Diffusion Network	Ying Yu et.al.	2507.02827v1	null
2025-07-03	Confidence-driven Gradient Modulation for Multimodal Human Activity Recognition: A Dynamic Contrastive Dual-Path Learning Approach	Panpan Ji et.al.	2507.02826v1	null
2025-07-03	DNN-Based Precoding in RIS-Aided mmWave MIMO Systems With Practical Phase Shift	Po-Heng Chou et.al.	2507.02824v1	null
2025-07-03	LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion	Fangfu Liu et.al.	2507.02813v1	null
2025-07-03	HyperGaussians: High-Dimensional Gaussian Splatting for High-Fidelity Animatable Face Avatars	Gent Serifi et.al.	2507.02803v1	null
2025-07-03	AREE-Based Decoupled Design of Hybrid Beamformers in mmWave XL-MIMO Systems	Jiazhe Li et.al.	2507.02802v1	null
2025-07-03	Time-Masked Transformers with Lightweight Test-Time Adaptation for Neural Speech Decoding	Ebrahim Feghhi et.al.	2507.02800v1	null
2025-07-03	No time to train! Training-Free Reference-Based Instance Segmentation	Miguel Espinosa et.al.	2507.02798v1	null
2025-07-03	A Highly Carbon-Rich Dayside and Disequilibrium Chemistry in the Ultra-Hot Jupiter WASP-19b	Suman Saha et.al.	2507.02797v1	null
2025-07-03	Ultrafast optical excitation of magnons in 2D antiferromagnets via spin torque exerted by photocurrent of excitons: Signatures in charge pumping and THz emission	Jalil Varela-Manjarres et.al.	2507.02793v1	null
2025-07-03	RichControl: Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation	Liheng Zhang et.al.	2507.02792v1	null
2025-07-03	From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding	Xiangfeng Wang et.al.	2507.02790v1	null
2025-07-03	From Pixels to Damage Severity: Estimating Earthquake Impacts Using Semantic Segmentation of Social Media Images	Danrong Zhang et.al.	2507.02781v1	null
2025-07-03	Self-Correction Bench: Revealing and Addressing the Self-Correction Blind Spot in LLMs	Ken Tsui et.al.	2507.02778v1	null
2025-07-03	Grounding Intelligence in Movement	Melanie Segado et.al.	2507.02771v1	null
2025-07-03	Fast and Simplex: 2-Simplicial Attention in Triton	Aurko Roy et.al.	2507.02754v1	null
2025-07-03	Partial Weakly-Supervised Oriented Object Detection	Mingxin Liu et.al.	2507.02751v1	null
2025-07-03	Linear Attention with Global Context: A Multipole Attention Mechanism for Vision and Physics	Alex Colagrande et.al.	2507.02748v1	null
2025-07-03	DexVLG: Dexterous Vision-Language-Grasp Model at Scale	Jiawei He et.al.	2507.02747v1	null
2025-07-03	Prompt learning with bounding box constraints for medical image segmentation	Mélanie Gaillochet et.al.	2507.02743v1	null
2025-07-03	Leveraging Transformer Models to Capture Multi-Scale Dynamics in Biomolecules by nano-GPT	Wenqi Zeng et.al.	2507.02734v1	null