October 16, 2025: Latest Research Papers
Latest 15 Papers - October 16, 2025
Hey everyone! Here's a rundown of some of the coolest new papers from the arXiv, covering Vision Language Action, Robots, Vision Language Models, and World Models. I've tried to make it easy to understand, so let's dive in!
Vision Language Action
This area is all about getting robots to understand what they see and then do something about it. Think of it as teaching robots to be smart and act on their knowledge. Here's what's new:
-
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy: This paper introduces a new framework for robots that can understand spatial relationships. It's like giving robots a better sense of where things are in the world, helping them plan actions more effectively. Basically, it's about making robots better at figuring out how to do stuff.
-
LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models: Guys, this one is super important. It digs deep into how well these models hold up when things aren't perfect. This means testing them with all sorts of crazy inputs to make sure they're reliable, even when things get messy.
-
DepthVLA: Enhancing Vision-Language-Action Models with Depth-Aware Spatial Reasoning: This research focuses on giving robots a better understanding of depth. Think about it like this: now they can see how far away things are, which makes their actions way more accurate.
-
EO-1: Interleaved Vision-Text-Action Pretraining for General Robot Control: This paper explores how to train robots using a combination of vision, text, and actions. It's like giving them a comprehensive education, which helps them learn how to do a whole bunch of different tasks.
-
On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations: It addresses how well these models perform when dealing with different types of inputs, like images and text. It's all about making sure robots can handle any situation.
-
USIM and U0: A Vision-Language-Action Dataset and Model for General Underwater Robots: This paper introduces a new dataset and model for underwater robots. The goal is to help robots explore and interact with the underwater world.
-
Model-agnostic Adversarial Attack and Defense for Vision-Language-Action Models: The team behind the project have focused on making it harder to trick robots. It's all about making sure robots are secure.
-
EmbodiedCoder: Parameterized Embodied Mobile Manipulation via Modern Coding Model: This study focuses on using coding models to give robots a new way to interact with the world and perform actions.
-
VLA-0: Building State-of-the-Art VLAs with Zero Modification: It's about creating top-notch VLA models without changing anything. Basically, it is the art of optimization.
-
DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving: This one looks at how world models can help autonomous driving systems. This research helps them make decisions and handle different driving scenarios better.
-
Reflection-Based Task Adaptation for Self-Improving VLA: It's about creating VLA models that can improve over time, based on their experiences. This lets robots learn from their mistakes and get better at what they do.
-
Image Quality Assessment for Embodied AI: This looks at how to measure the quality of images that embodied AI systems use. Ensuring that the images are clear is vital.
-
BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models: This research aims to make 3D manipulation learning more efficient. It is like teaching robots to manipulate 3D objects with greater ease.
-
NinA: Normalizing Flows in Action. Training VLA Models with Normalizing Flows: This paper utilizes a specific technique called normalizing flows to train VLA models. It's all about making the training process better.
-
Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model: This work focuses on improving spatial representation alignment. The goal is to make robots better at understanding space.
Robot
Alright, let's talk about robots! Here's a quick look at some cool stuff happening in robotics:
-
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy: (Same as above, but important enough to mention again!) It's a framework designed to help robots understand spatial relationships, making them better at planning and executing actions.
-
QuaDreamer: Controllable Panoramic Video Generation for Quadruped Robots: This paper focuses on creating panoramic videos for robots. It's all about giving them a wider view of their surroundings, enhancing their ability to navigate and interact.
-
A Modular Object Detection System for Humanoid Robots Using YOLO: This research introduces a modular system for object detection, using YOLO, for humanoid robots. It helps them identify and interact with objects.
-
Efficient Force and Stiffness Prediction in Robotic Produce Handling with a Piezoresistive Pressure Sensor: The study focuses on using a pressure sensor to predict force and stiffness when robots handle produce. This is to ensure they handle items with care.
-
Development of an Intuitive GUI for Non-Expert Teleoperation of Humanoid Robots: The research team worked on developing a user-friendly interface. It allows non-experts to control humanoid robots effectively.
-
Hoecken-D Hand: A Novel Robotic Hand for Linear Parallel Pinching and Self-Adaptive Grasping: The team is presenting a new robotic hand. The hand is designed to be versatile, so that it can grasp objects with ease.
-
Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision: This survey provides a comprehensive overview of how vision-language models are used in robot vision. It covers various aspects, including how robots see and interpret the world.
-
Towards Proprioception-Aware Embodied Planning for Dual-Arm Humanoid Robots: This research focuses on enabling dual-arm humanoid robots to plan their actions while being aware of their own body positions. It enhances the robot's ability to plan and execute tasks.
-
A Novel Robot Hand with Hoeckens Linkages and Soft Phalanges for Scooping and Self-Adaptive Grasping in Environmental Constraints: Another robotic hand design, this one is made for scooping and adaptive grasping in various environments. It is a very flexible robot hand.
-
More than A Point: Capturing Uncertainty with Adaptive Affordance Heatmaps for Spatial Grounding in Robotic Tasks: This paper explores how to capture uncertainty in robotic tasks. It uses heatmaps to identify the best action.
-
MODUR: A Modular Dual-reconfigurable Robot: The researchers are presenting a new modular robot design that can be reconfigured for different tasks. It's all about making robots adaptable.
-
EO-1: Interleaved Vision-Text-Action Pretraining for General Robot Control: (Again, mentioned because it's super relevant!) This paper continues to explore how to train robots using a combination of vision, text, and actions. It's about giving them a wide range of skills.
-
Tactile-Conditioned Diffusion Policy for Force-Aware Robotic Manipulation: The team focuses on developing a force-aware robotic manipulation system. The team wants the robot to be very sensitive.
-
USIM and U0: A Vision-Language-Action Dataset and Model for General Underwater Robots: (Yep, still relevant!) This paper introduces a new dataset and model for underwater robots, helping them explore and interact with the underwater world.
-
ALOHA2 Robot Kitchen Application Scenario Reproduction Report: This report details the reproduction of the ALOHA2 robot kitchen application scenario. It's about getting robots to perform tasks in a kitchen setting.
Vision Language Model
These models are all about understanding and generating language based on what they see. Let's see what's new:
-
VisCoP: Visual Probing for Video Domain Adaptation of Vision Language Models: This research aims to adapt vision language models for video tasks. It helps these models work better with video data.
-
Generative Universal Verifier as Multimodal Meta-Reasoner: This paper introduces a new approach to multimodal reasoning. The team uses a generative model to verify information across different types of data.
-
MEGC2025: Micro-Expression Grand Challenge on Spot Then Recognize and Visual Question Answering: The team has prepared a challenge. The challenge focuses on micro-expressions. This research aims to understand human emotions better.
-
Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision: (We've seen this one before!) This survey offers a comprehensive overview of how vision-language models are being used to enhance robot vision.
-
ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom: This paper focuses on improving multimodal reasoning by decoupling visual perception and reasoning capabilities. It's all about making these models smarter.
-
Towards Unified Multimodal Misinformation Detection in Social Media: A Benchmark Dataset and Baseline: The research team has made a new dataset and baseline for detecting misinformation in social media. It helps to automatically identify fake news and other problematic content.
-
Spatial-DISE: A Unified Benchmark for Evaluating Spatial Reasoning in Vision-Language Models: This benchmark is designed to evaluate how well vision-language models understand spatial relationships. It is like a test for these models.
-
DepthVLA: Enhancing Vision-Language-Action Models with Depth-Aware Spatial Reasoning: (Another familiar face!) This research is about improving vision-language-action models with depth-aware spatial reasoning, enhancing the robots' understanding of space.
-
Language as a Label: Zero-Shot Multimodal Classification of Everyday Postures under Data Scarcity: This paper explores using language as a label to classify everyday postures. It helps in the situations where there is not enough data.
-
Improving Visual Recommendation on E-commerce Platforms Using Vision-Language Models: This research aims to improve product recommendations on e-commerce platforms using vision-language models. This will allow the models to understand both images and text.
-
Self-Augmented Visual Contrastive Decoding: The paper introduces a new approach to visual contrastive decoding. It helps models generate more accurate and informative outputs.
-
MMLongCite: A Benchmark for Evaluating Fidelity of Long-Context Vision-Language Models: This benchmark evaluates the ability of vision-language models to handle long contexts. It helps assess their understanding of extended information.
-
Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs: The study is focused on understanding how information flows within VideoLLMs. It aims to make these models more interpretable.
-
Visible Yet Unreadable: A Systematic Blind Spot of Vision Language Models Across Writing Systems: This paper highlights a systematic blind spot in vision language models regarding writing systems. It identifies areas for improvement in these models.
-
What "Not" to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging: The research focuses on making VLMs aware of negations. It improves their reasoning capabilities.
World Model
World models are all about creating simulations of the world to help AI learn and plan. Here's what's up:
-
PhysMaster: Mastering Physical Representation for Video Generation via Reinforcement Learning: This paper focuses on using reinforcement learning to generate videos. It helps models understand and simulate physical phenomena.
-
Generative Universal Verifier as Multimodal Meta-Reasoner: (Yes, again!) The team wants to use a generative model to verify information across different types of data.
-
MTIL: Encoding Full History with Mamba for Temporal Imitation Learning: This research explores using Mamba to encode the full history of an environment for temporal imitation learning. It's about learning from past experiences.
-
Ctrl-World: A Controllable Generative World Model for Robot Manipulation: This paper presents a controllable generative world model for robot manipulation. It lets robots simulate and plan manipulations more effectively.
-
DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving: (We've seen this!) It is about how world models can improve autonomous driving systems. This research helps them make better decisions.
-
CoIRL-AD: Collaborative-Competitive Imitation-Reinforcement Learning in Latent World Models for Autonomous Driving: This paper introduces a new approach to imitation and reinforcement learning in autonomous driving. It focuses on using latent world models.
-
Deep SPI: Safe Policy Improvement via World Models: This research focuses on improving policies in a safe manner, using world models. This ensures safety during operations.
-
One Life to Learn: Inferring Symbolic World Models for Stochastic Environments from Unguided Exploration: The study aims to infer symbolic world models from unguided exploration. The team is trying to make models learn from random interactions.
-
DSM: Constructing a Diverse Semantic Map for 3D Visual Grounding: This research focuses on constructing a diverse semantic map for 3D visual grounding. The goal is to improve the understanding of 3D scenes.
-
Agent Learning via Early Experience: This paper investigates agent learning through early experiences. The team is trying to see how early experiences affect the way AI learns.
-
R-WoM: Retrieval-augmented World Model For Computer-use Agents: The study introduces a retrieval-augmented world model for computer-use agents. It aims to improve their performance.
-
Ego-Vision World Model for Humanoid Contact Planning: This research focuses on creating an ego-vision world model. This will help with the planning of contacts for humanoid robots.
-
Aligning Perception, Reasoning, Modeling and Interaction: A Survey on Physical AI: This survey provides an overview of physical AI, covering perception, reasoning, and interaction. It's a comprehensive look at the field.
-
TriVLA: A Triple-System-Based Unified Vision-Language-Action Model with Episodic World Modeling for General Robot Control: This paper presents a unified vision-language-action model with episodic world modeling. It will help with robot control.
-
Game-Theoretic Risk-Shaped Reinforcement Learning for Safe Autonomous Driving: This research uses game-theoretic risk-shaped reinforcement learning for safe autonomous driving. The focus is to make driving safer.
That's all for today, folks! I hope you found this helpful. See ya next time!