AI Papers - October 23, 2025
Hey everyone! Hope you're all doing great. I've compiled a list of the latest AI papers released on October 23, 2025. This is a quick rundown of what's fresh in the world of artificial intelligence. For a much better reading experience and more detailed info, be sure to check out the Github page. Let's dive in!
Video Understanding
LongInsightBench: A Comprehensive Benchmark for Evaluating Omni-Modal Models on Human-Centric Long-Video Understanding
This paper, dated October 21, 2025, is a submission to ARR Rolling Review. The paper introduces LongInsightBench, a comprehensive benchmark designed to assess the capabilities of omni-modal models in understanding human-centric long-form videos. The study offers a detailed analysis of model performance, offering valuable insights for researchers and developers in the field of video understanding. The benchmark allows for a thorough evaluation of these models, particularly in complex, real-world scenarios. This work helps improve the development of AI systems capable of understanding and interacting with long-form video content, a critical step towards more human-like AI.
Think With Videos For Agentic Long-Video Understanding
Also from October 21, 2025, this paper explores the concept of utilizing video content to enhance agentic long-video understanding. The research likely focuses on the methods by which agents can leverage video data to improve their understanding and reasoning capabilities within extended video sequences. This study could be pivotal in the evolution of AI agents capable of making informed decisions and performing tasks based on long-form video input. It's an important step in making AI more intelligent and capable of complex tasks.
StreamingTOM: Streaming Token Compression for Efficient Video Understanding
Released on October 21, 2025, this paper introduces StreamingTOM, which focuses on streaming token compression to enhance the efficiency of video understanding. This is a critical area, especially with the growing size and complexity of video data. By compressing tokens, this method can significantly reduce computational load, which would lead to faster processing and more practical applications. Efficiency is key in enabling AI systems to handle real-time video analysis tasks effectively. This paper aims to significantly improve how AI systems process video information.
MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues
From October 20, 2025, this paper presents MT-Video-Bench, a benchmark tailored for evaluating multimodal Large Language Models (LLMs) in multi-turn dialogues. This is a project that you can find more information about on their project website. These benchmarks are super important for pushing the boundaries of what is possible. The main goal here is to assess how well these models understand and respond to video content in a conversational setting. This can help advance the field in building AI systems that can interact with users through video content. The project website is full of more detailed information.
ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment
Published on October 19, 2025, and accepted to TMLR 2025, this paper introduces ActAlign. This method uses language-guided sequence alignment for zero-shot fine-grained video classification. A project page is available for more information. This approach is beneficial because it allows for the classification of videos without the need for extensive training data. This advancement simplifies the video classification process and opens up new possibilities for various applications. It underscores the ongoing progress in making AI more adaptable and efficient in understanding video content. The project page will provide additional details about the work.
Xiaoice: Training-Free Video Understanding via Self-Supervised Spatio-Temporal Clustering of Semantic Features
From October 19, 2025, this paper looks at training-free video understanding using self-supervised spatio-temporal clustering of semantic features. The main advantage is that it eliminates the necessity for extensive training, making it easier to implement across diverse video datasets. This could lead to a broader range of applications where quick and flexible video analysis is needed. This method simplifies video analysis, making it more efficient and adaptable.
StretchySnake: Flexible SSM Training Unlocks Action Recognition Across Spatio-Temporal Scales
Released on October 17, 2025, this paper focuses on StretchySnake. This is a method that allows for flexible State Space Model (SSM) training to improve action recognition across various spatio-temporal scales. The ability to adapt to different scales makes it especially useful in real-world scenarios, where video content can vary greatly. The approach will allow for more robust and accurate action recognition, which is key for advanced video understanding.
Symmetric Entropy-Constrained Video Coding for Machines
Also from October 17, 2025, this paper aims to be submitted to the IEEE Transactions. This work looks at symmetric entropy-constrained video coding for machines. The research is designed to increase efficiency and effectiveness in video compression. The main aim is to develop coding methods that optimize compression while maintaining high quality, which is essential for video analysis and transmission tasks. The research focuses on making video processing more efficient and effective.
SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding
Published on October 16, 2025, this paper introduces SVAG-Bench, a large-scale benchmark designed for multi-instance spatio-temporal video action grounding. This benchmark is a critical resource for researchers working on action recognition and understanding. This benchmark provides the data and metrics needed to measure and compare various methods, driving progress in this area of AI. The development of benchmarks is essential for consistent and reliable evaluations.
VTimeCoT: Thinking by Drawing for Video Temporal Grounding and Reasoning
From October 16, 2025, this paper, accepted by ICCV 2025, introduces VTimeCoT. This approach uses drawing to enhance video temporal grounding and reasoning. The novelty of drawing as a method to improve video understanding is what makes it exciting. This technique is designed to boost the ability of AI systems to interpret and reason about video content. The paper shows how new techniques can revolutionize the way AI systems understand video.
Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding
Published on October 15, 2025, and selected as a NeurIPS 2025 Spotlight paper, this work introduces Vgent. This method utilizes a graph-based framework for retrieval-reasoning-augmented generation to enhance the understanding of long videos. The project has a webpage with more information. The research highlights innovative methods for handling the complexity of long videos. The approach aims to significantly improve AI's ability to process and understand extended video content. The paper represents a crucial step in the field of long video understanding.
InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue
From October 15, 2025, this paper presents InteractiveOmni, a unified, omni-modal model specifically designed for audio-visual multi-turn dialogues. This model is all about creating AI systems that can understand and respond effectively in audio-visual interactions. This is a very important development in the field of AI, paving the way for more engaging and immersive user experiences.
Fact-R1: Towards Explainable Video Misinformation Detection with Deep Reasoning
This paper, published on October 15, 2025, explores explainable video misinformation detection through deep reasoning. This work provides an in-depth look at how AI can identify misinformation in videos, an important topic. The focus is on using deep reasoning to improve the transparency and trustworthiness of video analysis tools. This work is essential for building robust and reliable AI systems. The paper contributes to making AI more trustworthy.
VideoLucy: Deep Memory Backtracking for Long Video Understanding
Published on October 14, 2025, and accepted as a NeurIPS-2025 paper, this paper presents VideoLucy. It uses deep memory backtracking for long video understanding. This method will improve the ability of AI to comprehend long videos. Memory backtracking enables a better grasp of temporal relationships and the context within long video sequences. This study presents a notable advancement for long video understanding.
K-frames: Scene-Driven Any-k Keyframe Selection for long video understanding
Released on October 14, 2025, this paper presents a method called K-frames for scene-driven any-k keyframe selection in long video understanding. Selecting the most important frames, or keyframes, is crucial for efficiency and accuracy. This paper offers a new approach to enhance the effectiveness of AI systems in long video analysis. The research will improve AI's ability to efficiently and accurately analyze long-form video content.
World Model
VideoVerse: How Far is Your T2V Generator from a World Model?
Published on October 21, 2025, this paper dives into the capabilities of text-to-video (T2V) generators and evaluates how well they function as world models. The paper asks a critical question about the level of sophistication of current AI models. The study provides insights into the potential of T2V generators, and it highlights where improvements are needed. The paper encourages ongoing development in the area of AI world modeling.
Program Synthesis via Test-Time Transduction
From October 21, 2025, this paper focuses on program synthesis via test-time transduction and is a NeurIPS 2025 submission. This work delves into the automated creation of computer programs. Test-time transduction is a new approach that could dramatically change how software is developed. The study presents innovative methods for automatically creating programs, and it explores the potential of test-time transduction.
Higher Embedding Dimension Creates a Stronger World Model for a Simple Sorting Task
Also from October 21, 2025, this paper investigates how a higher embedding dimension improves the effectiveness of world models, focusing on a straightforward sorting task. The research emphasizes the significance of the embedding dimension in the performance of AI models. The findings of this research offer valuable information about the relationship between embedding dimensions and the efficacy of AI models, which can be applied to many different AI applications.
OmniNWM: Omniscient Driving Navigation World Models
From October 22, 2025, this paper presents OmniNWM, which focuses on omniscient driving navigation world models. A project website is available. The research looks at the development of AI models for autonomous navigation, which is an important step towards self-driving cars. This work could significantly advance autonomous navigation technology.
SAMPO:Scale-wise Autoregression with Motion PrOmpt for generative world models
Published on October 21, 2025, this paper introduces SAMPO, a method that uses scale-wise autoregression with motion prompts for generative world models. The method offers a new way of constructing world models with the use of scale-wise autoregression and motion prompts, which may improve their performance. The research will enhance the development of generative models.
World-in-World: World Models in a Closed-Loop World
From October 20, 2025, this paper introduces World-in-World, which looks at world models in a closed-loop setting. The study provides a closed-loop world context for testing AI models. The code is available for this project on GitHub. This work provides insight into how AI models function in closed-loop systems.
Can Image-To-Video Models Simulate Pedestrian Dynamics?
Published on October 20, 2025, this paper asks if image-to-video models can simulate pedestrian dynamics. The research investigates how well these models can simulate pedestrian movements and interactions. This work is an important step in creating more realistic and reliable AI simulations. The paper contributes to the understanding of AI's capabilities in simulating real-world events.
Morpheus: Benchmarking Physical Reasoning of Video Generative Models with Real Physical Experiments
Also from October 20, 2025, this paper presents Morpheus. It benchmarks the physical reasoning abilities of video generative models using real physical experiments. The research looks at the physical reasoning capabilities of video generative models through actual experiments. This will aid in the development of AI models with more advanced physical reasoning capabilities. The research makes real-world AI applications more practical.
From Next Token Prediction to (STRIPS) World Models -- Preliminary Results
From October 20, 2025, this paper looks at going From Next Token Prediction to (STRIPS) World Models. This research is focused on linking next-token prediction to STRIPS world models, which are used in planning and reasoning. The research may improve AI planning and reasoning capabilities. The paper offers initial findings on the intersection of prediction and planning in AI models.
SparseWorld: A Flexible, Adaptive, and Efficient 4D Occupancy World Model Powered by Sparse and Dynamic Queries
Published on October 22, 2025, this paper introduces SparseWorld. The paper focuses on a 4D occupancy world model that uses sparse and dynamic queries. This approach is designed to improve the efficiency and adaptability of AI models, especially in 4D environments. The study provides insights into creating AI models that can better handle complex and dynamic environments.
General agents contain world models
From October 20, 2025, this paper argues that general agents contain world models. The study reinforces the importance of world models in the development of general AI. This work is important in the push for more capable and versatile AI systems, emphasizing that world models are essential for AI to be more intelligent and adaptive. This study is an important step in the development of AI.
DARIL: When Imitation Learning outperforms Reinforcement Learning in Surgical Action Planning
Also from October 20, 2025, this paper shows how Imitation Learning outperforms Reinforcement Learning in Surgical Action Planning. The paper is accepted at the MICCAI2025 workshop on COLlaborative Intelligence and Autonomy in Image-guided Surgery (COLAS). The research shows that imitation learning can be more effective than reinforcement learning in the context of surgical action planning. The research could improve the development of AI in the medical field.
VAGEN: Reinforcing World Model Reasoning for Multi-Turn VLM Agents
Published on October 19, 2025, and accepted to NeurIPS 2025, this paper presents VAGEN, which is focused on reinforcing world model reasoning for multi-turn VLM (Video-Language Model) agents. The focus of the research is on enhancing the reasoning abilities of AI agents. The study offers insights into advancing AI agent interactions. The paper is an important step in creating more interactive AI systems.
A Comprehensive Survey on World Models for Embodied AI
Released on October 19, 2025, this paper presents a comprehensive survey on world models for embodied AI. A GitHub repository is available to host the related code. The research provides a comprehensive review of world models for embodied AI, and offers valuable insights into the field. This survey is a useful resource for researchers and developers in the field.
Vision-Centric 4D Occupancy Forecasting and Planning via Implicit Residual World Models
From October 19, 2025, this paper explores Vision-Centric 4D Occupancy Forecasting and Planning via Implicit Residual World Models. The research focuses on 4D occupancy forecasting and planning through vision-centric methods. The main goal is to improve the ability of AI to comprehend and engage with dynamic 4D environments. This will enhance the development of applications that involve complex 3D environments.
Multimodal
Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
Released on October 22, 2025, this paper focuses on how to achieve precise, contextual pixel understanding in Multimodal LLMs. This study makes advancements in how AI interprets visual data. The insights from this research can improve many AI applications.
AstroMMBench: A Benchmark for Evaluating Multimodal Large Language Models Capabilities in Astronomy
From October 21, 2025, this paper introduces AstroMMBench, a benchmark tailored for assessing the capabilities of multimodal Large Language Models in the context of astronomy. The benchmark facilitates an in-depth evaluation of AI models in astronomy. The work encourages innovation and improvement in the application of AI in the field of astronomy.
Text Takes Over: A Study of Modality Bias in Multimodal Intent Detection
Also from October 21, 2025, and an EMNLP 2025 Main Conference Full Paper, this paper studies modality bias in multimodal intent detection. This is an important step in making AI systems more accurate and reliable in multimodal applications. The research will improve how AI systems identify user intents across various data types.
PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies
Released on October 21, 2025, this paper introduces PRISMM-Bench, a benchmark centered on peer-review grounded multimodal inconsistencies. This benchmark allows researchers to test and improve AI systems' ability to find inconsistencies. This work focuses on ensuring accuracy and consistency in the AI models.
CovMatch: Cross-Covariance Guided Multimodal Dataset Distillation with Trainable Text Encoder
Also from October 21, 2025, this work introduces CovMatch. This method uses cross-covariance to guide multimodal dataset distillation with a trainable text encoder. The research offers a unique approach to multimodal dataset distillation. The work is designed to improve the effectiveness of AI models and is a great step toward more efficient AI models.
A Multimodal Deep Learning Approach for White Matter Shape Prediction in Diffusion MRI Tractography
From October 21, 2025, this paper presents a deep learning approach for white matter shape prediction in diffusion MRI tractography. The work is accepted to Human Brain Mapping. This research uses deep learning to predict the shape of white matter in the brain, improving diagnostic capabilities. The paper contributes to the application of AI in medical imaging.
Socialized Learning and Emergent Behaviors in Multi-Agent Systems based on Multimodal Large Language Models
Also from October 21, 2025, this paper explores socialized learning and emergent behaviors in multi-agent systems, based on multimodal Large Language Models. The study offers insights into the behavior of multi-agent systems. The research contributes to the development of better interactions and cooperation among AI agents.
MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning
Released on October 21, 2025, this paper, which has been withdrawn, was focused on multimodal agent tuning for robust tool-use reasoning. The paper's withdrawal will allow for further work to be done.
StarBench: A Turn-Based RPG Benchmark for Agentic Multimodal Decision-Making and Information Seeking
Also from October 21, 2025, this paper introduces StarBench, a turn-based RPG benchmark for agentic multimodal decision-making and information seeking. The benchmark gives a platform to explore agentic multimodal decision-making. The research contributes to the advancement of AI agents capable of making complex decisions.
DreamPRM-1.5: Unlocking the Potential of Each Instance for Multimodal Process Reward Model Training
From October 21, 2025, this paper introduces DreamPRM-1.5, and focuses on multimodal process reward model training. The approach aims to enhance AI's capability in multimodal applications. The research is a contribution towards more sophisticated AI systems.
Med-2E3: A 2D-Enhanced 3D Medical Multimodal Large Language Model
From October 21, 2025, this paper presents Med-2E3, which focuses on a 2D-enhanced 3D medical multimodal large language model. This paper is about using AI to improve medical applications and is a step towards a better understanding of medical data.
Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle
Also from October 21, 2025, this paper looks at Shuffle-R1, which is an efficient RL framework for Multimodal Large Language Models. The work aims to enhance the performance of multimodal LLMs. The research will improve the development of multimodal LLMs.
Beyond Single Models: Mitigating Multimodal Hallucinations via Adaptive Token Ensemble Decoding
Released on October 21, 2025, this paper focuses on mitigating multimodal hallucinations via adaptive token ensemble decoding. The paper investigates ways to improve AI systems. This will improve the accuracy of multimodal systems.
The Impact of Image Resolution on Biomedical Multimodal Large Language Models
Also from October 21, 2025, this paper examines the impact of image resolution on biomedical multimodal Large Language Models. The research gives insights into how image resolution impacts the performance of AI models. The research offers a better understanding of the factors that affect the effectiveness of AI.
Proactive Reasoning-with-Retrieval Framework for Medical Multimodal Large Language Models
From October 21, 2025, this paper discusses a proactive reasoning-with-retrieval framework for medical multimodal Large Language Models. The work is still in progress. This framework aims to enhance medical AI applications. The study may result in more advanced medical AI systems.
Multimodal LLM
Merge then Realign: Simple and Effective Modality-Incremental Continual Learning for Multimodal LLMs
From October 22, 2025, this paper looks at Merge then Realign, a method for modality-incremental continual learning for multimodal LLMs. The paper will be in the EMNLP 2025 Main Conference. This is a very useful study for the development of multimodal systems.
DaMo: Data Mixing Optimizer in Fine-tuning Multimodal LLMs for Mobile Phone Agents
Published on October 22, 2025, this paper introduces DaMo, which is a data mixing optimizer used in fine-tuning multimodal LLMs for mobile phone agents. The research could improve the performance of AI agents on mobile devices. The study provides insights into how to make AI more effective on mobile platforms.
Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
From October 22, 2025, this paper focuses on how to achieve precise, contextual pixel understanding in Multimodal LLMs. This study makes advancements in how AI interprets visual data. The insights from this research can improve many AI applications.
Text or Pixels? It Takes Half: On the Token Efficiency of Visual Text Inputs in Multimodal LLMs
Also from October 22, 2025, this paper focuses on token efficiency of visual text inputs in multimodal LLMs. The paper, accepted to EMNLP 2025 Findings, examines the token efficiency of visual text inputs in multimodal LLMs. This helps us understand how to create more efficient AI systems.
Robust Driving QA through Metadata-Grounded Context and Task-Specific Prompts
Released on October 21, 2025, this paper introduces a method for robust driving QA through metadata-grounded context and task-specific prompts. This work is about creating more reliable AI for self-driving cars. The goal is to improve the accuracy of question-answering systems in autonomous driving scenarios.
See the Text: From Tokenization to Visual Reading
From October 21, 2025, this paper looks at the process from Tokenization to Visual Reading. The research is focused on improving how AI systems understand text. This work contributes to advancements in AI's ability to read and comprehend text.
From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes
Published on October 21, 2025, and a NeurIPS 2025 (Datasets and Benchmarks) submission, this paper introduces a holistic benchmark for multi-level visual grounding in 3D scenes. The project has a page. The benchmark helps to assess the multi-level visual grounding capabilities of AI. The research encourages progress in AI's capacity for visual comprehension.
MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues
From October 20, 2025, this paper presents MT-Video-Bench, a benchmark tailored for evaluating multimodal Large Language Models (LLMs) in multi-turn dialogues. This is a project that you can find more information about on their project website. These benchmarks are super important for pushing the boundaries of what is possible. The main goal here is to assess how well these models understand and respond to video content in a conversational setting. This can help advance the field in building AI systems that can interact with users through video content. The project website is full of more detailed information.
: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs
From October 20, 2025, this paper presents , which is a new method for decoding discontinuous cross-modal dynamics for efficient multimodal LLMs. The research will improve efficiency in multimodal systems and may lead to new ways of processing information. This paper will be in the EMNLP 2025 Main Conference.
Enrich and Detect: Video Temporal Grounding with Multimodal LLMs
Published on October 19, 2025, and accepted to ICCV 2025, this paper discusses how to enrich and detect video temporal grounding with multimodal LLMs. The research aims to improve AI systems. This will improve AI's ability to understand video content. This work is an important step in improving video understanding.
EEschematic: Multimodal-LLM Based AI Agent for Schematic Generation of Analog Circuit
Also from October 19, 2025, this paper focuses on an AI agent for schematic generation of analog circuits. The focus is on using Multimodal-LLMs. The study focuses on integrating AI in the design of analog circuits. This has the potential to simplify complex processes.
Segmentation as A Plug-and-Play Capability for Frozen Multimodal LLMs
Released on October 19, 2025, this paper discusses segmentation as a plug-and-play capability for frozen multimodal LLMs. This work is about integrating segmentation, to improve the efficiency of AI systems. The results may improve how AI systems are developed.
VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs
From October 18, 2025, this paper introduces VisionSelector, which focuses on end-to-end learnable visual token compression for efficient multimodal LLMs. This is a novel approach to improve the efficiency of AI. The research is designed to increase the performance of multimodal LLMs.
Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding
Released on October 17, 2025, this paper provides a survey on multimodal retrieval-augmented generation for document understanding. This study provides an in-depth view of current methods. This will help with the advancement of AI.
Elevating Visual Perception in Multimodal LLMs with Visual Embedding Distillation
Also from October 17, 2025, this paper focuses on elevating visual perception in multimodal LLMs with visual embedding distillation. The project has a page. The goal is to enhance AI's visual perception capabilities. The paper offers insights into improving AI systems.
Video Foundation Model
Advances in 4D Representation: Geometry, Motion, and Interaction
Published on October 22, 2025, this paper dives into the latest advancements in 4D representation, including geometry, motion, and interaction. A project page is available. The research gives insight into how to model dynamic scenes in 4D. The work opens new avenues for AI applications.
TTOM: Test-Time Optimization and Memorization for Compositional Video Generation
Released on October 9, 2025, this paper introduces TTOM, which uses test-time optimization and memorization for compositional video generation. A project page is also available. The study introduces new ways of creating videos using AI. The findings offer valuable insights for video generation.
Inferring Dynamic Physical Properties from Video Foundation Models
From October 2, 2025, this paper explores the inference of dynamic physical properties from video foundation models. This study helps to understand how to infer physical properties from video data. This work is essential for the advancement of computer vision.
Can World Models Benefit VLMs for World Dynamics?
Published on October 1, 2025, this paper asks whether world models can benefit VLMs for world dynamics. A project page is available. This study seeks to determine the potential of world models. This work is essential for the evolution of AI.
FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction
From September 25, 2025, this paper introduces FantasyWorld, which is about geometry-consistent world modeling using unified video and 3D prediction. The research focuses on making sure that the generated 3D scenes are geometrically sound. The approach presents new methods for 3D modeling and prediction.
Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls for Video Generation
Published on September 20, 2025, this paper introduces Uni3C, which focuses on unifying 3D-enhanced camera and human motion controls for video generation. The project has a page. This will help with video generation in the future. The paper shows how to improve the control of video generation.
Simplifying Traffic Anomaly Detection with Video Foundation Models
From September 1, 2025, this paper discusses simplifying traffic anomaly detection with video foundation models. The paper will be in ICCVW 2025. Code is available on GitHub. This work focuses on using video foundation models to improve traffic analysis. The research will help enhance traffic management.
Autoregressive Universal Video Segmentation Model
Released on August 26, 2025, this paper introduces an autoregressive universal video segmentation model. The research offers a new way of approaching video segmentation. This work is expected to improve performance in video analysis.
ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing
Published on August 14, 2025, this paper introduces ToonComposer. This method uses generative post-keyframing to streamline cartoon production. The project has a page. The research is centered on making cartoon production more efficient. This research improves efficiency in cartoon creation.
SAGOnline: Segment Any Gaussians Online
From August 11, 2025, this paper introduces SAGOnline. This method is used for segmenting anything with Gaussians online. The study focuses on segmenting with a new methodology. This approach will improve video analysis.
TRIBE: TRImodal Brain Encoder for whole-brain fMRI response prediction
Released on July 29, 2025, this paper presents TRIBE, a TRImodal Brain Encoder designed for predicting whole-brain fMRI responses. The work is designed to improve the interpretation of fMRI data. This research is expected to contribute to advancements in neuroscience and medical imaging.
SAMITE: Position Prompted SAM2 with Calibrated Memory for Visual Object Tracking
Also from July 29, 2025, this paper introduces SAMITE, which focuses on visual object tracking. The approach uses position-prompted SAM2 with calibrated memory. The main aim is to improve the accuracy of object tracking. The research will enhance the efficiency and precision of visual object tracking.
SeqTex: Generate Mesh Textures in Video Sequence
Published on July 6, 2025, this paper focuses on using SeqTex to generate mesh textures in video sequences. The study provides methods for creating mesh textures. The research will improve visual realism in video processing.
SciVid: Cross-Domain Evaluation of Video Models in Scientific Applications
From July 4, 2025, this paper presents SciVid, a cross-domain evaluation method for video models in scientific applications. The paper is accepted in ICCV 2025. A GitHub repo is available. This study provides tools for improving video models. This will lead to advancements in scientific applications.
GenLit: Reformulating Single-Image Relighting as Video Generation
From June 20, 2025, this paper reformulates single-image relighting as video generation. The work reconsiders the relighting process. The paper offers new approaches to relighting and the results will lead to advances in AI.
That's all for now, folks! I hope you found this overview useful. Keep an eye out for more updates soon! And don't forget to check out the GitHub page for all the details.