Hi, I am Yipu Wang. I am currently a Ph.D. candidate jointly supervised by the Institute of Automation, Chinese Academy of Sciences (CASIA), and the School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences. I am fortunate to be advised by Prof. Xiaolong Zheng. My research focuses on vision-language models and spatial intelligence.

📝 Publications

arXiv 2025

Towards Cross-View Point Correspondence in Vision-Language Models

Yipu Wang^*, Yuheng Ji^*, Yuyang Liu^*, Enshen Zhou, Ziqiang Yang, Yuxuan Tian, Ziheng Qin, Yue Liu, Huajie Tan, Cheng Chi, Zhiyuan Ma, Daniel Dajun Zeng, Xiaolong Zheng

Cross-view correspondence is a fundamental capability for spatial understanding and embodied AI. We propose the Cross-View Point Correspondence (CVPC) task and CrossPoint-Bench, a comprehensive benchmark with hierarchical design. Our evaluation shows state-of-the-art models still fall far behind humans. We construct CrossPoint-378K, a dataset with 378K question-answering pairs across 900 scenes, and propose CroPond that achieves state-of-the-art performance on CrossPoint-Bench, surpassing Gemini-2.5-Pro by 39.7% accuracy.

arXiv 2025

VisualTrans: A Benchmark for Real-World Visual Transformation Reasoning

Yuheng Ji^*, Yipu Wang^*, Yuyang Liu, Xiaoshuai Hao, Yue Liu, Yuting Zhao, Huaihai Lyu, Xiaolong Zheng

VisualTrans is the first real-world benchmark for Visual Transformation Reasoning (VTR), evaluating spatial, procedural and quantitative reasoning across 12 human-object interaction tasks. While current models perform well on static tasks, they show significant limitations in dynamic, multi-step reasoning, revealing critical gaps in temporal and causal understanding for intelligent systems.

NeurIPS 2025 (Spotlight)

OpenCUA: Open Foundations for Computer-Use Agents

Xinyuan Wang^*, Bowen Wang^*, Dunjie Lu^*, Junlin Yang^*, Tianbao Xie^*, Junli Wang^*, et al., Yipu Wang, Heng Wang, Diyi Yang, Victor Zhong, Y.Charles, Zhilin Yang, Tao Yu

We present OpenCUA, a comprehensive open-source framework for scaling CUA data and foundation models which includes an annotation infrastructure, the first large-scale computer-use task dataset and a scalable pipeline that transforms demonstrations into state–action pairs with reflective long Chain-of-Thought reasoning. Our end-to-end agent model, OpenCUA-32B achieves an average success rate of 32.5% on OSWorld-Verified, establishing a new state-of-the-art (SOTA) among open-source models and surpassing OpenAI CUA (GPT-4o).

JES

Wavelet attention-powered neural network framework with hierarchical dynamic frequency learning for lithium-ion battery state of health prediction

Yipu Wang, Huan Wang

We propose WAPHF, a wavelet attention-powered hierarchical dynamic frequency learning framework for lithium battery SOH prediction. By integrating CNN with wavelet transform and dynamic frequency-focused attention, our method effectively addresses frequency aliasing issues and outperforms state-of-the-art approaches across three datasets.

📖 Educations

2025.09 - Present, University of Chinese Academy of Sciences, Computer Science and Technology.
2021.09 - 2025.06, University of Electronic Science and Technology of China, Electrical and Electronic Engineering.

💻 Internships

2025.02 - 2025.07, Moonshot AI, Multimodal Team.