Hi, I am Yipu Wang. I am currently a Ph.D. candidate jointly supervised by the Institute of Automation, Chinese Academy of Sciences (CASIA), and the School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences. I am fortunate to be advised by Prof. Xiaolong Zheng. My research focuses on vision-language models and multimodal reasoning.
π Publications

VisualTrans: A Benchmark for Real-World Visual Transformation Reasoning
Yuheng Ji*, Yipu Wang*, Yuyang Liu, Xiaoshuai Hao, Yue Liu, Yuting Zhao, Huaihai Lyu, Xiaolong Zheng (*Equal contribution)
- VisualTrans is the first real-world benchmark for Visual Transformation Reasoning (VTR), evaluating spatial, procedural and quantitative reasoning across 12 human-object interaction tasks. While current models perform well on static tasks, they show significant limitations in dynamic, multi-step reasoning, revealing critical gaps in temporal and causal understanding for intelligent systems.

OpenCUA: Open Foundations for Computer-Use Agents
Xinyuan Wang*, Bowen Wang*, Dunjie Lu*, Junlin Yang*, Tianbao Xie*, Junli Wang*, et al., Yipu Wang, Heng Wang, Diyi Yang, Victor Zhong, Y.Charles, Zhilin Yang, Tao Yu
- We present OpenCUA, a comprehensive open-source framework for scaling CUA data and foundation models which includes an annotation infrastructure, the first large-scale computer-use task dataset and a scalable pipeline that transforms demonstrations into stateβaction pairs with reflective long Chain-of-Thought reasoning. Our end-to-end agent model, OpenCUA-32B achieves an average success rate of 32.5% on OSWorld-Verified, establishing a new state-of-the-art (SOTA) among open-source models and surpassing OpenAI CUA (GPT-4o).

Yipu Wang, Huan Wang
- We propose WAPHF, a wavelet attention-powered hierarchical dynamic frequency learning framework for lithium battery SOH prediction. By integrating CNN with wavelet transform and dynamic frequency-focused attention, our method effectively addresses frequency aliasing issues and outperforms state-of-the-art approaches across three datasets.
π Educations
- 2025.09 - Present, University of Chinese Academy of Sciences, Computer Science and Technology.
- 2021.09 - 2025.06, University of Electronic Science and Technology of China, Electrical and Electronic Engineering.
π» Internships
- 2025.02 - 2025.07, Moonshot AI, Multimodal Team.