Hi, I am Yipu Wang. I am currently a Ph.D. candidate jointly supervised by the Institute of Automation, Chinese Academy of Sciences (CASIA), and the School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences. I am fortunate to be advised by Prof. Xiaolong Zheng. My research focuses on vision-language models and multimodal reasoning.

πŸ“ Publications

arXiv 2025
sym

VisualTrans: A Benchmark for Real-World Visual Transformation Reasoning

Yuheng Ji*, Yipu Wang*, Yuyang Liu, Xiaoshuai Hao, Yue Liu, Yuting Zhao, Huaihai Lyu, Xiaolong Zheng (*Equal contribution)

  • VisualTrans is the first real-world benchmark for Visual Transformation Reasoning (VTR), evaluating spatial, procedural and quantitative reasoning across 12 human-object interaction tasks. While current models perform well on static tasks, they show significant limitations in dynamic, multi-step reasoning, revealing critical gaps in temporal and causal understanding for intelligent systems.
NeurIPS 2025 (Spotlight)
sym

OpenCUA: Open Foundations for Computer-Use Agents

Xinyuan Wang*, Bowen Wang*, Dunjie Lu*, Junlin Yang*, Tianbao Xie*, Junli Wang*, et al., Yipu Wang, Heng Wang, Diyi Yang, Victor Zhong, Y.Charles, Zhilin Yang, Tao Yu

  • We present OpenCUA, a comprehensive open-source framework for scaling CUA data and foundation models which includes an annotation infrastructure, the first large-scale computer-use task dataset and a scalable pipeline that transforms demonstrations into state–action pairs with reflective long Chain-of-Thought reasoning. Our end-to-end agent model, OpenCUA-32B achieves an average success rate of 32.5% on OSWorld-Verified, establishing a new state-of-the-art (SOTA) among open-source models and surpassing OpenAI CUA (GPT-4o).
JES
sym

Wavelet attention-powered neural network framework with hierarchical dynamic frequency learning for lithium-ion battery state of health prediction

Yipu Wang, Huan Wang

  • We propose WAPHF, a wavelet attention-powered hierarchical dynamic frequency learning framework for lithium battery SOH prediction. By integrating CNN with wavelet transform and dynamic frequency-focused attention, our method effectively addresses frequency aliasing issues and outperforms state-of-the-art approaches across three datasets.

πŸ“– Educations

  • 2025.09 - Present, University of Chinese Academy of Sciences, Computer Science and Technology.
  • 2021.09 - 2025.06, University of Electronic Science and Technology of China, Electrical and Electronic Engineering.

πŸ’» Internships

  • 2025.02 - 2025.07, Moonshot AI, Multimodal Team.