CKSRI Seminar Series 2026 “World Modeling from Computer Vision"

CKSRI Seminar Series 2026
"World Modeling from Computer Vision"
Date: 26 May 2026
Time: 11:00 AM - 12:00 PM
Venue: Room 4580 (Lift 27-28), Academic Building, HKUST
Event Format: Seminar, Lecture, Talk
Speaker: Prof. Yinghao Xu
ABSTRACT
Large Language Models have given us an AI that reads, writes, and codes, yet they have not given us a robot that can fold laundry. Closing this gap is the next major frontier in AI, and it is precisely the problem world models exist to solve — systems that learn from raw observation how the world looks, how it evolves, and how it responds to an agent's actions.
The critical question is where such systems originate. While the popular view treats world models as a novel paradigm distinct from computer vision, I argue that world models are inherently rooted in computer vision. The large-scale representation learning, generative modeling, and spatiotemporal attention machinery driven by the past decade of vision research are exactly what we need to perceive, simulate, and act. World modeling is not what comes after computer vision; it is what computer vision becomes when we unify reconstruction, simulation, and action.
I will make this concrete through three interconnected works. First, LingBot-Map establishes the perception backbone, replacing hand-crafted SLAM with an end-to-end attention model that embeds geometric priors, recovering 3D structure from continuous video at scales legacy pipelines cannot achieve. Second, LingBot-World introduces the dynamics layer through neural world simulation, moving beyond static geometry to model the causal and physical evolution of scenes. Finally, LingBot-VA translates this world model into action by jointly modeling dynamics and control within a single autoregressive framework, turning a passive representation into an embodied controller
Ultimately, world modeling represents the next phase of AI, and computer vision is the discipline that builds it.
About the Speaker
Yinghao Xu is an Assistant Professor in the Department of Computer Science and Engineering at the Hong Kong University of Science and Technology (HKUST). Previously, he was a Staff Research Scientist at RobbyAnt, working on world models and embodied AI. Before that, he was a postdoctoral researcher at Stanford University. His research lies at the intersection of 3D computer vision, generative AI, and embodied AI, with a recent focus on building world models that unify 3D reconstruction, world simulation, and embodied action. He was the recipient of the Yunfan Award at WAIC 2024 and was nominated for the Snap Fellowship in 2022.



