Yifan Qiao

About Me

I am a Founding Member of Technical Staff at Inferact, building vLLM to make AI accessible to everyone with cheaper and faster inference.

I just finished my Postdoc at UC Berkeley, where I worked with Ion Stoica and Joseph E. Gonzalez in the Sky Computing Lab. Prior to that, I completed my Ph.D. in Computer Science at UCLA in 2024, where I was advised by Harry Xu and Miryung Kim.

My research lies at the intersection of systems and machine learning. I build systems to make AI faster and more efficient.

I am a recipient of the Amazon & UCLA Science Hub Fellowship (2021), a Jane Street Graduate Research Fellowship Finalist (2023), and UCLA's Outstanding Graduate Student Research Award (2024).

Updates

May 2026: 🚀 New blog on serving agentic workloads at scale with vLLM × Mooncake!
Jan 2026: Joined Inferact as a Founding Member of Technical Staff.
Dec 2025: Announced GVM at Sky Winter Retreat.
Jul–Nov 2025: Invited talks on kvcached and ConServe at Moonshot AI, AWS, NVIDIA, Meta, and IBM.
Oct 2025: Released kvcached with a technical deep dive blog. [X] [LinkedIn]

Open Source Projects

Other than vLLM, I am actively working on the Open Virtual GPU project (ovg-project), building open-source infrastructure for GPU virtualization and efficient GPU sharing in datacenters. Our vision is to create a "GPU OS" that makes GPU resources as manageable and shareable as CPU resources today. Read our first blog post on solving the GPU cost crisis.

kvcached 1,064

Elastic KV cache sharing across multiple co-located LLMs through GPU virtual memory. Integrates with SGLang and vLLM.

Co-leading with Jiarong Xing and Shan Yu

GVM 22 New

An OS-level GPU virtualization layer, for sharing a GPU with hardware-like performance isolation and full flexibility.

Co-leading with Yicheng Liu

Publications

ConServe: Fine-Grained GPU Harvesting for LLM Online and Offline Co-Serving

Yifan Qiao, Shan Yu, Shu Anzai, Haoran Ma, Shuo Yang, Yang Wang, Miryung Kim, Yongji Wu, Yang Zhou, Jiarong Xing, Joseph E. Gonzalez, Ion Stoica, Harry Xu

ICML 2026
BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching

Yilong Zhao, Shuo Yang, Kan Zhu, Lianmin Zheng, Baris Kasikci, Yifan Qiao, Yang Zhou, Jiarong Xing, Ion Stoica

ASPLOS 2026
UniCache: Unifying Prefix Cache Eviction for Heterogeneous LLM Serving Workloads

Bingyang Ouyang, Yifan Qiao, Jiarong Xing

SIGMETRICS 2026
Lost in Translation: The Search for Meaning in Network-Attached AI Accelerator Disaggregation

Jaewan Hong, Yifan Qiao, Soujanya Ponnapalli, Shu Liu, Marcos K. Aguilera Vincent Liu, Christopher J. Rossbach, Ion Stoica

HotNets 2025
PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications

Kuntai Du, Bowen Wang, Chen Zhang, Yiming Cheng, Qing Lan, Hejian Sang, Yihua Cheng, Jiayi Yao, Xiaoxuan Liu, Yifan Qiao, Ion Stoica, Junchen Jiang

SOSP 2025
Orthrus: Efficient and Timely Detection of Silent User Data Corruption in the Cloud with Resource-Adaptive Computation Validation

Chenxiao Liu, Zhenting Zhu, Quanxi Li, Yanwen Xia, Yifan Qiao, Xiangyun Deng, Youyou Lu, Tao Xie, Huimin Cui, Zidong Du, Harry Xu, Chenxi Wang

SOSP 2025
Idleness is Relative: Exploiting Tool-Call Idle Windows for Offloading in Agentic Systems with MORI

Tian Xia, Hanchen Li, Zhifei Li, Xiaokun Chen, Hao Kang, Yifan Qiao, Yi Xu, Ion Stoica

Arxiv 2026
SkyNomad: On Using Multi-Region Spot Instances to Minimize AI Batch Job Cost

Zhifei Li, Tian Xia, Ziming Mao, Zihan Zhou, Ethan J. Jackson, Jamison Kerney, Zhanghao Wu, Pratik Mishra, Yi Xu, Yifan Qiao, Scott Shenker, Ion Stoica

Arxiv 2026
Towards Efficient and Practical GPU Multitasking in the Era of LLM

Jiarong Xing, Yifan Qiao, Simon Mo, Xingqi Cui, Gur-Eyal Sela, Yang Zhou, Joseph Gonzalez, Ion Stoica

Arxiv 2025
Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving

Shan Yu, Jiarong Xing, Yifan Qiao, Mingyuan Ma, Yangmin Li, Yang Wang, Shuo Yang, Zhiqiang Xie, Shiyi Cao, Ke Bao, Ion Stoica, Harry Xu, Ying Sheng

Arxiv 2025 code
Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models

Yongji Wu, Wenjie Qu, Xueshen Liu, Tianyang Tao, Yifan Qiao, Zhuang Wang, Wei Bai, Yuan Tian, Jiaheng Zhang, Z. Morley Mao, Matthew Lentz, Danyang Zhuo, Ion Stoica

Arxiv 2024
DRust: Language-Guided Distributed Shared Memory with Fine Granularity, Full Transparency, and Ultra Efficiency

Haoran Ma, Yifan Qiao, Shi Liu, Shan Yu, Chenxi Wang, Yuanjiang Ni, Qingda Lu, Jiesheng Wu, Yiying Zhang, Miryung Kim, and Harry Xu.

OSDI 2024 full version code
A Tale of Two Paths: Toward a Hybrid Data Plane for Efficient Far-Memory Applications

Lei Chen*, Shi Liu*, Chenxi Wang, Haoran Ma, Yifan Qiao, Zhe Wang, Chenggang Wu, Youyou Lu, Xiaobing Feng, Huimin Cui, Shan Lu, and Harry Xu.

OSDI 2024 full version code
Harvesting Idle Memory for Application-managed Soft State with Midas

Yifan Qiao, Zhenyuan Ruan, Haoran Ma, Adam Belay, Miryung Kim, and Harry Xu.

NSDI 2024 code slides
Hermit: Low-Latency, High-Throughput, and Transparent Remote Memory via Feedback-Directed Asynchrony

Yifan Qiao, Chenxi Wang, Zhenyuan Ruan, Adam Belay, Qingda Lu, Yiying Zhang, Miryung Kim, and Guoqing Harry Xu.

NSDI 2023 code slides
Canvas: Isolated and Adaptive Swapping for Multi-Applications on Remote Memory

Chenxi Wang*, Yifan Qiao*, Haoran Ma, Shi Liu, Yiying Zhang, Wenguang Chen, Ravi Netravali, Miryung Kim, Guoqing Harry Xu. (*contributed equally)

NSDI 2023 code slides
Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs

John Thorpe*, Pengzhan Zhao*, Jonathan Eyolfson, Yifan Qiao, Zhihao Jia, Minjia Zhang, Ravi Netravali, Guoqing Harry Xu.

NSDI 2023 full version code
MemLiner: Lining up Tracing and Application for a Far-Memory-Friendly Runtime

Chenxi Wang*, Haoran Ma*, Shi Liu, Yifan Qiao, Jonathan Eyolfson, Christian Navasca, Shan Lu, Guoqing Harry Xu.

OSDI 2022 (Awarded Jay Lepreau Best Paper) code
Mako: A Low-Pause, High-Throughput Evacuating Collector for Memory-Disaggregated Datacenters

Haoran Ma, Shi Liu, Chenxi Wang, Yifan Qiao, Michael D. Bond, Stephen M. Blackburn, Miryung Kim, Guoqing Harry Xu.

PLDI 2022 code
Dorylus: Affordable, Scalable, and Accurate GNN Training over Billion-Edge Graphs

John Thorpe*, Yifan Qiao*, Jonathan Eyolfson, Shen Teng, Guanzhou Hu, Zhihao Jia, Jinliang Wei, Keval Vora, Ravi Netravali, Miryung Kim, and Guoqing Harry Xu. (*contributed equally)

OSDI 2021 full version code
Algorithm-Directed Crash Consistence in Non-Volatile Memory for HPC

Shuo Yang, Kai Wu, Yifan Qiao, Dong Li, Jidong Zhai.

CLUSTER 2017

Experience

Visiting Student at MIT PDOS Group, hosted by Adam Belay.

Worked on an elastic LLM serving system.

Jun. 2023 - Sept. 2023
Visiting Student at MIT PDOS Group, hosted by Adam Belay.

Worked on Midas, a new OS memory abstraction for soft state.

Jun. 2022 - Sept. 2022
Research Intern at Alibaba Bellevue, Cloud Storage Team, hosted by Qingda Lu.

Worked on Hermit, a high-performance and transparent remote memory system.

Jun. 2021 - Sept. 2021

Service

MLSys 2026, Program Committee
ASPLOS 2026, Program Committee
ATC 2024, External Review Committee
SOSP 2023, Artifact Evaluation Committee
OSDI 2023, Artifact Evaluation Committee
ATC 2023, Artifact Evaluation Committee
WORDS 2022, Session Chair

Awards

2024 Outstanding Graduate Student Research Award, UCLA
2023 Jane Street Graduate Research Fellowship Finalist
2021 Amazon Ph.D. Fellow
2019 Magna Cum Laude in Beijing (8/140)
2019 Magna Cum Laude at Department of Computer Science and Technology, Tsinghua University
2019 Cum Laude at Tsinghua University (14/140)
2018 CNPC Scholarship for Comprehensive Excellence (8/140)
2018 Qualcomm Scholarship (Top 6%)
2017 National Scholarship (6/140)

Teaching

UC Berkeley

Guest Lecture for CS 262A Advanced Topics in Computer Systems (Spring 2025)
Guest Lecture for CS 162 Operating System (Fall 2025)

UCLA

Teaching Assistant for CS 130 Software Engineering (Winter 2024)
Teaching Assistant for CS 130 Software Engineering (Spring 2024)