Fairness in Serving Large Language Models

Ying Sheng; Shiyi Cao; Dacheng Li; Banghua Zhu; Zhuohan Li; Danyang Zhuo; Joseph E. Gonzalez; Ion Stoica

Authors:

Ying Sheng, UC Berkeley and Stanford University; Shiyi Cao, Dacheng Li, Banghua Zhu, and Zhuohan Li, UC Berkeley; Danyang Zhuo, Duke University; Joseph E. Gonzalez and Ion Stoica, UC Berkeley

Abstract:

High-demand LLM inference services (e.g., ChatGPT and BARD) support a wide range of requests from short chat conversations to long document reading. To ensure that all client requests are processed fairly, most major LLM inference services have request rate limits, to ensure that no client can dominate the request queue. However, this rudimentary notion of fairness also results in under-utilization of the resources and poor client experience when there is spare capacity. While there is a rich literature on fair scheduling, serving LLMs presents new challenges due to their unpredictable request lengths and their unique batching characteristics on parallel accelerators. This paper introduces the definition of LLM serving fairness based on a cost function that accounts for the number of input and output tokens processed. To achieve fairness in serving, we propose a novel scheduling algorithm, the Virtual Token Counter (VTC), a fair scheduler based on the continuous batching mechanism. We prove a 2× tight upper bound on the service difference between two backlogged clients, adhering to the requirement of work-conserving. Through extensive experiments, we demonstrate the superior performance of VTC in ensuring fairness, especially in contrast to other baseline methods, which exhibit shortcomings under various conditions. The reproducible code is available at https://github.com/Ying1123/VTC-artifact.

Ying Sheng, UC Berkeley and Stanford University

Shiyi Cao, UC Berkeley

Dacheng Li, UC Berkeley

Banghua Zhu, UC Berkeley

Zhuohan Li, UC Berkeley

Danyang Zhuo, Duke University

Joseph E. Gonzalez, UC Berkeley

Ion Stoica, UC Berkeley

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {298768,
author = {Ying Sheng and Shiyi Cao and Dacheng Li and Banghua Zhu and Zhuohan Li and Danyang Zhuo and Joseph E. Gonzalez and Ion Stoica},
title = {Fairness in Serving Large Language Models},
booktitle = {18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)},
year = {2024},
isbn = {978-1-939133-40-3},
address = {Santa Clara, CA},
pages = {965--988},
url = {https://www.usenix.org/conference/osdi24/presentation/sheng},
publisher = {USENIX Association},
month = jul
}

Download

Sheng PDF

Fairness in Serving Large Language Models

Open Access Media

Presentation Video