{Quant-LLM}: Accelerating the Serving of Large Language Models via {FP6-Centric} {Algorithm-System} {Co-Design} on Modern {GPUs}

Haojun Xia; Zhen Zheng; Xiaoxia Wu; Shiyang Chen; Zhewei Yao; Stephen Youn; Arash Bakhtiari; Michael Wyatt; Donglin Zhuang; Zhongzhu Zhou; Olatunji Ruwase; Yuxiong He; Shuaiwen Leon Song

Authors:

Haojun Xia, University of Sydney; Zhen Zheng and Xiaoxia Wu, Microsoft; Shiyang Chen, Rutgers University; Zhewei Yao, Stephen Youn, Arash Bakhtiari, and Michael Wyatt, Microsoft; Donglin Zhuang and Zhongzhu Zhou, University of Sydney; Olatunji Ruwase, Yuxiong He, and Shuaiwen Leon Song, Microsoft

Abstract:

Six-bit quantization (FP6) can effectively reduce the size of large language models (LLMs) and preserve the model quality consistently across varied applications. However, existing systems do not provide Tensor Core support for FP6 quantization and struggle to achieve practical performance improvements during LLM inference. It is challenging to support FP6 quantization on GPUs due to (1) unfriendly memory access of model weights with non-power-of-two bit-width and (2) high runtime overhead of weight de-quantization. To address these problems, we propose TC-FPx, the first full-stack GPU kernel design scheme with unified Tensor Core support of 6-bit and arbitrary bit-width quantization (5-bit, etc.). We integrate TC-FPx kernel into an existing inference system, providing new end-to-end support (called Quant-LLM) for quantized LLM inference, where better trade-offs between inference cost and model quality are achieved with 6-bit quantization. Experiments show that Quant-LLM enables the inference of LLaMA-70b using only a single GPU, achieving 1.69×-2.65× higher normalized inference throughput than the FP16 baseline. The source code is publicly available at https://github.com/usyd-fsalab/fp6_llm.

Haojun Xia, University of Sydney

Zhen Zheng, Microsoft

Xiaoxia Wu, Microsoft

Shiyang Chen, Rutgers University

Zhewei Yao, Microsoft

Stephen Youn, Microsoft

Arash Bakhtiari, Microsoft

Michael Wyatt, Microsoft

Donglin Zhuang, University of Sydney

Zhongzhu Zhou, University of Sydney

Olatunji Ruwase, Microsoft

Yuxiong He, Microsoft

Shuaiwen Leon Song, Microsoft

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {298573,
author = {Haojun Xia and Zhen Zheng and Xiaoxia Wu and Shiyang Chen and Zhewei Yao and Stephen Youn and Arash Bakhtiari and Michael Wyatt and Donglin Zhuang and Zhongzhu Zhou and Olatunji Ruwase and Yuxiong He and Shuaiwen Leon Song},
title = {{Quant-LLM}: Accelerating the Serving of Large Language Models via {FP6-Centric} {Algorithm-System} {Co-Design} on Modern {GPUs}},
booktitle = {2024 USENIX Annual Technical Conference (USENIX ATC 24)},
year = {2024},
isbn = {978-1-939133-41-0},
address = {Santa Clara, CA},
pages = {699--713},
url = {https://www.usenix.org/conference/atc24/presentation/xia},
publisher = {USENIX Association},
month = jul
}

Download

Xia PDF

Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs

Open Access Media