Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention

TitleCost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention
Publication TypeConference Paper
Year of Publication2024
AuthorsGao B, He Z, Sharma P, Kang Q, Jevdjic D, Deng J, Yang X, Yu Z, Zuo P
Conference Name2024 USENIX Annual Technical Conference (USENIX ATC 24)
Date Published07/2024
PublisherUSENIX Association
Conference LocationSanta Clara, CA
ISBN Number978-1-939133-41-0