vLLM: Fast & Affordable LLM Serving with PagedAttention | UC Berkeley's Open-Source Library

2023/06/21 に公開

視聴回数 1,829 回

The vLLM team at UC Berkeley has developed an open-source library for fast LLM inference and serving called vLLM, which utilizes their new attention algorithm, PagedAttention, to manage attention keys and values. vLLM delivers up to 24x higher throughput than HuggingFace Transformers without requiring any model architecture changes. PagedAttention partitions the KV cache of each sequence into blocks, allowing for efficient memory management and sharing, resulting in near-optimal memory usage and up to 2.2x improvement in throughput. The library has been successfully deployed at Chatbot Arena and Vicuna Demo, making LLM serving affordable even for small research teams with limited compute resources. The comments discuss the technical details of the library, its comparison with other similar projects, and its potential applications.

🔗 https://vllm.ai/

#AI #LLM