How To Use KV Cache Quantization for Longer Generation by LLMs
The KV Cache: Memory Usage in Transformers
LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU
2bit LLM Quantization without Fine Tuning - KIVI
Deep Dive: Optimizing LLM inference
ArXiv Paper ThinK: Thinner Key Cache by Query-Driven Pruning By Yuhui Xu, Zhanming Jie, Hanze Dong
CacheGen: KV Cache Compression and Streaming for Fast Language Model Serving (SIGCOMM'24, Paper1571)
How to Efficiently Serve an LLM?
Accelerate Big Model Inference: How Does it Work?
vLLM Office Hours - Model Quantization for Efficient vLLM Inference - July 25, 2024
Quantization vs Pruning vs Distillation: Optimizing NNs for Inference
【AI論文解説】拡散モデルと自己回帰型モデルの融合 Part1
Efficient LLM Inference (vLLM KV Cache, Flash Decoding & Lookahead Decoding)
Accelerating Generative AI - Christian Puhrsch & Horace He, Meta
Exploring the Latency/Throughput & Cost Space for LLM Inference // Timothée Lacroix // CTO Mistral
Efficient Inference of Vision Instruction-Following Models with Elastic Cache - ArXiv:24
Scaling Computing Performance Beyond the End of Moore’s Law: Song Han
8-bit Methods for Efficient Deep Learning with Tim Dettmers
Doing POORMAN'S LLAMA with KV cacheing and Quantization and extending to Nano GPT