Coding LLaMA 2 from scratch in PyTorch - KV Cache, Grouped Query Attention, Rotary PE, RMSNorm
The KV Cache: Memory Usage in Transformers
LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU
PyTorch 2.0 Q&A: Optimizing Transformers for Inference
Accelerate Big Model Inference: How Does it Work?
How a Transformer works at inference vs training time
Accelerating Generative AI - Christian Puhrsch & Horace He, Meta
Mistral Architecture Explained From Scratch with Sliding Window Attention, KV Caching Explanation
Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation
Fast LLM Serving with vLLM and PagedAttention
Efficient LLM Inference (vLLM KV Cache, Flash Decoding & Lookahead Decoding)
Implement Llama 3 From Scratch - PyTorch
The math behind Attention: Keys, Queries, and Values matrices
Attention mechanism: Overview
Efficient Inference of Vision Instruction-Following Models with Elastic Cache - ArXiv:24
Coding a Transformer from scratch on PyTorch, with full explanation, training and inference.
FlashAttention - Tri Dao | Stanford MLSys #67
Let's build GPT: from scratch, in code, spelled out.
【Transformerの基礎】Multi-Head Attentionの仕組み
GPT-Fast - blazingly fast inference with PyTorch (w/ Horace He)