Mistral Spelled Out : RMS Norm : Part 5
The KV Cache: Memory Usage in Transformers
Coding LLaMA 2 from scratch in PyTorch - KV Cache, Grouped Query Attention, Rotary PE, RMSNorm
LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU
Mamba - a replacement for Transformers?
LayerNorm、InstanceNorm、GroupNorm: 小さなバッチ サイズ向けのバッチ正規化の代替手段
Rotary Positional Embeddings: Combining Absolute and Relative
Transformer layer normalization
Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Paper Explained)
RoPE (Rotary positional embeddings) explained: The positional workhorse of modern LLMs
Structured State Space Models for Deep Sequence Modeling (Albert Gu, CMU)
Transformer Architecture: Fast Attention, Rotary Positional Embeddings, and Multi-Query Attention
Fast LLM Serving with vLLM and PagedAttention
Ronen Eldan | The TinyStories Dataset: How Small Can Language Models Be And Still Speak Coherent
Let's Code Elon's Grok Model in Pytorch Step-by-Step, From Scratch, Spelled Out
CLIP - Paper explanation (training and inference)
Why is Chunk Size Important?
Training Loops in PyTorch - Linear regression example
Stack More Layers Differently: High-Rank Training Through Low-Rank Updates
Relative Position Bias (+ PyTorch Implementation)