Mistral Spelled Out : RMS Norm : Part 5
LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU
The KV Cache: Memory Usage in Transformers
Coding LLaMA 2 from scratch in PyTorch - KV Cache, Grouped Query Attention, Rotary PE, RMSNorm
RoFormer: Enhanced Transformer with Rotary Position Embedding Explained
Rotary Positional Embeddings
Lecture 28: Liger Kernel - Efficient Triton Kernels for LLM Training
Structured State Space Models for Deep Sequence Modeling (Albert Gu, CMU)
Mamba - a replacement for Transformers?
Relative Position Bias (+ PyTorch Implementation)
Coding a Transformer from scratch on PyTorch, with full explanation, training and inference.
Word Embeddings & Positional Encoding in NLP Transformer model explained - Part 1
LLaMA Pro: Progressive LLaMA with Block Expansion (Paper Explained)
Shoelace Formula: Area of any n side figure
LoRA: Low-Rank Adaptation of Large Language Models - Explained visually + PyTorch code from scratch
Implement Llama 3 From Scratch - PyTorch
Coding Stable Diffusion from scratch in PyTorch
RWKV from scratch Pytorch
Llama 1 vs. Llama 2: Meta's Genius Breakthrough in AI Architecture | Research Paper Breakdown
BERT explained: Training, Inference, BERT vs GPT/LLamA, Fine tuning, [CLS] token