Variants of Multi-head attention: Multi-query (MQA) and Grouped-query attention (GQA)
Multi-Head Attention (MHA), Multi-Query Attention (MQA), Grouped Query Attention (GQA) Explained
LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU
Multi-Head Attention (MHA), Multi-Query Attention (MQA), Grouped-Query Attention (GQA) #transformers
Transformer Architecture: Fast Attention, Rotary Positional Embeddings, and Multi-Query Attention
Coding LLaMA 2 from scratch in PyTorch - KV Cache, Grouped Query Attention, Rotary PE, RMSNorm
What is Grouped-Query Attention?
The KV Cache: Memory Usage in Transformers
LLM Jargons Explained: Part 2 - MQA & GQA
A Dive Into Multihead Attention, Self-Attention and Cross-Attention
Sliding Window Attention (Longformer) Explained
DeciLM 15x faster than Llama2 LLM Variable Grouped Query Attention Discussion and Demo
RoPE (Rotary positional embeddings) explained: The positional workhorse of modern LLMs
Deep dive - Better Attention layers for Transformer models
BART Explained: Denoising Sequence-to-Sequence Pre-training
FlashAttention - Tri Dao | Stanford MLSys #67
Vector Database Search - Hierarchical Navigable Small Worlds (HNSW) Explained
LLM Prompt Engineering with Random Sampling: Temperature, Top-k, Top-p
Mistral 7b - the best 7B model to date (paper explained)
プログラミングの歴史と未来【日本一講師の本気授業】