kv cache quantization（関連順）

14:41

How To Use KV Cache Quantization for Longer Generation by LLMs

Fahd Mirza

519 回視聴 - 4 か月前

8:33

The KV Cache: Memory Usage in Transformers

Efficient NLP

38,816 回視聴 - 1 年前

1:10:55

LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU

Umar Jamil

63,385 回視聴 - 1 年前

5:01

2bit LLM Quantization without Fine Tuning - KIVI

Fahd Mirza

338 回視聴 - 7 か月前

36:12

Deep Dive: Optimizing LLM inference

Julien Simon

21,984 回視聴 - 6 か月前

20:18

ArXiv Paper ThinK: Thinner Key Cache by Query-Driven Pruning By Yuhui Xu, Zhanming Jie, Hanze Dong

Academia Accelerated

15 回視聴 - 1 か月前

14:54

CacheGen: KV Cache Compression and Streaming for Fast Language Model Serving (SIGCOMM'24, Paper1571)

ACM SIGCOMM

634 回視聴 - 1 か月前

15:21

ArXiv Paper ThinK: Thinner Key Cache by Query-Driven Pruning By Yuhui Xu, Zhanming Jie, Hanze Dong

Academia Accelerated

9 回視聴 - 1 か月前

12:13

How to Efficiently Serve an LLM?

Ahmed Tremo

2,324 回視聴 - 1 か月前

1:08

Accelerate Big Model Inference: How Does it Work?

HuggingFace

18,306 回視聴 - 2 年前

50:37

vLLM Office Hours - Model Quantization for Efficient vLLM Inference - July 25, 2024

Neural Magic

728 回視聴 - 1 か月前

19:46

Quantization vs Pruning vs Distillation: Optimizing NNs for Inference

Efficient NLP

19,586 回視聴 - 1 年前

25:15

【AI論文解説】拡散モデルと自己回帰型モデルの融合 Part1

nnabla ディープラーニングチャンネル

1,093 回視聴 - 4 日前

45:44

Efficient LLM Inference (vLLM KV Cache, Flash Decoding & Lookahead Decoding)

Noble Saji Mathews

5,350 回視聴 - 6 か月前

26:44

Accelerating Generative AI - Christian Puhrsch & Horace He, Meta

PyTorch

2,721 回視聴 - 11 か月前

30:25

Exploring the Latency/Throughput & Cost Space for LLM Inference // Timothée Lacroix // CTO Mistral

MLOps.community

14,889 回視聴 - 11 か月前

40:04

Efficient Inference of Vision Instruction-Following Models with Elastic Cache - ArXiv:24

Academia Accelerated

8 回視聴 - 12 日前

31:53

Scaling Computing Performance Beyond the End of Moore’s Law: Song Han

MIT Schwarzman College of Computing

2,405 回視聴 - 6 か月前

58:41

8-bit Methods for Efficient Deep Learning with Tim Dettmers

Cohere

4,211 回視聴 - 1 年前

2:16:29

Doing POORMAN'S LLAMA with KV cacheing and Quantization and extending to Nano GPT

Madtutorials

27 回視聴 - 12 日前に配信済み

結果 : kv cache quantization