Reinforcement Learning from Human Feedback (RLHF) Explained
Reinforcement Learning with Human Feedback (RLHF), Clearly Explained!!!
LLMの予想外の現実世界の啓示
Early stages of the reinforcement learning era of language models
The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
Functional alignment of protein language models via reinforcement learning
強化学習はひどい - アンドレイ・カルパシー
Reinforcement learning (RL) enhanced large language models (LLMs), exploring RL techniques
🔵 Want better RAG results? Optimize your Data
小さな LM を微調整して、自分で考え、パズルを解くようにする方法 (GRPO & RL!)
エージェントのための強化学習 - モルガン・スタンレーのML研究者、ウィル・ブラウン
Large Language Models explained briefly
Lecture 05 • Reinforcement Learning for Language Models
Proximal Policy Optimization (PPO) - How to train Large Language Models
Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 16: Alignment - RL 1
DeepSeek's GRPO (Group Relative Policy Optimization) | Reinforcement Learning for LLMs
Optimizing Large Language Models with Reinforcement Learning-Based Prompts
Richard Sutton – Father of RL thinks LLMs are a dead end
How language model post-training is done today
LLM Training & Reinforcement Learning from Google Engineer | SFT + RLHF | PPO vs GRPO vs DPO