Preparing Fineweb - A Finely Cleaned Common Crawl Dataset
RedPajama - Meet The Biggest Pre-Training Dataset!!!
Text By the Bay 2015: Stephen Merity, A Web Worth of Data: Common Crawl for NLP
Internet-Scale Analysis of AWS Cognito Security
MADLAD 400: Clean Multilingual Dataset with 400+ languages
Introducing RedPajama v2: A Massive Dataset for Training LLMs with 30T Tokens!
Do you want to better your life? #philippines #angelescity #expat #pampanga #travelvlog
Growing up Pentecostal... #short
Stephen Merity - Internet scale analytics @ Common Crawl
RefinedWeb Dataset for Falcon LLM
Using HTML for Language Modeling
E15: Unlocking the Internet's Treasure with Rich Skrenta at Common Crawl
AI Positive - Rich Skrenta from Common Crawl // AI Inside 1
Deduplicating Training Data Makes Language Models Better (Research Paper Walkthrough)
First open-source multimodal math dataset boosts MLLM performance - Podcast
Data processing for Causal Language Modeling
3 Crypto Scams YOU WILL Fall For & How To Avoid
OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text
Roll20 GM Overview: Learn the Basics!
XGen-7B: Long Sequence Modeling with (up to) 8K Tokens. Overview, Dataset & Google Colab Code.