- Published on
GPU programming learning resources
- Authors

- Name
- Filip Reka
Books:
- "Programming Massively Parallel Processors: A Hands-on Approach" by Wen-mei W. Hwu, David B. Kirk, Izzat El Hajj
- "AI Systems Performance Engineering: Optimizing Model Training and Inference Workloads with GPUs, CUDA, and PyTorch" by Chris Fregly (coming soon!)
YouTube Channels:
YouTube videos:
- CUDA Programming Course - High-Performance Computing with GPUs
- CUDA + ThunderKittens, but increasingly drunk.
- Zen, CUDA and Tensor Cores by Casey Muratori
Github repositories:
Blog posts:
- GPU Glossary by Modal
- Roadmap: Understanding GPU Architecture by Cornell University
- "Accelerating Generative AI with PyTorch: Segment Anything, Fast" by PyTorch
- "Tiny-TPU"
- Fast LLM Inference From Scratch
- The Ultra-Scale Playbook: Training LLMs on GPU Clusters
- How to Think About GPUs
- LLM Inference Economics from First Principles
- Inside vLLM: Anatomy of a High-Throughput LLM Inference System
- Nvidia's H100: Funny L2, and Tons of Bandwidth
- Stephen Diehl Blog
- HuggingFace - "The Smol Training Playbook"
- NVIDIA Tensor Core Evolution: From Volta To Blackwell
- Implementing a fast Tensor Core matmul on the Ada Architecture
- Floating Point Visually Explained
- Making Deep Learning Go Brrrr From First Principles
- Understanding the CUDA Compiler & PTX with a Top-K Kernel
- Outperforming cuBLAS on H100: a Worklog
- How To Write A Fast Matrix Multiplication From Scratch With Tensor Cores
- How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
- Discussing Blackwell's drawbacks and dissecting its architecture
Other:
- hazyresearch at Stanford
- LocalLLama Reddit community
- Fixing ALL CUDA installation errors
- torch.compile, the missing manual