Institute of Poorly Optimized GPU Code

Blog Tags Projects About

Latest

A place for me to blog and showcase come cool thing that I did or found interesting

Published on
November 23, 2025
How to run custom CUDA kernels with Torch
torch cuda torch
Showcase of a simple way of running custom CUDA kernels in PyTorch with extensions and a quick way of benchmarking them with respect to native functions
Read more →
Published on
November 15, 2025
Faster LLM inference with KV cache, speculative decoding and torch.compile
llm torch
A brief overview of some of the techniques that can make a model faster without much code change
Read more →
Published on
November 6, 2025
An 'Easy' LeetGPU problem will teach you about GPU memory hierarchy
cuda gpu leetgpu kernels
Conv 1D is a simple kernel to write, however if you want to optimize it, you will learn about all layers of GPU memory hierarchy.
Read more →
Published on
November 3, 2025
Home-grown Qwen3 model with a bit of sprinkles on top
llm torch qwen3
Implementation of Qwen3 using PyTorch with some inference based optimizations
Read more →
Published on
October 27, 2025
How CUDA Kernels are Executed on the GPU?
cuda gpu
Beautiful graphics that explain how kernels are executed on the GPU
Read more →

Subscribe to the newsletter