Published onNovember 23, 2025How to run custom CUDA kernels with TorchtorchcudatorchShowcase of a simple way of running custom CUDA kernels in PyTorch with extensions and a quick way of benchmarking them with respect to native functionsRead more →
Published onNovember 15, 2025Faster LLM inference with KV cache, speculative decoding and torch.compilellmtorchA brief overview of some of the techniques that can make a model faster without much code changeRead more →
Published onNovember 6, 2025An 'Easy' LeetGPU problem will teach you about GPU memory hierarchycudagpuleetgpukernelsConv 1D is a simple kernel to write, however if you want to optimize it, you will learn about all layers of GPU memory hierarchy.Read more →
Published onNovember 3, 2025Home-grown Qwen3 model with a bit of sprinkles on topllmtorchqwen3Implementation of Qwen3 using PyTorch with some inference based optimizationsRead more →
Published onOctober 27, 2025How CUDA Kernels are Executed on the GPU?cudagpuBeautiful graphics that explain how kernels are executed on the GPURead more →