All Posts

Published on
February 9, 2026|Views: 1332|40 min read
Understanding DeepSeek's Multi-Head Latent Attention (MLA)
llm attention transformers deepseek mla kv-cache inference
On bottlenecks in attention, kv caching, long-context decoding, attention variants, and how DeepSeek MLA came to be. Part 1 of the FlashMLA blog series.
Published on
May 23, 2025|Views: 1107|35 min read
Data Quality Is All You Need?
llm pretraining midtraining posttraining data-quality synthetic-data dpo
Notes on Microsoft phi-4 data pipeline for pre-training, 'mid-training', supervised fine-tuning and preference optimization
Published on
March 28, 2025|Views: 4107|24 min read
Apple Silicon Metal vs NVIDIA CUDA
cuda metal gpu parallel-programming
Notes on the Apple Silicon GPUs: Architecture, Memory Hierarchy, and the Metal Programming framework, and how it compares to NVIDIA CUDA.
Published on
March 12, 2025|Views: 1941|13 min read
(Mis)adventures in running CUDA on Google Colab Free Tier
cuda nvcc colab
A recap of my day debugging issues with nvcc and nvcc4jupyter on Google Colab's free T4 GPUs, with brief notes on CUDA backward compatibility and compute capability
Published on
February 14, 2025|Views: 382|17 min read
Tensor Puzzles Walkthrough: Optimizations, Comparing Solutions
tensors pytorch linear-algebra machine-learning
Part 2 of my Tensor Puzzles Walkthrough series: optimizing solutions to fit the puzzle constraints, and comparing notes with the author.

Understanding DeepSeek's Multi-Head Latent Attention (MLA)