Published onMarch 25, 2026|Views: 70|7 min readStop using torch.cat for your KV cache implementationsllmkv-cachepytorchinferenceoptimizationtransformerstl;dr: `torch.cat` is not in-place, instead use pre-allocated buffers
Published onMarch 17, 2026|Views: 193|25 min readEnergy-Based Models: From Basics to LLMsllmenergy-based-modellikelihoodscore-matchingdiffusionA whirlwind tour of energy-based models (EBMs) paradigms, from basics to modern LLM applications. Based on my talk at the Toronto LLM Meetup.
Published onFebruary 9, 2026|Views: 2552|40 min readUnderstanding DeepSeek's Multi-Head Latent Attention (MLA)llmattentiontransformersdeepseekmlakv-cacheinferenceOn bottlenecks in attention, kv caching, long-context decoding, attention variants, and how DeepSeek MLA came to be. Part 1 of the FlashMLA blog series.
Published onMay 23, 2025|Views: 1349|35 min readData Quality Is All You Need?llmpretrainingmidtrainingposttrainingdata-qualitysynthetic-datadpoNotes on Microsoft phi-4 data pipeline for pre-training, 'mid-training', supervised fine-tuning and preference optimization