On bottlenecks in attention, kv caching, long-context decoding, attention variants, and how DeepSeek MLA came to be. Part 1 of the FlashMLA blog series.
A recap of my day debugging issues with nvcc and nvcc4jupyter on Google Colab's free T4 GPUs, with brief notes on CUDA backward compatibility and compute capability