Articles

April 03, 2026
Flashattention Analysis

Flash Attention Kernel Execution Flow Analysis (SM100)
April 01, 2026
[FlashAttention] Variable-Length Sequences with SeqlenInfo

In FlashAttention, sequences often have different lengths within a batch. This article explains how SeqlenInfo handles variable-length sequences efficiently.
April 01, 2026
[FlashAttention] Block-Level Masking with BlockInfo

In FlashAttention, causal and local attention masks restrict which Q tokens can attend to which K tokens. The BlockInfo class computes block-level valid ranges to avoid computing invalid Q-K pairs entirely, saving memory bandwidth and computation.
April 01, 2026
[FlashAttention] Blackwell MMA: PTX Inline Assembly for GEMM

In FlashAttention for Blackwell (SM100), the MMA (Matrix Multiply-Accumulate) operations use PTX inline assembly instead of CuTe’s high-level cute.gemm(). This article explains the blackwell_helpers.py functions and how they’re used in flash_fwd_sm100_simple_18.py.
April 01, 2026
[CuTeDSL] Pipeline Abstraction for GPU Kernel Synchronization

GPUs achieve high performance through pipelining - overlapping computation with memory access. This article explains the Pipeline abstraction in CuTe DSL, covering both CuTe native pipelines and FlashAttention’s extensions.
March 31, 2026
[CuTeDSL] Understanding Tile Schedulers for Blackwell

In this article, we explore Tile Schedulers - a critical abstraction for GPU kernel work distribution. We cover both CuTe native schedulers (for general GEMM) and FlashAttention’s custom schedulers (for attention-specific workloads).
March 31, 2026
[CuTeDSL] Atom Shapes: MMA, TMA, and TMEM

When using partition functions in CuTe DSL, the returned tensor shape contains an atom shape representing hardware instruction constraints. This article covers the three atom types from your examples.
March 25, 2026
[Triton Hacked] From Vector Add to FlashAttention-Level Optimization (Part 1)

I’ve always had a question: Why do other DSLs claim to be several times faster than Triton? Where does that speed-up come from? Is it really that hard for Triton to achieve this performance?
March 25, 2026
[CuTeDSL B200] Tuning GEMM from Scratch to 1840 TFLOPS on B200 (Part 1)

I plan to write a series documenting how to progressively tune a GEMM kernel from a basic version to the optimal performance on B200 (1840 TFLOPS). The related code is open-sourced at GitHub - deciding/cutex. You can run it directly using Modal — just install Modal and you’re ready to go, no B200 required.
March 22, 2026
Hello World

Welcome to my new blog! This is my first post using Jekyll on GitHub Pages.

Articles

Flash Attention Kernel Execution Flow Analysis (SM100)