Triton Democratizes GPU Programming

February 24, 2024

OpenAI's Triton language makes writing high-performance GPU kernels accessible to researchers who have never written CUDA, automating the hardest parts of GPU optimization while leaving algorithmic decisions to the developer.

"The bottom line here is not that Triton is inherently better, but that it simplifies the development of specialized kernels that can be much faster than those found in general-purpose libraries." OpenAI, Introducing Triton

Writing efficient CUDA code requires simultaneously reasoning about three separate concerns:

Coalescing memory transfers from DRAM to exploit wide bus interfaces
Manually managing SRAM to minimize shared memory bank conflicts
Carefully scheduling computations across Streaming Multiprocessors to leverage tensor cores Seasoned CUDA programmers with years of experience find this challenging. For ML researchers, it is effectively impossible. This means that the performance of most deep learning workloads is limited not by the GPU hardware but by how many operations have been hand-optimized by a small priesthood of systems engineers.

Triton automates memory coalescing, shared memory management, and within-SM scheduling, while leaving cross-SM scheduling and tiling to the developer. The result is that a matrix multiplication kernel in Triton takes roughly 25 lines of Python and achieves peak performance. Implementing something equivalent in CUDA requires vastly more effort and may achieve lower performance. Triton's softmax implementation keeps data in SRAM throughout normalization, which is faster than PyTorch's internal CUDA code that uses temporary memory for generality.

The strategic significance extends beyond individual kernels. Operator fusion, combining multiple operations to eliminate redundant memory transfers, is the single most impactful GPU optimization. Any two adjacent PyTorch operators present an opportunity for fusion. Automated compilers like NVFuser can handle simple fusions, but custom fused kernels written in Triton can dramatically outperform them. By lowering the barrier to writing custom kernels, Triton enables a much larger community of researchers to extract performance that was previously locked behind CUDA expertise. This is a direct challenge to Nvidia's CUDA moat: if GPU programming can be abstracted away, hardware becomes more interchangeable.

Triton shifts GPU optimization from a specialized craft to an accessible engineering task, which matters not just for performance but for breaking the dependency on CUDA-specific expertise.