
Rajib Lohia developed and integrated CDNA3 MFMA support for the flash attention MMA kernel in both the llama.cpp and ggml repositories, targeting MI300X (gfx942) GPUs. Using CUDA and C++, Rajib implemented FP16 MFMA intrinsic paths and optimized dispatch logic to handle various head sizes, replacing macros with constexpr warp sizing for improved code maintainability. The work addressed Q loading and stride handling for non-power-of-2 heads, resulting in throughput gains of 7% to 39% on large input batches. All 2480 flash attention tests passed, demonstrating robust performance optimization and correctness for large-context model inference workloads.
February 2026: Implemented CDNA3 MFMA support for the flash attention MMA kernel in both llama.cpp and ggml, enabling optimized FP16 MFMA paths and improved dispatch on MI300X (gfx942) across head sizes 64–128. Replaced macros with constexpr warp sizing, unified dispatch thresholds, and corrected Q loading/stride handling for non-power-of-2 heads. Benchmarks show sizable throughput gains on large inputs (pp512 to pp4096: +7% to +39%), with all 2480 flash attention tests passing. Business impact: higher inference throughput and lower latency for large-context models, enabling cost-efficient production at scale. Co-authored by Johannes Gäßler.
February 2026: Implemented CDNA3 MFMA support for the flash attention MMA kernel in both llama.cpp and ggml, enabling optimized FP16 MFMA paths and improved dispatch on MI300X (gfx942) across head sizes 64–128. Replaced macros with constexpr warp sizing, unified dispatch thresholds, and corrected Q loading/stride handling for non-power-of-2 heads. Benchmarks show sizable throughput gains on large inputs (pp512 to pp4096: +7% to +39%), with all 2480 flash attention tests passing. Business impact: higher inference throughput and lower latency for large-context models, enabling cost-efficient production at scale. Co-authored by Johannes Gäßler.

Overview of all repositories you've contributed to across your timeline