
During their work on the tenstorrent/tt-metal repository, Marko Dragula developed and optimized custom GPU kernels in C++ and CUDA to accelerate and stabilize model training. He implemented fused backward kernels for RMSNorm and SiLU, replacing higher-level composites to improve numerical stability and training throughput. Marko also refactored utility functions into shared headers, reducing code duplication and enhancing maintainability. Addressing training robustness, he fixed RMSNorm backward pass issues by correcting gamma broadcasting, zeroing registers, and introducing zero-initialized buffers. His work included expanding test coverage and tightening tolerances, resulting in more reliable large-model training and a leaner, more maintainable codebase.
Month 2025-09: Focused on stabilizing RMSNorm backward pass and improving training robustness for large models in tt-metal. Delivered correctness fixes with gamma broadcasting alignment, explicit register zeroing, and a zero-initialized circular buffer for intermediate results. Expanded test coverage and tightened tolerances to prevent training instability, notably for llama3_7B. This work reduces risk of exploding losses and improves reliability for production-scale training.
Month 2025-09: Focused on stabilizing RMSNorm backward pass and improving training robustness for large models in tt-metal. Delivered correctness fixes with gamma broadcasting alignment, explicit register zeroing, and a zero-initialized circular buffer for intermediate results. Expanded test coverage and tightened tolerances to prevent training instability, notably for llama3_7B. This work reduces risk of exploding losses and improves reliability for production-scale training.
Summary for 2025-07 (tenstorrent/tt-metal): Delivered three performance-oriented kernels and a maintainability refactor, driving faster and more stable training while reducing future bug surfaces. Key features delivered: (1) Custom RMSNorm backward kernel to accelerate training and improve numerical correctness; (2) Consolidated program factory utility functions into a shared header to eliminate duplication and speed development; (3) Custom SiLU backward kernel to replace a high-level composite with a fused, efficient kernel for better performance and numerical stability. No major bugs fixed this month. Overall impact: higher training throughput, more reliable convergence, and a leaner codebase with clearer shared utilities. Technologies/skills demonstrated: CUDA/C++ kernel development, backward kernel fusion, numerical stability, refactoring for maintainability, and performance optimization. Business value: reduced training time, lower maintenance costs, and stronger model training reliability.
Summary for 2025-07 (tenstorrent/tt-metal): Delivered three performance-oriented kernels and a maintainability refactor, driving faster and more stable training while reducing future bug surfaces. Key features delivered: (1) Custom RMSNorm backward kernel to accelerate training and improve numerical correctness; (2) Consolidated program factory utility functions into a shared header to eliminate duplication and speed development; (3) Custom SiLU backward kernel to replace a high-level composite with a fused, efficient kernel for better performance and numerical stability. No major bugs fixed this month. Overall impact: higher training throughput, more reliable convergence, and a leaner codebase with clearer shared utilities. Technologies/skills demonstrated: CUDA/C++ kernel development, backward kernel fusion, numerical stability, refactoring for maintainability, and performance optimization. Business value: reduced training time, lower maintenance costs, and stronger model training reliability.

Overview of all repositories you've contributed to across your timeline