EXCEEDS logo
Exceeds
mdragulaTT

PROFILE

Mdragulatt

During their work on the tenstorrent/tt-metal repository, Marko Dragula developed and optimized custom GPU kernels in C++ and CUDA to accelerate and stabilize model training. He implemented fused backward kernels for RMSNorm and SiLU, replacing higher-level composites to improve numerical stability and training throughput. Marko also refactored utility functions into shared headers, reducing code duplication and enhancing maintainability. Addressing training robustness, he fixed RMSNorm backward pass issues by correcting gamma broadcasting, zeroing registers, and introducing zero-initialized buffers. His work included expanding test coverage and tightening tolerances, resulting in more reliable large-model training and a leaner, more maintainable codebase.

Overall Statistics

Feature vs Bugs

75%Features

Repository Contributions

8Total
Bugs
1
Commits
8
Features
3
Lines of code
7,337
Activity Months2

Work History

September 2025

2 Commits

Sep 1, 2025

Month 2025-09: Focused on stabilizing RMSNorm backward pass and improving training robustness for large models in tt-metal. Delivered correctness fixes with gamma broadcasting alignment, explicit register zeroing, and a zero-initialized circular buffer for intermediate results. Expanded test coverage and tightened tolerances to prevent training instability, notably for llama3_7B. This work reduces risk of exploding losses and improves reliability for production-scale training.

July 2025

6 Commits • 3 Features

Jul 1, 2025

Summary for 2025-07 (tenstorrent/tt-metal): Delivered three performance-oriented kernels and a maintainability refactor, driving faster and more stable training while reducing future bug surfaces. Key features delivered: (1) Custom RMSNorm backward kernel to accelerate training and improve numerical correctness; (2) Consolidated program factory utility functions into a shared header to eliminate duplication and speed development; (3) Custom SiLU backward kernel to replace a high-level composite with a fused, efficient kernel for better performance and numerical stability. No major bugs fixed this month. Overall impact: higher training throughput, more reliable convergence, and a leaner codebase with clearer shared utilities. Technologies/skills demonstrated: CUDA/C++ kernel development, backward kernel fusion, numerical stability, refactoring for maintainability, and performance optimization. Business value: reduced training time, lower maintenance costs, and stronger model training reliability.

Activity

Loading activity data...

Quality Metrics

Correctness100.0%
Maintainability85.0%
Architecture95.0%
Performance95.0%
AI Usage32.4%

Skills & Technologies

Programming Languages

C++

Technical Skills

C++C++ developmentCUDA programmingGPU ProgrammingGPU programmingKernel DevelopmentKernel optimizationMachine learningNumerical MethodsNumerical methodsPerformance OptimizationPerformance tuningTestingTesting and validationcode maintainability

Repositories Contributed To

1 repo

Overview of all repositories you've contributed to across your timeline

tenstorrent/tt-metal

Jul 2025 Sep 2025
2 Months active

Languages Used

C++

Technical Skills

C++C++ developmentGPU ProgrammingGPU programmingKernel DevelopmentKernel optimization