
Nikola Drakulic developed robust deep learning infrastructure across several Tenstorrent repositories, including tt-forge-fe, tt-mlir, and tt-xla. He engineered features such as gradient verification frameworks, distributed tensor parallel training, and runtime tensor debugging, using Python, C++, and PyTorch. His work addressed challenges in model training reliability, gradient correctness, and scalable distributed execution by implementing standardized output handling, backward pass optimizations, and CI enhancements. By aligning autograd and decomposition contexts, introducing unified model loading APIs, and improving test automation, Nikola delivered solutions that improved model fidelity, accelerated validation cycles, and enabled efficient, reproducible training workflows for large-scale machine learning systems.
March 2026 monthly work summary for tenstorrent/tt-xla: Delivered foundational distributed tensor parallel training infrastructure with gradient sharding alignment, enabling scalable distributed model training and consistent gradient updates across shards. This work resolves core challenges for forward/backward graph correctness and establishes base for end-to-end tensor parallel testing and future performance optimizations.
March 2026 monthly work summary for tenstorrent/tt-xla: Delivered foundational distributed tensor parallel training infrastructure with gradient sharding alignment, enabling scalable distributed model training and consistent gradient updates across shards. This work resolves core challenges for forward/backward graph correctness and establishes base for end-to-end tensor parallel testing and future performance optimizations.
February 2026 monthly summary: Delivered a CI enhancement for tt-forge-models by introducing Priority Tagging for select Llama model variants to improve nightly build coverage and CI feedback for training pipelines (LoRA fine-tuning). This targeted prioritization yields faster issue detection and more robust model variant testing.
February 2026 monthly summary: Delivered a CI enhancement for tt-forge-models by introducing Priority Tagging for select Llama model variants to improve nightly build coverage and CI feedback for training pipelines (LoRA fine-tuning). This targeted prioritization yields faster issue detection and more robust model variant testing.
January 2026 (2026-01) monthly summary for tenstorrent/tt-xla. This period focused on stabilizing training runs, improving test reliability, and reducing wait times in model training workflows. Delivered two core features with associated commit work, fixed instability in nightly pipelines, and demonstrated strong capability in PyTorch/XLA integration, runtime optimization, and build/test automation. Business value centers on more reliable training cycles, faster iteration, and better resource utilization across CI and training environments.
January 2026 (2026-01) monthly summary for tenstorrent/tt-xla. This period focused on stabilizing training runs, improving test reliability, and reducing wait times in model training workflows. Delivered two core features with associated commit work, fixed instability in nightly pipelines, and demonstrated strong capability in PyTorch/XLA integration, runtime optimization, and build/test automation. Business value centers on more reliable training cycles, faster iteration, and better resource utilization across CI and training environments.
November 2025: Strengthened the Torch/XLA training test framework (tenstorrent/tt-xla) to boost reliability, observability, and business value. Delivered consolidated test infrastructure improvements, refined gradient handling, standardized failure messaging, and structured nightly/weekly test categorization; expanded coverage with tt_forge_models training tests; and cleaned the test surface by removing oversized models and updating configurations for failure reporting. These changes enable faster feedback, more reproducible results on XLA, and calmer CI outcomes.
November 2025: Strengthened the Torch/XLA training test framework (tenstorrent/tt-xla) to boost reliability, observability, and business value. Delivered consolidated test infrastructure improvements, refined gradient handling, standardized failure messaging, and structured nightly/weekly test categorization; expanded coverage with tt_forge_models training tests; and cleaned the test surface by removing oversized models and updating configurations for failure reporting. These changes enable faster feedback, more reproducible results on XLA, and calmer CI outcomes.
October 2025: Delivered standardized forward output extraction to enable consistent forward/backward testing across models in tt-forge-models. Introduced a new unpack_forward_output utility, training_utils.py, and ForgeModel.unpacked_output to support unified output handling. Propagated changes to all ModelLoader instances via unpack_output_training, enabling general fwd/bwd testing across architectures. Result: improved testing reliability, easier regression debugging, and stronger cross-model comparability, accelerating validation and integration cycles.
October 2025: Delivered standardized forward output extraction to enable consistent forward/backward testing across models in tt-forge-models. Introduced a new unpack_forward_output utility, training_utils.py, and ForgeModel.unpacked_output to support unified output handling. Propagated changes to all ModelLoader instances via unpack_output_training, enabling general fwd/bwd testing across architectures. Result: improved testing reliability, easier regression debugging, and stronger cross-model comparability, accelerating validation and integration cycles.
September 2025 monthly summary focused on API standardization in model loading for tt-forge-models. Implemented a consistent return type across pretrained loaders by removing the return_dict parameter and defaulting to returning dictionaries, reducing confusion and improving testability and downstream integration.
September 2025 monthly summary focused on API standardization in model loading for tt-forge-models. Implemented a consistent return type across pretrained loaders by removing the return_dict parameter and defaulting to returning dictionaries, reducing confusion and improving testability and downstream integration.
August 2025 monthly summary for tenstorrent/tt-mlir: Delivered CLI-driven control to disable TTRT callbacks to improve benchmarking reliability and experiment reproducibility. Implemented a new command-line flag --disable-ttrt-callbacks in the Run and Perf classes, with updated callback handling that respects the flag. This change reduces interference during performance runs and simplifies testing scenarios.
August 2025 monthly summary for tenstorrent/tt-mlir: Delivered CLI-driven control to disable TTRT callbacks to improve benchmarking reliability and experiment reproducibility. Implemented a new command-line flag --disable-ttrt-callbacks in the Run and Perf classes, with updated callback handling that respects the flag. This change reduces interference during performance runs and simplifies testing scenarios.
July 2025 monthly summary for tenstorrent/tt-mlir: Key features delivered include a runtime tensor debugging and manipulation API and the Chisel MLIR debugging/validation tool, with integration into the compilation pipeline. No major bugs fixed this month. Overall, these efforts improved runtime observability, enabled in-flight tensor inspection and replacement, and strengthened cross-context validation from GOLDEN to DEVICE contexts, delivering measurable business value.
July 2025 monthly summary for tenstorrent/tt-mlir: Key features delivered include a runtime tensor debugging and manipulation API and the Chisel MLIR debugging/validation tool, with integration into the compilation pipeline. No major bugs fixed this month. Overall, these efforts improved runtime observability, enabled in-flight tensor inspection and replacement, and strengthened cross-context validation from GOLDEN to DEVICE contexts, delivering measurable business value.
In May 2025, the tt-forge-fe focus was on stabilizing and improving autograd behavior for indexing, delivering a critical bug fix and reinforcing the foundation for robust gradient propagation. The work centers on aligning autograd with the decomposition context to ensure correct gradient flow for indexing and introducing utilities to support autograd bindings and padding operations.
In May 2025, the tt-forge-fe focus was on stabilizing and improving autograd behavior for indexing, delivering a critical bug fix and reinforcing the foundation for robust gradient propagation. The work centers on aligning autograd with the decomposition context to ensure correct gradient flow for indexing and introducing utilities to support autograd bindings and padding operations.
Month: 2025-03 | Tenstorrent tt-forge-fe: Delivered gradient verification and backward-pass enhancements to strengthen end-to-end gradient accuracy for compiled models. Implemented a Gradient Verification Framework for Backward Pass with verify_backward, and added backward support for repeat_interleave by chaining reshape, reduce_sum, and squeeze. Updated tests and added input validation, gradient saving, and detailed comparison between framework and compiled model gradients. These changes improve robustness of model compilation, testing, and deployment readiness.
Month: 2025-03 | Tenstorrent tt-forge-fe: Delivered gradient verification and backward-pass enhancements to strengthen end-to-end gradient accuracy for compiled models. Implemented a Gradient Verification Framework for Backward Pass with verify_backward, and added backward support for repeat_interleave by chaining reshape, reduce_sum, and squeeze. Updated tests and added input validation, gradient saving, and detailed comparison between framework and compiled model gradients. These changes improve robustness of model compilation, testing, and deployment readiness.
Month 2024-10: Delivered major MNIST training workflow enhancements for the tt-forge-fe repo, focusing on observability, stability, and gradient efficiency. Implemented TensorBoard integration for loss and parameter tracking, introduced early stopping to preserve the best model, and hardened the training loop with improved data handling. Added backward-pass optimizations and gradient handling improvements such as input filtering and layer freezing to boost efficiency and correctness. Fixed a bug related to passing unnecessary inputs to the backward pass, improving training stability and gradient fidelity. This work enhances model reliability, reduces iteration time, and provides clearer visibility into training dynamics for faster decision making.
Month 2024-10: Delivered major MNIST training workflow enhancements for the tt-forge-fe repo, focusing on observability, stability, and gradient efficiency. Implemented TensorBoard integration for loss and parameter tracking, introduced early stopping to preserve the best model, and hardened the training loop with improved data handling. Added backward-pass optimizations and gradient handling improvements such as input filtering and layer freezing to boost efficiency and correctness. Fixed a bug related to passing unnecessary inputs to the backward pass, improving training stability and gradient fidelity. This work enhances model reliability, reduces iteration time, and provides clearer visibility into training dynamics for faster decision making.

Overview of all repositories you've contributed to across your timeline