
Nikola Cvetkovic developed and maintained core compute and data-path features for the tenstorrent/tt-metal repository, focusing on kernel development, API design, and system stability. He engineered enhancements to matrix multiplication, reduction, and packing workflows, introducing configurable APIs and robust initialization flows using C++ and Python. His work addressed low-level hardware synchronization, memory alignment, and cross-architecture compatibility, improving reliability and performance for tile-based and parallel workloads. By refactoring kernel APIs, stabilizing test suites, and optimizing performance-critical paths, Nikola enabled more deterministic execution and maintainable code, demonstrating depth in embedded systems, hardware interaction, and performance optimization throughout the compute stack.

In September 2025, two core features were delivered in tenstorrent/tt-metal focused on initialization efficiency and cross-thread data integrity for reduction workloads on Blackhole hardware. The work improves reliability and performance of reduction-related operations and positions the backend for more stable high-load scenarios.
In September 2025, two core features were delivered in tenstorrent/tt-metal focused on initialization efficiency and cross-thread data integrity for reduction workloads on Blackhole hardware. The work improves reliability and performance of reduction-related operations and positions the backend for more stable high-load scenarios.
In August 2025, tenstorrent/tt-metal focused on delivering a consolidated Matrix Multiplication API and Compute Kernel Configuration Enhancements. Key work included matmul initialization refactoring, data format reconfiguration improvements, API robustness/docs fixes, and the introduction of a first-pass global hardware configuration state for compute kernels. These changes improve performance, reliability, and developer usability of the matrix multiplication subsystem, with cleaner initialization flows and better maintainability of the hardware config.
In August 2025, tenstorrent/tt-metal focused on delivering a consolidated Matrix Multiplication API and Compute Kernel Configuration Enhancements. Key work included matmul initialization refactoring, data format reconfiguration improvements, API robustness/docs fixes, and the introduction of a first-pass global hardware configuration state for compute kernels. These changes improve performance, reliability, and developer usability of the matrix multiplication subsystem, with cleaner initialization flows and better maintainability of the hardware config.
July 2025 monthly summary for tenstorrent/tt-metal: Delivered stability fixes for tilize uninitialization, propagated disable_src_zero_flag across tilize/compute kernels, and cleaned up Compute API initialization; introduced consistent behavior across tilize and unpack paths, and updated tests (n150) for better reliability. These changes reduce runtime errors, improve stability, and simplify future feature work.
July 2025 monthly summary for tenstorrent/tt-metal: Delivered stability fixes for tilize uninitialization, propagated disable_src_zero_flag across tilize/compute kernels, and cleaned up Compute API initialization; introduced consistent behavior across tilize and unpack paths, and updated tests (n150) for better reliability. These changes reduce runtime errors, improve stability, and simplify future feature work.
June 2025 — Tenstorrent TT-Metal: Delivered API enhancements, architecture-safe refactors, and test stabilization across the TT-Metal compute stack. Key deliveries include a new LLK API narrow_tile parameter enabling finer tile processing control; broad cleanup and consolidation of the Compute API initialization (reduce, tilize, and transpose) with API documentation and rename to transpose_*; plus several architecture-specific fixes to ensure tests pass consistently across architectures. Major bug fixes included architecture-dependent narrow_row handling corrections (transpose/tilize) and test stability work (adjusting checks to TILE_C_DIM). Overall impact: reduced API surface area and maintenance burden, more reliable cross-arch behavior, and a more robust test suite, enabling faster, safer feature development. Technologies/skills demonstrated: C/C++ API refactoring, kernel alignment, cross-architecture testing, test stabilization, and documentation.
June 2025 — Tenstorrent TT-Metal: Delivered API enhancements, architecture-safe refactors, and test stabilization across the TT-Metal compute stack. Key deliveries include a new LLK API narrow_tile parameter enabling finer tile processing control; broad cleanup and consolidation of the Compute API initialization (reduce, tilize, and transpose) with API documentation and rename to transpose_*; plus several architecture-specific fixes to ensure tests pass consistently across architectures. Major bug fixes included architecture-dependent narrow_row handling corrections (transpose/tilize) and test stability work (adjusting checks to TILE_C_DIM). Overall impact: reduced API surface area and maintenance burden, more reliable cross-arch behavior, and a more robust test suite, enabling faster, safer feature development. Technologies/skills demonstrated: C/C++ API refactoring, kernel alignment, cross-architecture testing, test stabilization, and documentation.
May 2025 (tenstorrent/tt-metal): Delivered a major upgrade to the compact data packing and reduce path, focusing on performance, configurability, and reliability for tile-based workloads. Implemented a new LLK API for compact packing, added programmable tile counts and configurable block dimensions, and improved edge masking packing. Introduced the ability to pack multiple tiles into a single tile and prepared initial MOP integration for compact reduce packing. Also enhanced test coverage with dedicated tests for the compact packer and clarified API semantics. These changes set the stage for higher throughput, reduced memory footprint, and more flexible workloads in production.
May 2025 (tenstorrent/tt-metal): Delivered a major upgrade to the compact data packing and reduce path, focusing on performance, configurability, and reliability for tile-based workloads. Implemented a new LLK API for compact packing, added programmable tile counts and configurable block dimensions, and improved edge masking packing. Introduced the ability to pack multiple tiles into a single tile and prepared initial MOP integration for compact reduce packing. Also enhanced test coverage with dedicated tests for the compact packer and clarified API semantics. These changes set the stage for higher throughput, reduced memory footprint, and more flexible workloads in production.
April 2025 monthly summary for tenstorrent/tt-metal focusing on system stability and deterministic execution. Delivered a targeted configuration change to disable speculative and performance-related features to achieve deterministic execution, enhancing reliability and testability. Key commits disabled Blackhole-related features (instruction coalescing, branch prediction, out-of-order execution, instruction prefetching, L1 data-cache behavior) and adjusted L1 data-cache configuration to stabilize runtime behavior and improve reproducibility across test and production workloads.
April 2025 monthly summary for tenstorrent/tt-metal focusing on system stability and deterministic execution. Delivered a targeted configuration change to disable speculative and performance-related features to achieve deterministic execution, enhancing reliability and testability. Key commits disabled Blackhole-related features (instruction coalescing, branch prediction, out-of-order execution, instruction prefetching, L1 data-cache behavior) and adjusted L1 data-cache configuration to stabilize runtime behavior and improve reproducibility across test and production workloads.
March 2025 TT-Metal: Delivered stability and benchmarking upgrades for tenstorrent/tt-metal. Fixed hangs in ResNet-50 and Stable Diffusion by disabling instruction coalescing, enabling reliable model bring-up and immediate performance gains. Enhanced Matrix Multiplication Benchmarking and Testing: refactored performance tests, added benchmarks across multiple data types and configurations, and prepared simulation-compatible test parameters. Strengthened the testing framework with robust test parameters, precision-focused test_ids, and build fixes to ensure reproducible metrics. This work improves evaluation throughput, reduces debugging time, and provides clearer traceability through explicit commits.
March 2025 TT-Metal: Delivered stability and benchmarking upgrades for tenstorrent/tt-metal. Fixed hangs in ResNet-50 and Stable Diffusion by disabling instruction coalescing, enabling reliable model bring-up and immediate performance gains. Enhanced Matrix Multiplication Benchmarking and Testing: refactored performance tests, added benchmarks across multiple data types and configurations, and prepared simulation-compatible test parameters. Strengthened the testing framework with robust test parameters, precision-focused test_ids, and build fixes to ensure reproducible metrics. This work improves evaluation throughput, reduces debugging time, and provides clearer traceability through explicit commits.
February 2025, tt-metal (tenstorrent): Key features delivered and bugs fixed with direct business value for core data-path reliability and performance. - XY Plane Packing API for Max Pooling: Introduced a new API to correctly configure packing reads per XY plane, enabling precise XY position generation in max pooling operations. Commit: ed769d5d2c7cf3d179dd802c093ce2a88d284d57. - Matrix Multiplication Stability Fix for Blackhole Architecture: Implemented a workaround to prevent multiple matrix multiplication hangs by disabling branch prediction and adjusting compiler flags for Blackhole, improving runtime stability. Commit: 399ef5299e7ecbd8b5b9671f72474656fd6a2bdc. Overall impact: These changes improve correctness and reliability of core math and data-paths on target hardware, reducing runtime failures and debugging time, and enabling more predictable deployments.
February 2025, tt-metal (tenstorrent): Key features delivered and bugs fixed with direct business value for core data-path reliability and performance. - XY Plane Packing API for Max Pooling: Introduced a new API to correctly configure packing reads per XY plane, enabling precise XY position generation in max pooling operations. Commit: ed769d5d2c7cf3d179dd802c093ce2a88d284d57. - Matrix Multiplication Stability Fix for Blackhole Architecture: Implemented a workaround to prevent multiple matrix multiplication hangs by disabling branch prediction and adjusting compiler flags for Blackhole, improving runtime stability. Commit: 399ef5299e7ecbd8b5b9671f72474656fd6a2bdc. Overall impact: These changes improve correctness and reliability of core math and data-paths on target hardware, reducing runtime failures and debugging time, and enabling more predictable deployments.
December 2024 monthly summary focusing on FP32 precision, memory/DEST handling, and kernel alignment improvements across tt-llk-bh and tt-metal to drive improved inference reliability and throughput. Key outcomes include precision improvements in FP32 element-wise operations, expanded FP32 support for reduce scalar paths, streamlined DEST clearing and ZEROACC semantics, and memory-alignment optimizations for LayerNorm and Softmax kernels. These changes reduce functional risk, improve test reliability, and enhance data access efficiency for real-time workloads.
December 2024 monthly summary focusing on FP32 precision, memory/DEST handling, and kernel alignment improvements across tt-llk-bh and tt-metal to drive improved inference reliability and throughput. Key outcomes include precision improvements in FP32 element-wise operations, expanded FP32 support for reduce scalar paths, streamlined DEST clearing and ZEROACC semantics, and memory-alignment optimizations for LayerNorm and Softmax kernels. These changes reduce functional risk, improve test reliability, and enhance data access efficiency for real-time workloads.
Month: 2024-11 — Focused on stabilizing test automation, hardening inter-thread synchronization, and improving data integrity across tt-metal and LLK/BlackHole subsystems. Delivered concrete test improvements, subproject fixes, and synchronization safeguards that reduce flakiness, prevent race conditions during CFG changes, and support Grayskull compatibility.
Month: 2024-11 — Focused on stabilizing test automation, hardening inter-thread synchronization, and improving data integrity across tt-metal and LLK/BlackHole subsystems. Delivered concrete test improvements, subproject fixes, and synchronization safeguards that reduce flakiness, prevent race conditions during CFG changes, and support Grayskull compatibility.
Concise monthly summary for 2024-10 focusing on business value and technical accomplishments in tenstorrent/tt-metal. No new feature deliveries this month; primary activity was stabilizing the test suite and preserving CI reliability while root-cause investigation proceeds.
Concise monthly summary for 2024-10 focusing on business value and technical accomplishments in tenstorrent/tt-metal. No new feature deliveries this month; primary activity was stabilizing the test suite and preserving CI reliability while root-cause investigation proceeds.
Overview of all repositories you've contributed to across your timeline