
Over a twelve-month period, Petar Milenkovic engineered core features and optimizations for the tenstorrent/tt-llk and tenstorrent/tt-metal repositories, focusing on low-level kernel development, hardware acceleration, and data path reliability. He implemented dynamic data packing, tiling, and unpacking algorithms in C++ and Python, enabling robust support for diverse data types and tile geometries across Whitehole and Blackhole architectures. His work included performance-driven refactors, test-driven development, and detailed documentation, improving throughput and reducing API overhead. By addressing edge-case bugs and enhancing test infrastructure, Petar delivered scalable, maintainable solutions that strengthened hardware-software integration and supported evolving machine learning workloads.
April 2026 performance summary for tenstorrent/tt-metal focused on delivering a high-impact packing optimization feature that reduces overhead and expands tile geometry support. Implemented an LLK-based multi-tile packing path that packs N tiles from sparse DEST slots to contiguous L1 memory in a single MOP call, significantly lowering per-tile reconfiguration. Extended runtime configurability to handle tiny tile geometries (1x32, 8x32, 16x32, 16x16) in addition to 32x32, with a replay-based PACR sequence and runtime num_tiles patching via Mop cfg. Added comprehensive tests for tiny tile packing and data format reconfig to bolster reliability. This work lays groundwork for higher packing throughput and more flexible tile layouts in production workloads.
April 2026 performance summary for tenstorrent/tt-metal focused on delivering a high-impact packing optimization feature that reduces overhead and expands tile geometry support. Implemented an LLK-based multi-tile packing path that packs N tiles from sparse DEST slots to contiguous L1 memory in a single MOP call, significantly lowering per-tile reconfiguration. Extended runtime configurability to handle tiny tile geometries (1x32, 8x32, 16x32, 16x16) in addition to 32x32, with a replay-based PACR sequence and runtime num_tiles patching via Mop cfg. Added comprehensive tests for tiny tile packing and data format reconfig to bolster reliability. This work lays groundwork for higher packing throughput and more flexible tile layouts in production workloads.
March 2026 summary focused on enabling Deepseek LLK support and enhancing performance/throughput across the core toolchain. Key LLK integrations were delivered across two repos, establishing granular control of float32 destination accumulation and improving packing/instruction paths for LLK workloads. The work includes cross-repo LLK enablement in tt-llk and tt-metal, performance-driven refactors, and robust test coverage for varying tile_dst_offsets. These changes deliver faster, more deterministic Deepseek execution with better scalability for ML workloads, demonstrating strong cross-team collaboration, LLK adoption, and performance optimization skills.
March 2026 summary focused on enabling Deepseek LLK support and enhancing performance/throughput across the core toolchain. Key LLK integrations were delivered across two repos, establishing granular control of float32 destination accumulation and improving packing/instruction paths for LLK workloads. The work includes cross-repo LLK enablement in tt-llk and tt-metal, performance-driven refactors, and robust test coverage for varying tile_dst_offsets. These changes deliver faster, more deterministic Deepseek execution with better scalability for ML workloads, demonstrating strong cross-team collaboration, LLK adoption, and performance optimization skills.
February 2026 (2026-02) monthly summary for tenstorrent/tt-llk: Key work centered on advancing the tilize algorithm to preserve FP32 accuracy and expand tile-size support, with a focus on business value, reliability, and enabling Deepseek experiments. The work spanned feature enhancements, API/test infra alignment, and cross-repo coordination to deliver robust tilize performance across White Hole (WH) and Black Hole (BH) paths.
February 2026 (2026-02) monthly summary for tenstorrent/tt-llk: Key work centered on advancing the tilize algorithm to preserve FP32 accuracy and expand tile-size support, with a focus on business value, reliability, and enabling Deepseek experiments. The work spanned feature enhancements, API/test infra alignment, and cross-repo coordination to deliver robust tilize performance across White Hole (WH) and Black Hole (BH) paths.
2026-01 Monthly Summary: Performance-driven feature delivery and code health improvements across LLK compute paths, with a focus on reducing API overhead and clarifying test reporting.
2026-01 Monthly Summary: Performance-driven feature delivery and code health improvements across LLK compute paths, with a focus on reducing API overhead and clarifying test reporting.
December 2025: Implemented cross-architecture row-major data packing from Destination register to L1 memory, enabling higher memory bandwidth and data throughput for the LLK path. Delivered initial llk_pack_rows.h headers with dedicated tests for Whitehole (WH) and Blackhole (BH), and updated test infra to support both architectures. Achieved strong test coverage with CI-ready results (WH: 480 tests passing; BH: 320 tests passing). Positioning the feature for multi-packer integration and subsequent Metal-layer support.
December 2025: Implemented cross-architecture row-major data packing from Destination register to L1 memory, enabling higher memory bandwidth and data throughput for the LLK path. Delivered initial llk_pack_rows.h headers with dedicated tests for Whitehole (WH) and Blackhole (BH), and updated test infra to support both architectures. Achieved strong test coverage with CI-ready results (WH: 480 tests passing; BH: 320 tests passing). Positioning the feature for multi-packer integration and subsequent Metal-layer support.
November 2025 monthly summary for tenstorrent/tt-llk focusing on key accomplishments and business impact. Delivered dynamic runtime-variable support for unpack logic to handle BH face dimensions by replacing hard-coded TTI instructions with TT instructions in cunpack_common.h, enabling runtime adaptability and reducing maintenance burden for edge cases.
November 2025 monthly summary for tenstorrent/tt-llk focusing on key accomplishments and business impact. Delivered dynamic runtime-variable support for unpack logic to handle BH face dimensions by replacing hard-coded TTI instructions with TT instructions in cunpack_common.h, enabling runtime adaptability and reducing maintenance burden for edge cases.
September 2025 (tt-metal): Focused on strengthening max pooling kernel reliability through testing and debugging enhancements. Delivered a new max pooling test and a debug-environment setup to improve diagnosis, reproducibility, and iteration speed. This work establishes the groundwork for upcoming performance optimizations and regression safety in the kernel.
September 2025 (tt-metal): Focused on strengthening max pooling kernel reliability through testing and debugging enhancements. Delivered a new max pooling test and a debug-environment setup to improve diagnosis, reproducibility, and iteration speed. This work establishes the groundwork for upcoming performance optimizations and regression safety in the kernel.
June 2025 monthly summary for tenstorrent/tt-llk. Delivered foundational documentation and robust input handling improvements that enhance developer onboarding, product reliability, and data processing throughput.
June 2025 monthly summary for tenstorrent/tt-llk. Delivered foundational documentation and robust input handling improvements that enhance developer onboarding, product reliability, and data processing throughput.
In May 2025, the focus was on stability and correctness of tensor tiling processing for the tt-llk repository. The primary deliverable was a targeted bug fix to pack_untilize that enables handling of input tensors of any size, along with the introduction of a new addressing mode to correctly process rows without unnecessary clearing of the y-counter. The work improves reliability for variable input shapes and lays groundwork for future performance and feature improvements.
In May 2025, the focus was on stability and correctness of tensor tiling processing for the tt-llk repository. The primary deliverable was a targeted bug fix to pack_untilize that enables handling of input tensors of any size, along with the introduction of a new addressing mode to correctly process rows without unnecessary clearing of the y-counter. The work improves reliability for variable input shapes and lays groundwork for future performance and feature improvements.
April 2025 performance summary for tenstorrent/tt-llk focusing on feature delivery and code quality improvements. Delivered 32-bit integer support in the Low-Level Kernel (LLK) for Wormhole (WH) and Blackhole (BH) architectures, enabling Int32 and UInt32 inputs with direct unpacking into the destination register, bypassing Source A/Source B limitations and reducing data loss risk.
April 2025 performance summary for tenstorrent/tt-llk focusing on feature delivery and code quality improvements. Delivered 32-bit integer support in the Low-Level Kernel (LLK) for Wormhole (WH) and Blackhole (BH) architectures, enabling Int32 and UInt32 inputs with direct unpacking into the destination register, bypassing Source A/Source B limitations and reducing data loss risk.
March 2025: Delivered BH board narrow row data support in LLK by modifying packing/unpacking to accept a narrow_row parameter, enabling a single packer interface for data arriving in narrow row format (Faces 0 and 2; skip Faces 1 and 3). No major bugs reported. This work improves data path flexibility and reduces special-case handling, paving the way for broader data-format support.
March 2025: Delivered BH board narrow row data support in LLK by modifying packing/unpacking to accept a narrow_row parameter, enabling a single packer interface for data arriving in narrow row format (Faces 0 and 2; skip Faces 1 and 3). No major bugs reported. This work improves data path flexibility and reduces special-case handling, paving the way for broader data-format support.
February 2025: Delivered essential int32 subtraction support in the SFPU kernel across two repositories (tt-llk-wh-b0 and tt-llk-bh). Implementations include a new int32 subtraction header and core logic with cross-format data handling and hardware considerations, enabling broader arithmetic workloads and more consistent results across data formats.
February 2025: Delivered essential int32 subtraction support in the SFPU kernel across two repositories (tt-llk-wh-b0 and tt-llk-bh). Implementations include a new int32 subtraction header and core logic with cross-format data handling and hardware considerations, enabling broader arithmetic workloads and more consistent results across data formats.

Overview of all repositories you've contributed to across your timeline