
Milan Vlahovic developed and optimized hardware performance counter systems across the tenstorrent/tt-llk and tenstorrent/tt-metal repositories, focusing on end-to-end observability and efficient data collection. He implemented a C++-based performance counter infrastructure with per-thread and later shared L1 buffer architectures, integrating Python tooling for configuration, data analysis, and automated metric reporting. His work enabled detailed tracking of kernel-level events, such as REQUESTS versus GRANTS, and introduced synchronization improvements for multithreaded data integrity. By reducing memory usage and simplifying metrics, Milan established a reproducible, maintainable foundation for performance analysis and capacity planning, demonstrating depth in C++, Python, and system architecture.
March 2026 monthly work summary focusing on delivering a high-impact optimization in the performance counter subsystem of tt-metal, with an emphasis on memory efficiency, data integrity, and maintainability.
March 2026 monthly work summary focusing on delivering a high-impact optimization in the performance counter subsystem of tt-metal, with an emphasis on memory efficiency, data integrity, and maintainability.
February 2026: Delivered the Tensix Performance Counter System for tt-llk, establishing end-to-end hardware performance observability across UNPACK/MATH/PACK threads and enabling data-driven optimization. Core features include a C++ PerfCounters plumbing with per-thread L1 buffers, Python tooling for configuration and readout, and derived metrics with automated summaries. Added matmul kernel instrumentation and kernel-level integration to provide side-by-side REQUESTS vs GRANTS analysis, improving visibility into arbitration, stalls, and bottlenecks. This work lays the foundation for reproducible performance analysis, targeted tuning, and better capacity planning.
February 2026: Delivered the Tensix Performance Counter System for tt-llk, establishing end-to-end hardware performance observability across UNPACK/MATH/PACK threads and enabling data-driven optimization. Core features include a C++ PerfCounters plumbing with per-thread L1 buffers, Python tooling for configuration and readout, and derived metrics with automated summaries. Added matmul kernel instrumentation and kernel-level integration to provide side-by-side REQUESTS vs GRANTS analysis, improving visibility into arbitration, stalls, and bottlenecks. This work lays the foundation for reproducible performance analysis, targeted tuning, and better capacity planning.

Overview of all repositories you've contributed to across your timeline