
Nikola Ostojic contributed to the tenstorrent/tt-metal and tt-inference-server repositories, focusing on model optimization, CI/CD reliability, and large-model inference stability. He enhanced continuous integration workflows by expanding test coverage and improving environment stability using Python and YAML, which reduced flakiness and accelerated feedback for Gemma3 releases. In tt-metal, he introduced tracing instrumentation for generator model prefill paths, enabling detailed execution analytics for performance profiling. For tt-inference-server, Nikola optimized trace region sizing and memory management, addressing out-of-memory issues and supporting larger models like llama-70b. His work demonstrated depth in DevOps, machine learning, and performance tuning across evolving model architectures.
January 2026 performance summary for the tt-inference-server: focused on stabilizing large-model inference and expanding model support. Implemented memory-management and trace-region optimizations to mitigate OOM during large LLM inference, improving reliability for models such as qwen2.5-vl and llama-3.2-3b, and added support for llama-70b. This work reduces outage risk, enables deployment of larger models, and enhances overall throughput under heavy load. The changes were delivered via targeted fixes and tuning associated with the commit f47ab4f8c2e601125e6bb19170273d2f6ff009f4, including trace region adjustments, fixes for qwen2.5-vl and llama-3.2-3b, and new llama-70b support.
January 2026 performance summary for the tt-inference-server: focused on stabilizing large-model inference and expanding model support. Implemented memory-management and trace-region optimizations to mitigate OOM during large LLM inference, improving reliability for models such as qwen2.5-vl and llama-3.2-3b, and added support for llama-70b. This work reduces outage risk, enables deployment of larger models, and enhances overall throughput under heavy load. The changes were delivered via targeted fixes and tuning associated with the commit f47ab4f8c2e601125e6bb19170273d2f6ff009f4, including trace region adjustments, fixes for qwen2.5-vl and llama-3.2-3b, and new llama-70b support.
Monthly summary for 2025-10 focused on business value and technical execution for tenstorrent/tt-inference-server. Key feature delivered: Llama-3.1-8B Trace Region Sizing Optimization across multiple device specs, enabling better performance and resource allocation. No major bugs fixed this month in this repo based on current records. Overall impact: improved inference throughput and more efficient hardware utilization for the Llama-3.1-8B model. Technologies/skills demonstrated: performance optimization, cross-device compatibility, and change-tracking via explicit commit references.
Monthly summary for 2025-10 focused on business value and technical execution for tenstorrent/tt-inference-server. Key feature delivered: Llama-3.1-8B Trace Region Sizing Optimization across multiple device specs, enabling better performance and resource allocation. No major bugs fixed this month in this repo based on current records. Overall impact: improved inference throughput and more efficient hardware utilization for the Llama-3.1-8B model. Technologies/skills demonstrated: performance optimization, cross-device compatibility, and change-tracking via explicit commit references.
In September 2025, delivered initial tracing instrumentation for the generator model’s prefill path in tenstorrent/tt-metal. This change enables detailed execution analytics by capturing inputs, outputs, and timing, supporting faster debugging and data-driven performance profiling. The work establishes a foundation for optimization and troubleshooting in the prefill flow; testing and validation are planned for the next cycle to confirm trace accuracy and impact.
In September 2025, delivered initial tracing instrumentation for the generator model’s prefill path in tenstorrent/tt-metal. This change enables detailed execution analytics by capturing inputs, outputs, and timing, supporting faster debugging and data-driven performance profiling. The work establishes a foundation for optimization and troubleshooting in the prefill flow; testing and validation are planned for the next cycle to confirm trace accuracy and impact.
August 2025: Delivered CI-focused improvements and expanded model testing in the tt-metal repo, driving reliability, coverage, and faster feedback for Gemma3 releases.
August 2025: Delivered CI-focused improvements and expanded model testing in the tt-metal repo, driving reliability, coverage, and faster feedback for Gemma3 releases.

Overview of all repositories you've contributed to across your timeline