
Uros Males developed core machine learning compiler and backend infrastructure across the tenstorrent/tt-mlir and tenstorrent/tt-xla repositories, focusing on enabling advanced tensor operations, robust sharding, and scalable model deployment. He implemented features such as MaxPool2dWithIndices and dot_general decomposition using C++ and MLIR, improving neural network flexibility and compilation reliability. Uros addressed backend integration challenges by refining StableHLO to TTIR conversions and enhancing random number generation correctness for JAX workflows. His work included Python-based testing infrastructure for large-scale model evaluation and introduced passes for tensor replication, demonstrating depth in compiler design, parallel computing, and distributed machine learning system optimization.
March 2026 (tt-mlir): Delivered a critical bug fix to sharding attribute resolution in collective operations, improving local shape correctness and stability for sharded MoE models. Focused on out_sharding handling in getOperandShardingAttr with a robust fallback to compute accurate local shapes, preventing slice-index errors during UpdateGlobalToLocalShapes.
March 2026 (tt-mlir): Delivered a critical bug fix to sharding attribute resolution in collective operations, improving local shape correctness and stability for sharded MoE models. Focused on out_sharding handling in getOperandShardingAttr with a robust fallback to compute accurate local shapes, preventing slice-index errors during UpdateGlobalToLocalShapes.
February 2026 monthly summary: Delivered scalable Llama deployment enhancements and expanded prefill/testing workflows in tt-forge-models, strengthened testing infrastructure for longer sequences and larger batch sizes in tt-xla, and introduced a tensor replication pass to improve sharding reliability in tt-mlir. These efforts increase deployment performance, model evaluation coverage, and the reliability of distributed compute paths, enabling more scalable benchmarks and faster business insights.
February 2026 monthly summary: Delivered scalable Llama deployment enhancements and expanded prefill/testing workflows in tt-forge-models, strengthened testing infrastructure for longer sequences and larger batch sizes in tt-xla, and introduced a tensor replication pass to improve sharding reliability in tt-mlir. These efforts increase deployment performance, model evaluation coverage, and the reliability of distributed compute paths, enabling more scalable benchmarks and faster business insights.
Nov 2025 monthly summary for tenstorrent/tt-mlir: Delivered foundational tensor operations and MLIR/StableHLO integration work enabling more capable neural network pipelines. Implemented MaxPool2dWithIndices (returns values and indices) to support unpooling and gradient computation, and extended MLIR/StableHLO by decomposing stablehlo.select_and_scatter into ttir.max_pool2d_with_indices and ttir.scatter_in_dim for greater flexibility. Updated verifiers, introduced a separate FlatBuffers schema entry for the new op, and expanded test coverage to validate end-to-end behavior across TTIR/TTNN and StableHLO. These changes establish groundwork for advanced pooling-based layers and improve cross-dialect interoperability, delivering tangible business and technical value.
Nov 2025 monthly summary for tenstorrent/tt-mlir: Delivered foundational tensor operations and MLIR/StableHLO integration work enabling more capable neural network pipelines. Implemented MaxPool2dWithIndices (returns values and indices) to support unpooling and gradient computation, and extended MLIR/StableHLO by decomposing stablehlo.select_and_scatter into ttir.max_pool2d_with_indices and ttir.scatter_in_dim for greater flexibility. Updated verifiers, introduced a separate FlatBuffers schema entry for the new op, and expanded test coverage to validate end-to-end behavior across TTIR/TTNN and StableHLO. These changes establish groundwork for advanced pooling-based layers and improve cross-dialect interoperability, delivering tangible business and technical value.
September 2025: Delivered core backend improvements for Tenstorrent MLIR/TTIR integration and JAX compatibility. Implemented StableHLO to TTIR conversion for tenstorrent.uniform to ttir.rand with operand/attribute extraction and test refactor; fixed MLIR lowering shape handling for jax.random.uniform by enforcing int32 shape before lowering, improving stability and correctness on the Tenstorrent backend.
September 2025: Delivered core backend improvements for Tenstorrent MLIR/TTIR integration and JAX compatibility. Implemented StableHLO to TTIR conversion for tenstorrent.uniform to ttir.rand with operand/attribute extraction and test refactor; fixed MLIR lowering shape handling for jax.random.uniform by enforcing int32 shape before lowering, improving stability and correctness on the Tenstorrent backend.
January 2025 monthly summary focused on delivering core platform capabilities and strengthening validation for numeric computations across TTIR and TT-XLA. Key initiatives drove broader applicability, reliability, and business value in ML workloads.
January 2025 monthly summary focused on delivering core platform capabilities and strengthening validation for numeric computations across TTIR and TT-XLA. Key initiatives drove broader applicability, reliability, and business value in ML workloads.

Overview of all repositories you've contributed to across your timeline