EXCEEDS logo
Exceeds
xutizhou

PROFILE

Xutizhou

Xuting Zhang engineered high-performance GPU features and optimizations across kvcache-ai/sglang and flashinfer-ai/flashinfer, focusing on deep learning and distributed systems. He refactored Triton and CUDA kernels to optimize Mixture-of-Experts routing, integrated FP8-optimized DeepGEMM into EPMoE, and delivered kernel fusion for Mamba state scatter operations. His work included memory safety fixes for expert-parallel MoE forward passes and introduced zero-copy state access for GDN decode kernels, reducing latency and improving throughput for linear-attention models. Using C++ and Python, Xuting’s contributions demonstrated deep technical depth in low-level optimization, performance tuning, and scalable GPU programming for production AI workloads.

Overall Statistics

Feature vs Bugs

86%Features

Repository Contributions

7Total
Bugs
1
Commits
7
Features
6
Lines of code
2,763
Activity Months5

Work History

March 2026

2 Commits • 2 Features

Mar 1, 2026

March 2026: Two high-impact feature deliveries across sgLang and FlashInfer that improved inference performance and memory efficiency for modern GPU workloads. Implemented K-last SSM layout support for GDN prefill/decode, and introduced pool-indexed (zero-copy) state access for the GDN decode kernel, enabling efficient integration with SGLang's state pool. These changes reduce latency, boost throughput for linear-attention models, and strengthen production readiness for SGLang+FlashInfer deployments on Hopper-era GPUs.

February 2026

2 Commits • 2 Features

Feb 1, 2026

February 2026 performance snapshot focused on low-level performance optimizations and kernel fusion to boost inference throughput and scalability in FlashInfer and SGLang. The work emphasizes reducing CPU-GPU overhead and consolidating kernel launches for critical paths.

June 2025

1 Commits • 1 Features

Jun 1, 2025

June 2025 monthly summary for kvcache-ai/sglang: Delivered FP8-optimized DeepGEMM integration into the EPMoE path, including new Triton kernels for data reordering and computation and a forward-pass refactor to streamline FP8 data paths. This work establishes a robust FP8 data-path foundation and sets the stage for targeted performance tuning; no major bugs fixed this period.

May 2025

1 Commits

May 1, 2025

May 2025 monthly summary for kvcache-ai/sglang: Major bug fix to MoE forward pass memory safety and correctness, addressing illegal memory access and preventing potential out-of-bounds errors. The fix enhances stability for expert-parallel MoE forwards under large-scale workloads and improves reliability of production deployments.

March 2025

1 Commits • 1 Features

Mar 1, 2025

March 2025 monthly summary focused on performance optimization for DeepEP Mixture-of-Experts in kvcache-ai/sglang. Delivered a permute kernel optimization by refactoring Triton kernels and adjusting data flow for expert processing, optimizing permutation and un-permutation steps. This work enhances throughput and reduces latency in Mixture-of-Experts routing and data distribution.

Activity

Loading activity data...

Quality Metrics

Correctness97.2%
Maintainability80.0%
Architecture91.4%
Performance94.2%
AI Usage37.2%

Skills & Technologies

Programming Languages

C++Python

Technical Skills

CUDADeep LearningDistributed SystemsFP8 QuantizationGPU ComputingGPU ProgrammingGPU programmingLow-level OptimizationMachine LearningMixture of Experts (MoE)OptimizationPerformance optimizationPyTorchTritondeep learning

Repositories Contributed To

3 repos

Overview of all repositories you've contributed to across your timeline

kvcache-ai/sglang

Mar 2025 Feb 2026
4 Months active

Languages Used

C++Python

Technical Skills

Deep LearningDistributed SystemsGPU ProgrammingOptimizationPyTorchTriton

flashinfer-ai/flashinfer

Feb 2026 Mar 2026
2 Months active

Languages Used

C++Python

Technical Skills

CUDAGPU programmingPerformance optimizationDeep LearningGPU ProgrammingMachine Learning

ping1jing2/sglang

Mar 2026 Mar 2026
1 Month active

Languages Used

Python

Technical Skills

CUDAGPU programmingdeep learningmachine learning