
Yao developed the Flash-Partitioned Distributed Transformer (FPDT) feature for the deepspeedai/DeepSpeed repository, focusing on enabling sequence-parallelism for large language models through CPU-offloaded attention and feedforward networks. By partitioning attention computations across sequence-parallel ranks, Yao improved both memory efficiency and training performance. The work included updating activation checkpointing to further reduce memory usage and enhance throughput during training and inference. Yao also implemented a new continuous integration workflow to validate flash attention, increasing reliability and feedback speed. This project leveraged Python, CUDA, and PyTorch, demonstrating depth in distributed systems and deep learning engineering within a complex codebase.

Delivered the Flash-Partitioned Distributed Transformer (FPDT) feature for deepspeedai/DeepSpeed. FPDT introduces CPU-offloaded attention/FFN enabling sequence-parallelism for large language models. The work includes a new CI workflow for flash attention and updates to activation checkpointing to improve memory efficiency and performance by partitioning attention computations across sequence-parallel ranks. Commit: 60a1b57b98c61c322cc76f1936eaec4f18a77b06.
Delivered the Flash-Partitioned Distributed Transformer (FPDT) feature for deepspeedai/DeepSpeed. FPDT introduces CPU-offloaded attention/FFN enabling sequence-parallelism for large language models. The work includes a new CI workflow for flash attention and updates to activation checkpointing to improve memory efficiency and performance by partitioning attention computations across sequence-parallel ranks. Commit: 60a1b57b98c61c322cc76f1936eaec4f18a77b06.
Overview of all repositories you've contributed to across your timeline