EXCEEDS logo
Exceeds
Jinghan Yao

PROFILE

Jinghan Yao

Developed the Flash-Partitioned Distributed Transformer (FPDT) feature for the deepspeedai/DeepSpeed repository, enabling sequence-parallelism with CPU-offloaded attention and feedforward networks for large language models. This work partitioned attention computations across sequence-parallel ranks, improving both memory efficiency and training performance. Leveraging Python, CUDA, and PyTorch, the implementation included updates to activation checkpointing to further reduce memory usage and enhance throughput during training and inference. Additionally, a new continuous integration workflow was introduced to validate flash attention, providing more reliable and faster feedback for ongoing development. The contribution focused on distributed systems and deep learning optimization techniques.

Overall Statistics

Feature vs Bugs

100%Features

Repository Contributions

1Total
Bugs
0
Commits
1
Features
1
Lines of code
1,434
Activity Months1

Work History

December 2024

1 Commits • 1 Features

Dec 1, 2024

Delivered the Flash-Partitioned Distributed Transformer (FPDT) feature for deepspeedai/DeepSpeed. FPDT introduces CPU-offloaded attention/FFN enabling sequence-parallelism for large language models. The work includes a new CI workflow for flash attention and updates to activation checkpointing to improve memory efficiency and performance by partitioning attention computations across sequence-parallel ranks. Commit: 60a1b57b98c61c322cc76f1936eaec4f18a77b06.

Activity

Loading activity data...

Quality Metrics

Correctness90.0%
Maintainability80.0%
Architecture90.0%
Performance90.0%
AI Usage20.0%

Skills & Technologies

Programming Languages

PythonShell

Technical Skills

CI/CDCUDADeep LearningDistributed SystemsPyTorchSequence ParallelismTransformer Architecture

Repositories Contributed To

1 repo

Overview of all repositories you've contributed to across your timeline

deepspeedai/DeepSpeed

Dec 2024 Dec 2024
1 Month active

Languages Used

PythonShell

Technical Skills

CI/CDCUDADeep LearningDistributed SystemsPyTorchSequence Parallelism