
Xiong Gao developed and optimized NPU-focused inference features across the openvino and openvino.genai repositories, delivering six features and one bug fix over three months. He implemented chunked prefill and dynamic LoRA support in C++ and Python, enabling efficient handling of long prompts and flexible fine-tuning on NPU hardware. His work included KV cache optimization, prefix cache reuse, and refined 3D position ID processing, which reduced inference time and improved accuracy for Vision-Language Models. By focusing on low-level programming, cache management, and plugin development, Xiong addressed runtime stability, memory efficiency, and production readiness for NPU-backed machine learning pipelines.

October 2025: Delivered NPUW KV Cache Optimization and Accuracy Enhancements for openvino. Implemented prefix KV cache reuse across generation calls, reducing inference time. Refined KV cache handling and 3D position ID processing to improve accuracy by avoiding KV cache restoration/storage for partial chunks and correcting chunk prefill inference. All changes align with openvino repository standards and prepared for review.
October 2025: Delivered NPUW KV Cache Optimization and Accuracy Enhancements for openvino. Implemented prefix KV cache reuse across generation calls, reducing inference time. Refined KV cache handling and 3D position ID processing to improve accuracy by avoiding KV cache restoration/storage for partial chunks and correcting chunk prefill inference. All changes align with openvino repository standards and prepared for review.
August 2025 performance summary: Delivered cross-repo NPU-focused enhancements to dynamic LoRA and VLM support across OpenVINO core and GenAI, enabling flexible fine-tuning on NPU hardware and reducing VLM startup latency. Implementations include dynamic LoRA loading with pre-allocated L0 tensors and VLM chunk prefill for NPUW, ensuring correct input handling for 3D VLM workloads and parity with CPU/GPU behavior. These changes improve model adaptability, throughput, and production readiness on NPU-backed inference and fine-tuning pipelines.
August 2025 performance summary: Delivered cross-repo NPU-focused enhancements to dynamic LoRA and VLM support across OpenVINO core and GenAI, enabling flexible fine-tuning on NPU hardware and reducing VLM startup latency. Implementations include dynamic LoRA loading with pre-allocated L0 tensors and VLM chunk prefill for NPUW, ensuring correct input handling for 3D VLM workloads and parity with CPU/GPU behavior. These changes improve model adaptability, throughput, and production readiness on NPU-backed inference and fine-tuning pipelines.
In July 2025, delivered two high-impact NPU-focused enhancements across openvino.genai and openvino, significantly improving runtime stability and performance for NPU-backed inference. The work emphasizes reliability, latency, and memory efficiency for long prompts and dynamic/static shape handling on NPUs.
In July 2025, delivered two high-impact NPU-focused enhancements across openvino.genai and openvino, significantly improving runtime stability and performance for NPU-backed inference. The work emphasizes reliability, latency, and memory efficiency for long prompts and dynamic/static shape handling on NPUs.
Overview of all repositories you've contributed to across your timeline