GraCE: Unlocking CUDA Graphs with Compiler Support for ML Workloads

Abhishek Ghosh; Ajay Nayak; Ashish Panwar; Arkaprava Basu

GraCE: Unlocking CUDA Graphs with Compiler Support for ML Workloads

Abhishek Ghosh ,
Ajay Nayak ,
Ashish Panwar ,
Arkaprava Basu

2026 Operating Systems Design and Implementation | July 2026

Published by USENIX

As the performance gap between GPUs and CPUs keeps increasing, the kernel launch overhead is becoming a first-order bottleneck for many ML workloads. NVIDIA introduced CUDA Graphs to mitigate this issue by capturing and launching GPU kernels as a static DAG, thus avoiding per-kernel launch overhead. However, CUDA Graphs are surprisingly difficult to deploy correctly and efficiently.

We present GraCE, a CUDA Graph-aware compilation framework for ML workloads. GraCE introduces three key optimizations: (1) code transformations that broaden the applicability of CUDA Graphs, (2) elimination of excessive kernel-parameter copy overheads within captured graphs, and (3) selective deployment of CUDA Graphs guided by a cost–benefit analysis. GraCE integrates seamlessly with PyTorch2’s compilation pipeline and requires no changes to user code.