GraCE: Unlocking CUDA Graphs with Compiler Support for ML Workloads
- Abhishek Ghosh ,
- Ajay Nayak ,
- Ashish Panwar ,
- Arkaprava Basu
2026 Operating Systems Design and Implementation |
Published by USENIX
As the performance gap between GPUs and CPUs keeps increasing, the kernel launch overhead is becoming a first-order bottleneck for many ML workloads. NVIDIA introduced CUDA Graphs to mitigate this issue by capturing and launching GPU kernels as a static DAG, thus avoiding per-kernel launch overhead. However, CUDA Graphs are surprisingly difficult to deploy correctly and efficiently.
We present GraCE, a CUDA Graph-aware compilation framework for ML workloads. GraCE introduces three key optimizations: (1) code transformations that broaden the applicability of CUDA Graphs, (2) elimination of excessive kernel-parameter copy overheads within captured graphs, and (3) selective deployment of CUDA Graphs guided by a cost–benefit analysis. GraCE integrates seamlessly with PyTorch2’s compilation pipeline and requires no changes to user code.