Unified virtual memory for large scale ML workloads
Context:
Recsys model training & inference are memory intensive. Generally, the servers for training & inference have the following hierarchy of memory:
- GPU RAM (High bandwidth, $$$) ~ 40–80 GB depending on GPU choice.
- CPU RAM (Low bandwidth, $$) ~ 1 TB & SSD memory (Low bandwidth, $) ~ multiple TBs. This is also known as host memory.
High bandwidth GPU memory is often at a scale lower than the CPU and SSD memory. For training or serving large scale recsys models, this might mean that model beyond a size cannot be loaded in memory or the process might result in the dreaded OOM (Out of memory).
Unified Virtual Memory (UVM):
- Unified virtual memory is a single memory address space accessible from any processor in a system. This allows applications to allocate data that can be read or written from code running on either CPUs or GPUs.
- That means the application can access the entire address space instead of being limited by GPU or Host memory. This completely removes the problems discussed in the previous section around loading large recsys models.
- Without a unified memory address space developers have to manually copy data from the CPU to the GPU memory before a GPU kernel can access that data.
- Replacing cudaMalloc with cudaMallocManaged and then copying using cudaMemcpy allows developers to use this technique. UVM enables GPU kernel execution while memory is oversubscribed (total memory used > total physical memory) by automatically evicting data that is no longer needed in the GPU memory to CPU memory. The data migration is automatically handled by Memory Management Unit (MMU) and driver.

No free lunch:
- The flexibility provided by UVM comes at a price of performance. In order to implement automatic data migration between a CPU and a GPU, the driver and MMU has to track data access information and determine the granularity of data migration.
- UVM needs special page table walk and page fault handling. This introduces extra latency for memory accesses in GPUs. In addition, the fluctuated page migration granularity may also under-utilize PCIe bandwidth.
- In order to measure the performance degradation for UVM and fine-tune the parameters, one needs benchmarking against specific workloads.
- ML/Deep learning infra community is actively researching smarter pre-fetching & page eviction to reduce the performance overhead.