Jaideep Ray
Better ML
Published in
2 min readDec 27, 2022

Unified virtual memory for large scale ML workloads

Context:

Recsys model training & inference are memory intensive. Generally, the servers for training & inference have the following hierarchy of memory:

  • GPU RAM (High bandwidth, $$$) ~ 40–80 GB depending on GPU choice.
  • CPU RAM (Low bandwidth, $$) ~ 1 TB & SSD memory (Low bandwidth, $) ~ multiple TBs. This is also known as host memory.

High bandwidth GPU memory is often at a scale lower than the CPU and SSD memory. For training or serving large scale recsys models, this might mean that model beyond a size cannot be loaded in memory or the process might result in the dreaded OOM (Out of memory).

Photo by Belinda Fewings on Unsplash

Unified Virtual Memory (UVM):

  1. Unified virtual memory is a single memory address space accessible from any processor in a system. This allows applications to allocate data that can be read or written from code running on either CPUs or GPUs.
  2. That means the application can access the entire address space instead of being limited by GPU or Host memory. This completely removes the problems discussed in the previous section around loading large recsys models.
  3. Without a unified memory address space developers have to manually copy data from the CPU to the GPU memory before a GPU kernel can access that data.
  4. Replacing cudaMalloc with cudaMallocManaged and then copying using cudaMemcpy allows developers to use this technique. UVM enables GPU kernel execution while memory is oversubscribed (total memory used > total physical memory) by automatically evicting data that is no longer needed in the GPU memory to CPU memory. The data migration is automatically handled by Memory Management Unit (MMU) and driver.
Shared address space between GPU and CPU.

No free lunch:

  1. The flexibility provided by UVM comes at a price of performance. In order to implement automatic data migration between a CPU and a GPU, the driver and MMU has to track data access information and determine the granularity of data migration.
  2. UVM needs special page table walk and page fault handling. This introduces extra latency for memory accesses in GPUs. In addition, the fluctuated page migration granularity may also under-utilize PCIe bandwidth.
  3. In order to measure the performance degradation for UVM and fine-tune the parameters, one needs benchmarking against specific workloads.
  4. ML/Deep learning infra community is actively researching smarter pre-fetching & page eviction to reduce the performance overhead.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

No responses yet

Write a response