Background and Issues

Nowadays, NUMA(Non-Uniform Memory Access) processor is common in data-center. In NUMA architecture, different CPU has its local memory and they are interconnected by QPI(Quick Path Interconnect). A processor can access its local memory faster than non-local memory.

Distribute datasets among NUMA nodes is one of the optimizations used to improve application performance. Considering that there is an array, and it is shared between N NUMA nodes. One way to map this array to NUMA nodes is to create N pointers, each of them points to one partition of this array.

When accessing elements with the index in the original data. An extra address mapping as the following codes shows.

int * shards[N];  // start address of each data shards, its memory is allocated on corresponding NUMA nodes.
size_t shard_size; // number of elements in each data shard

/*
 * Get elements by given index from `data`.
 */
inline int get(size_t idx) {
  size_t shard_id = idx / shard_size;
  return shards[shard_id] + ( idx % shard_size) ;
}

This address mapping would introduce unnecessary overhead.

Don’t forget ** The Operating System Already Provided Address Mapping Mechanism!!! **

Mapping continuous virtual address to NUMA nodes

mmap + numa_tonode_memory

Remember memory allocation are LAZY. Operating system does not assign actual pages of physical memory until the page is touched(write/read).

So, we can reserve a continuous virtual address by mmap ANONYMOUS pages. Then, apply numa_tonode_memory to bind virtual address shards to specific NUMA nodes.

NOTICE that these virtual addresses should be page aligned.

  size_t * shard_size[N]; // number of elements in each shard
  size_t ele_bytes; // bytes of each element
  // codes to initialize shard_size and ele_bytes, make sure each shard is page aligned.
  ...

  // reserve virtual memory
  void *addr = mmap(NULL, tot_pages * PGSIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
  CHECK(addr != NULL) << "Failed to mmap pages.";

  // put memory to NUMA nodes
  size_t offset = 0;
  for(int cpu = 0;cpu < N;cpu ++) {
    void *pos = (void*)((char*) addr + offset * ele_bytes);
    numa_tonode_memory(pos, ele_bytes * shard_size[cpu], cpu);
    offset += ele_bytes * shard_size[cpu];
  }

– EOF –