Mapping Continuous Virtual Address to NUMA Nodes
Background and Issues
Nowadays, NUMA(Non-Uniform Memory Access) processor is common in data-center. In NUMA architecture, different CPU has its local memory and they are interconnected by QPI(Quick Path Interconnect). A processor can access its local memory faster than non-local memory.
Distribute datasets among NUMA nodes is one of the optimizations used to improve application performance. Considering that there is an array, and it is shared between N NUMA nodes. One way to map this array to NUMA nodes is to create N pointers, each of them points to one partition of this array.
When accessing elements with the index in the original data. An extra address mapping as the following codes shows.
int * shards[N]; // start address of each data shards, its memory is allocated on corresponding NUMA nodes.
size_t shard_size; // number of elements in each data shard
/*
* Get elements by given index from `data`.
*/
inline int get(size_t idx) {
size_t shard_id = idx / shard_size;
return shards[shard_id] + ( idx % shard_size) ;
}
This address mapping would introduce unnecessary overhead.
Don’t forget ** The Operating System Already Provided Address Mapping Mechanism!!! **
Mapping continuous virtual address to NUMA nodes
mmap + numa_tonode_memory
Remember memory allocation are LAZY. Operating system does not assign actual pages of physical memory until the page is touched(write/read).
So, we can reserve a continuous virtual address by mmap
ANONYMOUS pages.
Then, apply numa_tonode_memory
to bind virtual address shards to specific NUMA nodes.
NOTICE that these virtual addresses should be page aligned.
size_t * shard_size[N]; // number of elements in each shard
size_t ele_bytes; // bytes of each element
// codes to initialize shard_size and ele_bytes, make sure each shard is page aligned.
...
// reserve virtual memory
void *addr = mmap(NULL, tot_pages * PGSIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
CHECK(addr != NULL) << "Failed to mmap pages.";
// put memory to NUMA nodes
size_t offset = 0;
for(int cpu = 0;cpu < N;cpu ++) {
void *pos = (void*)((char*) addr + offset * ele_bytes);
numa_tonode_memory(pos, ele_bytes * shard_size[cpu], cpu);
offset += ele_bytes * shard_size[cpu];
}
– EOF –