Linux Resource Control Monitoring Being Improved For Intel Sub-NUMA Cluster Configurations


For those making use of Intel’s sub-NUMA cluster (SNC) configuration option available on their servers since Skylake, the Linux resource control «resctrl» kernel code is being improved upon to better handle this resource configuration.

Sub-NUMA clustering allows partitioning the cores/cache/memory controllers of a CPU into multiple NUMA domains. Sub-NUMA clustering can be beneficial for NUMA aware/optimized software and similar to AMD’s NUMA Per Socket (NPS) BIOS option. While it’s been available for years and working out fine on Linux for expected functionality, it turns out the current Linux kernel resource control system mishandles the Resource Director Technology (RDT) monitoring when SNC is enabled.

Intel Skylake server.

Longtime Intel Linux engineer Tony Luck explained with this new patch series working to improve the resource control support for Sub-NUMA clustering servers:

Intel server systems starting with Skylake support a mode that logically partitions each socket. E.g. when partitioned two ways, half the cores, L3 cache, and memory controllers are allocated to each of the partitions. This may reduce average latency to access L3 cache and memory, with the tradeoff that only half the L3 cache is available for subnode-local memory access.

The existing Linux resctrl system mishandles RDT monitoring on systems with SNC mode enabled.

But, with some simple changes, this can be fixed. When SNC mode is enabled, the RDT RMID counters are also partitioned with the low numbered counters going to the first partition, and the high numbered counters to the second partition. The key is to adjust the RMID value written to the IA32_PQR_ASSOC MSR on context switch, and the value written to the IA32_QM_EVTSEL when reading out counters, and to change the scaling factor that was read from CPUID(0xf,1).EBX

E.g. in 2-way Sub-NUMA cluster with 200 RMID counters there are only 100 available counters to the resctrl code. When running on the first SNC node RMID values 0..99 are used as before. But when running on the second node, a task that is assigned resctrl rmid=10 must load 10+100 into IA32_PQR_ASSOC to use RMID counter 110.

There should be no changes to functionality on other architectures, or on Intel systems with SNC disabled, where snc_ways == 1.

This current behavior or proposed changes shouldn’t affect the SNC performance but rather is about addressing proper resource control monitoring for sub-NUMA cluster configurations. The seven patches correcting the x86/resctrl code are now out for review.