Show Your Support: This site is primarily supported by advertisements. Ads are what have allowed this site to be maintained on a daily basis for the past 18+ years. We do our best to ensure only clean, relevant ads are shown, when any nasty ads are detected, we work to remove them ASAP. If you would like to view the site without ads while still supporting our work, please consider our ad-free Phoronix Premium.
Google engineer Ilya Tocar has introduced the notion of «light» AVX support within the LLVM compiler infrastructure for utilizing some benefits of Advanced Vector Extensions (AVX) but trying to avoid the power/frequency impact that AVX-512 use has on older generations of Intel processors.
Merged to LLVM 16 Git yesterday — just prior to LLVM 16 feature development ending — was this introducing of the «light» AVX concept to this open-source compiler. This light AVX mode allows for generating of 256-bit load/stores even if the preference is set (via the -mprefer-vector-width=128 compiler option) to prefer a 128-bit vector width.
This light mode of AVX can be enabled for the Clang compiler by passing +allow-light-256-bit to the -mattr= compiler option. This light AVX mode is wired up to be utilized on Intel Icelake processors and older where there can be the performance (power/frequency) impact when encountering AVX 256-bit/512-bit use. Newer Intel CPUs don’t have any major problems with AVX-512 use — in case you missed it, see my AVX-512 Sapphire Rapids benchmark comparison. Similarly, AMD’s AVX-512 support introduced with Zen 4 processors also doesn’t have the frequency/power problems with AVX-512.
Ilya Tocar summed up this light AVX work for LLVM with the commit message:
AVX/AVX512 instructions may cause frequency drop on e.g. Skylake. The magnitude of frequency/performance drop depends on instruction (multiplication vs load/store) and vector width. Currently users, that want to avoid this drop can specify -mprefer-vector-width=128. However this also prevents generations of 256-bit wide instructions, that have no associated frequency drop (mainly load/stores).
Add a tuning flag that allows generations of 256-bit AVX load/stores, even when -mprefer-vector-width=128 is set, to speed-up memcpy&co. Verified that running memcpy loop on all cores has no frequency impact and zero CORE_POWER:LVL_TURBO_LICENSE perf counters.
Makes coping memory faster e.g.:
BM_memcpy_aligned/256 80.7GB/s ± 3% 96.3GB/s ± 9% +19.33% (p=0.000 n=9+9)
This «light» AVX option for prior generations of Intel CPUs will be found in LLVM 16.0 that is expected for release around 7 March.