A brief LIKWID demo: topology and microbenchmarks
LIKWID is a suite of tools for performance diagnostics.
In this short demo we will use two tools
to inspect the property of the hardware,
likwid-topology and likwid-bench
Inspect node structure with likwid-topology
likwid-topology aggregates information coming from different sources
and reports on the topology of the CPU, its caches and NUMA domains
(NVidia GPU information can be obtained with -G).
Basic information can be obtained with:
likwid-topology --clock --caches
--------------------------------------------------------------------------------
CPU name: 11th Gen Intel(R) Core(TM) i5-1145G7 @ 2.60GHz
CPU type: Intel Tigerlake processor
CPU stepping: 1
CPU clock: 2.61 GHz
********************************************************************************
Hardware Thread Topology
********************************************************************************
Sockets: 1
CPU dies: 1
Cores per socket: 4
Threads per core: 2
--------------------------------------------------------------------------------
HWThread Thread Core Die Socket Available
0 0 0 0 0 *
1 0 1 0 0 *
2 0 2 0 0 *
3 0 3 0 0 *
4 1 0 0 0 *
5 1 1 0 0 *
6 1 2 0 0 *
7 1 3 0 0 *
--------------------------------------------------------------------------------
Socket 0: ( 0 4 1 5 2 6 3 7 )
--------------------------------------------------------------------------------
********************************************************************************
Cache Topology
********************************************************************************
Level: 1
Size: 48 kB
Type: Data cache
Associativity: 12
Number of sets: 64
Cache line size: 64
Cache type: Non Inclusive
Shared by threads: 2
Cache groups: ( 0 4 ) ( 1 5 ) ( 2 6 ) ( 3 7 )
--------------------------------------------------------------------------------
Level: 2
Size: 1.25 MB
Type: Unified cache
Associativity: 20
Number of sets: 1024
Cache line size: 64
Cache type: Non Inclusive
Shared by threads: 2
Cache groups: ( 0 4 ) ( 1 5 ) ( 2 6 ) ( 3 7 )
--------------------------------------------------------------------------------
Level: 3
Size: 8 MB
Type: Unified cache
Associativity: 8
Number of sets: 16384
Cache line size: 64
Cache type: Non Inclusive
Shared by threads: 8
Cache groups: ( 0 4 1 5 2 6 3 7 )
--------------------------------------------------------------------------------
********************************************************************************
NUMA Topology
********************************************************************************
NUMA domains: 1
--------------------------------------------------------------------------------
Domain: 0
Processors: ( 0 4 1 5 2 6 3 7 )
Distances: 10
Free memory: 1738.41 MB
Total memory: 15593.6 MB
--------------------------------------------------------------------------------
Notably, we get:
basic cpu and threading information (thread to core correspondence)
cache information: in this case, we see that L1 and L2 are local to a core (which corresponds to a pair of threads), while L3 cache is common to all cores and threads. We also get the sizes of the caches (which we should be able to compare with some experiments, later)
NUMA topology information: in this case, there is only one domain to which all processors belong.
Maximum performance and bandwidth estimates: the Roofline model
The Roofline Model is a performance model that assumes that there are 2 main performance bottlenecks:
peak floating point performance
the memory bandwidth.
In this model, the relevant bottleneck for the kernel (typically, a loop) being investigated is determined by its arithmetic intensity (also operational intensity), that is the ratio between how many FP operations are done and the memory traffic in byte, that is how many bytes have to be moved from/to a particular memory system (cache level or RAM).
By Giu.natale - Own work, CC BY-SA 4.0, Link
The maximum possible FP performance for a kernel is given by that “roofline” (called so probably because it reminds of a roof):
kernels which require few operations per byte (low arithmetic intensity, on the left side of the plot above) will be ultimately memory bound
kernels which perform enough operations per byte (right end of the plot above) will be compute bound, that is able to run at the maximum FP performance of the machine.
Most codes in use today are typically memory bound.
Notice that if our code is latency limited its performance will stay way below the roofline.
likwid-bench
LIKWID gives us a way to determine the characteristics of our machine
using likwid-bench, a suite of microkernel benchmarks.
In order to use likwid-bench we need to specify, at least:
The “thread domain”, that is on which set of hardware threads the benchmarks is going to run.
To get which thread domains are available, we can get the-poption:
likwid-bench -p
Number of Domains 5
Domain 0:
Tag N: 0 4 1 5 2 6 3 7
Domain 1:
Tag S0: 0 4 1 5 2 6 3 7
Domain 2:
Tag D0: 0 4 1 5 2 6 3 7
Domain 3:
Tag C0: 0 4 1 5 2 6 3 7
Domain 4:
Tag M0: 0 4 1 5 2 6 3 7
On a more complicated machine than my laptop we typically get a N domain for the whole node, and one per socket or NUMA domain, and one per shared cache.
The working set size, that is how big the dataset for the benchmark is going to be.
This determines mostly which cache levels will be involved: bigger dataset will need larger and slower caches.
By tuning the working set size and comparing it with the cache sizes we get fromlikwid-topologywe can get an idea the properties of the different cache levels.
The benchmark type we want to run. A (long) list of the readily available benchmarks (for the architecture of the machine we are running on) can be obtained with
likwid-bench -a. To understand the “maximum” performance and bandwidth of the machine, we are interested in thepeakflops_*andload_*microkernels:
likwid-bench -a | grep 'peakflops_\|load_' # using Grep to limit output for readability
load_avx - Double-precision load, optimized for AVX
load_avx512 - Double-precision load, optimized for AVX-512
load_mem - Double-precision load, using non-temporal loads
load_sse - Double-precision load, optimized for SSE
peakflops_avx - Double-precision multiplications and additions with a single load, optimized for AVX
peakflops_avx512 - Double-precision multiplications and additions with a single load, optimized for AVX-512
peakflops_avx512_fma - Double-precision multiplications and additions with a single load, optimized for AVX-512 FMAs
peakflops_avx_fma - Double-precision multiplications and additions with a single load, optimized for AVX FMAs
peakflops_sp - Single-precision multiplications and additions with a single load, only scalar operations
peakflops_sp_avx - Single-precision multiplications and additions with a single load, optimized for AVX
peakflops_sp_avx512 - Single-precision multiplications and additions with a single load, optimized for AVX-512
peakflops_sp_avx512_fma - Single-precision multiplications and additions with a single load, optimized for AVX-512 FMAs
peakflops_sp_avx_fma - Single-precision multiplications and additions with a single load, optimized for AVX FMAs
peakflops_sp_sse - Single-precision multiplications and additions with a single load, optimised for SSE
peakflops_sse - Double-precision multiplications and additions with a single load, optimised for SSE
Determining the maximum performance
The benchmarks named peakflops_* can give us an estimate of the maximum FLOPs performance of our machine.
In order to run these benchmarks, we choose a small dataset that is going to fit in the L1 CPU cache
(otherwise the performance might be limited by memory bandwidth and not by the actual FP peak performance),
and choose the whole machine as thread domain (N).
As shown before by likwid-topology, the Total L1 cache for my CPU is $4*48$KB, so I will choose 128KB as the size of the working set.
A few comments:
By not specifying a number of threads,
likwid-benchwill use all the hardware threads in the workgroup by default (here, 8)We are using all 4 cores on my machine
maximum performance for scalar, double-precision operations:
likwid-bench -t peakflops -W N:128KB | grep 'MFlops/s' # Using grep here to focus only on the most relevant output
Running without Marker API. Activate Marker API with -m on commandline.
MFlops/s: 30087.87
maximum performance for vector (256-bit), double-precision operations:
likwid-bench -t peakflops_avx -W N:128KB | grep 'MFlops/s' # Using grep here to focus only on the most relevant output
Running without Marker API. Activate Marker API with -m on commandline.
MFlops/s: 117079.56
maximum performance for vector (256bit), double precision operations, which fused multiply-add operations:
likwid-bench -t peakflops_avx_fma -W N:128KB | grep 'MFlops/s' # Using grep here to focus only on the most relevant output
Running without Marker API. Activate Marker API with -m on commandline.
MFlops/s: 232018.68
Now, comparing with the corresponding single precision operations:
likwid-bench -t peakflops_sp -W N:128KB | grep 'MFlops/s' # scalar
likwid-bench -t peakflops_sp_avx -W N:128KB | grep 'MFlops/s' # vector
likwid-bench -t peakflops_sp_avx_fma -W N:128KB | grep 'MFlops/s' # vector + FMA
Running without Marker API. Activate Marker API with -m on commandline.
MFlops/s: 30032.48
Running without Marker API. Activate Marker API with -m on commandline.
MFlops/s: 210624.63
Running without Marker API. Activate Marker API with -m on commandline.
MFlops/s: 373403.05
A few comments:
Using vector operations (256b) lead ~4 times more performance than scalar ones (for double precision, 8 for single precision)
The ‘peak value’ depends on the kind of operation.
Determining the maximum bandwidth
We can use now the load_* benchmarks to find out the memory bandwidth.
Depending on the size of the working set, we can also evaluate the bandwidth for each cache level.
With a working set of 2GB we will for sure hit the RAM:
likwid-bench -t load -W N:2GB | grep 'MByte/s'
likwid-bench -t load_avx -W N:2GB | grep 'MByte/s'
Running without Marker API. Activate Marker API with -m on commandline.
MByte/s: 43969.41
Running without Marker API. Activate Marker API with -m on commandline.
MByte/s: 43688.51
With a working set smaller than the L3 cache, we should hit mostly the L3 cache, and thus our bandwidth should be quite higher:
likwid-bench -t load -W N:6.4MB | grep 'MByte/s'
likwid-bench -t load_avx -W N:6.4MB | grep 'MByte/s'
Running without Marker API. Activate Marker API with -m on commandline.
MByte/s: 114149.95
Running without Marker API. Activate Marker API with -m on commandline.
MByte/s: 190416.17
If we want to measure the bandwidth of the L2 cache, we can use a working set that is smaller than the total L2 cache, which is $4*1.25$MB:
likwid-bench -t load -W N:4MB | grep 'MByte/s'
likwid-bench -t load_avx -W N:4MB | grep 'MByte/s'
Running without Marker API. Activate Marker API with -m on commandline.
MByte/s: 125898.96
Running without Marker API. Activate Marker API with -m on commandline.
MByte/s: 317482.83
Fitting now in the L1 cache:
likwid-bench -t load -W N:168KB | grep 'MByte/s'
Running without Marker API. Activate Marker API with -m on commandline.
MByte/s: 178798.50
From the L1 cache we should also see a huge difference when using vector load instructions:
likwid-bench -t load_avx -W N:168KB | grep 'MByte/s'
Running without Marker API. Activate Marker API with -m on commandline.
MByte/s: 729787.97
What about stores?
likwid-bench -t store -W N:2GB | grep 'MByte/s'
Running without Marker API. Activate Marker API with -m on commandline.
MByte/s: 15096.93
Shortcut: Bypassing the cache with non-temporal stores
If some data must be written to a memory location
that we know will not be accessed again in the near future
(that means, it does not make sense to cache it),
we can use non-temporal stores, which are typically faster.
The benchmark store_mem uses non-temporal stores:
likwid-bench -t store_mem -W N:2GB | grep 'MByte/s'
Running without Marker API. Activate Marker API with -m on commandline.
MByte/s: 37493.32
Appendix
More about likwid-topology
As usual, to get information about likwid-topology
we can use the -h option:
likwid-topology -h
likwid-topology -- Version 5.4.1 (commit: 0123456789)
A tool to print the thread and cache topology on CPUs and GPUs.
Options:
-h, --help Help message
-v, --version Version information
-V, --verbose <level> Set verbosity
-c, --caches List cache information
-C, --clock Measure processor clock
-G, --gpus List GPU information
-O CSV output
-o, --output <file> Store output to file. (Optional: Apply text filter)
-g Graphical output
More about likwid-bench
We can get more information with the -h option:
likwid-bench -h
Threaded Memory Hierarchy Benchmark -- Version 5.4.1
Supported Options:
-h/--help Help message
-v/--version Version
-a/--all List available benchmarks
-d/--delim Delimiter used for physical hwthread list (default ,)
-p/--printdomains List available thread domains
or the physical ids of the hwthreads selected by the -c expression
-s/--runtime <TIME> Seconds to run the test minimally (default 1)
If resulting iteration count is below 10, it is normalized to 10.
-i/--iters <ITERS> Specify the number of iterations per thread manually.
-l/--list <TEST> list properties of benchmark
-t/--test <TEST> type of test
-w/--workgroup <thread_domain>:<size>[:<num_threads>[:<chunk size>:<stride>]-<streamId>:<domain_id>[:<offset>]
-W/--Workgroup <thread_domain>:<size>[:<num_threads>[:<chunk size>:<stride>]]
<size> in kB, MB or GB (mandatory)
For dynamically loaded benchmarks
-f/--tempdir <PATH> Specify a folder for the temporary files. default: /tmp
-o/--asmout <FILE> Save generated assembly to file
Difference between -w and -W :
-w allocates the streams in the thread_domain with one thread and support placement of streams
-W allocates the streams chunk-wise by each thread in the thread_domain
Usage:
# Run the store benchmark on all CPUs of the system with a vector size of 1 GB
likwid-bench -t store -w S0:1GB
# Run the copy benchmark on one CPU at CPU socket 0 with a vector size of 100kB
likwid-bench -t copy -w S0:100kB:1
# Run the copy benchmark on one CPU at CPU socket 0 with a vector size of 100MB but place one stream on CPU socket 1
likwid-bench -t copy -w S0:100MB:1-0:S0,1:S1
See also the official wiki page for less terse explanations.
Other tools
A “simple” way to measure latency is also using “pointer chasing” (see this optional episode for an example).
The Intel mlc (Memory Latency Checker) can not only measure latencies, but also determine the latency-bandwidth curve (when the bandwidth is saturated, then latency tends to increase too).
Credits
Apart from personal experience, the content of this notebook is inspired by official LIKWID wiki, and in particular the tutorial on the empirical roofline model.