A brief LIKWID demo: topology and microbenchmarks

LIKWID is a suite of tools for performance diagnostics. In this short demo we will use two tools to inspect the property of the hardware, likwid-topology and likwid-bench

Inspect node structure with `likwid-topology`

likwid-topology aggregates information coming from different sources and reports on the topology of the CPU, its caches and NUMA domains (NVidia GPU information can be obtained with -G).

Basic information can be obtained with:

likwid-topology --clock --caches

--------------------------------------------------------------------------------
CPU name:	11th Gen Intel(R) Core(TM) i5-1145G7 @ 2.60GHz
CPU type:	Intel Tigerlake processor
CPU stepping:	1
CPU clock:	2.61 GHz
********************************************************************************
Hardware Thread Topology
********************************************************************************
Sockets:		1
CPU dies:		1
Cores per socket:	4
Threads per core:	2
--------------------------------------------------------------------------------
HWThread        Thread        Core        Die        Socket        Available
0               0             0           0          0             *                
1               0             1           0          0             *                
2               0             2           0          0             *                
3               0             3           0          0             *                
4               1             0           0          0             *                
5               1             1           0          0             *                
6               1             2           0          0             *                
7               1             3           0          0             *                
--------------------------------------------------------------------------------
Socket 0:		( 0 4 1 5 2 6 3 7 )
--------------------------------------------------------------------------------
********************************************************************************
Cache Topology
********************************************************************************
Level:			1
Size:			48 kB
Type:			Data cache
Associativity:		12
Number of sets:		64
Cache line size:	64
Cache type:		Non Inclusive
Shared by threads:	2
Cache groups:		( 0 4 ) ( 1 5 ) ( 2 6 ) ( 3 7 )
--------------------------------------------------------------------------------
Level:			2
Size:			1.25 MB
Type:			Unified cache
Associativity:		20
Number of sets:		1024
Cache line size:	64
Cache type:		Non Inclusive
Shared by threads:	2
Cache groups:		( 0 4 ) ( 1 5 ) ( 2 6 ) ( 3 7 )
--------------------------------------------------------------------------------
Level:			3
Size:			8 MB
Type:			Unified cache
Associativity:		8
Number of sets:		16384
Cache line size:	64
Cache type:		Non Inclusive
Shared by threads:	8
Cache groups:		( 0 4 1 5 2 6 3 7 )
--------------------------------------------------------------------------------
********************************************************************************
NUMA Topology
********************************************************************************
NUMA domains:		1
--------------------------------------------------------------------------------
Domain:			0
Processors:		( 0 4 1 5 2 6 3 7 )
Distances:		10
Free memory:		1738.41 MB
Total memory:		15593.6 MB
--------------------------------------------------------------------------------

Notably, we get:

basic cpu and threading information (thread to core correspondence)
cache information: in this case, we see that L1 and L2 are local to a core (which corresponds to a pair of threads), while L3 cache is common to all cores and threads. We also get the sizes of the caches (which we should be able to compare with some experiments, later)
NUMA topology information: in this case, there is only one domain to which all processors belong.

Maximum performance and bandwidth estimates: the Roofline model

The Roofline Model is a performance model that assumes that there are 2 main performance bottlenecks:

peak floating point performance
the memory bandwidth.

In this model, the relevant bottleneck for the kernel (typically, a loop) being investigated is determined by its arithmetic intensity (also operational intensity), that is the ratio between how many FP operations are done and the memory traffic in byte, that is how many bytes have to be moved from/to a particular memory system (cache level or RAM).

By Giu.natale - Own work, CC BY-SA 4.0, Link

The maximum possible FP performance for a kernel is given by that “roofline” (called so probably because it reminds of a roof):

kernels which require few operations per byte (low arithmetic intensity, on the left side of the plot above) will be ultimately memory bound
kernels which perform enough operations per byte (right end of the plot above) will be compute bound, that is able to run at the maximum FP performance of the machine.

Most codes in use today are typically memory bound.

Notice that if our code is latency limited its performance will stay way below the roofline.

`likwid-bench`

LIKWID gives us a way to determine the characteristics of our machine using likwid-bench, a suite of microkernel benchmarks.

In order to use likwid-bench we need to specify, at least:

The “thread domain”, that is on which set of hardware threads the benchmarks is going to run.
To get which thread domains are available, we can get the -p option:

likwid-bench -p

Number of Domains 5
Domain 0:
	Tag N: 0 4 1 5 2 6 3 7
Domain 1:
	Tag S0: 0 4 1 5 2 6 3 7
Domain 2:
	Tag D0: 0 4 1 5 2 6 3 7
Domain 3:
	Tag C0: 0 4 1 5 2 6 3 7
Domain 4:
	Tag M0: 0 4 1 5 2 6 3 7

On a more complicated machine than my laptop we typically get a N domain for the whole node, and one per socket or NUMA domain, and one per shared cache.

The working set size, that is how big the dataset for the benchmark is going to be.
This determines mostly which cache levels will be involved: bigger dataset will need larger and slower caches.
By tuning the working set size and comparing it with the cache sizes we get from likwid-topology we can get an idea the properties of the different cache levels.

The benchmark type we want to run. A (long) list of the readily available benchmarks (for the architecture of the machine we are running on) can be obtained with likwid-bench -a. To understand the “maximum” performance and bandwidth of the machine, we are interested in the peakflops_* and load_* microkernels:

likwid-bench -a | grep 'peakflops_\|load_'   # using Grep to limit output for readability

load_avx - Double-precision load, optimized for AVX
load_avx512 - Double-precision load, optimized for AVX-512
load_mem - Double-precision load, using non-temporal loads
load_sse - Double-precision load, optimized for SSE
peakflops_avx - Double-precision multiplications and additions with a single load, optimized for AVX
peakflops_avx512 - Double-precision multiplications and additions with a single load, optimized for AVX-512
peakflops_avx512_fma - Double-precision multiplications and additions with a single load, optimized for AVX-512 FMAs
peakflops_avx_fma - Double-precision multiplications and additions with a single load, optimized for AVX FMAs
peakflops_sp - Single-precision multiplications and additions with a single load, only scalar operations
peakflops_sp_avx - Single-precision multiplications and additions with a single load, optimized for AVX
peakflops_sp_avx512 - Single-precision multiplications and additions with a single load, optimized for AVX-512
peakflops_sp_avx512_fma - Single-precision multiplications and additions with a single load, optimized for AVX-512 FMAs
peakflops_sp_avx_fma - Single-precision multiplications and additions with a single load, optimized for AVX FMAs
peakflops_sp_sse - Single-precision multiplications and additions with a single load, optimised for SSE
peakflops_sse - Double-precision multiplications and additions with a single load, optimised for SSE

Determining the maximum performance

The benchmarks named peakflops_* can give us an estimate of the maximum FLOPs performance of our machine.
In order to run these benchmarks, we choose a small dataset that is going to fit in the L1 CPU cache (otherwise the performance might be limited by memory bandwidth and not by the actual FP peak performance), and choose the whole machine as thread domain (N). As shown before by likwid-topology, the Total L1 cache for my CPU is $4*48$KB, so I will choose 128KB as the size of the working set.

A few comments:

By not specifying a number of threads, likwid-bench will use all the hardware threads in the workgroup by default (here, 8)
We are using all 4 cores on my machine

maximum performance for scalar, double-precision operations:

likwid-bench -t peakflops -W N:128KB | grep 'MFlops/s'   # Using grep here to focus only on the most relevant output

Running without Marker API. Activate Marker API with -m on commandline.
MFlops/s:		30087.87

maximum performance for vector (256-bit), double-precision operations:

likwid-bench -t peakflops_avx -W N:128KB | grep 'MFlops/s'   # Using grep here to focus only on the most relevant output

Running without Marker API. Activate Marker API with -m on commandline.
MFlops/s:		117079.56

maximum performance for vector (256bit), double precision operations, which fused multiply-add operations:

likwid-bench -t peakflops_avx_fma -W N:128KB | grep 'MFlops/s'   # Using grep here to focus only on the most relevant output

Running without Marker API. Activate Marker API with -m on commandline.
MFlops/s:		232018.68

Now, comparing with the corresponding single precision operations:

likwid-bench -t peakflops_sp -W N:128KB | grep 'MFlops/s'   # scalar
likwid-bench -t peakflops_sp_avx -W N:128KB | grep 'MFlops/s'   # vector
likwid-bench -t peakflops_sp_avx_fma -W N:128KB | grep 'MFlops/s'   # vector + FMA

Running without Marker API. Activate Marker API with -m on commandline.
MFlops/s:		30032.48
Running without Marker API. Activate Marker API with -m on commandline.
MFlops/s:		210624.63
Running without Marker API. Activate Marker API with -m on commandline.
MFlops/s:		373403.05

A few comments:

Using vector operations (256b) lead ~4 times more performance than scalar ones (for double precision, 8 for single precision)
The ‘peak value’ depends on the kind of operation.

Determining the maximum bandwidth

We can use now the load_* benchmarks to find out the memory bandwidth. Depending on the size of the working set, we can also evaluate the bandwidth for each cache level.

With a working set of 2GB we will for sure hit the RAM:

likwid-bench -t load -W N:2GB | grep 'MByte/s'
likwid-bench -t load_avx -W N:2GB | grep 'MByte/s'

Running without Marker API. Activate Marker API with -m on commandline.
MByte/s:		43969.41
Running without Marker API. Activate Marker API with -m on commandline.
MByte/s:		43688.51

With a working set smaller than the L3 cache, we should hit mostly the L3 cache, and thus our bandwidth should be quite higher:

likwid-bench -t load -W N:6.4MB | grep 'MByte/s'
likwid-bench -t load_avx -W N:6.4MB | grep 'MByte/s'

Running without Marker API. Activate Marker API with -m on commandline.
MByte/s:		114149.95
Running without Marker API. Activate Marker API with -m on commandline.
MByte/s:		190416.17

If we want to measure the bandwidth of the L2 cache, we can use a working set that is smaller than the total L2 cache, which is $4*1.25$MB:

likwid-bench -t load -W N:4MB | grep 'MByte/s'
likwid-bench -t load_avx -W N:4MB | grep 'MByte/s'

Running without Marker API. Activate Marker API with -m on commandline.
MByte/s:		125898.96
Running without Marker API. Activate Marker API with -m on commandline.
MByte/s:		317482.83

Fitting now in the L1 cache:

likwid-bench -t load -W N:168KB | grep 'MByte/s'

Running without Marker API. Activate Marker API with -m on commandline.
MByte/s:		178798.50

From the L1 cache we should also see a huge difference when using vector load instructions:

likwid-bench -t load_avx -W N:168KB | grep 'MByte/s'

Running without Marker API. Activate Marker API with -m on commandline.
MByte/s:		729787.97

What about stores?

likwid-bench -t store -W N:2GB | grep 'MByte/s'

Running without Marker API. Activate Marker API with -m on commandline.
MByte/s:		15096.93

Shortcut: Bypassing the cache with non-temporal stores

If some data must be written to a memory location that we know will not be accessed again in the near future (that means, it does not make sense to cache it), we can use non-temporal stores, which are typically faster. The benchmark store_mem uses non-temporal stores:

likwid-bench -t store_mem -W N:2GB | grep 'MByte/s'

Running without Marker API. Activate Marker API with -m on commandline.
MByte/s:		37493.32

Appendix

More about `likwid-topology`

As usual, to get information about likwid-topology we can use the -h option:

likwid-topology -h

likwid-topology -- Version 5.4.1 (commit: 0123456789)
A tool to print the thread and cache topology on CPUs and GPUs.

Options:
-h, --help		 Help message
-v, --version		 Version information
-V, --verbose <level>	 Set verbosity
-c, --caches		 List cache information
-C, --clock		 Measure processor clock
-G, --gpus		 List GPU information
-O			 CSV output
-o, --output <file>	 Store output to file. (Optional: Apply text filter)
-g			 Graphical output

More about `likwid-bench`

We can get more information with the -h option:

likwid-bench -h

Threaded Memory Hierarchy Benchmark -- Version 5.4.1 


Supported Options:
-h/--help		 Help message
-v/--version		 Version
-a/--all		 List available benchmarks 
-d/--delim		 Delimiter used for physical hwthread list (default ,) 
-p/--printdomains	 List available thread domains
				 or the physical ids of the hwthreads selected by the -c expression 
-s/--runtime <TIME>	 Seconds to run the test minimally (default 1)
				 If resulting iteration count is below 10, it is normalized to 10.
-i/--iters <ITERS>	 Specify the number of iterations per thread manually. 
-l/--list <TEST>	 list properties of benchmark 
-t/--test <TEST>	 type of test 
-w/--workgroup		 <thread_domain>:<size>[:<num_threads>[:<chunk size>:<stride>]-<streamId>:<domain_id>[:<offset>]
-W/--Workgroup		 <thread_domain>:<size>[:<num_threads>[:<chunk size>:<stride>]]
				 <size> in kB, MB or GB (mandatory)
For dynamically loaded benchmarks
-f/--tempdir <PATH>	 Specify a folder for the temporary files. default: /tmp
-o/--asmout <FILE>	 Save generated assembly to file

Difference between -w and -W :
-w allocates the streams in the thread_domain with one thread and support placement of streams
-W allocates the streams chunk-wise by each thread in the thread_domain

Usage: 
# Run the store benchmark on all CPUs of the system with a vector size of 1 GB
likwid-bench -t store -w S0:1GB
# Run the copy benchmark on one CPU at CPU socket 0 with a vector size of 100kB
likwid-bench -t copy -w S0:100kB:1
# Run the copy benchmark on one CPU at CPU socket 0 with a vector size of 100MB but place one stream on CPU socket 1
likwid-bench -t copy -w S0:100MB:1-0:S0,1:S1

See also the official wiki page for less terse explanations.

Other tools

A “simple” way to measure latency is also using “pointer chasing” (see this optional episode for an example).

The Intel mlc (Memory Latency Checker) can not only measure latencies, but also determine the latency-bandwidth curve (when the bandwidth is saturated, then latency tends to increase too).

Credits

Apart from personal experience, the content of this notebook is inspired by official LIKWID wiki, and in particular the tutorial on the empirical roofline model.