Walkthrough: Dense Matrix-Matrix multiplications (Part 1)
The Dense Matrix-Matrix multiplication has already very well optimized solutions, but we can learn something along the way by trying to optimize it, starting from the naive version.
Our code is computing the quantity
\(D = B^T A B\)
Baseline: Naive version, naive compilation
Note
To follow along: commit 7a83cbd0a
git checkout 7a83cbd0a
HEAD is now at 7a83cbd Matmul baseline
Compilation:
gcc -o main main.c -lm
Basic timing:
time ./main
real 0m24,341s
user 0m24,321s
sys 0m0,018s
The cpu is very busy. What is happening? In our code, we have some initialization functions, and a pair of matrix multiplications.
We expect the initialization functions to be negligible. Is this the case?
Let’s have a look with perf.
First of all, we compile with debug symbols on:
gcc -g -o main main.c -lm
Then we can run our code under perf:
perf record -g ./main
WARNING: Kernel address maps (/proc/{kallsyms,modules}) are restricted,
check /proc/sys/kernel/kptr_restrict and /proc/sys/kernel/perf_event_paranoid.
Samples in kernel functions may not be resolved if a suitable vmlinux
file is not found in the buildid cache or in the vmlinux path.
Samples in kernel modules won't be resolved at all.
If some relocation was applied (e.g. kexec) symbols may be misresolved
even with a suitable vmlinux or kallsyms file.
Couldn't record kernel reference relocation symbol
Symbol resolution may be skewed if relocation was used (e.g. kexec).
Check /proc/kallsyms permission or run as root.
[ perf record: Woken up 29 times to write data ]
[ perf record: Captured and wrote 7,423 MB perf.data (96993 samples) ]
We can have a look at the recorded data:
# Using --stdio and head to stay on the notebook, you don't have to
perf report -g --stdio | head -n 40
Warning:
Kernel address maps (/proc/{kallsyms,modules}) were restricted.
Check /proc/sys/kernel/kptr_restrict before running 'perf record'.
As no suitable kallsyms nor vmlinux was found, kernel samples
can't be resolved.
Samples in kernel modules can't be resolved as well.
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 96K of event 'cycles:P'
# Event count (approx.): 103878286417
#
# Children Self Command Shared Object Symbol
# ........ ........ ....... .................... ..........................
#
99.99% 0.00% main libc.so.6 [.] __libc_start_call_main
|
---__libc_start_call_main
main
|
|--63.72%--mat_t_mat_mul
|
--36.09%--matmat_mul
99.99% 0.00% main main [.] main
|
---main
|
|--63.72%--mat_t_mat_mul
|
--36.09%--matmat_mul
63.72% 63.58% main main [.] mat_t_mat_mul
|
--63.58%--__libc_start_call_main
main
mat_t_mat_mul
36.09% 36.01% main main [.] matmat_mul
|
--36.01%--__libc_start_call_main
main
matmat_mul
We note that:
The callgraph is very simple, so from now on we can avoid using
-gwithperfthe function
matmat_mul(the one that does \(C = AB\)) takes half of the time as themat_t_mat_mul(the one that does \(D = B^T C\)).
What is going wrong? We might have an idea about memory layouts and access patterns.
Investigation with the LIKWID Marker API
Note
To follow along: a9d5657ed
git checkout a9d5657ed
Previous HEAD position was 7a83cbd Matmul baseline
HEAD is now at a9d5657 Minimal LIKWID marker instrumentation
We will use LIKWID to carry on the investigation. For this, we need to manually instrument the code:
git diff HEAD~
diff --git a/matmul/main.c b/matmul/main.c
index 7efd5ce..48cb3eb 100644
--- a/matmul/main.c
+++ b/matmul/main.c
@@ -1,6 +1,6 @@
#include "matmul_v1.c"
#include "init.c"
-
+#include <likwid-marker.h>
int main(){
@@ -14,10 +14,20 @@ int main(){
// We want to compute D = Bt A B
// C = A B
// D = Bt C
+
+ LIKWID_MARKER_INIT;
+ LIKWID_MARKER_REGISTER("MM");
+ LIKWID_MARKER_REGISTER("MTM");
+
+ LIKWID_MARKER_START("MM");
matmat_mul(a,b,c,N);
+ LIKWID_MARKER_STOP("MM");
+ LIKWID_MARKER_START("MTM");
mat_t_mat_mul(b,c,d,N);
+ LIKWID_MARKER_STOP("MTM");
free(d);free(c);free(b);free(a);
+ LIKWID_MARKER_CLOSE;
return 0;
}
And then recompile:
gcc -DLIKWID_PERFMON -o main main.c -lm -llikwid
Once the compilation has succeeded (troubleshooting: see this tutorial if libraries and headers are not found), we can check which performance groups we can investigate with LIKWID:
likwid-perfctr -a
Group name Description
--------------------------------------------------------------------------------
ENERGY Power and Energy consumption
FLOPS_DP Double Precision MFLOP/s
DATA Load to store ratio
DIVIDE Divide unit information
FLOPS_SP Single Precision MFLOP/s
BRANCH Branch prediction miss rate/ratio
FLOPS_AVX Packed AVX MFLOP/s
Since our code uses only floats, we expect the FLOPS_SP performance group to be interesting:
likwid-perfctr -m -C S0:0 -g FLOPS_SP ./main
--------------------------------------------------------------------------------
CPU name: 11th Gen Intel(R) Core(TM) i5-1145G7 @ 2.60GHz
CPU type: Intel Tigerlake processor
CPU clock: 2.61 GHz
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Region MM, Group 1: FLOPS_SP
+-------------------+------------+
| Region Info | HWThread 0 |
+-------------------+------------+
| RDTSC Runtime [s] | 9.417304 |
| call count | 1 |
+-------------------+------------+
+------------------------------------------+---------+--------------+
| Event | Counter | HWThread 0 |
+------------------------------------------+---------+--------------+
| INSTR_RETIRED_ANY | FIXC0 | 104535300000 |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 37669540000 |
| CPU_CLK_UNHALTED_REF | FIXC2 | 24342280000 |
| FP_ARITH_INST_RETIRED_128B_PACKED_SINGLE | PMC0 | 0 |
| FP_ARITH_INST_RETIRED_SCALAR_SINGLE | PMC1 | 2985984000 |
| FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE | PMC2 | 0 |
| FP_ARITH_INST_RETIRED_512B_PACKED_SINGLE | PMC3 | 0 |
+------------------------------------------+---------+--------------+
+-------------------------+------------+
| Metric | HWThread 0 |
+-------------------------+------------+
| Runtime (RDTSC) [s] | 9.4173 |
| Runtime unhalted [s] | 14.4266 |
| Clock [MHz] | 4040.6872 |
| CPI | 0.3604 |
| SP [MFLOP/s] | 317.0742 |
| AVX SP [MFLOP/s] | 0 |
| AVX512 SP [MFLOP/s] | 0 |
| Packed [MUOPS/s] | 0 |
| Scalar [MUOPS/s] | 317.0742 |
| Vectorization ratio [%] | 0 |
+-------------------------+------------+
Region MTM, Group 1: FLOPS_SP
+-------------------+------------+
| Region Info | HWThread 0 |
+-------------------+------------+
| RDTSC Runtime [s] | 17.361890 |
| call count | 1 |
+-------------------+------------+
+------------------------------------------+---------+--------------+
| Event | Counter | HWThread 0 |
+------------------------------------------+---------+--------------+
| INSTR_RETIRED_ANY | FIXC0 | 104541300000 |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 72267170000 |
| CPU_CLK_UNHALTED_REF | FIXC2 | 44929640000 |
| FP_ARITH_INST_RETIRED_128B_PACKED_SINGLE | PMC0 | 0 |
| FP_ARITH_INST_RETIRED_SCALAR_SINGLE | PMC1 | 2985984000 |
| FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE | PMC2 | 0 |
| FP_ARITH_INST_RETIRED_512B_PACKED_SINGLE | PMC3 | 0 |
+------------------------------------------+---------+--------------+
+-------------------------+------------+
| Metric | HWThread 0 |
+-------------------------+------------+
| Runtime (RDTSC) [s] | 17.3619 |
| Runtime unhalted [s] | 27.6767 |
| Clock [MHz] | 4199.8549 |
| CPI | 0.6913 |
| SP [MFLOP/s] | 171.9850 |
| AVX SP [MFLOP/s] | 0 |
| AVX512 SP [MFLOP/s] | 0 |
| Packed [MUOPS/s] | 0 |
| Scalar [MUOPS/s] | 171.9850 |
| Vectorization ratio [%] | 0 |
+-------------------------+------------+
An observation: the two functions do the same amount of operations, but for the matrix-time-matrix multiplication we have a ~70% better performance than for the matrix-transpose-time-matrix multiplication.
This already suggests that at least one of the two function could be written in a way that is far from optimal.
Before we go further, we should ask ourselves:
How far are we from the maximum performance?
How much more performance would we be able to reach?