Walkthrough: Dense Matrix-Matrix multiplications (Part 1)

The Dense Matrix-Matrix multiplication has already very well optimized solutions, but we can learn something along the way by trying to optimize it, starting from the naive version.

Our code is computing the quantity

\(D = B^T A B\)

Baseline: Naive version, naive compilation

Note

To follow along: commit 7a83cbd0a

git checkout 7a83cbd0a

HEAD is now at 7a83cbd Matmul baseline

Compilation:

gcc -o main main.c -lm

Basic timing:

time ./main

real	0m24,341s
user	0m24,321s
sys	0m0,018s

The cpu is very busy. What is happening? In our code, we have some initialization functions, and a pair of matrix multiplications.

We expect the initialization functions to be negligible. Is this the case?

Let’s have a look with perf. First of all, we compile with debug symbols on:

gcc -g -o main main.c -lm

Then we can run our code under perf:

perf record -g ./main

WARNING: Kernel address maps (/proc/{kallsyms,modules}) are restricted,
check /proc/sys/kernel/kptr_restrict and /proc/sys/kernel/perf_event_paranoid.

Samples in kernel functions may not be resolved if a suitable vmlinux
file is not found in the buildid cache or in the vmlinux path.

Samples in kernel modules won't be resolved at all.

If some relocation was applied (e.g. kexec) symbols may be misresolved
even with a suitable vmlinux or kallsyms file.

Couldn't record kernel reference relocation symbol
Symbol resolution may be skewed if relocation was used (e.g. kexec).
Check /proc/kallsyms permission or run as root.
[ perf record: Woken up 29 times to write data ]
[ perf record: Captured and wrote 7,423 MB perf.data (96993 samples) ]

We can have a look at the recorded data:

# Using --stdio and head to stay on the notebook, you don't have to
perf report -g --stdio | head -n 40

Warning:
Kernel address maps (/proc/{kallsyms,modules}) were restricted.

Check /proc/sys/kernel/kptr_restrict before running 'perf record'.

As no suitable kallsyms nor vmlinux was found, kernel samples
can't be resolved.

Samples in kernel modules can't be resolved as well.

# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 96K of event 'cycles:P'
# Event count (approx.): 103878286417
#
# Children      Self  Command  Shared Object         Symbol                    
# ........  ........  .......  ....................  ..........................
#
    99.99%     0.00%  main     libc.so.6             [.] __libc_start_call_main
            |
            ---__libc_start_call_main
               main
               |          
               |--63.72%--mat_t_mat_mul
               |          
                --36.09%--matmat_mul

    99.99%     0.00%  main     main                  [.] main
            |
            ---main
               |          
               |--63.72%--mat_t_mat_mul
               |          
                --36.09%--matmat_mul

    63.72%    63.58%  main     main                  [.] mat_t_mat_mul
            |          
             --63.58%--__libc_start_call_main
                       main
                       mat_t_mat_mul

    36.09%    36.01%  main     main                  [.] matmat_mul
            |          
             --36.01%--__libc_start_call_main
                       main
                       matmat_mul

We note that:

The callgraph is very simple, so from now on we can avoid using -g with perf
the function matmat_mul (the one that does \(C = AB\)) takes half of the time as the mat_t_mat_mul (the one that does \(D = B^T C\)).

What is going wrong? We might have an idea about memory layouts and access patterns.

Investigation with the LIKWID Marker API

Note

To follow along: a9d5657ed

git checkout a9d5657ed

Previous HEAD position was 7a83cbd Matmul baseline
HEAD is now at a9d5657 Minimal LIKWID marker instrumentation

We will use LIKWID to carry on the investigation. For this, we need to manually instrument the code:

git diff HEAD~

diff --git a/matmul/main.c b/matmul/main.c
index 7efd5ce..48cb3eb 100644
--- a/matmul/main.c
+++ b/matmul/main.c
@@ -1,6 +1,6 @@
 #include "matmul_v1.c"
 #include "init.c"
-
+#include <likwid-marker.h>
 
 int main(){
 
@@ -14,10 +14,20 @@ int main(){
     // We want to compute D = Bt A B
     // C = A B
     // D = Bt C
+
+    LIKWID_MARKER_INIT;
+    LIKWID_MARKER_REGISTER("MM");
+    LIKWID_MARKER_REGISTER("MTM");
+
+    LIKWID_MARKER_START("MM");
     matmat_mul(a,b,c,N);
+    LIKWID_MARKER_STOP("MM");
+    LIKWID_MARKER_START("MTM");
     mat_t_mat_mul(b,c,d,N);
+    LIKWID_MARKER_STOP("MTM");
 
     free(d);free(c);free(b);free(a);
+    LIKWID_MARKER_CLOSE;
     return 0;
 
 }

And then recompile:

gcc -DLIKWID_PERFMON -o main main.c -lm -llikwid

Once the compilation has succeeded (troubleshooting: see this tutorial if libraries and headers are not found), we can check which performance groups we can investigate with LIKWID:

likwid-perfctr -a

Group name	Description
--------------------------------------------------------------------------------
   ENERGY	Power and Energy consumption
 FLOPS_DP	Double Precision MFLOP/s
     DATA	Load to store ratio
   DIVIDE	Divide unit information
 FLOPS_SP	Single Precision MFLOP/s
   BRANCH	Branch prediction miss rate/ratio
FLOPS_AVX	Packed AVX MFLOP/s

Since our code uses only floats, we expect the FLOPS_SP performance group to be interesting:

likwid-perfctr -m -C S0:0 -g FLOPS_SP ./main

--------------------------------------------------------------------------------
CPU name:	11th Gen Intel(R) Core(TM) i5-1145G7 @ 2.60GHz
CPU type:	Intel Tigerlake processor
CPU clock:	2.61 GHz
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Region MM, Group 1: FLOPS_SP
+-------------------+------------+
|    Region Info    | HWThread 0 |
+-------------------+------------+
| RDTSC Runtime [s] |   9.417304 |
|     call count    |          1 |
+-------------------+------------+

+------------------------------------------+---------+--------------+
|                   Event                  | Counter |  HWThread 0  |
+------------------------------------------+---------+--------------+
|             INSTR_RETIRED_ANY            |  FIXC0  | 104535300000 |
|           CPU_CLK_UNHALTED_CORE          |  FIXC1  |  37669540000 |
|           CPU_CLK_UNHALTED_REF           |  FIXC2  |  24342280000 |
| FP_ARITH_INST_RETIRED_128B_PACKED_SINGLE |   PMC0  |            0 |
|    FP_ARITH_INST_RETIRED_SCALAR_SINGLE   |   PMC1  |   2985984000 |
| FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE |   PMC2  |            0 |
| FP_ARITH_INST_RETIRED_512B_PACKED_SINGLE |   PMC3  |            0 |
+------------------------------------------+---------+--------------+

+-------------------------+------------+
|          Metric         | HWThread 0 |
+-------------------------+------------+
|   Runtime (RDTSC) [s]   |     9.4173 |
|   Runtime unhalted [s]  |    14.4266 |
|       Clock [MHz]       |  4040.6872 |
|           CPI           |     0.3604 |
|       SP [MFLOP/s]      |   317.0742 |
|     AVX SP [MFLOP/s]    |          0 |
|   AVX512 SP [MFLOP/s]   |          0 |
|     Packed [MUOPS/s]    |          0 |
|     Scalar [MUOPS/s]    |   317.0742 |
| Vectorization ratio [%] |          0 |
+-------------------------+------------+

Region MTM, Group 1: FLOPS_SP
+-------------------+------------+
|    Region Info    | HWThread 0 |
+-------------------+------------+
| RDTSC Runtime [s] |  17.361890 |
|     call count    |          1 |
+-------------------+------------+

+------------------------------------------+---------+--------------+
|                   Event                  | Counter |  HWThread 0  |
+------------------------------------------+---------+--------------+
|             INSTR_RETIRED_ANY            |  FIXC0  | 104541300000 |
|           CPU_CLK_UNHALTED_CORE          |  FIXC1  |  72267170000 |
|           CPU_CLK_UNHALTED_REF           |  FIXC2  |  44929640000 |
| FP_ARITH_INST_RETIRED_128B_PACKED_SINGLE |   PMC0  |            0 |
|    FP_ARITH_INST_RETIRED_SCALAR_SINGLE   |   PMC1  |   2985984000 |
| FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE |   PMC2  |            0 |
| FP_ARITH_INST_RETIRED_512B_PACKED_SINGLE |   PMC3  |            0 |
+------------------------------------------+---------+--------------+

+-------------------------+------------+
|          Metric         | HWThread 0 |
+-------------------------+------------+
|   Runtime (RDTSC) [s]   |    17.3619 |
|   Runtime unhalted [s]  |    27.6767 |
|       Clock [MHz]       |  4199.8549 |
|           CPI           |     0.6913 |
|       SP [MFLOP/s]      |   171.9850 |
|     AVX SP [MFLOP/s]    |          0 |
|   AVX512 SP [MFLOP/s]   |          0 |
|     Packed [MUOPS/s]    |          0 |
|     Scalar [MUOPS/s]    |   171.9850 |
| Vectorization ratio [%] |          0 |
+-------------------------+------------+

An observation: the two functions do the same amount of operations, but for the matrix-time-matrix multiplication we have a ~70% better performance than for the matrix-transpose-time-matrix multiplication.

This already suggests that at least one of the two function could be written in a way that is far from optimal.

Before we go further, we should ask ourselves:

How far are we from the maximum performance?
How much more performance would we be able to reach?