Optimization for non Intel architectures

SeisSol is currently only optimized for Intel architectures. However, as the single-node performance is mostly based on efficient small matrix-matrix multiplications using libxsmm, we believe that it should be possible to obtain high performance on other architectures as well by supplying a generator for efficient small matrix-matrix multiplications.

In this tutorial, we are going to show how to replace the calls to libxsmm by a custom code generator, which will simply create calls to Intel’s Math Kernel Library (MKL). While this approach is particularly useless for Intel architectures, it shows how one may replace libxsmm by an alternative.

A simple code generator in bash

During the build process, SeisSol calls the program libxsmm_gemm_generator in order to generate optimized GEMM routines, where the term GEMM is borrowed from the BLAS level 3 routine with the same name. This generator application requires up to 17 command line arguments, which are described in libxsmm’s documentation.

As a first step, we create a bash script which mimics the interface of libxsmm and creates calls to cblas_dgemm:

#!/bin/bash

if ! grep -q "gemm" ${2}; then
  echo "#include <mkl_cblas.h>" >> ${2}
fi

echo "void ${3}(const double* A, const double* B, double* C, const double* A_prefetch, const double* B_prefetch, const double* C_prefetch) {" >> ${2}
echo "  cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, ${4}, ${5}, ${6}, 1.0, A, ${7}, B, ${8}, ${11}, C, ${9});" >> ${2}
echo "}" >> ${2}

We call this file libxsmm_gemm_generator, we make it executable with chmod +x libxsmm_gemm_generator, and adjust the PATH environment variable to wherever this bash script is located with export PATH=/path/to/scripts/. (You can also choose another file name, then you have to adjust the SCons configuration.)

Definition of your architecture

Let’s suppose you want to run SeisSol on Your Awesome Architecture (YAA). YAA features 256 bit SIMD instructions which require 32 bytes memory alignment for optimal performance (aligned loads and stores). In order to pass this information to SeisSol, we are going to create a new architecture definition.

First, open the file generated_code/gemmgen/Arch.py and add 'yaa': 32 and 'yaa': False in the alignments and enablePrefetch dictionaries.

Second, open the file site_scons/arch.py and add 'yaa' in getArchitecture and 'yaa': 32 in the getAlignment function (yes, it’s redundant). In this file, you also have the option to set architecture-specific compiler flags. Here, we will use the same flags as for the Haswell architecture and replace elif cpu == 'hsw': with elif cpu in ['hsw', 'yaa']:. Inside the latter if statement, we add

if cpu == 'yaa':
      flags.extend(['-L{0}/lib/intel64/'.format(os.getenv('MKLROOT')), '-lmkl_intel_ilp64', '-lmkl_sequential', '-lmkl_core', '-lm', '-ldl'])
      flags.extend(['-DMKL_ILP64','-I{0}/include'.format(os.getenv('MKLROOT'))])

In order to link against MKL. Don’t forget to add import os at the top of the file.

Third, as a final tedious step, we open the file src/Equations/elastic/generated_code/initialization/precision.h and add

## ifdef DYAA
## define DOUBLE_PRECISION
## endif

Performance evaluation

You can use our performance proxy, whose performance usually closely matches the performance of SeisSol, to test the single-node performance. In order to compile it, change to the directory auto_tuning/proxy/. In there, run

scons equations=elastic order=6 arch=dyaa memLayout=../config/elastic_dknl_O6.xml

The second option specifies the convergence order, feel free to vary the number from 2 to 8. With arch we select our newly defined architecture. The last option memLayout allows us to use sparse or block-partitioned memory layouts (see ../config/elastic_dhsw_O6.xml for an example). However, as our code generator only supports dense matrix multiplication, we use a memory layout with only dense matrix multiplication. Note that in our experience sparse matrix multiplication becomes less useful with increasing vector width; e.g. on Knights Landing sparse matrix multiplication does not reduce the time to solution.

The file build/generated_code/gemms.cpp contains your generated kernels and allows you to check if any errors occurred. Furthermore, it is a good idea to run the unit tests first: From auto_tuning/proxy/, run ./build/bin/generated_kernels_test_suite on your cluster. The result should be

Running cxxtest tests (96 tests)..........................................OK!

Now you may finally evaluate your performance by writing a job script for your cluster. Here, we are going to use the following LoadLeveler script for SuperMUC Phase 2:

#!/bin/bash
#@ wall_clock_limit = 00:30:00
#@ job_name = proxy_test
#@ job_type = parallel
#@ class = test
#@ node = 1
#@ tasks_per_node = 1
#@ island_count = 1
#@ node_usage = not_shared
#@ initialdir = $(home)/seissol_reference/auto_tuning/proxy/
#@ output = logs/proxy_test.$(schedd_host).$(jobid).out
#@ error = logs/proxy_test.$(schedd_host).$(jobid).err
#@ queue

. /etc/profile
. /etc/profile.d/modules.sh

export OMP_NUM_THREADS=56
export KMP_AFFINITY="compact,granularity=thread"

./build/bin/seissol_proxy 100000 100 all

The proxy application delivers the following information:

time for seissol proxy              : 17.757648
GFLOPS (non-zero) for seissol proxy : 146.865083
GFLOPS (hardware) for seissol proxy : 422.893843

Hence, we are already at 43 % of peak performance (@ 2.2 GHz). However, remember that we used dense GEMM routines instead of sparse GEMM routines. Hence, the actual work done is better measured by time or by non-zero GFLOPS.

Performance comparison

We compare the performance obtained by noarch, yaa, and hsw on a dual-socket Xeon E5-2697:

arch time [s] NZ-GFLOPS HW-GFLOPS
noarch 78.7 33 92
yaa 17.8 147 423
hsw 10.8 240 554

To summarise, using MKL already gave us a huge speed-up of 4.4 in comparison to noarch. However, using a specialized code generator like libxsmm in combination with auto-tuning we get another speed-up of 1.6.