CDCS CENTER FOR DATA AND COMPUTING IN NATURAL SCIENCES



# Parallel Computing I

#### Parallelism On CPUs

Annett Ungethüm, 20.01.2022







# **CDCS Hamburg-X Project (BWFGB)**



# **CDCS Structure**



# **CDCS and CDLs in Detail**

#### CDCS CENTER FOR DATA AND COMPUTING IN NATURAL SCIENCES



#### 18.11.2021

#### Introduction CDCS

# **The CDCS Office Space**

#### CDCS CENTER FOR DATA AND COMPUTING IN NATURAL SCIENCES



As a DASHH student you can get a transponder to the CDCS hot desk office space (room 1064) Ask our secretary Miriam Döring: <u>miriam.doering@uni-hamburg.de</u>



#### Introduction CDCS

# **Computer Systems**



# **Computer Systems**











# **CPU information**



# Parallelism On A Single Core



- Each ALU processes an execution pipeline
- Usually unnoticed by developers and users
- Ever wondered why some CPUs are faster than others despite having the same frequency and number of cores? → This is one of the reasons



- Multiple values are processed at once by one ALU using one instruction
- Data is stored in vector registers, which can hold more than one value
- Not done automatically → Action by developer or compiler needed

# Why We Have Data Parallelism On a Single Core?

CDCS CENTER FOR DATA AND COMPUTING IN NATURAL SCIENCES



Number of Transistors per Core double every two years

- → But we cannot use all of them at the same time because of thermal constraints
- → Some transistors are switched off ("Dark Silicon"<sup>1</sup>)
- → Another scalar addition block doesn't make sense
- → "Dark" parts of the chip are used for specialized instructions, e.g. for processing vectors

18.11.2021 <sup>1</sup>H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam and D. Burger, "Dark Silicon and the End of Multicore Scaling," in IEEE Micro, vol. 32, no. 3, pp. 122-134, May-June 2012, doi: 10.1109/MM.2012.17.

# **Vectorized Processing**





CDCS CENTER FOR DATA AND COMPUTING IN NATURAL SCIENCES

#### Iscpu on an Intel or AMD CPU:

| ags:     | fpu vme de pse ts <u>c msr pa</u> e mce cx8 api <u>c sep mtrr pg</u> e mca cmov pat pse36 clflush mmx fxsr <u>sse sse2</u> ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid p<br>ni pclmulqdq vmx sse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced<br>tpr_shadow vnmi_ent_vpid_ent_ad_fsgsbase tsc_adjust bmi1_avx2_smep_bmi2_erms_invpcid_avx512f avx512fdq rdseed adx smap <u>avx512ifmd</u> clflushopt_clwb_avx512cd sha_ni_avx512bw avx512v] xsaveopt xsavec xgetbv<br>1 xsaves avx512vbmi_umip_avx512_vbmi2_gfni vaes vpclmulqdq avx512_bitalg_avx512_vpopcntdq_rdpid movdiri movdir64b fsrm_avx512_vp2intersect_flush_l1d arch_capabilities |
|----------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| mmx:     | Extension from 1997                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
|          | 64-bit registers $\rightarrow$ 2 x 32 float, 2 x 32 bit int, 4x 16 bit int, 8 x 8 bit int                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| sse*:    | Extension from 1999                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
|          | 128 bit registers $\rightarrow$ 2 x 64 bit integer, 2 x double, 4 x float, 4 x 32 bit integer,                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| avx(2):  | 256 bit registers $\rightarrow$ 4 x 64 bit integer, 4 x double, 32 x 8 bit integer                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| avx512*: | Different extensions with different functions                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
|          | 512 bit registers $\rightarrow$ 8 x 64 bit registers64 x 8 bit integer                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
|          |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |

#### Iscpu on an Arm CPU (RaspberryPi, most Android phones,...)

Flags: half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idi vt vfpd32 lpae evtstrm crc32

**Neon(v2):** 128 bit registers  $\rightarrow$  2 x 64 bit integer, 2 x double,...

**SVE:** In newer Arm CPUs (Armv8), register size between 128 bit and 2048 bit

CDCS CENTER FOR DATA AND COMPUTING IN NATURAL SCIENCES

#### Julia tries to apply SIMD automatically

@code\_native test(rand(Int64, 100000));



Still not working? Consider explicit vectorization in Julia: https://docs.julialang.org/en/v1/base/simd-types/

# SIMD With C/C++

CDCS CENTER FOR DATA AND COMPUTING IN NATURAL SCIENCES

g++ tries to apply vectorization autoatically with -O2 and -O3

#### succesful compiler log

uncompr.h:99:29: note: create vector\_type-pointer variable to type: vector(4) long unsigned int vect uncompr.h:99:29: note: created iftmp.118\_12 uncompr.h:99:29: note: add new stmt: MEM[(uint64\_t \*)vectp\_iftmp.608\_96] = vect\_\_27.607\_95; uncompr.h:99:29: note: ----->vectorizing statement: i\_21 = i\_47 + 1; uncompr.h:99:29: note: ----->vectorizing statement: vectp\_SR.601\_90 = vectp\_SR.601\_89 + 32; uncompr.h:99:29: note: ----->vectorizing statement: vectp\_SR.604\_93 = vectp\_SR.604\_92 + 32; uncompr.h:99:29: note: ----->vectorizing statement: vectp\_iftmp.608\_97 = vectp\_iftmp.608\_96 + 32; uncompr.h:99:29: note: ----->vectorizing statement: if (i\_21 >= \_22) uncompr.h:99:29: note: New loop exit condition: if (ivtmp\_100 >= bnd.598\_86) uncompr.h:99:29: note: LOOP VECTORIZED

#### unsuccesful compiler log

ry.cpp:131:46: note: not vectorized: vectorization is not profitable. note: not vectorized: not suitable for scatter store note: bad data references. note: not consecutive access \*outData\_75 = \_66; note: not vectorized: no grouped stores in basic block.

Reasons can be next to unidentifyable. Personal favourite: size\_t as iterator type doesn't work, uint32\_t works



#### 18.11.2021

#### Step 1: Getting data into a SIMD register



### Explicit SIMD with C/C++ -> Step 2 & 3 \*Example is intel only

CDCS CENTER FOR DATA AND COMPUTING IN NATURAL SCIENCES



#### 18.11.2021



g++ -O3 -flax-vector-conversions main.cpp -o myapp

Support for Neon on most 64 bit Arm systems, Use -mfpu=neon on 32 bit systems

# **Ressources for advanced C++ SIMD Programming**



# **Vector Libraries**



Matthias Kretz, Volker Lindenstruth: Vc: A C++ library for explicit vectorization. Softw., Pract. Exper. 42 (2012)



abstract interface for explicit SIMD vectorization. PMAM 2017

Pierre Estérie, Joel Falcou, Mathias Gaunard, Jean-Thierry Lapresté: Boost.SIMD: generic programming for portable SIMDization. WPMVP 2014



Tasks are split into threads and distributed across the cores (=multithreading)

Core ≠ Core

A physical core can (often) run more than one thread at the same time  $\rightarrow$  logical cores through *hyperthreading* 

Iscpu:



#### Windows Task Manager

| Sockets:              | 1 |
|-----------------------|---|
| Kerne:                | 4 |
| Logische Prozessoren: | 8 |

 $\rightarrow$  4 physical cores à 2 threads = 8 logical cores

### Advantages

- Threads on the same physical core use the same ressources
   → Fast exchange of data and fast ressource switching
- Lower probability of idle CPU time
   → Stalling cycles can be filled with
   computation by another thread

### Disadvantages

- Threads on the same physical core use the same ressources
   → Threads have to share bandwidth and cache space
- Higher probability of contentions (both threads try to access the same ressource)

→ Useful when your threads exchange data frequently but do not share other ressources (e.g. they do not access the same array)
 → Not useful when all threads access the same ressources

It is possible to enforce on which cores a thread is allowed to run

From the outside: numactl (might not be available on clusters because of security reasons)

numactl --physcpubind=0 myProgram myProgram will run on core 0

cat /sys/devices/system/cpu/cpu0/topology/thread\_siblings\_list

Tells you which logical cores are on the same physical core as core 0 Can be used to disable hyperthreading

numactl --physcpubind=0,2 myProgram

myProgram can run on core 0 and 2

From the inside: Depends on how you create your threads

- →Search term: Thread affinity
- Common approach: OpenMP

```
export GOMP_CPU_AFFINITY="0-2:2" Binds the threads to core 0 and core 2
OR
```

export OMP\_PLACES="{0}:2:2" OMP\_PLACES overwrites GOMP\_CPU\_AFFINITY



Works only if your implementation uses OpenMP

# **OpenMP - Loops**



Define which pieces of code can run in parallel (in case it's not a loop)

```
#pragma omp parallel sections
```

#pragma omp section
{ //do something

#pragma omp section
{ //do something else in parallel

Creates two threads which can run in parallel (if at least two cores are assigned) → There must be no dependencies between the sections, e.g. no commonly used variables

Multi-threading in Julia is very OpenMP-like: https://docs.julialang.org/en/v1/manual/multithreading/#man-multithreading

18.11.2021



- A node can either feature one or multiple CPUs on multiple sockets
- All core son all CPUs can access all memory modules on the same node  $\rightarrow$  NUMA system (<u>Non Uniform Memory</u> Access)

NUMA node

# **NUMA Systems**



Implementation-wise → Just like on a single CPU, you just have more cores available

Thread-pinning → It often makes sense to pin threads to the same NUMA node (if that node has enough cores)

numactl --cpunodebind=0 myProgram myProgram will run on node 0

No knowledge necessary about the ID of the cores on each node

numactl --cpunodebind=0 --membind=0 myProgram



myProgram will run on node 0 AND only allocate memory on node 0

#### 18.11.2021



### Popular solution for clusters

```
include <mpi.h>
int main(){
	MPI_init (NULL,NULL);
	int id;
```

- Build with *mpicc main.c myProgram* → If building with another compiler MPI library must be linked explicitly
- Run with *mpirun –np 2 myProgram*
- $\rightarrow$  Runs myProgram in two diffeent processes

```
int err = MPI_Comm_rank(MPI_COMM_WORLD, &id);
```

```
if (id==0) then{
	//do something
}
If (id==1) then{
	//do something else
```

Communication between processes via MPI\_Send and MPI\_Recv. → MPI\_Recv blocks a process until data transfer is done

```
,
MPI_Finalize();
```

Available for many programming languages, e.g. C/C++, Fortran, Pythn, R, Haskell, Julia... Also works on a single node, but might be overkill

18.11.2021



# **Voltage Scaling**



Active transistors produce heat

- → Depends on density of active transistors and frequency
- → Physical constraints limit the ability to conduct heat away
- → Core frequency is regulated down (via voltage) to reduce heat
- → Speed-up is not proportional to #cores

More threads than cores: Instructions and data of a thread are swapped out of the CPU to load another thread  $\rightarrow$  Context Switching

- → As we know, bandwidth for this swap is limited
- → Frequent swaps block your data bus

Operating system place threads on cores according to their own heuristics, e.g. round robin after fixed amount of time  $\rightarrow$  This is done to prevent CPU from overheating in one region, but:

- → It introduces additional context switches
- → The new core might be further away from where the data is written originally

Never spawn more threads than available cores & pin your threads to the same CPU(s) or NUMA node(s) if possible On CPUs, spawning of threads comes with an overhead

- → Short running threads, e.g. only a few ms, do not profit from multithreading
- → Create a thread pool and reuse the same threads for different tasks instead of constantly spawning new threads
- → Put all tasks into a queue and let the threads pull a task when they are done with the last one



# **The Memory Hierarchy**

| Size  |                                                                         | Туре                                                | Bandwidth                   |  |  |
|-------|-------------------------------------------------------------------------|-----------------------------------------------------|-----------------------------|--|--|
|       | 16 general purpose 64 bit registers on an x86                           | Registers                                           | At CPU clock speed          |  |  |
|       | <1MB                                                                    | L1 cache                                            | A few 100 – a few 1000 GB/s |  |  |
|       | KB to MB rende                                                          | e maximum degree of (                               | A form 100 GB/s             |  |  |
|       | The maximum degree of (useful) parallelism is limited by memory access! |                                                     |                             |  |  |
|       |                                                                         | cannot keep 20 cores b<br>all rely on frequent disc | 0.5 GB/S                    |  |  |
|       | $GB \text{ to TB r} \rightarrow Rance$                                  | dom memory access inc<br>issue by an order of mag   | creases the SD Express), in |  |  |
|       |                                                                         | Disc (hdd)                                          | < 200 MB/s                  |  |  |
| 18.11 | 1.2021                                                                  | Persistent memory                                   | 40                          |  |  |

CDCS CENTER FOR DATA AND COMPUTING IN NATURAL SCIENCES

htop:

This is all your CPU can do while waiting for data from disc or network

Solution: Copy your data to main memory before accessing it

loop over lines { open file read line from file do something with line close file open file data = get file contents close file loop over lines{ read line from data do something with line The same goes for writing
→ Collect data in an array, struct, or dataframe
→ Write data to disc when everything is

finished

Not every part of a problem can be parallelized



Minimum execution time will not go below execution time of the not parallelizable part, regardless of how many cores we use. The speedup increases along with the number of cores if the workload (problem size) of the parallelizable part increases, too

#### Main memory can deliver data to more threads in parallel than disc, but it is still limited Performance decreases when bandwidth is saturated $\rightarrow$ Hyperthreads not useful



CDCS CENTER FOR DATA AND COMPUTING IN NATURAL SCIENCES



# Exercise







# **Exercise: Parallelizing Run Length Encoding**



- 1. Read data from a single file into an array
- 2. Iterate over array and count duplicates
- 3. Write all run values and run lengths into a file on disc
- a) Which of these tasks do you think are parallelizable?
- b) Which parts are parallelizable but will not profit much from parallelization?
- c) Vectorization, hyperthreading (multiple threads on a core), multithreading (multiple threads on a CPU in general), or message passing (multiple nodes)?