

 $H, A \rightarrow \tau^{\dagger} \tau \rightarrow two \tau jets + X, 60 fb^{\dagger}$ 

# Use of GPUs for trigger applications in HEP experiments

**Felice Pantaleo** (CERN PH-SFT) DESY Computing Seminar 15/10/2012



## Outline



- Introduction
- Graphic Processing Units
- ALICE HLT
- NA62
  - RICH
  - CHOD
  - Tests
- Conclusion
- Reference



 $I, A \rightarrow \forall \tau \rightarrow two \tau jets + X, 60 fb'$ 

### Introduction

## Introduction



- High Energy Physics detectors needs
  - Processing large amounts of information
  - Very short time response
  - Complex structures, many sensors with topographic information
  - Efficient processing of data
- Huge benefits expected from the use of Multicore techniques
- Where are we ?

# Generic Detector Structure



ER

# Generic Trigger Structure



ERI

### Low Level Trigger



- Time needed for decision  $\Delta t_{dec} \approx 1 \text{ ms}$
- Particle rate  $\approx 10 \text{MHz}$
- Need pipelines to hold data
- Need fast response
- Backgrounds are huge
- High rejection factor
- Algorithms run on local, coarse data
- Ultimately, determines the physics



# Parallel Opportunities



- On detector (mostly Cellular Automata)
  - Trackers
  - Calorimeters
- Trigger Systems
  - Event decision: one event per node + dispatcher
- The Grid



- Reconstruction of tomographies in the European Synchrotron Radiation Facility
- Integration of GPUs into the ITMS framework of data acquisition in fusion experiments
- Track reconstruction simulation for PANDA in FAIR
- NA62 Low-Level Trigger
- ALICE High-Level Trigger

# HEP is a long lived science...



- Many algorithms in HEP have cores dating the seventies
- Mostly thought and devised as single threaded
- Current solution: one event-one core
- This solution shows saturation

#### Change of paradigm ?



- HEP characteristics and needs seem to fit perfectly both
  - Cellular techniques
  - Multi- Many- cores architectures
- Our feeling: a large number of opportunities ahead of us
- Further study is needed to evaluate the real benefits of these solutions.



 $f A \rightarrow \forall \tau \rightarrow t \text{wo } \tau \text{ jets } + X, 60 \text{ fb}'$ 

## **Graphic Processing Units**

#### GPUs (1/2)



- Exceptional raw power wrt CPUs
- Higher energy efficiency (NVIDIA Kepler PUE 1.05)
- Plug & Accelerate
- Massively parallel architecture
- Low Memory/core



### GPUs (2/2)



GPUs were traditionally used for real-time rendering.

NVIDIA & AMD main

manufacturers.

#### **Architecture (Kepler)**

- 7.1B Transistors
- 15 SMX units
- > 1 TFLOP FP64
- 1.5 MB L2 Cache
- 384-bit GDDR5
- PCI Express Gen3



## CUDA – Kepler Architecture

- SMX executes hundreds of threads concurrently.
- **SIMT** (Single-Instruction, Multiple-Thread)
- Instructions pipelined
- Thread-level parallelism
- Instructions issued in order
- No Branch prediction
- No speculative execution
- Branch predication

| -                 |      |      |                   | _    |      |      |                | Ins               | tructi        | on Ca | che    |                  |           |      |      |      |                |        |     |
|-------------------|------|------|-------------------|------|------|------|----------------|-------------------|---------------|-------|--------|------------------|-----------|------|------|------|----------------|--------|-----|
| Warp Scheduler    |      |      |                   |      |      | Wa   | rp Schee       | duler             |               |       | War    | p Sch            | neduler   | 1    |      | Wa   | rp Sche        | duler  |     |
| Dispatch Dispatch |      |      | Dispatch Dispatch |      |      |      |                | Dispatch Dispatch |               |       |        | Dispatch Dispatc |           |      | tch  |      |                |        |     |
|                   |      |      |                   |      |      |      | Regi           | ster f            | File (        | 65,53 | 6 x 3  | 2-bit            |           |      |      |      |                |        |     |
| Core              | Core | Core | DP Unit           | Coro | Core | Core | DP Unit        | LDIST             | SFU           | Core  | Core   | Core             | DP Unit   | Core | Core | Core | DP Unit        | LDIST  | SF  |
| Core              | Core | Core | DP Unit           | Core | Core | Core | DP Unit        | LDIST             | SFU           | Core  | Core   | Core             | DP Unit   | Core | Core | Core | DP Unit        | LDIST  | SFI |
| Core              | Core | Core | DP Unit           | Core | Coro | Care | OP Unit        | LINST             | SFU           | Core  | Care   | Core             | DP Unit   | Care | Core | Core | DP Unit        | LDIST  | SF  |
| Com               | Core | Core | DP UNI            | Gara | Core | Com  | OP LINE        | LINST             | SEII          | Core  | Com    | Core             | DP Here   | Core | Core | Core | DP Hall        | LDIST  | 85  |
|                   |      |      | the link          |      |      |      | DD Hell        | LINE              | -             |       |        |                  |           |      |      |      | DD Hol         |        |     |
| oure              | Core |      |                   |      | Cone | Core |                | Constraint of     | aru           | Core  | Gore   | Core             | _         | Gore | Core | Gore |                | - Diat |     |
| Core              | Core | Core | 0P Unit           | Core | Core | Core | DP-Unit        | LDIST             | SFU           | Core  | Care   | Core             | DP Unit   | Care | Gore | Gore | DP Unit        | LDIST  | SF  |
| Sore              | Core | Core | DP Unit           | Core | Core | Core | DP Unit        | LD/ST             | SFU           | Core  | Core   | Core             | DP Unit   | Core | Core | Core | DP Unit        | LDIST  | SFI |
| Core              | Core | Core | DP Unit           | Core | Core | Core | DP Unit        | LINST             | SFU           | Core  | Core   | Core             | DP Unit   | Core | Core | Core | <b>DP Unit</b> | LDIST  | SFI |
| Core              | Core | Core | <b>BP Unit</b>    | Gore | Core | Core | <b>DP Unit</b> | LDIST             | SFU           | Core  | Core   | Core             | OP Unit   | Gore | Core | Core | <b>DP Unit</b> | LDIST  | SFL |
| Gore              | Core | Core | <b>DP Unit</b>    | Core | Core | Core | DP Unit        | LUNST             | SFU           | Core  | Core   | Core             | OP Unit   | Core | Core | Core | DPUnn          | LDIST  | SF  |
| Core              | Core | Core | OP Unit           | Core | Core | Core | DP Unit        | LINST             | SFU           | Core  | Core   | Core             | DP Unit   | Gore | Core | Core | DP Unit        | LDIST  | SF  |
| Core              | Core | Core | DP Unit           | Gore | Core | Core | DP Unit        | LDIST             | SFU           | Core  | Core   | Core             | DP Voit   | Core | Core | Core | DP Unit        | LDIST  | SF  |
| Core              | Core | Core | <b>UP Unit</b>    | Core | Core | Core | DP Unit        | LDIST             | SFU           | Core  | Core   | Core             | DP Unit   | Core | Core | Core | 0P Unit        | LDIST  | SF  |
| Core              | Core | Core | DP Unit           | Core | Core | Core | DP Unit        | LINST             | SFU           | Core  | Core   | Core             | DP Unit   | Core | Core | Core | DP Unit        | LDIST  | SF  |
| Core              | Core | Core | DP Unit           | Core | Core | Core | DP Unit        | LDIST             | SFU           | Core  | Core   | Core             | DP Unit   | Core | Core | Core | DP Unit        | LDIST  | SF  |
| Core              | Core | Core | DP Unit           | Gore | Gore | Core | DP Unit        | LDIST             | SFU           | Core  | Gare   | Care             | GP Unit   | Core | Core | Core | <b>DP Unit</b> | LDIST  | SF  |
| 6                 |      | 200  |                   |      |      |      |                | Inter             | conne         | ct Ne | twork  |                  |           |      |      |      |                |        |     |
|                   | _    |      | _                 |      |      |      | 48 KB          | Share<br>B Re:    | eu Me<br>ad-O | niv D | y / L1 | ache             | inte<br>N | _    |      |      | _              | _      |     |
| -                 | Tex  |      | Tex               | ζ    |      | Tex  |                | Tex               |               |       | Tex    |                  | Tex       |      |      | Тех  |                | Tex    |     |
|                   | Ter  |      | Тен               |      |      | Tay  |                | Toy               |               |       | Tay    |                  | Toy       |      |      | Tex  |                | Tor    |     |

## CUDA – Abstractions



- 1. Hierarchy of thread groups
- 2. Shared memory
- 3. Barrier synchronization

Fine-grained *data/thread* parallelism Coarse-grained *data/task* parallelism Instruction-level parallelism





 $H, A \rightarrow \forall \tau \rightarrow two \tau jets + X, 60 fb'$ 

## ALICE High Level Trigger

Thanks to D. Rohr (ALICE Collaboration)

## What it is happening in HPC



Use several platforms containing GPUs to solve one single problem

Programming challenges:

- Algorithm parallelization
- Perform computation in GPUs
- Execution in a distributed system where platforms have their own memory
- Network communication.







ALICE is one of the major four experiments of the Large Hadron Collider at CERN. It was specifically designed to study heavy ion collisions.







# A Time Projection Chamber (TPC) is used to measure particle trajectories.



# ALICE TPC



• TPC clusters of a heavy ion event



# ALICE TPC



• Tracks reconstructed from the clusters



## ALICE HLT



- Alice HLT tracker divides the TPC in slices and processes the slices individually.
- Track segments from all the slices are merged later



| Category of task      | Name of task               | Description on task                                     |  |  |  |  |
|-----------------------|----------------------------|---------------------------------------------------------|--|--|--|--|
|                       | (Initialization)           |                                                         |  |  |  |  |
| Combinatorial part    | I: Neighbors finding       | Construct souds                                         |  |  |  |  |
| (Cellular automation) | II: Evolution              | (Track candidates)                                      |  |  |  |  |
| Kalman filter part    | III: Tracklet construction | Fit seed,<br>extrapolate tracklet,<br>find new clusters |  |  |  |  |
|                       | IV: Tracklet selection     | Select good tracklets,<br>assign clusters to<br>tracks  |  |  |  |  |
|                       | (Tracklet output)          |                                                         |  |  |  |  |

# Neighbors Finding





# Evolution step





## Tracklet reconstruction

• The algorithm looks for clusters close to extrapolation point



# Evolution step





# Tracklet construction



# Tracklet selection



## ALICE GPU Tracker





#### Screenshot of ALICE Online-Event-Display during first physics-fill with active GPU Tracker.



 $1, A \rightarrow \forall \tau \rightarrow two \tau jets + X, 60 fb'$ 

### NA62

#### $K^+ \rightarrow \pi^+ vv$ in the Standard Model



- FCNC process forbidden at tree level
- Short distance contribution dominated by Z penguins and box<sub>u</sub>, c, t diagrams
- Negligible contribution from u quark, small contribution from c quark
- Very small BR due to the CKM top coupling  $\rightarrow \lambda^5$



15/10/2012



- Amplitude well predicted in SM (measurement of V<sub>td</sub>) [see E.Stamou]
- Residual error in the BR due to parametric uncertainties (mainly due to charm contributions): ~7%
- Alternative way to measure the Unitarity Triangle with smaller theoretical uncertainty

|                        | G <sub>SD</sub> /G | Irr. theory err. | BR x 10 <sup>-11</sup> |
|------------------------|--------------------|------------------|------------------------|
| ĸ <sub>∟</sub> →πνν    | >99%               | 1%               | 3                      |
| K⁺ <b>→</b> π⁺νν       | 88%                | 3%               | 8                      |
| K <sub>L</sub> →π⁰e⁺e⁻ | 38%                | 15%              | 3.5                    |
| κ₋→π⁰μ⁺μ⁻              | 28%                | 30%              | 1.5                    |

33

## Experimental Technique





- Kaon decay in-flight from an unseparated 75 GeV/c hadron beam, produced with 400 GeV/c protons from SPS on a fixed berilium target
- ~800 MHz hadron beam with ~6% kaons
- The pion decay products in the beam remain in the beam pipe
- <u>Goal:</u> measurement of O(100) events in two years of data taking with % level of systematics
- Present result (E787+E949): 7 events, total error of ~65%.

# NA62 Trigger





L0: Hardware synchronous level. 10 MHz to 1 MHz. Max latency 1 ms. L1: Software level. "Single detector". 1 MHz to 100 kHz L2: Software level. "Complete information level". 100 kHz to few kHz.

# GPU as a Level 0 Trigger



- The idea: exploit GPUs to perform high quality analysis at trigger level
- GPU architecture: massive parallel processor SIMD
- "Easy" at L1/2, challenging at L0
- Real benefits: increase the physics potential of the experiment at very low cost!
- Profit from continuative developments in technology for free (Video Games,...)



"quasi-triggerless" with GPUs



## Data Flow







 $A \rightarrow \forall \tau \rightarrow two \tau jets + X, 60 fb'$ 

## NA62 RICH Level0 Trigger

### RICH





- ~17 m **RICH**
- 1 atm Neon
- Light focused by two mirrors on two spots equipped with ~1000 PMs each (pixel 18 mm)
- 3s p-m separation in 15-35 GeV/c, ~18 hits per ring in average
- ~100 ps time resolution, ~10 MHz events rate
- Time reference for trigger

## **Ring Reconstruction**



- Natively built for pattern recognition problems
- First attempt: ring reconstruction in RICH detector.





# It's a **pilot project**, very promising R&D!

# Crawford Algorithm

CERN

Consider a circle of radius R, centered in  $(x_0, y_0)$  and a list of points  $(x_i, y_i)$ . The following relations exist:

$$x_0^2 + y_0^2 - R^2 = \frac{1}{N} \{ 2x_0 \sum x_i + 2y_0 \sum y_i - \sum x_i^2 - \sum y_i^2 \}.$$
 (1)

$$x_{0} \{ \sum x_{i}^{2} - \frac{(\sum x_{i})^{2}}{N} \} + y_{0} \{ \sum x_{i} y_{i} - \frac{\sum x_{i} \sum y_{i}}{N} \} = \frac{1}{2} \{ \sum x_{i}^{3} + \sum x_{i} y_{i}^{2} - \sum x_{i} \frac{\sum x_{i}^{2} + \sum y_{i}^{2}}{N} \},$$
(2)

$$x_{0} \{ \sum x_{i} y_{i}^{2} - \frac{\sum x_{i} \sum y_{i}}{N} \} + y_{0} \{ \sum y_{i}^{2} - \frac{\sum y_{i}^{2}}{N} \} = \frac{1}{2} \{ \sum x_{i}^{2} y_{i} + \sum y_{i}^{3} - \frac{\sum y_{i}^{2} + \sum y_{i}^{2}}{N} \}$$

$$-\sum_{\text{felice.pantaleo@cern.ch}} y_{i} \frac{\sum x_{i}^{2} + \sum y_{i}^{2}}{N} \}.$$
(3)

15/10/2012

# Reduction two tiets + X, 60 16





- One reduction kernel is called per block, giving an array of results (one for each event)
- Must use sequential addressing instead of interleaved addressing to avoid Shared Memory bank conflicts
- Time complexity is O(logN), cost is O(N\*logN): not cost efficient
- Brent's theorem (algorithm cascading) suggests O(N/logN) threads:
  - $\circ$  Each thread does O(logN) sequential work
  - $\circ \quad \ \ All \ O(N/logN) \ threads \ cooperate \ for \ O(logN) \ steps$
  - New cost =  $O(N/\log N * \log N) = O(N)$

# GPU grid organization





## Stream Scheduler



- Exploit the instruction-level parallelism (i.e. pipelining streams)
- This is usually done by interlacing one stream instructions with another stream ones
- This cannot be done in realtime without the introduction of other **unknown** latencies
- CPU thread-level parallelism is the solution

#### C2050 Execution Time Lines

#### Sequential Version

| H2D Engine    | Stream 0 |   |   |  |
|---------------|----------|---|---|--|
| Kernel Engine |          | 0 |   |  |
| D2H Engine    |          |   | 0 |  |

#### Asynchronous Versions I and 3

| H2D Engine    | 1 | 2 | 3 | 4 |   |   |
|---------------|---|---|---|---|---|---|
| Kernel Engine |   | 1 | 2 | 3 | 4 |   |
| D2H Engine    |   |   | 1 | 2 | 3 | 4 |

#### **Asynchronous Version 2**

| H2D Engine    | 1 | 2 | 3 | 4 |   |   |   |   |   |  |  |  |
|---------------|---|---|---|---|---|---|---|---|---|--|--|--|
| Kernel Engine |   | 1 | 2 | 3 | 4 |   |   |   |   |  |  |  |
| D2H Engine    |   |   |   |   |   | 1 | 2 | 3 | 4 |  |  |  |





 $I, A \rightarrow \forall \tau \rightarrow two \tau jets + X, 60 fb'$ 

### NA62 RICH Tests

### Hardware configuration (1/2)



#### First Machine

- GPU: NVIDIA Tesla C2050
  - o 448 CUDA cores @ 1.15GHz
  - 3GB GDDR5 ECC @ 1.5GHz



- CUDA CC 2.0 (Fermi Architecture)
- PCIe 2.0 (effective bandwidth up to ~5GB/s)
- o CUDA Runtime v4.2, driver v295.20 (Feb '12)
- CPU: Intel® Xeon® Processor E5630 (released in Q1'10)
  - 2 CPUs, 8 physical cores (16 HW-threads)
- SLC6, GNU C compiler v4.6.2

## Hardware configuration (2/2)



Second Machine

- GPU: NVIDIA GTX680
  - 1536 CUDA cores @ 1.01GHz



- 2GB GDDR5 ECC @ 1.5GHz
- CUDA CC 3.0 (Kepler Architecture)
- PCIe 3.0 (effective bandwidth up to ~11GB/s)
- o CUDA Runtime v4.2, driver v295.20 (Feb '12)
- CPU: Intel® Ivy Bridge Processor i7-3770 (released in Q2 '12)
  - o 1 CPUs, 4 physical cores (8 hw-threads) @3.4GHz
- Fedora 17, GNU C compiler v4.6.2

## Results - Throughput



The throughput behaviour for a varying number of events inside a packet is a typical many-core device behaviour:

- constant time to process a varying number of events, activating more SMs as the packet size increases
- discrete oscillations due to the discrete nature of the GPU
- saturation plateau (1.4GB/s and 2.7GB/s)

The right choice of packet dimension is not unique.

It depends on the maximum latency we don't want to exceed and on the input rate of events.

Considering that the maximum rate per subdetector (@10MHz particles rate) for NA62 experiment is ~500MB/s, I would consider the throughput test **PASSED** 



## Results - Latency



Latency pretty stable wrt event size.

- A lower number of event inside a package is better to achieve a low latency.
- A larger number of event guarantees a better performance and a lower overhead.

The choice of the packet size depends on the technical requirements.



# of events per packet

-atency(ms)

# Results - Latency Stability



#### Latency Stability





 $1, A \rightarrow \forall \tau \rightarrow two \tau jets + X, 60 fb'$ 

## NA62 CHOD Level0 Trigger

### CHOD



- Opportunity to run test on a running subdetector.
- Aim: Applying time corrections in real-time can improve time resolution.

#### **Divide et Impera:**

- Real-Time-Ready data structures receiver implemented.
- Fast communication between Network Interface buffers and Host Memory (DNA driver) implemented
- A throughput/latency efficient kernel implemented



**Integrate** the single unit tests



## CHOD – Kernel

- Assuming the conditions of the detector to be stable, the kernel is an increment by a value taken from a lookup table.
- Each element of the lookup table contains the correction specific of each rectangular zone.
- Each zone correction is the central point one.



# CUDA Kepler Architecture





Investigation on which memory to use to store this matrix:

#### **Global memory (read and write)**

- Slow, but now with cache
- L1 cache designed for spatial re-usage, not temporal (similar to coalescing)
- It benefits if compiler detects that all threads load same value (*LDU* PTX ASM instruction, load uniform)

#### **Texture memory**

• Cache optimized for 2D spatial access pattern

#### **Constant memory**

- Slow, but with cache (8 kb)
- Shared memory (48kB per SMX)
  - Fast, but slightly different rules for bank conflicts now

Registers (65536 32-bit registers per SMX)



 $I, A \rightarrow \forall \tau \rightarrow two \tau jets + X, 60 fb'$ 

### NA62 CHOD Tests

# Kernel 8x8





# Kernel 64x64





# Time dependency





#### Throughput two tjets + X, 60 fb





### Latency









- I will test a complete system in the first NA62 technical run in November.
- GPUs seem to represent a good opportunity, not only for analysis and simulation applications, but also for more "hardware" jobs.
- Replacing custom electronics with fully programmable processors to provide the maximum possible flexibility is a reality not so far in the future.





- GPU now mature to be used in HEP in particular in dedicated triggers.
- Already in use in HLT in ALICE for TPC reconstruction
- First test as Level0 trigger processor in NA62 next month



 $I, A \rightarrow \forall \tau \rightarrow two \tau jets + X, 60 fb'$ 

### Questions?



#### **Real-Time Use of GPUs in NA62 Experiment**

*F. Pantaleo et al.*, 13th International Workshop on Cellular Nanoscale Networks and their Applications (<u>CERN-PH-EP-2012-</u> <u>260</u>)

#### ALICE HLT TPC Tracking on GPUs

D. Rohr et al., CNNA2012

#### Many-core processors and GPU opportunities in Particle Detectors

X. Vilasís Cardona et al. CNNA2012