# Real-time ML on FPGAs for particle physics

Jennifer Ngadiuba (Fermilab) Sioni Summers (CERN)

2nd Terascale School of Machine Learning 10 March, 2021



## Introduction

- In this school you have learned about several complex and large DL models (likely not the largest in the world but big enough...)
- You have probably also learned that you can **relatively efficiently and quickly** train or evaluate them on GPUs





- **Relatively** means that it depends a lot on your application
- For some applications speed and efficiency are crucial and GPUs might not the best solution!

## The rise of specialized hardware

Recent industry trends towards developing new devices optimized for AI and speed up both training and inference



## The rise of specialized hardware

FPGAs and ASICs making their way into data centers as co-processors

<u>Xilinx & Amazon Web Service</u> <u>Intel & Microsoft BrainWave</u> <u>Google cloud TPUs</u> <u>Xilinx & IBM cloud (CAPI)</u>



### Companies also provide toolkits to accelerate custom or standard DL models on FPGAs:

Intel OpenVino, ... Xilinx ML Suite, ...





### high level synthesis for machine learning

• Today you will learn about **hls4ml**: a package for translating neural networks to FPGA firmware for inference with ultra low latency

https://github.com/fastmachinelearning/hls4ml

https://fastmachinelearning.org/hls4ml/

pip install hls4ml



#### • Objectives:

- Introduction on FPGA functionalities
- Translate ML models into synthesizable FPGA code
- Make your model inference computationally efficient and fast

### **\$** FAST MACHINE LEARNING LAB

#### ABOUT THE FAST ML LAB Real-time and accelerated ML for fundamental sciences

Fast ML Lab is a research collective of physicists, engineers, and computer scientists interested in deploying machine learning algorithms for unique and challenging scientific applications. Our projects range from real-time, on-detector and low latency machine learning applications to high-throughput heterogeneous computing big data challenges. We are interested in deploying sophisticated machine learning algorithms to advance the exploration of fundamental physics from the world's biggest colliders to the most intense particle beams to the cosmos.



The project kicked off 4 years ago ... ~ 10 people, mostly physicists (with little expertise in electronic engineering)

https://fastmachinelearning.org/

### Many more contributors and users now!

#### Caltech

Jennifer Ngadiuba (PhD, Physics);

#### CERŅ

<u>Thea Årrestad</u> (PhD, Physics); <u>Vladimir Loncar</u> (PhD, Computer Science); <u>Maurizio Pierini</u> (PhD, Physics); <u>Sioni Summers</u> (PhD, Physics);

#### Columbia University

Giuseppe Di Guglielmo (PhD, Computer Science);

#### Fermilab

Lindsey Gray (PhD,Physics); Christian Herwig (PhD,Physics); Duc Hoang (Undergraduate, Physics); Burt Holzman (PhD,Physics); Sergo Jindariani (PhD,Physics); Thomas Klijnsma (PhD,Physics); Ben Kreis (PhD,Physics); Kevin Pedro (PhD,Physics); Ryan Rivera (PhD,EE); Nhan Tran (PhD,Physics); Mike Wang (PhD,Physics); Tingjun Yang (PhD,Physics);

#### Hawkeye 360

EJ Kreinar (Computer Science)

#### Lehigh University

Joshua Agar (PhD, Material Science and Engineering);

#### MIT

Jack Dinsmore (Undergraduate, Physics); Song Han (PhD, EECS); Phil Harris (PhD, Physics); Jeffrey Krupa (Graduate, Physics); Sang Eon Park (Graduate, Physics); Dylan Rankin (PhD, Physics);

#### **Northwestern University**

Seda Memik-Ogrenci (Electrical and Computer Engineering); Farah Fahim (ECE, Adjunct); Nhan Tran (ECE, Adjunct)

#### **Purdue University**

Mia Liu (PhD, Physics);

#### UC San Diego

Javier Duarte (PhD, Physics); Vesal Razavimaleki (Undergraduate, Engineering Physics);

#### University of Illinois Chicago

Zhenbin Wu (PhD, Physics);

#### University of Illinois Urbana-Champaign

Markus Atkinson (PhD, Physics); Mark Neubauer (PhD, Physics);

#### **University of Washington**

Scott Hauck (PhD, EECS); Shih-Chieh Hsu (PhD, Physics);



European Research Council























## The origins: triggering @ (HL-) LHC

Extreme collision frequency of 40 MHz → extreme data rates O(100 Tb/s) Most collision "events" don't produce interesting physics **"Triggering"** = filter events to reduce data rates to manageable levels





#### 99.75% events rejected!









## Ultra fast ML for triggering



### Ultra fast ML for triggering





## The L1 trigger system



detector data

optical links



O(100) Xillinx FPGAs

in: 40 MHz out: 100 KHz

detector front end electronics



AMCs w/ MicroTCA backplane

### Challenge: ultra low latency and scarse resources

Most of it allocated by standard algorithms:

- receive, calibrate, and sort calorimeter energy deposits over the whole detector
- aggregate them to make physics objects (jets, electrons, taus, energy sums)
- run track finders combining hits in muon stations

## The L1 trigger system



detector data

optical links

how to fit ML models here?



O(100) Xillinx FPGAs

in: 40 MHz out: 100 KHz

detector front end electronics

AMCs w/ MicroTCA backplane

### Challenge: ultra low latency and scarse resources

### <u>Most of it allocated </u>

- receive, calibrate
- aggregate them t
- run track finders combining hits in muon stations

the whole detector energy sums)

#### **Field Programmable Gate Arrays**

are reprogrammable integrated circuits

Contain many different building blocks ('resources') which are connected together as you desire

#### FPGA diagram





#### Field Programmable Gate Arrays are reprogrammable integrated circuits

Look Up Tables (LUTs) perform arbitrary functions on small bitwidth inputs (2-6 bits) → used for boolean operations, arithmetics, memory

Flip-flops register data in time with the clock pulse



### FPGA diagram





#### **Field Programmable Gate Arrays**

are reprogrammable integrated circuits

**DSPs** are specialized units for multiplication and arithmetic

→ faster and more efficient than LUTs for these type of operations

→ for deep learning, they are often the most precious resource

### FPGA diagram





Also contain embedded components:

### **Digital Signal Processors (DSPs):** logic units used for multiplications

### **Field Programmable Gate Arrays**

are reprogrammable integrated circuits

**BRAMs** are small, fast memories (ex, 18 Kb each)

→ more efficient than LUTs when large memory is required

Modern FPGAs have ~100 Mb of BRAMs, chained together as needed

### FPGA diagram





Also contain embedded components:

**Digital Signal Processors (DSPs):** logic units used for multiplications

Random-access memories (RAMs): embedded memory elements

### Field Programmable Gate Arrays

are reprogrammable integrated circuits

Contain array of **logic cells** embedded with **DSPs**, **BRAMs**, etc.

Support highly parallel algorithm implementation

Low power per Op (relative to CPU/GPU)



### FPGA diagram



Also contain embedded components:

**Digital Signal Processors (DSPs):** logic units used for multiplications

Random-access memories (RAMs): embedded memory elements

## Why are FPGAs fast?

- Fine-grained / resource parallelism
  - use the many resources to work on different parts of the problem simultaneously
  - allows us to achieve low latency
- Most problems have at least some sequential aspect, limiting how low latency we can go
  - but we can still take advantage of it with...

### Pipeline parallelism

- instruct the FPGA to work on different data simultaneously
- allows us to achieve high throughput



Like a production line for data...

## How are FPGAs programmed?



## The L1 trigger system



detector data

optical links

how to fit ML models here?



O(100) Xillinx FPGAs

in: 40 MHz out: 100 KHz

detector front end electronics

AMCs w/ MicroTCA backplane

### Challenge: ultra low latency and scarse resources

### <u>Most of it allocated </u>

- receive, calibrate
- aggregate them t
- run track finders combining hits in muon stations

the whole detector energy sums)

# Bring DL to FPGA for L1 trigger with high level synthesis for machine learning



### Neural network inference



### How to fit ML on one FPGA?

FPGAs provide huge flexibility

Performance depends on how well you take advantage of this

**Constraints:** Input bandwidth **FPGA** resources Latency

<u>Today you will learn how to optimize your project through:</u>

- compression: reduce number of synapses or neurons
- quantization: reduces the precision of the calculations (inputs, weights, biases)
- parallelization: tune how much to parallelize to make the inference faster/slower versus FPGA resources

NN TRAINING

FPGA PROJE

DESIGNING

## Today's **hls4ml** hands on

• Part 1: get started with hls4ml and train a basic model and run the conversion, simulation & c-synthesis steps

notebook: part1\_getting\_started.ipynb

• Part 2: learn how to tune inference performance with quantization and reuse factor

notebook: part2\_advanced\_config.ipynb

• Part 3: perform model compression and observe its effect on the FPGA resources/ latency

notebook: part3\_compression.ipynb

• Part 4: train using QKeras "quantization aware training" and study impact on FPGA metrics

notebook: part4\_quantization.ipynb



### Part 1: model conversion

## Physics case: jet tagging

Study a <u>multi-classification task to be implemented on FPGA</u>: discrimination between highly energetic (boosted) **q**, **g**, **W**, **Z**, **t** initiated jets



Reconstructed as one massive jet with substructure

## Physics case: jet tagging



## Input variables: several observables known to have high discrimination power from offline data analyses and published studies [\*]

[\*] D. Guest at al. <u>PhysRevD.94.112002</u>, G. Kasieczka et al. <u>JHEP05(2017)006</u>, J. M. Butterworth et al. <u>PhysRevLett.100.242001</u>, etc..

0.08

0.06

0.04

0.02

0.00

A.U.

## Physics case: jet tagging

 We'll train the five class multi-classifier on a sample of ~1M events with two boosted WW/ZZ/tt/qq/gg anti-k<sub>T</sub> jets

[doi:10.5281/zenodo.3602254, OpenML]

- Fully connected neural network with 16 expert-level inputs:
  - <u>Relu activation function</u> for intermediate layers
  - Softmax activation function for output layer





AUC = area under ROC curve (100% is perfect, 20% is random)



### Setup

- The interactive part is served with Python notebooks
- Open <a href="https://cern.ch/ssummers/hls4ml-tutorial">https://cern.ch/ssummers/hls4ml-tutorial</a> in your web browser
- Authenticate with your Github account (login if necessary)
- Open and start running through part1\_getting\_started.ipynb
- If you're new to Jupyter notebooks, select a cell and hit "shift + enter" to execute the code
- If you have Vivado installed, you might prefer to work locally, see 'conda' section at: https://github.com/fastmachinelearning/hls4ml-tutorial





### Part 2: advanced configuration
# Efficient NN design: quantization



- In the FPGA we use fixed point representation
  - operations are integer ops, but we can represent fractional values
- But we have to make sure we've used the correct data types!



# Efficient NN design: parallelization

- Trade-off between latency and FPGA resource usage determined by the parallelization of the calculations in each layer
- Configure the "reuse factor" = number of times a multiplier is used to do a computation



**Reuse factor**: how much to parallelize operations in a hidden layer

#### Parallelization: DSPs



#### Parallelization: Timing

Latency of layer m

$$L_m = L_{\text{mult}} + (R - 1) \times II_{\text{mult}} + L_{\text{activ}}$$



# Large fully-connected NN

- `Strategy: Resource' for larger networks and higher reuse factor
- Uses a slightly different HLS implementation of the dense layer to compile faster and better for large layers
- Here, we use a different partitioning on the first layer for the best partitioning of arrays

```
IOType: io_parallel
HLSConfig:
   Model:
    Precision: ap_fixed<16,6>
    ReuseFactor: 128
Strategy: Resource
LayerName:
   dense1:
    ReuseFactor: 112
```

This config is for a model trained on the MNIST digits classification dataset Architecture (fully connected):  $784 \rightarrow 128 \rightarrow 128 \rightarrow 128 \rightarrow 10$ Model accuracy:~97% We can work out how many DSPs this should use...

# Large fully-connected NN

- It takes a while to synthesise, so here's one I made earlier...
- The DSPs should be: (784 x 128) / 112 + (2 x 128 x 128 + 128 x 10) / 128 = 1162 🤞

| + Timing (ns):<br>* Summary: |         |            |              |
|------------------------------|---------|------------|--------------|
| ++<br>  Clock                | Target! | Estimated! | Uncertaintyl |
| ++<br> ap_clk  <br>++        | 5.001   | 4.375      | 0.621        |

+ Latency (clock cycles):

\_\_\_\_\_



| ====================================== |                   |                     |             |                             |  |  |
|----------------------------------------|-------------------|---------------------|-------------|-----------------------------|--|--|
| +<br>  Name                            | ++<br>  BRAM_18K  | DSP48E              | +           | LUT I                       |  |  |
| '<br>•••<br>+                          | ++                |                     |             | +                           |  |  |
| lotal<br>+<br> Available SLR           | 1<br>++<br>  2160 | .9621 1<br><br>2760 | 663360      | 99791 22262<br>+<br>3316801 |  |  |
| Utilization SLR (%)                    | ++                | 42                  | <br>25      | +<br>671                    |  |  |
| +<br> Available                        | ++<br>  4320      | 5520                | <br>1326720 | +<br>6633601                |  |  |
| +<br> Utilization (%)<br>+             | ++<br>  45 <br>++ | 21                  | <br>12      | 331                         |  |  |

II determined by the largest reuse factor



#### Part 3: compression

# Efficient NN design: compression

- Neural Network compression is a widespread technique to reduce the size, energy consumption, and overtraining of deep neural networks
- Several approaches in literature [arxiv.1510.00149, arxiv.1712.01312, arxiv.1405.3866, arxiv.1602.07576, doi:10.1145/1150402.1150464]
- Today we will test the tensorflow model sparsity toolkit
  - https://blog.tensorflow.org/2019/05/tf-model-optimization-toolkit-pruning-API.html

#### Main idea:

iteratively remove low magnitude weights, starting with 0 sparsity, smoothly increasing up to the set target as training proceeds



# Efficient NN design: compression





- DSPs (used for multiplication) are often limiting resource
  - maximum use when fully parallelized
  - DSPs have a max size for input (e.g. 27x18 bits), so number of DSPs per multiplication changes with precision



# Part 4: quantization-aware training

# Efficient NN design: quantization

- hls4ml allows you to use different data types everywhere, we saw how to tune that in part 2
- In this part we will also try quantizationaware training with QKeras [arxiv.2006.10159]
- With quantization-aware we can even go down to just 1 or 2 bits [Mach. Learn.: Sci. Technol. 2,015001 (2020)]



# Efficient NN design with **QKeras**

- QKeras is a library developed and maintained by Google to train models with quantization in the training
- Easy to use, drop-in replacements for Keras layers
  - e.g. Dense  $\rightarrow$  QDense **OR** Conv2D  $\rightarrow$  QConv2D
  - use quantizers to specify how many bits to use where with same kind of granularity as hls4ml
- Can achieve good performance with very few bits
- We've recently added support for QKeras-trained models to hls4ml
  - the number of bits used in training is also used in inference
  - the intermediate model is adjusted to capture all optimizations possible with QKeras





# Bonus: L1 trigger applications

- hls4ml enabled large development of new trigger algorithms with large gain for physics
  - replace standard cut-based algorithms

<u>CMS Phase-2 L1 trigger</u> <u>upgrade TDR</u>





#### NN VBF H→bb

|         | Usage              | Percentage |
|---------|--------------------|------------|
| Latency | 24 clk @<br>200MHz |            |
| П       | 5                  |            |
| DSP48E  | 484                | 8%         |
| FF      | 32634              | 2%         |
| LUT     | 62358              | 9%         |

- hls4ml enabled large development of new trigger algorithms with large gain for physics
  - replace standard cut-based algorithms
  - improve physics objects reconstruction (muons, taus, jets)



CMS Phase-2 Simulation

14 TeV

46

- hls4ml enabled large development of new trigger algorithms with large gain for physics
  - replace standard cut-based algorithms
  - improve physics objects reconstruction (muons, taus, jets)
  - develop new strategies like anomaly detection with autoencoders for signal-agnostic triggering

21 inputs:  $p_T/\eta/\Phi$  of 4 e/ $\gamma$ , 4  $\mu$ , 10 jets, and MET



K. Govorkova @ Fast Machine Learning workshop 20

- hls4ml enabled large development of new trigger algorithms with large gain for physics
  - replace standard cut-based algorithms
  - improve physics objects reconstruction (muons, taus, jets)
  - develop new strategies like anomaly detection with autoencoders for signal-agnostic triggering
- Expected to see further developments thanks to the latest QKeras and AutoQ support
  - make the model small and accurate!

<u>CMS Phase-2 L1 trigger</u> <u>upgrade TDR</u>



The Phase-2 Upgrade of the CMS Level-1 Trigger Technical Design Report



#### Bonus: further developments and features

### hls4ml bonuses

- We have discussed the original motivations behind hls4ml: extreme low latency, high throughput domain as for LHC first-stage triggers
- Since then, we have been expanding!
  - longer latency domains, larger models, resource constrained
  - different FPGA vendors (Xilinx, Intel, Mentor)
  - new applications, new architectures
- While maintaining core characteristics:
  - HLS-based fully on-chip implementation
  - extremely configurable: precision, resource vs latency/throughput tradeoff
  - research project, application- and user-driven
  - accessible, easy to use

### hls4ml bonuses

**(a)** 

(d)

hls4ml community is very active!

- Boosted Decision Trees [JINST 15 P05026 (2020)]
- Custom graph neural networks:
  - GarNet/GravNet for calorimeter reconstruction [arXiv: 2008.03601]
  - Interaction networks for tracking [arxiv.2012.01563]

• Large convolutional neural networks [arxiv.2101.05108]



- New implementation based on streaming hls::stream<T>
  - collect data from input pixels until we can compute one output (FIFOs)
  - compute the value of output pixel with a single call to matrix-vector multiplication
  - can reuse existing matrix-vector multiplication used for fully connected layers











- New implementation based on streaming hls::stream<T>
  - collect data from input pixels until we can compute one output (FIFOs)
  - compute the value of output pixel with a single call to matrix-vector multiplication
  - can reuse existing matrix-vector multiplication used for fully connected layers



















### Bonus: Fast machine learning beyond L1 trigger

### The need for fast ML



### Heterogenous computing



https://www.xilinx.com/support/documentation/white\_papers/wp504-accel-dnns.pdf https://www.xilinx.com/publications/events/machine-learning-live/colorado/xDNN ML Suite.pdf https://www.xilinx.com/applications/megatrends/machine-learning.html

4000

3000

#### Online reconstruction



30%

crucial to increase throughput

# Online reconstruction

• Large effort in the past years to rewrite parts of the reconstruction in CUDA for Nvidia GPUs



- Parallel effort to replace parts of the reconstruction with ML
  - minimize need to learn new processor-specific code → decrease effort, increase maintainability
  - must exploit co-processors to achieve highest throughput

### Heterogenous computing @ LHC

**Option 1: direct** 

**Option 2: as a service** 



COPROCESSOR GPU/FPGA/ASICS) COPROCESSOR GPU/FPGA/ASICS Model C Could be ... Model D somewhere else

Data center/ experimental site

Data center/ experimental site GPU

Model A

Model B

**FPGA** 

# Heterogenous computing @ LHC

#### **Option 2: as a service**

- One coprocessor can serve many CPUs → reduce cost and increase scalability
- Increase heterogeneity: choose best device for each job
- Deploy GPUs, FPGAs, ... simultaneously
- Model optimization for the processor could be obtained with available tools (ex, Intel oneAPI [\*])



Data center/ experimental site

## MLaaS with SONIC

- Services for Optimized Network Inference on Coprocessors (SONIC) enables inference as a service in experiment software frameworks
  - experiment software (C++) only has to handle converting inputs and outputs between event data format and inference server format
- Uses industry tools as gRPC communication and Nvidia Triton inference servers
- Interacts with cloud services: Azure, AWS, GCP



# MLaaS with SONIC

Replace hadronic calorimeter reconstruction with ML (2k parameters dense NN here) and enable the model inference in the CMS software with SONIC



#### GPU as a service [arxiv.2007.10359]

Each client is given 7,000 events A single GPU can serve up to 500 HLT nodes with 10% increase in throughput

#### FPGA as a service [arxiv.2010.08556]

A single service server capable of serving 1500 simultaneous clients while preserving throughput 25Gbps network bandwidth limit hit above 1500

# The need for fast ML

ASICs typically used at the front end for sensors read out: directly embed ML in here to allow intelligent data compression before transmission



# Example: High-granularity calorimeter @ HL-LHC

Novel technology for CMS endcap calorimeter: 52 layers with unprecedented number of readout channels!


## Example: CMS HG calorimeter



## What you have learned today

- Machine learning models are intrinsically parallelizable and can be executed efficiently on suitable hardware
- Could replace our standard physics-inspired algorithms which are instead typically sequential
- To gain from this potential down to ultra-low latency the hls4ml library was developed to translate your favourite ML model to an efficient FPGA implementation
- We hope you have gained some experience with hls4ml
  - tutorial always available at <u>https://cern.ch/ssummers/hls4ml-tutorial</u>
  - or if you want to run locally <u>https://github.com/fastmachinelearning/hls4ml-tutorial</u> (need Vivado installation)
- Stay tune for all new features at <a href="https://github.com/fastmachinelearning/hls4ml">https://github.com/fastmachinelearning/hls4ml</a>
- And for fast machine learning updates beyond hls4ml check <u>https://</u> <u>fastmachinelearning.org/projects.html</u>