

### **FPGAs in Detector Instrumentation: concepts and trends**

Michele Caselle and Marc Weber

12-15 March 2019 Physics Department, TU Dresden

Helmholtz Research Field Matter





www.kit.edu

### **Challenges for the future**



The next generation of detectors are extremely challenging: HEP, astrophysics, photon science, etc.

Simulation of 10 000 tracks in a future HL-CMS Tracker detector

Low-level trigger system based on FPGA track reconstruction and fitting High brilliance coherent synchrotron source for new generation of photon science experiments: EuXFEL, FLASH, FLUTE, KARA, SLS-2, etc.

Advanced beam diagnostic tools for new synchrotron and plasma accelerator machines.



Unprecedented data rate of up to 50 Tb/s to be processed in < 4  $\mu$ s with high efficiency and acceptable fake rate

Complex dynamics on short time scale, terahertz detector technologies, sub-ps time resolution, O(100) Gb/s continuous data acquisition for long-time scale (sec- hrs)

Scaling of existing technologies is necessary to be prepared for future experiments

Michele Caselle

# Challenges for the future (II)

- Increased luminosity requires:
  - Higher segmentation → O (10 µm) on hybrid and down to O(1-3 µm) on monolithic
  - Higher hit-rate capability  $\rightarrow$  sub-nanosecond
  - Higher radiation hardness



DAQ system, interface to, and control, front-end ('*readout*') & organize data into coherent structure ('*event building*') to cope with enormous data volume

Trigger systems are essential to find rare events & new physics ('*trigger*')

Configuration and control of detector ('run control')

Provide monitoring of system and data quality ('DQM')

Next generation of data processing (?)



Next generation of

FPGAs (?)



### **Next generation of FPGA**

### A New Class of Devices for Today's Challenges



**Device Category** 

### Three Ages of FPGA



# Heterogeneous: ZYNQ MPSoC technology

Karlsruher Institut für Technologie

*Heterogeneous platform* of the Zynq System-on-Chip (SoC) integrates, in a monolithic device, FPGA resources with a back-end software running on a hard-core ARM-based processor.



Next talk: System-on-Chip FPGAs: Experience and Recommendations by Ralf SPIWOKS (CERN)

System-on-Chip (SoC) workshop – CERN , 12-14 June 2019 https://indico.cern.ch/event/799275/registrations/48961/

Ref. https://www.xilinx.com/support/documentation/selection-guides/zynqultrascale-plus-product-selection-guide.pdf

Michele Caselle

# Heterogeneous: ZYNQ *RF*SoC technology



The bandwidth bottleneck of previous ZYNQ MPSoC was the bandwidth and complexity of the JESD204B standard communication for fast ADC/DAC



Ref. https://www.xilinx.com/support/documentation/selection-guides/zynq-usp-rfsocproduct-selection-guide.pdf

# **RF-ADC/DAC Implementation steps**



### **Versal architecture - Overview**

New Xilinx architecture **ACAP** (Adaptive Compute Acceleration Platform) develop in TSMC 7nm FinFET technology, key features:

- More heterogeneous
- Processor + FPGA + RF analog + HBM + AI engines (artificial intelligence)
- High bandwidth & low-latency → multi-Terabit/sec throughput





#### Ref:

https://www.xilinx.com/support/documentation/white\_pa pers/wp505-versal-acap.pdf

Michele Caselle





### Next generation of data processing

# Next generation of data processing



New generation of particle physics experiment already reached hundreds of millions of events per second, meaning physicists must sift through tens of *petabytes of data per year*. As the resolution of detectors improve, ever better solution is needed for real-time pre-processing and filtering ('trigger') of the most promising events.



# What's the different between Artificial lintelligence, Machine Learning, and Deep Learning?

Al involves machines that can perform tasks that are characteristic of human intelligence While this is rather general, it includes things like planning, understanding language, recognizing objects and sounds, learning, and problem solving



#### At its core, machine learning is simply a way of achieving AI

*"the ability to learn without being explicitly programmed."* You see, you can get AI without using machine learning, but this would require building millions of lines of codes with complex rules and decision-trees.



### Deep learning is one of many approaches to machine learning

Deep learning was inspired by the structure and function of the brain, namely the interconnecting of many neurons. Artificial Neural Networks (ANNs) are algorithms that mimic the biological structure of the brain.

- Data analysis
- Reconstruction chain  $\rightarrow$  jet tagging (task to find the particle ID of a jet)
- Low-level and high-level trigger systems
- Data quality monitor
  - etc.



- Physics analysis
- Reconstruction chain  $\rightarrow$  jet tagging (task to find the particle ID of a jet)
- Low-level and high-level trigger systems
- Data quality monitor
- etc.

Data

source

 Heterogeneous FPGA/GPU-based readout system

http://ufo.kit.edu/ufo



Fast feedback

M. Caselle et al., JINST 12 C03015 (2017)



Michele Caselle

- Physics analysis
- Reconstruction chain  $\rightarrow$  jet tagging (task to find the particle ID of a jet)
- Low-level and high-level trigger systems
- Data quality monitor

etc.



Using heterogeneous FPGA-GPU to combine the strengths of

both FPGA and GPU technologies

- Physics analysis
- Reconstruction chain  $\rightarrow$  jet tagging (task to find the particle ID of a jet)
- Low-level and high-level trigger systems
- Data quality monitor

etc.



Using heterogeneous FPGA-GPU to combine the strengths of

both FPGA and GPU technologies



- Physics analysis
- Reconstruction chain  $\rightarrow$  jet tagging (task to find the particle ID of a jet)
- Low-level and high-level trigger systems
- Data quality monitor

etc.





Using heterogeneous FPGA-GPU to combine the strengths of

both FPGA and GPU technologies

# **High-performance Deep Machine Learning FPGA**

High Level Synthesis four Machine Learning has been developed at CERN

*Talk: Machine Learning for HL-LHC Detectors*, by Jennifer NGADIUBA (CERN), Friday room GER 038



What is  $hls4ml \rightarrow$  framework removes major barrier on hardware development of ML algorithms allowing developers with little or no FPGA expertise to program FPGA by high level synthesis tools

Case studies:

- HEP: Machine Learning for low-level trigger system based on FPGA → fast jet substructure classification
- Photon Science: fast feedback to RF-system of synchrotron machine for the control of the coherent THz emission in burst mode
- Under development the low-level trigger system for HL-CMS track finding

|                      | Res_V         | FPGA       | Python Keras      |
|----------------------|---------------|------------|-------------------|
|                      |               | prediction | calculation (GPU) |
| <b>&gt;</b>          | Gluon (g)     | 0.118164   | 0.12993355        |
|                      | Quark (q)     | 0.639648   | 0.6487177         |
| Jet classification – | Boson (W)     | 0.118164   | 0.10633943        |
|                      | Boson (Z)     | 0.118164   | 0.10616959        |
|                      | top quark (t) | 0.015625   | 0.00883975        |

**16 clock cycles** @ **200 MHz** → **latency** = **80 ns** Implemented on ZYNQ UltraScale+ XCZU9EG

#### Low-latency & ultra-fast ML inference

Javier Duarte et al., Fast inference of deep neural networks in FPGAs for particle physics, arXiv:1804.06913v3 (2018)

Michele Caselle

Institute for Data Processing and Electronics (IPE)

## **Deep Machine Learning – photon science**

Goal is to keep the coherent THz intensity stable by fast feedback to RF System

Complex and nonlinear dynamics in longitudinal / transverse bunch profiles  $\rightarrow$  described by a nonlocal nonlinear partial differential *vlasov fokker planck* equation describing the time evolution of the probability distribution of a particle in synchrotron machine





### Impact

- Next generation of detectors will be harder than HL-LHC, not necessary 'bigger', but will generate huge and complex data volume (petabyte/sec), serious technological challenges → we are required to push the technology envelope (especially for trigger and data acquisition)
- Toward a common hardware infrastructure → novel programmable devices families, i.e. ACAP, which combine highperformance FPGA, scalar processors + artificial intelligence + vector processors (i.e. GPU)
  - Ad-hoc logics  $\rightarrow$  in the high performance & flexible FPGA region
  - Detector/system interconnections → by flexible I/O and multi-Tb/s integrated high bandwidth memory
  - Al engine  $\rightarrow$  low-latency high-performance ML inference (trigger, processing)
- Machine learning is one of the most promising technique for data analysis, processing (trigger), intelligent detector configuration, data quality monitor, etc.
  - Very flexible, the *ability to learn* without being explicitly programmed, simple code maintenance, easy integration in FPGA, etc.
  - ... we are just at the beginning, new exciting technologies will appear in the next years

### Thank you for your attention







### **Backup slides**



# **Two FPGA technology solutions**





Institute for Data Processing and Electronics (IPE)

### **Charting an Aggressive Course Forward**





#### Institute for Data Processing and Electronics (IPE)

### New "High-Flex 2" readout card Photon science:

- New generation of ultra-fast cameras for science (PS DTS)
- Detectors for beam diagnostics (ARD DTS)

→ Poster by M. Caselle (ARD + DTS)

- Hardware platform for implementing machine learning algorithms
- Superconducting sensors and quantum technologies:
  - Readout of Metallic Magnetic Calorimeters arrays
  - Control- and readout of superconducting quantum bits
- High Energy Physics (HEP):
  - NA62 (SPS-CERN) fast "low-level" trigger system, GPU-based ~µs latency
  - High Level Trigger (HLT) based on modern GPUs and FPGAs accelerators



Optical link (full-duplex)

up to 190 Gb/s

PCIe Gen 4 (x8 or x 16 lanes) *up to 240 Gb/s* 



SoC

### **New** "High-Flex 2" readout card

### Photon science:



- New generation of detectors for beam diagnostics
- Diagnostics and stabilization of laser systems
- Superconducting sensors and quantum technologies:
  - Control- and readout of superconducting quantum bits
- High Energy Physics (HEP):
  - NA62 (SPS-CERN) fast "low-level" trigger system, GPU-based
  - High Level Trigger (HLT) based on GPU- FPGAs accelerators

### Hardware platform for Artificial Intelligence algorithms

Heterogeneous FPGA- GPU system based on Machine learning

### Fully designed @ KIT





**PCIe** Gen 3 and 4, 16 lanes  $\rightarrow$  up to 240 Gb/s full-duplex



Michele Caselle

Institute for Data Processing and Electronics (IPE)

# Readout card based on ATCA / µTCA

- ATCA or µTCA standard
- Compatible with CMS timing layer detector readout card
- Key ideas:

### 1. I/O by interposer:

- Optical links
- FMC+ digital
- LPAF analog + digital



High-density board-to-board interconnection by interposer technology





### 2. FPGA by interposer:

- High-flexibility in FPGA selection
- Kintex/Virtex or RF/SoC Zynq

**3. Centralized slow-control** by ZYNQ processors and ATCA shelf manager, new tasks:

- Detector calibration
- On-line data quality check
- Machine learning



### **Common readout card 3 in 1 concept**



New LPAF (Low Profile Open-Pin-Field Array) connector: High-Speed High-Density analog & digital signals

Karlsruher Institut für Technologie

Common HW for: Massive optical I/O  $\rightarrow$  96 + 24 optical links High-end ADC/DAC cards  $\rightarrow$  ADSCs = 80 GS/s + DACs = 320 GS/s Massive digital connection  $\rightarrow$  320 digital I/O + 40 high-speed serial link (each > 16 Gb/s)



Michele Caselle

## High-performance heterogeneous FPGA-GPU DAQ

- Modern photon science detectors generate huge raw data volumes (~120 Gb/s)
- Observed slow changes in synchrotron machine (e.g. current)  $\rightarrow$  sec hrs



PCIe readout card of UFO DAQ Platform

- DMA working close to theoretical limit of data link
- Data latency 5 times better than other DMA architectures
- M. Caselle et al., JINST 12 C03015 (2017)

- Heterogeneous FPGA/GPU-based readout system, the UFO DAQ platform, has been developed <a href="http://ufo.kit.edu/ufo">http://ufo.kit.edu/ufo</a>
- The core component is a "novel" Direct Memory Access (DMA) architecture
- Direct FPGA ↔ GPU communication enables real-time data processing







# High-performance heterogeneous FPGA-GPU

### Ultrafast X-ray Computer Tomography @ KARA (KIT)





Reconstruction and semiautomatic segmentation by GPUs



3D spatial resolution: <1 µm + time resolution  $\rightarrow \mu s / ms$ 

### Data processing up to 50 Gb/s in real-time

M. Caselle et al., IEEE-RT DOI:10.1109/TNS.2013.2252528 (2013)



CMS low-level trigger system



L1 trigger will require reconstruction of charged particles with transverse momentum  $> \sim 2 \text{ GeV/c}$ 



CMS low-level trigger system based on FPGA-GPUs track reconstruction and fitting Total data latency of 6.9  $\mu$ s =  $2 \mu s$  (data transfer) + 4.9  $\mu s$ (GPU processing)

H. Mohr. M. Caselle et al., JINST 12 C04019 (2017)

### **Heterogeneous FPGA-GPU (Performance)**

### Data throughput KALYPSO $\rightarrow$ FPGA $\rightarrow$ GPU



- DMA working close to theoretical limit of data link (PCIe Gen3)
- Data throughput > 12.5 Gbyte/s both NVIDIA and AMD
- High throughput → fundamental to sustain a continuous acquisition for long time (sec,- hrs)

### Round-trip time: FPGA →GPU →FPGA



- NVIDIA (Tesla K40) : data latency < 2, jitter < 30 ns
- AMD (FirePro W9100): data latency < 1.4 μs, jitter 50 ns
- Low-latency  $\rightarrow$  fundamental for fast feedback to experimental system



Karlsruher Institut für Technologi



### **Versal – Adaptable Engines**

For traditionalists: This is the FPGA part

### Some known facts

- 6 Input LUTs
- Each CLB has 32 LUTs and 64 FF (4x density compared to US+)
- 16 LUTs in a slice can be
  - a 64 bit RAM
  - 32-bit shift registers (SRL32) or two SRL16
- Internal connection of LUTs possible
- 4x clock, 4x set/reset, 16 clock enable
- 3 step voltage-scaling supported





### Versal – Al tile architecture

Karlsruher Institut für Technologi

### 1.3 GHz VLIW / SIMD vector processors

### Parallelity

- VLIW: 7+ operations / clock cycle
- SIMD: 512 bit vector datapath (8 / 16 / 32 bit & SPFP operands)
- Up to 128 INT8 MACs / clock cycle / core

### Memory

- 16 KB Internal program memory
- 32 KB data memory (parallel)
- Integrated DMA logic



# **Advantage of Multi-Processor SoC technology**

The user-friendly Linux applications is combined with the efficiency and throughput of a system fully implemented in the FPGA fabric



- Embedded ('slow-control') server on FPGA , i.e. EPICS server, etc.
- Detector ('calibration') routines running on FPGA
- High granularity on-line monitoring of system and data quality ('DQM') on FPGA
- etc.

Karlsruher Institut für Technologie

# **Advantage of Multi-Processor SoC technology**

The user-friendly Linux applications is combined with the efficiency and throughput of a system fully implemented in the FPGA fabric



- PYNQ, reVISION: are an open-source project from Xilinx that makes it easy to programmable hardware without having to use HDL (verilog/VHDL) language.
- Low-level trigger based on Artificial Intelligent

