#### Xilinx Versal AI Core series



1 JG

# AI Engine Tile



#### **VLIW** processing units

- 400 engines arranged in 2d-array
- running at >1GHz
- 512b vector unit:
  - floating point: 8 multiply-accumulates per cycle
  - fixed point:

| X Operand  | Z Operand  | Output        | Number of GMACs @<br>1 GHz |
|------------|------------|---------------|----------------------------|
| 8 real     | 8 real     | 48 real       | 128                        |
| 16 real    | 8 real     | 48 real       | 64                         |
| 16 real    | 16 real    | 48 real       | 32                         |
| 16 real    | 16 complex | 48 complex    | 16                         |
| 16 complex | 16 real    | 48 complex    | 16                         |
| 16 complex | 16 complex | 48 complex    | 8                          |
| 16 real    | 32 real    | 48/80 real    | 16                         |
| 16 real    | 32 complex | 48/80 complex | 8                          |
| 16 complex | 32 real    | 48/80 complex | 8                          |
| 16 complex | 32 complex | 48/80 complex | 4                          |
| 32 real    | 16 real    | 48/80 real    | 16                         |
| 32 real    | 16 complex | 48/80 complex | 8                          |
| 32 complex | 16 real    | 48/80 complex | 8                          |
| 32 complex | 16 complex | 48/80 complex | 4                          |
| 32 real    | 32 real    | 80 real       | 8                          |
| 32 real    | 32 complex | 80 complex    | 4                          |
| 32 complex | 32 real    | 80 complex    | 4                          |
| 32 complex | 32 complex | 80 complex    | 2                          |
| 32 SPFP    | 32 SPFP    | 32 SPFP       | 8                          |

2 JG U

## Al Engine Tile: Interfaces



JGU

3

## **Al Engine Array**



## Al Engine Array - PL interface



5

## **Practical Application**

- Neural Network for ATLAS Trigger application on FPGA (fFEX)
- Utilize AI Engines in the Design
- Basic Idea:
  - Frame particle identification as object detection
  - Treat calorimeter as an image

## **ATLAS Calorimeter Structure**

- Calorimeter has layered cell structure
- Energy deposits are associated to a value in η and Φ
- Rough correspondence between calorimeter cells and pixels of an image
- All physics analysis is based on this information combined with tracking





- Based on YOLO-architecture:
  - Divide image into grid and locate objects inside grid cells
  - Very fast algorithm
- Small region in  $\eta$  and  $\Phi$
- Proof of principle:
  - Predict electrons and their location in the calorimeter
  - Simple architecture



## Architecture

### - Regression

- $Conv2D \rightarrow Dense \rightarrow Dense$
- 60.000 parameters
- Classification
  - Conv2D  $\rightarrow$  **MaxPool**  $\rightarrow$  Dense
  - 400.000 parameters



9

#### **Offline Results**



10

GU

## **DPU Implementation**

- Xilinx IP core: Deep-learning Processing Unit
- Xilinx default method for neural network implementation
- Optimizes accuracy and latency in multiple steps
- Final outcome: 33 µs (for 60.000 parameters)
- Also: mini model 30 µs (3 parameters)
- DPU has large bottleneck
- Not intended for smaller networks at ultralow latency
- Optimized for general purpose implementation of larger networks



- AI Engines are highly capable computation units for neural networks
- Can utilize them using:
  - Mapping trained NN to AIEs via Vitis AI and the DPU IP core
  - or hardcoding in Vitis via C++

- The DPU shows comparably bad latencies for small networks
  - Can be used for larger networks with less stringent timing constraints
- Hardcoding can provide much lower latency

