



### Data-Flow: current and future data-transmission applications in the data-acquisition systems at the LHC

### INFIERI 2015, Hamburg

W.Vandelli CERN/PH-ADT Wainer.Vandelli@cern.ch



### Outline



#### ➔ Introduction

- → Basis Data-Acquisition (DAQ) principles
  - efficiency and dead-time
- ➔ Scaling it up
  - architecture and data-flow
- ➔ DAQ at the LHC
  - challenges and designs
- ➔ Event Building and Networking
  - DAQ-specific workloads and technology limits
- ➔ Coping with LHC upgrade programme



### Introduction



#### → Data-flow is a sub-system of large data-acquisition systems

- connects and safely transport data between the other subsystems
- →Will need to understand DAQ before we can discuss the data movements
- Data acquisition is not an exact science. It is an alchemy of electronics, computer science, networking and physics
  - funding and manpower matter as well
  - there is not ONE solution
    - each experiment has it own peculiarities and legacies
    - experience, risk perception, technology expectations, ...

### **General Overview**







# DAQ & Trigger



### Basic DAQ: periodic trigger









- Measure temperature at a fixed frequency
- ADC performs analog to digital conversion
  - our front-end electronics
- CPU does readout and processing



### Basic DAQ: periodic trigger









- Measure temperature at a fixed frequency
- The system is clearly limited by the time to process an "event"
- $\rightarrow$  Example  $\tau$ =1ms to
  - ADC conversion
     +CPU processing
     +Storage
- →Sustain ~1/1ms=1kHz periodic trigger rate



### What is a "trigger"?



- →A "trigger" is a system that rapidly decides, based on "simple" criteria, if an interesting event took place, initiating the data-acquisition process
- → <u>Simple, rapid, selective</u> are the trigger keywords
- → <u>Relative parameters</u> that depend on the operating conditions
  - in a multi-level trigger system the last level is normally way slower and more complex than the first one

The oscilloscope trigger does exactly this. Informs the instrument to initiate the internal signal acquisition and visualization





### Basic DAQ: physics trigger





- Measure β decay properties
- Events are asynchronous and unpredictable
  - need a **physics** trigger
- Delay compensates for the trigger latency
  - time needed to reach a decision



### Basic DAQ: real trigger





# Basic DAQ: real trigger & busy logic





- Busy logic avoids triggers while processing
- Which (average) DAQ rate can we achieve now?
  - reminder: τ=1ms
     was sufficient to run
     at 1kHz with a clock
     trigger





Define  $\nu$  as average DAQ frequency

 $\nu\tau \rightarrow {\rm DAQ}$  system is busy -  $(1-\nu\tau) \rightarrow {\rm DAQ}$  system is free

$$f(1 - \nu\tau) = \nu \rightarrow \nu = \frac{f}{1 + f\tau} < f$$
  

$$\epsilon = \frac{N_{saved}}{N_{tot}} = \frac{1}{1 + f\tau} < 100\%$$

Define DAQ <u>deadtime (d) as the time the system requires to process an</u> <u>event, without being able to handle other triggers.</u> In our example d=0.1%/Hz

- → Due to the fluctuations introduced by the stochastic process the efficiency will always be less 100%
  - − in our specific example, d=0.1%/Hz, f=1kHz → v=500Hz,  $\epsilon$ =50%



→ If we want to obtain  $\nu$ ~f ( $\epsilon$ ~100%) → f $\tau$ <<1 →  $\tau$ << $\lambda$ 

− f=1kHz, ε=99% →  $\tau$ <0.1ms → 1/ $\tau$ >10kHz

In order to cope with the input signal fluctuations, we have to over-design our DAQ system by a factor 10. <u>This is very inconvenient!</u> Can we mitigate this effect?



### **Basic DAQ: De-randomization**





#### **De-randomization:** queuing theory 100 depth=1 ..... λ depth=5 90 depth=10 FIF depth=50 80 Τ 70 $\rightarrow$ We can now attain a £(%) **FIFO efficiency** 60 ~100% with $\tau \sim \lambda$ moderate buffer 50

size

Analytic calculation possible for very simple systems only. Otherwise simulations must be used.

 $\tau/\lambda$ 

2.0

40

30 L 0.0







DAQs



### Basic DAQ: collider mode





- Particle collisions are synchronous
- Trigger <u>rejects</u> uninteresting events
- Even if collisions are synchronous, the triggers (i.e. good events) are unpredictable
- De-randomization is still needed



## Multi-level trigger systems



## Sometime impossible to take a proper decision in a single place

- too long decision time
- too far
- too many inputs
- Distribute the decision burden in a hierarchical structure

• usually 
$$au_{N+1} \gg au_N$$
 ,  $f_{N+1} \ll f_N$ 

- At the DAQ level, proper buffering must be provided for every trigger level
  - absorb latency
  - de-randomize



#### → Data must be transported from level to level





# DAQ Scaling up



### Basic DAQ: more channels









TAL ENT



































## DAQ@LHC



→LHC experiments have O(10<sup>7</sup>) channels operating at 40 MHz (25 ns) → 40 TB/s

➔In addition, interesting phenomena are extremely rare

$$\sigma_H / \sigma_{Tot} \sim O(10^{-13})$$

#### → Events are complex

- significant number of overlapping collisions (pile-up μ)
- → Experiments are large (O(10 m))





# Trigger & DAQ Challenges at the LHC





### Challenging environment and requirements

- underground cavern
- space constraints
- power consumption constraints
- high radiation levels
- desire to limit non-active material
- magnetic field



LHC Experiment

- Different particle detection technologies organized in layers
- on-detector custom electronics forms and digitize signals
- operates at the LHC rate of 40 MHz

|                               | ATLAS               |
|-------------------------------|---------------------|
| Length (m)                    | 46                  |
| Diameter (m)                  | 25                  |
| Weight (t)                    | 7000                |
| Number of electr.<br>channels | 100·10 <sup>6</sup> |



### First Level Trigger

- → Operates at 40 MHz
- → Uses reduced granularity information from detector
- Simple, hardware-friendly algorithms
- → Small, deterministic decision time (~ $\mu$ s)
- Implemented with custom electronics: ASIC/FPGA
  - massively parallel through locality
- → Reduces the rate to 100 kHz









CÉRN





## LHC L1 Trigger and FE electronics



→Particle time of flight >> 25 ns

→ Cable delays >> 25 ns

Dedicated synchronization, timing and signal distribution facilities

Typical L1 decision latency is O(µs)

dominated by signal propagation in cables





Digital/analog <u>custom</u> front-end pipelines store information during L1 trigger decision



- Data from events selected by First Level Trigger are pushed by on-detector electronics over detector-depend links
- Back-end custom electronics
  - adapt from specific link technology to common (custom) link technology
  - format data to a common standard

- Located in a service cavern adjacent to the experiment
- Readout PCs
  - buffer data
  - connect 2000 custom links to a 10 GbE network





#### Data-Flow and High-Level Trigger: COTS domain

- High-Level Trigger is a large computing farm several 10000 cores
  - fully based on COTS technologies
- Necessarily located in a surface computing room
- Primary parallelism and scaling scheme through event distribution
  - different events are independent
- Complemented by Readout and Data-Flow infrastructures
  - convey data from the detector (underground) to the computing farm
  - equalizes farm usage and implements buffering
- Maximize throughput
  - no inter-node communications







Select events are stored in a <sup>1</sup>/<sub>2</sub> PB transient storage system and asynchronously moved to the off-line mass storage, data analysis and data distribution facilities

### System distributed in space



#### INFIERI 2015 - W.Vandelli

radiation

# (Today's) Technology domains



- Custom electronics and serial links
- Best performance per cost for the different conditions
  - e.g. inner detector has tighter requirements than muon spectrometer

- Commercial Off-the-Shelf (COTS)
   Computing and Networking
- Software is more flexible than firmware
  - less steep learning curve
- Easier to maintain, replace, mix and match



# (Today's) Technology domains



- Custom electronics and serial links
- Best performance per cost for the different conditions
  - e.g. inner detector has tighter

Two domains are linked by common (still custom) serial link technology

Electronics could not implement high-level protocols needed by a network (e.g. TCP/IP)

#### Computing and Networking

- Software is more flexible than firmware
  - less steep learning curve
- Easier to maintain, replace, mix and match



 $\rightarrow$ 





## LHC DAQ Systems







### ALICE



custom hardware















network switch

Calo/Muon Detectors Other Detectors Level 1 <2.5 µs 40 MHz Detector Read-Out FE FE FE L1 Accept 75 (100) kHz ROD ROD Regions Of Interest ~40 ms ReadOut System Level 2 ROI data (~2%) ROL Event Requests Data Collection Builder Network L2 Accept ~4 sec ~3 kHz SubFarmInput Event Filter **Event Filter** Network EF Accept ~200 Hz Data-SubFarmOutput Flow High Level Trigger







### LHCb



custom hardware



HLT





Control and Monitoring data





# Networking



# Network



- → Examples: Ethernet, Telephone, Infiniband, ...
- →All devices are equal
- Devices communicate directly with each other
  - no arbitration, simultaneous communications
- → Device communicate by sending messages
- In switched network, switches move messages between sources and destinations
  - find the right path
  - handle "congestion" (two messages with the same destination at the same time)







# Network



→ Examples: Ethernet, Telephone, Infiniband, ...

- →All devices are equal
- Devices communicate directly with each other
  - no arbitration, simultaneous communications
- → Device communicate by sending messages
- ➔In switched network, switches move messages between sources and destinations
  - find the right path
  - handle "congestion" (two messages with the same destination at the same time)

Thanks to these characteristics, networks do scale well. They are the backbones of LHC DAQ systems



single unit

storage

→ <u>Network-based EB is choice of all LHC</u> experiments and a case study for networking in DAQ

Event Building: collection and formatting of

all the data elements of an event into a







49





### Network switch: crossbar





- Each input port can potentially be connected to each output port
- At any given time, only one input port can be connected to a given output port
- Different output ports can be reached concurrently by different input ports



### Network switch: crossbar





→ Ideal situation → all inputs send data to different outputs

No interference (Congestion)

All input ports send data concurrently



## Crossbar switch: event building





- → EB workload implies converging data flow
  - <u>all inputs want to send to</u> <u>same destination at the same</u> <u>time</u>
- → "Head of line blocking"
  - congestion





# Congestion





→Well know phenomena ..

- Differently from road traffic, Ethernet HW is allow to "drop" packets
  - Higher level protocols have to take care of resending
  - Possibly important performance impacts



→Adding input and output FIFO dramatically improve the EB pattern handling

→EB workload anyway problematic

- FIFO size is limited, variable data size
- limited internal switching speed





# ATLAS Data Network Topology







# **Event Building in ATLAS**



- Present ATLAS network topology two possible sources of congestion
- Funnelling: multiple links may overload output link
  - Brute force → central routers are large (and expensive) Carrier-class Internet-scale devices with massive buffering and switching capabilities
- Bandwidth mismatch: a faster input link may instantaneously overload a slower output link
  - Traffic shaping → control maximum burst size with respect to the switch buffer size







# **Traffic Shaping**



### Credit mechanism at application level to control the burst size

- more credits → more concurrent responses
- →Quality metric is collection time → time to fetch data
- If one credit corresponds to 1 kB response size
  - above the switch buffer size, packet drops happen
- Interesting to study what buffer size would allow no traffic shaping
  - simulations calibrated to reproduce the above measurements







# LHC Upgrades



## LHC Upgrade Programme









## LHC Upgrade Programme







# "Pile up": collision multiplicity





One "event" in LHC is the superimposition of many, almost concurrent, proton-proton collisions

- pile up is the number of overlapping collisions in one event
- LHC upgrade programme increases the "brightness" of the accelerator increasing the pile-up
  - faster statistic collection, BUT non-linear increase in event complexity



# **Upgrade: Challenges**



### ➔ Increased pile-up

- larger data size  $\rightarrow$  bandwidth and storage
- more complex events → increased computing needs, reduced trigger efficiency and rejection power and increased acceptance rates







|                                        | ALICE                        | LHCD                        | CMS                           | ATLAS                       |
|----------------------------------------|------------------------------|-----------------------------|-------------------------------|-----------------------------|
| Hardware<br>trigger                    | No                           | No                          | Yes                           | Yes                         |
| Software<br>trigger input<br>rate      | 50 kHz Pb-Pb<br>200 kHz p-Pb | 30 MHz                      | 500/750 kHz for<br>PU 140/200 | 0.4 MHz                     |
| Baseline<br>processing<br>architecture | CPU/GPU/FPGA/<br>Cloud&Grid  | CPU farm<br>(+coprocessors) | CPU farm<br>(+coprocessors)   | CPU farm<br>(+coprocessors) |
| Software<br>trigger output<br>rate     | 50 kHz Pb-Pb<br>200 kHz p-Pb | 20-100 kHz                  | 5-7.5 kHz                     | 5-10 kHz                    |



### **Common Detector Links**





- Back-end electronics function is to adapt between two class of serial links
  - specialized detector links
  - common readout links
- No main DAQ functionalities, just technology proxying



## **Common Detector Links**







### GBT & VL





#### ➔ Asymmetric technology

- radiation hard components (ASIC) on the detector, COTS (FPGA) on the receiving end
- 500 mW power consumption (on-detector)
- up to 10.24 Gb/s uplink 2.56 Gb/s downlink
- →Logical, fixed bandwidth sharing per link allows multiplexing of different information
  - unique detector interface: data, configuration, trigger, slow control ...



### GBT & VL





#### ➔ Asymmetric technology

- radiation hard components (ASIC) on the detector, COTS (FPGA) on the receiving end
- 500 mW power consumption (on-detector)
- up to 10.24 Gb/s uplink 2.56 Gb/s downlink
- →Logical, fixed bandwidth sharing per link allows r
  - unique detector interface: data, configuration, trigg

GBT deployment planned for all LHC experiments



### **Detector Readout**







### **Detector Readout**





- Detector is completely interfaced by common technology
- → Keep same scheme two-link scheme
  - now with common, still custom, backend electronics





### **Detector Readout**





- reduce single point of failures
- simpler load balancing and maintenance
- →LHC experiments ~10000 GBT links
  - need high density heterogeneous data router



# Front-end Link Exchange (FELIX)





- ATLAS project for interfacing GBT links
- →Stateless, configurable data routing device
  - route data by streams or type
  - propagate commands
  - data duplication and sampling for monitoring
- Handling of high-level switched protocol
  - Infiniband/Ethernet/...
  - QoS for different traffic types

## **FELIX Prototype**









### → HL-LHC DAQ system will need network throughput of ~20 Tb/s

• currently ~1 Tb/s

### → Ethernet is the current champion in networking

- 100GbE available, 400 GbE in preparation
- 20 Tb/s → 200 100GbE, 50 400 GbE
- → Event Building pattern is not easy on network

### ➔Ideally want deep buffers to avoid application level solutions







→ Currently two major classes of Ethernet network devices

#### →Carrier

• deep-buffer, high-density, flexible firmware (FPGA, network processors), \$\$\$/port

### → Data-centre (Top-of-Rack TOR)

 shallow buffer, ASIC based, ultra-high density, focused on layer 2 and simple layer 3 features, very low latency, \$/port

| Speed | Carrier [ USD /<br>port ] | TOR [ USD /<br>port ] |
|-------|---------------------------|-----------------------|
| 10GbE | 400 - 1000                | 200 - 250             |
| 40GbE | 1000 - 4000               | 500 - 900             |



# Other network technologies



- Loss-less network using specific hardware and software stack
  - Single-vendor
- → Data-Center Bridging (DCB)
  - aka loss-less Ethernet
- Umbrella for a zoo of IEEE standards
- Looking forward to official release of new Intel interconnect
  - Omni-Path, expected end of 2015







Ultimately network "devices" with large buffers would be the simplest solution for the DAQ problem

- → How to make these affordable?
- → Re-use other affordable elements







#### →Intel DPDK (Data-Plane Development Kit)

• fast packet processing library  $\rightarrow$  allow building PC-based network switches





#### → Main limitation of this approach is density

- limited number of ports per "switch"
- scaling to a LHC-size network requires to re-think the topology



## Possible DAQ System









# Almost The End







- → If you are technology-oriented
- ➔ If you found these topics interesting
- → If you look forward to the challenges
- → If you like to be in the centre of the action

→ Your chance of hearing much more and learn through practice ...



# ISOTDAQ



#### Sixth edition of the International School of Trigger and Data Acquisition will be held in February 2016 and hosted by Weizmann Institute



http://isotdaq.web.cern.ch/isotdaq/isotdaq/Home.html





# The End

W.Vandelli CERN/PH-ADT Wainer.Vandelli@cern.ch