

## Scaling DNN Inference for Extreme Throughput

Michaela Blott Distinguished Engineer, Xilinx Research Dec 2020



#### Background

#### Xilinx

- Fabless semiconductor company, founded in Silicon Valley in 1984
- Today: ~4000 employees, \$2.8B revenue
- Invented the FPGA

#### Xilinx Research - Dublin

- Established almost 15 years ago
  - ~10 researchers plus university program
  - Highly active internship program, 80+ interns over the last 10years
- Focus: FPGAs in Machine Learning
  - Building systems, architectural exploration, algorithmic optimizations, benchmarking
  - Quantifying the value of our devices in this space
- In collaboration with partners, customers and universities





### What are FPGAs?

#### Customizable, Programmable Hardware Architectures

• The chameleon amongst the semiconductors...



- Customizes IO interfaces, compute architectures, memory subsystems to meet the application
- Use case: Nothing else works, and you want to avoid ASIC implementation; or ASIC emulation





#### What are FPGAs?





# Challenges in Deploying DNNs in Communications



## **DNNs in Communications**

- Many emerging use cases
  - Traffic classification
  - Traffic monitoring and statistics
  - Traffic prediction
  - Network intrusion detection ★



- Physical layers
- Implementation of individual basic components
  - Hashing/ indexing
  - Sorting

6

- ITU has identified and classified 30 use cases
  - ITU-T Y.3170-series Supp 55



- [1] https://www.cl.cam.ac.uk/~ey204/teaching/ACS/R244\_2018\_2019/papers/Kraska\_SIGMOD\_2018.pdf
- [2] http://learningsys.org/sosp19/assets/papers/22 CameraReadySubmission Abstract SOSP 19 ML\_Sys\_workshop-4.pdf

- [3] https://aip.scitation.org/doi/full/10.1063/1.5140609
- 4 https://hal.archives-ouvertes.fr/tel-01206266/document
- 5 https://tel.archives-ouvertes.fr/tel-01876701/document
- [6] https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8054694





#### Specific Challenges for DNNs in Communication: Throughput

- Extreme high throughput requirements
- Highest reported inference performance
  - 55kfps ResNet50 (423TOP/sec) MLPerf [1]
- Even a 1MOP/inference model would require
  - 600TOP/s for 400Gbps
  - 30TOP/s for 20Gbps

Throughput requirements are extremely high

- Beyond the limit of latest AI silicon
- Limiting complexity of DNNs

|             | Throughput |
|-------------|------------|
| 5G (20Gbps) | 30MRps     |
| 100Gbps     | 150MRps    |
| 400Gbps     | 600MRps    |

MRps: Million Requests per Second Assuming 64B / packet

Datasheet performance of State of the Art AI accelerators:

|                      | Performance  |  |
|----------------------|--------------|--|
| Ascend 910           | 256TOP/s *   |  |
| Colossus (Graphcore) | 250TOP/s *   |  |
| A100 (Nvidia)        | 312TOP/s *   |  |
|                      | 1248TOP/s ** |  |
|                      | *BF/FP16     |  |
|                      | **INT8       |  |



### Specific Challenges for DNNs in Communication: Latency

- Ultra low latency requirements in any form of cognition cycle:
  - Translates to buffering requirements



|             | Buffer<br>[10ns] | Buffer<br>[1us] | Buffer<br>[1msec] |
|-------------|------------------|-----------------|-------------------|
| 5G (20Gbps) | 0.2Kb            | 20Kb            | 20Mb              |
| 100Gbps     | 1Kb              | 100Kb           | 100Mb             |
| 400Gbps     | 4Kb              | 400Kb           | 400Mb             |

- Typical latency in MLPerf
  - Closed data center, single stream latency: 2-7msec [1]



#### **Challenges in the Semiconductor Landscape**

- Manufacturing difficulties of shrinking transistor sizes beyond 5nm
  - FINFET doesn't scale to 3nm
- Design costs are exploding



Source: IBS

 Limited performance & power benefits with smaller technology nodes

Hitting the physical limits of silicon-based computing

Moving away from standard van Neumann architectures

Architectural innovation becomes paramount





# Innovation is needed to provide the necessary performance scalability



#### **Innovative Approaches – Going Wide**

#### Cerebras: Waver-Scale Computing

- Targeting ML training





Source: HotChips2019

#### Innovative Approaches – Going High with 3D Die Stacking





#### **Innovative Approaches – Quantum Computing**

- Dwave: Quantum Computing
  - For HPC and ML applications





#### **Innovative Approaches – Analog Neuromorphic Computing**







# Performance Scalability through Specialization



#### **Specialization for Performance Scalability**

5 Series 7 Series U+ Series Versal



More hardened specialized functionality to improve compute density and save power



#### **Innovative Approaches**



#### DPU Compute Architecture Specialization, Performance & Flexibility



### Matrix of Processing Engines Customizing for DNN in General

- Popular layer-by-layer compute
- Batching to achieve high compute efficiency
- Specialized processing engines
  - Operators
  - ALU types
    - tensor-, matrix- or vector-based





#### **MPE: Specialization of Processing Engines**



### **MPE: Latency Implications of Batching**



Embedded measurement of system-level latency, FP16 https://rcl-lab.github.io/QutibenchWeb/

#### DPU Compute Architecture Specialization, Performance & Flexibility



#### Spatial Processors: Customizing for Specific Topologies

- Hardware architecture mimics the topology
- Customize everything to the specifics of the DNN
- Benefits:
  - Improved efficiency
  - Low fixed latency
  - Higher throughput
- FPGAs rather than ASICs





23

#### **Spatial Architectures:** Scaling to Meet Performance & Resource Requirements



1. Scale performance & resources to meet the application requirements

2. If resources allow, we can completely unfold to create a circuit that inferences at clock speed (communications!)

\_INX.



# **Customizing Arithmetic**



## **Customizing Arithmetic to Minimum Precision Required**

- Shrinks hardware cost & scales performance
  - Instantiate ~100x more compute within the same fabric, thereby scale performance 100x
- Reduces memory footprint
  - NN model can stay on-chip => no memory bottlenecks



#### C= size of accumulator \* size of weight \* size of activation



#### **Granularity of Customizing Arithmetic**











## **Extreme Specialization**



#### DPU Compute Architecture Specialization, Performance & Flexibility



#### LogicNets with FPGAs







#### **LogicNets Results**

#### Jet Tagging (CERN LHC)



31



# Scaling to Extreme Throughput in Network Intrusion Detection



#### **Deep Network Intrusion Detection System**





#### **Results**

|                | DNN                                              | Unrolled SP                               | LogicNet SP                                   | If we can change              |  |
|----------------|--------------------------------------------------|-------------------------------------------|-----------------------------------------------|-------------------------------|--|
|                | topology                                         | MLP                                       | Circuit is the topology                       | the topology                  |  |
|                | #layers                                          | 3                                         | 4                                             |                               |  |
|                | neurons / layer                                  | 64                                        | 10s - 100s                                    |                               |  |
|                | #bits / weight & activation                      | 2b                                        | 2b                                            |                               |  |
|                | #bits / inputs & output                          | binary                                    | binary                                        |                               |  |
|                | Inputs / neuron                                  | 64                                        | 7                                             | Sparsity to suit<br>to fabric |  |
|                | accuracy                                         | 91.9%                                     | 91.3%                                         | lo tablic                     |  |
| 100Gbps throu  | Optimization<br>ghput                            | spatially unrolled, customized arithmetic | Learned circuit                               | 400Gbps throughput            |  |
| requirements a | re met Throughput                                | Expected* 208MRps                         | 471MRps                                       | requirements are met          |  |
| Extreme low la | Latency                                          | 1.2usec                                   | 9nsec                                         |                               |  |
|                | Clock                                            | 208MHz                                    | 471MHz                                        | High clock rate               |  |
| low clock      | ock UNSW-NB15 Network Intrusion Detection        |                                           |                                               |                               |  |
|                | Spatial processing, cu<br>scale to communciation |                                           | and learned circuits can<br>ency requirements | help                          |  |



# Challenge

How can we enable a broader spectrum of end-users to be able to specialize hardware architectures and co-design solutions?





# Providing tools and platforms for exploration of DNN compute architectures

#### End-to-end flow

- ML engineers can create specialized hardware architectures on an FPGA
  - with spatial architectures and custom precision

#### • Open source

- Transparency and flexibility for the fast changing landscape of algorithms
  - if not supported, you can add your own

## From DNN to FPGA Deployment





**Brevitas** Training in pytorch Algorithmic optimizations

FINN compiler Specializations of hardware architecture

> Deployment with PYNQ

AVNET

- Train or even learn reduced precision DNNs
- Library of standard layers
- Pretrained examples

#### ONNX Intermediate Representation

- Perform optimizations
- Map to Vivado HLS
- Create DNN hardware IP
- Embeds the DNN IP into an infrastructure design
- Generates Python run-time (based on PYNQ)
- Enables integration with your application
- Works on embedded and Alveo platforms



#### Infrastructure for Experimentation

- Xilinx academic compute clusters
  - 4 centres world-wide
  - Free to use
  - Enabling research community
- Not only for FINN





38

### **FINN Status**

- Ongoing development
  - Support for residual topologies, depthwise convolutions
  - LogicNet
  - Multinode deployment on XACC
- Looking to build-up a community
  - Many student, hobbyist, and school projects
  - University classes with FINN @ Stanford, Charlotte, NTNU
    - Online material in preparation
  - Industrial applications
- Looking to create differentiating application portfolio
  - Extreme throughput (100M+ fps) ultra-low latency

If you're interested, we'd love to hear from you ③













# Summary



## **Summary – Future Work**

- Specialization in hardware architectures is key to scaling performance to meet requirements of DNNs in communications
- With more flexibility, more opportunity to customization
  - FPGAs allow to specialize to the specifics of individual use cases without loosing generality
- SPs with customized arithmetic and LogicNets are shown to meet 100Gbps 400Gbps requirements in NIDS (as well as high energy physics)
- Tools such as FINN are needed to overcome complexity in the design entry and make technology accessible

Please be in touch, if you're interested in collaborating ③

# **XILINX**.

# **Thank You**

# More information can be found at: <a href="https://xilinx.github.io/finn">https://xilinx.github.io/finn</a>



