# Xilinx Solutions for Radio Telescope Arrays

Name: Michael Reznik

Date: 15 February 2019



© Copyright 2019 Xilinx

### What does Xilinx offer?



 RFSoC and super sampling blocks
FPGAs (with on-chip HBM available)
Alveo acceleration cards
SDAccel: Software development environment for Alveo

Design considerations for writing highlevel synthesis code for an FX correlator

Figures of merit for next-gen ACAP

Source: The Square Kilometer Array



## **RFSoC Block Diagram**

| Processing System                                  |                                                                                               |                   |                 |        |
|----------------------------------------------------|-----------------------------------------------------------------------------------------------|-------------------|-----------------|--------|
| Quad-Core                                          | Memory<br>Sub-System<br>(DDR4)                                                                |                   | DisplayPort     |        |
| ARM®                                               |                                                                                               | System            | USB 3.0<br>SATA |        |
| Cortex™-A53                                        |                                                                                               | Functions         | PCIe® Gen2      |        |
|                                                    |                                                                                               |                   | GigE            |        |
| Dual-Core                                          | Platform &                                                                                    |                   | CAN             |        |
| ARM®                                               | Power                                                                                         | Security          | SPI             |        |
| Cortex™-R5                                         | Management                                                                                    | Coodinty          | SD/eMMC         |        |
|                                                    |                                                                                               |                   | NAND            |        |
| Programmable Logic                                 | _ogic Fabric & DSP                                                                            |                   |                 |        |
|                                                    | Differentiation & Acceleration                                                                | 01001             | 11101           | 33Gb/s |
| ADCx8<br>4Gsps, 12-bit<br>DACx8<br>6.5Gsps, 14-bit | Broad IP Portfolio<br>• Radio & Remote-PHY IP<br>• Digital Pre-Distortion<br>• Full Duplex IP | Modulation<br>FEC |                 |        |

## HBM: Terabit/s memory bandwidth by the numbers

### > HBM memory organization

- >>64 DQ (Bidirectional Data) signals per channel each ru
- >> 16 Channels per HBM stack
- >> Up to 2 HBM stacks per FPGA
- >> Up to 3.68Tbps bandwidth HBM
  - 64\*1800\*16\*2 = 3.686Tb/s
  - 64\*1800\*16\*2/8 = 460GB/s

### Xilinx used 4 high HBM 3D stacked mem

- >> Up to 64Gb of memory per FPGA
  - 4H\*8Gb\*2HBM stacks=64Gb
  - 4H\*8Gb\*2HBM stacks/8=8GB



## **HBM Architecture + Xilinx innovation**

- Standard HBM architecture
  - I6 Pseudo channels per HBM Stack, accessing a discrete 2Gb memory
  - 8 Memory controller per HBM stack
  - 16 512 bit AXI RX/TX ports per HBM stack
  - Each AXI port can address a corresponding 2Gb section of memory
- Xilinx innovations
  - Added flexible addressing that creates a unified memory map any port can access any memory address
  - Extend AXI ports into fabric to ease timing



## FPGA with On-chip High-Bandwidth Memory (HBM)

#### > 8 GB of HBM

> Up to 460GB/s of memory bandwidth between HBM and programmable logic fabric





**E** XILINX.

### **Alveo Accelerator Cards**



### Alveo U200

- 18.6 Peak INT8 TOPs
- 77GB/s DDR Memory Bandwidth
- 31TB/s Internal SRAM Bandwidth
- 892,000 LUTs



### Alveo U250

- 33.3 Peak INT8 TOPs
- 77GB/s DDR Memory Bandwidth
- 38TB/s Internal SRAM Bandwidth
- 1,341,000 LUTs



### Alveo U280

- 24.5 Peak INT8 TOPs
- 460GB/s HBM2 Memory Bandwidth
- 30TB/s Internal SRAM Bandwidth
- 1,079,000 LUTs
- > PCle interface: Gen3x16 (U200 & U250), Gen4x8 w/ CCIX (U280)
- > Network connectivity: 2x QSFP28
- > Power: 100W (typ)

## **GPU vs. Alveo Competitive Overview**



Leverage Alveo's HW adaptability to deliver highest application performance & efficiency

**EXILINX**.

## Key Adaptable Advantage vs. GPGPU - Memory Hierarchy



**6X** 

Xilinx

Kernel

В

LUTRAM

BRAN

Global Mem (if needed)

Video

 $(\bullet \bullet)$ 

**6X** 

Kernel

C

BRAM

**EXILINX**.

BRAM

UltraRAM

Kernel

Α

BRAM

UltraRAM

BRAM



## **SDAccel, Runtime and Platform**



**E** XILINX.



### > Develop, profile and deploy OpenCL applications

- >> OpenCL uses standard APIs (code is portable)
- > F1 platform aware
- > Flexible kernels development
  - >> C / C++ / OpenCL / RTL



Programming Steps



Comprehensive debug and profiling environment

## **Xilinx SDAccel with RTL Kernels**

#### > RTL import through kernel wizard in SDAccel

- >> Top level needs to match interface requirements
- > Leveraging existing RTL IP
  - >> RTL is resource efficient and high performance

#### > Hardware emulation mode for simulation







# Versal ACAP Technology Tour



Scalar Processing Engines



Adaptable Hardware Engines



Intelligent Engines SW Programmable, HW Adaptable



Breakout Integration of Advanced Protocol Engines



## **7nm Adaptive Compute Acceleration Platform (ACAP)**

### > Al engine peak performance

- >> int8: 133 TOPS
- >> int16: 33 TOPS
- >> fp32: 8 TFLOPS

### > DSP engine peak performance

- >> int8: 13.6 TOPS
- >> int24: 4.5 TOPS
- >> fp32: 3.2 TFLOPS

#### > Memory bandwidth

- >> Block RAM: 118 Tb/s
- >> Ultra RAM: 49 Tb/s
- >> DDR4: 816 Gb/s
- >> LPDDR4: 1.096 Tb/s
- >> Network-on-Chip: 2.5 Tb/s





## Network-on-Chip (NoC)

#### Ease of Use

Inherently software programmable Available at boot, no place-and-route required

### High Bandwidth and Low Latency

Multi-terabit/sec throughput Guaranteed QoS

#### **Power Efficiency**

8X power efficiency vs. soft implementations Arbitration across heterogeneous engines









© Copyright 2019 Xilinx



## **Adaptable Architecture Connected Via NoC**

#### > Scalar Engines

- >> Arm® Cortex<sup>™</sup>-A72 APU
- >> Arm Cortex-R5 RPU

#### > Adaptable Engines

- >> CLBs
- Internal Memory

#### > Intelligent Engines

- >> AI Engine
- >> DSP Engine

#### > Connectivity

- >> PCIe w/CCIX
- >> Ethernet
- >> DDR Memory Controllers
- >> Transceivers
- >> I/O

#### > Platform Resources

- Network-On-Chip
- >> Platform Management Controller



**E** XILINX.

## **Versal™ Network-on-Chip**

#### > High bandwidth terabit network-on-chip

- >> Memory mapped access to all resources
- >> Built-in arbitration between engines and memory
- >> AXI4 based structure spanning full device (height and width)

#### > High bandwidth, low latency, low power

- >> Guaranteed QoS
- >> 8X power efficiency vs. FPGA implementations
- >> Support AXI4 MM and AXI4 Stream

#### > Adaptable kernel placement

- >> Every PL region has master and slave interface
- >> Easily swap kernels at NoC port boundaries
- >> Simplifies connectivity between kernels



## **Digital Signal Processing Capability**



Al Engine 2D Array

## **Al Engine Architecture**

### > AI Engine tile

>> AI Engine, data memory, and interconnect

### > 1+ GHz VLIW/SIMD AI Engine

- > 32-bit Scalar RISC processor with fixed and floating point vector units
- > Each AI Engine can access 4 Memory Modules (N,E,S,W) as one contiguous memory
- > AXI-MM Switch for configuration, control and debugging functionality
- > AXI-Stream crossbar switch for routing N/E/S/W streams



# Versal Development Experience





## **System Design Methodology**

| Paper                                                                   | Traffic                                                                      | Data Flow                                                                                                                                                                                           | Data Flow                                                                                           | Power Estimation                                                                    | n System                                                                                              | Synthesis & Implementation                                                             |
|-------------------------------------------------------------------------|------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------|
| Algorithm                                                               | Analysis                                                                     | Modelling (IPI)                                                                                                                                                                                     | Simulation                                                                                          | and Analysis                                                                        | Simulation                                                                                            |                                                                                        |
| Develop a<br>paper mapping<br>of algorithm/<br>application to<br>Versal | Capture traffic<br>flow for NoC<br>Static analysis<br>System C<br>simulation | Connect:<br>traffic generators,<br>memories,<br>performance<br>monitors<br>Configure:<br>traffic generators,<br>NoC connectivity,<br>QoS requirements<br>Elaborate:<br>design and export<br>netlist | System Verilog<br>simulation<br>Gather and<br>analyze<br>statistics from<br>performance<br>monitors | Leverage XPE<br>with output from<br>IP Integrator for<br>accurate power<br>analysis | Full system<br>simulation;<br>replacing traffic<br>generators<br>Co-simulate<br>with PS, ME<br>and PL | Take completed<br>design through<br>back-end tools<br>Timing Closure<br>PDI Generation |

#### Leverage these steps

-

## **Unified Tool Chain for Device Programming**



© Copyright 2019 Xilinx