CSP Mid firmly in place:
A Mid Correlator FPGA Design

Dr William Kamp
High Performance Computing Lab, AUT.
Overview

• Mid Correlator Beamformer (Mid.CBF) Architecture
• Problem Description
  • Bands, Sub-arraying, Zoom-modes
• Top level – Mid Correlator
• X-FPGA – systolic array architecture
• CMAC design
• Column design
• Time-division Multiplexing
• RFI blanking
• Time-centroid index
Ant, 1/32 BW To XBF-Slice #1
8 Ant, 1/32 BW To XBF-Slice #2
8 Ant, 1/32 BW To XBF-Slice #3
8 Ant, 1/32 BW To XBF-Slice #4
2
8 Ant, 1/32 BW To XBF-Slice #5
8 Ant, 1/32 BW To XBF-Slice #6
8 Ant, 1/32 BW To XBF-Slice #7
8 Ant, 1/32 BW To XBF-Slice #8
2
8 Ant, 1/32 BW To XBF-Slice #9
8 Ant, 1/32 BW To XBF-Slice #10
8 Ant, 1/32 BW To XBF-Slice #11
8 Ant, 1/32 BW To XBF-Slice #12
2
8 Ant, 1/32 BW To XBF-Slice #13
8 Ant, 1/32 BW To XBF-Slice #14
8 Ant, 1/32 BW To XBF-Slice #15
8 Ant, 1/32 BW To XBF-Slice #16
2
8 Ant, 1/32 BW To XBF-Slice #17
8 Ant, 1/32 BW To XBF-Slice #18
8 Ant, 1/32 BW To XBF-Slice #19
8 Ant, 1/32 BW To XBF-Slice #20
2
8 Ant, 1/32 BW To XBF-Slice #21
8 Ant, 1/32 BW To XBF-Slice #22
8 Ant, 1/32 BW To XBF-Slice #23
8 Ant, 1/32 BW To XBF-Slice #24
2
8 Ant, 1/32 BW To XBF-Slice #25
8 Ant, 1/32 BW To XBF-Slice #26
8 Ant, 1/32 BW To XBF-Slice #27
8 Ant, 1/32 BW To XBF-Slice #28
2
8 Ant, 1/32 BW To XBF-Slice #29
8 Ant, 1/32 BW To XBF-Slice #30
8 Ant, 1/32 BW To XBF-Slice #31
8 Ant, 1/32 BW To XBF-Slice #32
2

DISH via SaDT
Combined X-Part and BF-Part
PSS
PST
PSS
PST
PSS

4 x 10G
4 x 10G
4 x 10G
4 x 10G
100G
100G
100G
100G

MTP-MTP OM4 (DISH)
MTP-MTP OM4 (DISH)
MTP-MTP OM4 (DISH)
MTP-MTP OM4
MTP-MTP OM4
LC-MTP OM3
LC-MTP OM3

PST: 1/8 BW for 1 beam
PSS + PST Beamformed Data – 1 Fiber @ 16Gbps
1/8BW for 1 Antenna – 1 Fiber @ 17Gbps
VLBI B1/B2: Beams 1-4, Imaging:
F1 FPGA to FSW FPGA:
Stage 1 F-Part
Stage 2 F-Part
Stage 2 F-Part
Stage 1 F-Part
Stage 1 F-Part
Stage 2 F-Part
Processing
Processing
Processing
Processing

8-antenna F-Part Unit
8 Ant – 1/8 BW
8 Ant – 1/8 BW
8 Ant – 1/8 BW
8 Ant – 1/8 BW
Data Distribution /
Data Distribution /
Data Distribution /
Data Distribution /

VLBI B5: Beams 1-4, Imaging:
1/8 BW for 1 Antenna – 1 Fiber @ 22 Gbps

8 Ant, Transient Buffer Output to XBF-Part Unit #25
2 Ant, Transient Buffer Output to XBF-Part Unit #1

8 Ant – 1/32 BW
8 Ant – 1/32 BW
8 Ant – 1/32 BW
8 Ant – 1/32 BW

1/32 BW for 1 beam
1/32 BW for 1 beam
1/32 BW for 1 beam
1/32 BW for 1 beam
PST Beam 1&2, PSS Beams 1-64 – 1/32 BW
PST Beam 1&2, PSS Beams 1-64 – 1/32 BW
PST Beam 1&2, PSS Beams 1-64 – 1/32 BW
PST Beam 1&2, PSS Beams 1-64 – 1/32 BW
PST Beam 1&2, PSS Beams 1-64 – 1/32 BW
PST Beam 1&2, PSS Beams 1-64 – 1/32 BW
PST Beam 1&2, PSS Beams 1-64 – 1/32 BW
PST Beam 1&2, PSS Beams 1-64 – 1/32 BW
PST Beam 1&2, PSS Beams 1-64 – 1/32 BW
PST Beam 1&2, PSS Beams 1-64 – 1/32 BW
PST Beam 1&2, PSS Beams 1-64 – 1/32 BW
PST Beam 1&2, PSS Beams 1-64 – 1/32 BW
PST Beam 1&2, PSS Beams 1-64 – 1/32 BW
PST Beam 1&2, PSS Beams 1-64 – 1/32 BW
PST Beam 1&2, PSS Beams 1-64 – 1/32 BW
PST Beam 1&2, PSS Beams 1-64 – 1/32 BW
PST Beam 1&2, PSS Beams 1-64 – 1/32 BW
PST Beam 1&2, PSS Beams 1-64 – 1/32 BW
PST Beam 1&2, PSS Beams 1-64 – 1/32 BW
PST Beam 1&2, PSS Beams 1-64 – 1/32 BW
PST Beam 1&2, PSS Beams 1-64 – 1/32 BW
PST Beam 1&2, PSS Beams 1-64 – 1/32 BW

8 Ant, Transient Buffer Output to XBF-Part Unit #25
2 Ant, Transient Buffer Output to XBF-Part Unit #1

20 Gbps/Fiber

2 Ant, Transient Buffer Output to XBF-Part Unit #1
2 Ant, Transient Buffer Output to XBF-Part Unit #1
2 Ant, Transient Buffer Output to XBF-Part Unit #1
2 Ant, Transient Buffer Output to XBF-Part Unit #1

8 Ant, 1/32 BW To XBF-Slice #25
8 Ant, 1/32 BW To XBF-Slice #21
8 Ant, 1/32 BW To XBF-Slice #13
8 Ant, 1/32 BW To XBF-Slice #9
8 Ant, 1/32 BW To XBF-Slice #5
8 Ant, 1/32 BW To XBF-Slice #1
8 Ant, 1/32 BW To XBF Slice #3
8 Ant, 1/32 BW To XBF Slice #2

∑
2
20 Gbps / Fiber

∑
2

Mid.CBF Architecture
10,000 ft view
Problem Description – Lots-a-Loops

- For each of up to 65536 frequency channels
  - For each of up to 16 sub-arrays
    - For each antenna in the sub-array
      - For each of two polarisations
        - Get string of complex samples (A)
        - For each antenna in the sub-array
          - For each of two polarisations
            - Get string of complex samples (B)
            - For corresponding samples in A and B
              - If neither A(i) or B(i) flagged as RFI: Multiply the complex samples and accumulate
              - Calculate the time-centroid index
              - Count number of valid (non-RFI flagged) samples in accumulation
              - Count total number of samples
By the Numbers

- 402 Tera-CMAC/s = 2.4 Peta-ops @ (4 bit)
- 9 Giga-CMAC per second per Watt @ (4 bit)
- 2 TB/s data ingress
- 0.75 TB/s data egress
- 128 Stratix10 FPGA:
  - 80k CMAC,
  - > 625 MHz clock
“Embarrassingly Parallel”

- Rearrange the order of some loops
- Unroll some loops
- 16 sub-arrays, within the 197 antenna array.
  - Any antenna can be assigned to any sub-array (or all to one).
  - A Sub-array may be in one of 5 bands
    - Band 5 : 4b+j4b complex samples @ 5GHz = 5 GB/s.
    - Band 3 : 8b+j8b complex samples @ 1.4GHz = 2.8 GB/s.
    - Other bands are just lower bandwidths.
  - Must support sub-arrays in different bands concurrently.
- 64k regular channels plus 64k zoom channels
Complex Multiply and Accumulate

Two modes required
- \(4b+j4b \times 4b+j4b\), 2 per cycle
- \(8b+j8b \times 8b+j8b\), 1.5 per cycle

Switch between modes instantaneously

> 625 MHz clock rate required

Use a small area of the chip – as it will be tiled out 2550 times
Complex Multiplication

- **Rectangular Multiply (4 multiplies, 2 additions):**
  
  \[(a + jb)\cdot(c + jd) = (ac – bd) + j(ad + bc)\]

- **Gauss Multiply (3 multiplies, 5 addition):**
  
  \[k1 = c(a + b), \quad k2 = a(d – c), \quad k3 = b(c + d)\]
  
  \[
  \text{Real} = k1 – k3, \\
  \text{Imag} = k1 + k2
  \]

- **Compressed Rect Multiply (1 multiply, 3 to 5 additions):**
  
  \[X = a + (b \cdot 2^w), \quad Y = c + (d \cdot 2^w)\]
  
  \[Z = X \cdot Y\]
  
  \[
  \text{Real} = Z[0 : 2w] – Z[4w : 6w] + Z[4w-1] = ac - bd \\
  \text{Imag} = Z[2w : 4w] + Z[2w-1] = ad + bc
  \]
CMAC Design

• Green: Pre-calculation and multiplicand formatting
• Blue: Multiplication (18 by 18 bit signed)
• Yellow: Accumulators
• Red: Post-calculation to convert results to real and imaginary
• Tile the CMACs out in a 2D grid – up to 50 by 50 in size.
• Share the pre-computations between CMACs using data from the same antenna
• Share the post-computation between CMACs
• Add pipelining across rows and down columns so FPGA will meet timing.

• Test compile for a Stratix10,
  • 48 by 48 array
  • runs at 640 MHz
  • 116 ALM, one DSP block per CMAC
Time Division Multiplexing (TDM)
TDM – FPGA Assignment
TDM – Output to Long Term accumulator
RFI blanking

- Radio frequency interference
  - Not yet well described in requirements
  - Both polarizations are treated identically

- Input samples marked as corrupted by RFI
  - Marked statically by Telescope Manager, or
  - Marked dynamically by the F-Part

- Exclude RFI corrupted samples from the accumulation

- Sub-accumulation exceeds a threshold
  - Exclude from the long term accumulation
Time-Centroid Index (TCI)

- Center of weight of the samples
  - when RFI blanking is applied, contributes zero weight at its time instant
- Calculate as the accumulation of a counter
  - Analogy is the moment of a leaver carrying evenly spaced weights.
Mid.CBF Team