CASPER Tutorials

Welcome to the CASPER tutorials page! Here you will find all the current tutorials for the ROACH, SNAP and SKARAB platforms.

It is recommended to start with the introduction tutorial for the platform of your liking, then do that platform’s GBE tutorial and finally move onto the spectrometer or correlator tutorial.

Currently there are four hardware platforms supported through the CASPER Community:

  1. ROACH
  2. ROACH2
  3. SKARAB
  4. SNAP

It is worth noting that even though SNAP (and SKARAB) require their firmwares to be developed using Xilinx’s Vivado (as opposed to ISE), the SNAP tutorials are very similar to the ROACH/2 tutorials. In fact, the only real difference is the choice of hardware platform that is made in Simulink. This is done by selecting the SNAP Yellow Block in the Simulink library under CASPER XPS Blockset -> Hardware Platforms

Tutorial Instructions

If you are new to astronomy signal processing, here is Tutorial 0: some basic introduction into astronomy signal processing. If you already have a lot of experience on it, you can go directly to the introduction tutorials below for CASPER FPGA design and implementation.

If you are a beginner, we recommend the Step-by-Step tutorials, however if you should get stuck, prefer a less tedious method of learning, or already have decent feel for these tools, links to Completed tutorials are available with commented models.

Vivado

SNAP

  1. Introduction Tutorial: Step-by-Step or Completed
  2. 10GbE Tutorial: Step-by-Step or Completed
  3. Spectrometer Tutorial Step-by-Step or Completed
  4. Correlator Tutorial Step-by-Step or Completed
  5. Yellow Block Tutorial: Bidirectional GPIO

Tutorial 2: 10GbE Interface

Introduction

In this tutorial, you will create a simple Simulink design which uses the SNAP’s 10GbE ports to send data at high speeds to another port. This could just as easily be another FPGA board or a computer with a 10GbE network interface card. In addition, we will learn to control the design remotely, using a supplied Python library for KATCP.

In this tutorial, a counter will be transmitted through one SFP+ port and back into another. This will allow a test of the communications link. This test can be used to test the link between boards and the effect of different cable lengths on communication robustness.

Background

SNAP boards have two on-board SFP+ ports. The Ethernet interface is driven by an on-board 156.25MHz crystal oscilator. This clock is then multiplied up on the FPGA by a factor of 66. Thus, the speed on the wire is actually 156.25MHz x 66 = 10.312.5 Gbps. However, 10GbE over single-lane SFP+ connectors uses 64b/66b encoding, which means that for every 66 bits sent, 66 bits are actually transmitted. This is to ensure proper clocking, since the receiver recovers and locks-on to the transmitter’s clock and requires edges in the data. Imagine transmitting a string of 0xFF or 0b11111111… which would otherwise generate a DC level on the line, now an extra two bits are introduced which includes a zero bit which the receiver can use to recover the clock and byte endings. See here for more information.

For this reason, we actually get 10Gbps usable data rate. CASPER’s 10GbE Simulink core sends and receives UDP over IPv4 packets. These IP packets are wrapped in Ethernet frames. Each Ethernet frame requires a 38 byte header, IPv4 requires another 20 bytes and UDP a further 16. So, for each packet of data you send, you will incur a cost of at least 74 bytes. I say at least, because the core will zero-pad some headers to be on a 64-bit boundary. You will thus never achieve 10Gbps of usable throughput, though you can get close. It pays to send larger packets if you are trying to get higher throughputs.

The maximum payload length of the CASPER 10GbE core is 8192 bytes (implemented in BRAM) plus another 512 (implemented in distributed RAM) which is useful for an application header. These ports (and hence part of the 10 GbE cores) run at 156.25MHz, while the interface to your design runs at the FPGA clock rate (sys_clk, adcX_clk etc). The interface is asynchronous, and buffers are required at the clock boundary. For this reason, even if you send data between two SNAP boards which are running off the same hard-wired clock, there will be jitter in the data. A second consideration is how often you clock values into the core when you try to send data. If your FPGA is running faster than the core, and you try and clock data in on every clock cycle, the buffers will eventually overflow. Likewise for receiving, if you send too much data to a board and cannot clock it out of the receive buffer fast enough, the receive buffers will overflow and you will lose data. In our design, we are clocking the FPGA at 100 MHz, with the cores running at 156.25MHz. We can thus clock data into the TX buffer continuously without worrying about overflows.

Create a new model

Start Matlab and open Simulink (either by typing ‘simulink’ on the Matlab command line, or by clicking on the Simulink icon in the taskbar). A template is provided for Tut2 with a pre-created packet generator in the tutorials_devel git repository. Get a copy of this template and save it. You will need the SNAP block in the Platforms subdirectory of the xps_library. Specify a clock frequency of 100 MHz and the clock source “sys_clock”.

Add reset logic

A very important piece of logic to consider when designing your system is how, when and what happens during reset. In this example we shall control our resets via a software register. We shall have two independent resets, one for the 10GbE cores which shall be used initially, and one to reset the user logic which may be used more often to restart the user part of the system. Construct reset circuitry as shown below.

_images/tut2_rst1.png

Add a software register

Use a software register yellow block from the CASPER XPS System Blockset for the rst block. Rename it to rst.

It used to be that every register you inserted had to be natively 32-bits, and you were responsible for slicing these 32 bits into different signals if you want to control multiple flags. The latest block can implicitly break the 32-bit registers out into separate names signals, so we’ll use that. The downside is there are a bunch of settings to configure – you need to set up the names and data types of your register subfields. You can configure the register as follows:

_images/tut2_rst_mask_params.png

Add Goto blocks

Add two Goto blocks from Simulink->Signal Routing. Configure them to have the tags as shown (core_rst and cnt_rst). These tags will be used by associated From (also found in Simulink->Signal Routing) blocks in other parts of the design. These help to reduce clutter in your design and are useful for control signals that are routed to many destinations. They should not be used a lot for data signals as it reduces the ease with which data flow can be seen through the system.

Add 10GbE and associated registers for data transmission

We will now add the 10GbE block to transmit a counter at a programmable rate.

Add a 10GbE block for data transmission

Add a ten_GbE yellow block from the CASPER XPS System Blockset. It will be used to transmit data and we shall add another later to receive data. Rename it gbe0. Double click on the block to configure it and set it to be associated with SFP+ port 0. If your application can guarantee that it will be able to use received data straight away (as our application can), shallow receive buffers can be used to save resources. This optimisation is not necessary in this case as we will use a small fraction of resources in the FPGA.

_images/Gbe0Blockk1.jpg

Add registers to provide the target IP address and port number

Add two yellow-block software registers to provide the destination IP address and port number for transmission with the data. Name one dest_ip and the other dest_port. The registers should be configured to receive their values from the processor. Connect them to the appropriate inputs of the gbe0 10GbE block as shown. A Slice block is required to use the lower 16 bits of data from the dest_port register. Constant blocks from Simulink->Sources with 0 values are attached to the simulation inputs of the software registers. The destination port and IP address are not important in this system as it is a loopback example. Add a From block from Simulink->Signal Routing and set the tag to use core_rst, this enables one to reset the block.

_images/10ge1.jpg

Create a subsystem to generate a counter to transmit as data

We will now implement logic to generate a counter to transmit as data. This is already included in the Template for Tut 2. Some details are provided here for completeness.

Construct a subsystem for data generation logic

It is often useful to group related functionality and hide the details. This reduces drawing space and complexity of the logic on the screen, making it easier to understand what is happening. Simulink allows the creation of Subsystems to accomplish this.

These can be copied to places where the same functionality is required or even placed in a library for use in other projects and by other people. To create a subsystem, one can highlight the logical elements to be encapsulated, then right-click and choose Create Subsystem from the list of options. You can also simply add a Subsystem block from Simulink->Ports & Subsystems.

Subsystems inherit variables from their parent system. Simulink allows one to create a variable whose scope is only a particular subsystem. To do this, right-click on a subsystem and choose the Create Mask option. The mask created for that particular subsystem allows one to add parameters that appear when you double-click on the icon associated with the subsystem.

The mask also allows you to associate an initialisation script with a particular subsystem. This script is called every time a mask parameter is modified and the Apply button clicked. It is especially useful if the internal structure of a subsystem must change based on mask parameters. Most of the interesting blocks in the CASPER library use these initialisation scripts.

Drop a subsystem block into your design and rename it pkt_sim. Then double-click on it to add logic.

Add a counter to generate a certain amount of data

Add a Counter block from Xilinx Blockset->Basic Elements and configure it to be unsigned, free-running, 32-bits, incrementing by 1 as shown. Add a Relational block, software register and Constant block as shown. In simulation this circuit will generate a counter from 0 to 49 and then stop counting. This will allow us to generate 50 data elements before stopping.

_images/Payload_length1.png _images/CounterBlog1.jpg

Add a counter to limit the data rate

As mentioned earlier in this tutorial, it is impossible to supply data to the 10GbE transmission block at the full clock rate. This would mean transmitting a 64-bit word at 200MHz, and the 10GbE standard only supports up to 156.25MHz data transmission. We thus want to generate data in bursts such that the transmission FIFOs do not overflow. We thus add circuitry to limit the data rate as shown below. The logic that we have added on the left generates a reset at a fixed period determined by the software register. This will trigger the generation of a new packet of data as before. In simulation this allows us to limit the data rate to 50/200 * 200MHz = 50MHz. Using these values in actual hardware would limit the data rate to (50/(8/10*156.25)) = 4Gbps.

_images/counter_jbo1.png

Finalise logic including counter to be used as data

We will now finalise the data generation logic as shown below. To save time, use the existing logic provided with the tutorial. Counter1 in the illustration generates the actual data to be transmitted and the enable register allows this data stream to the transmitting 10GbE core to be turned off and on. Logic linked to the eof output port provides an indication to the 10GbE core that the final data word for the frame is being sent. This will trigger the core to begin transmission of the frame of data using the IP address and port number specified.

_images/full_logic_jbo1.png

Receive blocks and logic

The receive logic is is composed of another 10GbE yellow block with the transmission interface inputs all tied to 0 as no transmission is to be done, however Simulink requires all inputs to be connected. Connecting them to 0 should ensure that during synthesis the transmission logic for this 10GbE block is removed. Double click on the block to configure it and set it to be associated with SFP+ port 1.

Buffers to capture received and transmitted data

The casperfpga Python package contains all kinds of methods to interact with your 10GbE cores. For example, grabbing packets from the TX and RX stream, or counting the number of packets sent and received are all supported, as long as you turn on the appropriate functionality in the 10GbE yellow block. The settings we’ll use are –

_images/snap_gbe_core_0_params.png

_images/snap_gbe_core_0_debug_params.png

You can see how to use these functions in the software that accompanies this tutorial.

LEDs and status registers

You can also sprinkle around other registers or LEDs to monitor status of core parameters, or give visual feedback that the design is doing something sane. Check out the reference model for some examples of potentially useful monitoring circuitry.

Compilation

Compiling this design takes approximately 20 to 30 minutes. A pre-compiled binary (.fpg file) is made available to save time.

Programming and interacting with the FPGA

A pre-written python script, ‘’snap_tut_tge.py’’ is provided. This script programs the fpga with your complied design (.fpg file) configures the 10GbE Ports and initiates data transfer. The script is run using:

 ./snap_tut_tge.py <SNAP_IP_ADDRESS>

If everything goes as expected, you should see a whole bunch of lines running across your screen as the code sets up the IP/MAC parameters of the 10GbE cores and checks their status, and that the data the cores are sending and receiving are consistent. Have a look at this code to see how one uses the more advanced (i.e. more complex that read_int, and write_int) methods casperfpga makes available. Documentation for casperfpga is still a work in progress(!) but the basic idea is that when you instantiate a CasperFpga, the software intelligently builds python objects into this instance, based on what you put in your design. For example, your Ethernet cores should show up as objects CasperFpga.gbes.<simulink_block_name> (or CasperFpga.gbes[‘simulink_block_name’]) which have useful methods like “setup”, which sets the core’s IP/MAC address, or “print_10gbe_core_details” wich will print out useful status information, like the current state of the core’s ARP cache. iPython and tab-complete are your friend here, there are lots of handy methods to discover. (I’m still discovering them now :) )

The control software should be(!) well-commented, to explain what’s going on behind the scene as the software interacts with your FPGA design.

Conclusion

This concludes Tutorial 2. You have learned how to utilize the 10GbE ports on a SNAP to send and receive UDP packets. You also learned how to further use the Python to program the FPGA and control it remotely using some of the OOP goodies avaiable in casperfpga.

Tutorial 3: Wideband Spectrometer

Introduction

A spectrometer is something that takes a signal in the time domain and converts it to the frequency domain. In digital systems, this is generally achieved by utilising the FFT (Fast Fourier Transform) algorithm. However, with a little bit more effort, the signal to noise performance can be increased greatly by using a Polyphase Filter Bank (PFB) based approach.

When designing a spectrometer for astronomical applications, it’s important to consider the science case behind it. For example, pulsar timing searches will need a spectrometer which can dump spectra on short timescales, so the rate of change of the spectra can be observed. In contrast, a deep field HI survey will accumulate multiple spectra to increase the signal to noise ratio. It’s also important to note that “bigger isn’t always better”; the higher your spectral and time resolution are, the more data your computer (and scientist on the other end) will have to deal with. For now, let’s skip the science case and familiarize ourselves with an example spectrometer.

Setup

This tutorial comes with a completed model file, a compiled bitstream, ready for execution on SNAP, as well as a Python script to configure the SNAP and make plots. Here

Spectrometer Basics

When designing a spectrometer there are a few main parameters of note:

  • Bandwidth: The width of your frequency spectrum, in Hz. This depends on the sampling rate; for complex sampled data this is equivalent to:

_images/bandwidtheq11.png

In contrast, for real or Nyquist sampled data the rate is half this:

_images/bandwidtheq21.png

as two samples are required to reconstruct a given waveform .

  • Frequency resolution: The frequency resolution of a spectrometer, Δf, is given by

_images/freq_eq1.png,

and is the width of each frequency bin. Correspondingly, Δf is a measure of how precise you can measure a frequency.

  • Time resolution: Time resolution is simply the spectral dump rate of your instrument. We generally accumulate multiple spectra to average out noise; the more accumulations we do, the lower the time resolution. For looking at short timescale events, such as pulsar bursts, higher time resolution is necessary; conversely, if we want to look at a weak HI signal, a long accumulation time is required, so time resolution is less important.
Configuration and Control
Hardware Configuration

The tutorial comes with a pre-compiled fpg file, which is generated from the model you just went through. Copy this over to you SNAP fpg directory, then load it onto your SNAP. All communication and configuration will be done by the python control script called snap_tut_spec.py.

Next, you need to set up your SNAP. Switch it on, making sure that:

  • You have your clock source connected to the ADC (3rd SMA input from left). It should be generating an 80 0MHz sine wave with 0 dBm power.
The snap_tut_spec.py spectrometer script

Once you’ve got that done, it’s time to run the script. First, check that the clock source is connected to clk_i of the ADC. Now, if you’re in linux, browse to where the snap_tut_spec.py file is in a terminal and at the prompt type

 ./snap_tut_spec.py <SNAP IP or hostname> -b <fpg name>

replacing with the IP address of your SNAP and with your fpg file. You should see a spectrum like this:

_images/Spectrometer.py_4.81.png

In the plot, there should be a fixed DC offset spike; and if you’re putting in a tone, you should also see a spike at the correct input frequency. If you’d like to take a closer look, click the icon that is below your plot and third from the right, then select a section you’d like to zoom in to.

Now you’ve seen the python script running, let’s go under the hood and have a look at how the FPGA is programmed and how data is interrogated. To stop the python script running, go back to the terminal and press ctrl + c a few times.

iPython walkthrough

The snap_tut_spec.py script has quite a few lines of code, which you might find daunting at first. Fear not though, it’s all pretty easy. To whet your whistle, let’s start off by operating the spectrometer through iPython. Open up a terminal and type:

ipython

and press enter. You’ll be transported into the magical world of iPython, where we can do our scripting line by line, similar to MATLAB (you can also use jupyter if you’re familiar with that). Our first command will be to import the python packages we’re going to use:

import casperfpga,casperfpga.snapadc,time,numpy,struct,sys,logging,pylab,matplotlib

Next, we set a few variables:

snap='192.168.0.1'  # Put your SNAP IP here
katcp_port=7147     # This is default KATCP port
bitstream='snap_tut_spec.fpg'  # Path to the fpg file
sample_rate = 800.0 # Sample rate in MHz
freq_range_mhz = numpy.linspace(0., sample_rate/2, 2048)

Which we can then use to connect to the SNAP board using casperfpga:

print('Connecting to server %s on port %i... '%(snap,katcp_port)),
fpga = casperfpga.CasperFpga(snap)

We now have an fpga object to play around with. To check if you managed to connect to your SNAP, type:

fpga.is_connected()

Let’s set the bitstream running using the upload_to_ram_and_program() command:

fpga.upload_to_ram_and_program(bitstream) 

Next, we need to initialize the ADC. Note in the future this will be done automatically, but for now, we need to:

# Create an ADC object
adc = casperfpga.snapadc.SNAPADC(fpga, ref=10) # reference at 10MHz
# We want a sample rate of 800 Mhz, with 1 channel per ADC chip, using 8-bit ADCs
adc.init(samplingRate=sample_rate, numChannel=1, resolution=8)
adc.selectADC(0)
# Since we're in 4-way interleaving mode (i.e., one input per snap chip) we should configure
# the ADC inputs accordingly
adc.selectADC(0) # send commands to the first ADC chip
adc.adc.selectInput([1,1,1,1]) # Interleave four ADCs all pointing to the first input

Now we need to configure the accumulation length by writing values to the acc_len register. For two seconds of integration, the accumulation length is: 2 [seconds] * 4096 [samples per spectrum] / 800e6 [ADC sample rate]. In nice powers-of-two, this is approximately 2*(2^30)/4096

fpga.write_int('acc_len',2*(2**30)/4096)

Finally, we reset the counters:

fpga.write_int('cnt_rst',1)
fpga.write_int('cnt_rst',0)

To read out the integration number, we use fpga.read_uint():

acc_n = fpga.read_uint('acc_cnt')

Do this a few times, waiting a few seconds in between. You should be able to see this slowly rising. Now we’re ready to plot a spectrum. We want to grab the even and odd registers of our PFB:

a_0=struct.unpack('>1024l',fpga.read('even',1024*4,0))
a_1=struct.unpack('>1024l',fpga.read('odd',1024*4,0))

These need to be interleaved, so we can plot the spectrum. We can use a for loop to do this:

interleave_a=[]

for i in range(1024):
	interleave_a.append(a_0[i])
	interleave_a.append(a_1[i])

This gives us a 2048 channel spectrum. Finally, we can plot the spectrum using pyLab:

pylab.figure(num=1,figsize=(10,10))
pylab.plot(interleave_a)
pylab.title('Integration number %i.'%acc_n)
pylab.ylabel('Power (arbitrary units)')
pylab.grid()
pylab.xlabel('Channel')
pylab.xlim(0,2048)
pylab.show()

Voila! You have successfully controlled the SNAP spectrometer using python, and plotted a spectrum. Bravo! You should now have enough of an idea of what’s going on to tackle the python script. Type exit() to quit ipython.

snap_spec_tut.py notes

Now you’re ready to have a closer look at the snap_spec_tut.py script. Open it with your favorite editor. Again, line by line is the only way to fully understand it, but to give you a head start, here’s a few notes:

Connecting to the SNAP

To make a connection to the SNAP, we need to know what port to connect to, and the IP address or hostname of our SNAP.

Starting from line 47, you’ll see the following code:

    p = OptionParser()
    p.set_usage('spectrometer.py <SNAP_HOSTNAME_or_IP> [options]')
    p.set_description(__doc__)
    p.add_option('-l', '--acc_len', dest='acc_len', type='int',default=2*(2**28)/2048,
        help='Set the number of vectors to accumulate between dumps. default is 2*(2^28)/2048, or just under 2 seconds.')
    p.add_option('-s', '--skip', dest='skip', action='store_true',
        help='Skip reprogramming the FPGA and configuring EQ.')
    p.add_option('-b', '--fpg', dest='fpgfile',type='str', default='',
        help='Specify the fpg file to load')
    opts, args = p.parse_args(sys.argv[1:])

    if args==[]:
        print 'Please specify a SNAP board. Run with the -h flag to see all options.\nExiting.'
        exit()
    else:
        snap = args[0] 
    if opts.fpgfile != '':
bitstream = opts.fpgfile

What this code does is set up some defaults parameters which we can pass to the script from the command line. If the flags aren’t present, it will default to the values set here.

Conclusion

If you have followed this tutorial faithfully, you should now know:

  • What a spectrometer is and what the important parameters for astronomy are.
  • Which CASPER blocks you might want to use to make a spectrometer, and how to connect them up in Simulink.
  • How to connect to and control a SNAP spectrometer using python scripting.

In the following tutorials, you will learn to build a correlator, and a polyphase filtering spectrometer using an FPGA in conjunction with a Graphics Processing Unit (GPU).

Tutorial 4: Wideband Pocket Correlator

Introduction

In this tutorial, you will create a simple Simulink design which uses the iADC board on ROACH and the CASPER DSP blockset to process a wideband (400MHz) signal, channelize it and output the visibilities through ROACH’s PPC.

By this stage, it is expected that you have completed tutorial 1 and tutorial 2 and are reasonably comfortable with Simulink and basic Python. We will focus here on higher-level design concepts, and will provide you with low-level detail preimplemented.

Background

Some of this design is similar to that of the previous tutorial, the Wideband Spectrometer. So completion of tutorial 3 is recommended.

Interferometry

In order to improve sensitivity and resolution, telescopes require a large collection area. Instead of using a single, large dish which is expensive to construct and complicated to maneuver, modern radio telescopes use interferometric arrays of smaller dishes (or other antennas). Interferometric arrays allow high resolution to be obtained, whilst still only requiring small individual collecting elements.

Correlation

Interferometric arrays require the relative phases of antennas’ signals to be measured. These can then be used to construct an image of the sky. This process is called correlation and involves multiplying signals from all possible antenna pairings in an array. For example, if we have 3 antennas, A, B and C, we need to perform correlation across each pair, AB, AC and BC. We also need to do auto-correlations, which will give us the power in each signal. ie AA, BB, CC. We will see this implemented later. The complexity of this calculation scales with the number of antennas squared. Furthermore, it is a difficult signal routing problem since every antenna must be able to exchange data with every other antenna.

Polarization

Dish type receivers are typically dual polarized (horizontal and vertical feeds). Each polarization is fed into separate ADC inputs. When correlating these antennae, we differentiate between full Stokes correlation or a half Stokes method. A full Stokes correlator does cross correlation between the different polarizations (ie for a given two antennas, A and B, it multiplies the horizontal feed from A with the vertical feed from B and vice-versa). A half stokes correlator only correlates like polarizations with each other, thereby halving the compute requirements.

The Correlator

The correlator we will be designing is a 3 input correlator which uses a SNAP board with each ADC operating in maximum speed mode.

Creating Your Design
Create a new model

Having started Matlab, open Simulink (either by typing simulink on the Matlab command line, or by clicking the Simulink icon in the taskbar). Create a new model and add the Xilinx System Generator and SNAP platform blocks as before in Tutorial 1.

System Generator and Platform Blocks

_images/sysgen_snap_platform.png

By now you should have used these blocks a number of times. Pull the System Generator block into your design from the Xilinx Blockset menu under Basic Elements. The settings can be left on default.

The SNAP platform block can be found under the CASPER XPS System Blockset: Platform subsystem. Set the Clock Source to adc0_clk and the rest of the configuration as the default.

Sync Generator

_images/snap_sync.png

The Sync Generator puts out a sync pulse which is used to synchronize the blocks in the design. See the CASPER memo on sync pulse generation for a detailed explanation.

This sync generator is able to synchronize with an external trigger input. Typically we connect this to a GPS’s 1pps output to allow the system to reset on a second boundary after a software arm. This enables us to know precisely the time at which an accumulation was started. It also allows multiple boards to be synchronized which is vital if we are using a signal which correlates digitizers hosted on separate boards. To synchronize from an external PPS we can drive the sync generator logic with the SNAP’s sync_in GPIO input.

Logic is also provided to generate a sync manually via a software input. This allows the design to be used even in the absence of a 1 pps signal. However, in this case, the time the sync pulse occurs depends on the latency of software issuing the sync command and the FPGA signal triggering. This introduces some uncertainty in the timestamps associated with the correlator outputs.

ADCs

_images/snap_adc.png

Connection of the ADCs is as in tutorial 3 except now we are using all three available inputs.

Throughout this design, we use CASPER’s bus_create and bus_expand blocks to simplify routing and make the design easier to follow.

Control Register

_images/t4_ctrl_reg_jbo1.png

This part of the Simulink design sets up a software register which can be configured in software to control the correlator. Set the yellow software register’s IO direction as from processor. You can find it in the CASPER_XPS System blockset. The constant block input to this register is used only for simulation.

The output of the software register goes to three slice blocks, which will pull out the individual parameters for use with configuration. The first slice block (top) is setup as follows:

_images/t4_ctrl_slice_set1.png

The slice block can be found under the Xilinx Blockset → Control Logic. The only change with the subsequent slice blocks is the Offset of the bottom bit. They are, from top to bottom, respectively,16, 17 & 18.

After each slice block we put an edge_detect block, this outputs true if a boolean input signal is true this clock and was false last clock. Found under CASPER DSP Blockset → Misc.

Next are the delay blocks. They can be left with their default settings and can be found under Xilinx Blockset → Common. The delays used here aren’t necessary for the function of the design, but can help meet timing by giving the compiler an extra cycle of latency to use when routing control signals.

The Goto and From bocks can be found under Simulink-> Signal Routing. Label them as in the block diagram above.

Clip Detect and status reporting

To detect and report signal saturation (clipping) to software, we will create a subsystem with latching inputs.

_images/t4_status_clip_jbo1.png _images/t4_status_report1.png

The internals of this subsystem (right) consist of delay blocks, registers and cast blocks.

The delays (inputs 2 - 9) can be keep as default. Cast blocks are required as only unsigned integers can be concatenated. Set their parameters to Unsigned, 1 bit, 0 binary points Truncated Quantization, Wrapped Overflow and 0 Latency.

The Registers (inputs 10 - 33) must be set up with an initial value of 0 and with enable and reset ports enabled. The status register on the output of the clip detect is set to processor in with unsigned data type and 0 binary point with a sample period of 1.

PFBs, FFTs and Quantisers

The PFB FIR, FFT and the Quantizer are the heart of this design, there is one set of each for the 3 ADC channels. However, in order to save resources associated with control logic and PFB and FFT coefficient storage, the independent filters are combined into a single simulink block. This is configured to process three independent data streams by setting the “number of inputs” parameter on the PFB_FIR and FFT blocks to 3.

_images/snap_pfb.png

Configure the PFB_FIR_generic blocks as shown below:

_images/snap_fir_mask.png

There is potential to overflow the first FFT stage if the input is periodic or signal levels are high as shifting inside the FFT is only performed after each butterfly stage calculation. For this reason, we recommend casting any inputs up to 18 bits with the binary point at position 17 (thus keeping the range of values -1 to 1), and then downshifting by 1 bit to place the signal in one less than the most significant bits.

The fft_wideband_real block should be configured as follows:

_images/snap_fft_mask.png

The Quantizer Subsystem is designed as seen below. The quantizer cuts the data signals from the FFT output width (18 bits) down to 4 bits. This means that the downstream processing can be implemented with less resources. In particular, less RAM is needed to store the accumulated correlations. We have to be careful when quantizing signals to make sure that we’re not either saturating the quantizer, or suffering from low signal levels. Prior to quantizing we multiply our signals by a runtime programmable set of coefficients, which can be set so as to ensure the quantizer power output levels are optimal.

_images/snap_quantizer.png

The top level view of the Quantizer Subsystem is as seen below. We repeat this system once for each signal path.

_images/snap_quantizer_top.png

LEDs

The following sections are more periphery to the design and will only be touched on. By now you should be comfortable putting the blocks together and be able to figure out many of the values and parameters. The complete design is available in the tutorials repository for reference.

As a debug and monitoring output we can wire up the LEDs to certain signals. We light an LED with every sync pulse. This is a sort of heartbeat showing that the design is clocking and the FPGA is running.

We also use an LED to give a visual indication of when an accumulation is complete.

Since the signals in our design are very low duty cycle, they won’t naturally make LED flashes which are visible. We therefore use a pulse extend module to stretch pulses on these signals for 2^24 FPGA clock cycles, which is about 10 ms.

_images/snap_leds.png

ADC RMS

These blocks calculate the RMS values of the ADCs’ input signals. We subsample the input stream by a factor of four and do a pseudo random selection of the parallel inputs to prevent false reporting of repetitive signals. This subsampled stream is squared and accumulated for 2^16 samples.

_images/snap_rms.png

The MAC operation

The multiply and accumulate is performed in the dir_x (direct-x) blocks, so named because different antenna signal pairs are multiplied directly, in parallel (as opposed to the packetized correlators’ X engines which process serially).

Two sets are used, one for the even channels and another for the odd channels. Accumulation for each antenna pair takes place in BRAM using the same simple vector accumulator used in tut3.

_images/snap_cmac.png

CONTROL:

The design starts by itself when the FPGA is programmed. The only control register inputs are for resetting counters and optionally sync’ing to external signal.

Sync LED provides a “heartbeat” signal to instantly see if your design is clocked sensibly.

New accumulation LED gives a visual indication of data rates and dump times.

Accumulation counter provides simple mechanism for checking if a new spectrum output is available. (poll and compare to last value)

Software

The python scripts are located in the tut_corr tutorial directory. We first need to run poco_init.py to program the FPGA and configure the design. Then we can run either the auto or the cross correlations plotting scripts (plot_poco_auto.py and plot_poco_cross.py).

Try running these scripts with the -h option to get a description of optional and required arguments.

Yellow Block Tutorial: Bidirectional GPIO

This tutorial aims to provide a very entry level introduction to yellow block creation using the JASPER toolflow. A number of other tutorials and guides already exist. For example, the original ROACH Yellow Block Tutorial in which I based this tutorial from, the Yellow block EDK wiki page, and Dave George’s guide to yellow blocking the KATADC. This tutorial attempts be more of a guided tour around the inner workings of the toolflow, in which you will make an extremely simple new yellow block.

In this tutorial, you will create a yellow block for a bidirectional GPIO n-bit interface for the SNAP board.

Making a Bidirectional GPIO - HDL

So we want to design a bidirectional GPIO interface. That means we need to create a bidirectional GPIO module, and convince the toolflow to instantiate it.

(In most cases when we are porting something into the Toolflow, all verilog/vhdl code is completed, tested, and working in the form of a Xilinx Vivado project)

The simplest version of a bidirectional GPIO module that can be created is simply a wrapper around a Xilinx IOBUF instance. An IOBUF (see the 7 series user guide page 39) is a Xilinx module used to connect signals to a bi-directional external pin. It has the following ports, which are described (using slightly loose terminology) below:

I: the input (i.e., from the FPGA to the GPIO pin)

O: the output (i.e., from the GPIO pin to the FPGA)

IO: the GPIO pin (defined by the user in the Simulink mask later)

T: The control signal, which configures the interface as an input (i.e. IO —> O) when T=1, and an output (i.e. I —> IO) when T=0.

We construct a module “my_gpio_bidir” which wraps ‘n’ number such IOBUF instances (i.e., an n-bit wide buffer) and also registers the output signal. This simple module will form the entirety of the interface we will turn into a yellow block. Create a new folder in /mlib_devel/jasper_library/hdl_sources/ named ‘my_gpio_bidir’ and save your module description as my_gpio_bidir.v there.

NB: n-bit refers to the parameter WIDTH below.

module my_gpio_bidir #(parameter WIDTH=1) (
    input            clk,
    inout      [WIDTH-1:0] dio_buf, //inout, NOT input(!)
    input      [WIDTH-1:0] din_i,
    output reg [WIDTH-1:0] dout_o,
    input            in_not_out_i
  );
  
  // A wire for the output data stream
  wire [WIDTH-1:0] dout_wire; 

  // Buffer the in-out data line
  IOBUF iob_data[WIDTH-1:0] (
    .O (dout_wire),  //the data output
    .IO(dio_buf),    //the external in-out signal
    .I(din_i),       //the data input
    .T(in_not_out_i) //The control signal. 1 for input, 0 for output
  ); 
 
  //register the data output
  always @(posedge clk) begin
    dout_o <= dout_wire;
  end
endmodule
Am I on the right track?

To ensure we are on the right track, we will run the ‘jasper’ command in the Matlab terminal and check to see if our yellow blocks ended up in jasper.per file. This file contains all the peripherals from our Simulink model.

NB: This script will fail because we have not written the proper Python yellow block code yet. This is just to double-check on the right track.

When the script fails, open up the build directory (named the same as your simulink model) and open the file ‘jasper.per’ Read through it and ensure you find your new yellow blocks in it (search for the name of your yellow block). If you don’t then something went wrong, and you should re-read this tutorial to see where you differed. If you see the yellow block, please continue on.

Python auto-gen scripts (JASPER Toolflow)

Now we have the module (HDL you wrote first) and Simulink model finished. It is time to write some Python code so that the toolflow will see our yellow block and instantiate the module. When the toolflow runs, it will look for xps-tagged blocks in your design. For each one it will construct an instantiation, connecting your yellow block ports/parameters to the HDL code you wrote. This will all later show up in a top-level auto-generated entity, cleverly called ‘top.v’.

The toolflow is as follows, starting with the jasper command in the matlab terminal (all scripts can be found in the mlib_devel directory or yellow_block sub-dir):

jasper.m -> jasper_frontend.m -> exec_flow.py -> toolflow.py -> yellow_block.py -> my_gpio_bidir.py

The last script will be the name of your module as in this case my_gpio_bidir.py. Create this script in the yellow block sub-directory of mlib_devel. I recommend carefully reading yellow_block.py as the function header comments are well written and explain what you need to do. :)

NB: I figured out how to create this script by comparing my_gpio_bidir to the gpio yellow block. First I found the hdl_source file for it. Next I compared the top.v from a different project that contained the gpio block (as top.v contains the instantiation for the gpio module) Then, I compared the script that generated that instantiation (gpio.py) to the top.v and hdl source.

Start by first just tweaking the modify_top function to suit your needs, run ‘jasper’ command and fix python errors until the errors point to the gen_constaints function. Next, repeat the process for the gen_constraints function. Debug and repeat until you compile w/out errors, look in top.v file in the build directory and you should see your yellow block instantiation. Carefully add the rest of your functionality from here until your top.v instantiation matches your HDL code (module).

(Move on to the next section, once the ‘jasper’ command finishes and your yellow block instantiation matches the module)

NB: The system generated verilog/VHDL code is all lowercase, be sure that your ports and signals match accordingly.

The code for the my_gpio_bidir yellow block is below (pay particular attention to the comments):

from yellow_block import YellowBlock
from constraints import PortConstraint
from helpers import to_int_list

class my_gpio_bidir(YellowBlock):
    def initialize(self):
        # Set bitwidth of block (this is determined by the 'Data bitwidth' parameter in the Simulink mask)
        self.bitwidth = int(self.bitwidth)
        # add the source files, which have the same name as the module (this is the verilog module created above)
        self.module = 'my_gpio_bidir'
        self.add_source(self.module)

    def modify_top(self,top):
        # port name to be used for 'dio_buf'
        external_port_name = self.fullname + '_ext'
        # get this instance from 'top.v' or create if not instantiated yet
        inst = top.get_instance(entity=self.module, name=self.fullname, comment=self.fullname)
        # add ports necessary for instantiation of module
        inst.add_port('clk', signal='user_clk', parent_sig=False)
        # parent_port=True, and dir='input', so add an input to 'top.v'
        inst.add_port('dio_buf', signal=external_port_name, dir='inout', width=self.bitwidth, parent_port=True)
        inst.add_port('din_i', signal='%s_din_i'%self.fullname, width=self.bitwidth)
        inst.add_port('dout_o', signal='%s_dout_o'%self.fullname, width=self.bitwidth)
        inst.add_port('in_not_out_i', signal='%s_in_not_out_i'%self.fullname)
        # add width parameter from 'Data bitwidth' parameter in Simulink mask
        inst.add_parameter('WIDTH', str(self.bitwidth))

    def gen_constraints(self):
        # add port constraint to user_const.xdc for 'inout' ()
        return [PortConstraint(self.fullname+'_ext', self.io_group, port_index=range(self.bitwidth), iogroup_index=to_int_list(self.bit_index))]
Testing

Now we need to test. The python script for this tutorial is an automated testing of the Bidirectional GPIO block we just made. It sets one GPIO bank (a or b) as output, the other as an input. It then writes to one output bank and reads the others input. After which it swaps the modes of each bank in order to demonstrate that each bank can be either an input or output (bidirectional) and repeats the same manner of write/reading.

Run the script included below in the terminal using the command:

./tut_gpio_bidir.py -f <Generated fpg file here> <SNAP hostname or ip addr>

NB: You may need to run chmod +x ./tut_gpio_bidir.py first.

#!/usr/bin/env python
'''
Script for testing the Bi-Directional GPIO Yellow Block created for CASPER Tutorial 7.
Author: Tyrone van Balla, January 2016
Reworked for SNAP and tested: Brian Bradford, May 2018
'''
import casperfpga
import time
import sys
import numpy as np

fpgfile = 'tut_gpio_bidir.fpg'
fpgas = []

def exit_clean():
    try:
        for f in fpgas: f.stop()
    except:
        pass
    exit()

def exit_fail():
    print 'FAILURE DETECTED. Exiting . . .'
    exit()

if __name__ == '__main__':
    import argparse

    parser = argparse.ArgumentParser()
    parser.add_argument("snap", help="<SNAP_HOSTNAME or IP>")
    parser.add_argument("-f", "--fpgfile", type=str, default=fpgfile, help="Specify the fpg file to load")
    parser.add_argument("-i", "--ipython", action='store_true', help="Enable iPython control")

    args = parser.parse_args()

    if args.snap == "":
        print 'Please specify a SNAP board. \nExiting'
        exit()
    else:
        snap = args.snap

    if args.fpgfile != '':
        fpgfile = args.fpgfile

# try:

print "Connecting to server %s . . . "%(snap),
fpga = casperfpga.CasperFpga(snap)
time.sleep(1)

if fpga.is_connected():
    print 'ok\n'
else:
    print 'ERROR connecting to server %s . . .'%(snap)
    exit_fail()

# program fpga with bitstream

print '------------------------'
print 'Programming FPGA...',
sys.stdout.flush()
fpga.upload_to_ram_and_program(fpgfile)
time.sleep(1)
print 'ok'

# intialize gpio bank control registers
fpga.write_int('a_is_input', 1)
fpga.write_int('b_is_input', 1)

if args.ipython:
    # open ipython session for manual testing of yellow block
    
    # list all registers first
    print '\nAvailable Registers:'
    registers = fpga.listdev()
    for reg in registers:
        if not('sys' in reg):
            print '\t',
            print reg
        else:
            pass
    print '\n'

    # how to use
    print 'Use "fpga" as the fpga object\n'

    import IPython; IPython.embed()
    
    print 'Exiting . . .'
    exit_clean()

'''
Automated testing of Bidirectional GPIO Block.
Sets one GPIO bank as output, other as input.
Writes to output bank, reads input.

Swaps mode of banks to demonstrate either bank can be either input or output.

'''
print '#################################'
# Send from GPIO_LED (B) to GPIO_GPIO (A) 
print '\nConfiguring to send from GPIO_LED (B) to GPIO_GPIO (A)\n'
fpga.write_int('a_is_input', 1) # GPIO_GPIO as input
fpga.write_int('b_is_input', 0) # GPIO_LED as output

print 'Initial Values: A: %s, B: %s\n' % (np.binary_repr(fpga.read_int('from_gpio_a'), width=4), np.binary_repr(fpga.read_int('from_gpio_b'), width=4))
print 'Writing 0xF to B . . . \n'

fpga.write_int('to_gpio_a', 0)  # dummy data written to GPIO_GPIO
fpga.write_int('to_gpio_b', 0xFFFF) # data written to GPIO_LED
time.sleep(0.01)

print 'A: 0 <------------- B: 0xF\n'

from_a = fpga.read_int('from_gpio_a') # read GPIO_GPIO
from_b = fpga.read_int('from_gpio_b') # read GPIO_LED

print 'Readback values: A: %s, B: %s\n' % (np.binary_repr(from_a, width=4), np.binary_repr(from_b, width=4))

print 'Writing 0x0 to B . . . \n'
print 'A: 0xF <---------- B: 0x0\n'

fpga.write_int('to_gpio_a', 0xFFFF) # dummy data written to GPIO_GPIO
fpga.write_int('to_gpio_b', 0x0) # data written to GPIO_LED
time.sleep(0.01)

from_a = fpga.read_int('from_gpio_a') # read GPIO_GPIO
from_b = fpga.read_int('from_gpio_b') # read GPIO_LED

print 'Readback values: A: %s, B: %s\n' % (np.binary_repr(from_a, width=4), np.binary_repr(from_b, width=4))

print '##################################'
# Send from GPIO_GPIO  (A) to GPIO_LED (B) 
print '\nConfiguring to send from GPIO_GPIO (A) to GPIO_LED (B)\n'
fpga.write_int('a_is_input', 0) # GPIO_GPIO as output
fpga.write_int('b_is_input', 1) # GPIO_LED as input

print 'Initial Values: A: %s, B: %s\n' % (np.binary_repr(fpga.read_int('from_gpio_a'), width=4), np.binary_repr(fpga.read_int('from_gpio_b'), width=4))
print 'Writing 0x0 to A . . . \n'

fpga.write_int('to_gpio_a', 0)  # data written to GPIO_GPIO
fpga.write_int('to_gpio_b', 0xFFFF) # dummy data written to GPIO_LED
time.sleep(0.01)

print 'A: 0 -------------> B: 0xF\n'

from_a = fpga.read_int('from_gpio_a') # read GPIO_GPIO
from_b = fpga.read_int('from_gpio_b') # read GPIO_LED

print 'Readback values: A: %s, B: %s\n' % (np.binary_repr(from_a, width=4), np.binary_repr(from_b, width=4))

print 'Writing 0xF to A . . . \n'

print 'A: 0xF ----------> B: 0x0\n'

fpga.write_int('to_gpio_a', 0xFFFF) # data written to GPIO_GPIO
fpga.write_int('to_gpio_b', 0x0) # dummy data written to GPIO_LED
time.sleep(0.01)

from_a = fpga.read_int('from_gpio_a') # read GPIO_GPIO
from_b = fpga.read_int('from_gpio_b') # read GPIO_LED

print 'Readback values: A: %s, B: %s\n' % (np.binary_repr(from_a, width=4), np.binary_repr(from_b, width=4))

# except KeyboardInterrupt:
#     exit_clean()
# except Exception as inst:
#     exit_fail()

exit_clean()

Your results without wiring pins on the SNAP board should look something close to:

Connecting to server rpi2-11 . . .  ok

------------------------
Programming FPGA... ok
#################################

Configuring to send from GPIO_LED (B) to GPIO_GPIO (A)

Initial Values: A: 0000, B: 0000

Writing 0xF to B . . . 

A: 0 <------------- B: 0xF

Readback values: A: 0000, B: 1111

Writing 0x0 to B . . . 

A: 0xF <---------- B: 0x0

Readback values: A: 1111, B: 0000

##################################

Configuring to send from GPIO_GPIO (A) to GPIO_LED (B)

Initial Values: A: 1111, B: 0000

Writing 0x0 to A . . . 

A: 0 -------------> B: 0xF

Readback values: A: 0000, B: 0000

Writing 0xF to A . . . 

A: 0xF ----------> B: 0x0

Readback values: A: 1111, B: 1111

Now we will wire up the pins on the SNAP board correctly. Put the following female jumpers between pins on the J9 GPIO (refer to page 14 of SNAP Schematic if necessary):

  • TEST0 and TEST4
  • TEST1 and TEST5
  • TEST2 and TEST6
  • TEST3 and TEST7

After this, run the same script again. The expected results are:

Connecting to server rpi2-11 . . .  ok

------------------------
Programming FPGA... ok
#################################

Configuring to send from GPIO_LED (B) to GPIO_GPIO (A)

Initial Values: A: 0000, B: 0000

Writing 0xF to B . . . 

A: 0 <------------- B: 0xF

Readback values: A: 1111, B: 1111

Writing 0x0 to B . . . 

A: 0xF <---------- B: 0x0

Readback values: A: 0000, B: 0000

##################################

Configuring to send from GPIO_GPIO (A) to GPIO_LED (B)

Initial Values: A: 1111, B: 1111

Writing 0x0 to A . . . 

A: 0 -------------> B: 0xF

Readback values: A: 0000, B: 0000

Writing 0xF to A . . . 

A: 0xF ----------> B: 0x0

Readback values: A: 1111, B: 1111

If you matched the result above, then congratulations you’ve successfully created and tested your first yellow block!

If not, start by ensuring your original HDL code was correct to begin with, then debug the yellow block Python script you wrote.

Add yellow block to XPS Library
  1. Create a new Simulink model with the name identical to your yellow block name (rename your yellow block if it is an unacceptable model name)
  2. Add your yellow block to the model. (This should be the only block in the model)
  3. Add your yellow block mask script to ‘xps_library’ folder if needed.
  4. Save your Simulink model in the ‘xps_models’ folder (please put it in the directory that makes sense, otherwise create a new directory)
  5. Launch Matlab via the ./startsg script in mlib_devel directory.
  6. Double-click on ‘xps_library’ directory from the ‘Current Folder’ pane on the left-hand side of the Matlab window.
  7. Run xps_build_new_library, click ‘Yes’ on overwrite dialog prompt and ignore any warnings.
  8. For any models you wish to link with this new library, open the model and run update_casper_blocks(bdroot) in the Matlab command window. (Preferably all your models)

Now help out CASPER by adding more yellow blocks to our library :)


Author: Brian Bradford, June 1, 2018

Credit to Jack Hickish for original ROACH yellow block tutorial, in which I based this from.

SKARAB

  1. Introduction Tutorial Step-by-Step or Completed
  2. 40GbE Tutorial Step-by-Step or Completed
  3. HMC Tutorial Step-by-Step or Completed

Tutorial 2: 40GbE Interface

Introduction

In this tutorial, you will create a simple Simulink design which uses the SKARAB’s 40GbE ports to send data at high speeds to another port. This could just as easily be another SKARAB board or a computer with a 40GbE network interface card. In addition, you will learn to control the design remotely using a supplied Python library for KATCP. The UDP packets sent by the SKARAB will be recorded to disk.

This tutorial essentially implements the transmission of a counter through one QSFP+ port and back into another. This allows for a test of the communications link in terms of performance and reliability. This test can be used to test the link between boards and the effect of different cable lengths on communications quality.

Background

For more info on the SKARAB please follow this link to the SKARAB hardware platform. Of particular interest for this tutorial is the section on the QSFP+ Mezzanine Card.

The maximum payload length of the 40GbE core is 8192 bytes (implemented in BRAM) plus another 512 bytes (implemented in distributed RAM) - which is useful for an application header. These ports (and hence part of the 40 GbE cores) run at 156.25MHz, while the interface to your design runs at the FPGA clock rate (sys_clk, etc). The interface is asynchronous and buffers are required at the clock boundary. For this reason, even if you send data between two SKARAB boards which are running off the same hard-wired clock, there will be jitter in the data. A second consideration is how often you clock values into the core when you try to send data. If your FPGA is running faster than the core, and you attempt to clock data in on every clock cycle, the buffers will eventually overflow. Likewise for receiving, if you send too much data to a board and cannot clock it out of the receive buffer fast enough, the receive buffers will overflow and you will lose data. In our design we are clocking the FPGA at 200MHz with the cores running at 156.25MHz. We will therefore not be able to clock data into the Tx buffer continuously for very long before it overflows. If this doesn’t make much sense to you now don’t panic, it will become clear after you’ve tried it.

Tutorial Outline

This tutorial will be run in an explain-and-explore kind of way. There are too many blocks for us to run through each one and its respective configuration. Hence, each section will be generally explained and it is up to you to explore the design and understand the detail. Please don’t hesitate to ask any questions during the tutorial session. If you are doing these tutorials outside of the CASPER workshop please email any questions to the CASPER email list.

This tutorial consists of 2 designs: a transmitter and a receiver. We will look at the transmitting design first.

Tx Design

As with the previous tutorial, drop down a Xilinx XSG Block and then the SKARAB platform yellow block. Configure the clock frequency to 170Mhz. Firstly, we use a software register to control our system. In this design we are using a single 32-bit register with the lower 6 bits being used for the following logic signals:

  • 40GbE core reset,
  • Debug logic reset,
  • Packet reset,
  • Transmit enable,
  • Packet enable, and
  • Snap block arming.

This software register also takes in some simulation stimuli. Try playing with these and stimulate the design using the play button on the top of the window. It is advisable to use a short simulation time as more complex designs can take ages to simulate.

_images/Tx_control.png

Each packet that is sent from the FPGA fabric can be sent to a specified IP and port. These values are configurable and, in this case, are set via software registers. These could also be set dynamically from the fabric as required.

_images/tx_ip_port_registers.png

This is the start of the logic to build up our payload. The decimation register is used to control the rate at which packets are sent. Have a look at lines 307-312 of the Tx python script to see how this value is calculated and used.

_images/Tx_decimation_logic.png

Here is the rest of the payload generation logic. We are creating 2 ramps and a walking-1 pattern. The payload is generated using a combination of counters, slice blocks, delays, adders and comparators. The ramps and walking-1 are concatenated together and put into the payload buffer by toggling the tx_valid signal on the 40GbE core. The tx_data bus is 256 bits wide so only 256 bits can be clocked in on a clock cycle. The buffer can accept a payload of up to 8192 bytes. Once all the data we require is in the payload buffer we toggle the tx_end_of_frame signal to send the packet into the ether.

_images/tx_40gbe.png

As a method of debugging, the transmit side also as some data snapshot (snap, not SNAP) blocks which can capture data as it is sent to the core. The snap block is a BRAM which can be triggered to capture data on a particular signal and then read out from software. They are very useful for debugging and checking the data at particular stages through your design.

_images/Tx_snapshot_blocks.png

The design also has a counter that keeps track of each time the overflow or almost-full lines are driven high by the core. This will tell us if we have any overflow or almost-overflowing buffers.

_images/Tx_afull_overflow_regs.png

Now take a look through the Tx python script to see how the registers are being set and the debug snap blocks are used to validate the data being sent. This should be well commented, but please ask questions where things aren’t clear.

Rx Design

For the receiver design do the same as the previous design by dropping down an XSG block and the SKARAB platform block. Configure the clock rate to 230Mhz. We want this to be well above the clock rate of the transmit design so that we can handle the variable rate from the transmitter and not overflow our buffers.

Again we have a control register which manages resets, enables and snap block triggering.

_images/Rx_control_regs.png

This is the receiving 40GbE. If the Tx-side is all tied to zero (0) this interface is not used. The Rx side is connected up to labels which are used to reduce the wires running around the design.

_images/Rx_40gbe.png

The following blocks split out the walking 1 and the ramps from the received data.

_images/Rx_data_split.png

Here each of the split data are written into snap blocks. The snap blocks are triggered by the end-of-frame signal and the write enable is driven by the rx_valid signal.

_images/Rx_data_capture.png

Here we have counters used to count any errors on the receiver’s side. They are fed into software registers for access from software.

_images/Rx_debug_regs.png

Here are more registers used for debugging; they count any errors in the expected data, the ramps and the walking 1.

_images/Rx_error_counters.png

This writes the packet header data into a snap block just to provide more debugging data.

_images/Rx_pkt_counter.png

Running the python script

Once you are finished examining the designs and feel that you have a good handle on them. Look through the python tx script. Try to figure out how to call the script with the correct parameters and files. You might have to scp the files to the control server, then run it and see what data you can get out. You can also start an ipython session to manually connect and run each of the commands. If you are running the tutorial during the workshop the facilitators should have the control server information up to view, or look here. If you are running the tutorials elsewhere please familiarise yourself with your local setup/server(s) in order to run the tutorial.

The python script, tut_40gbe.py, does all the heavy lifting for the communication aspect of the tutorial. The script allows you to specify a few parameters and subsequently programs the boards, transmits and receives the test data.

Script arguments

Much like most python scripts, you can query the arguments via:

$ python tut_40gbe.py --help
usage: tut_40gbe_tx.py [-h] [--txhost TXHOST] [--rxhost RXHOST]
                       [--txfpg TXFPG] [--rxfpg RXFPG] [--pktsize PKTSIZE]
                       [--rate RATE] [--decimate DECIMATE] [-p] [-i]
                       [--loglevel LOG_LEVEL]

Script and classes for SKARAB Tutorial 2

optional arguments:
  -h, --help            show this help message and exit
  --txhost TXHOST       Hostname or IP for the TX SKARAB. (default: )
  --rxhost RXHOST       Hostname or IP for the RX SKARAB. (default: )
  --txfpg TXFPG         Programming file for the TX SKARAB. (default: )
  --rxfpg RXFPG         Programming file for the RX SKARAB. (default: )
  --pktsize PKTSIZE     Packet length to send (in words). (default: 160)
  --rate RATE           TX bitrate, in Gbps. (default: 2.0)
  --decimate DECIMATE   Decimate the datarate by this much. (default: -1)
  -p, --program         Program the SKARABs (default: False)
  -i, --ipython         Start IPython at script end. (default: False)
  --loglevel LOG_LEVEL  log level to use, default None, options INFO, DEBUG,
                        ERROR (default: INFO)

As per the info above, we can see that running the 40GbE tutorial script requires the following (compulsory) parameters:

  • The IPs or Hostnames of SKARABs assigned to do the Transmitting and Receiving of data.
    • It is also worth mentioning that the SKARABs you intend on using to carry out this tutorial must have at least one QSFP+ cable plugged in to each board.
  • Programming files for the Tx and Rx SKARABs. These fpg-files will be output after the build process is executed on Tx and Rx simulink models.
    • There are already-built versions of these images available in the working directory, namely tut_40gbe_tx.fpg and tut_40gbe_rx.fpg. Feel free to use these if you don’t want to wait for the build process to complete.
  • The -p or –program flag dicatates whether the programming files specified in the -txfpg and -rxfpg flags will be programmed to the board (using the method you learnt in the previous tutorial)
    • I would suggest specifying this flag purely to minimise the ‘setup’ work you need to do to get the tutorial running.
    • You could run the tutorial script without specifying this flag, however that would require programming the two SKARABs with the associated fpg-files before running the script. Nothing major.

The other flags already have default values and don’t need to be specified unless you want to, for example, test how different parameters change the behaviour of the system. An example of the script execution is show below:

$ python tut_40gbe.py --txhost skarab020201-01 --rxhost skarab020202-01 --txfpg tut_40gbe_tx.fpg --rxfpg tut_40gbe_rx.fpg -p

Of course, please do make sure you are in the correct directory holding the tut_40gbe.py script. Equally, substitute tut_40gbe_tx.fpg and tut_40gbe_rx.fpg for the paths to your generated fpg-files in the event you ran through the build process. These files should be in tut_40gbe_tx/outputs/ and tut_40gbe_rx/outputs/. After executing the script as above you should see something resembling the following being printed to your terminal window:

INFO:root:Connecting to SKARABs
*
*
INFO:root:Programming SKARABs
*
*
INFO:root:  Done programming TXer
*
*
INFO:root:  Done programming RXer
skarab020202-01
INFO:root:Setting TX destination to 10.0.0.2.
INFO:root:Sending data at 1.970Gbps (0.177Ghz * 256 / 23)
INFO:root:Setting RX port.
INFO:root:Starting TX.
INFO:root:Some RX stats:
INFO:root:	valid: 7432640
INFO:root:	eof: 221779039
INFO:root:	badframe: 0
INFO:root:	overrun: 0
------------------------- pkt_000 -------------------------
  ctr mark              walking_one   pkt_ctr      ramp
    0 7777                       32     47491         1
    1 7777                       64     47491         2
    2 7777                      128     47491         3
    3 7777                      256     47491         4
    4 7777                      512     47491         5
*
*
*
*
*

If you see any errors do make a note of them! Regardless, please ask if you have any questions, of which I am sure there will be many.

Tutorial 3: HMC Interface

AUTHORS: A. Isaacson and A. van der Byl

EXPECTED TIME: 2 hours

Introduction

In this tutorial, you will create a simple Simulink design which writes and reads to/from the HMC Mezzanine Card that is plugged into the SKARAB Mezzanine 0 MegaArray slot - refer to SKARAB for more information. In addition, we will learn to control the design remotely, using a supplied Python library for KATCP.

In this tutorial, a counter will be used to generate HMC test write data. Another counter will be used to read the test data from the HMC. The write and read data rate can be controlled by a software register to read and write every second clock cycle or read and write every clock cycle. This test can be used to demonstrate the throughput that the HMC can handle. This tutorial will also explain why the HMC read data needs to be reordered and shows a way of doing this using BRAM with BRAM read and write control.

Background

SKARAB boards have four MegaArray mezzanine slots. Mezzanine 3 is traditionally used for the QSFP+ Mezzanine Card, which makes provision for the 40GbE Interface functionality - refer to SKARAB. The rest of the Mezzanine slots (0, 1 and 2) can be used for the HMC Mezzanine Card. This tutorial assumes that the HMC Mezzanine Card is fitted in mezzanine 0 slot. The SKARAB board can have up to three HMC Mezzanine Cards.

The HMC Mezzanine Card is fitted with a single Micron HMC MT43A4G80200 – 4GB 8H DRAM stack device. The HMC (Hybrid Memory Cube) Mezzanine Card, is a 4GB, serialised memory, which uses 16 x 10Gbps SERDES lanes to the FPGA per a mezzanine site. The HMC hardware works using the OpenHMC core designed by Juri Schmidt who works for the Computer Architecture Group at the University of Heidelberg CAG and OpenHMC. The OpenHMC core is designed to be fully configurable, but this tutorial is using the configuration: 256 bit data transfer and 4 FLITS per a word.

The HMC yellow block makes provision for two links: Link 2 and Link 3. Each Link uses 8 x TX/RX GTH full duplex lines rated at 10Gbps each. This can be higher, but we are using it at this rate. This means that theoretically each link can handle 80Gbps throughput. The reality is that the HMC interface uses a protocol (FLIT = Floating Unit), which has a 128 bit (64 bit header and 64 bit tail), so at least half of that data is header and so the actual throughput will be in the region of 40Gbps. This has been tested successfully at 32 Gbps, with both links active. This tutorial will only write and read from link 3, so can handle a throughput between 32Gbps - 40Gbps. This tutorial tests the HMC with a write/read throughput of 29.44Gbps (pass) and a write/read throughput of 58.88Gbps (fail). The user will see the difference in the HMC data and status monitoring when the HMC read and write mechanism is successful and compare it to when it is unsuccessful.

The HMC memory has controller logic built in as well as stacked DRAM. This controller logic and stacked DRAM are divided into vaults. It is important to note that due to the DRAM refresh cycles when you request data (i.e. read) from the HMC you will not always get the reads returned to you in the correct order. The HMC will only return data that has been requested only when the vault is available. If the vault is not available then it will handle other requests that are available. This out of order return allows the HMC to meet the higher throughput, but it does mean that the data read from the HMC needs to reordered before using it. This tutorial will show you how to do this.

The HMC address scheme is user configurable, but also has set patterns and we are using one of the set patterns. The FLITs per a word (FPW) is also configurable, but the HMC yellow block is using 4 FLITS per a word. Each Flit is 128 bits, so 512 bits in total. The HMC is running at a 156.25MHz rate and the data width (after the header and tail have been removed) is 256bits. This gives a maximum data throughput of 256bits x 156.25MHz = 40Gbps.

It is important to note that each write uses 3 x FLITS and each read uses 1 x FLIT, so if you try to interleave read and writes then there will be a loss of throughput as the HMC devices are not being accessed nominally, as the FLITS per a word is set to 4. It is important to write and read simultaneously in order to achieve the correct throughput. It is not possible to read and write from the same address simultaneously and so there needs to be an offset between the read and write address. This tutorial uses an offset between the write and read addresses.

The address structure also influences the overall latency in the device. If you access a vault from a link that is not hardware linked to the vault then there will be additional latency to move from one vault to another. There are 16 vaults for this particular HMC device. All links can access all the vaults. The hardware linking looks as follows:

Link 0: vault 0, 1, 2, 3 (Link 0 is not available)

Link 1: vault 4, 5, 6, 7 (Link 1 is not available)

Link 2: vault 8, 9, 10, 11

Link 3: vault 12, 13, 14, 15

Therefore, using Link 2 and Link 3 try and use their corresponding vaults if the goal is to minimise latency, but in order to maximise throughput then circle the write and read address through all the vaults so that there will always be data available to process. This tutorial cycles through a limited address space to demonstrate this.

The HMC yellow block is using a 4GB device with 32 Byte Max Block size, which is shown in Table 13 below. The HMC firmware handles the request address bits 0-4 and 32-33. The HMC yellow block write and read address is 27 bits, which is mapped to request address bit 5-31 of the 32-Byte Max Block Size, which is shown in Table 13 below. Bit 5 is the LSB and bit 31 is the MSB.

_images/HMC_Default_Address_Map_Mode_Table.png

More information on the HMC device (Rev D) and OpenHMC controller (Rev 1.5) can be found under the following repo (in the “hmc” folder):

https://github.com/ska-sa/skarab_docs (master branch)

Create a new model

Start Matlab and open Simulink (either by typing ‘simulink’ on the Matlab command line, or by clicking on the Simulink icon in the taskbar). A template is provided for this tutorial with a pre-created HMC reordering function, SKARAB XSG core config or platform block, Xilinx System Generator block and a 40GbE yellow block. Get a copy of this template and save it. Make sure the SKARAB XSG_core_config_block or platform block is configured for:

  1. Hardware Platform: “SKARAB:xc7vx690t”
  2. User IP Clock source: “sys_clk”
  3. User IP Clock Rate (MHz): 230 (230MHz clock derived from 156.25MHz on-board clock). This clock domain is used for the Simulink design

The rest of the settings can be left as is. Click OK.

The 40GbE yellow block needs to be added as the SKARAB Board Support Package (BSP) is currently integrated in this block. If the 40GbE yellow block is not included then your Simulink will not compile as the SKARAB BSP will be missing.

Add control and reset logic

A very important piece of logic to consider when designing your system is how, when and what happens during reset. In this example we shall control our resets via a software register. We shall have one reset to reset the HMC design counters and trigger the data capture snap blocks. We shall have one data rate select, which will control/select the throughput through the HMC and we shall have one HMC write/read enable signal, which allows the user to disable/enable the process, so that the hmc read counter, write counter and hmc out counter all represent the same instant in time. Construct reset and control circuitry as shown below.

_images/hmc_software_reg_cntrl.png

Add a software register

Use a software register yellow block from the CASPER XPS Blockset->Memory for the reg_cntrl block. Rename it to reg_cntrl. Configure the I/O direction to be From Processor. Attach two Constant blocks from the Simulink->Sources section of the Simulink Library Browser to the input of the software register and make the value 0 and 1 as shown above.

Add Goto Blocks

Add three Goto blocks from Simulink->Signal Routing. Configure them to have the tags as shown (rst_cntrl, data_rate_sel and wr_rd_en). These tags will be used by associated From (also found in Simulink->Signal Routing) blocks in other parts of the design. These help to reduce clutter in your design and are useful for control signals that are routed to many destinations. They should not be used a lot for data signals as it reduces the ease with which data flow can be seen through the system.

Add a write and read counter to generate test data for the HMC
Add Counter Blocks

Add four Counter blocks from Xilinx Blockset->Basic Elements and configure it to be unsigned, free-running, 9-bits, incrementing by 1 as shown below - the block parameters are the same for all counters. The first counter represents the write data, the second counter represents the write address, the third counter represents the read data and the fourth counter represents the read address as shown below.

Add Delay Blocks

Add the delay blocks from Xilinx Blockset->Basic Elements and configure it as shown below. The read enable is delayed by 256 clock cycles in order to prevent the read and write address from clashing. It also allows ample time for a write to occur before reading occurs. The HMC write enable and read enable signal are aligned in order to ensure that the HMC reading and writing happen concurrently.

Add Goto and From Blocks

Add Goto and From blocks from Simulink->Signal Routing as shown below. Configure them to have the tags as shown.

Add Gateway Out Blocks and Scopes

Add Gateway Out blocks from Xilinx Blockset->Data Types as shown below. Remember to disable the “translate into output port” check. The purpose of the gateway out block is because we are connecting Xilinx blocks to the Simulink scope. You don’t need a gateway, but a warning will be generated when you compile the simulink block (Ctrl+D).

Add Scopes from Simulink->Sinks, in order to visually display the signals after simulation as shown below.

In simulation this circuit will generate a write & read address and data counter from 0 to 511 and the counter will wrap around after 511 as it is only 9 bits. This will allow us to generate simple test data for the HMC in order to analyse the memory writing and reading process. If the data is the same as the address then it is easier to see what is going on. Also, if the counter overflows all the time then it is easier to compare the HMC write with the HMC read and the HMC reorder process.

The dotted red-lines represent the counter enable signal path and that is generated by the data rate control function in the section below.

_images/hmc_write_read_cntrl.png _images/hmc_write_read_counter_settings.png

Add functionality to control the write and read data rate

As mentioned earlier in this tutorial, it is impossible to perform HMC read and write at the full clock rate. This would mean writing/reading a 256 bit word at 230MHz (58.88Gbps), and the HMC firmware supports up to a maximum of 256 bit at 156.25MHz (40Gbps) data throughput. We thus want to limit the data rate, so that the HMC firmware FIFOs do not overflow. We thus add circuitry to limit the data rate as shown below. The logic that we have added below can either enable the HMC write and read counters every clock cycle or enable the HMC write and read counters every second cycle. There is a multiplexer, which is controlled via a software register that can select either rate for demonstration purposes.

Implement the function that performs the write and read data rate control as shown below:

Add a Counter Block

Add one Counter block from Xilinx Blockset->Basic Elements and configure it to be unsigned, free-running, 2-bits, incrementing by 1 as shown below. This counter is used to divide the data rate by 2 using the LSB of the counter.

Add Xilinx Constant Blocks

Add three Constant blocks from Xilinx Blockset->Basic Elements and configure it to be a 1 or a 0 as shown below. All constants must be boolean.

Add Slice Block

Add a Slice block from the Xilinx Blockset-> Basic Elements, as shown below. Configure it to select the least significant bit.

Add From Blocks

Add From blocks from Simulink->Signal Routing as shown below. Configure them to have the tags as shown below.

Add Xilinx Convert (cast) Block

Add a Xilinx Convert block from Xilinx Blockset-> Data Types. Configure it to be boolean.

Add Xilinx Bus Multiplexer (Mux) Block

Add a Xilinx Mux block from Xilinx Blockset-> Basic Elements. Configure it to have two inputs, 1 clock cycle latency and full output precision, as shown below.

Add Xilinx Logical Block

Add a Xilinx logical block from Xilinx Blockset-> Basic Elements. Configure it to have four inputs, no latency and full output precision, as shown below.

Add Gateway Out and To Workspace Block (Optional)

Add Gateway Out blocks from Xilinx Blockset->Data Types as shown below. Remember to disable the “translate into output port” check. The purpose of the gateway out block is because we are connecting Xilinx blocks to the Simulink workspace variable. You don’t need a gateway, but a warning will be generated when you compile the simulink block (Ctrl+D).

Add to To Workspace blocks from Simulink->Sinks as shown below. This captures all the simulation data to the Matlab workspace variable. This makes it easier to see if data is aligned than just looking at the scope display. This step is optional, but you are welcome to try it.

The dotted red lines indicates where it interfaces with the HMC write and read control functionality in the section above.

_images/hmc_data_rate_cntrl.png

Add HMC and associated registers for error monitoring

We will now add the HMC yellow block in order to write and read to the HMC Mezzanine Card on the SKARAB.

Add the HMC yellow block for memory accessing

Add a HMC yellow block from the CASPER XPS Blockset->Memory, as shown below. It will be used to write and read data to/from the HMC memory on the Mezzanine Card. Rename it to hmc. Double click on the block to configure it and set it to be associated with Mezzanine slot 0. Make sure the simulation memory depth is set to 22 and the latency is set to 5. The randomise option should be checked, as this will ensure that the read HMC data is out of sequence, which emulates the operation of the HMC. This is explained above.

Add the Xilinx constant blocks as shown below - the tag is 9 bits, the data is 256 bits, the address is 27 bits and the rest is boolean. Add Xilinx cast blocks to write data (cast to 256 bits), write/read address (cast to 27 bits) and hmc data out (cast to 9 bits). Add the GoTo and From blocks and name them as shown below.

Link 2 is not used, so the outputs can be terminated, as shown below. Add the terminator block from Simulink->Sinks

_images/hmc_yellow_block_bp.png

Add registers to provide HMC status monitoring

Add three yellow-block software registers to provide the HMC status (2 bits), HMC receive CRC error counter (16 bits) and the HMC receive FLIT protocol error status (7 bits). Name them as shown below. The registers should be configured to send their values to the processor. Connect them to the HMC yellow block as shown below using GoTo blocks. A Convert (cast) block is required to interface with the 32 bit registers. Delay blocks are also required. To workspace blocks from Simulink->Sinks are attached to the simulation outputs of the software registers.

The HMC status is made up of the HMC initialisation done and HMC Power On Self Test (POST) OK flags. It takes a maximum of 1.2s for the HMC lanes to align and the initialisation process to complete. Once this is done then internally generated test data is written into the HMC. The data is then read out and compared with the data written in. If there is a match then POST passes and the POST OK flag is set to ‘1’. In this case, HMC initialisation done will be ‘1’ when the initialisation is successful and the POST process has finished. The POST OK flag will only be set to ‘1’ when the memory test is successful. Therefore, the user can only start writing and reading to/from the HMC when init_done and post_ok flag are both ‘1’. If any flags are ‘0’ then the HMC did not properly start up. Refer to the HMC Data Rate Control functionality above, which uses these flags to only start the write and read process when they are asserted.

The HMC receive CRC error counter will continue to increment if there are receive checksum errors encountered by the HMC firmware. This should always be 0.

The HMC receive FLIT protocol error status register is 7 bits. If any of these bits are ‘1’ then this means an error has occurred. This should always be ‘0’. In order to decode what this error means there is a table in the HMC data sheet on page 48 Table 20.

_images/hmc_error_yellow_block_mon.png

Implement the HMC reordering functionality

We will now implement logic to reorder the data that is read out of sequence from the HMC. This is critical, as the data is no use to us if it is out of sequence. This is already included in the template for this tutorial, so please use this functionality as is to save time. Some details are provided here for completeness.

The logic below looks complicated, but it is not. The HMC does not read back the data in the order it was requested due to how the HMC vaults operate and the DRAM refresh cycles. This makes the HMC readback undeterministic. The HMC reorder BRAM (512 Deep) reorders all the data read back from the HMC. This will synchronise the reorder readouts by using the read out tag as the write address of the reorder BRAM. It turns out through experience that the maximum delay can be in the order of 256 tags, when the data is requested. The function below does the following:

  1. It ensures that the HMC has written at least 256 words into the reorder BRAM before reading out of the reorder BRAM.
  2. It makes sure the read pointer does not exceed the write pointer i.e. do not read data that has not been written yet.
  3. Once the read pointer reaches count 256 then it waits until the write pointer count is at 512 and then continues to read the rest of the reorder BRAM while the write pointer starts from 0 again. This prevents the write and read pointers from clashing. This is essentially a bank swopping control mechanism.

_images/hmc_reorder_bram.png

_images/hmc_reorder_logic.png

Buffers to capture HMC write, HMC read and HMC reordered read data

The HMC write data (input), HMC read data (output) and HMC reordered data need to be connected to bitfield snapshot blocks for data capture analysis (located in CASPER DSP Blockset->Scopes), as shown below. These blocks (hmc_in_snap, hmc_out_snap and hmc_reorder_snap) are identical internally. Using these blocks, we can capture data as it is written and compare it to the data we have read and finally to the data that has been reordered.

Bitfield snapshot blocks are a standard way of capturing snapshots of data in the CASPER tool-set. A bitfield snap block contains a single shared BRAM allowing capture of 128-bit words.

_images/hmc_snap_blocks_dc.png

The ctrl register in a snap block allows control of the capture. The least significant bit enables the capture. Writing a rising edge to this bit primes the snap block for capture. The 2nd least most significant bit allows the choice of a trigger source. The trigger can come from an external source or be internal and immediately. The 3rd most least significant bit allows you to choose the source of the valid signal associated with the data. This may also be supplied externally or be immediately enabled.

The basic principle of the snap block is that it is primed by the user and then waits for a trigger at which point it captures a block of data and then waits to be primed again. Once primed the addr output register returns an address of 0 and will increment as data is written into the BRAMs. Upon completion the addr register will contain the final address. Reading this value will show that the capture has completed and the results may be extracted from the shared BRAMs.

In the case of this tutorial, the arming and triggering is done via software. The trigger is the rst signal. The “we” signal on the snapshot blocks is the data valid signal. Configure and connect the snap blocks as shown above. The Convert (cast) blocks should all be to 9 bits. The delays should be as shown above, as this aligns the data correctly. The following settings should be used for the bitfield snapshot blocks: storage medium should be BRAM, number of samples (“2^?”) should be 13, Data width 64, all boxes unchecked except “use DSP48s to implement counters”, Pre-snapshot delay should be 0.

HMC status registers

We shall now look at some registers to monitor the progress of our HMC writing and reading. We shall be able to check how many HMC write and read requests were issued and compare it to actual data read out of the memory via registers. We shall be able to check if the HMC is handling the throughput for the writing and reading via registers.

Write status registers
  • ‘’hmc_wr_cnt’’ is attached to a counter that increments when the HMC write enable signal is asserted ‘1’. It keeps a count of the number of write requests.
  • ‘’hmc_empty_wr_cnt’’ is attached to a counter that will increment only when the HMC write enable signal and HMC write ready signal are asserted ‘1’. This is optional.
  • ‘’hmc_wr_err’’ is a register that allows us to check if the HMC is meeting the write throughput by incrementing a counter every time the write enable signal is asserted ‘1’ when the HMC write ready signal is deasserted ‘0’ i.e. the HMC yellow block is still busy reading from the FIFO and is not ready for more write data.
Read status registers
  • ‘’hmc_rd_cnt’’ is attached to a counter that increments when the HMC read enable signal is asserted ‘1’. It keeps a count of the number of read requests.
  • ‘’hmc_empty_rd_cnt’’ is attached to a counter that will increment only when the HMC read enable signal and HMC read ready signal are asserted ‘1’. This is optional.
  • ‘’hmc_out_cnt’’ is attached to a counter that will increment only when the HMC data valid signal is asserted ‘1’. It keeps a count of the number of valid read data coming from the memory.
  • ‘’hmc_rd_err’’ is a register that allows us to check if the HMC is meeting the read throughput by incrementing a counter every time the read enable signal is asserted ‘1’ when the HMC read ready signal is deasserted ‘0’ i.e. the HMC yellow block is still busy reading from the FIFO and is not ready for more write data.

From tag rst_cntrl should go through an edge_detect block (rising and active high) to create a pulsed rst signal, which is used to trigger and reset the counters in the design. This is located in CASPER DSP Blockset -> Misc.

It should look as follows when you have added all the relevant registers:

_images/hmc_gen_debug_status_monitoring.png

You should now have a complete Simulink design. Compare it with the complete hmc tutorial *.slx model provided to you before continuing if unsure.

Compilation

It is now time to compile your design into a FPGA bitstream. This is explained below, but you can also refer to the Jasper How To document for compiling your toolflow design. This can be found in the ReadtheDocs mlib_devel documentation link:

https://casper-toolflow.readthedocs.io

In order to compile this to an FPGA bitstream, execute the following command in the MATLAB Command Line window:

jasper

This will run the process to generate the FPGA bitstream and output Vivado compile messages to the MATLAB Command Line window along the way. During the compilation and build process Vivado’s system generator will be run, and the windows below should pop up with the name of your slx file in the window instead of tut_1. The same applies below in the output file path - tut_1 will be replaced with the name of your slx file. In my case it is “tut_hmc”.

_images/Jasper_sysgen_SKARAB1.png

Execution of this command will result in an output .bof and .fpg file in the ‘outputs’ folder in the working directory of your Simulink model. Note: Compile time is approximately 45-50 minutes, so a pre-compiled binary (.fpg file) is made available to save time.

_images/Tut1_outputs_dir_files1.png

Programming the FPGA

Reconfiguration of the SKARAB’s SDRAM is done via the casperfpga python library. The casperfpga package for python, created by the SA-SKA group, wraps the Telnet commands in python. and is probably the most commonly used in the CASPER community. We will focus on programming and interacting with the FPGA using this method.

Getting the required packages

These are pre-installed on the server in the workshop and you do not need to do any further configuration, but if you are not working from the lab then refer to the How To Setup CasperFpga Python Packages document for setting up the python libraries for casperfpga. This can be found in the “casperfpga” repo wiki (to be deprecated) located in GitHub and the ReadtheDocs casperfpga documentation link:

https://github.com/ska-sa/casperfpga/wiki

https://casper-toolflow.readthedocs.io

Copy your .fpg file to your NFS server

As per the previous figure, navigate to the outputs folder and (secure)copy this across to a test folder on the workshop server.

scp path_to/your/model_folder/your_model_name/outputs/your_fpgfile.fpg user@server:/path/to/test/folder/
Connecting to the board

SSH into the server that the SKARAB is connected to and navigate to the folder in which your .fpg file is stored.

Start interactive python by running:

ipython

Now import the fpga control library. This will automatically pull-in the KATCP library and any other required communications libraries.

 import casperfpga

To connect to the SKARAB we create an instance of the SKARAB board; let’s call it fpga. The wrapper’s fpgaclient initiator requires just one argument: the IP hostname or address of the SKARAB board.

fpga = casperfpga.CasperFpga('skarab_name or ip_address')

The first thing we do is configure the FPGA.

fpga.upload_to_ram_and_program('your_fpgfile.fpg')

All the available/configured registers can be displayed using:

fpga.listdev()

The FPGA is now configured with your design. The registers can now be read back. For example, the HMC status register can be read back from the FPGA by using:

fpga.read_uint('hmc_status') or fpga.registers.hmc_status.read_uint();

The value returned should be 3, which indicates that the HMC has successfully completed initialisation and POST OK passes.

If you need to write to the reg_cntrl register then do the following:

 fpga.registers.reg_cntrl.write(data_rate_sel= False), where data_rate_sel = False (29.44Gbps), data_rate_sel = True (58.88Gbps)
 
 fpga.registers.reg_cntrl.write(rst = 'pulse'), this creates a pulse on the rst signal
 
 fpga.registers.reg_cntrl.write(wr_rd_en= True) , where wr_rd_en = False (disable HMC write/read), wr_rd_en = True (Enable the HMC write/read)

Manually typing these commands by hand will be cumbersome, so it is better to create a Python script that will do all of this for you. This is described below.

Running a Python script and interacting with the FPGA

A pre-written python script, ‘’tut_hmc.py’’ is provided. The code within the python script is well commented and won’t be explained here. The user can read through the script in his/her own time. In summary, this script programs the fpga with your complied design (.fpg file), writes to the control registers, initates the HMC write & read process, reads back the HMC snap shot captured data and status registers while displaying them to the screen for analysis. In order to run this script you will need to edit the file and change the target SKARAB IP address and the *.fpg file, if they are different. The script is run using:

python tut_hmc.py

If everything goes as expected, you should see a whole bunch of text on your screen - this is the output of the snap block and status register contents.

Analysing the Display Data

You should see something like this:

 user@server:~$ python tut_hmc.py
 connecting to SKARAB...
 done
 programming the SKARAB...
 done
 arming snapshot blocks...
 done
 triggering the snapshots and reset the counters...
 done
 enabling the HMC write and read process...
 done
 reading the snapshots...
 done
 disabling the HMC write and read process...
 done
 reading back the status registers...
 hmc rd cnt: 55527004
 hmc wr cnt: 55527004
 hmc out cnt: 55527004
 hmc wr err: 0
 hmc rd err: 0
 hmc status: 3
 rx crc err cnt: 0
 hmc error status: 0
 done
 Displaying the snapshot block data...
 HMC SNAPSHOT CAPTURED INPUT
 -----------------------------
 Num wr_en wr_addr wr_data wr_rdy rd_en rd_addr rd_tag rd_rdy
 [0] 1 1 1 1 0 0 0 1
 [1] 1 2 2 1 0 0 0 1
 [2] 1 3 3 1 0 0 0 1
 [3] 1 4 4 1 0 0 0 1
 [4] 1 5 5 1 0 0 0 1
 [5] 1 6 6 1 0 0 0 1
 [6] 1 7 7 1 0 0 0 1
 [7] 1 8 8 1 0 0 0 1
 [8] 1 9 9 1 0 0 0 1
 [9] 1 10 10 1 0 0 0 1
 [10] 1 11 11 1 0 0 0 1
 ....
 [589] 1 78 78 1 1 462 462 1
 [590] 1 79 79 1 1 463 463 1
 [591] 1 80 80 1 1 464 464 1
 [592] 1 81 81 1 1 465 465 1
 [593] 1 82 82 1 1 466 466 1
 [594] 1 83 83 1 1 467 467 1
 [595] 1 84 84 1 1 468 468 1
 [596] 1 85 85 1 1 469 469 1
 [597] 1 86 86 1 1 470 470 1
 [598] 1 87 87 1 1 471 471 1
 [599] 1 88 88 1 1 472 472 1
 HMC SNAPSHOT CAPTURED OUTPUT
 -----------------------------
 Num hmc_read_tag_out hmc_data_out
 [1] 1 1
 [2] 2 2
 [3] 3 3
 [4] 4 4
 [5] 5 5
 [6] 6 6
 [7] 7 7
 [8] 8 8
 [9] 9 9
 [10] 10 10
 [11] 12 12
 [12] 11 11
 [13] 13 13
 ....
 [588] 75 75
 [589] 77 77
 [590] 78 78
 [591] 79 79
 [592] 80 80
 [593] 81 81
 [594] 82 82
 [595] 83 83
 [596] 84 84
 [597] 85 85
 [598] 86 86
 [599] 87 87
 HMC REORDER SNAPSHOT CAPTURED OUTPUT
 -------------------------------------
 Num rd_en rd_addr data_out
 [1] 1 1 1
 [2] 1 2 2
 [3] 1 3 3
 [4] 1 4 4
 [5] 1 5 5
 [6] 1 6 6
 [7] 1 7 7
 [8] 1 8 8
 [9] 1 9 9
 [10] 1 10 10
 [11] 1 11 11
 [12] 1 12 12
 [13] 1 13 13
 [14] 1 14 14
 [15] 1 15 15
 ....
 [588] 1 76 76
 [589] 1 77 77
 [590] 1 78 78
 [591] 1 79 79
 [592] 1 80 80
 [593] 1 81 81
 [594] 1 82 82
 [595] 1 83 83
 [596] 1 84 84
 [597] 1 85 85
 [598] 1 86 86
 [599] 1 87 87

The above results show that the HMC is meeting the 29.44Gbps throughput, as the HMC write error register (hmc_wr_err) and HMC read error register (hmc_rd_err) is 0, which means the HMC is always ready for data when the HMC write/read request occurs. Note that the HMC read count (hmc_rd_cnt), HMC write count (hmc_wr_cnt) and HMC read out count (hmc_out_cnt) are all equal, which is expected. Compare the HMC snapshot output data and the HMC reorder snapshot captured output data - notice how the HMC snapshot output data is out of sequence in places and the HMC snapshot reorder data is in sequence again. There is no missing data. This is how the HMC should work.

Edit the tut_hmc.py script again and change the data rate to 58.88Gbps. Rerun as above and this time notice that the difference in the above registers and snapshot data. What do you see? You should see that HMC read count, HMC write count and HMC read out count values do not match. The HMC write error register and HMC read error register should be non zero, which indicates that the HMC is asserting write and read requests when the HMC write and read ready signals are not asserted, which means the FIFO is not being cleared fast enough. The HMC read output data will still be out of sequence, but data will be lost. This can be clearly seen in the HMC reorder snapshot captured output.

Other notes

• iPython includes tab-completion. In case you forget which function to use, try typing library_name.

• There is also onboard help. Try typing library_name.function?

• Many libraries have onboard documentation stored in the string library_name.doc

• KATCP in Python supports asynchronous communications. This means you can send a command, register a callback for the response and continue to issue new commands even before you receive the response. You can achieve much greater speeds this way. The Python wrapper in the corr package does not currently make use of this and all calls are blocking. Feel free to use the lower layers to implement this yourself if you need it!

Conclusion

This concludes the HMC Interface Tutorial for SKARAB. You have learned how to utilize the HMC interface on a SKARAB to write and read data to/from the HMC Mezzanine Card. You also learned how to further use Python to program the FPGA and control it remotely using casperfpga.

ISE

ROACH1/2

  1. Introduction Tutorial
  2. 10GbE Tutorial
  3. Spectrometer Tutorial
  4. Correlator Tutorial

Tutorial 2: 10GbE Interface

Introduction

In this tutorial, you will create a simple Simulink design which uses the ROACH2’s (or ROACH1’s) 10GbE ports to send data at high speeds to another port. This could just as easily be another FPGA board or a computer with a 10GbE network interface card. In addition, we will learn to control the design remotely, using a supplied Python library for KATCP.

In this tutorial, a counter will be transmitted through one SFP+ port and back into another. This will allow a test of the communications link. This test can be used to test the link between boards and the effect of different cable lengths on communication robustness.

Background

ROACH2 boards have four SFP+ ports on a single 10GbE Mezzanine Card. The Ethernet interface is driven by an on-board 156.25MHz crystal oscilator. This clock is then multiplied up on the FPGA by a factor of 66. Thus, the speed on the wire is actually 156.25MHz x 66 = 10.312.5 Gbps. However, 10GbE over single-lane SFP+ connectors uses 64b/66b encoding, which means that for every 66 bits sent, 66 bits are actually transmitted. This is to ensure proper clocking, since the receiver recovers and locks-on to the transmitter’s clock and requires edges in the data. Imagine transmitting a string of 0xFF or 0b11111111… which would otherwise generate a DC level on the line, now an extra two bits are introduced which includes a zero bit which the receiver can use to recover the clock and byte endings. See here for more information.

For this reason, we actually get 10Gbps usable data rate. CASPER’s 10GbE Simulink core sends and receives UDP over IPv4 packets. These IP packets are wrapped in Ethernet frames. Each Ethernet frame requires a 38 byte header, IPv4 requires another 20 bytes and UDP a further 16. So, for each packet of data you send, you will incur a cost of at least 74 bytes. I say at least, because the core will zero-pad some headers to be on a 64-bit boundary. You will thus never achieve 10Gbps of usable throughput, though you can get close. It pays to send larger packets if you are trying to get higher throughputs.

The maximum payload length of the CASPER 10GbE core is 8192 bytes (implemented in BRAM) plus another 512 (implemented in distributed RAM) which is useful for an application header. These ports (and hence part of the 10 GbE cores) run at 156.25MHz, while the interface to your design runs at the FPGA clock rate (sys_clk, adcX_clk etc). The interface is asynchronous, and buffers are required at the clock boundary. For this reason, even if you send data between two ROACH boards which are running off the same hard-wired clock, there will be jitter in the data. A second consideration is how often you clock values into the core when you try to send data. If your FPGA is running faster than the core, and you try and clock data in on every clock cycle, the buffers will eventually overflow. Likewise for receiving, if you send too much data to a board and cannot clock it out of the receive buffer fast enough, the receive buffers will overflow and you will lose data. In our design, we are clocking the FPGA at 100 MHz, with the cores running at 156.25MHz. We can thus clock data into the TX buffer continuously without worrying about overflows.

Create a new model

Start Matlab and open Simulink (either by typing ‘simulink’ on the Matlab command line, or by clicking on the Simulink icon in the taskbar). A template is provided for Tut2 with a pre-created packet generator in the tutorials_devel git repository. Get a copy of this template and save it. Make sure the XSG_config_block is configured for ROACH2 (or ROACH1, if that’s what you’re using). Specify a clock frequency of 100 MHz and the clock source “sys_clock”.

Add reset logic

A very important piece of logic to consider when designing your system is how, when and what happens during reset. In this example we shall control our resets via a software register. We shall have two independent resets, one for the 10GbE cores which shall be used initially, and one to reset the user logic which may be used more often to restart the user part of the system. Construct reset circuitry as shown below.

_images/tut2_rst.png

Add a software register

Use a software register yellow block from the CASPER XPS System Blockset for the rst block. Rename it to rst.

It used to be that every register you inserted had to be natively 32-bits, and you were responsible for slicing these 32 bits into different signals if you want to control multiple flags. The latest block can implicitly break the 32-bit registers out into separate names signals, so we’ll use that. The downside is there are a bunch of settings to configure – you need to set up the names and data types of your register subfields. The settings you need are ‘’’NOTE: due to missing MATLAB licenses on the 2017 conference servers, you can’t use do this for ROACH2.’’’ For ROACH2 take the default block, configure the I/O direction to be From Processor and then manually slice the bottom bit of the output to make the cnt_rst signal, and bit 1 to make the core_rst signal. You can check the example ROACH2 design to see how to do this.

Add Goto blocks

Add two Goto blocks from Simulink->Signal Routing. Configure them to have the tags as shown (core_rst and cnt_rst). These tags will be used by associated From (also found in Simulink->Signal Routing) blocks in other parts of the design. These help to reduce clutter in your design and are useful for control signals that are routed to many destinations. They should not be used a lot for data signals as it reduces the ease with which data flow can be seen through the system.

Add 10GbE and associated registers for data transmission

We will now add the 10GbE block to transmit a counter at a programmable rate.

Add a 10GbE block for data transmission

Add a ten_GbE yellow block from the CASPER XPS System Blockset. It will be used to transmit data and we shall add another later to receive data. Rename it gbe0. Double click on the block to configure it and set it to be associated with SFP+ port 0. If your application can guarantee that it will be able to use received data straight away (as our application can), shallow receive buffers can be used to save resources. This optimisation is not necessary in this case as we will use a small fraction of resources in the FPGA.

_images/Gbe0Blockk.jpg

Add registers to provide the target IP address and port number

Add two yellow-block software registers to provide the destination IP address and port number for transmission with the data. Name one dest_ip and the other dest_port. The registers should be configured to receive their values from the processor. Connect them to the appropriate inputs of the gbe0 10GbE block as shown. A Slice block is required to use the lower 16 bits of data from the dest_port register. Constant blocks from Simulink->Sources with 0 values are attached to the simulation inputs of the software registers. The destination port and IP address are not important in this system as it is a loopback example. Add a From block from Simulink->Signal Routing and set the tag to use core_rst, this enables one to reset the block.

_images/10ge.jpg

Create a subsystem to generate a counter to transmit as data

We will now implement logic to generate a counter to transmit as data. This is already included in the Template for Tut 2. Some details are provided here for completeness.

Construct a subsystem for data generation logic

It is often useful to group related functionality and hide the details. This reduces drawing space and complexity of the logic on the screen, making it easier to understand what is happening. Simulink allows the creation of Subsystems to accomplish this.

These can be copied to places where the same functionality is required or even placed in a library for use in other projects and by other people. To create a subsystem, one can highlight the logical elements to be encapsulated, then right-click and choose Create Subsystem from the list of options. You can also simply add a Subsystem block from Simulink->Ports & Subsystems.

Subsystems inherit variables from their parent system. Simulink allows one to create a variable whose scope is only a particular subsystem. To do this, right-click on a subsystem and choose the Create Mask option. The mask created for that particular subsystem allows one to add parameters that appear when you double-click on the icon associated with the subsystem.

The mask also allows you to associate an initialisation script with a particular subsystem. This script is called every time a mask parameter is modified and the Apply button clicked. It is especially useful if the internal structure of a subsystem must change based on mask parameters. Most of the interesting blocks in the CASPER library use these initialisation scripts.

Drop a subsystem block into your design and rename it pkt_sim. Then double-click on it to add logic.

Add a counter to generate a certain amount of data

Add a Counter block from Xilinx Blockset->Basic Elements and configure it to be unsigned, free-running, 32-bits, incrementing by 1 as shown. Add a Relational block, software register and Constant block as shown. In simulation this circuit will generate a counter from 0 to 49 and then stop counting. This will allow us to generate 50 data elements before stopping.

_images/Payload_length.png _images/CounterBlog.jpg

Add a counter to limit the data rate

As mentioned earlier in this tutorial, it is impossible to supply data to the 10GbE transmission block at the full clock rate. This would mean transmitting a 64-bit word at 200MHz, and the 10GbE standard only supports up to 156.25MHz data transmission. We thus want to generate data in bursts such that the transmission FIFOs do not overflow. We thus add circuitry to limit the data rate as shown below. The logic that we have added on the left generates a reset at a fixed period determined by the software register. This will trigger the generation of a new packet of data as before. In simulation this allows us to limit the data rate to 50/200 * 200MHz = 50MHz. Using these values in actual hardware would limit the data rate to (50/(8/10*156.25)) = 4Gbps.

_images/counter_jbo.png

Finalise logic including counter to be used as data

We will now finalise the data generation logic as shown below. To save time, use the existing logic provided with the tutorial. Counter1 in the illustration generates the actual data to be transmitted and the enable register allows this data stream to the transmitting 10GbE core to be turned off and on. Logic linked to the eof output port provides an indication to the 10GbE core that the final data word for the frame is being sent. This will trigger the core to begin transmission of the frame of data using the IP address and port number specified.

_images/full_logic_jbo.png

Receive blocks and logic

The receive logic is is composed of another 10GbE yellow block with the transmission interface inputs all tied to 0 as no transmission is to be done, however Simulink requires all inputs to be connected. Connecting them to 0 should ensure that during synthesis the transmission logic for this 10GbE block is removed. Double click on the block to configure it and set it to be associated with SFP+ port 1.

Buffers to capture received and transmitted data

The casperfpga Python package contains all kinds of methods to interact with your 10GbE cores. For example, grabbing packets from the TX and RX stream, or counting the number of packets sent and received are all supported, as long as you turn on the appropriate functionality in the 10GbE yellow block. The settings we’ll use are –

_images/gbe_core_0_params.png

_images/gbe_core_0_debug_params.png

You can see how to use these functions in the software that accompanies this tutorial.

LEDs and status registers

You can also sprinkle around other registers or LEDs to monitor status of core parameters, or give visual feedback that the design is doing something sane. Check out the reference model for some examples of potentially useful monitoring circuitry.

Compilation

Compiling this design takes approximately 20 to 30 minutes. A pre-compiled binary (.fpg file) is made available to save time.

Programming and interacting with the FPGA

A pre-written python script, ‘’roach2_tut_tge.py’’ is provided. This script programs the fpga with your complied design (.fpg file) configures the 10GbE Ports and initiates data transfer. The script is run using:

 ./roach2_tut_tge.py <ROACH_IP_ADDRESS>

If everything goes as expected, you should see a whole bunch of lines running across your screen as the code sets up the IP/MAC parameters of the 10GbE cores and checks their status, and that the data the cores are sending and receiving are consistent. Have a look at this code to see how one uses the more advanced (i.e. more complex that read_int, and write_int) methods casperfpga makes available. Documentation for casperfpga is still a work in progress(!) but the basic idea is that when you instantiate a CasperFpga, the software intelligently builds python objects into this instance, based on what you put in your design. For example, your Ethernet cores should show up as objects CasperFpga.gbes.<simulink_block_name> (or CasperFpga.gbes[‘simulink_block_name’]) which have useful methods like “setup”, which sets the core’s IP/MAC address, or “print_10gbe_core_details” wich will print out useful status information, like the current state of the core’s ARP cache. iPython and tab-complete are your friend here, there are lots of handy methods to discover. (I’m still discovering them now :) )

The control software should be(!) well-commented, to explain what’s going on behind the scene as the software interacts with your FPGA design.

Conclusion

This concludes Tutorial 2. You have learned how to utilize the 10GbE ports on a ROACH to send and receive UDP packets. You also learned how to further use the Python to program the FPGA and control it remotely using some of the OOP goodies avaiable in casperfpga.

Tutorial 3: Wideband Spectrometer

Introduction

A spectrometer is something that takes a signal in the time domain and converts it to the frequency domain. In digital systems, this is generally achieved by utilising the FFT (Fast Fourier Transform) algorithm. However, with a little bit more effort, the signal to noise performance can be increased greatly by using a Polyphase Filter Bank (PFB) based approach.

When designing a spectrometer for astronomical applications, it’s important to consider the science case behind it. For example, pulsar timing searches will need a spectrometer which can dump spectra on short timescales, so the rate of change of the spectra can be observed. In contrast, a deep field HI survey will accumulate multiple spectra to increase the signal to noise ratio. It’s also important to note that “bigger isn’t always better”; the higher your spectral and time resolution are, the more data your computer (and scientist on the other end) will have to deal with. For now, let’s skip the science case and familiarize ourselves with an example spectrometer.

Setup

This tutorial comes with a completed model file, a compiled bitstream, ready for execution on ROACH, as well as a Python script to configure the ROACH and make plots. Here

Spectrometer Basics

When designing a spectrometer there are a few main parameters of note:

  • Bandwidth: The width of your frequency spectrum, in Hz. This depends on the sampling rate; for complex sampled data this is equivalent to:

_images/bandwidtheq1.png

In contrast, for real or Nyquist sampled data the rate is half this:

_images/bandwidtheq2.png

as two samples are required to reconstruct a given waveform .

  • Frequency resolution: The frequency resolution of a spectrometer, Δf, is given by

_images/freq_eq.png,

and is the width of each frequency bin. Correspondingly, Δf is a measure of how precise you can measure a frequency.

  • Time resolution: Time resolution is simply the spectral dump rate of your instrument. We generally accumulate multiple spectra to average out noise; the more accumulations we do, the lower the time resolution. For looking at short timescale events, such as pulsar bursts, higher time resolution is necessary; conversely, if we want to look at a weak HI signal, a long accumulation time is required, so time resolution is less important.
Configuration and Control
Hardware Configuration

The tutorial comes with a pre-compiled bof file, which is generated from the model you just went through (tut3.bof) Copy this over to you ROACH boffiles directory, chmod it to a+x as in the other tutorials, then load up your ROACH. You don’t need to telnet in to the ROACH; all communication and configuration will be done by the python control script called tut3.py.

Next, you need to set up your ROACH. Switch it on, making sure that:

• You have your ADC in ZDOK0, which is the one nearest to the power supply.

• You have your clock source connected to clk_i on the ADC, which is the second on the right. It should be generating an 800MHz sine wave with 0dBm power.

The tut3.py spectrometer script

Once you’ve got that done, it’s time to run the script. First, check that you’ve connected the ADC to ZDOK0, and that the clock source is connected to clk_i of the ADC. Now, if you’re in linux, browse to where the tut3.py file is in a terminal and at the prompt type

 ./tut3.py <roach IP or hostname> -b <boffile name>

replacing with the IP address of your ROACH and with your boffile. You should see a spectrum like this:

_images/Spectrometer.py_4.8.png

In the plot, there should be a fixed DC offset spike; and if you’re putting in a tone, you should also see a spike at the correct input frequency. If you’d like to take a closer look, click the icon that is below your plot and third from the right, then select a section you’d like to zoom in to. The digital gain (-g option) is set to maximum (0xffff_ffff) by default to observe the ADC noise floor. Reduce the gain (decrease the value (for a -10dBm input 0x100)) when you are feeding the ADC with a tone, as not to saturate the spectrum.

Now you’ve seen the python script running, let’s go under the hood and have a look at how the FPGA is programmed and how data is interrogated. To stop the python script running, go back to the terminal and press ctrl + c a few times.

iPython walkthrough

The tut3.py script has quite a few lines of code, which you might find daunting at first. Fear not though, it’s all pretty easy. To whet your whistle, let’s start off by operating the spectrometer through iPython. Open up a terminal and type:

ipython

and press enter. You’ll be transported into the magical world of iPython, where we can do our scripting line by line, similar to MATLAB. Our first command will be to import the python packages we’re going to use:

import corr,time,numpy,struct,sys,logging,pylab

Next, we set a few variables:

katcp_port = 7147
roach = 'enter IP address or hostname here'
timeout = 10

Which we can then use in FpgaClient() such that we can connect to the ROACH and issue commands to the FPGA:

fpga = corr.katcp_wrapper.FpgaClient(roach,katcp_port, timeout)

We now have an fpga object to play around with. To check if you managed to connect to your ROACH, type:

fpga.is_connected()

Let’s set the bitstream running using the progdev() command:

fpga.progdev('tut3.bof')

Now we need to configure the accumulation length and gain by writing values to their registers. For two seconds and maximum gain: accumulation length, 2*(2^28)/2048, or just under 2 seconds:

fpga.write_int('acc_len',2*(2**28)/2048)
fpga.write_int('gain',0xffffffff)

Finally, we reset the counters:

fpga.write_int('cnt_rst',1)
fpga.write_int('cnt_rst',0)

To read out the integration number, we use fpga.read_uint():

acc_n = fpga.read_uint('acc_cnt')

Do this a few times, waiting a few seconds in between. You should be able to see this slowly rising. Now we’re ready to plot a spectrum. We want to grab the even and odd registers of our PFB:

a_0=struct.unpack('>1024l',fpga.read('even',1024*4,0))
a_1=struct.unpack('>1024l',fpga.read('odd',1024*4,0))

These need to be interleaved, so we can plot the spectrum. We can use a for loop to do this:

interleave_a=[]

for i in range(1024):
	interleave_a.append(a_0[i])
	interleave_a.append(a_1[i])

This gives us a 2048 channel spectrum. Finally, we can plot the spectrum using pyLab:

pylab.figure(num=1,figsize=(10,10))
pylab.plot(interleave_a)
pylab.title('Integration number %i.'%acc_n)
pylab.ylabel('Power (arbitrary units)')
pylab.grid()
pylab.xlabel('Channel')
pylab.xlim(0,2048)
pylab.show()

Voila! You have successfully controlled the ROACH spectrometer using python, and plotted a spectrum. Bravo! You should now have enough of an idea of what’s going on to tackle the python script. Type exit() to quit ipython. tut3.py notes ==

Now you’re ready to have a closer look at the tut3.py script. Open it with your favorite editor. Again, line by line is the only way to fully understand it, but to give you a head start, here’s a few notes:

Connecting to the ROACH

To make a connection to the ROACH, we need to know what port to connect to, and the IP address or hostname of our ROACH. The connection is made on line 96:

fpga = corr.katcp_wrapper.FpgaClient(...)

The katcp_port variable is set on line 16, and the roach variable is passed to the script at the terminal (remember that you typed python tut3.py roachname). We can check if the connection worked by using fpga.is_connected(), which returns true or false:

if fpga.is_connected():

The next step is to get the right bitstream programmed onto the FPGA fabric. The bitstream is set on line 15:

bitstream = 'tut3.bof'

Then the progdev command is issued on line 108:

fpga.progdev(bitstream)

Passing variables to the script

Starting from line 64, you’ll see the following code:

from optparse import OptionParser

p = OptionParser()
p.set_usage('tut3.py <ROACH_HOSTNAME_or_IP> [options]')
p.set_description(__doc__)

p.add_option('-l','—acc_len',dest='acc_len',
type='int', default=2*(2**28)/2048,
help='Set the number of vectors to accumulate between dumps. default is 2*(2^28)/2048, or just under 2 seconds.')

p.add_option('-g', '--gain', dest='gain',
type='int',default=0xffffffff,
help='Set the digital gain (6bit quantisation scalar). Default is 0xffffffff (max), good for wideband noise. Set lower for CW tones.')

p.add_option('-s', '--skip', dest='skip', action='store_true',
help='Skip reprogramming the FPGA and configuring EQ.')

opts, args = p.parse_args(sys.argv[1:])

 if args==[]:
     print 'Please specify a ROACH board. Run with the -h flag to see 	all options.\nExiting.'
     exit()
 else:
     roach = args[0]

What this code does is set up some defaults parameters which we can pass to the script from the command line. If the flags aren’t present, it will default to the values set here.

Conclusion

If you have followed this tutorial faithfully, you should now know:

• What a spectrometer is and what the important parameters for astronomy are.

• Which CASPER blocks you might want to use to make a spectrometer, and how to connect them up in Simulink.

• How to connect to and control a ROACH spectrometer using python scripting.

In the following tutorials, you will learn to build a correlator, and a polyphase filtering spectrometer using an FPGA in conjunction with a Graphics Processing Unit (GPU).

Tutorial 4: Wideband Pocket Correlator

Introduction

In this tutorial, you will create a simple Simulink design which uses the iADC board on ROACH and the CASPER DSP blockset to process a wideband (400MHz) signal, channelize it and output the visibilities through ROACH’s PPC.

By this stage, it is expected that you have completed tutorial 1 and tutorial 2 and are reasonably comfortable with Simulink and basic Python. We will focus here on higher-level design concepts, and will provide you with low-level detail preimplemented.

Background

Some of this design is similar to that of the previous tutorial, the Wideband Spectrometer. So completion of tutorial 3 is recommended.

Interferometry

In order to improve sensitivity and resolution, telescopes require a large collection area. Instead of using a single, large dish which is expensive to construct and complicated to maneuver, modern radio telescopes use interferometric arrays of smaller dishes (or other antennas). Interferometric arrays allow high resolution to be obtained, whilst still only requiring small individual collecting elements.

Correlation

Interferometric arrays require the relative phases of antennas’ signals to be measured. These can then be used to construct an image of the sky. This process is called correlation and involves multiplying signals from all possible antenna pairings in an array. For example, if we have 3 antennas, A, B and C, we need to perform correlation across each pair, AB, AC and BC. We also need to do auto-correlations, which will give us the power in each signal. ie AA, BB, CC. We will see this implemented later. The complexity of this calculation scales with the number of antennas squared. Furthermore, it is a difficult signal routing problem since every antenna must be able to exchange data with every other antenna.

Polarization

Dish type receivers are typically dual polarized (horizontal and vertical feeds). Each polarization is fed into separate ADC inputs. When correlating these antennae, we differentiate between full Stokes correlation or a half Stokes method. A full Stokes correlator does cross correlation between the different polarizations (ie for a given two antennas, A and B, it multiplies the horizontal feed from A with the vertical feed from B and vice-versa). A half stokes correlator only correlates like polarizations with each other, thereby halving the compute requirements.

The Correlator

The correlator we will be designing is a 4 input correlator as shown below. It uses 2 inputs from each of two ADCs. It can be thought of as a 2-input full Stokes correlator or as a four input single polarization correlator.

_images/Roach_with_iadcs_on_bench.jpg

Setup

The lab at the workshop is preconfigured with the CASPER libraries, Matlab and Xilinx tools. Start Matlab.

Creating Your Design
Create a new model

Having started Matlab, open Simulink (either by typing simulink on the Matlab command line, or by clicking the Simulink icon in the taskbar). Create a new model and add the Xilinx System Generator and XSG core config blocks as before in Tutorial 1.

System Generator and XSG Blocks

_images/sysgen_xsg.png

By now you should have used these blocks a number of times. Pull the System Generator block into your design from the Xilinx Blockset menu under Basic Elements. The settings can be left on default.

The XSG block can be found under the CASPER XPS System Blockset. Set the Hardware platform to ROACH:sx95t, the Clock Source to adc0_clk and the rest of the configuration as the default.

Make sure you have an ADC plugged into ZDOK0 to supply the FPGA’s clock!

Sync Generator

_images/t4_sync_gen.png

The Sync Generator puts out a sync pulse which is used to synchronize the blocks in the design. See the CASPER memo on sync pulse generation for a detailed explanation.

This sync generator is able to synchronize with an external trigger input. Typically we connect this to a GPS’s 1pps output to allow the system to reset on a second boundary after a software arm and thus know precisely the time at which an accumulation was started. To do this you can input the 1pps signal into either ADCs’ sync input. The sync pulse allows data to be tagged with a precise timestamp. It also allows multiple ROACH boards to be synchronized, which is useful if large numbers of antenna signals are being correlated.

ADCs

_images/t4_adcs_jbo.png

Connection of the ADCs is as in tutorial 3 except for the sync outputs. Here we OR all four outputs together to produce a sync pulse sampled at one quarter the rate of the ADC’s sample clock. This is the simplest way of dealing with the four sync samples supplied to the FPGA on every clock cycle, but means that our system can only be synchronized to within 3 ADC sample clocks.

Logic is also provided to generate a sync manually via a software input. This allows the design to be used even in the absence of a 1 pps signal. However, in this case, the time the sync pulse occurs depends on the latency of software issuing the sync command and the FPGA signal triggering. This introduces some uncertainty in the timestamps associated with the correlator outputs. We will not use the 1pps in this tutorial although the design has the facility to do this hardware sync’ing.

Set up the ADCs as follows and change the second ADC board’s mask parameter to adc1…

_images/t4_adc_set.png

Throughout this design, we use CASPER’s bus_create and bus_expand blocks to simplify routing and make the design easier to follow.

_images/t4_concat_block.png

For the purposes of simulation, it can be useful to put simulation input signals into the ADCs. These blocks are pulse generators in the case of sync inputs and any analogue source for the RF inputs (noise, CW tones etc).

_images/t4_sin_wave_set.png _images/t4_noise_set.png

Control Register

_images/t4_ctrl_reg_jbo.png

This part of the Simulink design sets up a software register which can be configured in software to control the correlator. Set the yellow software register’s IO direction as from processor. You can find it in the CASPER_XPS System blockset. The constant block input to this register is used only for simulation.

The output of the software register goes to three slice blocks, which will pull out the individual parameters for use with configuration. The first slice block (top) is setup as follows:

_images/t4_ctrl_slice_set.png

The slice block can be found under the Xilinx Blockset → Control Logic. The only change with the subsequent slice blocks is the Offset of the bottom bit. They are, from top to bottom, respectively,16, 17 & 18.

After each slice block we put an edge_detect block, this outputs true if a boolean input signal is true this clock and was false last clock. Found under CASPER DSP Blockset → Misc.

Next are the delay blocks. They can be left with their default settings and can be found under Xilinx Blockset → Common.

The Goto and From bocks can be found under Simulink-> Signal Routing. Label them as in the block diagram above.

Clip Detect and status reporting

To detect and report signal saturation (clipping) to software, we will create a subsystem with latching inputs.

_images/t4_status_clip_jbo.png _images/t4_status_report.png

The internals of this subsystem (right) consist of delay blocks, registers and cast blocks.

The delays (inputs 2 - 9) can be keep as default. Cast blocks are required as only unsigned integers can be concatenated. Set their parameters to Unsigned, 1 bit, 0 binary points Truncated Quantization, Wrapped Overflow and 0 Latency.

The Registers (inputs 10 - 33) must be set up with an initial value of 0 and with enable and reset ports enabled. The status register on the output of the clip detect is set to processor in with unsigned data type and 0 binary point with a sample period of 1.

PFBs, FFTs and Quantisers

The PFB FIR, FFT and the Quantizer are the heart of this design, there is one set of each for the 4 outputs of the ADCs. However, in order to save resources associated with control logic and PFB and FFT coefficient storage, the four independent filters are combined into a single simulink block. This is configured to process four independent data streams by setting the “number of inputs” parameter on the PFB_FIR and FFT blocks to 4.

_images/t4_pfb_fft_jbo.png

Configure the PFB_FIR_generic blocks as shown below:

_images/t4_pfb_set_jbo.png

There is potential to overflow the first FFT stage if the input is periodic or signal levels are high as shifting inside the FFT is only performed after each butterfly stage calculation. For this reason, we recommend casting any inputs up to 18 bits with the binary point at position 17 (thus keeping the range of values -1 to 1), and then ownshifting by 1 bit to place the signal in one less than the most significant bits.

The fft_wideband_real block should be configured as follows:

_images/t4_fft_set_jbo.png

The Quantizer Subsystem is designed as seen below. The quantizer removes the bit growth that was introduced in the PFB and FFT. We can do this because we do not need the full dynamic range.

_images/t4_quant_jbo.png

The top level view of the Quantizer Subsystem is as seen below.

_images/t4_quant_top_lvl.png

LEDs

The following sections are more periphery to the design and will only be touched on. By now you should be comfortable putting the blocks together and be able to figure out many of the values and parameters. Also feel free to consult the reference design which sits in the tutorial 4 project directory or ask any questions of the tutorial helpers.

As a debug and monitoring output we can wire up the LEDs to certain signals. We light an LED with every sync pulse. This is a sort of heartbeat showing that the design is clocking and the FPGA is running.

We light an error LED in case any ADC overflows and another if the system is reset. The fourth LED gives a visual indication of when an accumulation is complete.

ROACH’s LEDs are negative logic, so when the input to the yellow block is high, the LED is off. Since this is the opposite of what you’d normally expect, we invert the logic signals with a NOT gate.

Since the signals might be too short to light up an LED and for us to actually see it (consider the case where a single ADC sample overflows; 1/800MHz is 1.25 nS – much too short for the human eye to see) we add a negedge delay block which delays the negative edge of a block, thereby extending the positive pulse. A length of 2^23 gives about a 10ms pulse.

_images/t4_leds_jbo.png

ADC RMS

These blocks calculate the RMS values of the ADCs’ input signals. We subsample the input stream by a factor of four and do a pseudo random selection of the parallel inputs to prevent false reporting of repetitive signals. This subsampled stream is squared and accumulated for 2^16 samples.

_images/t4_adc_rms.png

The MAC operation

The multiply and accumulate is performed in the dir_x (direct-x) blocks, so named because different antenna signal pairs are multiplied directly, in parallel (as opposed to the packetized correlators’ X engines which process serially).

Two sets are used, one for the even channels and another for the odd channels. Accumulation for each antenna pair takes place in BRAM using the same simple vector accumulator used in tut3.

_images/t4_mac_op_jbo.png

CONTROL:

The design starts by itself when the FPGA is programmed. The only control register inputs are for resetting counters and optionally sync’ing to external signal.

Sync LED provides a “heartbeat” signal to instantly see if your design is clocked sensibly.

New accumulation LED gives a visual indication of data rates and dump times.

Accumulation counter provides simple mechanism for checking if a new spectrum output is available. (poll and compare to last value)

Software

The python scripts are located in the tut4 tutorial directory. We first need to run poco_init.py to program the FPGA and configure the design. Then we can run either the auto or the cross correlations plotting scripts (plot_poco_auto.py and plot_poco_cross.py).

poco_init.py

 print('Connecting to server %s on port %i... '%(roach,katcp_port)), 
     fpga = corr.katcp_wrapper.FpgaClient(roach, katcp_port, 
 timeout=10,logger=logger) 
     time.sleep(1) 
     if fpga.is_connected(): 
         print 'ok\n' 
     else: 
         print 'ERROR connecting to server %s on port %i.\n'%
 (roach,katcp_port) 
         exit_fail() 
     print '------------------------' 
     print 'Programming FPGA...', 
     if not opts.skip: 
         fpga.progdev(boffile) 
         print 'done' 
     else: 
         print 'Skipped.' 
     print 'Configuring fft_shift...', 
     fpga.write_int('fft_shift',(2**32)-1) 
     print 'done' 
     print 'Configuring accumulation period...', 
     fpga.write_int('acc_len',opts.acc_len) 
     print 'done' 
     print 'Resetting board, software triggering and resetting error 
 counters...', 
     fpga.write_int('ctrl',1<<17) #arm 
     fpga.write_int('ctrl',1<<18) #software trigger 
     fpga.write_int('ctrl',0) 
     fpga.write_int('ctrl',1<<18) #issue a second trigger 
     print 'done'

In previous tutorials you will probably have seen very similar code to the code above. This initiates the katcp wrapper named fpga which manages the interface between the software and the hardware. fpga.progdev programs the boffile onto the FPGA and fpga.write_int writes to a register.

poco_adc_amplitudes.py

This script outputs in the amplitudes (or power) of each signal as well as the bits used. It updates itself ever second or so.

 ADC amplitudes
 --------------
 ADC0 input I: 0.006 (0.51 bits used)
 ADC0 input Q: 0.004 (0.19 bits used)
 ADC1 input I: 0.005 (0.45 bits used)
 ADC1 input Q: 0.004 (0.19 bits used)
 -----------------------------------

poco_plot_auto.py

This script grabs auto-correlations from the brams and plots them. Since there are 4 inputs, 2 for each ADC there are 4 plots. Some plots will be random if there is no noise source or tone being inputted into ADC. Ie plots 3 and 4.

_images/t4_plot_auto.png

poco_plot_cross.py This script grabs cross-correlations from the brams and plots them. This plotshows the cross-correlation of AB.

_images/t4_plot_cross.png

Environment setup

OS

It is recommended to use Ubuntu 14.04. 16.04 has also been known to work, although the setup process can be a bit of a headache.

Matlab and Xilinx

To use the tutorials you will need to install the versions of Matlab and the Xilinx tools particular to the hardware you plan to use. See the installation matrix below.

Hardware Matlab Version Xilinx Version
ROACH1/2 2013b ISE 14.7
SKARAB 2016b Vivado 2016.2
SNAP 2016b Viviado 2016.4

Modifications to be run after installs

ROACH1/2

Xilinx removed support for several hardware pcores we use for ROACH1/2 from ISE 14. So the current solution is to add the following pcores from the Xilinx 11 install to your XPS_ROACH_BASE/pcores folder or to your 14 install directory at Xilinx/14.7/ISE_DS/EDK/hw/XilinxProcessorIPLib/pcore.

OPB pcores

  • bram_if_cntlr_v1_00_a
  • bram_if_cntlr_v1_00_b
  • ipif_common_v1_00_c
  • opb_arbiter_v1_02_e
  • opb_bram_if_cntlr_v1_00_a
  • opb_ipif_v3_00_a
  • opb_opb_lite_v1_00_a
  • opb_v20_v1_10_c
  • proc_common_v1_00_a

All installs

The syntax in the Xilinx Perl scripts is not supported under the Ubuntu default shell Dash. Change the symbolic link sh -> dash to sh -> bash:

cd /bin/
sudo rm sh
sudo ln -s bash sh

Point gmake to make by creating the symbolic link gmake -> make:

cd /usr/bin/
sudo ln -s make gmake

If you are not getting any blocks in Simulink (Only seen in CentOS) change the permissions on /tmp/LibraryBrowser to a+rwx:

chmod a+rwx /tmp/LibraryBrowser