FFT

Block: FFT (fft)
Block Author: Aaron Parsons
Block Maintainer: Andrew Martens
Document Author: Aaron Parsons, Andrew Martens

Contents

Summary
Mask Parameters
Ports
Description

Summary

Computes the Fast Fourier Transform with 2^N channels for time samples presented 2^P at a time in parallel. Uses a biplex FFT architecture under the hood which has been extended to handled time samples in parallel. For P = 0, this block accepts two independent, parallel streams (labelled as pols) and computes the FFT of each independently (the biplex architecture provides this for free). Data is output in normal frequency order, meaning that channel 0 (corresponding to DC) is output first, followed by channel 1, on up to channel 2^N − 1 (which can be interpreted as channel -1). When multiple time samples are presented in parallel on the input, multiple frequency samples are output in parallel.

Mask Parameters

Parameter	Variable	Description	Recommended Value
Number simultaneous streams	n_streams	The number of input data streams to be processed in parallel. Each stream consists of a set of parallel inputs set by another parameter (see Number of Simultaneous Inputs)
Size of FFT: (2^?)	FFTSize	The number of channels computed in the complex FFT core. The number of channels output for each real stream is half of this.
Input Bit Width	input_bit_width	The number of bits in each real and imaginary sample as they are carried through the FFT. Each FFT stage will round numbers back down to this number of bits after performing a butterfly computation if bit growth is not enabled.	To make optimal use of BRAM => 18 For low FFT noise => 25
Input binary point	bin_pt	The position of the binary point in the input data.
Coefficient Bit Width	coeff_bit_width	The number of bits used in the real and imaginary part of the twiddle factors at each stage.	18
Number of Simultaneous Inputs: (2^?)	n_inputs	The number of parallel time samples which are presented to the FFT core each clock. This must be at least 2². The number of output ports is half of this value.
Unscramble output (ie, put channels in canonical order)	unscramble	The FFT inherently produces data in an order that requires unscrambling before being used by many algorithms. This requires resources and can limit performance and so should be disabled if not necessary.
Asynchronous operation	async	Whether valid data is input on every clock cycle or is flagged via the en input port.
Quantization Behavior	quantization	Specifies the rounding behavior used at the end of each twiddle and butterfly computation to return to the number of bits specified above.	NOT Truncate.
Overflow Behavior	overflow	Indicates the behavior of the FFT core when the value of a sample exceeds what can be expressed in the specified bit width.	Wrap as Saturate will not make overflow corruption better behaved.
Add Latency	add_latency	Latency through adders in the FFT.	1
Mult Latency	mult_latency	Latency through multipliers in the FFT.	2
BRAM Latency	bram_latency	Latency through BRAM in the FFT.	2 For designs aimed at > 200MHz => 3
Convert Latency	conv_latency	Latency through blocks used to reduce bit widths after twiddle and butterfly stages.	1 For designs aimed at > 180Mhz => 2
Number bits above which to store stage’s coefficients in BRAM (2^? bits)	coeffs_bit_limit	Determines the threshold at which the twiddle coefficients in a stage are stored in BRAM. Below this threshold distributed RAM is used.	8 (ensures at least 2^8=256 bits out of 18432 bits of BRAM used)
Number bits above which to implement stage’s delays in BRAM (2^? bits)	delays_bit_limit	Determines the threshold at which data delays in a stage are stored in BRAM. Below this threshold distributed RAM is used.	8 (ensures at least 2^8=256 bits out of 18432 bits of BRAM used)
BRAM sharing in coeff storage	coeff_sharing	Real and imaginary components of twiddle factors can be generated from the same set of coefficients, reducing BRAM use at the cost of some logic.
Store a fraction of coeff factors where useful	coeff_decimation	The full set of twiddle factors can be generated from a smaller set, reducing BRAM use at the cost of the some logic.
Generate coeffs with multipliers where useful	coeff_generation	Generate twiddle factors in the internal fft_direct block using an oscillator with feedback.	To reduce BRAM usage => on. To reduce multiplier usage => off
Number calibration locations when generating coeffs (2^?)	cal_bits	When generating twiddle factors with an oscillator with feedback, reference values are used to calibrate the complex exponential generated.	For low BRAM usage => 1. For high quality twiddle factors => 9.
Feedback rotation vector resolution	n_bits_rotation	When generating the twiddle factors, the resolution of the vector determines how much error accumulates.	For low error => 25. For low BRAM usage => 18.
Maximum fanout	max_fanout	The maximum fanout the twiddle factors are allowed to experience between where they are generated and when they are multiplied with the data stream. As the coefficients are shared, large fanout can occur which can affect maximum timing achievable. Decreasing the maximum fanout allowed should increase possible performance at the expense of some logic.
Multiplier specification (0=core, 1=embedded, 2=behavioural) (left=1st stage)	mult_spec	Array of values allowing exact specification of how multipliers are implemented at each stage.	2 (behavioral HDL) for each stage
Bit growth instead of shifting	bit_growth	Bit growth at every stage in the FFT can result in overflows which affect data quality. This can be prevented by dividing the data by two on the output of every stage, or by increasing the number of bits in the data stream by one bit. Shifting decreases the dynamic range and possible data quality whereas bit growth increases the resource requirements.
Max bits to growth to	max_bits	The maximum number of bits to increase the data path to when the bit growth option is chosen. Shifting is used for FFT stages after this.
Hardcode shift schedule	hardcode_shifts	When shifting to prevent overflow, use a fixed shifting schedule. This uses less logic and increases performance when compared to using a dynamic shift schedule.
Shift schedule	shift_schedule	When using a fixed shift schedule, use the shift schedule specified. A ‘1’ at position M in the array indicates a shift for the M’th FFT stage, a ‘0’ indicates no shift.
DSP48 adders in butterfly	dsp48_adders	The butterfly operation at each stage consists of two adders and two subtracters that can be implemented using DSP48 units instead of logic.	on (enabled) to reduce logic used.

Ports

Port	Dir	Data Type	Description	Recommended Use
sync	in	Boolean	sync is used to indicate the last data word of a frame of input data. When the block is in asynchronous operating mode an active signal is aligned with en being active. When the block is in synchronous operating mode, a an active pulse is aligned with the clock cycle before the first valid data of a new input frame.	Ensure the sync period complies with the memo describing correct use.
shift	in	Unsigned	Sets the shifting schedule through the FFT to prevent overflow. Bit 0 specifies the behavior of stage 0, bit 1 of stage 1, and so on. If a stage is set to shift (with bit = 1), then every sample is divided by 2 at the output of that stage.
in<stream><inp ut>	in	Signed consisting of one (Input Bit Width) width signals per input.	The time-domain stream(s) to be channelised.	Data amplitude should not exceed 0.5 (divide data by 2 pre-FFT)
en	in	Boolean	When asynchronous operation is chosen, this port indicates that valid input data is available on all input data ports.
sync_out	out	Boolean	Indicates that data out will be valid next clock cycle.
out<stream><in put>	out	Inherited	The frequency channels.
of	out	Unsigned, one bit per input stream	Indication of internal arithmetic overflow. Not time aligned with data. The most significant bit is the flag for input stream 0 etc.

Description

Computes the Fast Fourier Transform with 2^N channels for time samples presented 2^P at a time in parallel. Uses a biplex FFT architecture under the hood which has been extended to handled time samples in parallel. For P = 0, this block accepts two independent, parallel streams (labelled as pols) and computes the FFT of each independently (the biplex architecture provides this for free). Data is output in normal frequency order, meaning that channel 0 (corresponding to DC) is output first, followed by channel 1, on up to channel 2^N − 1 (which can be interpreted as channel -1). When multiple time samples are presented in parallel on the input, multiple frequency samples are output in parallel.