High Performance FPGA Design Notes

High Performance
FPGA Design

This article provides recommendations for achieving timing closure in today's large FPGAs. The article is written in two sections. Section I suggests recommended design practices for high performance design. Section II provides tactics to use when the recommended practices don't achieve timing closure, that is, for difficult timing paths or operation at a very high frequency. Obviously, if the target frequency is 40 MHz, then the steps in section II won't be necessary. At frequencies over 100 MHz, it may be necessary to use one or all of the steps in section II. These steps are presented in order, from easy to implement to hard. One only needs to use the necessary steps in order to achieve the timing goal; it isn't necessary to utilize all of the steps presented in section II.

I. Recommended Design Practices for High Performance FPGA Design

Use synchronous design techniques
Today's place and route tools are based upon static timing analysis which examines all of the delays in the design from register output to register input. These tools are designed for synchronous logic; they do not work well with asynchronous state machines, self-clocking circuits, etc. All registers within a given clock domain should be clocked from a common clock.
Use clock enables
Since all registers in FPGAs have clock enables, and it's desired to use a single common clock within a clock domain, the clock enable should be used rather than a locally generated clock.
Use a sufficient number of pipeline stages to achieve the target clock frequency
FPGAs are rich in registers; adding pipeline stages helps to reduce the delays through the total combinatorial path. Even if the delay through a section of the combinatorial path is small, register balancing may be used to speed up other paths nearby. Another benefit is in the ability to break up routing delays across the chip.
Know the logic that will be generated by the HDL code specified
Use an IF/ELSIF construct when a priority encoder is desired and a CASE statement when a decoder is desired. To illustrate the difference, consider the following code:

IF A = '0' THEN

[assignment #1]

ELSIF B = '0' THEN

[assignment #2]

ELSIF C = '0' THEN

[assignment #3]

END IF;
Assignment #1 obviously occurs when (A = '0'). The equation for assignment #2 is actually (A /= '0' AND B = '0'). The equation for assignment #3 is (A /= '0' AND B /= '0' AND C = '0'). When expanded in this fashion it becomes obvious that the delay for the ending ELSIF will be longer than the delay through the top IF. This should be considered when organizing an IF/ELSIF construct. In summary, if a priority encoder is desired, then the IF statement is the method of achieving it. But if a decoder is needed, a CASE statement is a much better choice.
Use registers in the I/O ring whenever possible
Registering on the way into and out of the chip provides guaranteed input and output timing and can save resources within the FPGA fabric. This can generally be accomplished automatically by either the synthesis or place and route tools with the right switch set.
Think in terms of the device architecture
Keep track of the levels of combinatorial logic in the units of the target architecture. If you are targeting a 4-input lookup table based FPGA, then keep track of combinatorial delays in terms of lookup tables (i.e., three levels of logic is equal to three lookup tables). This aids in estimating performance during design, rather than waiting until the first static timing analysis report is available. Also, take advantage of the special features offered in the target architecture (hardware multipliers, PLLs, DLLs, wide muxes, etc.).

II. Techniques for Achieving Timing Closure

Whenever a timing problem is encountered a decision must be made about how best to handle it. There are many methods, many of which are described below, but one method often recommended by the tool vendors is to change the synthesis and/or place and route tool settings. For example, setting the synthesizer to optimize more for speed than area, or increasing the place and route effort. While these may work, in my opinion they generally don't really fix the problem; the timing problem quite often will come back to haunt you later in the design. If the problem can be fixed easily in the source code or by adding a constraint, it is almost always best to choose this method over changing the tool settings since it permanently eliminates the problem.

The following are some techniques used to fix timing problems.

Add a pipeline stage
Depending upon the nature of the design, one of the simplest methods that can be used to break up a long combinatorial path is with the addition of a pipeline stage. This is generally easy to do in DSP designs which are of a dataflow architecture, and not so easy to do for other types of designs such as a PCI bus interface. Keep in mind that if the synthesizer has the ability to move registers during timing optimization ("register balancing"), you can actually add pipeline stages back to back and let the synthesizer move the registers into the combinatorial path.
Add multicycle constraints
If the offending timing path actually has multiple cycles to execute, a multicycle path constraint can be used to constrain the path to the actual required timing. For example, if the output of an accumulator is actually needed every two 100 MHz clocks, then the place and route tools can be instructed, through the use of a multicycle constraint, to optimize the path for two clock cycles or 20 nsecs for this example. It may also be possible to modify the logic to allow multiple cycles for a long path, and then add the constraint to eliminate the violation.
Use duplication for overloaded nets
For nets that have a high fanout, duplicating the source of the net both reduces the delay and helps in routing to different areas of the chip. Duplication can be accomplished either manually or through the tools. To duplicate manually, it will be necessary to instruct the tools not to eliminate redundant logic. Another method is to hide the redundant logic in a different level of the hierarchy. Automatic duplication can be accomplished by decreasing the maximum fanout constraint on the particular net for synthesis, forcing the synthesizer to duplicate it.
Prioritize routing
When the above methods don't work or don't apply, another method that might alleviate the problem is to prioritize the routing of the failing net or nets. Some place and route tools provide a method of accomplishing this directly, while others don't. If there isn't a direct method, it's possible to accomplish the same thing by applying a constraint to the net that is slightly faster than the desired clock frequency. If the target frequency is 100 MHz, then applying a constraint of 9.8 nsecs should cause the tools to route these nets first.
Use low skew routing resources
Occasionally one has to deal with a net with a very high fanout. One way of speeding this up and helping to route it across the chip is to manually place it on a low skew net. This is accomplished through constraint or attribute in the source code.
Consider using a tool generated core
For a particularly difficult function, such as a function that uses a long carry chain as in an accumulator, an easy fix might be to use a tool generated core. Many of the FPGA vendor core generators provide accumulators, adder/subtracters, comparators, etc. These are not only optimized in terms of construction, but sometimes contain relative placement constraints providing minimized routing delays.
Construct a custom core
With some vendors it's possible to construct an arithmetic core, manually place this core, and use this placed core in your design. For Xilinx this is known as a "Relationally Place Macro" or RPM. To accomplish this, the function is first captured as a standalone module, synthesized, and then floorplanned to lock down the placement. In the final design, this module is instantiated as a black box passing an EDIF or NGC file to the place and route tools.
Floorplan the design
Manually floorplanning a design used to be quite a burden. Newer tools such as Xilinx PlanAhead allow this to be a much less labor intensive task. In PlanAhead, one has the ability to define groups of logic (which are generally defined as the modules in the design hierarchy, but don't have to be) and to place these within regions on the die. Since modules are being placed, as opposed to individual registers and lookup tables, performing a floorplan doesn't actually take that long. And the benefits can be substantial for certain types of designs - particularly DSP designs which are organized as a dataflow. To be effective, the design must be synthesized in a way that preserves the hierarchy, and it greatly helps if the outputs from each module are registered.