Notes on High Performance FPGA Design
This article provides recommendations for achieving timing closure in
today's large FPGAs. The article is written in two sections. Section
I suggests recommended design practices for high performance design.
Section II provides tactics to use when the recommended pratices don't
achieve timing closure, that is, for difficult timing paths or
operation at a very high frequency. Obviously, if the target
frequency is 40 MHz, then the steps in section II won't be necessary.
At frequencies over 100 MHz, it may be necessary to use one or all of
the steps in section II. These steps are presented in order, from easy
to implement to hard. One only needs to use the necessary steps in
order to achieve the timing goal; it isn't necessary to utilize all of
the steps presented in section II.
I. Recommended Design Practices for High Performance FPGA Design
- Use synchronous design techniques
Todays place and route tools are based upon static timing analysis
which examines all of the delays in the design from register output to
register input. These tools are designed for synchronous logic; they
do not work well with asynchronous state machines, self-clocking
circuits, etc. All registers within a given clock domain should be
clocked from a common clock.
- Use clock enables
Since all registers in FPGAs have clock enables, and it's desired to
use a single common clock within a clock domain, the clock enable
should be used rather than a locally generated clock.
- Use a sufficient number of pipeline stages to achieve the target
clock frequency
FPGAs are rich in registers; adding pipeline stages helps to reduce
the delays through the total combinatorial path. Even if the delay
through a section of the combinatorial path is small, register
balancing may be used to speed up other paths nearby. Another benefit
is in the ability to break up routing delays across the chip.
- Know the logic that will be generated by the HDL code specified
Use an IF/ELSIF construct when a priority encoder is desired and a
CASE statement when a decoder is desired. To illustrate the
difference, consider the following code:
IF A = '0' THEN
[assignment #1]
ELSIF B = '0' THEN
[assignment #2]
ELSIF C = '0' THEN
[assignment #3]
END IF;
Assignment #1 obvously occurs when (A = '0'). The equation for
assignment #2 is actually (A /= '0' AND B = '0'). The equation for
assignment #3 is (A /= '0' AND B /= '0' AND C = '0'). When expanded
in this fashion it becomes obvious that the delay for the ending ELSIF
will be longer than the delay through the top IF. This
should be considered when organizing an IF/ELSIF construct. In
summary, if a priority encoder is desired, then the IF statement is
the method of achieving it. But if a decoder is needed, a CASE
statement is a much better choice.
- Use registers in the I/O ring whenever possible
Registering on the way into and out of the chip provides guaranteed
input and output timing and can save resources within the FPGA
fabric. This can generally be accomplished automatically by either
the synthesis or place and route tools with the right switch set.
- Think in terms of the device architecture
Keep track of the levels of combinatorial logic in the units of the
target architecture. If you are targetting a 4-input lookup table
based FPGA, then keep track of combinatorial delays in terms of lookup
tables (i.e., three levels of logic is equal to three lookup tables).
This aids in estimating performance during design, rather than waiting
until the first static timing analysis report is available. Also,
take advantage of the special features offered in the target
architecture (hardware multipliers, PLLs, DLLs, wide muxes, etc.).
II. Techniques for Achieving Timing Closure
Whenever a timing problem is encountered a decision must be made about
how best to handle it. There are many methods, many of which are
described below, but one method often recommended by the tool vendors
is to change the synthesis and/or place and route tool settings. For
example, setting the synthesizer to optimize more for speed than area,
or increasing the place and route effort. While these may work, in my
opinion they generally don't really fix the problem; the timing
problem quite often will come back to haunt you later in the design.
If the problem can be fixed easily in the source code or by adding a
constraint, it is almost always best to choose this method over
changing the tool settings since it permanently eliminates the
problem.
The following are some techniques used to fix timing problems.
- Add a pipeline stage
Depending upon the nature of the design, one of the simplest methods
that can be used to break up a long combinatorial path is with the
addition of a pipeline stage. This
is generally easy to do in DSP designs which are of a dataflow
architecture, and not so easy to do for other types of designs such as
a PCI bus interface. Keep in mind that if the synthesizer has the
ability to move registers during timing optimization ("register
balancing"), you can actually add pipeline stages back to back and
let the synthesizer move the registers into the combinatorial
path.
- Add multicycle constraints
If the offending timing path actually has multiple cycles to execute,
a multicycle path constraint can be used to constrain the path to the
actual required timing. For example, if the output of an accumulator
is actually needed every two 100 MHz clocks, then the place and route
tools can be instructed, through the use of a multicycle constraint, to
optimize the path for two clock cycles or 20 nsecs for this example.
It may also be possible to modify the logic to allow multiple cycles
for a long path, and then add the constraint to eliminate the
violation.
- Use duplication for overloaded nets
For nets that have a high fanout, duplicating the source of the net
both reduces the delay and helps in routing to different areas of the
chip. Duplication can be accomplished either manually or through the
tools. To duplicate manually, it will be necessary to instruct the
tools not to eliminate redundant logic. Another method is to hide the
redundant logic in a different level of the hieararchy. Automatic
duplication can be accomplished by decreasing the maximum fanout
constraint on the particular net for synthesis, forcing the
synthesizer to duplicate it.
- Prioritize routing
When the above methods don't work or don't apply, another method that
might alleviate the problem is to prioritize the routing of the
failing net or nets. Some place and route tools provide a method of
accomplishing this directly, while others don't. If there isn't a
direct method, it's possible to accomplish the same thing by applying
a constraint to the net that is slightly faster than the desired clock
frequency. If the target frequency is 100 MHz, then applying a
constraint of 9.8 nsecs should cause the tools to route these nets
first.
- Use low skew routing resources
Occasionally one has to deal with a net with a very high fanout. One
way of speeding this up and helping to route it across the chip is to
manually place it on a low skew net. This is accomplished through
constraint or attribute in the source code.
- Consider using a tool generated core
For a particularly difficult function, such as a function that uses a
long carry chain as in an accumulator, an easy fix might be to use a
tool generated core. Many of the FPGA vendor core generators provide
accumulators, adder/subtracters, comparators, etc. These are not only
optimized in terms of construction, but sometimes contain relative
placement constraints providing minimized routing delays.
- Construct a custom core
With some vendors it's possible to construct an arithmetic core,
manually place this core, and use this placed core in your design.
For Xilinx this is known as a "Relationally Place Macro" or RPM. To
accomplish this, the function is first captured as a standalone
module, synthesized, and then floorplanned to lock down the placement.
In the final design, this module is instantiated as a black box
passing an EDIF or NGC file to the place and route tools.
- Floorplan the design
Manually floorplanning a design used to be quite a burden. Newer
tools such as Xilinx PlanAhead allow this to be a much less labor
intensive task. In PlanAhead, one has the ability to define groups of
logic (which are generally defined as the modules in the design
hierarchy, but don't have to be) and to place these within regions on
the die. Since modules are being placed, as opposed to individual
registers and lookup tables, performing a floorplan doesn't actually
take that long. And the benefits can be substantial for certain types
of designs - particularly DSP designs which are organized as a
dataflow. To be effective, the design must be synthesized in a way
that preserves the hierarchy, and it greatly helps if the outputs from
each module are registered.
Please contact us if you have any questions on
this, or to provide feedback. Thank you!
|