Whenever a timing problem is encountered a decision must be made about
how best to handle it. There are many methods, many of which are
described below, but one method often recommended by the tool vendors
is to change the synthesis and/or place and route tool settings. For
example, setting the synthesizer to optimize more for speed than area,
or increasing the place and route effort. While these may work, in my
opinion they generally don't really fix the problem; the timing
problem quite often will come back to haunt you later in the design.
If the problem can be fixed easily in the source code or by adding a
constraint, it is almost always best to choose this method over
changing the tool settings since it permanently eliminates the
problem.
The following are some techniques used to fix timing problems.
- Add a pipeline stage
Depending upon the nature of the design, one of the simplest methods
that can be used to break up a long combinatorial path is with the
addition of a pipeline stage. This
is generally easy to do in DSP designs which are of a dataflow
architecture, and not so easy to do for other types of designs such as
a PCI bus interface. Keep in mind that if the synthesizer has the
ability to move registers during timing optimization ("register
balancing"), you can actually add pipeline stages back to back and
let the synthesizer move the registers into the combinatorial
path.
- Add multicycle constraints
If the offending timing path actually has multiple cycles to execute,
a multicycle path constraint can be used to constrain the path to the
actual required timing. For example, if the output of an accumulator
is actually needed every two 100 MHz clocks, then the place and route
tools can be instructed, through the use of a multicycle constraint, to
optimize the path for two clock cycles or 20 nsecs for this example.
It may also be possible to modify the logic to allow multiple cycles
for a long path, and then add the constraint to eliminate the
violation.
- Use duplication for overloaded nets
For nets that have a high fanout, duplicating the source of the net
both reduces the delay and helps in routing to different areas of the
chip. Duplication can be accomplished either manually or through the
tools. To duplicate manually, it will be necessary to instruct the
tools not to eliminate redundant logic. Another method is to hide the
redundant logic in a different level of the hierarchy. Automatic
duplication can be accomplished by decreasing the maximum fanout
constraint on the particular net for synthesis, forcing the
synthesizer to duplicate it.
- Prioritize routing
When the above methods don't work or don't apply, another method that
might alleviate the problem is to prioritize the routing of the
failing net or nets. Some place and route tools provide a method of
accomplishing this directly, while others don't. If there isn't a
direct method, it's possible to accomplish the same thing by applying
a constraint to the net that is slightly faster than the desired clock
frequency. If the target frequency is 100 MHz, then applying a
constraint of 9.8 nsecs should cause the tools to route these nets
first.
- Use low skew routing resources
Occasionally one has to deal with a net with a very high fanout. One
way of speeding this up and helping to route it across the chip is to
manually place it on a low skew net. This is accomplished through
constraint or attribute in the source code.
- Consider using a tool generated core
For a particularly difficult function, such as a function that uses a
long carry chain as in an accumulator, an easy fix might be to use a
tool generated core. Many of the FPGA vendor core generators provide
accumulators, adder/subtracters, comparators, etc. These are not only
optimized in terms of construction, but sometimes contain relative
placement constraints providing minimized routing delays.
- Construct a custom core
With some vendors it's possible to construct an arithmetic core,
manually place this core, and use this placed core in your design.
For Xilinx this is known as a "Relationally Place Macro" or RPM. To
accomplish this, the function is first captured as a standalone
module, synthesized, and then floorplanned to lock down the placement.
In the final design, this module is instantiated as a black box
passing an EDIF or NGC file to the place and route tools.
- Floorplan the design
Manually floorplanning a design used to be quite a burden. Newer
tools such as Xilinx PlanAhead allow this to be a much less labor
intensive task. In PlanAhead, one has the ability to define groups of
logic (which are generally defined as the modules in the design
hierarchy, but don't have to be) and to place these within regions on
the die. Since modules are being placed, as opposed to individual
registers and lookup tables, performing a floorplan doesn't actually
take that long. And the benefits can be substantial for certain types
of designs - particularly DSP designs which are organized as a
dataflow. To be effective, the design must be synthesized in a way
that preserves the hierarchy, and it greatly helps if the outputs from
each module are registered.