The intent of this note is to describe some of the
more important aspects of HDL design for FPGAs. The target audience for this
is not the engineer beginning to design in HDL, but for those with some experience already
and who are seeking to further their knowledge. The
examples are in VHDL, although much of it is applicable to Verilog as well.
The first section is dedicated to Register Transfer Level (RTL)
design, that is, designing to specifically target an FPGA, and the second
section discusses behavioral (testbench) design.
The term Register Transfer Level is used to
describe the flow of signals or transfer of data between registers.
It is used to describe the style of coding that will be synthesized
into a programmable device or an ASIC. As one can use VHDL or Verilog
to describe constructs that can't be synthesized, a subset of the HDL
must be used, and this subset is referred to as RTL. For the purposes
of this note, coding for test is referred to as "behavioral" coding.
- Use of Reset
One of the first considerations in coding RTL is
the use of reset in the design. It used to be that one was best off
using the global dedicated reset line in the FPGA to reset all
registers in the device since it conserves routing resources. This is
still an option in many FPGA families, but with availability of
abundant routing resources in today's FPGAs, it isn't as necessary.
Also, it isn't necessary to reset all of the registers in a design,
for example, pipeline registers don't generally need to be reset.
Still, even if one chooses not to use the global reset line, the use
of reset for most registers is required to get the simulation up and
running. Multiple engineers working on the same device have to agree
on the approach since it affects coding style throughout the design.
One easy way to initialize a signal in an HDL, and this is a method
that is not recommended, is to use an initial value on the declaration.
With few exceptions, the use of a default assignment in the signal
declaration should not exist in RTL since it may create a
situation where the RTL simulation will not match the back annotated (timing) simulation. An
example of this is:
SIGNAL dont_do_this : SLB := '1';
An exception to this rule is when there is some small circuit which
needs to be operational during reset and the FPGA guarantees the power up
state of the registers. An example of this would be a counter which is used to elongate
reset and must run when reset is active.
The following two sections continue the
discussion of reset implementation issues.
- Reset Polarity
One problem that arises with FPGA design is that some FPGAs have
an active high reset while others have an active low
reset. It would be nice to code in a portable fashion, that is where
the code is independent of the FPGA vendor. A solution is to use a
generic which specifies the active level of reset and which is
passed down to all registers throughout the hierarchy. In the
following example, a generic called "RESET_LEVEL" accomplishes this:
- LIBRARY IEEE;
- USE IEEE.STD_LOGIC_1164.ALL;
- ENTITY blatz IS
- -- active level of reset for registers is defined here
- GENERIC (RESET_LEVEL : SLB := '1');
- PORT (CLK : IN SLB;
- RESET : IN SLB;
- CE : IN SLB;
- DIN : IN SLB;
- DOUT : OUT SLB);
- END blatz;
- ARCHITECTURE behavior OF blatz IS
- BEGIN
- a_register: PROCESS (CLK, RESET)
- BEGIN
- IF RESET = RESET_LEVEL THEN
- DOUT <= '0';
- ELSIF CLK'event AND CLK = '1' THEN
- IF CE = '1' THEN -- clock enable
- DOUT <= DIN;
- END IF;
- END IF;
- END PROCESS a_register;
- END behavior;
RESET_LEVEL must be passed down through the hierarchy through the use of
port mapping. The following is an example of port mapping in a component
instantiation:
- blatz_c: blatz
- GENERIC MAP (RESET_LEVEL => RESET_LEVEL)
- PORT MAP (CLK => my_clk,
- RESET => my_reset,
- CE => my_ce,
- DIN => my_din,
- DOUT => my_dout);
By implementing reset as a generic which is passed down in this fashion, the value of the RESET_LEVEL
generic can be changed at the top level which will invert the active level of reset for all
registers in the design.
- Use of Asynchronous Resets
Other than the use of asynchronous reset for globally resetting the device,
the use of asynchronous resets is best avoided when possible. Asynchronous resets
create timing paths which are hard for static timing
analyzers to deal with, and can make code less portable.
It's worth noting that some FPGA vendors have started providing a fuse selection for
synchronous vs. asynchronous clear and set inputs on registers. This has the advantage
that it can provide
other synchronous paths for the logic into the register - besides the D input.
For example, consider the case where
two terms A and B will be used to set a register. This would be equivalent to an OR gate
on the D input to the register with A and B as inputs to the OR gate.
But if the asynchronous SET input were actually a synchronous
set, then the OR gate could be eliminated with A driving the D input and B driving the
synchronous set input.
The FPGA vendors have suggested using this, and it makes
sense, however it creates a caveat; global
chip reset will need to be routed into the synchronous logic. With the typical fanout of global reset
being one of the largest nets in the design, this reset signal quickly becomes the critical timing
path in the design. The way to get around this is to break the timing path for global reset
as it enters the chip. One must also ensure that this doesn't create any problems, that reset
is active for multiple clock cycles and that there isn't any logic that transitions immediately out
of reset.
- Resource Sharing
Resource Sharing is the ability for the synthesizer to automatically
utilize a common function -
common to two processes for example. Consider the following
code fragment in a process:
- IF count = 237 AND request = '1' THEN
- start_count <= '1';
- END IF;
- IF count = 237 AND latch_data = '1' THEN
- data_out <= data_in;
- END IF;
Will the synthesizer create one or two comparators
for "count = 237"? The answer depends upon the quality of the
synthesizer.
The general rule for resource sharing is to perform it
manually, when possible, eliminating the possibility that the synthesizer
will screw it up. Our example can be rewritten to use a concurrent signal assignment
for the comparator:
- Count_eq237 <= '1' WHEN count = 237 ELSE '0';
and the process rewritten as follows:
- IF count_eq237 = '1' AND request = '1' THEN
- start_count <= '1';
- END IF;
- IF count_eq237 = '1' AND latch_data = '1' THEN
- data_out <= data_in;
- END IF;
It is recommended that this practice be used for all large
arithmetic and relational operations
(adders, subtractors, comparators, etc.)
- Clocks
Although practical and often times necessary to have multiple
clocks in a design, one should not
generate local clocks within a clock domain in an FPGA design. This practice is often
used in gate array design to write to registers in the device (such as the control register,
status register, etc.).
In FPGAs, which have flip-flops with clock enables built in,
the clock enable should be used. The
clock enable is always the outer IF under the clock sensitivity expression:
- a_register: PROCESS (CLK, RESET)
- BEGIN
- IF RESET = RESET_LEVEL THEN
- DOUT <= '0';
- ELSIF CLK'event AND CLK = '1' THEN
- IF CE = '1' THEN -- clock enable
- DOUT <= DIN;
- END IF;
- END IF;
- END PROCESS a_register;
The use of locally generated clocks should be avoided.
- Decoding Register Writes
Generally, one must decode lower order address lines and
a qualified strobe to write to internal
registers within the device (control register, status register, etc.). This could be accomplished in
the following fashion, but is not recommended for higher performance designs:
- ELSIF CLK'event AND CLK = '1' THEN
- IF address = CONTROL_REG_ADDR AND REGWRITE = '1' THEN
- CONTROL_REG <= DATA_IN;
- END IF;
- IF address = STATUS_REG_ADDR AND REGWRITE = '1' THEN
- STATUS_REG <= DATA_IN;
- END IF;
- END IF;
In the above example, if there are many registers, a large fanout is created
on the signal REGWRITE and the address
bus which can be difficult to deal with. A better way to accomplish this is with the following:
- ELSIF CLK'event AND CLK = '1' THEN
- -- Default assignments:
- write_controlreg <= '0';
- write_statusreg <= '0';
- IF REGWRITE = '1' THEN
- CASE address IS
- WHEN CONTROL_REG_ADDR =>
- write_controlreg <= '1';
- WHEN STATUS_REG_ADDR =>
- write_statusreg <= '1';
- WHEN OTHERS => NULL;
- END CASE;
- END IF;
- IF write_controlreg = '1' THEN
- control_reg <= DATA_IN;
- END IF;
- IF write_statusreg = '1' THEN
- status_reg <= DATA_IN;
- END IF;
- END IF;
Notice that the IF statements which test the address have been changed to the
more efficient CASE statement, since a priority encoder is not desired on the address.
The signals "write_controlreg" and "write_statusreg"
become clock enables for the data registers "control_reg" and "status_reg", and
can be physically located adjacent to the flip-flops which comprise the register. This allows for a
higher operating frequency and less routing congestion.
Note that an additional pipeline has been added to the data path and this must be taken into account;
it may be necessary to pass the input data "DATA_IN" through a pipeline register to compensate.
- State Machines
State machine design in VHDL is more a matter of style preference than rules.
What is presented here is one preferred style and some recommendations based upon personal preference.
Although many synthesizers recommend the "two process" state machine,
this approach
generates non-registered outputs which can glitch. Other than that caution (which can be solved
by registering the outputs in a clocked process) and the fact that they are a bit more difficult to
read, they are an acceptable method of coding.
Another method is to create a single clocked process containing
all state assignments and
output assignments. Since all signal assignments in a clocked process become flip-flops, the
outputs from the state machine are registered and will not glitch. An example of such a state
machine is shown here:
- ARCHITECTURE behavior OF state_machine IS
- TYPE STATETYPE IS (IDLE, GETBUS, HAVEBUS);
- SIGNAL state: STATETYPE;
- BEGIN
- sm_process: PROCESS (CLK, RESET)
- BEGIN
- IF (RESET = RESET_LEVEL) THEN
- BUSREQ <= '0';
- BUSFREE <= '0';
- state <= IDLE;
- ELSIF CLK'EVENT AND CLK = '1' THEN
- -- define inactive state for all outputs
- BUSREQ <= '0';
- BUSFREE <= '0';
- CASE (state) IS
- WHEN IDLE =>
- IF MEMREQ = '1' THEN
- state <= GETBUS;
- END IF;
- WHEN GETBUS =>
- BUSREQ <= '1';
- IF BUSGNT = '1' THEN
- state <= HAVEBUS;
- END IF;
- WHEN HAVEBUS =>
- IF MEMREQ = '0' THEN
- BUSFREE <= '1';
- state <= IDLE;
- END IF;
- END CASE;
- END IF;
- END PROCESS;
In this state machine, the state bits are created as an enumerated type
by the following declarations:
- TYPE STATETYPE IS (IDLE, GETBUS, HAVEBUS);
- SIGNAL state: STATETYPE;
The advantage of using an enumerated type is that the values for STATETYPE
(that is, IDLE, GETBUS, and HAVEBUS)
are displayed in the simulator waveform window. The synthesizers provide a mechanism to
control the assignment of enumerated type to actual values ("one-hot", binary, random, etc.).
One can also use constant declarations to define the state bit assignments manually,
if desired.
As it turns out, unless all of the outputs from the state
machine (in our example, BUSREQ
and BUSFREE) are assigned to a value in every state of the state machine,
the state machine must generate
"extra" logic to maintain the outputs for those states in which the outputs aren't assigned
- the state machine will not be fully optimized. But assigning all outputs in every state
can make the state machine quite unreadable -
especially in a large state machine.
A solution for this, shown in the example above, is to have
default assignments at the start of the
clocked portion of the process and to override these in specific states. In our example above,
BUSREQ and BUSFREE are normally low and are asserted high when active. They are
assigned to '0' at the beginning of the clocked process, and assigned to '1' in the states where
they are to be active. This allows all of the outputs to be assigned a value in all states, and for
the state machine to be fully optimized, while still being quite readable.
It's worth noting the assignments for the two outputs have different dependencies and will exhibit
different timing. The output "BUSREQ" is only dependent upon being in the state GETBUS. If the
state bits are one-hot encoded, then BUSREQ will only depend upon one flip-flop. The "BUSFREE" output
is dependent upon being in the state HAVEBUS, and is also dependent upon the input MEMREQ.
- Project Packages
Although it always seems like exta overhead, software programmers
learned many years ago that it pays to use symbolic definitions for
commonly used constants. For example, addresses for registers, bit
assignments and other constants should
be defined in one project package and then included in all modules. A
separate package can be used for simulation constants. It's useful to
have the simulation package reference the project package when
necessary - all
constants should be defined in only one place. Therefore when a
constant needs to be changed, it is changed in one place and
simulation and synthesis will simply work.
- Style
It is important to adopt a style that is meaningful to the designer, as well as
to others that might have to read their code. There is no right or wrong in this category;
it is simply a matter of preference. Since VHDL is case-insensitive, one has the option
of using case to increase readability. My personal preference is to use upper case for VHDL keywords,
for constants, and for signals declared in ports. Internal signals are always lower case in
this scheme. By using upper case for ports and lower case for signals, it's always easy to
determine if a signal is an input or output to the module or is an internal signal.
Another style preference is the location for component declarations. When an entity/architecture
pair contains many declared components, they tend to make the module less readable. An option
to alleviate this is to put these declarations into a package - either the project package or
some other package, and then import the package into the entity.
The term behavioral design as used in this note refers to VHDL code
used to test the FPGA part of the design, or the RTL.
- Behavioral versus RTL Style
When writing RTL the designer is limited to using
a small subset of the HDL language to be compatible with available hardware
structures. For example, it's trivial to define registers that operate from
both edges of a clock, but these aren't available in hardware (although they would
come in handy sometimes!). Contrary to RTL, one can use the entire HDL language when
writing a behavioral descritpion.
While this is true, the end goal when creating a behavioral model is generally
to model a hardware device.
And since one is trying to model the
device in a fashion that exactly duplicates the behavior of the
device, it is often best to use
some of the same "clocked process" techniques that one would use in an RTL design.
This isn't to say that non-synthesizable
constructs can't be used,
but if the device being modeled
contains a synchronous state machine, one might be best off using a synchronous state machine
in the behavioral description rather than a process with a series of WAIT FOR statements.
- Complete Testbenches
When starting to model a testbench, there is a tendency to leave out the
static signals and clocks
and to force these from the simulator. If you think about it, it isn't any more typing to include these
in the testbench and make the testbench complete. In this way, the simulation can run
standalone - the top level of the testbench is read into the simulator and simulation
time is advanced, without forcing any signals.
This has the advantage of easy re-use; if some months later the same
simulation needs to be executed, there is no need to figure out which static signals
need to be forced or how to force
the clock. Another (minor) advantage is in code portability; code written in the HDL is portable, whereas
code stimulus written in the simulator language is specific to that simulator.
- Model Timing
An effective behavioral model will contain the
worst case timing from the device or bus specification. These should be declared
as generics for easy modification. A good practice is to use
the exact description from the datasheet as the generic name. An example from a model of the
TI's HPI bus is:
- GENERIC (RESET_LEVEL : SLB := '1'; -- reset level of regs
- MEM_LENGTH : INTEGER := 2**19; -- length of HPI mem array
- EN_RAND_READ_TIMING : BOOLEAN := TRUE; -- random read
- tREADACCESS : TIME := 20 NS;
- tREADACCESS_MAX : TIME := 180 NS; -- max for random read
- tREADACCESS_MIN : TIME := 20 NS; -- max for random read
- td_DSL_HYL : TIME := 12 NS;
- tv_HYH_HDV : TIME := 7 NS;
- td_DSL_HDD : TIME := 20 NS;
- td_DSH_HYL : TIME := 12 NS;
- th_DSH_HDVR : TIME := 1 NS;
- tHPIA_C_READ : TIME := 20 NS;
- td_DSH_HYH_MIN : TIME := 30 NS;
- td_DSH_HYH_MAX : TIME := 83 NS;
- tsu_HBV_DSL : TIME := 5 NS;
- twDSL : TIME := 30 NS;
- twDSH : TIME := 10 NS;
- tsu_HDV_DSHW : TIME := 10 NS;
- th_DSL_HBV : TIME := 5 NS
- );
- Timing Checks
An effective model provides setup, hold, pulse
width, and other
timing checks. Timing checks can be
performed with boolean tests, or with ASSERTION statements. Keep in mind that the assertion
statement is "bass ackwards" - it is a test for a false condition. Placing a NOT in front of the test
changes it to a normal "test for true" condition.
Assertion statements have the advantage of providing a
mechanism for stopping the simulation
on error. The simulator can be made to stop on the assertion level - NOTE, WARNING,
ERROR, or FAILURE. The user declared string is always written to the screen when the test is
false.
Some examples are shown below. In these examples, the runtime function
"now" returns the current simulation time which is stored in the signals hstrobe_falling
and hstrobe_rising and used for the timing checks.
- timing_checks : PROCESS (hstrobe_i)
- BEGIN
- -- Falling edge checks
- IF hstrobe_i'event AND hstrobe_i = '0' THEN
- IF now - hstrobe_rising < twDSH THEN
- ASSERT FALSE REPORT "HOST STROBE PULSE WIDTH
- HIGH VIOLATION" SEVERITY WARNING;
- END IF;
- ASSERT HRNW'stable(tsu_HBV_DSL)
- REPORT "HRNW SETUP TO RISING EDGE OF THE
- EARLIER OF HDS0, HDS1, OR HCS"
- SEVERITY WARNING;
- ASSERT HCNTL'stable(tsu_HBV_DSL)
- REPORT "HCNTL SETUP TO RISING EDGE OF THE
- EARLIER OF HDS0, HDS1, OR HCS"
- SEVERITY WARNING;
- hstrobe_falling <= now; -- capture time of falling edge
- END IF;
- -- Rising edge checks
- IF hstrobe_i'event AND hstrobe_i = '1' THEN
- IF now - hstrobe_falling < twDSL THEN
- ASSERT FALSE REPORT "HOST STROBE PULSE WIDTH
- LOW VIOLATION" SEVERITY WARNING;
- END IF;
IF hrnw_l = '0' THEN
- ASSERT hd_pad'STABLE(tsu_HDV_DSHW)
- REPORT "HD SETUP TO RISING EDGE OF THE EARLIER OF HDS0, HDS1, HCS"
- SEVERITY WARNING;
- END IF;
- ASSERT hrdy_i = '1' REPORT "HRDY LOW AT RISING
- EDGE OF EARLIER OF HCS, HDS0, OR HDS1"
- SEVERITY WARNING;
- hstrobe_rising <= now; -- capture time of rising edge
- END IF;
- END PROCESS timing_checks;
- Protocol Checks
Another useful feature to include in a behavioral model is
protocol checking. Protocol checks are
simply conditionals that warn if non-timing related bus violations occur. An example of this would
be the premature ending of a bus cycle, such as on the TI HPI bus. A bus cycle with HCS and
HDS0/HDS1 active should not end with HRDY low. The last assertion statement in the example
above looks for this condition.
- Handling Large Counters
With some applications which process large amounts of data in a sequential fashion, or
simply counters that must wait for an exceptionally long time, the
simulation time becomes large enough to be prohibitive. Examples of this are counters
for frame-based image or video applications, or counters which wait a stabilization delay for DRAMs
at system reset.
An easy solution to this problem involves passing down generics
from the top level as described in the section on reset polarity above. This solution is to assign
the terminal counts of the counters with generics. The default values for these generics at the top
level of the chip contain the correct values for synthesis, but are overridden in the testbench
with smaller values for simulation.
For example, if a graphics engine has to process 1280 pixels X 1024 scan lines for each frame, the
time required to simulate the 1.3 million pixels is significant. If these values
for these generics are overridden with
100 pixels X 10 scan lines in the testbench, then it's possible to simulate multiple frames. Of course,
one would want to process some number of full frames, as well as many smaller frames for completeness.
Please contact us if you have any questions on
this, or to provide feedback. Thank you!