image

image
image
image

image


Notes on HDL Design for FPGAs

The intent of this note is to describe some of the more important aspects of HDL design for FPGAs. The target audience for this is not the engineer beginning to design in HDL, but for those with some experience already and who are seeking to further their knowledge. The examples are in VHDL, although much of it is applicable to Verilog as well. The first section is dedicated to Register Transfer Level (RTL) design, that is, designing to specifically target an FPGA, and the second section discusses behavioral (testbench) design.

I. Register Transfer Level (RTL) Design

The term Register Transfer Level is used to describe the flow of signals or transfer of data between registers. It is used to describe the style of coding that will be synthesized into a programmable device or an ASIC. As one can use VHDL or Verilog to describe constructs that can't be synthesized, a subset of the HDL must be used, and this subset is referred to as RTL. For the purposes of this note, coding for test is referred to as "behavioral" coding.

  1. Use of Reset

    One of the first considerations in coding RTL is the use of reset in the design. It used to be that one was best off using the global dedicated reset line in the FPGA to reset all registers in the device since it conserves routing resources. This is still an option in many FPGA families, but with availability of abundant routing resources in today's FPGAs, it isn't as necessary. Also, it isn't necessary to reset all of the registers in a design, for example, pipeline registers don't generally need to be reset. Still, even if one chooses not to use the global reset line, the use of reset for most registers is required to get the simulation up and running. Multiple engineers working on the same device have to agree on the approach since it affects coding style throughout the design.

    One easy way to initialize a signal in an HDL, and this is a method that is not recommended, is to use an initial value on the declaration. With few exceptions, the use of a default assignment in the signal declaration should not exist in RTL since it may create a situation where the RTL simulation will not match the back annotated (timing) simulation. An example of this is:

           SIGNAL dont_do_this : SLB := '1';

    An exception to this rule is when there is some small circuit which needs to be operational during reset and the FPGA guarantees the power up state of the registers. An example of this would be a counter which is used to elongate reset and must run when reset is active.

    The following two sections continue the discussion of reset implementation issues.

  2. Reset Polarity

    One problem that arises with FPGA design is that some FPGAs have an active high reset while others have an active low reset. It would be nice to code in a portable fashion, that is where the code is independent of the FPGA vendor. A solution is to use a generic which specifies the active level of reset and which is passed down to all registers throughout the hierarchy. In the following example, a generic called "RESET_LEVEL" accomplishes this:

    LIBRARY IEEE;
    USE IEEE.STD_LOGIC_1164.ALL;


    ENTITY blatz IS
    -- active level of reset for registers is defined here
    GENERIC (RESET_LEVEL : SLB := '1');
    PORT (CLK : IN SLB;
    RESET : IN SLB;
    CE : IN SLB;
    DIN : IN SLB;
    DOUT : OUT SLB);
    END blatz;

    ARCHITECTURE behavior OF blatz IS
    BEGIN


    a_register: PROCESS (CLK, RESET)
    BEGIN


    IF RESET = RESET_LEVEL THEN
    DOUT <= '0';
    ELSIF CLK'event AND CLK = '1' THEN
    IF CE = '1' THEN -- clock enable
    DOUT <= DIN;
    END IF;
    END IF;
    END PROCESS a_register;


    END behavior;


    RESET_LEVEL must be passed down through the hierarchy through the use of port mapping. The following is an example of port mapping in a component instantiation:

    blatz_c: blatz
    GENERIC MAP (RESET_LEVEL => RESET_LEVEL)

    PORT MAP (CLK => my_clk,
    RESET => my_reset,
    CE => my_ce,
    DIN => my_din,
    DOUT => my_dout);

  3. By implementing reset as a generic which is passed down in this fashion, the value of the RESET_LEVEL generic can be changed at the top level which will invert the active level of reset for all registers in the design.

  4. Use of Asynchronous Resets

    Other than the use of asynchronous reset for globally resetting the device, the use of asynchronous resets is best avoided when possible. Asynchronous resets create timing paths which are hard for static timing analyzers to deal with, and can make code less portable.

    It's worth noting that some FPGA vendors have started providing a fuse selection for synchronous vs. asynchronous clear and set inputs on registers. This has the advantage that it can provide other synchronous paths for the logic into the register - besides the D input. For example, consider the case where two terms A and B will be used to set a register. This would be equivalent to an OR gate on the D input to the register with A and B as inputs to the OR gate. But if the asynchronous SET input were actually a synchronous set, then the OR gate could be eliminated with A driving the D input and B driving the synchronous set input.

    The FPGA vendors have suggested using this, and it makes sense, however it creates a caveat; global chip reset will need to be routed into the synchronous logic. With the typical fanout of global reset being one of the largest nets in the design, this reset signal quickly becomes the critical timing path in the design. The way to get around this is to break the timing path for global reset as it enters the chip. One must also ensure that this doesn't create any problems, that reset is active for multiple clock cycles and that there isn't any logic that transitions immediately out of reset.

  5. Resource Sharing

    Resource Sharing is the ability for the synthesizer to automatically utilize a common function - common to two processes for example. Consider the following code fragment in a process:

    IF count = 237 AND request = '1' THEN
    start_count <= '1';
    END IF;

    IF count = 237 AND latch_data = '1' THEN
    data_out <= data_in;
    END IF;

    Will the synthesizer create one or two comparators for "count = 237"? The answer depends upon the quality of the synthesizer.

    The general rule for resource sharing is to perform it manually, when possible, eliminating the possibility that the synthesizer will screw it up. Our example can be rewritten to use a concurrent signal assignment for the comparator:

    Count_eq237 <= '1' WHEN count = 237 ELSE '0';

    and the process rewritten as follows:

    IF count_eq237 = '1' AND request = '1' THEN
    start_count <= '1';
    END IF;

    IF count_eq237 = '1' AND latch_data = '1' THEN
    data_out <= data_in;
    END IF;

    It is recommended that this practice be used for all large arithmetic and relational operations (adders, subtractors, comparators, etc.)

  6. Clocks

    Although practical and often times necessary to have multiple clocks in a design, one should not generate local clocks within a clock domain in an FPGA design. This practice is often used in gate array design to write to registers in the device (such as the control register, status register, etc.).

    In FPGAs, which have flip-flops with clock enables built in, the clock enable should be used. The clock enable is always the outer IF under the clock sensitivity expression:

    a_register: PROCESS (CLK, RESET)
    BEGIN

    IF RESET = RESET_LEVEL THEN
    DOUT <= '0';
    ELSIF CLK'event AND CLK = '1' THEN
    IF CE = '1' THEN -- clock enable
    DOUT <= DIN;
    END IF;
    END IF;
    END PROCESS a_register;

    The use of locally generated clocks should be avoided.

  7. Decoding Register Writes

    Generally, one must decode lower order address lines and a qualified strobe to write to internal registers within the device (control register, status register, etc.). This could be accomplished in the following fashion, but is not recommended for higher performance designs:

    ELSIF CLK'event AND CLK = '1' THEN
    IF address = CONTROL_REG_ADDR AND REGWRITE = '1' THEN
    CONTROL_REG <= DATA_IN;
    END IF;

    IF address = STATUS_REG_ADDR AND REGWRITE = '1' THEN
    STATUS_REG <= DATA_IN;
    END IF;
    END IF;

    In the above example, if there are many registers, a large fanout is created on the signal REGWRITE and the address bus which can be difficult to deal with. A better way to accomplish this is with the following:

    ELSIF CLK'event AND CLK = '1' THEN


    -- Default assignments:
    write_controlreg <= '0';
    write_statusreg <= '0';


    IF REGWRITE = '1' THEN
    CASE address IS
    WHEN CONTROL_REG_ADDR =>
    write_controlreg <= '1';
    WHEN STATUS_REG_ADDR =>
    write_statusreg <= '1';
    WHEN OTHERS => NULL;
    END CASE;
    END IF;


    IF write_controlreg = '1' THEN
    control_reg <= DATA_IN;
    END IF;


    IF write_statusreg = '1' THEN
    status_reg <= DATA_IN;
    END IF;

    END IF;

    Notice that the IF statements which test the address have been changed to the more efficient CASE statement, since a priority encoder is not desired on the address. The signals "write_controlreg" and "write_statusreg" become clock enables for the data registers "control_reg" and "status_reg", and can be physically located adjacent to the flip-flops which comprise the register. This allows for a higher operating frequency and less routing congestion. Note that an additional pipeline has been added to the data path and this must be taken into account; it may be necessary to pass the input data "DATA_IN" through a pipeline register to compensate.

  8. State Machines

    State machine design in VHDL is more a matter of style preference than rules. What is presented here is one preferred style and some recommendations based upon personal preference.

    Although many synthesizers recommend the "two process" state machine, this approach generates non-registered outputs which can glitch. Other than that caution (which can be solved by registering the outputs in a clocked process) and the fact that they are a bit more difficult to read, they are an acceptable method of coding.

    Another method is to create a single clocked process containing all state assignments and output assignments. Since all signal assignments in a clocked process become flip-flops, the outputs from the state machine are registered and will not glitch. An example of such a state machine is shown here:

    ARCHITECTURE behavior OF state_machine IS
    TYPE STATETYPE IS (IDLE, GETBUS, HAVEBUS);
    SIGNAL state: STATETYPE;


    BEGIN


    sm_process: PROCESS (CLK, RESET)
    BEGIN
    IF (RESET = RESET_LEVEL) THEN
    BUSREQ <= '0';
    BUSFREE <= '0';
    state <= IDLE;

    ELSIF CLK'EVENT AND CLK = '1' THEN


    -- define inactive state for all outputs
    BUSREQ <= '0';
    BUSFREE <= '0';


    CASE (state) IS


    WHEN IDLE =>
    IF MEMREQ = '1' THEN
    state <= GETBUS;
    END IF;

    WHEN GETBUS =>
    BUSREQ <= '1';
    IF BUSGNT = '1' THEN
    state <= HAVEBUS;
    END IF;

    WHEN HAVEBUS =>
    IF MEMREQ = '0' THEN
    BUSFREE <= '1';
    state <= IDLE;
    END IF;

    END CASE;
    END IF;
    END PROCESS;

    In this state machine, the state bits are created as an enumerated type by the following declarations:

    TYPE STATETYPE IS (IDLE, GETBUS, HAVEBUS);
    SIGNAL state: STATETYPE;

    The advantage of using an enumerated type is that the values for STATETYPE (that is, IDLE, GETBUS, and HAVEBUS) are displayed in the simulator waveform window. The synthesizers provide a mechanism to control the assignment of enumerated type to actual values ("one-hot", binary, random, etc.). One can also use constant declarations to define the state bit assignments manually, if desired.

    As it turns out, unless all of the outputs from the state machine (in our example, BUSREQ and BUSFREE) are assigned to a value in every state of the state machine, the state machine must generate "extra" logic to maintain the outputs for those states in which the outputs aren't assigned - the state machine will not be fully optimized. But assigning all outputs in every state can make the state machine quite unreadable - especially in a large state machine.

    A solution for this, shown in the example above, is to have default assignments at the start of the clocked portion of the process and to override these in specific states. In our example above, BUSREQ and BUSFREE are normally low and are asserted high when active. They are assigned to '0' at the beginning of the clocked process, and assigned to '1' in the states where they are to be active. This allows all of the outputs to be assigned a value in all states, and for the state machine to be fully optimized, while still being quite readable.

    It's worth noting the assignments for the two outputs have different dependencies and will exhibit different timing. The output "BUSREQ" is only dependent upon being in the state GETBUS. If the state bits are one-hot encoded, then BUSREQ will only depend upon one flip-flop. The "BUSFREE" output is dependent upon being in the state HAVEBUS, and is also dependent upon the input MEMREQ.

  9. Project Packages

    Although it always seems like exta overhead, software programmers learned many years ago that it pays to use symbolic definitions for commonly used constants. For example, addresses for registers, bit assignments and other constants should be defined in one project package and then included in all modules. A separate package can be used for simulation constants. It's useful to have the simulation package reference the project package when necessary - all constants should be defined in only one place. Therefore when a constant needs to be changed, it is changed in one place and simulation and synthesis will simply work.

  10. Style

    It is important to adopt a style that is meaningful to the designer, as well as to others that might have to read their code. There is no right or wrong in this category; it is simply a matter of preference. Since VHDL is case-insensitive, one has the option of using case to increase readability. My personal preference is to use upper case for VHDL keywords, for constants, and for signals declared in ports. Internal signals are always lower case in this scheme. By using upper case for ports and lower case for signals, it's always easy to determine if a signal is an input or output to the module or is an internal signal.

    Another style preference is the location for component declarations. When an entity/architecture pair contains many declared components, they tend to make the module less readable. An option to alleviate this is to put these declarations into a package - either the project package or some other package, and then import the package into the entity.

I. Behavioral (Testbench) Design

    The term behavioral design as used in this note refers to VHDL code used to test the FPGA part of the design, or the RTL.

  1. Behavioral versus RTL Style

    When writing RTL the designer is limited to using a small subset of the HDL language to be compatible with available hardware structures. For example, it's trivial to define registers that operate from both edges of a clock, but these aren't available in hardware (although they would come in handy sometimes!). Contrary to RTL, one can use the entire HDL language when writing a behavioral descritpion. While this is true, the end goal when creating a behavioral model is generally to model a hardware device. And since one is trying to model the device in a fashion that exactly duplicates the behavior of the device, it is often best to use some of the same "clocked process" techniques that one would use in an RTL design.

    This isn't to say that non-synthesizable constructs can't be used, but if the device being modeled contains a synchronous state machine, one might be best off using a synchronous state machine in the behavioral description rather than a process with a series of WAIT FOR statements.

  2. Complete Testbenches

    When starting to model a testbench, there is a tendency to leave out the static signals and clocks and to force these from the simulator. If you think about it, it isn't any more typing to include these in the testbench and make the testbench complete. In this way, the simulation can run standalone - the top level of the testbench is read into the simulator and simulation time is advanced, without forcing any signals.

    This has the advantage of easy re-use; if some months later the same simulation needs to be executed, there is no need to figure out which static signals need to be forced or how to force the clock. Another (minor) advantage is in code portability; code written in the HDL is portable, whereas code stimulus written in the simulator language is specific to that simulator.

  3. Model Timing

    An effective behavioral model will contain the worst case timing from the device or bus specification. These should be declared as generics for easy modification. A good practice is to use the exact description from the datasheet as the generic name. An example from a model of the TI's HPI bus is:

    GENERIC (RESET_LEVEL : SLB := '1'; -- reset level of regs
    MEM_LENGTH : INTEGER := 2**19; -- length of HPI mem array
    EN_RAND_READ_TIMING : BOOLEAN := TRUE; -- random read
    tREADACCESS : TIME := 20 NS;
    tREADACCESS_MAX : TIME := 180 NS; -- max for random read
    tREADACCESS_MIN : TIME := 20 NS; -- max for random read
    td_DSL_HYL : TIME := 12 NS;
    tv_HYH_HDV : TIME := 7 NS;
    td_DSL_HDD : TIME := 20 NS;
    td_DSH_HYL : TIME := 12 NS;
    th_DSH_HDVR : TIME := 1 NS;
    tHPIA_C_READ : TIME := 20 NS;
    td_DSH_HYH_MIN : TIME := 30 NS;
    td_DSH_HYH_MAX : TIME := 83 NS;
    tsu_HBV_DSL : TIME := 5 NS;
    twDSL : TIME := 30 NS;
    twDSH : TIME := 10 NS;
    tsu_HDV_DSHW : TIME := 10 NS;
    th_DSL_HBV : TIME := 5 NS
    );


  4. Timing Checks

    An effective model provides setup, hold, pulse width, and other timing checks. Timing checks can be performed with boolean tests, or with ASSERTION statements. Keep in mind that the assertion statement is "bass ackwards" - it is a test for a false condition. Placing a NOT in front of the test changes it to a normal "test for true" condition.

    Assertion statements have the advantage of providing a mechanism for stopping the simulation on error. The simulator can be made to stop on the assertion level - NOTE, WARNING, ERROR, or FAILURE. The user declared string is always written to the screen when the test is false.

    Some examples are shown below. In these examples, the runtime function "now" returns the current simulation time which is stored in the signals hstrobe_falling and hstrobe_rising and used for the timing checks.

    timing_checks : PROCESS (hstrobe_i)
    BEGIN
    -- Falling edge checks
    IF hstrobe_i'event AND hstrobe_i = '0' THEN


    IF now - hstrobe_rising < twDSH THEN
    ASSERT FALSE REPORT "HOST STROBE PULSE WIDTH
    HIGH VIOLATION" SEVERITY WARNING;
    END IF;


    ASSERT HRNW'stable(tsu_HBV_DSL)
    REPORT "HRNW SETUP TO RISING EDGE OF THE
    EARLIER OF HDS0, HDS1, OR HCS"
    SEVERITY WARNING;

    ASSERT HCNTL'stable(tsu_HBV_DSL)
    REPORT "HCNTL SETUP TO RISING EDGE OF THE
    EARLIER OF HDS0, HDS1, OR HCS"
    SEVERITY WARNING;

    hstrobe_falling <= now; -- capture time of falling edge

    END IF;


    -- Rising edge checks
    IF hstrobe_i'event AND hstrobe_i = '1' THEN
    IF now - hstrobe_falling < twDSL THEN
    ASSERT FALSE REPORT "HOST STROBE PULSE WIDTH
    LOW VIOLATION" SEVERITY WARNING;
    END IF;


    IF hrnw_l = '0' THEN
    ASSERT hd_pad'STABLE(tsu_HDV_DSHW)
    REPORT "HD SETUP TO RISING EDGE OF THE EARLIER OF HDS0, HDS1, HCS"
    SEVERITY WARNING;
    END IF;


    ASSERT hrdy_i = '1' REPORT "HRDY LOW AT RISING
    EDGE OF EARLIER OF HCS, HDS0, OR HDS1"
    SEVERITY WARNING;

    hstrobe_rising <= now; -- capture time of rising edge

    END IF;
    END PROCESS timing_checks;


  5. Protocol Checks

    Another useful feature to include in a behavioral model is protocol checking. Protocol checks are simply conditionals that warn if non-timing related bus violations occur. An example of this would be the premature ending of a bus cycle, such as on the TI HPI bus. A bus cycle with HCS and HDS0/HDS1 active should not end with HRDY low. The last assertion statement in the example above looks for this condition.

  6. Handling Large Counters

    With some applications which process large amounts of data in a sequential fashion, or simply counters that must wait for an exceptionally long time, the simulation time becomes large enough to be prohibitive. Examples of this are counters for frame-based image or video applications, or counters which wait a stabilization delay for DRAMs at system reset.

    An easy solution to this problem involves passing down generics from the top level as described in the section on reset polarity above. This solution is to assign the terminal counts of the counters with generics. The default values for these generics at the top level of the chip contain the correct values for synthesis, but are overridden in the testbench with smaller values for simulation.

    For example, if a graphics engine has to process 1280 pixels X 1024 scan lines for each frame, the time required to simulate the 1.3 million pixels is significant. If these values for these generics are overridden with 100 pixels X 10 scan lines in the testbench, then it's possible to simulate multiple frames. Of course, one would want to process some number of full frames, as well as many smaller frames for completeness.



  7. Please contact us if you have any questions on this, or to provide feedback. Thank you!






image