Note that this page has been updated to reflect Xilinx move from the TRN based PCIe core interface to the AXI based interface.
The PCI Express hard IP block in Xilinx FPGA families provides a Transaction Layer Packet (TLP) interface for the user (FPGA fabric) side. The TLP interface used to be based on the Transacation (TRN) interface, but Xilinx has changed to to the AXI interface to be consistent with their other IP.
The PCIe core supports Gen 1 and 2, and 1 to 8 lanes in 7 series devices, and Gen 3 and up to 16 lanes in the UltraScale+ family. When the core is built using the IP Integrator tool within Vivado, it creates an interface to the hard IP core that resides in the FPGA fabric. When configuring the core, the user can select the bus width of the datapath and frequency for the user-side logic. The bus width can vary from 64 bits to 128 bits (256 for UltraScale+) and with user-side frequencies from 62.5 MHz to 250 MHz.
There are separate receiver and transmitter TLP interfaces which use the AXI inteface. The user side must build transaction layer packets to send to the root complex, or parse transaction layer packets received from host. As a bus slave, the user side will receive TLPs for write and read requests. Write requests are retired to the destination. For read requests, it must complete the read, build a transaction layer packet, and send it on the transmitter. As a bus master the user side can build write and read transaction layer packets and send them on the transmitter. The user side will receive read completions in response to read requests.
The AXI Interface
The AXI interface, now present in most of Xilinx IP, is similar to the handshakes in the old parallel PCI bus whereby either side can generate wait states. It consists of a ready strobe driven by the receiver and valid strobe driven by the transmitter. Data is transferred only when both ready and valid are active. Either side can deassert the strobe to cause a wait state on any given clock. Additional strobes are used to validate individual bytes on the bus, and to indicate that the transfer is the last transfer in a burst operation. Both 3 DWORD and 4 DWORD TLP headers are supported, for 32-bit and 64-bit addressing, respectively.
TLP Buffering and Maximum Payload Size
The size of individual data transfers on the PCIe bus is determined by the value of the Maximum Payload Size parameter. When building the core, the value that you would like the interface to support is specified in bytes. This value gets broadcast to the host, which then determines the actual value which will be less than or equal to the value requested. The actual value is driven on a port from the core. The user design must use the value specified by the host and not the value requested.
Transmit buffers are implemented in block RAM and are configurable in size when the core is generated. The signal tx_buf_av broadcasts the number of available buffers, where a buffer can hold one maximum sized TLP, the maximum determined by the payload size. These transmit buffers are shared between the user application and the core, since the core handles configuration cycles without intervention by the user logic. If the user logic fills the transmit buffers with write data destined for the host, the core will be blocked from generating completions for configuration cycles. To avoid this, the core provides the tx_buf_av signal which indicates the number of available buffers and can be used to throttle the user application.
PCIe Error Reporting
There are many PCIe error conditions that can be generated to the core. Many of these won't apply to all applications. One important one is completion timeout, which the user logic is responsible for implementing. This should be indicated by asserting the completion timeout error flag (CFG_ERR_CPL_TIMEOUT). Some of the other errors that might be flagged are:
End-to-end CRC ECRC Error
Unsupported Request Error
Unexpected Completion Error
Completer Abort Error
For the case of multiple errors in one packet, only one should be reported. The signal CFG_ERR_POSTED is used to further qualify these errors as indicating that they are for a posted (write) or a non-posted (read) operation.
As in all headers, the user is responsible for building the completion header to transmit completions for read requests. These must contain the latched TAG and requester ID from the request, as well as the ID assigned to the endpoint from configuration space. This latter item is made available to the user from the configuration interface.
It should be noted that the core, not the user logic, provides completions to all configuration requests.
The Xilinx PCIe IP core supports Legacy, MSI and MSI-X interrupts. Interrupts on the PCIe interface are very different than on the parallel PCI bus. Essentially a message is sent to the root complex when the interrupt is to be asserted, and then another message must be sent when the interrupt is to be negated. For Legacy and MSI interrupts, the Xilinx core provides a simple interface to initiate these messages. For the case of MSI interrupts, the user logic can return a vector with the interrupt. The number of vectors can be specified when building the core. The user logic should actually support both Legacy and MSI since the selection of either one is specified by the host and output from the core on the CFG_INTERRUPT_MSIENABLE signal.
PCIe Byte Order
In the parallel PCI bus if one designed to 32-bit quantities, the byte order did not matter. This is not so with the PCIe interface; depending upon your target "endianess", a byte swap may have to be performed on the data payload as it arrives from the receiver or is sent to the transmitter. This can be confusing since the swap cannot be performed on any of the header words - just the data payload. This includes the data payload for a completion which must also be swapped.
Getting Started with the Xilinx PCIe Core
The first step is to build the core, but in order to do this, some decisions will need to be made. The link speed and number of lanes will need to be specified based on your bandwidth needs. You will need an internal clock frequency and AXI datapath width which supports the link bandwidth that you've chosen. The AXI frequency is selectable in multiples of 62.5 MHz, and the datapath width is selectable in multiples of 64 bits. Keep in mind that higher frequencies may make timing closure difficult to achieve, and very wide widths might cause routing problems in smaller devices.
Other decisions are:
The vendor and device IDs, class code, etc.
The size and number of Base Address Registers (BARs)
The size of TLP buffers
Power management, extended capabilities, shared logic, and many other items
Many of these decisions are for specialized applications and can be left at the defaults.
After building the core, the core will appear in the Project Manager window in Vivado. The next step is to build an example design which can be leveraged for the actual design. The example design shows the default signal assignments for many of the unused inputs to the core that must be driven. It also sets up the Xilinx simulation testbench which is useful for gettting started.
The easiest way to build the example design is from a right mouse click on the core name in the IP Sources tab of the Project Manager. This will open a new instance of Vivado and build the necessary files. After compiling the files for simulation (simulator dependent), a basic simulation can be run. Note that the simulation will run for quite a long time before anything transpires as it needs to get past link initialization.
FPGA Configuration Issues with PCIe
FPGA configuration, that is loading the FPGA bit file from the configuration PROM, can be problematic for PCIe designs. The reason for this is timing. Shortly after a PC boots, PCI enumeration is initiated whereby the host polls and initializes any PCI devices present. If PCI enumeration occurs prior to the FPGA configuration being loaded (i.e. the user's design loaded), then the device is not enumerated - the host does not know about it. The required time for this is subject to interpretation, but is somewhere in the 50 msecs region. Xilinx provides three recommendations:
1. Use parallel (BPI) PROMs for configuration. These can achieve a faster configuration time, but only if #2 is also implemented.
2. Drive the configuration clock from the board. The configuration clock, CCLK, is generated by a ring oscillator in the FPGA and has +/-50% tolerance, so running the configuration timing at, or even close to the highest rate allowable for the PROM is not possible without using an accurate clock.
3. Don't use the PCIe connector reset to reset the card. By generating a local board-level reset, FPGA configuration can commence while the host is still in reset, allowing for a bit more configuration time.
PCIe Simulation Test
The Xilinx tools can output a PCI Express simulation model as described above. This model is based upon an instance of the hard IP, which means that you will be simulating two instances of the PCIe core - one for the root port and one for the endpoint. While this works and is a good first test, this approach has a number of inadequacies. First off, since it models a serial interface, it is very slow in simulation speed. Additionally one must run for a long time in the beginning of a simulation to complete link training in the core. Second, there isn't any way to test the AXI wait states on both sides of the AXI interfaces. This is a major drawback since this is essentially the complex part of the interface to the core. Third, there are other features which can't easily be tested since the core prevents this, such as the Expansion ROM Base Address Register response, FIFO flow control, and others. And lastly, it isn't a true pseudo-code model in that read data is not returned to the model (our version has actually been modified to provide this). This means that from the stimulus code when a read is performed, you can't test the code and act on the value.
At Verien, we designed our own pseudo-code behavioral models to test our PCI Express designs - for both 64 bit and 128 bit designs. These have the advantage that they generate cycles from the AXI interface - not the serial interface, and are therefore about an order of magnitude faster and without link training. They also allow AXI interface wait states to be tested for both the receiver and transmitter, either for a fixed amount or randomly applied. Since read data is returned to the model, the test vectors can be written in pseudo-code procedures, much like a diagnostic. One can write the pseudo-code to initiate a DMA operation and then poll the DMA Done bit, for example. The pseudo-code model actually instantiates in place of the Xilinx core. Write cycles are generated from the pseudo-code procedures to the AXI master interface and reads are handled on the AXI receive interface. Since there is no actual link involved, the simulation speed improvement is huge.