Design Article
Sidebar: An overview of timing issues and clock management
Pong Chu
8/13/2012 8:28 PM EDT
This excerpt from Embedded SOPC Design with Nios II Processor and Verilog Examples by Pong P. Chu appears courtesy of the editors at John Wiley & Sons Inc.
16.1 MEMORY RESOURCES OF DEl BOARD
The Altera EP2C20 FPGA device and DEl board provide several options for storage elements:
These memory options exhibit a trade-off between cost and performance. A D FF is the fastest and most versatile option but requires the most silicon area and thus has the highest per-bit cost. It is only feasible for small, fast buffers. On the other hand, an SDRAM cell occupies the smallest silicon area and has the lowest per-bit cost but has the slowest access speed. Thus, it is feasible for a system that requires massive storage but can tolerate relatively slower performance.
It is a good idea to keep in mind the capacities of these options and to select the proper type that is most suitable for an application at hand.
16.2 BRIEF OVERVIEW OF TIMING AND CLOCK MANAGEMENT
As discussed in Section 5.1.2, the single most fundamental design principle is the synchronous methodology, in which all registers are driven by a single global clock. This methodology implicitly assumes that the rising edge of the clock signal can arrive in all registers at the same time. In reality, this assumption is only true for an intermediate-sized circuit within the FPGA device. Non-ideal clocking must be taken into consideration in many designs, especially for a system with high-speed off-chip access. In this section, we provide a brief overview of relevant timing issues and clock management schemes.
16.2.1 Clock distribution network
In a digital gate, the output stage "drives" the input ports of connected components. The number of input ports that can be driven is known as fan-out. A typical gate can drive around half a dozen ports (i.e., a fan-out of 6). Since all registers are connected to the same clock signal in a synchronous system, the fan-out of the clock signal is the number of FFs in the system, which can reach thousands or even tens of thousands in a large design.
To facilitate the requirement, an FPGA device contains special clock distribution networks to route the clock signal. A network is composed of multiple levels of buffers to increase the driving capability and is carefully placed and routed to balance and minimize the propagation delays. A conceptual three-level clock distribution network is shown in Figure 16.1, in which the fan-out of an individual buffer is four. To provide design flexibility, FPGA devices usually provide multiple clock distribution networks. There are 16 distribution networks in the EP2C20 device. A distribution network reaches all resources within the device and can be used for a global clock as well as control signals, such as a clear or enable signal.
In a real system, the clock's sampling edge may reach FFs at different times and the difference between the arrival times is known as clock skew. Because of the propagation delay of buffers, the clock skew between the clock source and a leaf FF can be quite large. However, the skews between the leaf FFs are small since the FFs experience similar delays. Thus, for a synchronous system implemented completely within an FPGA chip (i.e., not considering off-chip signals), we can assume that it is driven by the ideal clock source.

16.2.2 Timing consideration of off-chip access
The timing analysis for off-chip signals is more complicated because it involves an I/O buffer delay, an I/O pad delay, and additional routing delays and can be effected by the external load and PCB (printed circuit board) routing.
One important timing parameter of a synchronous system is tCO, which defines the clock-to-output delay (i.e., the time required to obtain a stable output signal after the clock's sampling edge), and we use this to illustrate off-chip timing issues. The simplified timing path to determine the device-level clock-to-output delay is shown in Figure 16.2.
The system within the FPGA chip can be considered an ideal synchronous system. Its clock-to-output delay is labeled tCO in Figure 16.2 and its value is equal to tCQ plus tOUTPUT, as discussed in Section 5.5. On the other hand, the clock-to output delay in the device level is the delay from the clock pin to the output pin. It is labeled tCO1 in Figure 16.2. tCO1 involves additional propagation delays:
The I/O output delay is affected by the load of the pin. During the timing analysis, Quartus Timing Analyzer uses a default value to estimate the value. For more accurate computation, we need to consider the actual PCB wiring and even the effect of the transmission line. It is labeled tCO2 in Figure 16.2.


Figure 16.3 Conceptual diagram of Cyclone II PLL.
16.2.3 PLL
To further facilitate clock and timing management, Cyclone II devices also contain PLL (phase-locked loop) circuits. The simplified block diagram of a Cyclone II PLL circuit is shown in Figure 16.3. It consists of a PFD (phase-frequency detector), a charge pump, a loop filter, a VCO (voltage controlled oscillator), and several frequency dividers and PS (phase selection) circuits. The key part of a PLL is the closed feedback loop. The PFD compares the phases of the reference input clock and feedback clock and outputs their difference. The charge pump and loop filter convert the difference to a voltage level. Based on the voltage level, the VCO oscillates at a higher or lower frequency, which affects the phase and frequency of the feedback clock. The negative feedback mechanism eventually forces the feedback clock and the reference input clock to have the same frequency and phase, which is said to be phase locked.
There are several frequency dividers in PLL and we can perform frequency synthesis by adjusting the values of these dividers. Because of the PLL loop, ƒREF = ƒFB. Since ƒREF = ƒ1N/N and ƒFB = ƒVCO/M, we have

In a Cyclone II PLL, the VCO output is fed to three separate frequency dividers and phase selection circuits to obtain three output clocks. For example, the frequency of the output clock 0 is
We can also adjust the PS circuit to adjust the phase for the output clocks (i.e., to make the sampling edge of the output clock ahead or behind the sampling edge of the input clock).
The output of a Cyclone II PLL can be connected to a clock distribution network or an output pin. The PLL can be used to change the system clock rate with a fixed external oscillator and drive different subsystems with different clock rates. It can also be used to reduce clock skew and adjust the arrival time of a clock's sampling edge to meet special timing requirements. There are four PLLs in an EP2C20 device.
From Embedded SOPC Design with Nios II Processor and Verilog Examples by Pong P. Chu, copyright 2012, John Wiley & Sons, Inc. Reproduced by permission.
16.1 MEMORY RESOURCES OF DEl BOARD
The Altera EP2C20 FPGA device and DEl board provide several options for storage elements:
- EP2G20's D FFs (for registers): about 20K bits embedded in logic cells (LEs).
- EP2G20's embedded RAM: about 200K bits, configured as 52 4K-bit modules.
- off-chip SRAM device: about 4,000K bits, arranged as a 256K-by-16 cell array.
- off-chip SDRAM device: about 64,000K bits, arranged as a 4M-by-16 cell array.
These memory options exhibit a trade-off between cost and performance. A D FF is the fastest and most versatile option but requires the most silicon area and thus has the highest per-bit cost. It is only feasible for small, fast buffers. On the other hand, an SDRAM cell occupies the smallest silicon area and has the lowest per-bit cost but has the slowest access speed. Thus, it is feasible for a system that requires massive storage but can tolerate relatively slower performance.
It is a good idea to keep in mind the capacities of these options and to select the proper type that is most suitable for an application at hand.
16.2 BRIEF OVERVIEW OF TIMING AND CLOCK MANAGEMENT
As discussed in Section 5.1.2, the single most fundamental design principle is the synchronous methodology, in which all registers are driven by a single global clock. This methodology implicitly assumes that the rising edge of the clock signal can arrive in all registers at the same time. In reality, this assumption is only true for an intermediate-sized circuit within the FPGA device. Non-ideal clocking must be taken into consideration in many designs, especially for a system with high-speed off-chip access. In this section, we provide a brief overview of relevant timing issues and clock management schemes.
16.2.1 Clock distribution network
In a digital gate, the output stage "drives" the input ports of connected components. The number of input ports that can be driven is known as fan-out. A typical gate can drive around half a dozen ports (i.e., a fan-out of 6). Since all registers are connected to the same clock signal in a synchronous system, the fan-out of the clock signal is the number of FFs in the system, which can reach thousands or even tens of thousands in a large design.
To facilitate the requirement, an FPGA device contains special clock distribution networks to route the clock signal. A network is composed of multiple levels of buffers to increase the driving capability and is carefully placed and routed to balance and minimize the propagation delays. A conceptual three-level clock distribution network is shown in Figure 16.1, in which the fan-out of an individual buffer is four. To provide design flexibility, FPGA devices usually provide multiple clock distribution networks. There are 16 distribution networks in the EP2C20 device. A distribution network reaches all resources within the device and can be used for a global clock as well as control signals, such as a clear or enable signal.
In a real system, the clock's sampling edge may reach FFs at different times and the difference between the arrival times is known as clock skew. Because of the propagation delay of buffers, the clock skew between the clock source and a leaf FF can be quite large. However, the skews between the leaf FFs are small since the FFs experience similar delays. Thus, for a synchronous system implemented completely within an FPGA chip (i.e., not considering off-chip signals), we can assume that it is driven by the ideal clock source.

Figure 16.1 Conceptual clock distribution network
16.2.2 Timing consideration of off-chip access
The timing analysis for off-chip signals is more complicated because it involves an I/O buffer delay, an I/O pad delay, and additional routing delays and can be effected by the external load and PCB (printed circuit board) routing.
One important timing parameter of a synchronous system is tCO, which defines the clock-to-output delay (i.e., the time required to obtain a stable output signal after the clock's sampling edge), and we use this to illustrate off-chip timing issues. The simplified timing path to determine the device-level clock-to-output delay is shown in Figure 16.2.
The system within the FPGA chip can be considered an ideal synchronous system. Its clock-to-output delay is labeled tCO in Figure 16.2 and its value is equal to tCQ plus tOUTPUT, as discussed in Section 5.5. On the other hand, the clock-to output delay in the device level is the delay from the clock pin to the output pin. It is labeled tCO1 in Figure 16.2. tCO1 involves additional propagation delays:
- I/O input delay of the clock signal: the delays of pad, package pin routing, and I/O buffer.
- clock routing delay of the clock signal: the delay of the clock distribution network.
- logic array to I/O buffer delay of the output signal: the routing delay from an logic element to the I/O buffer.
- I/O output delay of the output signal: the delays of pad, package pin routing, and I/O buffer.
The I/O output delay is affected by the load of the pin. During the timing analysis, Quartus Timing Analyzer uses a default value to estimate the value. For more accurate computation, we need to consider the actual PCB wiring and even the effect of the transmission line. It is labeled tCO2 in Figure 16.2.

Click image to enlarge
Figure 16.2 Conceptual diagram of off-chip delay.

Click image to enlarge.
Figure 16.3 Conceptual diagram of Cyclone II PLL.
16.2.3 PLL
To further facilitate clock and timing management, Cyclone II devices also contain PLL (phase-locked loop) circuits. The simplified block diagram of a Cyclone II PLL circuit is shown in Figure 16.3. It consists of a PFD (phase-frequency detector), a charge pump, a loop filter, a VCO (voltage controlled oscillator), and several frequency dividers and PS (phase selection) circuits. The key part of a PLL is the closed feedback loop. The PFD compares the phases of the reference input clock and feedback clock and outputs their difference. The charge pump and loop filter convert the difference to a voltage level. Based on the voltage level, the VCO oscillates at a higher or lower frequency, which affects the phase and frequency of the feedback clock. The negative feedback mechanism eventually forces the feedback clock and the reference input clock to have the same frequency and phase, which is said to be phase locked.
There are several frequency dividers in PLL and we can perform frequency synthesis by adjusting the values of these dividers. Because of the PLL loop, ƒREF = ƒFB. Since ƒREF = ƒ1N/N and ƒFB = ƒVCO/M, we have

In a Cyclone II PLL, the VCO output is fed to three separate frequency dividers and phase selection circuits to obtain three output clocks. For example, the frequency of the output clock 0 is
We can also adjust the PS circuit to adjust the phase for the output clocks (i.e., to make the sampling edge of the output clock ahead or behind the sampling edge of the input clock).
The output of a Cyclone II PLL can be connected to a clock distribution network or an output pin. The PLL can be used to change the system clock rate with a fixed external oscillator and drive different subsystems with different clock rates. It can also be used to reduce clock skew and adjust the arrival time of a clock's sampling edge to meet special timing requirements. There are four PLLs in an EP2C20 device.
From Embedded SOPC Design with Nios II Processor and Verilog Examples by Pong P. Chu, copyright 2012, John Wiley & Sons, Inc. Reproduced by permission.
Navigate to related information

