Design Article
Reducing energy cost of intra-chip communications
Fabien Clermidy, Ivan Miro-Panades, Yvain Thonnart and Pascal Vivet, CEA-Leti
5/15/2012 8:08 AM EDT
The advent of network-on-chip
Until the early 2000, busses were mostly used in communication infrastructure. They presented good advantages in terms of flexibility and were widely adopted. However, they also came with some drawbacks, especially in terms of scalability and power consumption: busses were crossing the whole chip for connecting IP and scalability was obtained by increasing the number of wires, resulting in high wires capacitances. This reduced performance and increased power consumption. Segmented busses were later introduced, but came with irregular structures limiting the bus interests while not really solving the issues.
In the late 1990s, the network-on-chip (NoC) concept was introduced. Keywords for defining NoCs are regularity, flexibility, throughput scalability and reduced power consumption. NoCs leverage on multiprocessors interconnects background but differ in their implementation with different latency, area cost and power consumption requirements. As regular structures, they bring the flexibility and scalability needed for the platform concept. In terms of power consumption, they are more efficient than busses thanks to smaller wire lengths and typically divide by two communications power consumption. However, these advantages come to a cost in terms of latency as going from one PE to another one is made by crossing different switches or routers.
Limitations of classical NoC-based architectures
Even if NoC-based architectures solve many issues linked to many-core architectures, the power consumption stays at a high level and tends to increase due to the increasing number of cores. Without innovation in this field, the communication alone could have accounted for more than 50 percent of the full SoC power consumption. This is due to many factors, the first one being clock distribution. Indeed, NoC are distributed all over the SoC and the clock tree of a fully synchronous NoC typically represents 30 percent of its power consumption. This is due to added buffers required for obtaining balanceded clock on high frequency NoC clock due to the high communication throughput.
However, clock distribution is not the only problem. One more fundamental issue is the difficulty to predict communications events which are often performed by data bursts and whose dynamic is dependent on the different PE behaviors. As a result, defining power modes in interconnect is a harsh task.
Globally-asynchronous, locally synchronous (GALS) paradigm
GALS architectures are a solution to deal with multiple clocks domains. Consequently, it is a solution to solve the clock tree distribution issue in NoC-based architectures and has been widely used. The main difficulty with GALS architectures is the re-synchronization phase which can imply large area and latency overheads.

Figure 3: A mesochronous implementation using inverted clocks scheme (DSPIN)


Figure 4: FIFO resynchronization (left) versus pausable clocks (right) for asynchronous GALS
Until the early 2000, busses were mostly used in communication infrastructure. They presented good advantages in terms of flexibility and were widely adopted. However, they also came with some drawbacks, especially in terms of scalability and power consumption: busses were crossing the whole chip for connecting IP and scalability was obtained by increasing the number of wires, resulting in high wires capacitances. This reduced performance and increased power consumption. Segmented busses were later introduced, but came with irregular structures limiting the bus interests while not really solving the issues.
In the late 1990s, the network-on-chip (NoC) concept was introduced. Keywords for defining NoCs are regularity, flexibility, throughput scalability and reduced power consumption. NoCs leverage on multiprocessors interconnects background but differ in their implementation with different latency, area cost and power consumption requirements. As regular structures, they bring the flexibility and scalability needed for the platform concept. In terms of power consumption, they are more efficient than busses thanks to smaller wire lengths and typically divide by two communications power consumption. However, these advantages come to a cost in terms of latency as going from one PE to another one is made by crossing different switches or routers.
Limitations of classical NoC-based architectures
Even if NoC-based architectures solve many issues linked to many-core architectures, the power consumption stays at a high level and tends to increase due to the increasing number of cores. Without innovation in this field, the communication alone could have accounted for more than 50 percent of the full SoC power consumption. This is due to many factors, the first one being clock distribution. Indeed, NoC are distributed all over the SoC and the clock tree of a fully synchronous NoC typically represents 30 percent of its power consumption. This is due to added buffers required for obtaining balanceded clock on high frequency NoC clock due to the high communication throughput.
However, clock distribution is not the only problem. One more fundamental issue is the difficulty to predict communications events which are often performed by data bursts and whose dynamic is dependent on the different PE behaviors. As a result, defining power modes in interconnect is a harsh task.
Globally-asynchronous, locally synchronous (GALS) paradigm
GALS architectures are a solution to deal with multiple clocks domains. Consequently, it is a solution to solve the clock tree distribution issue in NoC-based architectures and has been widely used. The main difficulty with GALS architectures is the re-synchronization phase which can imply large area and latency overheads.
The so-called mesochronous scheme is the most classical one. It considers clocks with the same frequencies but different phases. Synchronization between frequency domains can then be simplified thanks to these identical clock frequencies. One solution is to inverse clocks between two neighbor blocks (Figure 3). The phase drift is then limited to half the clock period but it relaxes a lot the clock tree synthesis and thus the corresponding power consumption. Another solution is to use a learning phase where signal conflicts are detected and then avoid the conflicting cases in a second phase. This scheme leads to minimum hardware for synchronization purpose and reduces latency compared to the clock inversion scheme thanks to the learning phase. This second scheme can also be extended to ratiochronous clocks, i.e. clocks related by an integer ratio. It thus allows the connections of PE with different frequencies, all related to a root clock. However, the precision in terms of frequency selection is limited when the clock root frequency and the objective frequency are in the same range.



Figure 3: A mesochronous implementation using inverted clocks scheme (DSPIN)
The asynchronous scheme is the most advanced paradigm. In that case, clocks frequencies and phases are not related. It then requires a complex and costly synchronization scheme between two frequency domains due to meta-stability issue. Two solutions have been studied: asynchronous FIFOs and pausable clock (Figure 4).
The first solution is costly both in terms of hardware because successive data have to be temporarily stored for assuring a data transmission per cycle; and latency because at least two cycles are lost when crossing a frontier. However, pausable clock scheme requires a local clock generator for being able to control the core clock when conflicts are detected. Moreover, the clock is more or less paused depending on traffic between the core and the outside. Thus, the core performance depends on the quantity of communication. As a result, this technique has not been implemented in industrial circuits due to its inherent issues.
Asynchronous GALS allows an advanced power management of cores, as it can be associated to Dynamically Voltage and Frequency Scaling (DVFS). However, clocks are still distributed in the whole chip, and communication remains difficult to foresee, thus limiting the impact of power management on the NoC itself. In this perspective, mesochronous schemes are intrinsically limited, but asynchronous ones can be further exploited by completely removing the clock inside the NoC.
The first solution is costly both in terms of hardware because successive data have to be temporarily stored for assuring a data transmission per cycle; and latency because at least two cycles are lost when crossing a frontier. However, pausable clock scheme requires a local clock generator for being able to control the core clock when conflicts are detected. Moreover, the clock is more or less paused depending on traffic between the core and the outside. Thus, the core performance depends on the quantity of communication. As a result, this technique has not been implemented in industrial circuits due to its inherent issues.
Asynchronous GALS allows an advanced power management of cores, as it can be associated to Dynamically Voltage and Frequency Scaling (DVFS). However, clocks are still distributed in the whole chip, and communication remains difficult to foresee, thus limiting the impact of power management on the NoC itself. In this perspective, mesochronous schemes are intrinsically limited, but asynchronous ones can be further exploited by completely removing the clock inside the NoC.


Figure 4: FIFO resynchronization (left) versus pausable clocks (right) for asynchronous GALS
Navigate to related information

