4. Micro-architecture: Optimization and exploration of the
micro-architecture is a powerful technology for improving an algorithm’s
performance and adjusting for power. For algorithms, it’s not just
about operating frequency. The important measurements are latency (how
long it takes to get the first result) and throughput (how fast the data
can be fed).
An eight tap FIR filter with eight multipliers may
have a latency of one cycle if the period is long and the adder tree can
be done with the multipliers, but it might have a latency of two or
three, or even more (if using pipelined multipliers), yet the throughput
might remain constant at one clock cycle.
If only one
multiplier is used and the coefficients and tap register are restricted
to a single RAM, then the latency might be nine or ten clock cycles (or
more) and the throughput similarly longer, but this comes with the
benefit of considerably reduced area and power [Figure-2].
Figure 2: FIR serial versus FIR parallel implementation
may reduce clock frequency to reduce power. This may then require
increased parallelism in the design (using loop unrolling in the HLS
tool) to balance latency. Using an HLS tool to perform loop pipelining
and unrolling of constraints helps achieve these implementations
quickly. Designers can then compare the power, performance, and area of
each implementation. The right implementation depends on the design
goals regarding frequency, latency, and throughput, collectively.
The golden source code (SystemC/C++/C) is independent of technology
details. The same code can be retargeted to different target
technologies [Figure-3] because frequency is just a parameter. Through
frequency explorations, designers can set or adjust the clock frequency;
the HLS tool then figures out how to get things to fit in a clock
cycle. Also, since the implementation can be controlled down to the
resources used, designers can experiment with using different operators
like pipelined multipliers and adders.
Figure 3: Target optimized RTL code generation
example, if the analysis tools show that a design actually has some
extra slack in one implementation, the designer can reduce voltage to
save power. Or, with a little faster implementation, they can share more
operators. In this way, they can balance dynamic power with parallelism
for better performance.
6. Block hierarchy
hierarchical blocks naturally lends itself to multi-clock design. More
advanced HLS tools support running the blocks at different clock speeds
and handling the data transfer between blocks through FIFOs. Designs
with decimation are well suited to multi-clock design [Figure-4].
Figure 4: Decimation
with lower data rates may run with a slower clock, reducing the
switching power and the static power by decreasing block area. In more
general cases, the clock frequency can be tuned to match the best
implementation for either throughput or latency and power, based on the
technology target, with the same source code.
7. LVFS (Low Voltage Frequency Scaling)
In low power mode, the HLS tool can insert an idle signal (1-bit output
port) in the design. This signal is set when the block is in an idle
state (not processing any data, not reading any input, and not writing
to any output). This signal can be used in a system-level power
management strategy, like LVFS or gating the clock power to a block.