Editor's Note: In this three part series, Dr. Algosa
Vrancic and Jeff Meisel
demonstrate how a novel approach with Intel hardware and software
is allowing for real-time high-performance computing (HPC) in order to
problems with multi-core processors that were not possible only five
- Part 1 is a review of real-time
concepts that are important for
understanding this domain of engineering problems, and a comparison of
traditional HPC with real-time HPC.
2 outlines software architecture approaches for utilizing multi-core
with cache optimizations.
- Part 3 will consider industry examples
this particular methodology.
In traditional embedded systems, CPU caches are viewed as a necessary
evil side shows up as a nondeterministic execution time inversely
the amount of code and/or data of a time-critical task located inside
the cache when
the task execution has been triggered. For demonstration purposes, we
cache performance to better understand some important characteristics.
technique applied is using a structure within LabVIEW called a timed
in Figure 12.
Figure 12: Timed loop structure (used for
The timed loop acts as a regular while loop, but with some special
that lend themselves to profiling hardware. For example, the structure
any code within the loop in a single thread. The timed loop can be
with microsecond granularity, and it can be assigned a relative
priority that will be
handled by the RTOS. In addition, it can set processor affinity, and it
can also react
to hardware interrupts. Although the programming patterns shown in the
section do not utilize the timed loop, it is also quite useful for
dealing with realtime
HPC applications, and parallelism is harvested through the use of
structures and queue structures to pass data between the structures.
describes benchmarks that were performed to understand cache
An execution time of a single timed loop iteration as a function of the
cached code/data is shown in Figure 13. The loop runs every 10
we use an indirect way to cause the loop's code/data to be flushed from
the cache; a
lower priority task that runs after each iteration of the loop adds 1
to each element
of an increasingly larger array of doubles flushing more and more of
task's data from the CPU cache. In addition to longer runtime, in the
scenario the time goes from 4 to 30 microseconds for an increase by a
7.5. Figure 13 also shows that decaching also increases jitter. The
same graph can
be also used to demonstrate the "necessary" part of the picture. Even
embedded CPUs will go as far as completely eliminating cache to
it is obvious that such measures will also significantly reduce
Besides, few people are willing to go back one or two CPU generations
especially as the amounts of L1/L2/L3 cache are continuously increasing
providing enough room for most applications to run while incurring only
Figure 13: Execution time of a simple timecritical
task as a function of amount of cached
code/data on 3.2 GHz/8-MB L3 cache i7 Intel
CPU using LabVIEW Real Time. Initial ramp-up
due to 256K L2 cache.