Design Article

Multicore microprocessors and embedded multicore SOCs have very different needs

Steve Leibson, Tensilica, Inc.

7/18/2007 12:00 AM EDT

The term "multicore" seems to be getting a lot of use these days. For example, there's an industry association dedicated to the idea and the IEEE Computer Society's Computer [1,2}magazine recently devoted two cover stories to the concept. Like the poem about the blind men and the elephant [3], the term appears to mean many different things to different people depending on the context.

When used to describe PC-class microprocessors, the phrase nearly always refers to on-chip arrays of identical, single-ISA (instruction-set architecture) processors that handle processing loads using homogeneous or symmetric multiprocessing (SMP) and shared memory.

For SOC designs, the term may refer to shared-memory SMP architectures but it can also mean heterogeneous (single-ISA or multiple-ISA), single-chip, asymmetric multiprocessing (AMP) designs, with or without shared memory. Therefore, whenever you see a reference to a multicore chip or design, you need to dig deeper to clarify how the term is being used.

SMP and AMP approaches with and without shared memory can be used to solve processing problems that are beyond the capabilities of an individual microprocessor. Multicore PC and server microprocessors based on the x86 architecture started to appear after Intel and AMD hit the clock-rate wall and could no longer increase single-core-processor clock rates the way they did throughout the 1990s.

The maximum clock rates of these processors approached 4 GHz, at the cost of excessive power consumption, heat dissipation, and electromigration-related reliability concerns.

The path to further increases in processor performance through increased clock rates appeared to be blocked. An alternate path involved putting two and then four identical processor cores (and later eight and probably 16 processor cores) on a chip with both cores running at a lower clock rate to reduce power consumption and heat dissipation.

Figure 1, below, adapted from an article in Microprocessor Report [4], shows high-level block diagrams for upcoming quad-core processors from Intel and AMD. Although there are some architectural differences, both designs show the result of the need to distribute the processing load of a large operating system (generally Microsoft Windows) and its large application programs over several processors.

A large shared DRAM memory (hundreds or thousands of megabytes) is required to hold the operating system, applications, and data. Each of the processor cores therefore has at least two levels of SRAM cache memory (three in the case of AMD's Barcelona processor) to serve as speed adapters that isolate each processor's high-speed execution engine from relatively slow shared memory.

Figure 1: AMD and Intel Quad-Core x86 Microprocessors>

Like barnacles, SRAM cache hierarchies have accumulated around general-purpose processor cores as the disparity between processor clock rate and memory speed has grown. Although essential to this sort of architecture, cache hierarchies are inherently inefficient because they keep multiple copies of data and instruction blocks.

There is always an overhead cost (in terms of time, power dissipation, and silicon area) associated with moving information among cache hierarchy levels although processor architects work hard to minimize these overhead penalties.

Other vendors of server processors have also taken the multicore path. Figure 2 below shows a high-level block diagram of Sun's Niagara II server processor. Niagara II contains eight multithreaded processor cores. Each processor core has its own level-1 instruction and data caches and the processor cores share a large level-2 cache. Four memory controllers keep the caches filled.

Figure 2: Sun Niagara II 8-Core Processor

These first three examples of multicore microprocessors illustrate the stamp-and-repeat nature of multicore design for general-purpose processors. Because each general-purpose processor core must be able to handle any system task, the processor cores tend to be identical and tend to be arranged in regular, symmetric arrays.

Multicore processor chips and SOC designs for embedded applications can resemble the general-purpose multicore arrays, as shown by the IBM Cell Broadband Engine block diagram in shown Figure 3 below [3].

The 9-core Cell Broadband Engine contains eight independent synergistic processor elements (SBEs). Instead of a cache hierarchy, each SBE has a 256-kbyte local memory store that it uses for holding instructions and data. An SBE cannot directly access memory outside of its local store.

Instead, it relies on the intervention of an associated memory flow controller (MFC) to transfer words between the SBE's local memory and main memory. Transfers take place across a sophisticated, high-speed (205 Gbytes/sec), 4-ring interconnect called the element interconnect bus (EIB).

The ninth on-chip processor core, which is a general-purpose processor, is also attached to the EIB network and acts as the taskmaster, scheduling and initiating processing tasks on the SBEs. Only the on-chip general-purpose processor has cache memories.

Figure 3: IBM's 9-Core Cell Broadband Engine

Although the largely symmetric configuration of the block diagram for IBM's multicore CBE superficially resembles the block diagrams of the general-purpose multicore processor arrays from AMD, Intel, and Sun shown in Figures 1 and 2, there are important differences.

First, each of the CBE's SBEs has a local memory instead of a cache. Further, the SBEs do not share memory. Although the SBEs can dip into a shared memory through requests issued to their MFCs, the shared-memory address space is in none of the SBE's direct memory spaces.

The MFCs contain memory-management units that provide access to the separate shared-memory space using the virtual address mapping defined by the lone on-chip general-purpose processor.

IBM's CBE architecture demonstrates a key difference between general- purpose computing and server applications and embedded applications. Shared memory spaces benefit general-purpose computing applications while dissimilar, real-time tasks executed for embedded applications - such as audio, video, image, and network processing - benefit from more separation between the multiple processor cores.

Due to the highly asymmetric nature of embedded tasks, even within the same system on the same chip, the silicon efficiency of embedded multicore SOC designs can benefit from the use of diverse processor cores to execute the diverse tasks.

The block diagram of a Super 3G cellphone handset processor, shown in Figure 4 below, illustrates this situation [5]. (Tasks amenable to processor-based execution are shown in gray.)

Figure 4: Block Diagram of a Super 3G Cellphone Processor

Some of the tasks in the Super 3G handset processor involve multimedia (audio, video, and image) processing; some tasks involve running the user interface; some are baseband-processing tasks; and some (those on the left) are associated with transmission processing.

None of these tasks resembles the other (like the parts of the elephant in Saxe's poem). Many of these tasks will run on their assigned processor without needing an RTOS (real-time operating system). Others will need only the thinnest kernel while some may require an RTOS for task supervision.

Although general-purpose processor cores can perform all of the tasks shown in Figure 4 above, they cannot perform them efficiently and will likely need relatively high clock rates to perform the required processing in the allotted time. Processor cores more closely matched to the tasks can execute these tasks in far fewer clock cycles.

Such tailored processors will therefore be able to run at lower clock rates and will consequently consume less energy. Reducing energy consumption is absolutely critical in battery-powered applications such as a cellphone handset and is increasingly important in even line-powered embedded applications as energy costs climb.

Although the benefits of general-purpose, SMT multicore microprocessors are clear, the law of diminishing returns can rear its ugly head above four cores [6] except for server applications where the number of users can be large.

The Super 3G cellphone processor shown in Figure 4 above illustrates that the large number of tasks running on complex embedded SOCs can benefit from a large number of heterogeneous processor cores. Many complex embedded applications similarly benefit.

Steven Leibson is the Technology Evangelist for Tensilica, Inc. He recently co-authored a book, Engineering the Complex SOC, with Tensilica's President and CEO Chris Rowen. Leibson formerly served as the Vice President of Content and Editor in Chief of the Microprocessor Report.

References:
[1] Nidhi Aggarwal, et al, "Isolation in Commodity Multicore Processors," Computer Magazine, June, 2007, pages 49-59.
[2] Michael Gschwind, et al, "An Open Source Environment for Cell Broadband Engine System Software," Computer Magazine, June, 2007, pages 37-47.
[3] John Godfrey Saxe, "The Blind Men and the Elephant."
[4] Jim McGregor, "The New x86 Landscape," Microprocessor Report, May 14, 2007.
[5] Eisuke Miki, "Cell Phone Technology for Super 3G and Beyond," Microprocessor Forum, San Jose, CA, May 22, 2007.
[6] Rakesh Kumar, et al, "Homogeneous Chip Multiprocessors," Computer Magazine, November, 2005, pages 32-38.

To read more about this topic, go to More about multicores and multiprocessors.





Please sign in to post comment

Navigate to related information

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)

Feedback Form