CPU and GPU subsystems are not just "bigger IP" because they have to be wrapped and they have to be optimized to the nth degree to be competitive on PPAR.
When is a subsystem not like an IP? I have been asked this question several times and I haven't had a good answer until recently. A change in adoption of one of our tools led me circuitously to a cause, and also to an interesting trend in SoC design. The change is new growth in use of our GenSys tool in the assembly and optimization of subsystems -- particularly ARM and GPU subsystems. Tracing back from that effect to the root cause has taken some detective work. In the process, I gained a better understanding of what makes subsystems more than just "bigger IP."
A typical starting point -- an ARM Cortex-A57 subsystem. This includes a multi-core CPU and multiple caches, as well as a bus interface to external logic. (Click here to see a larger image.)
The design task in using this subsystem can be broken down into two components as follows:
The subsystem must be configured for use in SoCs, which almost certainly do not follow ARM standards. ARM provides the means to configure some number of slave interfaces with certain protocols, but -- outside the subsystem -- you will very probably use local bus protocols or your own flavor of AMBA. Your SoC integration teams are experienced in working with these protocols, not the ARM protocol (even if you use the AMBA protocol internally -- unless you use the ARM version of the AMBA generator, you are using your own flavor of AMBA). Therefore you have to wrap the subsystem in adaptation logic -- bus adaptors at minimum, possibly also configuration registers adapted to local standards and so on -- whatever is needed to provide a standard "localized" package to all integrators.
The subsystem must be tuned to give the absolute best Power, Performance, Area/Cost, and Reliability (PPAR). While common IP platforms have unquestionable upsides, one downside is that most competitors for any given application are building their products around the same components. Since these contribute significantly to your overall product PPAR, how do you differentiate (aside from whatever secret sauce you add), or -- at least -- not fall behind? The only possible way in hardware is through superior implementation -- you optimize for the very best PPAR you can get, which means you need to turn this whole thing into a hard macro. It is well-known that, where a basic full-chip ARM-based implementation may run at say 500MHz, a finely-tuned hard macro can run at 3GHz or higher, probably in a smaller area if carefully tiled. And, since this will be a hard macro, it has to be finished for whatever power management, test, and other logic you have planned, and it must be optimized to the floorplan of the targeted SoC.
Adapting and finishing the subsystem. (Click here to see a larger image.)
So far, so good. But this adaptation, PPAR tuning, finishing, and implementation takes a lot of expertise in tools, in ARM IP, in protocols and subsystems, and also in your local protocols and technology needs. And, since many of your current products use ARM subsystems, it makes sense to form a central design team responsible to leverage that expertise to all consumers of these hard macros. But there's a catch. The end-product needs are similar, but not identical. Each product must be as competitive as it possibly can be in its respective market. That drives a different functional and PPAR profile for each subsystem usage. A very simple and obvious example is the detailed pinout and consequent constraints on the layout. What will work well for one SoC may need to be changed significantly for another. So now your central team needs to pump out optimized hard macros at a very fast rate to keep up with your product launch rate.
How can this be achieved? To understand this, you first need to understand the transformations required in the adaptation-tuning-finishing-hardening process as follows:
Choosing subsystem vendor options (e.g., bus configuration)
Changing cache and FIFO sizes
Replacing inferred logic with faster instantiated gates in critical paths
Adding pipelining for further performance and adding custom IP/logic (e.g., for local power management)
Wrapping the subsystem with all the goodies you need to interface to local requirements
Replacing generic memories with wrapped technology-specific memories
Inserting and connecting memory BIST and repair controllers
Inserting IEEE1500 logic and adding test-mux logic
Restructuring RTL hierarchy for power and voltage domains
Restructuring to optimize tiling
Changing direct connections to use feed-throughs
Adding yet more pipeline registers to correct timing issues and -- in some cases -- cloning logic to reduce congestion. Here you are obviously iterating with P&R trials and eventual layout, and the changes you make are highly sensitive to the target pinout if you want to optimize timing.
Optimizing layout through RTL partitioning and optimized feed-throughs. (Click here to see a larger image.)
This process requires a lot of steps, and it is a large part of why you couldn't normally consider taking less than several months per instance. But if you could automate these transformations, the biggest part of the build could be collapsed and, per macro, you would be limited only by the verification and synthesis/place and route cycle.
What does this mean for the design task? Well, let's see, shall we...