Design Article
Comment
Rchandta1
How does the software approach compare with FPGA in SDR? It is reconfigurable, ...
Bob Lacovara
This is a nice summary of the state of affairs in SDR. I am perfectly willing to ...
Vector processing: Finally, high-performance software-defined radio
Dave Kelf, Phil Moorby, Sigmatix, Inc.
9/7/2010 1:39 PM EDT
This situation is exacerbated by the evolution in wireless standards. Today we are looking at CDMA derivative protocols up around 2Mbps, and HSPA stretching to 10Mbps or more. The 4G standards promise even greater throughout at approximately 40Mbps for WiMAX and 100Mbps for the LTE Category 3 specification. Contrary to these extraordinary throughput requirements, handset providers are demanding baseband processing devices that consume, on average, less than 1W of power for the RF circuitry, Physical Layer processing and Protocol Stack.
However, two forces in the wireless industry are converging to drive practical SDR implementations. Vector processing architectures are now being put forward targeted at wireless baseband usage, with specialized instruction sets, streamlined storage access mechanisms, and most importantly multiple forms of parallelism. These devices hold the promise of a throughput and energy efficiency trade-off far greater than previous architectures, bringing commercial grade handset SDR within shouting distance of custom hardware.
These processors are being delivered at a time when the advantages associated with an SDR implementation, for example the flexibility required to drive seamless multimode (multiple protocols on a single device) operation, including inter protocol-handovers, and cognitive radio techniques that cope with shifting frequencies or unusual interference scenarios, and business opportunities of a faster, less risky implementation option that may be adapted late in the design process to changing environments and evolving wireless standards. If these requirements can be recognized, accompanied handset level performance for new standards, SDR will achieve a firm position in the industry.
The major barrier still in the way of new SDR is the programmability of vector processors. Leveraging the opportunity afforded by multiple levels of parallelism and making every part of the parallel architecture and storage access count on every clock cycle is a highly complex software engineering problem that does not lend itself well to automation. The hand-crafted assembly code often required to get the best from these devices simply takes too long to write and does not provide the intra-processor portability required to leverage new technologies as they become available, as well as ensuring that the wireless system provider is not trapped with one processor architecture from a single supplier. New methods for delivering SDR implementations that circumvent this issue are critical.
Multimode vector radio software on vector processors
It is well known that the advancement in chip fabrication technology has slowed, driven mainly by the significant costs of constructing a state-of-the-art fabrication facility. This has forced processor vendors to look at other ways to preserve their performance roadmap curves, the clear favorite of which is the increase in parallel structures in processor architectures. Multicore processing has been hyped as one answer to this growing requirement and it is useful for SDR, but it is actually finer grain parallelism that holds the key for the energy efficiency required for 4G wireless.
Single Instruction, Multiple Data (commonly referred to as SIMD) datapaths enable a flexible number of parallel lanes, nowadays 32 or 64 16bit wide, to be used to manipulate large qualities of data on every clock cycle. Combine this with Very Long Instruction Word (or VLIW) pipelines where multiple operations may be executed on every clock cycle, and the engineer has at their disposal a processing matrix which can be packed on to silicon much more efficiently than a simple multiplying of processor cores.
We would propose a software implementation, known as Multimode Vector Radio, or MVR, that makes use of a four dimensional parallel programming model, as follows (See Figure 1):
1. Data Parallelism using Single Instruction Multiple Data (SIMD) pipeline architectures, where a wide pipeline is broken up into individual lanes which execute separate data words, but using the same instruction.
2. Instruction Parallelism based on Very Long Instruction Word (VLIW) multistage pipelines, where in any one clock cycle a number of instructions are executed coincidently.
3. Homogenous Multicore, where a number of the pipeline cores are run in parallel, each one executing a different process, or part of the same process.
4. Heterogeneous Multicore, where the group of homogeneous cores are combined with parallel processors of different types, for example scalar processors and co-processing accelerators, and each element is leveraged on processing tasks that suit their architecture.
In addition, the operation of an MVR baseband is very storage centric, where memory writes in particular are minimized, and once a carefully sized data block is loaded into memory, and subsections contained within processor registers, as many operations as possible are performed on it using careful execution flow control.

Figure 1 - MVR Four Dimensional Model
Now although these architectural elements will be familiar to the embedded software engineer, it is their combined use which provides the MVR model performance potential. The basic idea of MVR is to consider data blocks as vector patterns which may be processed together using the different parallel means available.
SIMD instructions represent possibly the most energy efficient resource in a modern processor due to the construction of the data path elements, and it is by the prudent use of SIMD lanes that the specific DSP style operations may be regulated to maximize performance.
The VLIW pipeline appears as a horizontal parallel overlay on the vertically oriented SIMD paths, and with careful instruction ordering and register allocation, the vertical pipelines are maintained fully loaded, even as the operational loops are switched over and phases such as a pipeline pre-amble are managed to minimize downtime.
Homogenous multicore processors expand available operations that may be executed, and by load balancing across the cores, downtime is again minimized. Much has been written about the use of homogenous multicore architectures, most of which is very relevant to baseband processing.
In heterogeneous processor platforms, typical of vector DSP processors, it is important to balance the baseband algorithms across the different types of processing capabilities. Through extensive testing and measuring, it has been shown that not all base band algorithms in a modern wireless standard, such as LTE, are suitable for vector processors. For example, error correcting methods, such as the Turbo and Viterbi algorithms, are best suited to the vector cores, whereas demodulation and scrambling fit more effectively on scalar processors. Balancing of these algorithms across heterogeneous architectures requires a high degree of optimization.
Vector processors that target baseband applications often make use of specialized instructions that enable common operations to be executed in a smaller number of clock cycles. For example, the use of a "shuffle network" which allows the rapid movement of data characteristic in various algorithm components, such as the butterfly FFT network or error correction matrices, can shave an order of magnitude off the number of clocks cycles for very specific operations, at the expense of silicon real estate.
Effective memory usage is a key element of energy efficient programming, given the power usage associated with every memory access. Indeed the entire MVR programming model revolves around the usage of data where as many operations are performed on a single data element between storage access.
Although very sophisticated multi-level hierarchical memory architectures are prevalent on general purpose processors today, memory architectures on vector processors tend to be simple, non-coherent local memory designs. Without a global shared memory model to work with, programming becomes much more difficult and time consuming, and overcoming this constraint is one of the toughest issues.
Figure 2 shows some of the relative characteristics of the processor types and the performance levels that have been achieved on them using MVR. As can be seen, the Vector DSPs provide a very effective throughput to power ratio, not surprising given their design for this application.

Figure 2 - Comparison of Average Processor Characteristics
SDR Multimode Platform Architecture
The use of SDR leads to a significant degree of flexibility. By not being constrained into traditional baseband implementation thinking, a unique array of capabilities may be created.

Figure 3 - Example SDR 3/4G Multimode PHY Layer Platform Architecture
SDR Automation
A basic dichotomy existed, prior to MVR, in the implementation of any high performance software application. On one hand the code may be written in a fairly traditional fashion using C or C++, and a compiler used to apply it to processors of choice, leading to faster time to market, ease of coding, and processor portability at the expense of performance. Alternatively substantial sections of the code could be written using low level, processor specific assembly code, targeting efficient processor usage and leading to improved performance at the expense of portability and design ease. What is required is a method that retains the programmability and portability of a standard, abstract programming model, but still provides the performance.
For many companies already working with SDR implementations, the use of assembly code to program processors at very low levels has proven critical to achieve required performance levels. However, the use of assembly programming has two major effects:
1. Code written for one processor cannot easily be moved to another, or often future generations of the same processor, without a complete re-write. Even the use of a cross-assembler results in poor performance translation. This has a significant impact on the portability of large software components, leading to cost and competitiveness issues.
2. The creation of these assembly code programs takes a large amount of expert engineering time with a deep understanding of both the processor and the algorithms being implemented. This in turn makes the methodology both expensive and time consuming, with an impact on time-to-market.
MVR will only work commercially if the portability and programmability issues are solved. New optimization techniques are now emerging which do indeed open up the use of higher level languages for performance implementations. These optimizers leverage detailed processor information to manipulate code streams in a manner usually associated with synthesis technologies, and applies this information to parallelize and tightly map algorithms onto the processor architecture. The optimizers act as front-ends to the native processor compilers, using instrinsics to direct the compilation process to maximize efficiency. Figure 4 shows the process steps associated with this optimization flow.


Figure 4 - A Methodology for Performance and Portability
In this approach, shown in figure 3, the protocol or baseband design engineers code their algorithm elements without consideration for processor architecture. Processor engineers drive the construction of templates which provide the raw data on the processor, and leverage parts or all of the compiler technology that comes with it as appropriate. An optimizer combines the two code bases, modifying the input code to make the best use of the target processor architecture, to produce either raw assembly code or processed C code that includes intrinsics to guide a further compilation step. The binary code is then run against a cycle approximate model of the processor and analysis performed to check for common performance issues, which may result in a refinement in the code base.
The input language to the optimizer has a significant effect on the efficiency of the entire process. The use of the C or C++ programming language with some restrictions to focus the code to vector processor architectural forms without reducing portability, and provide facilities for easier baseband algorithm design can dramatically improve programmability and performance. Similar to OpenGL for graphics design, an "Open Baseband Vector Language" would still leverage C/C++ but provide some simple restrictions as well as additional instructions enabling efficient baseband coding that runs on related processors. A significant level of business value and market size must be present drive the instigation of such a format, and arguably this is now true of software defined radio, given the introduction of vector processor technology for this purpose from multiple companies.
Bright Future for SDR
SDR has proven valuable for military and commercial wireless baseband implementations given the increased ease of use, multimode applications, and control versatility afforded by its use. However, traditionally its low performance level versus that of custom hardware has created a barrier to its proliferation in power sensitive applications such as commercial cellular handsets. Furthermore, the lack of programmability and portability of higher performing assembly code implementations detracts from its use in more general infrastructure applications.
Multimode Vector Radio solves this problem by leveraging multiple dimensions of parallelism afforded by modern processor architectures to drive an order of magnitude performance improvement without a reduction in the positive benefits of software based devices. By leveraging a methodology that retains performance in a portable and programmable fashion, MVR could represent the future of next generation baseband design.
About the Authors
Dave Kelf is the President and Chief Executive Officer of Sigmatix, Inc. After a number of years in DSP and communications semiconductor engineering roles at Plessey and Nortel, Kelf works in both sales and marketing at Cadence Design Systems, most lately responsible for the successful Verilog and VHDL verification product line. As vice president of Marketing at Co-Design Automation and then Synopsys, Kelf oversaw the successful introduction and growth of the SystemVerilog language, before running marketing for Novas Software (now Springsoft). Kelf holds a MSc in microelectronics and an MBA.Phil Moorby is the Chief Technical Officer of Sigmatix, Inc. and considered a semiconductor industry luminary. He is known as the inventor of the Verilog Hardware Description Language, the language used to implement most of the integrated circuits developed today worldwide, for which he was bestowed the EDAC Kaufmann Award in 2006. Moorby co-founded Gateway Design Automation, one of the companies that formed Cadence Design Systems, and was appointed Fellow at Cadence. Moorby co-founded Synapix, Inc., a company focused on advanced video stream analysis. As Chief Scientist at Co-Design Automation, Moorby was instrumental to the success of SystemVerilog. His combined background in mathematics and performance software makes him ideal to lead the Sigmatix technology vision and development programs.




Rich Krajewski
9/8/2010 2:40 AM EDT
Memristors are supposed to be able to simulate some neural processes. Hey, maybe memristors will help SDR put some cogs in cognitive radio.
Sign in to Reply
jskull
9/9/2010 4:39 AM EDT
Interesting ideas, however the perennial VLIW problems of low code density and high instruction bandwidth will be need to be resolved.
Sign in to Reply
Bob Lacovara
9/9/2010 8:38 AM EDT
This is a nice summary of the state of affairs in SDR. I am perfectly willing to believe that highly-optimized designs require highly-expert engineers, this is only fair, and sometimes we call this "job security". But it does tend to limit adoption of the technology to users who care far more about performance than for cost, e.g., the military. And it is for the military that the most obvious uses of a SDR's extreme versatility appear.
Other applications do exist. The other day, while looking for an ultrasonic receiver, I was thinking: I need to cover 500 kHz to 20 MHz, I need decent agility, decent out-of-band rejection, I need sensitivity, I need I and Q outputs... hey, I need a shortwave radio, no: I need a SDR. So I'm going to get one that I can play with in the lab, and there are several out there that I can get inside of. SDR will have plenty of applications that don't need bleeding edge performance, it's only a matter of time.
Sign in to Reply
Rchandta1
9/10/2010 1:46 PM EDT
How does the software approach compare with FPGA in SDR? It is reconfigurable, much like software and now they are big enough to accommodate large amount of logic.
This article mentions 0.5W to 1W of power consumption of commercial processors. I doubt if it include I/Os, external memory etc.
Sign in to Reply