United Business Media EE Times


Search

HOMEMARKET INTELLIGENCE UNITFORUMSDESIGNNEW PRODUCTSCAREERSBLOGSCONTACTEVENTSSIGN UP!RSSMost Popular contentTrusted Sources

 

Cover Story

Implementing a DSP in Programmable Logic

The numerous advantages of PLDs allowed designers at Kaytronics to create an easily modifiable DSP for system-on-a-chip designs.

by Martin Langhammer



As programmable logic devices increase in size and feature sets, engineers are assessing their capability to implement system-on-a-chip (SOC) designs. At the heart of most SOC designs is a processor, and accordingly, system designers would like to know if a useful processor can be implemented in a PLD.

Programmable logic offers several advantages over ASICs when designing with embedded processors. With ASICs, hardware and software codesign is difficult, and even for simple designs, the majority of engineering time is spent on verification. PLDs make true hardware and software codesign possible, as the hardware and software become one. The hardware architecture can be as easily modified as a line of code, even after the programmable logic device is on the board. Another advantage is the ability to add application-specific blocks to the processor, connected through the I/O ports.

A design team at Kaytronics recently developed and implemented a standard 16-bit fixed-point digital signal processor for a look-up table­based PLD. To be considered successful, the processor had to run at a reasonable speed and closely follow a well-known architecture. Perhaps most importantly, it had to occupy a minority portion of a single device's resources, since SOC designs require many more components than a single processor.

Our successful implementation, designed for Altera's Flex 10K family of PLDs, indicates that high-density programmable logic is increasingly viable for SOC designs. But because today's programmable devices have unique architectures, a PLD processor design should be built from the ground up with the target architecture in mind, to fully take advantage of all its features.

With our design, the user can specify certain processor features and modifications without requiring access to the source code. Using this method, designers can create semicustom processors quickly for any application.

Processor characteristics

We chose Texas Instruments' TMS320C25 16-bit fixed-point DSP as the architecture. We implemented almost all the features of the 'C25; however some weren't required because of some PLD features, and others were left out because they would have limited its performance. In addition, we made improvements to the addressing scheme. For example, in the PLD-based DSP, auxiliary registers, used mostly for relative memory accesses, can be directly called as part of an instruction, rather than requiring an auxiliary register pointer to be set first. The modification reduces program sizes and improves execution times.

The resulting PLD-based DSP featured an instruction set that was similar to that of the TMS320C25, with almost identical mnemonics. Similar instructions operate almost identically--all instructions execute in a single cycle, save for branch instructions, which require two cycles. The instructions also take a single 16 -bit word, although branch addresses occupy the word following the branch op code. Source code is very easy to port between the two processors, but the object codes for each are completely different.

The DSP requires only 12 percent of the logic resources of the largest programmable logic device that was currently available. As a result, the DSP can fit into smaller devices with enough resources left to create custom peripherals and execution units.

We built the processor design from several high-performance, area-efficient arithmetic functions, including parameterized multipliers, that were developed for the target programmable logic architecture. To create a single-cycle multiplier-accumulator, we used a multiplier corresponding to the Library of Parameterized Modules (LPM) standard. The MAC is one of the largest components of the processor, consuming over a third of the whole DSP design's resources.

Three interrupts are vectored separately to addresses at the top of program memory. Jump instructions are then used to transfer control to the interrupt service routines. Several cycles may occur from the activation of the interrupt input until interrupt processing begins, since the DSP will complete executing any current instructions, including branches. Once an interrupt process begins, the interrupt that initiated it will be disabled, until a return from interrupt (RTI) is executed. The two other interrupts, however, will remain enabled, so that multiple interrupt service routines may be nested.

Pure Harvard

The processor uses a pure Harvard architecture rather than the modified Harvard architecture of the TMS320C25 (see Figure 1). In a pure Harvard architecture, the data and program spaces are completely separate. In an ASIC or standard product, one of the benefits of the modified Harvard architecture is that it allows data to be passed between the two spaces, such as performed by the TBLR (Table Read) and TBLW (Table Write) instructions. In PLDs, this feature is not as important, because the data space memory can be preloaded with tables during device configuration if it's implemented in the programmable device.

The data memory is the same size as on the TMS320C25: 256X16 bits. Because the auxiliary registers can be loaded from locations indexed by other auxiliary registers, they can contain addresses that are up to 16 bits wide, the same width as the data memory. Thus the user can easily modify the processor to have data memory spaces that are considerably larger and contained in a single contiguous block, although the direct addressing range supported by the address field in an instruction word is 12 bits. The program memory is in a 1kX16-bit block and is addressed by the 10-bit program counter, which, again, the user can easily enlarge for access to much larger memories. Because the critical path of the processor is in the data execution space, the user can easily move the program memory (or just a part of it) off-chip without affecting performance. An eight-level hardware stack stores return addresses.

The architecture of the programmable logic DSP and the TMS320C25 are almost identical. Naming conventions of the registers are the same, as well as the datapath flow, and the ALU is used in the same manner. The main difference is in the way that the auxiliary registers (AR0 through AR5) are accessed and updated. Also, the program and data buses are completely separate.

We added and mapped I/O ports to the very top of the data space. There are two word-wide ports, one all inputs and one all outputs. The user can easily add additional ports, with the number of ports controlled by an external parameter. The user can then customize the processor without changing its internal architecture. Since the ports operate at the instruction rate, the user can add additional computational units, with their input(s) connected to the output ports of the processor and their output(s) to the input ports of the DSP.

The DSP operates at 15 MIPS, which is comparable to the standard TMS320C25. The system clock is twice the pipeline clock to correctly generate memory control signals.

Instruction set

We implemented a total of 97 instructions, grouped by general function, including addressing modes. All of the computational instructions were implemented, except for a few ALU instructions.

We wrote a simple assembler (running under Windows 95) for the processor, since the object codes for it are incompatible with that of the TMS320C25. The assembler supports labels, declarations, radixes, and data table entry. It produces two output files: one for the program space and one to configure tables in the data space. The two memory initialization files are included within the processor structure when it is compiled.

Most of the design challenges stemmed from the architectural relationship of logic to routing found on today's PLD architectures. In standard-cell devices, until the advent of deep-submicron (DSM) processes, a combinational path delay was largely due to the logic; in DSM, logic and routing delays are of the same magnitude. In PLDs, however, the routing delays are an order of magnitude greater than the logic delays, and registered system performance is limited by the routing delay between logic resources. In fact, for that reason, such features as data moves and some ALU instructions would have limited the performance of the design to the point where we decided not to include them.

The processor design was targeted for the Altera Flex 10K architecture because of the characteristics of its memory structures (called embedded array blocks, or EABs) and its routing structure. The family also offered the largest programmable logic device available.

The ALU

The performance of the processor was determined by the number of levels of logic required to take advantage of their features. One of the critical paths was the ALU, especially in the feedback path for the accumulator, largely because of the multiple wide buses (32 bits) that feed the ALU. Feedback paths (used for accumulation and the absolute value function) and feed-forward paths (used for logic operations, accumulating the multiplier result, and loading the accumulator through a barrel shifter) all had to be combined at the ALU. The routing of a large number of wide buses resulted in a deep structure, with several layers of logic and routing immediately preceding the ALU. The number of ALU instructions was constrained when the delay from data memory to accumulator became the same as the delay from data memory to multiplier.

The three critical paths, all in the data space, are:

  • Path 1: ARx to data memory address, data memory data to barrel shifter, barrel shifter to ALU
  • Path 2: ARx to data memory address, data memory to multiplier input, multiplier combinational delay to P register
  • Path 3: accumulator feedback path

As we balanced these paths to offer roughly the same delay, we needed more levels of logic to implement the ALU functions than we did to implement the multiplier.

To minimize the levels of logic (and more importantly, the levels of routing), we used a programmable logic-specific multiplier. In the typical ASIC multiplier, partial products are summed by a Wallace tree, and the number of levels required is approximately log 1.5 (multiplier bit-width/2). In contrast, a binary-tree partial-products addition requires only log 2 (multiplier bit-width) levels and was therefore faster and more resource-efficient.

Memory resources

In the Flex 10K devices used for the design, the EABs are 256 X 8-bit SRAM blocks. For the processor, the EABs are configured as ROM for the program space and asynchronous SRAM, with separate data input and output buses, for the data space. For both the data and program spaces, two EABs are required for every 256 words of storage. The larger Flex 10K devices (the EPF10K50 through EPF10K250) contain 10 to 20 EABs that allow for an expansion in data memory, especially if the program memory is moved off-chip. Since the SRAM-based devices we used are reconfigured every time they are powered up, tables can be loaded into the data space RAM upon device start-up. This feature allows us to use a pure Harvard architecture, and it also makes program memory use more efficient, since data tables can be stored externally in the configuration EPROM rather than on-chip in program memory.

The separate data input and output buses make it straightforward to implement DMOV (Data Move) and similar instructions to facilitate more efficient convolutions. However, the additional levels of logic required to support such features were prohibitive in terms of performance.

Virtually all the single-operation instructions were included in the PLD processor. As explained previously, some ALU instructions were left out in the interests of processor performance. Data moves, from data space to data space, were also not included, as the logic to control them would be in the critical paths in and out of data memory and would greatly affect processor performance. A trade-off was made: requiring two instructions for a few operations resulted in a 15-MIPS processor. The alternative is an 8-MIPS processor executing all operations in a single cycle.

Advantages of programmable logic implementation

Among the various programmable logic architectures, we found the Flex 10K family of devices to be well-suited to processor design, owing to the large amounts of continuous routing and memory in the form of EABs. The continuous routing resources made implementing large feed-forward buses (required for moving data to and from the memories) possible; further, their use resulted in deterministic delays. EABs, which offer contiguous memory storage blocks, allow for both sizable program and data memories to be implemented on-chip, again with deterministic performance. Both features also allow designers to modify the processor, add peripherals, or move the design to a differently sized device while still maintaining the same performance.

In addition, with programmable logic, data and program memory size, number of stack levels, and number of I/O ports can be set with parameters when instantiating the DSP into a top-level design. Thus, as noted, designers can create semicustom processors quickly for any application.

Figure 1 DSP core for programmable logic designs

The Kaytronics processor uses a pure Harvard architecture rather than the modified Harvard architecture of the TMS320C25 16-bit fixed-point DSP on which it is based.

Another advantage of programmable logic, also noted earlier, is the ability to add application-specific blocks to the processor, connected through the I/O ports. For example, in the case of Reed-Solomon coding, a single-cycle finite-field multiplier can be added to the processor in this way. In the case of an 8-bit field, this addition can save dozens of operations over software implementation of the multiplier using polynomial expansion and reduction techniques, or save substantial amounts of memory over a log/antilog multiplication method. Adding this small functional block can increase the performance of a Reed-Solomon encoder or decoder by several orders of magnitude, making this DSP equivalent in performance to current leading-edge processors for this application.

A FIR filter, or correlator, is another function-specific block that can be added to produce a new output every cycle. This addition would offload simple, repetitive tasks from the processor, increasing overall system performance.

Optimized DSP architecture for PLDs

Based on the experiences of implementing a standard DSP architecture in PLDs, designers can develop an optimal DSP architecture for programmable logic based on the following principles and lessons.

Decoupling the accumulator from the multiplier reduces some of the paths leading into the block, allowing more accumulator functions to be implemented.

Large numbers of wide feedback paths are inefficient for programmable logic, because of the relatively coarse grains of look-up tables. The ALU section should be implemented with feed-forward paths in mind.

Using a deeper pipeline, especially in the data space, would increase performance, possibly up to 50 MIPS.

Using a register file rather than the accumulator for working storage would increase performance for many reasons. The datapaths in and around the ALU would be reduced, because the flow through the ALU would have less feedback. Addressing registers and data could exist in the register file, allowing the ALU to modify the addressing registers. This modification would yield more addressing features, such as for matrix manipulation. Storing working values in the register file would also bypass the data memory, avoiding the introduction of a long asynchronous delay into the critical path of an accumulator-based machine.

Because the critical paths all exist in the data space, functions can be added to the program space without affecting the cycle times of the processor.

The processor should be architected in a manner that would allow virtually all processor specifications to be modified by the user, such as number of registers, number of ports, memory configuration, and computational unit complexity.

A standard DSP implemented in programmable logic may even offer higher performance than standard products when combined in a single device with application-specific peripheral designs. Furthermore, I recently built a parameterized RISC processor that has been used successfully by engineers. The success of these projects indicates that high-density programmable logic is increasingly viable for SOC designs.


Martin Langhammer is currently a field application engineer for Kaytronics, Inc. in Toronto. His background includes intellectual property development for both programmable logic and ASICs, principally DSP functions for communications and image processing. He is also involved in the development of RISC and CISC processor architectures.

To voice an opinion on this or any Integrated System Design article, please email your message to miker@isdmag.com.


integrated system design  May 1998



[ Articles from Integrated System Design Magazine ] [ ICs and uPs ]
[ Custom ICs and Programmable Logic ] [ Vendor Guide ]
[ Design and Development Tools ] [ Home ]



For more information about isdmag.com email webmaster@isdmag.com
For advertising information email amstjohn@mfi.com
Comments on our editorial are welcome
Copyright © 2000 Integrated System Design

  Free Subscription to EE Times
First Name Last Name
Company Name Title
Email address
  Click here for your Free Subscription to EETimes Europe
 
CAREER CENTER
Looking for a new job?
SEARCH JOBS
SPONSOR

RECENT JOB POSTINGS
CAREER NEWS
SRC Expands R&D Centers
The Semiconductor Research Corp has added a new center to its university R&D efforts.

For more great jobs, career related news, features and services, please visit EETimes' Career Center.


All White Papers »   

 
Education and
Learning


Learn Now:












Home | About | Editorial Calendar | Feedback | Subscriptions | Newsletter | Media Kit | Contact | Reprints|  RSS|   Digital|  Mobile
Network Websites
International
Network Features




All materials on this site Copyright © 2009 TechInsights, a Division of United Business Media LLC All rights reserved.
Privacy Statement | Terms of Service | About