Design Article

IMG1

Algorithmic synthesis for video post-processor design

Pradeep Thiruchelvam, Synfora, Inc.

12/9/2008 10:14 AM EST

Introduction
With growing consumer demand for faster, cheaper and more complex devices, designers face constant pressure to meet time-to-market deadlines and financial constraints. The need to integrate ever more functionality into a product leads to a growing number of algorithms, all of which must be implemented in the SoCs that drive the product. A designer's chief challenge today is to execute these highly complex algorithms in hardware as rapidly as possible, while meeting aggressive high-performance and low-power goals. As a result, a disproportionate amount of design time is spent on hardware implementation rather than algorithmic innovation.

This article describes how ST Microelectronics used Algorithmic Synthesis to design a video post-processor, with the goals of improving project time and design flexibility, without compromising performance and power targets.

Algorithmic Synthesis offers the solution
Engineers can reduce project time and costs significantly if they elevate the design abstraction to an algorithmic level. Algorithmic synthesis (AS) technology allows the creation of complete application engines directly from sequential, untimed C algorithms. This enables the designer to explore multiple algorithms and implementation alternatives with different performance, area and power (PPA) profiles quickly, to find an optimal design and build the hardware automatically.

ST used PICO Express algorithmic synthesis to design a multi standard video post-processor (VPP), an integral part of a multi-standard video CODEC that comprises deblocking and deringing functions. The VPP was selected as a representative example to implement using AS, because it offered the possibility of directly comparing the resulting PPA and time taken with a hand-designed block.

The Critical Role of Application Engines
A typical SoC designed for a consumer device comprises, at the highest level, four different types of IP: complex application engines (e.g. video codecs, wireless modems); Star IP (CPU, DSP); Connectivity and Control IP (USB, DMA); and memory. Typically, the bulk of engineering effort is spent designing and verifying complex application engines, because these are key to defining and differentiating the end product.

Application engines are intricate pieces of IP, usually subdivided into many blocks. Depending on the application, an engine may consist of a control processor and one or more hardware accelerators that help to meet cost and PPA goals. Traditionally, a hardware accelerator is designed block-by-block, either by reusing previously designed blocks, or by designing new RTL blocks manually. The engine is then assembled together, verified, and integrated with the rest of the SoC components.

Application engines created by Algorithmic Synthesis
Algorithmic synthesis creates hardware application engines that "drop" into the rest of the SoC design. Each application engine is derived from a sequential, untimed C algorithm. A designer provides a C description of his algorithm along with a testbench, and defines the design constraints. The PICO system then automatically generates the synthesizable RTL, which is designed to operate at a specific clock frequency and technology target according to the specified constraints.


1. System design flow using algorithmic synthesis that takes untimed C algorithm along with design constraints and delivers RTL, SystemC and testbenches for an optimal implementation in hardware.

As the following test demonstrates, using AS to create the application engine gives the designer unparalleled opportunities for rapid space exploration. AS also automatically creates both a complete RTL verification environment and SystemC transaction level models that can be used for virtual platforms for system validation. In addition, AS produces scripts to drive a variety of tools which ensure a seamless integration into existing RTL-GDSII flows.

ST Project goals
The key goals of this project were to prove that using an AS design methodology can significantly reduce time-to-market and improve responsiveness to change, without sacrificing quality of results (QoR).

PICO Express was chosen for its ability to handle complex designs and deliver good quality results. The VPP was chosen for implementation as a good example of a complex video-processing design supporting multiple video standards, with aggressive performance and area targets.

A design team initially created the VPP by hand. This became the reference design against which the synthesized version would be compared in terms of time (man months) taken to create it and the PPA profile.

VPP features and specification
The filter supports the following video standards for Deblocking and Deringing: H.264, MPEG4, VC1. The VPP also supports upsampling and format conversion.

For this project, the complete VPP was recreated in untimed C. However, the specific PPA to be compared with hand design was the MPEG deblocking filters. From this, it was possible to extract complete VPP PPA and design effort. Design Implementation
Implementation using PICO Express
ST took a step-by-step approach to implementation of the design:
1) Architecture definition
In order to meet the required performance target, the VPP needed to operate with task level parallelism. ST chose memory architecture with three memory banks. Each bank consisted of a set of memories for storing all of the Luma and Chroma pixels of a single macroblock (MB), plus some additional lines of pixels above and below the marcocell.


2. The first step in the design process is to create the memory architecture for the VPP block.

ST then had to identify which portions of the algorithm could run in parallel, and be communicated through memories or streams. The division was made, based on the different iterative functions that need to be performed by the VPP. Each of these functions can be described within a set of nested loops and broken down into three stages: Fetch, Deblock, and De-Ring/output.


3 Loop sequentialization enables the designer to improve performance.

The memory banks were explicitly expressed as [3][][] for the three banks in the C code. Loop sequentialization was controlled by providing additional memory dependence information to PICO Express as pragmas. This allowed two loops which accessed the same memory to run in parallel. (The general form of the C code to obtain the desired architecture is shown in Figure 6) The outer loop in this case was handled by an external driver or hardware controller, and the user had to guarantee that memory accesses did not conflict.


4. General form of the C code.

2) Interface definition
The VPP was controlled through the XBus interface using configuration and task control registers in the XIF block. The outer-loop control of the PPA was implemented with a hardware controller that connected to the host port of the PPA.

In addition to the control signals, the input Luma, Chroma and parameters were obtained through three separate DMA channels. The Luma and Chroma output data were sent through two separate DMA channels. The DMA channels were implemented using the streaming interface specification in PICO. The PPA also interfaced to the external memories through an SRAM interface.


5. VPP interface automatically generated for streams, memories and host port.
(Click this image to view a larger, more detailed version)

3) Pass-through modes:
The VPP supports pass-through and format conversion modes where the filters are essentially disabled. However, the correct implementation of these modes requires a clear understanding of the configuration parameters, DMA parameters and the supported input and output formats. Once the top level architecture was created, the next logical step was to implement the pass-through modes.

The pass-through mode implementation consisted of three loops:

  • Load loop: responsible for fetching data from the input Luma and Chroma streams and writing to the memory bank, in a format that allows easy access during the deblocking phase.
  • Deblock loop: operates in pass-through mode, does not modify the data in the memory banks. Also reads and writes data back to the memory banks to help validate the data movement, memory access and the parallel execution of the loops in the design.
  • Store loop: reads data from the memory bank and writes to the Luma and Chroma output streams in the desired output format.
Since the RTL is already generated at this point, even if it only supports minimum functionality, many bugs can be caught and eliminated early on in the design cycle.


6. The sequentialization graph shows the data movement and dependencies between loops.

4) Luma and Chroma processing
The architecture specification of the VPP allows for the Luma and Chroma data to be processed in parallel, as the filters operate independently in spate data paths. This provides the advantage of first implementing, testing and debugging one of the data paths, and then using this experience to quickly implement the second. The Luma part of the pipeline was implemented and tested first before implementing Chroma, saving much time during the design.

5) Deblocking filters
As the project was a multi-standard deblocking filter, each filter path of the algorithm consisted of multiple filter cores. However, using AS technology, it was easy to start with a single deblocking filter core and incrementally add support for additional standards with minimal effort. Once the MPEG4 deblocking functionality was implemented, the generated RTL could then be verified in RTL simulation for the implemented functionality and the required performance.

Once the MPEG4 deblocking functionality was implemented and tested, ST went on to place the H264 and VC1 deblocking filters, which were incrementally added and tested.

6) Deringing filter
Next the deringing algorithm was implemented as the optional MPEG4 post-filter. This required a two-pass filtering, in which the filter threshold computation was done during the first pass, followed by the deringing block computation on the second pass. The deringing filter was first implemented as a separate stage of the pipeline between the deblocking and output stages, but was later combined with the output stage of the PPA to improve performance of the design.

7) PPK block and final integration
ST was able to generate RTL after implementation of each of the above functionalities, making the integration not a final step but part of the design flow. This ensured fewer issues during the final phase of the design, as most of the bugs had already been caught earlier on in the design cycle.

The output loop also needed to handle interleaving of the Luma and Chroma outputs, in order to produce output of the YCbCr 4:2:2 interleaved raster format. This required the separate Luma and Chorma paths of the design to be merged in the PPK loop as part of the output stage. The PPK loop handles both the upsampling of Chroma data and the interleaving of Luma and Chroma data to produce YCbCr 4:2:2 interleaved raster format (sometimes called "TV output format").

8) Optimization for performance and area
Once the design was completed to meet the functional specification, it was optimized to meet the performance specification and reduce the final area.


7. Performance reports help identify and tweak the performance of the design.
(Click this image to view a larger, more detailed version)

By using algorithmic synthesis, ST could easily identify opportunities in the design to improve the overall performance. Analysis of the performance reports made it clear that an engineer could implement the deringing loop and the output loop in such a way as to reduce the overall cycles necessary to filter the data. This was achieved by reading the pixel data from the bank memories in a single loop, performing the deringing operation, and then directly streaming out the data without writing it back to the memory. It was possible to make the changes quickly without losing any of the work that was done. The result was not only a design with improved performance, but also a smaller area and reduced bandwidth to the memories.

9) Verification
PICO Express provides two separate verification flow: the standalone verification of the RTL, and the validation of the design in the system context. Both of these are driven by the same input C code specification. In terms of this project, ST took the following steps:

  • Engineers used the standalone verification flow initially. Datasets derived from the C input were used to drive and test the results of the RTL. The verification coverage of the RTL is directly related to the coverage of the C specification by the test data that is provided.
  • They then used the system validation flow to directly simulate the entire application and driver code as multiple, concurrently running SystemC threads. Multiple models of the application are generated automatically and can be plugged seamlessly at various levels of accuracy, including the "bit-accurate" sequential model, the "thread-accurate" parallel performance model, and the "cycle-accurate" RTL model.
  • Finally the generated RTL was tested using a Specman testbench that was independently developed for validating the RTL.

30% improved performance results
ST found that using algorithmic synthesis achieved all of the key goals:

  • Performance: 30% higher than a manual design with the same functionality.
  • Time: The design was completed in 6 months using significantly fewer design resources.
  • Productivity: The project demonstrated a 3x to 4x productivity gain compared to the manual design.
  • Flexibility: the ability to explore different architectural choices and implementations allowed ST to make decisions rapidly to converge on a highly optimized design.

Key benefits of using Algorithmic Synthesis

  • Increased productivity and faster time to market: The design was completed successfully with a very small team of engineers, in a shorter period of time. The total effort, including the time needed to understand the specifications, and to deal with any changes to the C model and specifications, was approximately 11 man-months. This compared to 30-40 man-months required for the manual design effort.
  • Flexibility and scalability: Revisions and improvements to the design could be made very easily. During the process, there were several changes to the C model that were quickly incorporated into the final design. ST was also able to incrementally add support for new standards and additional functionality.
  • Good quality of results: Through quick exploration of various architectural choices, ST was able to achieve an optimal design that met the key metrics of performance, power and area.
  • Ease of verification and integration: The generated RTL could be rapidly verified using the integrated simulation flows. This allowed ST to run either stand-alone verification or verification as part of the complete system. The standard interfaces generated by PICO enabled swift integration with the Specman testbench for independent verification of the design.
  • Automatic generation of simulation models and synthesis scripts: In addition to the RTL, PICO Express also generates SystemC models from a single C specification, as well as the synthesis scripts to ease the backend flow. This saved a significant amount of time that would otherwise have been necessary to write the SystemC models or the scripts for gate level synthesis by hand.

Conclusion
Demand for high performance, low cost, portable electronic systems will continue to drive IC design. The ability to automatically create efficient hardware from a sequential, untimed C algorithm using AS methodology allows designers to focus on algorithms that create differentiated performance and efficient implementation.

ST used algorithmic synthesis to successfully build a complex application engine from an untimed sequential C algorithm, using significantly fewer resources in a shorter period of time, and achieving performance, power and area targets. Based on the success of the VPP trial, ST are now deploying AS across their Nomadik organization. Emmanuel Chiaruzzi, Video Hardware Development Manager, stated: "With the complexity of our designs and tight deadlines, we have to find a way to improve our productivity. Moving to an untimed C approach has shown us a way to achieve greater efficiency, and we will be rolling this methodology out this year."

About the Author:
Pradeep Thiruchelvam
is Chief of Staff, Applications Engineering at Synfora.


print

email

rss

Bookmark and Share

Joinpost comment




Please sign in to post comment

Navigate to related information

Most Popular

Product Parts Search

Enter part number or keyword
PartsSearch


FeedbackForm