NAND flash controllers
Flash memory has become the primary storage for portable devices, as it is more reliable and power efficient than hard disks. NAND flash requires strong error detection and correction capabilities, however. ECC blocks can be implemented in either software or hardware. If a hardware implementation is selected, it will be the most important factor in the QoR of the NAND flash controller, as ECC is an additional block and larger than the control logic part.
In this project, we implemented a NAND flash controller with an ECC block using the compiler. The design consists of four threads that include controller logic, a communications interface for AHBLite buses, and two threads for ECC blocks. Specifications include the following:
- Support for NAND flash devices with 8- or 16-bit I/O bus
- Programmable support for various types of Micron NAND flash (see table I)
- Programmable access timing
- Support for 4 bits error correction per 540 bytes
- Support for ONFI 1.0 and AHB-Lite interface
- The given clock period constraint is 6.2 ns
Click to enlarge
Table I: Supported configuration of NAND flash controller
We spent about six months from specifications to silicon—one month for SystemC coding, two months for design tuning and simple design validation, and the last three months for driver development and comprehensive HW/SW co-verification. In the next section, we will show the design process for the ECC block.
ECC implementation: BCH code
The traditional ECC for NAND flash devices includes Hamming code and BCH code, in which Hamming code only provides one bit correction. That is not sufficient for multiple-level-cell (MLC) NAND flash devices. For this reason, only BCH code is used in this design. The BCH implementation can be divided into two parts: encoder and decoder. The encoder will calculate associated parity bits for received messages and put them into spare parts of NAND flash devices. The decoding operation has three fundamental steps:
1) calculation of syndromes
2) co-efficient calculation of the error-locator polynomial
3) root calculation of the error-locator polynomial
Initial Design for HLS
It is easy to find a software implementation of BCH code in C/C++ through resources on the Internet. Designers just need to add hardware-specific SystemC constructs to the pure C/C++, and subsequently employ HLS tools to generate an optimized RTL implementation. The translation only requires a few weeks of effort even if the designers do not have related domain knowledge. During synthesis, we performed full loop unrolling for the parity and syndrome calculations to finish the calculation of each byte within one memory access time (i.e. one clock cycle). All function calls were in-lined to maximize optimization for every thread. In addition, we had to disable resource sharing in order to finish synthesis in a reasonable amount of time.
The synthesis results in Table II indicate that the decoder occupies 99% of the area. The reason is that the logarithm and anti-logarithm values of Galois field (GF) required for the decoding operation are pre-computed and recorded in two constant arrays in the original software implementation, and these two arrays will be synthesized as huge lookup tables (LUTs). In a software implementation, using memory to keep necessary information is a common way to reduce search and computation complexity since operation systems and software designers are able to completely manage the usage of memory. This is not a good approach to hardware implementation, however, since the memory size is too large.
Click to enlarge
Table II: Synthesis reports of initial design