Design Article
Custom Reconfigurable Computing Machine for High Performance Cellular Automata Processing
G. Cappuccino et al
7/23/2001 12:00 AM EDT
|
ABOUT THE AUTHOR
Pasquale Corsonello received his Masters
degree in Electronics Engineering from the University of Naples,
Italy, in 1988. He joined the Institute of Research on Parallel
Computer System (IRSIP) of National Council of Research of Italy,
working on the design and modeling of electronic transducers for
high precision measurement. In 1992 he joined the Department of
Electronics, Computer Sciences and Systems of the University of
Calabria, where he was involved in application specific IC design.
Currently he is assistant professor at the Department of
Electronics Engineering and Applied Mathematics of the University
of Reggio Calabria. His main research interests are in arithmetic
circuit, reconfigurable computing, and VLSI design. Prof.
Corsonello is a member of the IEEE.
|
||
. The models are based on a simple but efficient
idea: describing the behavior of complex systems and phenomena in
terms of many interacting elementary components. In spite of their
usefulness in modeling complex systems, the performance
requirements to execute CA simulations can be largethey use a
highly parallel computational model, which makes conventional,
sequential machines poorly matched to these kinds of application

. As a consequence, researchers have two main
approaches to execute CA algorithms: using software running on
parallel general-purpose supercomputers or developing
custom-computing machines
.
Reconfigurable computing systems can provide a marked performance advantage over general-purposes computers at a lower cost. This article proposes a novel Reconfigurable computing Machine for Cellular Automata processing (CAREM). This machine uses an FPGA-based processor as a computational engine.
With FPGAs, you can quickly implement hardware functions for each different algorithm. Owing to their uniform structure composed of many parallel-processing elements, FPGAs appear very attractive for computations where a large, regular-input data stream has to be processed to produce an output data stream. CA algorithms fit this computation model well: a large data stream corresponding to the cells' states is repeatedly updated according to a uniform set of rules. Although the CAREM processor is realized on a totally reconfigurable device, it consists of a well-defined structure for implementing different CA applications with low design effort.
The automaton's evolution retains a locality principle: for each
cell (i) the transition function (f) calculates the next state by
examining only its own state and the states of the neighboring
cells (Neigh(i)) at the current time step. The transition function
uses a specific pattern to define which cells are neighboring
cells. Figure 1 shows the typical structure of a
two-dimensional (2D) automaton. The cells are arranged on an m*n
lattice. In this example, the automaton uses a Moore neighborhood
composed of eight cells.
Figure 1: A two-dimensional cellular automaton
The cells' updating is synchronous and arrives in parallel fashion into the lattice. As a consequence, at each time-step the entire cellular automaton evolves to a new global state, which arises from the state's changes in the single cells. Although the transition function does not explicitly indicate a final state for the automaton evolution, it could be necessary to define this condition for certain types of automata.
According to the previously discussed CA model, you can implement a computing system targeting CA applications using a processing engine and two memory blocks storing two consecutive automaton states, S(t) and S(t+1). This basic computing architecture is shown in Figure 2. The processing engine contains the processing elements (PEs), which implement the automaton transition function. The system uses memory blocks A and B alternatively as source and destination memory. At each step t, the processing engine reads the current state S(t) from the source memory and writes the new state S(t+1) into the destination memory. After updating the entire automaton's state, the roles of the two memory blocks switch and another iteration starts.
Figure 2: Basic CA processing
The CA model enables a totally parallel computationyou could compose the Processing Engine using a PE for each cell. However, technological and cost limitations makes this feasible only for simple computations. Usually, the number of PEs implemented in the Processing Engine is lower than the number of cells. This implies that each PE has to be reused on several cells during the same iteration.
Figure 3: The CAREM system showing how the processing engine and memories communicate with the host PC
The execution flow occurs during three main operational phases. During the first phase, the entire initial state of the automaton is transferred from main memory to local SRAM into the board. In addition, the Host PC transfers some parameters required for the computation, such as number of iterations, lattice size, and rule parameters to the CAREM processor.
In the second phase, the CAREM processor calculates the automaton evolution according to its specific transition function. At each iteration, the CAREM processor reads the current automaton state from source memory and writes the new state into destination memory. The processor computes n-bits of data at each step, concurrently updating just a limited number of cells. When updating the entire automaton state, the roles of the two memory banks are changed and the processor begins another iteration. According to the specific algorithm the processor executes, CAREM will stop the computation and set an appropriate flag bit in order to signal the end of the computation. The system can generate a stop condition in two ways: after a specified number of iterations, or using a check function applied to the automaton data. After recognizing the stop condition, the Host PC retrieves display data from local memory.
The CAREM processor uses FPGA technology for high flexibility on different automata. It's easy to scale the proposed computing architecture based on desired performance and cost. By increasing bus width and by using multiple FPGA devices and memories, you can improve the grade of parallelism of the computation. Since the PEs work in a totally parallel manner on the automaton data, processing speed scales linearly with the number of PEs.
implements the CAREM prototype. To provide a
capability to handle different algorithms, the prototype organizes
the reconfigurable processor as three different sub-modules
(Figure 4). These sub-modules implement the previously
described common CAneighborhood module, rules module, and
control unit. According to the particular CA algorithm the
processor is executing, you can define these modules to support the
requirements of a specific application.
Figure 4: The CAREM processor comprises three sub-systems: neighborhood module, rules module, and control unit
The neighborhood module constitutes an interface block between
local source memory and the processing elements inside the rules
module. This module supports fast access to a cell's state for
reading. The neighborhood module consists of parallel multiple rows
of shift registers. At each clock cycle, a new 32-bit word is read
from the source memory and written into the first register's
column. At the same time, all the columns shift to the right by one
position so that a new set of cells and corresponding neighbors are
available for processing. Although the row number is fixed and
equals to the memory word-width, the column number depends on the
neighborhood size of the particular CA algorithm the processor is
implementing. In the application examples in the following
sections, we use a Moore neighborhood
with size equal to three cells.
The rules module, the heart of the reconfigurable processor, contains the processing elements that perform the transition function of the cellular automaton. The FPGA's logic blocks define these application-specific PEs. At each step, the rules module receives as input the current 32-bit state word provided by the neighborhood module and generates an updated 32-bit word. This data word is immediately written into the destination memory. Besides the updating function, the rules module can provide other interesting features depending on the particular automaton the processor is executing.
The control unit performs two main functions. The unit implements an interface circuit for the exchange of parameters between the Host PC and the CAREM processor, and it operates as a management unit for local memory. A Finite State Machine (FSM) provides both functions. The interface circuit consists of several registers of different sizes. The execution registers (IER and OER) store some basic I/O signals that the processor needs for execution control.
The iterations register (IR) stores the number of CA iterations that the processor has to perform before stopping the computation. The count register (CR) stores the number of updated data words per iteration. Finally, a group of four rules registers (RR1 through RR4) permits the rules module to store some parameters. These registers provide greater runtime flexibility for the computation since the host system can use the registers to dynamically change the automaton rules without reconfiguring the processor.
The memory unit provides complete management of local memory for the CAREM processor. As discussed in the previous section, CA processing requires reading the current automaton state from the source memory and writing the updated state to the destination memory. This implies that the processor has to generate addresses and control signals at each step of the computation for both memory banks. The processor generates address values sequentially, with the maximum value of the address dependent on the parameter stored in the CR register.
used in image-processing tasks and a forest-fire
simulation. Both algorithms use a 2D CA model with a Moore
neighborhood, but they differ in cell size, transition function,
and execution flow. Owing to its versatile structure, we matched
the CAREM processor to these applications only by reconfiguring the
rules module. The processor exhibited a throughput of about 160
Mbit/sec. To speed-up the CA processing, these example applications
limited communication with the Host PC to the initial and final
data exchanges.
Thinning Algorithm
Binary thinning is a technique that reduces the thickness of
binary images. It is widely used in the pre-processing stages of
pattern-recognition applications such as optical pattern
recognition (OCR), fingerprint-image processing, and computer
vision. By means of an erosion process, this technique deletes
pixels from the boundary, but still maintains original connectivity
of the image. After a finite number of iterations, this
transformation produces another binary image corresponding to a
skeleton of the original image.
This example implements the thinning algorithm of Arcella et
al
. Each input pixel uses a set of eight matching
templates, applied in parallel to the neighboring pixels. We use 32
processing elements in the rules module, updating 32 1-bit cells at
each clock step. The algorithm requires a variable number of
iterations according to the complexity of the input image. The
algorithm terminates when the image does not change between two
consecutive iterations. The control unit will set the stop flag in
the execution register signaling the end of the computation.
Figure 5 illustrates the effect of thinning on a sample
image. Table 1 shows the performance results of our
computation on real working examples. The table compares results
from the CAREM processor to the PAPRICA processor
. In both cases the thinning algorithm executes on
a 256x256 lattice. The PAPRICA processor uses an extended array of
256 PEs
with an instruction cycle-time of 350ns. Those
PEs provide a set of elementary instructions, performing basic
graphic and logical operations on the input image pixels. On the
other hand, the CAREM processor uses PEs the processor specifically
identifies for the thinning algorithm. They exhibit a cycle time of
200ns. Table 1 shows that this technique provides a 65X
cell-update speed increase over the PAPRICA processor.
Figure 5: Thinning execution on a sample image
| System | Lattice Size | Execution Time (Sec) |
Cell Updates/Sec |
Speed Up |
| PAPRICA | 256*256 | 2.6 | 2.52*10 6 |
1 |
| CAREM | 256*256 | 0.041 | 1.64*10 8 |
65 |
Table 1: Thinning executions (100 iterations)
Forest-Fire Algorithm
The forest-fire algorithm
is another interesting and widely recognized
application of CA modelling for studying the behavior of complex
real-word phenomena. In the basic model, each cell in the automaton
represents a portion of land and the cell's state can assume any
one of four different values: tree, burnt tree, fire, and ground.
We thus need 2 bits to represent a cell's state. The automaton
state evolution follows a simple set of rules:
- If a tree cell catches fire from one of its neighborhood cells,
it burns and becomes a fire cell
- A fire cell becomes a burnt tree cell
- Burnt trees and ground cells do not change their own state, they only stop the spreading of the fire.
You could also extend this basic forest-fire model to consider other elements in the forest such as terrain conditions, fuel types, and weather conditions.
To execute the forest-fire algorithm on the CAREM reconfigurable machine, we reconfigured the original rules module of the thinning application to implement the new transition function. In this case we use 16 processing elements working concurrently. In order to observe the evolution of the automaton state, the forest-fire simulation also requires specifying the number of iterations between two consecutive visualizationsthis value is loaded into the iterations register IR.
As previously discussed, the rules-registers block inside the control unit provides a more flexible implementation of the CA rules. For the forest fire algorithm, the control unit permits switching from the basic model to an extended model, which introduces wind effect on fire propagation. To include wind effects, we load the rules register RR1 with two wind parameters: direction and speed. The RRI also stores a control bit w, which sets the model the processor uses. Figure 6 shows two visualization snapshots of the forest-fire simulation. For this example, the w bit was set to zero, which means there is no wind. No wind means uniform spreading of the fire. Conversely, by setting the w bit to one, the propagation of the fire takes place in a specified direction, as shown in Figure 7.
Figure 6: Uniform fire propagation
Figure 7: Wind effect on fire propagation
Table 2 shows the performance results for the forest-fire
computation. These results are compared to that obtained on a Meiko
CS2 parallel computer
. The Meiko system uses a configuration of eight
processors. Nevertheless, as can be noted, the CAREM processor
showed a cell-update speed increase of 24 over the Meiko
processor.
| System | Lattice Size | Execution Time (Sec) |
Cell Updates/Sec |
Speed Up |
| Meiko CS2-8 | 224*224 | 14.91 | 3.37*10 6 |
1 |
| CAREM | 128*256 | 0.41 | 8.19*10 7 |
24 |
Table 2: Forest-fire executions (1000 iterations)
The proposed CAREM machine provides an inexpensive platform for easily implementing several CA applications. The scalable architecture of the proposed processor is modular, which makes it very easy to reconfigure the processor for different automata. The implementation of two significantly different algorithms has demonstrated the flexibility of the CAREM processor. Owing to the efficient parallel processing provided by CAREM machine, the execution of these algorithms has shown much higher performance than that obtained on software implementations running on general-purpose supercomputers and specialized processors.


6
