Design Article

Optimizing sort algorithms on DSPs

Hazarathaiah Malepati and Yosi Stein, Analog Devices

3/13/2009 12:00 PM EDT

Making the Most of SIMD
In this article, we present the bubble sort and merge sort algorithms. We discuss the cycle counts of these algorithms on the Blackfin processor, and show how to reduce the cycle count of the bubble sort and merge sort algorithm. In addition we look at methods to accelerate the sorting of data that is wider than the register width of the processor.

Introduction
Standard algorithm cookbooks provide dozens of number-sorting algorithms [1]. Though the sorting algorithms are very simple from a mathematical point of view, they can be very time consuming to perform on DSPs due to their non-linear computational complexity and the presence of many control operations in their execution. Usually, we specify the complexity of an algorithm using the notation O(.), which indicate 'the order of'. For example, the complexity of sorting N numbers using the bubble sort method is O(N2) and using merge sort method is O(NlogN). This doesn't mean we consume N2 and NlogN processor clock cycles in executing those algorithms to perform sorting. However, we can say that we consume about xN2 and yNlogN cycles (excluding the overhead), where the efficiency factors x and y depend on the particular processor architecture, software code, algorithm flow, etc. used to execute the algorithm.

Since most of the operations in sort algorithms are of the control type, a processor compute unit with a deep pipeline will stall whenever it encounters a conditional operation such as a jump, a branch in a loop structure, etc. To avoid this situation we should restructure the algorithm to use fewer conditional operations. In addition to the processor architecture, the efficiency factors x and y depend on the kind of source code (C, assembly level, etc.) used to develop the algorithm. We usually develop software with the C language because of its ease of use. The C code is converted to native machine code using a compiler. The efficiency of compiled code depends entirely on the particular compiler used. If the C compiler is not efficient then the values x and y are much larger than can be obtained from hand-optimized assembly code.

In this article, we will estimate the values for x and y by implementing the bubble sort and merge sort algorithms on a Blackfin DSP processor [2]. We provide techniques to efficiently implement the merging of arrays on Blackfin. We also provide optimized assembly level code for the merge sort algorithm.

Cycle counts of Bubble Sort vs. Merge Sort
In Figure 1, we illustrate the sorting of eight numbers (i.e., N = 8) in ascending order using the bubble sort method and the merge sort method. As we explained above, the order of complexity of the bubble sort method to sort N = 8 numbers is 82 = 64 steps and for the merge sort method is 8log8 = 24 steps. Next, we will examine in some detail the operations that make up each method.

With the bubble sort method, as illustrated in Figure 1(a), we find the minimum of all the numbers present in the given array A and place it in the array B next to the previous minimum number and continue till we fill all the N numbers to array B from array A. As shown in Listing 1, the outer loop runs N times and each time it finds the element with minimum value present in array A and places it in array B. Then we fill the location where we found the element with the minimum value in array A with the allowed maximum value (for example, it is 0x7fffffff, if the width of the one entity of signed array is 32 bits). In the inner loop, we find the minimum between two values: reference value, c (which is initialized with the value at n = m = 0 location of array A at the beginning of inner loop) and input value from array A at the current index m. We update the c and n with minimum value and the index of minimum value depending on condition check. The number of cycles we consume to perform one iteration of the bubble sort method inner loop determines the value of the factor x.


(Click to enlarge)

Figure 1. Sorting Approaches: (a) bubble sort and (b) merge sort.


Listing 1: C-simulation code for bubble sort method.

In the merge sort method, we start by dividing the array A into one element sub-arrays as shown with the dashed line in Figure 1 (b), and initialize the length of a sub-array as L= 1 and the number of sub-arrays as M = N. We assume that the length of array A is N = 2n for simplicity. With this, the outer loop of the merge sort method runs for n times and the inner loop runs K = (M >> i) times, where i = 1, 2, … , n. The length L of a sub-array doubles with each iteration of the outer loop. At any point in time, the function of the inner loop is to merge the two consecutive sub-arrays into a new sub-array such that the elements of the new sub-array are always arranged in ascending order. This is highlighted by the bolded code in Listing 2. For example, in the first iteration of the outer loop of the merge sort method, we take two consecutive one element sub-arrays and form a new two element sub-array such that the two elements of the new array are arranged in ascending order. Similarly, in the second iteration of the outer loop, we consider two consecutive two element sub-arrays and merge them into one new four element array such that the four elements of the new sub-array are placed in ascending order, and so on.


Listing 2. C-simulation code for merge sort method.

Now we will address the question of the cycle cost of merging two sorted arrays. As highlighted in Listing 2, the part of the inner loop that merges two sorted arrays involves many condition checks, conditional data moves and conditional pointer updates. The number of cycles we consume in implementing the control flow of the inner loop of sorted array merging determines the value of the efficiency factor y.





Please sign in to post comment

Navigate to related information

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)

Feedback Form