Optimizing Sort Algorithms for the PS3 Part 1 (Quicksort)

11 July 2011

Sorting is a simple but important concept for implementing games. For example, sorting transparent objects before rendering or sorting objects by state attribute (e.g. texture) to improve batching. Because of the way most algorithms work (comparing and swapping pairs of items) sorting often takes precious time. With multi-core architectures like the PS3 it makes sense to parallelize this operation to maximize the use of these cores.

This is the first of a series of blog posts explaining how to get several algorithms to run faster on the PS3. First we will take the canonical C++ implementations, apply generic optimizations, then apply PS3-specific optimizations for the PPU. Then later in the series we will show how to run the same implementations on a SPU by using Offload. The last parts will focus on parallelization, i.e. sorting on multiple SPUs at the same time.

Quick Links == Part 1 (Quicksort)

Part 2 (Merge)

Part 3 (Merge Sort)

Part 4 (Offloading on 1 SPU)

Part 5 (Parallel Sort on 2 SPUs)

Part 6 (Final Optimizations)

The entire series is also available as a single PDF document.

Quicksort == The first algorithm we will present is quicksort. This is a very popular sort because it is very fast (O(n * log n) time) in the average case and requires a constant amount of memory to run (i.e. it doesn't depend on the number of items to sort). As a downside it becomes very slow (O(n^2)) when the input is already sorted. Its divide-and-conquer approach makes it easy to parallelize.

Sorting goes like this: choose one item to be the pivot (e.g. the item in the middle of the array). Move all the items smaller than the pivot (A) before it and all larger items (B) after it. This divides the array into two parts: A and B. The same process is then recursively applied to A and B until the items are sorted (see the code below).

template<typename T>
void quickSort(T *data, size_t start, size_t end)
{
    size_t count = end - start + 1;
    if(count < 2)
        return;

    // choose the middle item as pivot
    int pivotIndex = start + ((end - start) >> 1);
    T pivot = data[pivotIndex];
    int i = start, j = end;
    T tmp;
    // move all items smaller than pivot after it and larger items before it
    while(i <= j)
    {
        // look for the next item larger than pivot (going from the start)
        while(data[i] < pivot)
            i++;
        // do the same for the next item larger than pivot (going from the end)
        while(pivot < data[j])
            j--;
        if(i <= j)
        {
            // swap the larger and smaller items
            swap(data[i], data[j]);
            i++;
            j--;
        }
    }
    // now repeat the process on all items before the pivot
    if(start < j)
        quickSort(data, start, j);
    // and another time on all items after the pivot
    if(i < end)
        quickSort(data, i, end);
}

This naive implementation takes 21.9 ms to sort 64K floats and 9.6 ms for sorting 64K integers. As a comparison std::sort() takes 12.8 ms and 6.7 ms, respectively. Making it iterative by managing a local stack gives 21.5 ms for floats and 9.6 ms for integers. Quicksort efficiently sorts large arrays but does not perform well with small arrays.

A common optimization is to use a simpler sort for smaller arrays. When sorting with insertion sort for arrays of 55 items or less, we get 17.2 ms and 7.4 ms. Another optimisation[1] is to choose the median of (first, middle, last) as the pivot which results in 16.8 ms and 7.1 ms. The final code for this quicksort implementation can be found below.

template<typename T>
void quickSortIterative(T *data, size_t start, size_t end)
{
    range stack[32];
    range *s = stack;
    s->start = start;
    s->end = end;
    s++;
    do
    {
        // dequeue the next [start, end] range of the array to sort
        s--;
        int startI = s->start, endI = s->end;
        int count = endI - startI + 1;
        // small array optimisation
        if(count <= 55)
        {
            insertionSort(data, startI, endI);
            continue;
        }
        // choose the middle item as pivot
        int pivotIndex = startI + ((endI - startI) >> 1);
        // reorder these three items such that the median is at pivotIndex
        median(data[startI], data[pivotIndex], data[endI]);

        T pivot = data[pivotIndex];
        int i = startI, j = endI;
       // move all items smaller than pivot after it and larger items before it
        while(i <= j)
        {
        // look for the next item larger than pivot (going from the start)
            while(data[i] < pivot)
                i++;
        // do the same for the next item larger than pivot (going from the end)
            while(pivot < data[j])
                j--;
        // swap the larger and smaller items
            if(i <= j)
            {
                swap(data[i], data[j]);
                i++;
                j--;
            }
        }
        // now repeat the process on all items before the pivot
        if(startI < j)
        {
            // instead of a recursive call, enqueue the new [start, end] range
            s->start = startI;
            s->end = j;
            s++;
        }
        // and another time on all items after the pivot
        if(i < endI)
        {
            s->start = i;
            s->end = endI;
            s++;
        }
    }
    while(s > stack);  // as long as we have [start, end] ranges to sort
}

struct range
{
    int start;
    int end;
};

// reorder items such that a <= b <= c
template<typename T>
void median(T& a, T& b, T &c)
{
    if(c < a)
        swap(a, c);
    if(b < a)
        swap(a, b);
    if(c < b)
        swap(b, c);
}

Testing methodology == The arrays used as test data for the performance comparisons contain random data with the current time used as the seed. For each array size the sorting function is executed five times and the median time is kept. The clock() function is used for measuring execution time. This approach will be used throughout the series.

Results ==

64K floats

64K integers

std::sort

12.8 ms

6.7 ms

Recursive quicksort

21.9 ms

9.6 ms

Iterative quicksort

21.5 ms

9.6 ms

Iterative quicksort, small array (N <= 55)

17.2 ms

7.4 ms

Iterative quicksort, small array (N <= 55) and median

16.8 ms

7.1 ms

As we can see in the table above, our simple quicksort implementation approaches the performance of the PS3's STL sort for integer arrays but it is quite slower when sorting float arrays. As we will see later in the series, this is not a problem because we will get similar performance for integers and floats when running on the SPUs.

Another optimization we could apply is vectorizing the partitioning part (i.e. moving elements on each side of the pivot) using AltiVec instructions. Unfortunately, research suggested that there was no successful vectorized implementation.

In the next part in the series we will have a look at the merge algorithm which is central to both the merge sort algorithm and implementing a parallel sort.

References == [1] Jon L. Bentley and M. Douglas McIlroy, Engineering a Sort Function, Software—Practice and Experience, Vol. 23(11), 1249–1265, 1993

Pierre-Andre Saulais's Avatar

Pierre-Andre Saulais

Principal Software Engineer, Compilers