Optimizing Sort Algorithms for the PS3 Part 6 (Final Optimizations)

11 July 2011

Sorting is a simple but important concept for implementing games. For example, sorting transparent objects before rendering or sorting objects by state attribute (e.g. texture) to improve batching. Because of the way most algorithms work (comparing and swapping pairs of items) sorting often takes precious time. With multi-core architectures like the PS3 it makes sense to parallelize this operation to maximize the use of these cores.

In the previous part we have created a parallel sort implementation that runs on 2 SPUs. In this part we will improve it so that it can run on 4 SPUs instead of 2. In addition, we will show how to parallelize merge to run on any number of SPUs which will improve performance even further.

Quick Links == Part 1 (Quicksort)

Part 2 (Merge)

Part 3 (Merge Sort)

Part 4 (Offloading on 1 SPU)

Part 5 (Parallel Sort on 2 SPUs)

Part 6 (Final Optimizations)

The entire series is also available as a single PDF document.

Parallel Sort on 4 SPUs == As we have seen, the recursive structure of our parallel sort implementation limits the number of SPUs we can use. With one level of recursion and dividing the array in half each time we are limited to using 2 SPUs in parallel. If we divided the array in four parts we could use 4 SPUs.

The algorithm would be the same, only with “unrolling” one level of recursion. At each recursion level we are left with 4 sorted array parts. We then need to merge these 4 parts into 2, then into one. Note that now we don't need to copy back temp into data:

template<typename T>
void parallelSort(T *data, size_t count, AlignedT *temp, ThreadQueue *queue)
{
    // divide the data array into (nearly) equal parts
    const size_t PARTS = 4;
    size_t partSize[PARTS], partStart[PARTS], partEnd[PARTS];
    size_t current = 0, left = count;
    for(int i = 0; i < PARTS; i++)
    {
        partStart[i] = current;
        partSize[i] = min(count / PARTS, left);
        current += partSize[i];
        left -= partSize[i];
        if(i == (PARTS - 1))
            partSize[i] += left;
        partEnd[i] = partStart[i] + partSize[i] - 1;
    }

    // sort the parts
    if(partSize[PARTS - 1] > MAX_SPU_ARRAY_COUNT)
    {
        parallelSort(data + partStart[0], partSize[0], temp, queue);
        parallelSort(data + partStart[1], partSize[1], temp, queue);
        parallelSort(data + partStart[2], partSize[2], temp, queue);
        parallelSort(data + partStart[3], partSize[3], temp, queue);
    }
    else
    {
        sortChunkQueued(data, partStart[0], partEnd[0], queue);
        sortChunkQueued(data, partStart[1], partEnd[1], queue);
        sortChunkQueued(data, partStart[2], partEnd[2], queue);
        sortChunkQueued(data, partStart[3], partEnd[3], queue);
    }
    // parts must have been sorted before merging
    queue->joinAllThreads();

    // merge four parts (data) into two (temp)
    merge(temp + partStart[0],
        data, partStart[0], partEnd[0], 
        data, partStart[1], partEnd[1]);
    merge(temp + partStart[2], 
        data, partStart[2], partEnd[2], 
        data, partStart[3], partEnd[3]);

    // merge two parts (temp) into one (data)
    merge(data, 
        temp, partStart[0], partEnd[1],
        temp, partStart[2], partEnd[3]);
}

Without going into details, when sorting >20K arrays with merge sort we see 3.4-4.8x speed-ups. These speed-ups come from “unrolling” one recursion level. Doing this again (i.e. dividing the array into 8 parts) would be possible, unfortunately only 6 out of the PS3's 8 SPUs can be used. With Offload we can enqueue more blocks than there are SPUs but some SPUs will be idle at the end: if we enqueue 8 blocks, 6 will be processed in parallel at the same time and when they have been sorted the last 2 blocks will be processed. Thus it is not any faster than sorting 4 blocks on 4 SPUs.

Parallel Merge == One major part of our sort implementation remains sequential: merging. It turns out that merge can also be parallelized[1]. The idea is to take the median of the left array (the middle element since the array is sorted) and look in the right array for an item that is smaller than the median (this is fast since the array is sorted). Then these two items divide each array into two parts that can be merged independently. And independently means we can easily parallelize it:

template<typename T>
void parallelMerge(T *to, const T *fromX, size_t lowX, size_t highX, 
    const T *fromY, size_t lowY, size_t highY, size_t lowTo, ThreadQueue *q)
{
    size_t lengthX = highX - lowX + 1;
    size_t lengthY = highY - lowY + 1;
    if((lengthX + lengthY) <= MAX_SPU_ARRAY_COUNT)
    {
        mergeQueued(to, fromX, lowX, highX, fromY, lowY, highY, lowTo, q);
        return;
    }

    if(lengthX < lengthY)
    {
        parallelMerge(to, fromY, lowY, highY, fromX, lowX, highX, lowTo, q);
        return;
    }

    // get median of the X sub-array
    size_t midX = (lowX + highX) / 2;
    // find index mixY such that temp[midY] > temp[midX]
    size_t midY = binarySearch(fromY, lowY, highY, fromX[midX]);
    // copy the median
    size_t midTo = lowTo + midX - lowX + midY - lowY;
    to[midTo] = fromX[midX];

    // merge X[lowX .. midX - 1] with Y[lowY .. midY - 1]
    parallelMerge(to, fromX, lowX, midX - 1, fromY, lowY, midY - 1, lowTo, q);
    // and X[midx + 1 .. highX] with Y[midY .. highY]
    parallelMerge(to, fromX, midX + 1, highX, fromY, midY, highY, midTo + 1,q);
}

In this code, the mergeQueued function is similar to the sortChunkQueued function we have seen earlier. It allocates SPU memory for all arrays, copy the data from PPU to SPU, merge both arrays and copies the result back to the PPU once finished.

Results ==

16K faces

64K floats

64K ints

std::sort

8.4 ms

12.0 ms

7.0 ms

Quicksort (PPU)

9.4 ms

17.7 ms

7.2 ms

Parallel quicksort (2 SPU) with sequential merge

6.5 ms

6.1 ms

6.1 ms

Parallel quicksort (4 SPU) with sequential merge

5.4 ms

4.1 ms

4.0 ms

Parallel quicksort (2 SPU) with parallel merge (6 SPU)

5.5 ms

5.4 ms

5.4 ms

Parallel quicksort (4 SPU) with parallel merge (6 SPU)

3.6 ms

2.8 ms

2.8 ms

std::stable_sort

23.3 ms

31.6 ms

19.9 ms

Merge sort (PPU)

18.3 ms

10.1 ms

9.6 ms

Parallel merge sort (2 SPU) with sequential merge

10.1 ms

3.6 ms

3.3 ms

Parallel merge sort (4 SPU) with sequential merge

6.3 ms

2.5 ms

2.4 ms

Parallel merge sort (2 SPU) with parallel merge (6 SPU)

8.7 ms

2.5 ms

2.4 ms

Parallel merge sort (4 SPU) with parallel merge (6 SPU)

4.5 ms

1.2 ms

1.2 m

Using parallelMerge instead of merge in our parallelSort function gives even greater speedups (using all 6 SPUs for merging). When sorting >20K arrays with merge sort on 4 SPUs we see 6-11x speed-ups for float arrays and 5-10x speed-ups for int arrays over the PPU implementation. When it comes to face arrays >16K we see 2-2.3x speed-ups with 2 SPUs and 3.6-4.5x speed-ups with 4 SPUs (see below for tables and plots of the results).

Interestingly, different sort algorithms perform differently with different data types. For example, our parallel merge sort (with parallel merge) implementation sees 7-11x speed-ups for float and int arrays >40K on 4 SPUs. When sorting faces the speed-ups are lower (4x). On the other hand, the parallel quicksort (with parallel merge) sees 5-6x speed-ups (over merge sort) on the same face arrays. With float and integer arrays the speed-ups are lower (4x over merge sort).

Speed-up plots == These plots compare the performance of the different sort implementations we have presented in the series as well as STL's. All speed-ups are over our optimized merge sort running on the PPU. All vectorized functions we have described are used whenever possible.

Sorting Arrays of Floats === Untitled

Sorting Arrays of Integers === Untitled

Sorting Arrays of Faces (1 float, 3 integers) === Untitled

Comparison with offloaded std::sort (Face arrays) === Untitled

Conclusion == In this blog post series we have explained the process of creating a parallel sort implementation on the PS3's SPUs. Our implementation can use any sort function (that can run inside an Offload block) to sort array chunks small enough to fit in a SPU's local memory. These chunks are then merged in parallel, also on SPUs. Up to 4 SPUs are used for sorting and up to 6 SPUs for merging. This results in 6-11x speed-ups for sorting float and integers arrays of over 20K elements and 2.5-3x speed-ups for sorting similarly-sized arrays of our user-defined Face struct.

References == [1] http://dzmitryhuba.blogspot.com/2010/10/parallel-merge-sort.html

Pierre-Andre Saulais's Avatar

Pierre-Andre Saulais

Principal Software Engineer, Compilers