Optimizing Sort Algorithms for the PS3 Part 3 (Merge Sort)

11 July 2011

Sorting is a simple but important concept for implementing games. For example, sorting transparent objects before rendering or sorting objects by state attribute (e.g. texture) to improve batching. Because of the way most algorithms work (comparing and swapping pairs of items) sorting often takes precious time. With multi-core architectures like the PS3 it makes sense to parallelize this operation to maximize the use of these cores.

In the first two parts we have implemented the quicksort and merge algorithms, optimized them and and compared their performance to the STL versions. In this part we will do the same with merge sort.

Quick Links == Part 1 (Quicksort)

Part 2 (Merge)

Part 3 (Merge Sort)

Part 4 (Offloading on 1 SPU)

Part 5 (Parallel Sort on 2 SPUs)

Part 6 (Final Optimizations)

The entire series is also available as a single PDF document.

Merge Sort == Like quicksort, this sort also takes O(n * log n) time on average but, unlike quicksort, requires a temporary array whose size is equal to the array to be sorted. On the positive side and also unlike quicksort, this sort is stable. This means that items that are equal will be kept in the same order. This can be useful when sorting values of user-defined types such as structs. In addition, its performance doesn't degrade to O(n^2) with already sorted arrays liked quicksort does.

The divide-and-conquer approach of this algorithm makes it easy to implement as a recursive function. It goes like this: the array is divided into two equal parts and the function recursively calls itself on each half of the array. When both halves of the array are sorted, they are merged together to the temporary array. After the contents of this array is copied back to the input array the latter is sorted (see code below).

template<typename T>
void mergeSort(T *data, T *temp, size_t start, size_t end)
{
    size_t count = end - start + 1;
    if(count < 2)
        return;

    // divide the array to sort into two parts and sort them
    size_t mid = (start + end) / 2;
    mergeSort(data, temp, start, mid);
    mergeSort(data, temp, mid + 1, end);

    // merge two parts (data) into one (temp)
    merge(temp + start, data, start, mid, data, mid + 1, end);

    // copy one part from temp to data
    __builtin_memcpy(data + start, temp + start, (end - start + 1) * sizeof(T));
}

This naive recursive implementation takes 29.5 ms to sort 65k floats and 19.5 ms for sorting 65K integers. As a comparison std::stable_sort takes 31.8 ms and 20 ms, respectively. The same "small array" optimisation can be used. When sorting arrays of size 8 and less with insertion sort, we get 21.2 ms for floats and 11.4 ms for integers.

The iterative merge algorithm is a bit different from the recursive one. It consists of several passes with an increasing block size m. For the first pass (m = 16), each block of m items is sorted using the insertion sort. Then pairs of adjacent blocks are merged together, i.e. blocks 0 and 1 will be merge into block 01, then blocks 2 and 3 into block 23, then 3 and 4 into block 34 and so on. With each pass the value of m doubles. This means that in the second pass block 01 is merged with block 23, block 45 with block 67 and so on:

template<typename T>
void mergeSortIterative(T *data, T *temp, size_t start, size_t end)
{
    size_t count = end - start + 1;
    size_t m = 16;
    // this algorithm requires that the source array can be divided into
    // two chunks, one being as least m in size
    if(count < (m + 1))
    {
        insertionSort(data, start, end);
        return;
    }

    // sort chunks of m = 16 items
    size_t i = start;
    for(; i <= (end - m + 1); i += m)
        sort16(data, i);
    if(i < end)
        insertionSort(data, i, end);

    Merge<T> merge;
    T *src = temp, *dst = data, *swap;
    // with each merge pass the chunks double in size
    for(; m < count; m *= 2)
    {
        // alternate between merging data to temp and temp to data
        // this is to avoid copying temp to data every time
        swap = src; src = dst; dst = swap;
        // merge 2 * N chunks of m items into N chunks of 2 * m items
        mergeIterative(src, dst, start, end, m);
       // now all chunks of size 2 * m are sorted
    }
    // make sure that the sorted data ends up in the 'data' array
    if(dst == temp)
        __builtin_memcpy(data + start, temp + start, count * sizeof(T));
}

template<typename T>
void mergeIterative(T *data, T *temp, size_t start, size_t end, size_t m)
{
    // merge 2 * N chunks of m items into N chunks of 2 * m items
    size_t i = start;
    for(; i <= (end - m); i += (2 * m))
    {
        size_t endChunk = min(2 * m - 1, end - i);
        T *src = (T *)(data + i);
        T *dst = (T *)(temp + i);
        mergeImpl(dst, src, 0, m - 1, src, m, endChunk);
    }
    // copy trailing items that make the array not divisible by 2 * m
    for(; i <= end; i++)
        temp[i] = data[i];
}

template<typename T>
void sort16(T *data, size_t start)
{
    insertionSort(data, start, start + 15);
}

With this approach it takes 22.1 ms to sort the float array and 10.7 ms for the integer array which is no improvement over the recursive version. With the iterative approach, unlike with the recursive approach, we can control the size of array chunks that will be sorted using the simpler sort. We also control the alignment of arrays to be merged. This will be useful for vectorizing parts of our implementation and mitigates the additional complexity.

Vectorizing == Our implementation so far processes items one at a time and has a lot of branching. As we have seen in the previous part, vectorizing our code using AltiVec SIMD instructions can improve on both counts. Consider the function sortVec4 below which sorts a vector containing four items without any branching:

// T is either vec_float4 (4 floats) or vec_int4 (4 ints)
template<typename T>
inline void sortVec4(T *a)
{
    // va = [x, y, x, w]
    T va = *a;
    T v1 = vec_perm(va, va, PERM(VA, VA, VC, VC));
    T v2 = vec_perm(va, va, PERM(VB, VB, VD, VD));
    // v3.x = [min(v1.x, v2.x), min(v1.y, v2.y), ..., ...]
    T v3 = vec_min(v1, v2);
    T v4 = vec_max(v1, v2);
    // X = v3.x, Y = v3.y and so on with Z, W. Similarly A = v4.x, B = v4.y ...
    // v1 = [v3.x, v4.x, v3.y, v3.w]
    v1 = vec_perm(v3, v4, PERM(VX, VA, VY, VW));
    v2 = vec_perm(v3, v4, PERM(VZ, VC, VB, VD));
    v3 = vec_min(v1, v2);
    v4 = vec_max(v1, v2);
    v1 = vec_perm(v3, v4, PERM(VX, VA, VY, VB));
    v2 = vec_perm(v3, v4, PERM(VX, VY, VA, VB));
    v3 = vec_min(v1, v2);
    v4 = vec_max(v1, v2);
    *a = vec_perm(v3, v4, PERM(VX, VY, VC, VD));
}

Using this function and the mergeVec4 function seen in the previous part we can create a sortVec8 function. sortVec8 sorts two vectors using sortVec4 and merges them using mergeVec4. Similarly with sortVec8 and mergeVec8 (adapted from [1]) we can create a sortVec16 function. This function can be used to specialize the sort16 function for float and integer items; the plain insertion sort will be used for sorting other types of arrays.

Results ==

64K floats

64K integers

std::stable_sort

31.8 ms

20 ms

Recursive merge sort

29.5 ms

19.5 ms

Recursive merge sort, small array (N <= 8)

21.2 ms

11.4 ms

Iterative merge sort, small array (N = 16)

22.1 ms

10.7 ms

Same as above, but sort16 is defined as sortVec16

16.6 ms

8.7 ms

Same as above, but merge is vectorized (mergeVec4)

10.8 ms

9.6 ms

Specializing sort16 gives us some improvements over using insertion sort for small arrays: we get 16.6 ms for float arrays and 8.7 ms for integer arrays. Finally, if we use the vectorized merge implementation described in the previous part, we get 10.8 ms for float arrays and 9.6 ms for integer arrays. This is a big speed-up for sorting floats but a regression for sorting integers.

This is not surprising since we observed the same kind of slowdown when using the vectorized merge on integer arrays. With template specialization we can easily choose to use the vectorized variant only for float arrays. Also, as we will see this won't be an issue when offloading to SPUs.

In the next part in the series we will see how to run some of the code we have presented in this series on a SPU using Offload.

References == [1] http://seven-degrees-of-freedom.blogspot.com/2010/07/question-of-sorts.html

Pierre-Andre Saulais's Avatar

Pierre-Andre Saulais

Principal Software Engineer, Compilers