Optimizing Sort Algorithms for the PS3 Part 4 (Offloading on 1 SPU)

11 July 2011

Sorting is a simple but important concept for implementing games. For example, sorting transparent objects before rendering or sorting objects by state attribute (e.g. texture) to improve batching. Because of the way most algorithms work (comparing and swapping pairs of items) sorting often takes precious time. With multi-core architectures like the PS3 it makes sense to parallelize this operation to maximize the use of these cores.

In the first three parts we have implemented the quicksort, merge and merge sort algorithms, as well as optimized them on the PS3's PPU. In this part we will start running these implementations on a SPU to improve performance even further.

Quick Links == Part 1 (Quicksort)

Part 2 (Merge)

Part 3 (Merge Sort)

Part 4 (Offloading on 1 SPU)

Part 5 (Parallel Sort on 2 SPUs)

Part 6 (Final Optimizations)

The entire series is also available as a single PDF document.

Offload The Sort Code On 1 SPU == The first step for sorting in parallel is to offload sorting an array chunk on a SPU using Offload. This can be done by using either an __offload or __blockingoffload block in the C++ code. The former spawns a thread on a SPU and runs the code inside of the block (see code below). The latter is similar except that the execution of the function containing the offload block is stopped until the block has finished executing. For now we will use the latter as we are using only one SPU and not doing any processing on the PPU.

template<typename T>
void sortOffloaded(T *data, size_t start, size_t end)
    size_t count = (end - start + 1);
    if(count < 2)
    // run the code in the following block on a SPU and wait until it finishes
    __blockingoffload(data, start, end)
        mySort(data, start, end);

Sorting 64K floats using quicksort on the SPU takes 39.7 ms while 20K takes 10.7 ms. As we have seen on the PPU it takes 16.8 ms to sort 64K floats while 20K takes 7.1 ms. Thus offloading the quicksort code as-is results in severe slowdowns: 0.4x for floats and 0.2x for integers.

This is mainly due to the fact that the array is stored in main (PPU) memory. Since the sorting code is running on the SPU, every read and write from and to the array has to go through Offload's software cache. This cache has to issue DMA requests to copy memory between the PPU and SPU. The easiest solution to this issue is to copy the array data to the SPU's local RAM, sort the local array and copy it back to the PPU memory when finished. The local array is allocated with 128-byte aligned memory which generally improves performance over non-aligned arrays.

One of the main issue with this approach is the 256 KB RAM size for storing code and data. This means that with big enough arrays the whole array does not fit in the RAM. For now we can only sort arrays that fit inside the SPU RAM, but we will solve this issue later.

template<typename T, typename S>
void sortOffloadedLocalMem(T *data, size_t start, size_t end)
    size_t count = (end - start + 1);
    if(count < 2)
    __blockingoffload(data, start, end, count)
        size_t size = count * sizeof(T);
        // enforce the maximum array size
        if(count <= MAX_SPU_ARRAY_COUNT)
            T __attribute__((aligned(128))) copy[MAX_SPU_ARRAY_COUNT];
            // copy the array from PPU to SPU
            __builtin_memcpy(copy, data + start, size);
            // sort the local array
            mySort(copy, 0, count – 1);
            // copy the array back from SPU to PPU
            __builtin_memcpy(data + start, copy, size);
            printf("Error: array size (%d) exceeds maximum size (%d)\n", 
                count, MAX_SPU_ARRAY_COUNT);

Results == With this approach it only takes 3.1 ms to sort 20K floats on the SPU using quicksort. This represents a 1.5x speed-up over execution on PPU, however sorting integers results in 0.6x slowdowns. Similarly merge sort sees 2-2.4x speed-up for floats and 1.9-2.1x speed-ups for integers. Quicksort is limited to 40K arrays in SPU's local memory while merge sort's limit is 20K.

The table below compares the performance of sorting on 1 SPU and the various implementations presented in the last post. The user-defined structure used ('face') contains one floating point (depth value) and three integer indices. This could be used to sort non-opaque polygons by depth.

4K faces

20K floats

20K ints

64K floats

64K ints

Iterative quicksort

2.6 ms

4.7 ms

2.1 ms

16.8 ms

7.1 ms

Iterative quicksort, on 1 SPU

7.5 ms

10.7 ms

10.7 ms

39.7 ms

39.4 ms

Iterative quicksort, on 1 SPU (local mem.)

2.6 ms

3.1 ms

3.1 ms

Iterative merge sort (vectorized merge)

3.7 ms

2.8 ms

2.4 ms

10.8 ms

9.6 ms

Iterative merge sort, on 1 SPU (local mem.)

3.6 ms

1.2 ms

1.2 ms

In the next part in the series we will see how to run our sort functions on multiple SPUs in parallel to build a parallel sort implementation and work around the current arrays size limitation.

Pierre-Andre Saulais's Avatar

Pierre-Andre Saulais

Principal Software Engineer, Compilers