arrow_back_ios Back to List

Automatic separate compilation - aka Call-Graph Duplication

Offload C++ is a simple multicore programming model which minimises code changes and shifts the offloading work to the compiler. When targeting heterogeneous processors such as the Cell processor, this simple programming model requires the Offload C++ compiler to perform call-graph duplication which is transparent to the programmer. Code (and local data) inside an __offload block is automatically compiled separately for a different type of processor.

Why is call-graph duplication required?

Standard C++ does not provide memory semantics for different memory spaces on Non-Uniform Memory Architectures (NUMA) such as the Cell processor. Call-graph duplication in Offload C++ is necessary to enable compiling standard C++ code across different instruction types and memory spaces of those processors.

Wrapping a section of code inside a block that runs on a different processor however implicitly introduces additional memory semantics. Local (non-static) data items defined inside an __offload block are now located in local memory (scratchpad memory) whereas data outside the block (outer data) is located in host memory (see code sample below). Since accessing both types of memory from inside an __offload block requires different operations, the compiler has to be able to distinguish addresses of local and outer data. The compiler automatically deduces the (address) types of pointers and references, and separately compiles (duplicates) the code for the appropriate address type.

At the same time call-graph duplication allows to offload a large amount of code to heterogeneous processors very quickly.

Simple programming model - a lot of work for the compiler

On homogeneous multi-core processors the code inside __offload blocks (offloaded code) runs on the same type of processor as the code surrounding the __offload block (host code). However, on heterogeneous processors, the offloaded code may run on a different type of processor. For example, on the Cell processor, host code runs on the PPU whereas the offloaded code runs on the SPU (s). Since host code and offloaded code are in the same translation unit (single source programming model), the compiler has to be able to process host code and offloaded code separately for both types of processors.

Offloading full C++ in a way that is transparent to the programmer requires a lot of work from the compiler. Already simple language features (and the combination thereof) such as overloading, pointer parameters (see example below), non-static member methods, out-of-class definitions of class members, non-pod arguments, virtual methods, function pointers etc pose challenges for an offloading compiler. Also C++ ABI integration issues such as name mangling need to be covered.

Call-graph duplication for heterogeneous processors

For the Cell processor the Offload C++ compiler automatically compiles (duplicates) the code inside an __offload block for the SPU. For example, in the code below the function modifyglobal is called inside an __offload block.

int global;
void setglobal(int arg)
    global = arg;
void modifyglobal(int arg)
int main()

The compiler automatically compiles a separate definition (in addition to the definition for the PPU) of that function to run on an SPU. The function setglobal is also separately compiled as it is called from modifyglobal. Note that the function setglobalmodifies the variable global which is defined outside the __offload block, so it is on the PPU. For the SPU version of that function the compiler therefore generates access code (DMA (Direct Memory Access) or software cache access) to perform this assignment from the SPU.

Offloading code containing pointer parameters

Complex C/C++ programs contain a lot of pointer code. Offloading such code without further changes requires the compiler to be able to distinguish and separately process addresses on the PPU (__outer) and SPU (local). See for example the following test case:

#include <stdio.h>

int f(int* a, int* b) {
    return *a + *b;

int ga = 1;
int gb = 2;

int a,b,c,d;

int main () {

    __offload {
        int la = 3;
        int lb = 4;
        a = f(&la, &lb); //f(int*        ,int*)
        b = f(&ga, &gb); //f(__outer int*,__outer int*)
        c = f(&la, &gb); //f(int*        , __outer int*)
        d = f(&ga, &lb); //f(__outer int*,int*)

    if ((a == 7) && (b == 3) && (c == 5) && (d == 5)) {
    } else {
        printf("FAIL: %d %d %d %d\n",a,b,c,d);
        return 1;

Inside the __offload block the function f is called with 4 different combinations of __outer and SPU- local addresses. The compiler generates (duplicates) a SPU definition of that function for each combination to suit each call arguments. Dereferencing pointers inside the each duplicated generates the appropriate access (DMAs/SW cache accesses on __outer pointers and SPU local reads/writes on local pointers). No change to the function source code is required.

Implicit method parameters such as the this pointer are treated similar to normal pointers: Non-static member methods are also duplicated according to the address type of the this pointer.

struct myClass
    int var;
    myClass(int init)
        var = init;
    void set(int val)
        var = val;

myClass m(1);//global PPU instance

void func()
        myClass n(0); //local SPU instance - constructor is automatically offloaded
        m.set(8); //set is duplicated with an __outer this pointer

In the above example a local variable n is declared inside the __offload block and initialized through a constructor call - this constructor is offloaded with a local this pointer. Then, method set is called on the global variable m - this method is offloaded with an __outer this pointer.