Implicit call-graph duplication could reduce explicit code duplication in C++ AMP

15 November 2011

Since my previous blog about C++ AMP and how it would benefit from call-graph duplication implemented in Offload C++, Microsoft have released another C++ AMP demo that highlights an important pattern of current multicore software: separate implementation of the same functionality for different processors. The following 2 functions (one for CPU and one for GPU) taken from the mentioned MS blog implement virtually the same functionality except that the GPU implementation is annotated with restrict(direct3d) for overloading and it calls a GPU specific, high-performance equivalent of exp.

//----------------------------------------------------------------------------
// GPU implementation - Call value at period t : V(t) = S(t) - X
//----------------------------------------------------------------------------
float expiry_call_value(float s, float x, float vdt, int t) restrict(direct3d)
{
	float d = s * direct3d::fast_exp(vdt * (2.0f * t - NUM_STEPS)) - x;
	return (d > 0) ? d : 0;
}

//----------------------------------------------------------------------------
// CPU implementation - Call value at period t : V(t) = S(t) - X
//----------------------------------------------------------------------------
float expiry_call_value(float s, float x, float vdt, int t)
{
	float d = s * ::exp(vdt * (2.0f * t - NUM_STEPS)) - x;
	return (d > 0) ? d : 0.0f;
}

To make the similarity between GPU and CPU function even more obvious the GPU version could be refactored to have the same body as the CPU version by overloading ::exp with a GPU version:

static inline float exp(float arg) restrict(direct3d)  {
      return direct3d::fast_exp(arg); //just calls the fast GPU version
}

Now the GPU function is exactly the same as the CPU function except for the restrict(direct3d) directive:

float expiry_call_value(float s, float x, float vdt, int t)  restrict(direct3d)
{
	float d = s * ::exp(vdt * (2.0f * t - NUM_STEPS)) - x;//resolves to the GPU version because the call is inside an "restrict(direct3d" context.
	return (d > 0) ? d : 0;
}

The problem is having to maintain separate libraries with the same functionality for CPU and GPU is leading to the usual well-known problems with code duplication. Call-graph duplication could address this problem in C++AMP by not requiring the annotation of library functions for accelerators (such as the GPU) using "restrict(direct3d)". Upon calling a normal function from inside a "restrict(direct3d)" context the compiler generates a "restrict(directed)" annotated copy of that function and all functions it calls directly and indirectly, hence the name call-graph duplication. This allows CPU and GPU to share the same source base and tweaking performance by adding fast GPU overloads if required. Granted, the above code is very simple and the code duplication could in this example be achieved by using macros. However, not only do macros limit the debuggability of the code they generate, the power of call-graph duplication really shows on large, very complex call-graphs involving argument pointers to different memory types (See issues with pointer types), where the duplication is driven behind the scenes by the (pointer) types of the arguments, something that could not be done by macros because they are not part of the C++ type system. Call-graph duplication is part of the Offload C++ specification implemented by Codeplay's Offload compiler which was used to boost performance of AI and visual effects in the AAA PS3 title NASCAR The Game.

Uwe Dolinsky's Avatar

Uwe Dolinsky

Chief Scientist