Implementing OpenCL Support for Eigen using SYCL and ComputeCpp
22 May 2017
The Eigen library is
The library benefits from acceleration using heterogeneous hardware, since matrix and vector operations require many parallel calculations and are thus suited to parallellzation on GPUs. Until
SYCL is a royalty-free, cross-platform C++ abstraction layer that builds on the underlying concepts, portability and efficiency of OpenCL, while adding the ease-of-use and flexibility of modern C++11/14. For example, SYCL enables single-source development, where C++ template functions are compiled for both host and device to construct complex algorithms that use OpenCL acceleration, and then re-use them throughout their source code on different types of data.
In this post we will talk about how we implemented the SYCL
Expression Trees in Eigen
Expression trees are used to represent the operations in Eigen, such as the one in the diagram below. Terminal (blue) nodes represent data items, whereas non-terminal (green) nodes are the operations that are performed on them.
Figure: The calculation is represented as an expression tree that
Kernel Fusion and Eigen
Figure: Kernel performance using individual kernels versus kernel fusion
Adding the SYCL Backend to Eigen
When we began designing the integration of SYCL with Eigen, we wanted to ensure that the back-end implementation provided compatibility for existing frameworks that use the tensor operations. By following the existing device specification we have avoided changing the interface of the existing Eigen code, meaning that all applications (including TensorFlow applications) will continue to work seamlessly with the SYCL Eigen implementation.
We also wanted to minimize the runtime overhead of applications running with our implementation, and by pushing the expression specification to compile-time as much as possible we have been able to do this, resulting in a significant speed up.
Challenges
The main challenge is a non-intrusive integration of SYCL into the standard C++-based Eigen code base without requiring major changes to existing code and interfaces. Targeting OpenCL 1.2 through SYCL enables Eigen to run on many low-power and mobile devices that don't support C++ operations such as direct access to host memory. This however requires the application to manage the data accessed by the kernel so integrating this with the Eigen framework in a non-disruptive way poses interesting technical challenges.
OpenCL 1.2 vs CUDA Memory Allocation
In OpenCL 1.2, memory allocations such as buffers are represented by opaque cl_mem handles on the host side. As an opaque object, address arithmetic cannot be performed on cl_mem handles. Inside an OpenCL 1.2 kernel, memory allocations are represented by pointers annotated with the address space of the allocation or variable.
However, the representation in CUDA is different. In CUDA, allocating a device buffer through cudaMalloc returns a void * pointer which is indistinguishable from a standard CPU pointer. This pointer cannot be de-referenced by the host processor, but
In short, while CUDA uses the same types to represent memory allocations on both the CPU and GPU, and can perform address arithmetic in both
In Eigen the CUDA backend can utilize the result of cudaMalloc directly as the data container in terminal nodes. This representation is automatically valid on the device, with no additional transformations required. However, for OpenCL 1.2 we need two different representations of the expression tree (one for host and one for device) as due to the two different representations of the memory. Therefore, it is necessary to convert the host representation of an expression to a valid device representation. In order to reduce the run-time overhead, we apply the expression conversion as a compile-time transformation.
Eigen provides a pluggable interface for the allocation and deallocation of memory. The allocation method accepts the required allocation size as a parameter and returns a void * pointer. In order to use the same interface here, we map each OpenCL memory object to a proxy host address. From the users point of view, the returned pointer can be used in an identical manner to CUDA's host-side pointer. When we convert the host expression representation to the device-side representation, these proxy addresses can be replaced with the corresponding OpenCL memory objects. Similarly for deallocating memory, we locate the corresponding OpenCL memory object based on it's proxy address and release the memory.
Initial Performance Results
The evaluation of the proposed approach was carried out on a single machine with an Intel® Core™ i7-6700K CPU, running at 4.00 GHz, with 32Gb of RAM. The GPU used was an AMD R9 Nano.
The following is a logarithmic-scale performance comparison of the Eigen operations registered for SYCL with those registered for the CPU, based on a 4k input data size.
Figure: Performance of Eigen operations using SYCL on a GPU and on a quad-core CPU
We can see that there are cases such as algebraic functions, convolutions, and contractions where we have achieved up to an order-of-magnitude speed-up with SYCL.
The current version of the Eigen codebase contains the initial release of the SYCL backend, and we are continuing to optimize the performance of many of the operations.
The full paper is available to download from the ACM Digital Library, click here to download.
