Codeplay are attending IWOCL 2018

Posted on May 1, 2018 by Melissa Richardson.

Codeplay are excited to again be attending and sponsoring the annual OpenCL conference IWOCL on the 14th – 16th May in Oxford, UK. Attending from Codeplay will be:  Andrew Richards - CEO ; Michael Wong - VP Research & Development ; Rod Burns - Developer Relations Manager ; Ruyman Reyes - Principal Software Engineer, Programming Models ; Gordon Brown - Senior Software Engineer, SYCL ; Alastair Murray – Principal Software Engineer, Compilers ; Ewan Crawford – Staff Software Engineer, Debuggers ; Medhi Goli – Senior Software Engineer ; Callum Fare – Software Engineer ; Christopher Di Bella - Staff Software Engineer, ComputeCpp Runtime ; Toby St Clere Smithe – Software Engineer & Toomas Remmelg– Intern Software Engineer. 

We are looking to meet new and old faces from the OpenCL and SYCL community, so if you are attending come and say "hello". The team will be wearing their Codeplay t-shirts and hoodies and will be easy to spot.

If you are interested in organising a meeting with us at the event, please use our contact form or Tweet us at @codeplaysoft and let us know!

Whilst at IWOCL Codeplay will be presenting a variety of tutorials, workshops & presentations – here is just a small snapshot of things to look out for! The full agenda is now published here.

Monday 14th May – 09.30am – 17.30pm – DHPCC++ Conference (Distributed & Heterogeneous Programming for C/C++)

In response to the demand for heterogeneous programming models for C/C++, and the interest in driving these models in ISO C++, Distributed & Heterogeneous Programming in C/C++ includes all the programming models that have been designed to support heterogeneous programming in C and C++. Many models now exist including SYCL, HPX, KoKKos, Raja, C++AMP, HCC, Boost.Compute, and CUDA to name a few.

This conference aims to address the needs of both HPC and the consumer/embedded community where a number of C++ parallel programming frameworks have been developed to address the needs of multi-threaded and distributed applications. The C++11/14/17 International Standards have introduced new tools for parallel programming to the language, and the ongoing standardization effort is developing additional features which will enable support for heterogeneous and distributed parallelism into ISO C++ 20/23.

DHPCC++ is an ideal place to discuss research in this domain, consolidate usage experience, and share new directions to support new hardware and memory models with the aim of passing that experience to ISO C and C++.

During the DHPCC++ Conference, several members of Codeplay will be presenting a variety of presentations, the full list of those can be found here -

Tuesday 15th May – 10.30am – 11.00am - What’s New in SYCL 1.2.1 and How to Explore the Features - Michael Wong

On the 17th of November 2017, Khronos ratified the latest SYCL 1.2.1 specification. Although only one minor version increase, the work on the new specification represents two and half years of effort from the SYCL group. The group spent time receiving feedback from the public specifications and working closely with C++ developers to devise the best way to approach the challenges of heterogeneous programming with real-world applications like TensorFlow.

SYCL 1.2.1 improves on the previous SYCL 1.2 specification by adding a number of “mini-features” in the form of extensions to the C++ API that simplify programming and expose more capabilities from the underlying OpenCL 1.2 interface, such as explicit copy functionality; alongside various improvements on the interface including better support for standard C++ allocators or extensions capabilities.

In this presentation, we introduce the new SYCL 1.2.1 specification, explain the different updates and changes to the APIs and illustrate how to take advantage of them by showing some examples. We will also present the current status of the implementation of SYCL 1.2.1 for ComputeCpp, an implementation of the standard, and how to use the new features and API changes. To conclude we will provide some hints on the future direction of the SYCL and C++ standards by examining various proposals from Codeplay that are currently work in progress.

Wednesday 16th May – 10.00am – 10.30am - TensorFlow Acceleration on ARM Hikey Board – Mehdi Goli, John Lawson, Uwe Dolinsky, Luke Iwanski & Andrew Richards

There is huge demand for targeting complex and large-scale machine learning applications particularly those based on popular actively-maintained frameworks such as TensorFlow and CAFFE to a variety of platforms with accelerators ranging from high-end desktop GPUs to resource-constrained embedded or mobile GPUs, FPGAs, and DSPs. However, to deliver good performance different platforms may require different algorithms or data structures, yet code should be easily portable and reused as much as possible across different devices. The open SYCL standard addresses this by providing parallel processing through a single-source programming model enabling the same standard C++ code to be used on the CPU and accelerator. This allows high-level C++ abstractions and templates to be used to quickly configure device and host code to cover specific features of the platform. By targeting OpenCL, SYCL enables C++ applications such as TensorFlow to run efficiently on OpenCL devices without having to write OpenCL code.

In this presentation we propose an OpenCL-enabled back-end for TensorFlow via SYCL in order to enable developers to access a wider range of processor combinations.

SYCL is a royalty-free, cross-platform C++ abstraction layer that builds on the underlying concepts,
portability and efficiency of OpenCL, while adding the ease-of-use and flexibility of modern C++14. This solution also benefits from ensuring the implementation is maintainable and compliant as the standards evolve. Dispatching device kernels from C++ applications is a widely used method for dealing with heterogeneous platforms in various programming models, such as
CUDA, C++AMP, HCC, OpenACC, or OpenMP.

SYCL brings this capability to a wide range of accelerators supporting OpenCL. This lets developers create powerful, more performance-portable template libraries that can take advantage of a wide range of heterogeneous hardware and software platforms.

Moreover, porting TensorFlow to OpenCL would mean handwriting the kernels in OpenCL C and having separate code-bases, which would be complicated to maintain. By using SYCL, everything is single-source C++, and therefore it is possible to use a non-intrusive approach to add the SYCL back-end to TensorFlow.

There are three steps for implementing the SYCL back-end for TensorFlow:
In the first step we have introduced the SYCL device by specializing the TensorFlow’s device abstraction layer. The implemented SYCL device supports any OpenCL-enabled devices.

In the next step we implement a SYCL back-end for all the linear operations in the Eigen Tensor module. Each TensorFlow operator maps to an Eigen expression. As the Eigen has the same expression interface regardless of the selected device, in most cases, there is no need to specialize TensorFlow’s operators for SYCL.

In the last step we have registered the existing TensorFlow operators for SYCL device.

The evaluation of the proposed approach was carried out on an ARM Hikey board with ARM A53CPU , A73 CPU, and Mali GPU

Our results show significant improvements over those run on ARM A53 CPU, A73 CPU, and ACL library for Mali GPU especially for large scale application.
We are maintaining the TensorFlow SYCL back-end and actively optimizing TensorFlow’s operators for different DNN models across different platforms.

Wednesday 16th May – 14.30pm – 15.00pm – Building a Brain with SYCL & Modern C++ - Toby St. Clere Smithe

State-of-the art machine learning systems typically depend on energetically costly gradient-descent learning over a curated task-specific data set. Despite their successes, these methods are not well suited to building fully autonomous systems such as may employ energy-efficient accelerators targeted by OpenCL. By contrast, the brain uses low-energy local learning rules to discover the causal structure of an environment, forming semantically rich representations without supervision, and therefore exhibiting the required combination of efficiency and flexibility. To investigate these properties, a paradigm shift to dynamic “spike-based” computation is required. Historically, investigating spiking neural models has been a task for specialists, with software that is tailored to specific scientific projects, or that trades flexibility against performance. Here, we present neurosycl, a high-performance, portable spiking network simulator based on SYCL, with a modern and extensible C++ API. Our aim is to provide the necessary components for non-specialists to build a simulated brain, and to run the constructed models as close to real-time as possible.

This bipartite aim leads to two competing considerations – a simple interface, and portable performance – which are reconciled using SYCL’s single-source programming model. We describe two principal algorithmic challenges that illustrate the different hardware demands of spiking neural networks relative to deep learning networks, and how neurosycl solves them for GPU-like parallel processors via SYCL. Firstly, although the brain is akin to a parallel processor whose cores are neurons, the connections between neurons may have differing temporal delays, which results in a message-passing problem if the neurons are simulated asynchronously. Secondly, because these messages (‘spikes’) are generated chaotically, then transmitted to arbitrary target neurons with arbitrary transmission delays, a naive implementation even of a synchronous model quickly runs into a highly suboptimal memory access regime.

neurosycl’s design separates the specification of a model architecture from its simulation, so that once a model has been instantiated, its basic structure is fixed. This simplification enables us to infer the memory access pattern, and thus re-order the connection indices so that adjacent kernels access nearby memory locations. The simplification is also reflected in the API design: users can construct complex connection graphs between arbitrary neuron groups using a simple declarative interface, but runtime interactions with the model, for monitoring or I/O, are mediated by a set of simulated electrodes, combined with hooks into the simulation loop. This design mirrors that of neuroscientific experiments, and permits the user to embed the simulated brain into a virtual environment by integrating with other technologies, exposing implementation details only when necessary to allow this. We describe our API, illustrated by a number of “brain-building” examples, showing how the components compose and map via SYCL onto the hardware. We present performance comparisons across hardware platforms and alternative simulators, demonstrating portability for various network configurations and standard neuron models.

Wednesday 16th May – 15.00pm – 15.30pm - Enabling Profiling for SYCL Applications - Callum Fare

Since GPGPU devices have become mainstream, more and more software is being written to target many-core devices. Developers are now required to think in parallel in order to run applications with maximum performance, however, the ability to target a wide range of devices is vital. GPUs can range from very powerful discrete cards to extremely low-power embedded chips and, to be efficient, developers must be able to reuse their code in different scenarios. OpenCLTM addresses this issue by providing a C like programming language that can target different architectures, however, it requires a deep knowledge of the underlying hardware to be used efficiently. SYCLTM provides a C++ abstraction layer that simplifies parallel development, allowing developers to leverage the power of OpenCLTM while reducing the amount of code required.

Parallel programming offers a set of new challenges for developers, and being able to understand the behavior of their applications on target hardware is important in order to ensure they run with the best performance on all hardware platforms.

The LPGPU2 project has been working to add SYCLTM profiling capabilities in the now open-source tool suite CodeXL. Originally created to perform OpenCLTM profiling on AMD hardware, the project is adding the ability for CodeXL to perform SYCLTM profiling on devices including all the supported ComputeCpp’s desktop CPUs and GPUs as well as mobile low-power devices, like ARM GPUs running under the Android operating system.

In this talk, we will explain how the LPGPU2 CodeXL codebase was modified and extended to allow developers understand and identify bottlenecks of SYCLTM applications and how we extended it to perform accurate power consumption measurements as well as the ability to analyse and provide feedback on how the application can be improved in the context of Android development.

We’ll reveal the secrets of complex software stacks through the new profiling capabilities. The tool can be used at all levels and with all sorts of applications, from complex simulations to machine learning.

Interested in IWOCL? Here is a bit more information on the conference itself:  The International Workshop on OpenCL (IWOCL) is an annual meeting of OpenCL users, researchers, developers and suppliers to share OpenCL best practise, and to promote the evolution and advancement of the OpenCL standard. The conference is open to anyone who is interested in contributing to, and participating in the OpenCL community. IWOCL is the premier forum for the presentation and discussion of new designs, trends, algorithms, programming models, software, tools and ideas for OpenCL. Additionally, IWOCL provides a formal channel for community feedback to OpenCL promoters and contributors. More information on the conference can be found here: