Using the SYCL Kernel Fusion Extension - A Hands-On Introduction

21 June 2023

In a previous blog post, we introduced the SYCL extension for user-driven kernel fusion developed by Codeplay, and how it can improve SYCL application performance.

In this post, we will take a more hands-on approach to show how to set up DPC++ with kernel fusion support and compile and run a SYCL application. Using that knowledge will allow you to apply the kernel fusion extension to your own SYCL applications.

We will also get a glimpse into SYCL application performance analysis with Intel VTune, investigating the reason for performance improvement through kernel fusion.

So, let's get started!

Prerequisites

Throughout this tutorial, we will be assuming a Linux system with at least one SYCL device compatible with the kernel fusion extension. Concretely, this would need to be a device that is compatible with a SPIR-V compatible SYCL backend. This would for example be the OpenCL backend with an Intel CPU or GPU (also integrated), or the Intel LevelZero backend with an Intel GPU (also integrated).

The installation of the necessary drivers and device backends is out-of-scope for this blog post, consult the corresponding manuals for OpenCL and LevelZero for installation instructions.

For the section on VTune performance analysis, we will assume an OpenCL CPU, suited for the selection of metrics that we will investigate. Intel VTune supports performance analysis across a wide range of different platforms and devices, for example, SYCL application performance analysis via a graphical user interface or via the command-line.

The installation of VTune itself is out-of-scope for this blog post, see the installation manual for instructions.

As an example, the system used throughout this blog post uses an Intel i7 6700K CPU with the OpenCL backend, Ubuntu 20.04 LTS and Intel VTune 2023.1.0 preview.

Setup of DPC++ with kernel fusion support

If you were hoping for a long setup procedure now (who isn't ? 😉), I'll have to disappoint you: Current DPC++ daily releases already support kernel fusion by default.

So all we need to do is download a daily release, unpack it and setup up some environment variables.

First, go to some directory of your choosing (you should have write permissions) and then execute the following command to download a daily release:

wget https://github.com/intel/llvm/releases/download/sycl-nightly%2F20230408/dpcpp-compiler.tar.gz

Note that newer daily releases found here might also work.

In the next step, we'll unpack the DPC++ release and eventually remove the TAR file we downloaded:

tar xfz dpcpp-compiler.tar.gz
rm dpcpp-compiler.tar.gz
The last step of the setup is to setup some environment variables:
export PATH=$(pwd)/dpcpp_compiler/bin:$PATH
export LD_LIBRARY_PATH=$(pwd)/dpcpp_compiler/lib:$LD_LIBRARY_PATH

Note that this step needs to be repeated each time you re-open the terminal and assumes your current working directory matches the directory you choose for setup.

To verify that the setup worked, use the following commands:

which clang++
which sycl-ls

In both cases, the output should point to an executable in the dpcpp_compiler/bin subdirectory of the setup directory you chose.

Also, the following command should show at least one device for either the OpenCL or LevelZero backend:

sycl-ls

An example output would be:

[opencl:cpu:0] Intel(R) OpenCL, Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz OpenCL 3.0

In case this command does not print a device, see the OpenCL or LevelZero installation manual for instructions.

Example application

For demonstration purposes, we will be using the example application from the original blog-post.

You can download the complete source for the example application here.

After completing the setup above, we can compile the example application, still without kernel fusion:

clang++ -fsycl no-fusion.cpp -o no-fusion

After that, we can execute the application:

./no-fusion

The output should be similar to this, the exact runtime is of course dependent on your hardware configuration*:

Elapsed time in microseconds: 298080
Elapsed time in microseconds: 235596
Elapsed time in microseconds: 235771
Elapsed time in microseconds: 235865
Elapsed time in microseconds: 235926
Elapsed time in microseconds: 235780
Elapsed time in microseconds: 235566
Elapsed time in microseconds: 235890
Elapsed time in microseconds: 235718
Elapsed time in microseconds: 235786
The application performs ten timed iterations of the main workload. Due to the necessary translation from SPIR-V to device binary format and the caching of that translation, explained in the first blog-post, the first iteration takes a little longer than the remaining ones*

Fusion

In the next step, we can enable kernel fusion for the application, but still without dataflow internalization. To that end, a few modifications to the source-code are necessary; you can download the resulting source file here.

We can compare the previous source file with the new one:

git diff --no-index no-fusion.cpp fusion.cpp

From the output, we can see that only minimal code changes are necesary to enable kernel fusion in the application:

diff --git a/no-fusion.cpp b/fusion.cpp
index 08595a0..4cd0702 100644
--- a/no-fusion.cpp
+++ b/fusion.cpp
@@ -28,7 +28,7 @@ int main() {
   auto in4 = get_random_data();
   std::vector<float> out(dataSize, -1.0);
-  queue q{};
+  queue q{ext::codeplay::experimental::property::queue::enable_fusion{}};
   {
     buffer<float> bIn1{in1.data(), range{dataSize}};
@@ -40,10 +40,14 @@ int main() {
     buffer<float> bTmp2{range{dataSize}};
     buffer<float> bTmp3{range{dataSize}};
+    ext::codeplay::experimental::fusion_wrapper fw{q};
+
     for (size_t i = 0; i < 10; ++i) {
       auto start = std::chrono::high_resolution_clock::now();
+      fw.start_fusion();
+
       // tmp1 = in1 * in2
       q.submit([&](handler &cgh) {
         auto accIn1 = bIn1.get_access(cgh);
@@ -80,6 +84,8 @@ int main() {
             dataSize, [=](id<1> i) { accOut[i] = accTmp1[i] - accTmp3[i]; });
       });
+      fw.complete_fusion();
+
       q.wait();
       auto stop = std::chrono::high_resolution_clock::now();
The new source file can be compiled with this command:
clang++ -fsycl fusion.cpp -o fusion

As you can see, there are no additional compilation flags needed for kernel fusion.

There are also no additional flags needed for kernel fusion when executing the application:

./fusion

This should result in output similar to the following*:

Elapsed time in microseconds: 337596
Elapsed time in microseconds: 184143
Elapsed time in microseconds: 184681
Elapsed time in microseconds: 184205
Elapsed time in microseconds: 184065
Elapsed time in microseconds: 184574
Elapsed time in microseconds: 184253
Elapsed time in microseconds: 184524
Elapsed time in microseconds: 184426
Elapsed time in microseconds: 184191
As numbers show, the first iteration is slightly slower than in the case without fusion, due to the additional overhead for JIT compiling for fusion.

On the other hand, the remaining iterations are faster than in the non-fused case*. In the later section on VTune analysis, we will explore the reason for that more.

Dataflow Internalization

As discussed in the first blog-post, internalization of dataflow in the fused kernel can be an important optimization technique.

We can apply dataflow internalization to the buffers bTmp1, bTmp2 and bTmp3 in our application, resulting in the modified source code that you can download from here.

Using the following command, we can make the necessary code changes for internalization visible:

git diff --no-index fusion.cpp internalization.cpp

The output shows that we mainly need to pass an additional property to the buffer definition. As an additional optimization, we have disabled the insertion of extra work-group barriers by the fusion JIT compiler by passing a property to complete_fusion():

diff --git a/fusion.cpp b/internalization.cpp
index 4cd0702..430c020 100644
--- a/fusion.cpp
+++ b/internalization.cpp
@@ -36,9 +36,15 @@ int main() {
     buffer<float> bIn3{in3.data(), range{dataSize}};
     buffer<float> bIn4{in4.data(), range{dataSize}};
     buffer<float> bOut{out.data(), range{dataSize}};
-    buffer<float> bTmp1{range{dataSize}};
-    buffer<float> bTmp2{range{dataSize}};
-    buffer<float> bTmp3{range{dataSize}};
+    buffer<float> bTmp1{
+        range{dataSize},
+        {sycl::ext::codeplay::experimental::property::promote_private{}}};
+    buffer<float> bTmp2{
+        range{dataSize},
+        {sycl::ext::codeplay::experimental::property::promote_private{}}};
+    buffer<float> bTmp3{
+        range{dataSize},
+        {sycl::ext::codeplay::experimental::property::promote_private{}}};
     ext::codeplay::experimental::fusion_wrapper fw{q};
@@ -84,7 +90,7 @@ int main() {
             dataSize, [=](id<1> i) { accOut[i] = accTmp1[i] - accTmp3[i]; });
       });
-      fw.complete_fusion();
+      fw.complete_fusion(ext::codeplay::experimental::property::no_barriers{});
       q.wait();
We can compile and run the application with the following commands:
clang++ -fsycl internalization.cpp -o internalization ./internalization

Again, no additional flags are needed for compilation or execution to enable fusion and dataflow internalization.

The output shows the performance improvement we can get from fusion with dataflow internalization*:

Elapsed time in microseconds: 266811
Elapsed time in microseconds: 89876
Elapsed time in microseconds: 89861
Elapsed time in microseconds: 89939
Elapsed time in microseconds: 89904
Elapsed time in microseconds: 89844
Elapsed time in microseconds: 90018
Elapsed time in microseconds: 89887
Elapsed time in microseconds: 89891
Elapsed time in microseconds: 89831
As in both versions above, the JIT and SPIR-V compilation overhead makes the first iteration slower than the following ones. However, the performance benefit from fusion with dataflow internalization is so notable that even the first iteration is faster than the first iteration in the case without fusion. In the remaining iterations, the version with fusion and internalization outperforms the non-fused version by more than 2x.

In the next section, we will use VTune to get a glimpse into why performance improves so significantly with fusion.

VTune Performance Analysis

The Intel VTune profiler can provide great insight into the performance of applications (not only for SYCL), allowing users to identify hotspots, analyze bottlenecks and guiding optimizations.

We will be using it, or, more specifically, its command-line interface, to get some insight into performance differences between the three different versions of our application.

Fusion Performance

First, in the attempt to determine the reason fusion, even without dataflow internalization, improves application performance, we will investigate the cache performance of both version.

To analyze applications, we first need to collect some metrics with VTune. For the non-fused version, this can be achieved with this command, limiting execution to the OpenCL CPU device:

ONEAPI_DEVICE_SELECTOR=opencl:cpu vtune -collect memory-access -r report-no-fusion ./no-fusion

We can do the same for the fused version with no dataflow internalization:

ONEAPI_DEVICE_SELECTOR=opencl:cpu vtune -collect memory-access -r report-fusion ./fusion

Now that we have collected the metrics, we can generate reports for both versions, specifically focusing on the actual kernel functions. As the application performs only few arithmetic operations for each item of data loaded, i.e., it is memory-bound, we will investigate memory metrics in particular**. For this first step of fusion, the most relevant metric will be the cache metric, as fusion, even without internalization, can improve cache hit rate*.

For the version without fusion, this works with the following command:

vtune -report hw-events -r report-no-fusion --column="stalls_l1d_miss,stalls_l2_miss,stalls_l3_miss"\
  --filter function="main::{lambda(sycl::_V1::handler&)#1}::operator()(sycl::_V1::handler&) const::KernelOne"\
  --filter function="main::{lambda(sycl::_V1::handler&)#2}::operator()(sycl::_V1::handler&) const::KernelTwo"\
  --filter function="main::{lambda(sycl::_V1::handler&)#3}::operator()(sycl::_V1::handler&) const::KernelThree"\
  --filter function="main::{lambda(sycl::_V1::handler&)#4}::operator()(sycl::_V1::handler&) const::KernelFour"
The output of this command should be similar to the following, while the exact numbers can deviate based on hardware configuration*:
Function                                                                               Hardware Event Count:CYCLE_ACTIVITY.STALLS_L1D_MISS (M)  Hardware Event Count:CYCLE_ACTIVITY.STALLS_L2_MISS (M)  Hardware Event Count:CYCLE_ACTIVITY.STALLS_L3_MISS (M)
-----------------------------------------------------------------------------------------  -------------------------------------------------------  ------------------------------------------------------ ---------------------------------------
main::{lambda(sycl::_V1::handler&)#1}::operator()(sycl::_V1::handler&) const::KernelOne                                                     14,352                                                  13,858                                13,286
main::{lambda(sycl::_V1::handler&)#3}::operator()(sycl::_V1::handler&) const::KernelThree                                                   14,196                                                  13,598                                13,208
main::{lambda(sycl::_V1::handler&)#2}::operator()(sycl::_V1::handler&) const::KernelTwo                                                     14,716                                                  13,832                                13,494
main::{lambda(sycl::_V1::handler&)#4}::operator()(sycl::_V1::handler&) const::KernelFour                                                    16,016                                                  15,262                                15,002
For the non-fused version, we only have to specify the name of the fused kernel function. The fused functions use the fused_ prefix and are consecutively numbered. As the application only contains a single fusion, the name will be fused_0
vtune -report hw-events -r report-fusion --column="stalls_l1d_miss,stalls_l2_miss,stalls_l3_miss" --filter function=fused_0

In this case, the output should be similar to the following*:

Function  Hardware Event Count:CYCLE_ACTIVITY.STALLS_L1D_MISS (M)  Hardware Event Count:CYCLE_ACTIVITY.STALLS_L2_MISS (M)  Hardware Event Count:CYCLE_ACTIVITY.STALLS_L3_MISS (M)
--------  -------------------------------------------------------  ------------------------------------------------------  ------------------------------------------------------
fused_0                                                    40,586                                                  38,532                                                  37,232
When comparing the sum of the L1 cache misses observed by the four kernels in the case without fusion (ca. 59 billion stall cycles) with the number of L1 caches misses for the fused version (ca. 40 billion stall cycles), we can see that the fused version observes significantly fewer L1 cache misses. As the fused kernel executes all operations for each work-item in a sequence in a single kernel, the temporal locality is better, leading to a reduction in cache misses*.

Effect of dataflow internalization

When comparing the fused version with and without internalization above, we have seen that major performance improvements can result from dataflow internalization.

To analyze this improvement, we will investigate another VTune metric.

For the fused case, we can reuse the existing report. For the version with internalization, we can generate the report using the following command:

ONEAPI_DEVICE_SELECTOR=opencl:cpu vtune -collect memory-access -r report-internalization ./internalization

After that, we can compare the number of memory loads and stores performed by the two different versions of the fused kernel.

For the version without dataflow internalization:

vtune -report hw-events -r report-fusion --column=all_load,all_store --filter function=fused_0

This yields an output similar to the following*:

Function  Hardware Event Count:MEM_INST_RETIRED.ALL_LOADS_PS (M)  Hardware Event Count:MEM_INST_RETIRED.ALL_STORES_PS (M)
--------  ------------------------------------------------------  -------------------------------------------------------
fused_0                                                      995                                                      501
We can do the same for the version with dataflow internalization:
vtune -report hw-events -r report-internalization --column=all_load,all_store --filterfunction=fused_0

The resulting output will be similar to the following*:

Function  Hardware Event Count:MEM_INST_RETIRED.ALL_LOADS_PS (M)  Hardware Event Count:MEM_INST_RETIRED.ALL_STORES_PS (M)
--------  ------------------------------------------------------  -------------------------------------------------------
fused_0                                                      499                                                      124

Comparing the output, we can see that dataflow internalization reduces the number of loads by almost 500 million and the number of stores by close to 380 million, resulting in performance improvements*.

This analysis of course only scratches the surface of performance analysis and what Intel VTune supports. If you want to learn more about these topics, have a look at the Intel VTune cookbook (in particular the SYCL sections) or the getting started guide Intel VTune.

Outlook

The SYCL extension for kernel fusion is currently an experimental feature, allowing users to experiment with the feature and us to gather early user feedback on the API and functionality.

At the same time, there is a larger effort to define a graph API for SYCL. SYCL graphs allows users to define a directed acyclic graph of dependent SYCL commands ahead of execution and is meant to open new optimization opportunities for SYCL applications, for example for workloads that repeatedly submit a similar sequence of kernels.

Fusing the kernels in a graph into a single kernel is one of those potential optimizations. Codeplay have therefore started work on a SYCL extension for graph fusion, building on top of the SYCL graph API. The sequence of kernels to fuse is defined through the graph API, which offers two different mechanisms: one recording mode very similar to the existing kernel fusion extension, and one API for explicitly constructing a graph from the kernel and dependencies. Using fusion on top of graphs provides a number of advantages, for example a more fine-grained control over when the JIT compilation for fusion actually takes place.

If you want to follow the development of the new extension, you can do so here.

Disclaimer*

Experiments performed on 10/04/2023 by Codeplay, with Intel Core i7-6700K, Ubuntu 20.04.5 LTS, Linux kernel 5.15, Intel VTune Profiler 2023.1.0 pre-release (build 625246), and OpenCL driver version 2022.14.10.0.20_160000.xmain-hotfix.

DPC++ nightly version 2023-04-08 (git commit 3d6917f) was used for measurements.

Performance varies by use, configuration and other factors. Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software or service activation. Intel, the Intel logo, Codeplay and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

Metrics**

Availability and naming of hardware events are CPU-specific. To obtain a full list of available metrics for a report, a command similar to the following can be used:

vtune -report hw-events --column="?" -r ./report-internalization
Codeplay Software Ltd has published this article only as an opinion piece. Although every effort has been made to ensure the information contained in this post is accurate and reliable, Codeplay cannot and does not guarantee the accuracy, validity or completeness of this information. The information contained within this blog is provided "as is" without any representations or warranties, expressed or implied. Codeplay Sofware Ltd makes no representations or warranties in relation to the information in this post.
Lukas Sommer's Avatar

Lukas Sommer

Research Engineer