NVIDIA® Texture Cache in SYCL™

11 September 2023

This blog post will present a performance analysis of the CUDA texture cache for generic GPGPU applications. This is important for GPGPU programmers that may want to manually access the L1/Tex cache outside the scope of image specific applications.

What is the texture cache

The texture cache is a part of the unified data cache. This cache is a dedicated bit of memory present on every Compute Unit (CU). The unified in the name comes from the fact that all its components (the texture cache, L1 cache and SYCL local memory (CUDA shared memory)) used to be housed in separate memory modules, but for better performance were over time moved onto a single die.

The name, "Texture cache", comes from its original usage in the graphics pipeline to accelerate texture mapping. This also means that it is directly accessible by the TEX units used by the SYCL bindless images API, however that is beyond the scope of this blog post. More can be read about the bindless images API in its documentation.

The texture cache is non-coherent, meaning that it has no mechanism to update the data already loaded into it. As such it can only be used with data that is read only for at least the lifetime of the kernel.

How to access the texture cache

A global memory load can be cached in the texture cache by accessing it using the ldg function, which currently lives in the sycl::ext::oneapi::experimental::cuda namespace. The full details of the function can be found in its documentation

An example of adapting code to use the texture cache is visible in the code snippets below. Original snippet taken from the page-rank benchmark from HecBench.

// Without texture cache usage
q.submit([&](sycl::handler& cgh) {
  cgh.parallel_for<class map>(
    sycl::nd_range<1>(gws, lws), [=] (sycl::nd_item<1> item) {
    int i = item.get_global_id(0);
    if (i < n) {
      float outbound_rank = d_page_ranks[i]/(float)d_noutlinks[i];
      for(int j=0; j<n; ++j)
        d_maps[i*n+j] = d_pages[i*n+j]*outbound_rank;
    }
  });
});

#include <sycl/ext/oneapi/experimental/cuda/builtins.hpp>
using namespace sycl::ext::oneapi::experimental::cuda;
// With texture cache usage
q.submit([&](sycl::handler& cgh) {
  cgh.parallel_for<class map>(
    sycl::nd_range<1>(gws, lws), [=] (sycl::nd_item<1> item) {
    int i = item.get_global_id(0);
    if (i < n) {
      float outbound_rank = ldg(&d_page_ranks[i])/(float)ldg(&d_noutlinks[i]);
      for(int j=0; j<n; ++j)
        d_maps[i*n+j] = ldg(&d_pages[i*n+j])*outbound_rank;
    }
  });
});

Under the hood this function uses the ld.global.nc PTX assembly instruction. As such the texture cache can also be accessed in inline PTX assembly using this instruction.

When to use the texture cache

As mentioned previously the texture cache can only be used with data that is read only for at least the lifetime of the kernel.

NVIDIA describes two key characteristics of the cache. Firstly that it is optimized for a certain data layout, but this layout (called block linear) is opaque and only accessible through the previously mentioned bindless images API. Secondly, that on some architectures the texture cache has higher throughput, but longer latency, therefore needing more parallelism to cover for that latency.

General usage

Benchmarking was performed to get a general overview of the impact of using the texture cache. It was done by modifying a selection of benchmarks from the HeCBench suite to create versions with and without the use of the texture cache, and comparing performance between them.

The HeCBench suite was chosen as the benchmarks contained within are based on real world GPGPU use cases. This was useful as one of the factors in a choice of benchmark was if a major GPGPU package (like PyTorch for example) was implementing that algorithm using the texture cache. The subset of benchmarks used was chosen by finding benchmarks that either already use the texture cache or operate on read only data that could use the texture cache. During their selection special attention was paid to benchmarks that operated on data in 2D layouts, like lattices or images as they could benefit from the aforementioned cache characteristics.

The modifications performed on the benchmarks were very minimal. If a benchmark wasn't using the texture cache, a version using it was created by wrapping all eligible variables in the ldg function described above. And if a benchmark was already using it, a version without it was created.

To minimize measurement error, the execution time of each version of each benchmark has been calculated as an average of many individual runs. The exact amount of these runs is different between benchmarks. It was determined by calculating (on assumption of runs being uncorrelated) the standard error after each series of runs and increasing the amount of runs until that error was below the desired resolution of 1%. This strategy minimized error so much that the error bars wouldn't be clearly visible on the plot and as such were omitted.

The performance comparison was done by running the two versions of each benchmark (as described above) on the same system equipped with an NVIDIA A100. Then getting the average kernel execution time from both series of runs and calculating the texture cache speedup using the formula visible below:

The data collected this way is visible in the plot in figure 1.

Figure 1: Plot showing speedups resulting from the use of the texture cache

All performance measurements visible in the plot in figure 1 were performed with an AMD EPYC 7402 CPU with 260 GB of RAM and an NVIDIA A100 PCIe GPU with 40 GB of RAM running Ubuntu 22.04.2 LTS (Linux Kernel 5.15.0), DPC++ version 2023.2, CUDA SDK 12.2. The binaries were compiled with compiler switches:

-std=c++17 -Wall -fsycl -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_80 -O3

In the plot in figure 1 it can be seen that the use of texture cache had no effect on most benchmarks. As for the benefiting benchmarks the improvements are rather varied, but always noticeable (up to 40%). Observed performance degradation has been very rare and rather small (~2%).

From this data it can be seen that while the texture cache can bring noticeable performance increases, it is very situational. The exact circumstances under which it is most useful are hard to pinpoint due to the fact that the previously mentioned cache characteristics are either vague or opaque, but the collected data highlights some of them.

Firstly, that the computational patterns of an algorithm are important. The texture cache seems to prefer algorithms with patterns of multiple reads of different elements of an array by multiple threads/and or same thread in a warp. This can be seen in the fact that algorithms based in convolutions (represented by conv1d, conv1d-tiled, conv1d-tiled-caching and Convolution Separable), LSTMs (represented by clink kernel and clink offload) and graph algorithms (represented by page-rank, all pairs distance without shared and all pairs distance with shared) received the biggest performance increases.

This is not too surprising since convolution is one of the most common image processing operations. As such the optimizations from the graphics pipeline origins of the texture cache can shine through.

It is worth noting that winograd double and winograd float experienced small but significant performance degradation, despite being convolution based. This will be further explored in the next section.

Secondly, multiple reuse of data appears to be important. This is suggested by benchmarks like ising and heat2D. Which while working on reused 2D structured read only data, reuse that data very lightly, and received no noticeable performance benefit.

These observations, in combination with the fact that using the texture cache requires minimal changes to the kernel's code, means that the texture cache can be a quick and easy way to improve kernel performance.

Architecture impact

As mentioned in the first section the texture cache lives in the unified data cache. As such it is impacted by what else is unified with it.

The last time NVIDIA has made a change to it was in 2017 with the Volta microarchitecture (compute capability 7.0). At this point the unified data cache got physically combined with SYCL local memory.

The possible impact of this change has been investigated by running the benchmarks from the previous subsection on 3 different hardware configurations:

NVIDIA GeForce GTX 1050 Ti, representing the performance before the merge of the Texture/L1 and shared memory caches.
NVIDIA GeForce GTX 1650 representing performance after the merge.
NVIDIA A100 representing a more current architecture to see if there have been any more changes since.

Relevant changes to the memory hardware between these GPUs are visible in table 1.

	1050Ti	1650	A100
compute capability	6.1	7.5	8.0
VRAM size	4 GB	4 GB	40 GB
VRAM technology	GDDR5	GDDR5	HBM2e
Memory bus	128 bit	128bit	5120 bit
Memory clock	1752 MHz	2001 MHz	1215 MHz
Unified L1/texture cache (per CU)	48 KB	N/A	N/A
Local memory (per CU)	96 KB	N/A	N/A
Unified data cache (per CU)	N/A	96 KB	192 KB

Table 1: Relevant hardware differences between used GPUs

This hardware selection meant that the benchmarks run on the 3rd configuration had to be run with slightly different parameters. The only changed parameter was an increase of the processed data size. This change was necessary as the A100 is the only data center GPU. This means that it possesses much more memory capacity and compute resources (as visible in table 1). As such this change kept the resource utilization and occupancy of the cards equally stressed allowing for caching benefits observation.

The data collected is visible in the plots in figures 2,3 and 4.

Figure 2: Plot showing speedups resulting from the use of the texture cache on a GTX 1050Ti

All performance measurements visible in the plot in figure 2 were performed with an Intel i7-10700K CPU with 32 GB of RAM and an NVIDIA GeForce GTX 1050 Ti running Ubuntu 18.04.6 LTS (Linux Kernel 5.4.0), DPC++ version c87e7802, CUDA SDK 11.6. The binaries were compiled with compiler switches:

-std=c++17 -Wall -fsycl -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_61 -O3

Figure 3: Plot showing speedups resulting from the use of the texture cache on a GTX1650

All performance measurements visible in the plot in figure 3 were performed with a 12th Gen Intel i9-12900K CPU with 64 GB of RAM and an NVIDIA GeForce GTX 1650 running Ubuntu 22.04.2 LTS (Linux Kernel 5.19.0), DPC++ version 0a39bb88, CUDA SDK 12.1. The binaries were compiled with compiler switches:

-std=c++17 -Wall -fsycl -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_75 -O3

Figure 4: Plot showing speedups resulting from the use of the texture cache on an A100

All performance measurements visible in the plot in figure 4 were performed with an AMD EPYC 7402 CPU with 260 GB of RAM and an NVIDIA A100 PCIe GPU with 40 GB of RAM running Ubuntu 22.04.2 LTS (Linux Kernel 5.15.0), DPC++ version 2023.2, CUDA SDK 12.2. The binaries were compiled with compiler switches:

-std=c++17 -Wall -fsycl -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_80 -O3

These results suggest that the GPU architecture version may have a major impact on the effectiveness of the texture cache. With some of the benchmarks experiencing drastically lower benefits of using the texture cache in the more modern GPUs, from over 200% to less than 50%. Some benchmarks (namely winograd float and winograd double) experienced not only lower performance benefits, but even a small but significant performance degradation on newer GPUs.

Data size impact

As the 1D Convolution benchmark has a lot of easily changeable variables it allowed for easy collection of extra data. With minimal modifications it was made to run 3 versions of the convolution kernel for 3 different data types for 7 different sizes of the convolution mask. Convolution mask size here is the size in elements of the smaller convolution input sequence. This was run and processed like the previous benchmarks, just with the desired resolution of 5%. This change in error resolution was necessary due to a combination of time constraints and the very long single run times. This lower precision was still deemed acceptable as the observed speed-ups are substantially bigger. The data collected is visible in the plot in figure 5.

Figure 5: Plot showing speedups for different mask sizes for different variants of 1D convolution

All performance measurements visible in the plot in figure 5 were performed with an Intel i7-10700K CPU with 32 GB of RAM and an NVIDIA GeForce GTX 1050 Ti running Ubuntu 18.04.6 LTS (Linux Kernel 5.4.0), DPC++ version c87e7802, CUDA SDK 11.6. The binaries were compiled with compiler switches:

-std=c++17 -Wall -fsycl -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_61 -O3

The first thing immediately obvious from the plot in figure 5 is that the conv1d-tiled-caching kernel, despite being convolution based, receives either no benefit or even a light penalty, reinforcing the fact that the texture cache is extremely use case dependent. It can be seen that the texture cache benefits grow with mask size. This continues up to a saturation point occurring at around 63 float values or the equivalent of 31 double values or 127 int16_t values. After which the benefits drop noticeably. A seemingly relevant observation made by this paper is that the texture cache has a 128 byte cache line width.

This also further supports the conclusion from a previous subsection that the texture cache benefits from increased data reuse.

Optimizing for the texture cache

Data type

While the ldg function supports many data types with a variety of different sizes, performance wise it favors some over others.

It can be seen in the plot in figure 5, that, roughly speaking, the smaller the data type the bigger the benefits of using the texture cache get. This is also visible in the winograd benchmarks from the plots in figures 2,3 and 4. This benchmark originally used double as its primary data type, and had a version of it made that used float instead.

Note that, when changing the data type for a given benchmark the whole algorithm was changed to use the new data type.

Data ordering

While the 2D spacial locality of the block-linear layout that the texture cache is optimized for, hints at it being related to z-ordering (aka z-curve ordering or Morton code indexing), as it is an opaque layout nothing can be officially confirmed.

However, some applications have reported benefits from using z-ordering. Like the particle simulator that in this paper from 2010 says that "it may be beneficial to use other functions such the Z-order curve [8] to improve the coherence of memory accesses.". Or the more modern use cases described in this and this paper, which both use z-ordering within the HOOMD-blue molecular dynamics package.

Relation to local memory

As mentioned in the architecture impact section ever since the Volta microarchitecture the texture cache and local memory have been sharing the same physical resource. This has introduced a new dependency between them. Namely, that the sum of the local memory and texture cache used by a kernel must be less or equal to a fixed value.

This means that the more local memory a given kernel allocates the less texture cache is available to it. This is further exaggerated by the fact that while local memory can allocate any arbitrary amount of memory within the hardware limits, the texture cache is not as flexible and can only snap to per architecture predefined values called carveouts.

More about this topic, including how to set local memory allocation limits and how the carveout is decided can be read about here

Related optimization tip

If one is considering the use of the texture cache to optimize some kernels, one might also consider a related optimization.

Namely, register reuse. This optimization is possible because data loaded through the texture cache has to be constant for the lifetime of the kernel. This means that the data instead of being loaded on every use can be just loaded into registers and reused from there. And while a texture cache load is faster than a device memory one, a register load is significantly faster than either one of them.

Register reuse is especially important for data used within loops. Because the data is constant for the lifetime of the kernel, it is also invariant for all the loops within the kernel. As such the loads can be taken out of the loop greatly reducing the amount of memory loads.

It is important to note that this only applies to smaller amounts of data as loading too much data into registers might cause a register spill and harm performance instead. This is because in this case registers spill into private memory, which is located in device memory. This means that at this point the optimization would be replacing a cached global read with a standard one.

Disclaimer

Performance varies by use, configuration and other factors. Results may vary.

Codeplay Software Ltd has published this article only as an opinion piece. Although every effort has been made to ensure the information contained in this post is accurate and reliable, Codeplay cannot and does not guarantee the accuracy, validity or completeness of this information. The information contained within this blog is provided "as is" without any representations or warranties, expressed or implied. Codeplay Sofware Ltd makes no representations or warranties in relation to the information in this post.

oneAPI

oneAPI for NVIDIA®/AMD

oneAPI Construction Kit

SYCL™

Research Projects

All Updates

News

Press Updates

Blogs

Videos

About Us

Careers

Management Team

Collaborations

Press-Packs

Contact Us