GPU Programming with CUDA C++ (2024)

Short guidew to GPU Programming with CUDA C++ (2024): coding, performance optimization, latest features.

Paul Norvig


January 12, 2024


I’ve been working with CUDA for a while now, and it’s been quite exciting to get into the world of GPU programming. It’s a space where every millisecond of performance counts and where the architecture of your code can leverage the incredible power GPUs offer. Over the years, I’ve picked up a few tricks and techniques that have helped me unlock new levels of speed and efficiency. In this piece, I’m going to share some advanced CUDA features and tips for performance optimization that I’ve found to be game-changers.

Introduction to GPU Architecture and CUDA C++

As I with GPU programming, I realized that understanding the architecture of a graphics processing unit (GPU) is crucial before even writing a line of CUDA C++ code. Unlike a CPU with a few cores optimized for sequential serial processing, a GPU has a highly parallel structure that makes it effective for handling large blocks of data simultaneously. This is precisely why GPUs have become a powerhouse for not only rendering graphics but also accelerating a wide range of computation-heavy applications like scientific simulations and deep learning.

GPUs follow a SIMD (Single Instruction, Multiple Data) architecture which means multiple processing units execute the same instruction on different pieces of data simultaneously. CUDA, which stands for Compute Unified Device Architecture, provides a C++ friendly platform developed by NVIDIA for general-purpose processing on GPUs.

So, if you’re like me, itching to get your hands dirty with some GPU programming, let’s break down the essentials. Here’s a snippet that illustrates how CUDA C++ parallels the GPU architecture:

__global__ void add(int n, float *x, float *y) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];

int main(void) {
int N = 1<<20; // 1M elements
float *x, *y;

// Allocate Unified Memory – accessible from CPU or GPU
cudaMallocManaged(&x, N*sizeof(float));
cudaMallocManaged(&y, N*sizeof(float));

// Initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;

// Run kernel on 1M elements on the GPU
int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
add<<<numBlocks, blockSize>>>(N, x, y);

// Wait for GPU to finish before accessing on host

// Free memory

return 0;

In this code, the __global__ specifier indicates a function (add) that runs on the GPU but can be called from the CPU. The blockIdx, blockDim, and threadIdx variables are built-in CUDA variables that let us calculate an index for each thread so our data can be processed in parallel.

Notice the triple angle brackets in the call to add<<<numBlocks, blockSize>>>(N, x, y). This syntax is unique to CUDA and specifies the execution configuration. Here, numBlocks is the number of blocks, and blockSize is the number of threads per block. The CUDA runtime system handles the scaling across the GPU cores.

Understanding these basic blocks of CUDA helps demystify the parallel computing capabilities of GPUs. You might be wondering where the data is living in these parallel operations. Well, Unified Memory is a shared memory space accessible by both the CPU and GPU, simplifying data management. GPU memory management is a vast topic and one that merits its own focus later on.

For those looking to dive deeper, NVIDIA’s CUDA Toolkit Documentation is an invaluable resource. Additionally, GitHub repositories like cuda-samples provide a plethora of real code examples that can accelerate your learning curve.

Remember, the layout of CUDA corresponds closely with the GPU architecture. Using blocks, grids, and threads efficiently requires a solid grasp of how GPUs process data. Mastering these concepts is a gateway to harnessing the raw power of GPUs for an array of computational tasks where traditional CPU-bound methods falter. As we move past this introduction, we’ll touch upon the crucial aspects of CUDA programming, clearly separating the wheat from the chaff in terms of what skills can really amplify the performance of your applications.

Setting Up the Development Environment for CUDA Programming

Embarking on the CUDA programming journey entails setting up a robust development environment. I’ll guide you through the process which is, trust me, easier than it might seem. You’ll be writing your first CUDA program in no time.

First up, ensure that you have a CUDA-enabled NVIDIA GPU. You can check the compatibility of your GPU with CUDA on NVIDIA’s website. It’s essential since CUDA code won’t run on non-NVIDIA or older GPUs that don’t support the necessary compute capability.

Next, we’ll need to install the CUDA Toolkit provided by NVIDIA. This is your bread and butter for CUDA development. It includes the necessary compilers, libraries, and debugging tools. Head over to the CUDA Toolkit download page and select the version appropriate for your operating system. During installation, you can choose to install the drivers too if you don’t already have them installed.

On Linux, your installation command might look something like this:

sudo dpkg -i cuda-repo-<distro>_<version>_amd64.deb
sudo apt-key adv --fetch-keys<distro>/x86_64/
sudo apt-get update
sudo apt-get install cuda

Replace <distro> and <version> with the specific distribution name and CUDA version respectively.

For Windows users, once the installer is downloaded, it’s mostly a click-through process. If you prefer command-line installation on Windows, you can use something like:

setup.exe -s nvcc_<version> cuobjdump_<version> nvprune_<version> cuda-memcheck_<version>

This installs the core components, including the NVCC compiler and additional tools for debugging and binary manipulation.

With the toolkit installation out of the way, it’s time to get the Integrated Development Environment (IDE) set up. I’m a fan of Visual Studio Code (VS Code) for its simplicity and extensibility, but feel free to use your preferred IDE like Visual Studio, Eclipse, or even command line editors such as Vim or Emacs if that’s what you’re comfortable with.

For VS Code, there’s a fantastic extension called C/C++ for Visual Studio Code that you should definitely install. It provides IntelliSense, debugging, and code browsing capabilities which are invaluable.

"name": "C/C++",
"publisher": "ms-vscode.cpptools",
"version": "1.0.0"

Now comes the part where you set up your first project. Create a new directory for your CUDA projects and open it in your IDE:

mkdir cuda_projects
cd cuda_projects
code .

Within the project directory, create a simple .cu file to get started. Here’s a “Hello World” CUDA code snippet:

#include <stdio.h>

__global__ void hello_cuda() {
printf("Hello, CUDA!\n");

int main() {
hello_cuda<<<1, 1>>>();
return 0;

As you can see, the main function launches a kernel hello_cuda with a single thread using the triple angle brackets (<<< >>>).

To compile the .cu file, use the NVCC compiler that came with the CUDA Toolkit. Assuming your file is named, the command is:

nvcc -o hello

This command compiles the code and outputs an executable named hello. Run it, and if all goes well, you should see Hello, CUDA! printed out. If you’re seeing that, congratulate yourself, you’ve just written and run your first CUDA program!

Remember, this setup is just the starting point. As you grow in CUDA, you might need to use additional libraries like cuBLAS or cuFFT, but worry not, they follow a similar setup and installation paradigm. Now, forge ahead and keep experimenting with more complex kernels and operations to harness the full power of your GPU with CUDA!

Basic CUDA Kernels and Memory Management

When I first started dabbling with CUDA, kernels and memory management felt like stumbling blocks. But as soon as I got the hang of it, I began writing CUDA code with a renewed sense of confidence. Let’s talk about spinning up a basic CUDA kernel and managing memory effectively.

A CUDA kernel is essentially a function that runs on the GPU. Here’s a classic example of a kernel that adds two arrays A and B element-wise to produce an array C:

__global__ void vectorAdd(int *A, int *B, int *C, int N) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) {
C[i] = A[i] + B[i];

Notice the __global__ specifier? That tells the CUDA C++ compiler this function should run on the GPU. We calculate the index i using the block and thread identifiers. These are intrinsic variables that CUDA provides, namely blockIdx, blockDim, and threadIdx.

To run this kernel, you’ll need to allocate memory on the GPU and copy the data from the CPU:

int *d_A, *d_B, *d_C; // Device pointers
int size = N * sizeof(int);

// Allocate device memory
cudaMalloc((void **)&d_A, size);
cudaMalloc((void **)&d_B, size);
cudaMalloc((void **)&d_C, size);

// Copy data from the host to the device
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

Here, cudaMalloc allocates memory on the GPU, and cudaMemcpy copies data from the host (CPU) to the device (GPU). The parameters dictate the direction of data transfer.

Once the data is on the GPU, call the kernel like so:

// Number of blocks and threads per block
dim3 blocksPerGrid((N + threadsPerBlock - 1) / threadsPerBlock, 1, 1);
dim3 threadsPerBlock(256, 1, 1);

// Launch the kernel
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

The <<<blocksPerGrid, threadsPerBlock>>> syntax specifies how many blocks of threads to create. It’s one of those things that looked cryptic to me until I realized it’s just a function call with some extra information about the parallel execution configuration.

After the kernel executes, I always ensure to copy the results back to host memory:

// Copy the result from device to host
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

Finally, cleaning up the allocated memory is crucial to avoid memory leaks:


Remember that GPU memory management is manual, unlike the automatic memory management that you might be used to in high-level programming languages.

As for error checking, I enforce it religiously. It’s simple with CUDA’s error handling API:

cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) {
fprintf(stderr, "Failed to launch vectorAdd kernel (error code %s)!\n", cudaGetErrorString(err));

Checking the return value of CUDA runtime functions can save hours of debugging time. When something goes wrong, cudaGetErrorString(err) can tell you what it is.

To round things up, CUDA programming might have a steep learning curve, especially when dealing with lower-level details like memory management, but it’s incredibly rewarding to see your code fly on a GPU. Tinkering with kernels and memory allows me to extract every ounce of performance from the hardware, something I can’t get enough of as I explore the possibilities of parallel computation.

Advanced CUDA Features and Techniques

As a seasoned CUDA developer, I’ve often marveled at the range of sophisticated features that NVIDIA’s platform offers to wring out every drop of performance from their GPUs. After years of exploring CUDA’s various iterations, I’ve uncovered a series of advanced techniques that prove invaluable in tackling complex computational challenges. Let’s look at some of these, and explore how you might apply them in your programming endeavors.

Dynamic Parallelism

Gone are the days of the CPU having to control all aspects of execution. With dynamic parallelism, kernels can launch other kernels directly on the GPU. This is a game-changer for recursive algorithms and adaptive workload management.

__global__ void childKernel() {
// operations by the child kernel

__global__ void parentKernel() {
// operations before launching the child kernel
if (threadIdx.x == 0) { // Only one thread launches the child kernel
childKernel<<<1, 256>>>();
// operations after the child kernel has finished

With this, you can enable waves of parallel tasks that are dynamically adjusted as the computation progresses. It simplifies the programming model for algorithms that inherently contain nested parallelism, such as quicksort or tree traversal.

Stream Multiplexing

To maximize GPU utilization, I often make two or more kernels execute concurrently via stream multiplexing. By assigning different streams to different kernels, the GPU can switch between tasks during stalls or when waiting for memory operations.

cudaStream_t stream1, stream2;

myKernel1<<<1000, 256, 0, stream1>>>(...);
myKernel2<<<1000, 256, 0, stream2>>>(...);


Here, myKernel1 and myKernel2 could potentially run in parallel, improving throughput, especially on workloads with a mix of compute and memory-bound operations.

Unified Memory

Unified memory, or managed memory, abstracts the complexities of memory allocations between the host and device. It brings in a single memory space accessible from any processor without the need for explicit data transfer.

int* devArray;
cudaMallocManaged(&devArray, 1000 * sizeof(int)); // Allocate managed memory

// Use directly in both host and device code

This feature significantly eases the mental load when orchestrating complex data structures that need to be accessed on both the CPU and GPU, delivering a more streamlined programming model.

GPU Lambda Expressions

I appreciate the power brought by lambda expressions in C++. With CUDA, you can now harness that power directly in device code. This allows capturing variables and executing inline functions within kernel calls, leading to cleaner and more modular code.

int a = 10;
auto myLambda = [=] __device__ () {
printf("Value of a in lambda: %d", a);

myLambda<<<1, 1>>>();

Lambdas in CUDA make it easier to create callback-like patterns and reduce boilerplate when dealing with small, detailed operations that don’t require a full kernel definition.

Cooperative Groups

Finally, let’s highlight cooperative groups, a feature that allows threads to cooperate and synchronize more flexibly than the traditional thread blocks. This can be incredibly powerful for workloads needing complex communication patterns or advanced reduction and scan operations.

__global__ void myKernel(cooperative_groups::grid_group g) {
// Perform operations within the group
g.sync();  // Synchronize all threads in the group

The cooperative groups API abstracts the grid and block-level synchronization and provides the means to tackle problems that were tough to parallelize before.

These advanced features can significantly elevate the performance and capability of your GPU programs. But here’s a tip: always profile your code. Understand the bottlenecks and ensure the complexity introduced by these features is worth it by measurable performance gains.

Happy CUDA coding!

Performance Optimization and Best Practices in CUDA

In wrapping up our journey through GPU programming with CUDA C++, let’s focus on what can make or break your applications: performance optimization and best practices. I’ve spent countless hours tuning CUDA code, and I assure you, the devil is in the details.

Let’s tackle the elephant in the room - memory access patterns. GPUs love data locality and coalesced accesses because it significantly cuts down memory transfer times. A common pattern I use is to ensure global memory accesses by threads in a warp are contiguous:

__global__ void optimizedAccess(float *data) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
// Access pattern that promotes coalescing
float value = data[index];
// Do something with value...

Next, shared memory is a limited resource. Use it wisely. It’s much faster than global memory, so caching data there for reuse pays dividends. Here’s a snippet that uses shared memory for a reduction:

__global__ void sharedMemReduction(float *input, float *output, int size) {
__shared__ float partialSum[256];
int tid = threadIdx.x;
int start = blockIdx.x * blockDim.x * 2;

if(start + tid < size)
partialSum[tid] = input[start + tid] + input[start + tid + blockDim.x];
partialSum[tid] = 0;

for (unsigned int s = blockDim.x / 2; s > 0; s >>= 1) {
if (tid < s) {
partialSum[tid] += partialSum[tid + s];
if (tid == 0) output[blockIdx.x] = partialSum[0];

Dim3 is your friend for defining grid and block dimensions. Remember, the right configuration can lead to significant performance gains. Always tailor these dimensions to your data:

dim3 threadsPerBlock(16, 16);
dim3 numBlocks((N + threadsPerBlock.x - 1) / threadsPerBlock.x,
(N + threadsPerBlock.y - 1) / threadsPerBlock.y);
optimizedAccess<<<numBlocks, threadsPerBlock>>>(data);

Synchronization is critical and often a performance bottleneck. __syncthreads() synchronizes threads within a block, ensuring all threads have completed before moving on. Use it judiciously; unnecessary sync points can throttle your warp execution:

__global__ void necessarySynchronization(/* params */) {
// Perform first part of computation...
__syncthreads(); // Sync point required before proceeding
// Perform next part of computation that relies on the first part being complete...

Optimizing kernel launches is another best practice. Always prefer to launch more work than less, saturating the GPU to mask latency. Overhead from launching a kernel is more consuming than you think, which is why batch processing is a clever approach:

int numberOfBatches = /* calculate based on problem size */;
for (int i = 0; i < numberOfBatches; i++) {
myKernel<<<numBlocks, threadsPerBlock>>>(/* args */);
// Consider using cudaDeviceSynchronize() if necessary between batches

Lastly, let’s talk tools, because even the best of us can’t optimize in the dark. The NVIDIA Visual Profiler is gold for identifying bottlenecks and streamlining performance. Comb through those timelines and metrics; they reveal more than you can imagine, often pinpointing exactly where you need to focus your optimization efforts.

As this section concludes our CUDA C++ discussion for 2024, I hope you feel equipped to tackle GPU programming head-on. Optimization is an iterative and incremental part of the development process: measure, tweak, and repeat. And always, always, play close attention to the docs and community-proven practices; they’re the bedrock of good coding. Happy coding, and may your kernels run swiftly!