Parallel Algorithms with C++ STL and TBB (Threading Building Blocks) (2024)

Some coding steps to optimizing C++ parallel algorithms using STL and TBB for robust multi-threading.
Author

Paul Norvig

Published

January 5, 2024

Introduction

I’ve been working with C++ and parallel computing for a while now, and it’s one of those areas that always seems to be growing. When I first got into parallel programming, I realized pretty quickly that understanding how to use multiple cores and processors was going to be key. So, I started looking into tools and libraries that could help. That’s how I came across the world of Threading Building Blocks (TBB) and the enhancements to the C++ Standard Template Library (STL). They’ve been a big part of my toolkit for writing efficient, parallel code - below I share some insights on how to get started with them.

Introduction to Parallel Computing in C++

Parallel computing is an essential paradigm for developers to grasp, especially when dealing with C++. In my journey, the transition from sequential to parallel problem-solving required a shift in mindset. Think of it as going from crafting a sonnet to orchestrating a symphony. Each musical part, like a task in a program, must work in harmony with the rest, yet, each can be played independently.

In parallel computing, your code is split into parts that run concurrently. This way, I often see a dramatic speedup on multi-core processors, which are quite common these days. To start off, let’s first get familiar with the simplest form of parallelism using standard threads:

#include <iostream>
#include <thread>
#include <vector>

void do_work(unsigned id) {
std::cout << "Thread " << id << " is doing work\n";
}

int main() {
std::vector<std::thread> threads;

for (unsigned i = 0; i < 10; ++i) {
threads.emplace_back(do_work, i);
}

for (auto& thread : threads) {
thread.join();
}

return 0;
}

This code spins up ten threads, each running the do_work function. It’s a hands-on way to get started, with each thread being joined at the end to ensure that the main thread waits for all threads to finish their jobs.

However, managing threads manually can be laborious and error-prone. Thankfully, the C++ Standard Template Library (STL) aids us with utilities like std::async that simplifies creating asynchronous tasks:

#include <iostream>
#include <future>

int main() {
auto future = std::async([](){
std::cout << "Hello from a future!\n";
});

future.wait(); // Wait for the async task to complete
}

Such STL features are just scratching the surface of parallel computing in C++. What truly opens up the potential is the Threading Building Blocks (TBB) library, which helps us employ higher-level parallel patterns without drowning in the lower-level details of thread management. Consider the following TBB pipeline example:

#include <tbb/tbb.h>
#include <vector>

int main() {
std::vector<float> large_data(1000000);

// Parallel fill with TBB
tbb::parallel_for(size_t(0), large_data.size(), [&large_data](size_t i) {
large_data[i] = std::sin(i);
});
}

Using tbb::parallel_for, I’ve replaced a traditional loop with an elegant parallel loop that works over a vector, filling it with computed values. This is an entry point into a world where your operations can take full advantage of modern multicore processors. And this is just a drop in the ocean of what TBB offers.

Parallel computing is a vast subject, and C++ provides an evolving ecosystem with its STL and TBB libraries. While the above code clips are introductory, my experience dictates that starting with mastering these basic patterns paves the way for tackling more complex parallel algorithms seamlessly. Whether it’s for speeding up computationally intense tasks or efficiently processing large datasets, embedding parallelism into your codebase is a game-changer.

To those just beginning with parallel programming in C++, remember that each core on your processor is an opportunity for performance gains. Utilize them with caution and care, understanding the synchronization and data sharing implications. That said, it’s a fulfilling journey, one that will transform not just how your programs run, but also how you think about solving problems.

Leveraging C++ STL for Parallel Algorithms

As I began to explore the Standard Template Library (STL) with an eye toward parallelism, it became clear that C++ provides a rich ecosystem for this purpose. The STL, traditionally used for sequential programming, has seen significant enhancements to support parallelism, which opens up new opportunities for performance optimization.

One of the most straightforward ways to leverage parallelism in C++ is by using the parallel execution policies introduced in C++17. These policies can be applied to many of the algorithms in the STL, allowing them to run in parallel without fundamentally changing the code structure. Here is a simple example:

#include <vector>
#include <algorithm>
#include <execution>

int main() {
std::vector<int> data = {2, 3, 4, 5, 6, 7, 8, 9};

std::for_each(std::execution::par, data.begin(), data.end(), [](int& n){
n *= 2;  // Double the value of each element
});

return 0;
}

In the above code, std::execution::par tells the for_each function to execute the lambda function in parallel. Now, each multiplication can happen concurrently, potentially speeding up the overall processing on systems with multiple cores. It’s one of the simplest ways to kickstart parallel programming with the tools already provided in the STL.

Let’s consider sorting, which is often a performance bottleneck in applications. With the STL, you can easily parallelize the sorting of a large dataset:

#include <vector>
#include <algorithm>
#include <execution>
#include <random>

int main() {
std::vector<int> data(1000000);
std::random_device rd;
std::mt19937_64 eng(rd());
std::uniform_int_distribution<> distr;

// Fill data with random integers
std::generate(begin(data), end(data), [&]() { return distr(eng); });

// Sort data in parallel
std::sort(std::execution::par, data.begin(), data.end());

return 0;
}

By simply adding std::execution::par as a parameter to std::sort, the STL takes care of distributing the work across multiple threads. This is a classic example where a few changes can make a significant impact on performance when working with large volumes of data.

It’s worth mentioning that while parallel algorithms can lead to performance gains, they aren’t always the right choice. The overhead associated with managing parallel execution can sometimes outweigh the benefits, especially for small datasets or tasks that are inherently sequential in nature.

Another aspect to consider when dealing with parallel algorithms is the potential for race conditions and the need for thread synchronization. Accessing and modifying shared data must be managed carefully to avoid issues - a topic that the TBB excels at. However, that’s a story for another section of the article.

In essence, the STL’s algorithms with parallel execution policies offer a relatively gentle slope for beginners to ascend the mountain of parallel programming in C++. It’s a testament to the design of the language and its libraries that a strategy as powerful as parallel processing can be attempted without deep diving into the complexities of thread management.

Remember to profile and benchmark your parallel code against the sequential version. Real-world performance can vary dramatically based on the context, and the only way to know if you’re truly gaining benefits is to measure.

For those keen on diving deeper, the ISO C++ website (https://isocpp.org/) is a treasure trove of information, and the cppreference (https://en.cppreference.com/) provides thorough documentation on execution policies. Happy coding, and may your algorithms run at lightning speed.

Understanding TBB for Task-based Parallelism

When tackling the Herculean task of implementing parallelism in applications, it’s easy to get tangled in the intricacies of thread management and synchronization. I’ve had my fair share of headaches trying to get it right. This is where Intel’s Threading Building Blocks (TBB) enters the fray, waving the flag of relief for developers like us trying to squeeze out that last drop of performance from our C++ applications.

TBB offers an abstraction layer over raw threading mechanisms, which means I spend less time wrestling with thread lifecycle or mutexes and more time crafting the actual logic of my program. To illustrate, let’s explore using TBB for a common scenario: parallel_for. This functionality is akin to a traditional for loop, except it runs iterations concurrently.

#include <vector>
#include <tbb/parallel_for.h>

int main() {
std::vector<int> largeData(1000000);
// Assume largeData is populated with meaningful data

tbb::parallel_for(tbb::blocked_range<size_t>(0, largeData.size()),
[&](tbb::blocked_range<size_t> r) {
for (size_t i = r.begin(); i != r.end(); ++i) {
// Perform some computation on largeData[i]
}
});
}

Here, TBB intelligently divides the range of our large data set and assigns chunks to different threads across the available cores. This makes it quite effortless to scale up the workload without manually creating and managing threads.

Next up is the task_scheduler_init object, essentially letting you specify the number of threads TBB should use. However, TBB’s default initializer is pretty good at detecting the optimal count, so I rarely have to fiddle with it.

Sometimes, managing dependencies between tasks is essential. TBB offers tbb::task for this situation. It can be somewhat tricky initially, but once you get the hang of it, things become quite manageable. Here’s a simplified example of creating tasks:

#include <tbb/task.h>

class SimpleTask : public tbb::task {
public:
tbb::task* execute() {
// Task code goes here
return nullptr;
}
};

int main() {
tbb::task_scheduler_init init;

SimpleTask* t = new(tbb::task::allocate_root()) SimpleTask();
tbb::task::spawn_root_and_wait(*t);
}

With task-based parallelism, it’s easier to represent complex task graphs where tasks are interdependent. These could be daunting to manage with raw threads, because you would be in charge of tracking the execution and dependencies.

Moreover, TBB comes bundled with concurrent containers such as concurrent_hash_map and concurrent_queue, which are thread-safe and work miracles when you want shared data structures but don’t want the headache of dealing with traditional locks.

As you journey into the realm of TBB, the extensive documentation and vibrant community at Intel’s TBB GitHub repository are treasure troves of information and assistance. Begin with the examples provided and escalate the complexity of your projects as you grow more comfortable.

While this snippet only scratches the surface of TBB, it should give you an idea of the power at your fingertips. Start small with parallel_for, appreciate the simplicity of the task-based paradigm, leverage concurrent containers, and before you know it, you’ll be parallelizing tasks like it’s second nature, with all the grunt work of thread management taken care of. It’s tools like TBB that make modern C++ such a powerhouse for high-performance computing, giving us the edge in developing software that truly harnesses the capabilities of contemporary hardware.

Integrating STL and TBB for Optimized Performance

Integrating STL and TBB for optimized performance requires a grasp of both libraries and the ability to discern when and how to leverage their strengths. I’ve often found myself choosing between one or the other, but the true power lies in knowing how to combine them effectively.

STL algorithms often provide a straightforward solution for data manipulation. However, when I need more juice for performance-critical sections, TBB comes into play. TBB is designed with parallelization in mind, offering scalable solutions to multi-threading problems.

Let’s start by enhancing the performance of a simple STL sort algorithm using TBB. Normally, you might do this:

#include <algorithm>
#include <vector>

std::vector<int> myVector = { ... }; // Imagine a large dataset here
std::sort(myVector.begin(), myVector.end());

But, if you’re aiming for better performance by adding parallelism, you might integrate TBB like so:

#include <tbb/parallel_sort.h>
#include <vector>

std::vector<int> myVector = { ... };
tbb::parallel_sort(myVector.begin(), myVector.end());

Notice how just swapping std::sort with tbb::parallel_sort can make your sorting much faster on a multi-core processor.

Moving on, let’s dig into more complex scenarios—say, transforming elements in a std::vector. With STL, that’s typically a loop or a std::transform call. But, imagine this vector is huge. You want your code to run faster by processing chunks of the vector in parallel.

Here’s how you’d do it with STL:

#include <vector>
#include <algorithm>
#include <functional>

std::vector<int> myVector = { ... };
std::transform(myVector.begin(), myVector.end(), myVector.begin(),
[](int x) { return x * x; });

And here’s how you’d do it with TBB:

#include <tbb/parallel_for.h>
#include <vector>

std::vector<int> myVector = { ... };
tbb::parallel_for(tbb::blocked_range<std::vector<int>::iterator>(myVector.begin(), myVector.end()),
[](const tbb::blocked_range<std::vector<int>::iterator>& r) {
for(auto i = r.begin(); i != r.end(); ++i) {
*i = (*i) * (*i);
}
});

You wrap the range of the vector you want to work with in tbb::blocked_range and use tbb::parallel_for to process it. What happens under the hood is TBB divides the range into chunks and processes them in parallel threads. It’s still just a few extra lines of code, but the performance gains on large datasets can be significant.

When I worked on integrating these two libraries in my projects, it wasn’t just about blind optimization. It’s crucial to profile your code and figure out which parts are creating bottlenecks. This is because sometimes, the overhead of spinning up threads can outweigh the performance gains for smaller datasets or less complex operations.

For beginners, an understanding of how to measure performance gains is essential. Tools like Valgrind’s Callgrind, Intel’s VTune, or even simple clock-based timers can help you decide if parallelism is improving performance or not.

If you’re looking for more examples and community discussions, I recommend checking sites like Stack Overflow or exploring repositories on GitHub (search for “tbb examples”). You’re not just looking for code snippets, but for real-world scenarios where developers have weighed in on similar problems.

Remember, integrating STL and TBB is about finding a balance. You don’t need to parallelize everything. Start with hotspots in your code—those critical sections that need to be faster. As you experiment and profile, you’ll get a feel for when to use STL, when to opt for TBB, and when to blend the two for the best results.