Introducing CLplusplus

In my last post, I discussed my current attempts to build an emulator suitable for OS development. Today, I will discuss another project which I’m simultaneously working on, which is is more closely related to my next job than to OSdeving but could be of interest to some of you readers anyway: CLplusplus, a more modern C++ wrapper to the OpenCL API than what the OpenCL standard will provide.

General project goal

Better data types

OpenCL is, at its heart, a C API, and this shows in many aspects of its standard interface. For example, querying some property of an OpenCL object is usually done in the following manner :

size_t result_size;
clGetSomeProperty(property_name, nullptr, nullptr, &result_size);

... <allocate storage for the result> ...

clGetSomeProperty(property_name, result_size, result, nullptr);

For any user of a modern programming language, such verbosity is ridiculous. We have come to expect API interfaces to work in a much more straightforward manner, like this :

result = get_some_property(property_name);

Or even this :

result = opencl_object.property_name();

Such high-level queries may quite easily be implemented on top of the C interface provided by the OpenCL API. But the main problem is that sometimes, because of the unfortunate fact that the C programming language has the worst array abstraction ever devised in a programming language, we will need to move to higher-level containers.

This is especially true whenever an OpenCL query will return an array of results, which is best expressed in C++ by an std::vector, whereas the OpenCL API will instead resort to inferior solutions such as separate array size and pointer storage, or worse, zero-terminated lists.

Whenever I see an an API that uses zero-terminated list in 2015, my heart goes like this:

Although zero-terminated lists were not the root cause of the HeartBleed attack, they cause a countless amount of software crashes and security vulnerabilities every day, due to the simple fact that if their trailing zero is ever omitted for any reason (ranging from parser bugs to human error and malice), it will cause their parser to enter a near-infinite sequence of undefined behavior, that will usually lead the program to crash, and possibly disclose sensitive private information before that.

Even if this were not a very serious concern, zero-terminated lists are also much less efficient for computers to parse than their equivalent length-and-data representation, which allows a parser to skip any unnecessary information in a very quick way.

Let’s face it: zero-terminated lists are a terrible idea. They were not even good at the time where they emerged, and they have become worse ever since their problems have been widely known. Time to move on, and abstract them away when we can’t do without altogether.

In general, whenever the contents of an OpenCL object would be better expressed by a higher-level construct, I tried to use this construct instead of the raw thing. But high-level here does not mean bloated. For example, there are cases where OpenCL will happily use a string as an identifier where a scoped enum would do, and in such a case I will pick the enum option and transparently translate the OpenCL string to it.

Exceptions

In the 20st century, as software kept growing larger and more complex, it became obvious that returning error codes from function didn’t scale. So people thought about storing the relevant information in global variables instead. But that still led to huge amounts of boilerplate code in each and every caller of a function that may fail. Also, the default behavior of naive code was not to check for errors, which was undesirable.

Then, in the early 21st century, CPU manufacturers reached the physical limit of clock rate increases, and set out to put multiple CPU cores on a single chip instead. Thus, parallel computing became mainstream. With that event, any error handling technology which relied on global variables became obsolete and dangerous to use.

It is time to face it: exceptions have won. You may not like them, but they are the only method of error handling that will scale to large systems by avoiding error handler code duplication, yet simultaneously prove compatible with modern massively multithreaded processors. It is time to translate the error codes of legacy C APIs to exceptions and move on.

C++11 constructs

I don’t like C++ very much as a programming language. It seems to me that all it does is to replicate the design errors of C without learning from them, then add tons of new ones, and produce an end result which is both buggy, error-prone AND incomprehensible to the average programmer.

However, C++ is the standard language at my future workplace. And I guess many Java and VB.net programmers around the world will agree that we can’t fight the bad taste of the software industry in programming languages. So we better make the most of what we have.

If I am going to use C++ in a project, I want to at least use a version of C++ that gives me a minimal level of comfort. Closures, type inference, truly general function pointers, that kind of thing. So I’m sad for the MSVC users around the world, but it is time for C++ libraries to move on and switch to the C++11 standard. Without pressure, compiler vendors like Microsoft and Intel won’t stop dragging their feet and start actually supporting modern C++ releases.

What this means for an OpenCL developer, is that CLplusplus makes things like filtering devices in order to match some hardware requirements easier than ever. Here’s an example snippet which looks for all available OpenCL device/platform combinations which match an OpenCL version requirement, support out-of-order command execution, and can natively handle double-precision floating point numbers.

const auto filtered_platforms =
CLplusplus::get_filtered_devices(

   // Platform filter. This is a perfect job for a lambda !
   [](const CLplusplus::Platform & platform) -> bool {

      // Platform OpenCL version check
      return (platform.version() >= target_version);

   },

   // Device filter
   [](const CLplusplus::Device & device) -> bool {

      // A platform may support older-generation devices,
      // so we need to check for device version too
      if(device.version() < target_version) return false;

      // Let's get a couple of definitions out of the way
      const bool ooe_execution = device.queue_properties() &
         CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE;
      const auto double_config = device.double_fp_config();

      // Now we can cleanly express our device requirements
      return device.available() &&
             ooe_execution &&
             (double_config != 0) &&
             ((double_config & CL_FP_SOFT_FLOAT) == 0);

   }

);

Similarly, if you believe like me that 21st century developers should not need to concern themselves about the internal intricacies of library callback calling conventions, well, CLplusplus might also be for you:

const auto context_callback = [](const std::string & errinfo,
                                 const void * private_info,
                                 size_t cb) -> void
{
      std::cout << std::endl
                << "OPENCL CONTEXT ERROR: " << errinfo
                << std::endl;
};
CLplusplus::Context context(context_properties,
                            target_device,
                            context_callback);

Finally, pretty much every OpenCL-generated object which looks like an iterable is iterable using range-based for loops in CLplusplus, which makes for much cleaner code in practice:

const auto extensions = platform.extensions();
std::cout << "Platform supports "
          << extensions.size() << " extension(s):"
          << std::endl;
for(const auto & extension : extensions) {
   std::cout << " * " << extension << std::endl;
}

Reference-counted objects

In its basic design, OpenCL acknowledges that manual resource management is a bad idea, and that reference counting is the only way to manage any kind of resource in a safe and deterministic way. However, because it is a C API, it cannot go the last mile and actually perform the reference counting itself. That is a mission for higher-level language bindings.

I have also spent a lot of time to ensure that every OpenCL object is accessible in a form that actually looks and behaves like a C++ object, including when it comes to things like data accessors. Yet CLplusplus does not hide its power, and will happily give things like raw OpenCL object handles to an informed caller who needs it anyway.

Better abstractions (when all else has failed)

Sometimes, thankfully in rare cases, it is just the case that the abstraction provided by OpenCL to address a specific use case is poorly chosen. This is the case, for example, when OpenCL uses callbacks as a mechanism for asynchronous program object compilation. Because such compilation is a central part of an OpenCL’s application workflow, which the rest of the application must wait for at some point, using a callback in this situation results in spaghetti code whose tentacles are difficult to synchronize with one another. It is simply the wrong abstraction in this case.

Which is why in this scenario, CLplusplus uses a combination of asynchronous events and futures instead. Events are the standard OpenCL abstraction for asynchronous work synchronization, and future-like behavior provides additional usability for a scenario where an asynchronous program build must later be followed by synchronous kernel setup.

When CLplusplus has to cheat with the core OpenCL abstractions like this in order to produce a usable interface, we promess that we will do our best to get our improvements merged into the core specification. Also, no one who is proficient with OpenCL will ever be forced to use our alternative abstraction in such situations: if you want to use the raw OpenCL functionality, it will remain available to you.

What about the standard C++ wrapper?

While the OpenCL standard does try to specify a C++ wrapper for its API, it is sufficient to say that if I tried to express each and every one of my nitpicks about it on paper, I would need a couple crates of blank sheets of paper.

Basically, my top issues with that wrapper are that:

It badly abuses C++ headers, in a fashion that will cause needlessly long application compilation times
Like any standard-defined code, it is full of deprecated gibberish that makes it practically impossible to read and modify, most notable amongst these being a full redefinition of some STL containers. This is why standards should solely focus on defining APIs, and not implementations.
It does not go far enough in its quest to abstract away the ugliness of C data types. Again, the OpenCL API is free to use the data representations it likes, but my personal belief is that no application code should ever need to directly deal with zero-terminated lists in the 21st century.
It fails to leverage the full power of modern C++. I know, MSVC users. I believe I have already made abundantly clear what I think about this argument.
It uses an heavy syntax, for example function templates are used in situations where they are not really needed and getters are prefixed with “get”. Simple tasks should look simple (and too many template instantiations are, again, bad for compile times).

Project location and status

If you are interested and would like to take a closer look, I have taken the code to GitHub, along with a couple example applications.

The library currently provides a full abstraction of the core OpenCL 1.2 platform layer, which is to say that you can use it to manage platforms, devices, and context objects. Next, my goal is to also have it cover the core runtime layer (command queues, buffers, and all that), leading to a full core OpenCL 1.2 wrapper.

However, extensions and more recent OpenCL versions are problematic.

The main challenge when it comes to OpenCL versions is that I do not have an OpenCL 2.0 implementation handy, because NVidia do not want their OpenCL implementation to compete with their proprietary CUDA offering, and Intel only provide OpenCL implementations for their top-of-the-line Xeon chips. As I refuse to work on the support of an API which I cannot test on one of my machines, this means I am stuck for now.

Similarly, I’m not sure I could test all the graphics interoperability extensions on the hardware and OSs that I have available at home. And from what I’ve read on the web, these are broken in most implementations anyway.

That being said, I am highly interested in extensions which extend the core OpenCL feature set, such as cl_khr_initialize_memory and cl_khr_terminate_context. So it is likely that some of these are going to end up in CLplusplus at some point.

The OS|periment

Musings on personal computer operating systems