Complete Application Tutorial
This section gives a walkthrough of how a simple Celerity application can be set up from start to finish. Before you begin, make sure you have built and installed Celerity and all of its dependencies.
We are going to implement a simple image processing kernel that performs edge detection on an input image and writes the resulting image back to the filesystem. Here you can see how a result might look (white parts):
Original image by Reinhold Möller (CC-BY-SA 4.0).
Setting Up a CMake Project
The first thing you typically want to do when writing a Celerity application
is to set up a CMake project. For this, create a new folder for your project
and in it create a file CMakeLists.txt
with the following contents:
cmake_minimum_required(VERSION 3.13)
project(celerity_edge_detection)
find_package(Celerity CONFIG REQUIRED)
add_executable(edge_detection edge_detection.cpp)
add_celerity_to_target(TARGET edge_detection SOURCES edge_detection.cpp)
With this simple CMake configuration file we've created a new executable
called edge_detection
that links to Celerity. The important section is the
call to add_celerity_to_target
, where we specify both the target that we
want to turn into a Celerity executable, as well as all source files that
should be compiled for accelerator execution.
Create an empty file edge_detection.cpp
next to your CMakeLists.txt
.
Then, create a new folder build
inside your project directory, navigate
into it and simply run cmake ..
to configure your project. Just as during
installation, you might have to provide some additional
parameters to CMake in order for it to find and/or configure Celerity and its
dependencies.
Image Handling Boilerplate
We're going to start by adding the necessary code to load (and later save) an
image file. To this end, we'll use the stb
single file libraries. Download stb_image.h
and stb_image_write.h
from
GitHub and drop them next to our source file.
Next, add the following code to edge_detection.cpp
:
#include <cstdlib>
#define STB_IMAGE_IMPLEMENTATION
#include "stb_image.h"
#define STB_IMAGE_WRITE_IMPLEMENTATION
#include "stb_image_write.h"
int main(int argc, char* argv[]) {
if(argc != 2) return EXIT_FAILURE;
int img_width, img_height;
uint8_t* img_data = stbi_load(argv[1], &img_width, &img_height, nullptr, 1);
stbi_image_free(img_data);
return EXIT_SUCCESS;
}
First we check that the user provided an image file name and if so, we load
the corresponding file using stbi_load()
. The last parameter tells stb that
we want to load the image as grayscale. The result is then stored in an array
of uint8_t
, which consists of img_height
lines of size img_width
each.
We then immediately free the image again and exit.
Now might be a good time to compile and run the program to make sure everything works so far.
Celerity Queue and Buffers
With everything set up, we can now begin to implement the Celerity portion of our application. The first thing that we will require in any Celerity program is the distributed queue. Similar to how a SYCL queue allows you to submit work to a compute device, the Celerity distributed queue allows you to submit work to the distributed runtime system -- which will subsequently be split transparently across all available worker nodes.
Additionally, we will require buffers to store our input image as well as the resulting edge-detected image in a way that can be efficiently accessed by the GPU. Let's create our two buffers and the distributed queue now:
#include <celerity/celerity.h>
...
uint8_t* img_data = stbi_load(argv[1], &img_width, &img_height, nullptr, 1);
celerity::buffer<uint8_t, 2> input_buf(img_data, celerity::range<2>(img_height, img_width));
stbi_image_free(img_data);
celerity::buffer<uint8_t, 2> edge_buf(celerity::range<2>(img_height, img_width));
celerity::distr_queue queue;
...
With this we've created a couple of two-dimensional buffers, input_buf
and
edge_buf
, that both store values of type uint8_t
and are of size
(img_height, img_width)
. Notice that we initialize input_buf
using the
image data that we've just read. We can then immediately free the raw image,
as we no longer need it. edge_buf
on the other hand is not being
initialized with any existing data, as it will be used to store the result of
our image processing kernel.
Detecting Those Edges
Now we are ready to do the actual edge detection. For this we will write a kernel function that will be executed on one or more GPUs. The way kernels are specified in Celerity is very similar to how it is done in SYCL:
queue.submit([&](celerity::handler& cgh) {
// TODO: Buffer accessors
cgh.parallel_for<class MyEdgeDetectionKernel>(
celerity::range<2>(img_height - 2, img_width - 2),
celerity::id<2>(1, 1),
[=](celerity::item<2> item) {
// TODO: Kernel code
}
);
});
We call queue.submit()
to inform the Celerity runtime that we want to
execute a new kernel function. As an argument, we pass a so-called command
group; a C++11 lambda function. Command groups themselves are not being
executed on an accelerator. Instead, they serve as a way of tying kernels
to buffers, informing the runtime system exactly how we plan to access
different buffers from within our kernels. This is done through buffer
accessors, which we will create in a minute.
The actual kernel code that will be executed on our compute device(s) resides
within the last argument to the celerity::handler::parallel_for
function -
again concisely written as a lambda expression. Let us continue by fleshing
out the kernel code. Replace the TODO with the following code:
int sum = r_input[{item[0] + 1, item[1]}] + r_input[{item[0] - 1, item[1]}]
+ r_input[{item[0], item[1] + 1}] + r_input[{item[0], item[1] - 1}];
w_edge[item] = 255 - std::max(0, sum - (4 * r_input[item]));
This kernel computes a discrete Laplace
filter - a simple
type of edge detection filter - by summing up the four pixel values along the
main axes surrounding the current result pixel and computing the difference
to the current pixel value. We then subtract the resulting value from the
maximum value a uint8_t
can store (255) in order to get a white image with
black edges. The current pixel position is described by the
celerity::item<2>
we receive as an argument to our kernel function. This
two-dimensional item corresponds to a y/x
position in our input and output
images and can be used to index into the respective buffers. However, we're
not using the buffers directly; instead we are indexing into the
aforementioned buffer accessors. Let's create these now - replace the TODO
before the kernel function with the following:
celerity::accessor r_input{input_buf, cgh, celerity::access::neighborhood{1, 1}, celerity::read_only};
celerity::accessor w_edge{edge_buf, cgh, celerity::access::one_to_one{}, celerity::write_only, celerity::no_init};
If you have worked with SYCL before, these buffer accessors will look
familiar to you. Accessors tie kernels to the data they operate on by declaring
the type of access that we want to perform: We want to read from our
input_buf
, and want to write to our edge_buf
. Additionally, we do not care
at all about preserving any of the previous contents of edge_buf
, which is why
we choose to discard them by also passing the celerity::no_init
property.
So far everything works exactly as it would in a SYCL application. However,
there is an additional parameter passed into the accessor
constructor that is
not present in its SYCL counterpart. In fact, this parameter represents one of
Celerity's most important API additions: While access modes (such as read
and
write
) tell the runtime system how a kernel intends to access a buffer, they
do not convey any information about where a kernel will access said buffer.
In order for Celerity to be able to split a single kernel execution across
potentially many different worker nodes, it needs to know how each of those
kernel chunks will interact with the input and output buffers of a kernel
-- i.e., which node requires which parts of the input, and produces which
parts of the output. This is where Celerity's so-called range mappers come
into play.
Let us first discuss the range mapper for edge_buf
, as it represents the
simpler of the two cases. Looking at the kernel function, you can see that
for each invocation of the kernel -- i.e., for each work item, we only ever
access the output buffer once: at exactly the current location represented by
the item
. This means there exists a one-to-one mapping of the kernel index
and the accessed buffer index. For this reason we pass a
celerity::access::one_to_one
range mapper (which means that kernel and
buffer need to have the same dimensionality and size).
The range mapper for our input_buf
is a bit more complicated, but not by
much: Remember that for computing the Laplace filter, we are summing up the
pixel values of the four surrounding pixels along the main axes and
calculating the difference to the current pixel. This means that in addition
to reading the pixel value associated with each item, each kernel thread also
reads a 1-pixel neighborhood around the current item. This being another
very common pattern, it can be expressed with the
celerity::access::neighborhood
range mapper. The parameters (1, 1)
signify that we want to access a 1-item boundary in each dimension
surrounding the current work item.
While we are using built-in range mappers provided by the Celerity API, they can in fact also be user-defined functions! For more information on range mappers, see Range Mappers.
Lastly, there are two more things of note for the call to parallel_for
: The
first is the kernel name. Just like in SYCL, each kernel function in
Celerity may have a unique name in the form of a template type parameter.
Here we chose MyEdgeDetectionKernel
, but this can be anything you like.
Kernel names used to be mandatory in SYCL 1.2.1 but have since become optional.
Finally, the first two parameters to the parallel_for
function tell
Celerity how many individual GPU threads (or work items) we want to execute.
In our case we want to execute one thread for each pixel of our image, except
for a 1-pixel border on the outside of the image - which is why we subtract
2 from our image size in both dimensions and additionally specify the
execution offset of (1, 1)
. Why this is a good idea is left as an
exercise for the reader ;-).
...and that's it, we successfully submitted a kernel to compute an edge detection filter on our input image and store the result in an output buffer. The only thing that remains to do now is to save the resulting image back to a file.
Saving The Result
To write the image resulting from our kernel execution back to a file, we need
to pass the contents of edge_buf
back to the host. Similar to SYCL 2020,
Celerity offers host tasks for this purpose. In the distributed memory setting,
we opt for the simple solution of transferring the entire image to one node
and writing the output file from there.
Just like the compute tasks we created above by calling
celerity::handler::parallel_for
, we can instantiate a host task on the command group
handler by calling celerity::handler::host_task
. Add the following code at the end of
your main()
function:
queue.submit([&](celerity::handler& cgh) {
celerity::accessor out{edge_buf, cgh, celerity::access::all{}, celerity::read_only_host_task};
cgh.host_task(celerity::on_master_node, [=]() {
stbi_write_png("result.png", img_width, img_height, 1, out.get_pointer(), 0);
});
});
Just as in compute kernel command groups, we first obtain accessors for the
buffers we want to operate on within this task. Since we need access
to the entire buffer, we pass an instance of the all
range mapper.
Then we supply the code to be run on the host as a lambda function to
celerity::handler::host_task
. As the tag celerity::on_master_node
implies, we select the overload that calls our host task on a single node
-- the master node. Since the code is executed on the host, we are able to
use it for things such as result verification and I/O. In this case, we call
stbi_write_png
to write our resulting image into a file called result.png
.
Note: While master-node tasks are easy to use, they do not scale to larger problems. For real-world applications, transferring all data to a single node may be either prohibitively expensive or impossible altogether. Instead, collective host tasks can be used to perform distributed I/O with libraries like HDF5. This feature is currently experimental.
Running The Application
After you've built the executable, you can try and run it by passing an image file as a command line argument like so:
./edge_detection ./my_image.jpg
If all goes well, a result file named result.png
should then be located in
your working directory.
Since Celerity applications are built on top of MPI internally, you can now also try and run multiple nodes:
mpirun -n 4 edge_detection ./my_image.jpg