Introduction to OpenCL
What is OpenCL? ●
●
●
●
“Open Computing Language” A framework for writing programs to run across heterogeneous platforms (CPU, GPU, DSP, FPGA, Cell, etc) Includes a language (based on C99) and APIs to control the devices It provides parallelism
task-based
&
data-based
Data vs. taskbased parallelism Data-based parallelism - as in CUDA, the same (set of) operation(s) is executed in parallel over a large set of data Task-based parallelism - multiple independent tasks are launched in parallel on independent sets of data Data & task-based parallelism can be used interchangeably.
Structure ●
One host connected to one or more compute devices ● device → collection of compute units ● compute unit → collection of processing elements ● processing element → execute code as SIMD or SPMD (single instruction/program, multiple data)
OpenCL memory model
Memory resources: buffers & images.
Terminology ●
●
● ●
●
●
Host – the CPU, submits works to other compute devices Work item – basic unit of work for an OpenCL device – more work items are grouped in local workgroups Kernel – code for a work item (C function) Program – collection of kernels and other functions Global dimension → the range of computation (the whole problem space) Local dimension → the size of a workgroup (dimension of the smaller problem within a workgroup)
Terminology (cont.) ●
●
●
Command queue – used to submit work to a device: ● queued in-order ● executed in-order or out-of-order With out-of-order queue, a processor executes instructions in an order governed by the availability of input data, rather than by their original order in a program. Context – composed of devices, memory objects, command queues
Synchronization Only within a workgroup (local) via: ● barriers (sync. execution) ● memory fences (sync. memory access) ● No global synchronization → the global work-items must be independent, cannot synchronize outside a workgroup ● The use of events on the host – for sequential running of kernels, a kernel can wait for an event from a previous one ● No locks or mutexes, though... ●
Advantages of OpenCL ●
●
●
●
One standard API for parallelizing (unlike CUDA, Cell, which use specific API to access the resources) Speedup for computationally intensive apps, when using all resources available Most PCs have an OpenCL compatible graphics card - it is inexpensive and easy to start working with OpenCL Platform independent
Usage & setup C/C++
OpenCL
C/C++
Initialization
Execution
Termination
CPU
CPU, GPU
CPU
Initialization: 1) Get the platforms, devices cl_device_id devices[2]; status = clGetDeviceIDs (NULL, CL_DEVICE_TYPE_GPU, 1, &devices[0], &numDev); status = clGetDeviceIDs (NULL, CL_DEVICE_TYPE_CPU, 1, &devices[1], &numDev);
2) Create the context cl_context ctx = clCreateContext (0, 2, devices, NULL, NULL, &status);
3) Create command queues – each device must have a queue and all works are submitted through queues queue_gpu = clCreateCommandQueue(ctx, devices[0], 0, &status); queue_cpu = clCreateCommandQueue(ctx, devices[1], 0, &status);
Usage & setup C/C++
OpenCL
C/C++
Initialization
Execution
Termination
CPU
CPU, GPU
CPU
Initialization: 4) Create memory objects: buffIN = clCreateBuffer(ctx, CL_MEM_READ_WRITE, size, NULL, NULL); buffOUT = clCreateBuffer(ctx, CL_MEM_READ_WRITE, size, NULL, NULL);
Termination: clFinish(queue_cpu); result = (int*) clEnqueueMapBuffer (queue, buffOUT, CL_TRUE, CL_MAP_READ ,0, size,0,NULL,NULL,NULL); // do something with the result
Compilation and execution ●
Kernel functions can be separated in a special file ( kernels.cl )
●
Declaration of one kernel inside this file is made by:
__kernel void fun (__global [type] param1, [type] param2, … ) { // input & output are both given as parameters // … code ... }
●
Compiling the kernel, setting the arguments & run: clProgram prog = clCreateProgramWithSource (ctx, 1, &source, NULL, &err); check(clBuildProgram(prog,1,&device,NULL,NULL,NULL)); kernel = clCreateKernel (prog, “fun”, &err); clSetKernelArg(kernel, i, sizeof(buffIN), &buffIN); // set kernel i-th argument clEnqueueNDRangeKernel (queue, kernel, 2, NULL, global, NULL, 0, NULL, NULL) // start kernel
●
Kernels are compiled separately for each device by passing compiler flags
C language features in OpenCL Derived from ISO C99, but no use of: * function pointers * recursion, etc Additions: * work items, workgroups * vector types * synchronization → barrier() function * new built-in functions: integer/image/vector functions Address space: __global – memory allocated from global addr space __constant __read_only for images __local __write_only __private Data types: scalar, image, vector int4 v = (int4) (0, 1, 2, 3); v1 += v2; // vector addition
Example – applying a Gaussian filter to an image Gauss filter: a pixel in an image is the result of averaging the neighboring filters with the following weights: 1 2 1 One kernel function is executed by all 2 4 2 compute units (data-based parallelism). 1 2 1
Example – applying a Gaussian filter to an image
Example – histogram equalisation A transformation that creates an image with uniformly distributed histogram, used to increase the visibility of the details by maximizing the contrast. 2 kernel functions: → one calculates the numbers of pixels with density <= I [0... 255] → another kernel calculates the new values for the pixels using the formula: newVal = ( n[val] – n[0] ) * 255 / ( totalNumPixels – n[0] ), where val = previous value of pixel T n[i] = number of occurrences of pixel value I
Example – histogram equalisation A transformation that creates an image with uniformly distributed histogram, used to increase the visibility of the details by maximizing the contrast. 2 kernel functions: → one calculates the numbers of pixels with density <= I [0... 255] → another kernel calculates the new values for the pixels using the formula: newVal = ( n[val] – n[0] ) * 255 / ( totalNumPixels – n[0] ), where val = previous value of pixel T n[i] = number of occurrences of pixel value I
Other facts about OpenCL * The current OpenCL version is 2.0, which brings new: → pipes – memory objects that store data organized as FIFO, created by host & read/write by kernels (enabling producer consumer relationships between kernels) → dynamic parallelism – a device can queue kernels to itself, as opposed to only host could do it, in previous version → shared virtual memory → improved atomic functions – load, store, exchange, fetch and modify, clear, test and set, etc → read/write image objects – new image formats & other features * In addition to C/C++, OpenCL can be integrated also with Python, known as PyOpenCL - less lines of code, errors translated to Py exceptions OpenCL kernels can be also called from languages such Java, JavaScript, Haskell, Perl, Ruby, etc.
OpenCL vs. CUDA Heterogeneous envir
Only for GPU
Less famous than CUDA
Widely used
--Supported by AMD & Intel ---
Support for developers: books & certification exams Supported by Nvidia Provides debugger, profiler
OpenCL and CUDA have similar platform models, memory models, execution models, and programming models. OpenCL can be a good alternative to CUDA. When running on GPUs, the same good practice should be applied in OpenCL as in CUDA such as avoiding branches (if, while, for). For OpenCL a trade-off implementation must be done, which gives the best results for all devices (CPU, GPU, etc..) - that implies to know very well the architecture.
Questions?
___________________ References: http://www.openclblog.com/ https://www.khronos.org/opencl/ http://documen.tician.de/pyopencl/
http://amdahlsoftware.com/ten-reasons-why-we-love-opencl-and-why-you-mig