Introduction to Parallel Programming Using CUDA

CUDA (Compute Unified Device Architecture) is a scheme which is made by NVIDIA that make NVIDIA as a GPU (Graphic Processing Unit) is able to create computations not only for graphic-rendering but also for general-purpose. So, with CUDA, we can utilize quite a lot of processors that is own by NVIDIA for various computations. Now day, GPU likes ATI, also has had many processors on it. Scheme that ATI build is called ATI Stream. Parallel programming becomes so important because need of computer computation ability continues to increase for multitasking and powerful graphic processing. Current methods for performance enhancement are different with earlier methods where enhancement of processor clock is preferred. Clock speed enhancement is restricted by physical capabilities such as power and heat problem. In 2005, various computer industries started to offer computer with some cores on it starting from 2, 3, 4, 6, and so on. At the beginning of development of GPU with many cores, GPU utilization were only done by interface like OpenGL and DirectX which these interfaces are specialized only for graphical processing.
Latest series of NVIDIA have supported CUDA, exactly after 2006. For the list of series that support CUDA can be seen at As the beginning in study of parallel programming using CUDA, it’s better to use C or C++ as programming language. CUDA C has been first special programming language that is developed by a GPU company to facilitate general-purpose computing on GPU. Some matters that should be prepared when using CUDA C to make an application are these below.
  • CUDA-enabled graphics processor 
  • NVIDIA device driver
  • CUDA development toolkit
  • Standard C compiler
Needs of toolkit and driver can be downloaded at CUDA C provides that needs for Windows, Linux, and Mac. If you have installed CUDA toolkit on your computer, there will be compiler application that can be used is named nvcc. Besides that, if you use Windows, it’s better to install Visual Studio for easiness of software construction and there is a program from Visual Studio, named cl.exe, which is required for compilation.
A special case in source code that uses CUDA C is presence of kernel call. For example is the following code.

#include <iostream>
__global__ void kernel( void ) {
int main( void ) {
    printf( "Hello, World!\n" );
    return 0;

Interpolation of __global__ variable on kernel()  function is functioned to show the compiler that the program is compiled to run on device and not on host. Next, we can see sample of program below where there is part of variable passing.

#include <iostream>
#include "book.h"
__global__ void add( int a, int b, int *c ) {
    *c = a + b;
int main( void ) {
    int c;
    int *dev_c;
    HANDLE_ERROR( cudaMalloc( (void**)&dev_c, sizeof(int) ) );
    add<<<1,1>>>( 2, 7, dev_c );

    HANDLE_ERROR( cudaMemcpy( &c, dev_c, sizeof(int), cudaMemcpyDeviceToHost   ) );
    printf( "2 + 7 = %d\n", c );
    cudaFree( dev_c );
    return 0;

dev_c is variable that will be used to keep value which will be passed from host to device and then that value will be taken from device to be delivered to host. Memory allocation method use cudaMalloc() function which its function likes malloc() in C. Taking a value from device uses cudaMemcpy() function. Now, how about parallel on GPU? We can see vector addition in the following code.
#include "../common/book.h"
#define N 10
__global__ void add( int *a, int *b, int *c ) {
int tid = blockIdx.x;
// handle the data at this index
if (tid < N)
c[tid] = a[tid] + b[tid];

int main( void ) { 
int a[N], b[N], c[N];
int *dev_a, *dev_b, *dev_c;

// allocate the memory on the GPU
HANDLE_ERROR( cudaMalloc( (void**)&dev_a, N * sizeof(int) ) );
HANDLE_ERROR( cudaMalloc( (void**)&dev_b, N * sizeof(int) ) );
HANDLE_ERROR( cudaMalloc( (void**)&dev_c, N * sizeof(int) ) ); 

// fill the arrays 'a' and 'b' on the CPU
for (int i=0; i<N; i++) { a[i] = -i; b[i] = i * i; }

// copy the arrays 'a' and 'b' to the GPU
HANDLE_ERROR( cudaMemcpy( dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice ) );
HANDLE_ERROR( cudaMemcpy( dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice ) );

add<<<N,1>>>( dev_a, dev_b, dev_c ); 

// copy the array 'c' back from the GPU to the CPU
HANDLE_ERROR( cudaMemcpy( c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost ) ); 

// display the results
for (int i=0; i<N; i++) { printf( "%d + %d = %d\n", a[i], b[i], c[i] ); }

// free the memory allocated on the GPU
cudaFree( dev_a ); cudaFree( dev_b ); cudaFree( dev_c ); return 0; 

Program that can be paralleled is program that is used to produce a value whose output value isn’t affected or depend by other values from other computations with same function.  For example is in this vector addition program. In vector addition, value of c(0) is addition of a(0) and b(0) and not affected by value of c(1), c(2), and so on.
In this program also has part that reads add<<<N,1>>>. This part shows that the program utilizes N threads. This N value is got from:
N blocks x 1 thread per block
To determine number of threads which will be used can be configured by change value of those two variables. add<<<1,N>>>  function will produce number of used threads equal to this add<<<N,1>>> function. Number of blocks and thread per block are certainly limited and for every device will be have different amount. You can see it by properties function from CUDA. Above program also use only 1 thread on every block. Hence, position identification is just by taking block position which run the computation by call blockIdx.x variable.
Other matter that will be important in utilization of CUDA is capability to represent 2D or 3D array in 1D array. This capability will simplify us in program development for memory allocation and number of threads configuration. For more study about parallel programming using CUDA, you can learn some resources that is suggested by NVIDIA.

CUDA by Example: An Introduction to General-Purpose GPU Programming Written by two senior members of the CUDA software platform team, this book shows programmers how to employ each area of CUDA through working examples. After a concise introduction to the CUDA platform and architecture, as well as a quick-start guide to CUDA C, the book details the techniques and trade-offs associated with each key CUDA feature. You will discover when to use each CUDA C extension and how to write CUDA software that delivers truly outstanding performance.

CUDA Application Design and Development As the computer industry retools to leverage massively parallel graphics processing units (GPUs), this book is designed to meet the needs of working software developers who need to understand GPU programming with CUDA and increase efficiency in their projects. CUDA Application Design and Development starts with an introduction to parallel computing concepts for readers with no previous parallel experience, and focuses on issues of immediate importance to working software developers: achieving high performance, maintaining competitiveness, analyzing CUDA benefits versus costs, and determining application lifespan Written by Rob Farber, author of the popular "Super Computing for the Masses" series in Dr Dobbs Journal.

GPU Computing Gems Jade Edition (Applications of GPU Computing Series) This is the second volume of Morgan Kaufmanns GPU Computing Gems, offering an all-new set of insights, ideas, and practical, hands-on, skills from researchers and developers worldwide. Each chapter gives you a window into the work being performed across a variety of application domains, and the opportunity to witness the impact of parallel GPU computing on the efficiency of scientific research.

GPU Computing Gems Emerald Edition (Applications of GPU Computing Series)GPU Computing Gems: Emerald Edition is the first volume in Morgan Kaufmann's Applications of GPU Computing Series, offering the latest insights and research in computer vision, electronic design automation, emerging data-intensive applications, life sciences, medical imaging, ray tracing and rendering, scientific simulation, signal and audio processing, statistical modeling, and video / image processing

Programming Massively Parallel Processors: A Hands-on Approach (Applications of GPU Computing Series)Multi-core processors are no longer the future of computing-they are the present day reality. A typical mass-produced CPU features multiple processor cores, while a GPU (Graphics Processing Unit) may have hundreds or even thousands of cores. With the rise of multi-core architectures has come the need to teach advanced programmers a new and essential skill: how to program massively parallel processors.


  1. Comment should not be empty

  2. I am only beginner in PC developing, but this things seems not as complicated for me as I had thought before. Only one question that I have is where should I download drivers for my video-card? And yesterday I found out the answer - NVidia driver for windows I should download only here.