2009년 6월 17일 수요일

[omp] 개념

Types of Parallel Programming
Before we begin with OpenMP, it is important to know why we need parallel processing. In a typical case, a sequential code will execute in a thread which is executed on a single processing unit. Thus, if a computer has 2 processors or more ( or 2 cores, or 1 processor with HyperThreading), only a single processor will be used for execution, thus wasting the other processing power. Rather than letting the other processor to sit idle (or process other threads from other programs) we can use it to speed up our algorithm.
Parallel processing can be divided in to two groups, task based and data based.

Task based : Divide different tasks to different CPUs to be executed in parallel. For example, a Printing thread and a Spell Checking thread running simultaneously in a word processor. Each thread is a separate task.
Data based : Execute the same task, but divide the work load on the data over several CPUs. For example, to convert a color image to grayscale. We can convert the top half of the image on the first CPU, while the lower half is converted on the second CPU (or as many CPUs you have), thus processing in half the time.
There are several methods to do parallel processing


























Use MPI : Mesage Passing Interface - MPI is most suited for a system with multiple processors and multiple memory. For example, a cluster of computers with their own local memory. You can use MPI to divide workload across this cluster, and merge the result when it is finished. Available with Microsoft Compute Cluster Pack.
Use OpenMP : OpenMP is suited for shared memory systems like we have on our desktop computers. Shared memory systems are systems with multiple processors but each are sharing a single memory subsystem. Using OpenMP is just like writing your own smaller threads but let the compiler do it. Available in Visual Studio 2005 Professional and Team Suite.
Use SIMD intrinsics : Single Instruction Multiple Data (SIMD) has been available on mainstream processors such as Intel's MMX, SSE, SSE2, SSE3, Motorola's (or IBM's) Altivec and AMD's 3DNow!. SIMD intrinsincs are primitive functions to parallelize data processing on the CPU register level. For example, the addition of two unsigned char will take the whole register size, although the size of this data type is just 8-bit, leaving 24-bit in the register to be filled with 0 and wasted. Using SIMD (such as MMX), we can load 8 unsigned chars (or 4 shorts or 2 integers) to be executed in parallel on the register level. Available in Visual Studio 2005 using SIMD intrinsics or with Visual C++ Processor Pack with Visual C++ 6.0.





Understanding the Fork-and-Join Model
OpenMP uses the fork-and-join parallelism model. In fork-and-join, parallel threads are created and branched out from a master thread to execute an operation and will only remain until the operation has finished, then all the threads are destroyed, thus leaving only one master thread.

The process of splitting and joining of threads including synchronization for end result are handled by OpenMP.




























How Many Threads Do I Need?
A typical question is that how many threads do I actually need? Are more threads better? How do I control the number of threads in my code? Are number of threads related to number of CPUs?
The number of threads required to solve a problem is generally limited to the number of CPUs you have. As you can see in the Fork-and-Join figure above, whenever threads are created, a little time is taken to create a thread and later to join the end result and destroy the threads. When the problem is small, and the number of CPUs are less than the number of threads, the total execution time will be longer (slower) because more time has been spent to create threads, and later switch between the threads (due to preemptive behaviour) then to actually solve the problem. Whenever a thread context is switched, data must be saved/loaded from the memory. This takes time.

The rule is simple, since all the threads will be executing the same operation (hence the same priority), 1 thread is sufficient per CPU (or core). The more CPUs you have, the more threads you can create.

Most compiler directives in OpenMP uses the Environment Variable OMP_NUM_THREADS to determine the number of threads to create. You can control the number of threads with the following functions;


// Get the number of processors in this system
int iCPU = omp_get_num_procs();

// Now set the number of threads
omp_set_num_threads(iCPU);



Of course, you can put any value for iCPU in the code above (if you do not want to call omp_get_num_procs), and you can call the omp_set_num_threads functions as many times as you like for different parts of your code for maximum control. If omp_set_num_threads is not called, OpenMP will use the OMP_NUM_THREADS Environment Variable.


Parallel for Loop
Let us start with a simple parallel for loop. The following is a code to convert a 32-bit Color (RGBA) image to 8-bit Grayscale image

// pDest is an unsigned char array of size width * height
// pSrc is an unsigned char array of size width * height * 4 (32-bit)
// To avoid floating point operation, all floating point weights in the original
// grayscale formula
// has been changed to integer approximation

// Use pragma for to make a parallel for loop
omp_set_num_threads(threads);

#pragma omp parallel for
for(int z = 0; z < height*width; z++)
{
pDest[z] = (pSrc[z*4+0]*3735 + pSrc[z*4 + 1]*19234+ pSrc[z*4+ 2]*9797)>>15;
}



The #pragma omp parallel for directive will parallelize the for loop according to the number of threads set. The following is the performance gained for a 3264x2488 image on a 1.66GHz Core Duo system (2 Cores).

Thread(s) : 1 Time 0.04081 sec
Thread(s) : 2 Time 0.01906 sec
Thread(s) : 4 Time 0.01940 sec
Thread(s) : 6 Time 0.02133 sec
Thread(s) : 8 Time 0.02029 sec

As you can see, by executing the problem using 2 threads on a dual-core CPU, the time has been cut by half. However, as the number of threads is increased, the performance does not due to increased time to fork and join.


Parallel double for loop
The same problem above (converting color to grayscale) can also be written in a double for loop way. This can be written like this;

Collapse Copy Code

for(int y = 0; y < height; y++)
for(int x = 0; x< width; x++)
pDest[x+y*width] = (pSrc[x*4 + y*4*width + 0]*3735 + pSrc[x*4 + y*4*width + 1]*19234+ pSrc[x*4 + y*4*width + 2]*9797)>>15;


In this case, there are two solutions;
Solution 1

We have made the inner loop parallel using the parallel for directive. As you can see, when 2 threads are being used, the execution time has actually increased! This is because, for every iteration of y, a fork-join operation is performed, and at the end contributed to the increased execution time.
Collapse Copy Code

for(int y = 0; y < height; y++)
{
#pragma omp parallel for
for(int x = 0; x< width; x++)
{
pDest[x+y*width] = (pSrc[x*4 + y*4*width + 0]*3735 + pSrc[x*4 + y*4*width + 1]*19234+ pSrc[x*4 + y*4*width + 2]*9797)>>15;
}
}


Thread(s) : 1 Time 0.04260 sec
Thread(s) : 2 Time 0.05171 sec

Solution 2

Instead of making the inner loop parallel, the outer loop is the better choice. Here, another directive is introduced - the private directive. The private clause directs the compiler to make variables private so multiple copies of a variable does not execute. Here, you can see that the execution time did indeed get reduced by half.

int x = 0;
#pragma omp parallel for private(x)
for(int y = 0; y < height; y++)
{
for(x = 0; x< width; x++)
{
pDest[x+y*width] = (pSrc[x*4 + y*4*width + 0]*3735 + pSrc[x*4 + y*4*width + 1]*19234+ pSrc[x*4 + y*4*width + 2]*9797)>>15;
}
}


Thread(s) : 1 Time 0.04039 sec
Thread(s) : 2 Time 0.02020 sec


Get to know the number of threads
At any time, we can obtain the number of OpenMP threads running by calling the function int omp_get_thread_num(); .



Summary
By using OpenMP, you can gain performance on multi-core systems for free, without much coding other than a line or too. There is no excuse not to use OpenMP. The benefits are there, and the coding is simple.

There are several other OpenMP directives such as firstprivate, critical sections and reductions which are left for another article. :)

Points of Interest
You can also perform similar operation using Windows threads, but it will be more difficult to implement, because of issues such as synchronization and thread management.

Additional Notes
The original RGB to Grayscale formula is given as Y = 0.299 R + 0.587 G + 0.114 B.

References
Parallel Programming in C with MPI and OpenMP (1st ed.) , Michael J. Quinn, McGraw Hill, 2004
Scientific Parallel Computing, L. Ridgway Scott, Terry Clark, Babak Bagheri, Princeton University Press, 2005
Introduction to Parallel Computing: A Practical Guide with Examples in An W.P. Petersen, P. Arbenz, Oxford University Press, 2004.
Digital Color Imaging Handbook, Gaurav Sharma, CRC Press, 2003

댓글 없음:

댓글 쓰기

팔로어

프로필

평범한 모습 평범함을 즐기는 평범하게 사는 사람입니다.