An Introduction to Parallel Computing With MPI Computing Lab I

Transcription

An Introduction to Parallel Computing With MPI Computing Lab I
An Introduction to Parallel Computing With MPI
Computing Lab I
The purpose of the first programming exercise is to become familiar with the operating environment on
a parallel computer, and to create and run a simple parallel program using MPI. In code development, it
is always a good idea to start simple and develop/debug each piece of code before adding more
complexity. This first code will implement the basic MPI structure, query communicator info, and
output process rank – the classic “hello world” program. You will learn how to compile parallel
programs and submit batch jobs using the scheduler.
•
Write a basic “hello world” code which creates a MPI environment, determines the number of
processes in the global communicator, and writes the rank of each process to standard output.
You will have to implement the correct MPI language binding depending on which programming
language you are using: FORTRAN, C, or C++. The general program structure for each language
is shown below. You can write your code in the editor “TextWrangler” which is installed on the
lab computers. This program allows you to edit and save source code files either locally on the
lab computers or remotely on socrates and transfer files as needed using sftp.
FORTRAN90
program MPIhelloworld
implicit none
include “mpif.h” ! Include the MPI header file
integer::ierr,pid,np
call MPI_INIT(ierr) ! Initialize MPI environment
call MPI_COMM_SIZE(MPI_COMM_WORLD,np,ierr) ! Get number of processes (np)
call MPI_COMM_RANK(MPI_COMM_WORLD,pid,ierr) ! Get local rank (pid)
write(*,*)”I am process: “,pid
call MPI_FINALIZE(ierr) ! Terminate MPI environment
stop
end program MPIhelloworld
C
#include <mpi.h> /* Include the MPI header file */
#include <stdio.h>
main (int argc, char *argv[]) {
int ierr,pid,np;
1
ierr = MPI_Init(&argc, &argv); /* Initialize MPI environment*/
MPI_Comm_size(MPI_COMM_WORLD, &np); /* Get number of processes (np)
MPI_Comm_rank(MPI_COMM_WORLD, &pid); /* Get local rank (pid) */
printf(“I am process: %d \n“,pid);
MPI_Finalize(); /* Terminate MPI environment */
}
C++
#include <mpi.h> // Include the MPI header file
#include <isostream>
int main (int argc, char ** argv) {
int ierr,pid,np;
MPI::Init(argc,argv); // Initialize MPI environment
np = MPI::COMM_WORLD.Get_size(); // Get number of processes (np)
pid = MPI::COMM_WORLD.Get_rank(); // Get local rank (pid)
printf(“I am process: %d \n“,pid);
MPI::Finalize(); // Terminate MPI environment
}
•
Save your source code to your home directory on socrates (from the TextWrangler File menu
select “Save to FTP/SFTP Server…” and log in). Now open a terminal program (such as Terminal
or X11) and ssh to socrates. You should be able to log in with your NSID account. If you are not
familiar with the UNIX command line environment, you can consult the attached document
explaining all the basic commands you need to know. Your home directory is the location where
you will keep all your source code, the executables, and your input and output data files. Your
parallel code is submitted from the home directory and you can read and write files from there.
Most parallel computers provide a different directory with additional disk space should your
program use very large data files.
socrates
Information about socrates is available on the site
http://www.usask.ca/its/services/research_computing/socrates.php
Your account has been set up to use the OpenMPI implementation of the MPI standard.
Socrates also has MPICH and LAM MPI installed.
Socrates has the compilers gcc, g77, gfortran available. To compile a parallel MPI program
you need to use the compiler scripts provided by OpenMPI which link the native compilers to
the proper MPI libraries. The compiler scripts are mpif77 or mpif90 for FORTRAN programs,
mpicc for C, and mpiCC for C++ programs. They can be passed any flag accepted by the
underlying compilers. To do a basic build, use one of the following commands:
2
[]$ mpif90 -o executable sourcecode.f90
[]$ mpicc -o executable sourcecode.c
[]$ mpiCC -o executable sourcecode.cpp
Socrates uses the TORQUE/Moab batching system to manage the load distribution on the
cluster. This load leveling program creates a queuing system to manage the cluster, and users
must submit their batch jobs to the queue. An outline of basic TORQUE commands is given
below (which evolved from software called PBS – Portable Batch System).
To submit a parallel job, you will need to create a job script. Using a text editor (TextWrangler)
create a new file named myjobscript.pbs and type in all the necessary commands required
to submit your parallel job to the queue. A sample job script is shown below. Note that PBS
commands are preceded by #PBS and comment lines are inserted with a single #.
#/bin/sh
# Sample PBS Script for use with OpenMPI on Socrates
# Jason Hlady May 2010
# Specify the number of processors to use in the form of
# nodes=X:ppn=Y, where X = number of computers (nodes),
# Y = number of processors per computer
#PBS -l nodes=1:ppn=1
# Job name which will show up in queue, job output
#PBS -N <my job name>
# Optional: join error and output into one stream
#PBS -j oe
# Show what node the app started on--useful for serial jobs
echo `hostname`
cd $PBS_O_WORKDIR
echo "Current working directory is `pwd`"
echo "Starting run at: `date`"
echo "---------------------"
# Run the application
mpirun <my program name>
echo "Program finished with exit code $? at: `date`"
exit 0
3
When you submit the batch job, TORQUE will assign a job ID number. The standard output and
standard error of the job will be stored in the file myjobname.oJOB_ID# in your working
directory.
To submit your batch job simply enter
qsub myjobscript.pbs
The job ID number will be output to screen. To observe the status of your job in the queue, type
qstat
To kill a job enter
qdel JOB_ID#
You can view the man pages of any of these commands for more information and options.
Computing Lab II
Option 1: Jacobi Iteration on a Two-Dimensional Mesh
This is a classic problem for learning the basics of building a parallel Single Program Multiple Data
(SPMD) code with a domain decomposition approach and data dependency between processes. These
issues are common to many parallel algorithms used in scientific programs. We will keep the algorithm
as simple as possible so that you can focus on implementing the parallel communication and thinking
about program efficiency.
Consider solving the temperature distribution on a two-dimensional grid with fixed temperature values
on the boundaries.
Figure 1: Uniform grid of temperature values. Boundary values indicated by grey nodes.
4
The temperature values at all grid points can be stored in a two-dimensional data array, T(i,j). Starting
from an initial guess for the temperature distribution (say T = 0 at all interior nodes (white squares)), we
can calculate the final temperature distribution by repeatedly applying the calculation
,
,
,
,
,
over all interior nodes until the temperature values converge to the final solution. This is not a very
efficient solver and it may take hundreds (or thousands) of sweeps of the grid before convergence, but it
is the simplest algorithm you can use. An example FORTRAN 90 sequential program is given below.
program jacobi
! A program solving 2D heat equations using Jacobi iteration
implicit none
integer, parameter::id=100,jd=100
integer::i,j,n,nmax
real(kind=8), dimension(0:id+1,0:jd+1)::Tnew,Told
character(6)::filename
! Initialize the domain
Told=0.0_8 ! initial condition
Told(0,:)=80.0_8 ! right boundary condition
Told(id+1,:)=50.0_8 ! left boundary condition
Told(:,0)=0.0_8 ! bottom boundary condition
Told(:,jd+1)=100.0_8 ! top boundary condition
Tnew=Told
! Perform Jacobi iterations (nmax sweeps of the domain)
nmax=1000
do n=1,nmax
! Sweep interior nodes
do i=1,id
do j=1,jd
Tnew(i,j)=(Told(i+1,j)+Told(i-1,j)+Told(i,j+1)+Told(i,j-1))/4.0_8
end do
end do
! Copy Tnew to Told and sweep again
Told=Tnew
end do
! Output field data to file
50 format(102f6.1)
filename="T.dat"
open(unit=20,file=filename,status="replace")
do j=jd+1,0,-1
write(20,50)(Tnew(i,j), i=0,id+1)
end do
stop
end program jacobi
5
Now parallelize the Jacobi solver. Use a simple one-dimensional domain decomposition as shown
below.
Process:
0
1
…
n
Figure 2: 1D decomposition of the grid.
Each process will perform iterations only on its subdomain, and will have to exchange temperature
values with neighboring processes at the subdomain boundaries. You should create a row of ghost
points to store these communicated values. The external row of boundary values around the global
domain can also be considered ghost points. If you keep things basic, you should be able to write the
parallel program in less than 70 lines of code!
Some tips and hints:
•
•
•
•
•
To keep things simple, directly program the mapping of the domain to the processes, i.e.
process 0 is on the left boundary, process ‘n’ on the right boundary, the rest in the middle. You
can also directly specify the different boundary conditions for each process.
After every process sweeps its local nodes once, you will have to communicate the updated
temperature values at subdomain boundaries before the next sweep. This can be accomplished
in two communication shifts – first everyone sends data to the process on the right and receives
from the left, then everyone sends to the left and receives from the right. Make sure the
communication pattern doesn’t block.
Since the data values you need to communicate may not be in contiguous memory locations in
your 2D temperature data array, you can create a 1D buffer array and explicitly copy the data
values in/out of the buffer and use the buffer array in the MPI_SEND and MPI_RECV calls.
You may want to look at the data field when the computation is done, and the easiest way to do
this is to have every process write its local data array to a separate data file. You will have to
use a different file name for every process, and one way to automatically generate file names (in
FORTRAN 90) with the process id as the file name is with ASCII number to character conversion:
filename=achar((pid-mod(pid,10))/10+48) // achar(mod(pid,10)+48) // ".dat" which
gives the file name “12.dat” for pid = 12.
Try using MPI_SENDRECV instead of separate blocking send and receive calls. This will allow you
to solve the case when the domain is periodic in the x-direction (roll the domain into a
6
•
•
cylindrical shell with the two x-faces joined together) and process 0 communicates with process
‘n’,.
You can implement a grid convergence measure such as the rms of the difference between Tnew
and Told on the global grid, and then stop the outer loop when the convergence measure is
acceptably small (say 10-5). To do this you will need to use collective communication calls to
calculate the global convergence of the grid and to broadcast this value to all processes so that
they stop at the same time.
If you have the 1D domain decomposition working you can try a 2D domain decomposition
which subdivides the domain into squares instead of strips. This is a more efficient
decomposition since the number of subdomain ghost points is reduced.
Option 2: Numerical Integration of a Set of Discrete Data
This problem uses a master-worker model where the master process divides up the data and sends it to
the workers, who perform local computations on the data and communicate results back to the master.
There is no data dependency between workers (they don’t need to communicate with each other). This
is an example of what is called an “embarrassingly parallel” problem.
Consider the numerical integration of a large set of discrete data values, which could represent points
sampled from a function.
Figure 3: Discrete data values, f(xi), where i = 1,2,3,…,n.
To approximate the integral, we can fit straight lines between each pair of points and then compute the
sum of the areas under each line segment. This is the trapezoid formula:
7
∑
The locations
may not be evenly spaced. An example FORTRAN 90 code is given below.
program integrate
! A program to numerically integrate discrete data from the file “ptrace.dat”
implicit none
integer, parameter::n=960000 ! Number of points in data file
integer::i
real(kind=8)::integral
real(kind=8), dimension(n)::x,f
! Open data file and read in data
open(unit=21,file="ptrace.dat",status="old")
do i=1,n
read(21,*)x(i),f(i)
end do
! Now compute global integral
integral=0.0_8
do i=1,n-1
integral=integral+(x(i+1)-x(i))*(f(i)+f(i+1))/2.0_8 ! trapezoidal formula
end do
! Ouput result
write(*,*)"The integral of the data set is: ",integral
stop
end program integrate
Now parallelize this program using the master-worker model. The master process (choose process 0,
which is always present) reads in data from the file, divides it up evenly and distributes it to the workers
(all other processes). The workers compute the integral of their portions of the data and return the
results to the master. The master sums the results to find the global integral and outputs the result. If
you keep things simple, you should be able to write the parallel program in less than 60 lines of code.
Some tips and hints:
•
•
If the data array is very large, the master process may not have enough local memory to store
the entire array in memory. In this case it would be better to read in only part of the data set at
a time and send it to a worker(s), before reading in more data (over-write previous values) and
send to other workers, etc.
In order to make this algorithm efficient, we need to minimize the idle time of the workers (and
the master) and balance the computational work as evenly as possible. If the number of
processes is small, and the data set is large, we may want the master process to help compute
part of the integral while it is waiting for the workers to finish. Also, if the amount of data
communicated to each worker is large (lots of communication overhead – bandwidth related)
other workers will be idling while they wait for their data. Would it be more efficient to send
smaller parcels of data to each worker so that they all get to work quickly, and then repeatedly
8
•
•
send more data when they finish until all the work is done? But if the number of messages gets
too large, then we will have increased latency-related overhead.
You can try using non-blocking communication calls on the master process so that it can do
other tasks while waiting for results from workers.
You can also try using the scatter and reduce collective communication routines to implement
the parallel program.
Investigate Parallel Performance
Measure the parallel performance of your code and examine how the efficiency varies with process
count and problem size.
•
•
•
Implement timing routines in your parallel code as well as in a sequential version, and write run
time to standard output. When submitting timed parallel jobs to the queue, you want to make
sure that resources are used exclusively for your job (i.e. other applications are not running at
the same time on the same CPU). Also, the run time of your code may be affected by the
mapping of processes to cores/sockets/nodes on the machine so experiment with this. It might
be a good idea to launch the code several times and average the run time results.
Measure the parallel efficiency and speedup of your code on different numbers of processes.
You may also want to repeat the measurements on larger/smaller domains to examine the
effects of problem size. The single process run time T1 can be used to calculate speedup, or a
tougher measure is to use the sequential code run time Ts. Plot a curve of speedup versus
number of processes used. Also plot efficiency versus number of processes.
How well does your code scale? How does the problem size affect the efficiency? Are there
ways that the parallel performance of your code can be improved? You may want to consider
operation count in critical loops, memory usage, compiler optimization, communication
overhead, etc. as ways to improve the speed of your code.
9

Documents pareils