Programmierkonzepte

Transcription

Programmierkonzepte
Vorlesung Rechnerarchitektur 2
Seite 77
Programmierkonzepte
Speedup (1)
A sequential algorithm can be evaluated in terms of its execution time.
The execution time of a parallel algorithm depends not only on the problem size but also on
the architecture of the parallel system and the number of available computing resources.
The speedup factor is a measure that captures the relative benefits of solving a computational
problem in parallel. The speedup factor of a parallel computation utilizing p processors is
derived as the following ratio:
T s sequential execution time
T
s
Sp = ----Tp
T p parallel execution time
Sp can be defined as the ratio of the sequential processing time to the parallel processing
time.
Ts is the execution time taken to perform the computation on one processor using the best
algorithm known for a particular computation problem.
Tp is the execution time needed to perfom the same computation on a parallel system using
p processors.
Amdahl’s Law:
An application programm W is divided into two computational parts X and Y. These two
parts take x% and y% of the total execution time.
Assume that part Y cannot be improved regarding execution time and part X can be improved to run n times faster. Then the speedup Sp is defined as:
Ts
W
1
n
1
S p = ----- = ------------------------------------- = ------------------------------------- = --------------------------------- → --Tp
((x ⁄ n) + y) ⋅ W
((1 – y) ⁄ n) + y
1 + (n – 1) ⋅ y
y
n→∞
This implicits:
•
•
•
•
Part X should be optimized, i.e. concentrate on the common cases
Upper bound for the speedup is 1/y
Part Y is called bottleneck. Y should be as small as possible
y is known as the sequential bottleneck
[Scalable Parallel Computing]
Lehrstuhl für Rechnerarchitektur - Universität Mannheim
WS05/06
Vorlesung Rechnerarchitektur 2
Seite 78
Programmierkonzepte
Speedup (2)
Speedup
1 ≤ Sp ≤ p
32
superlinear speedup
28
24
lin
r
ea
p
du
e
e
sp
20
16
highest
achievable
speedup
optimal number of processors
12
8
performance reduction
by adding more processors
4
0
0
4
8
12
16
20
24
28
32 Number of Processors
The speedup factor Sp is normally less than the number of processors p (theoretical maximum) due to the overhead factors:
•
•
•
•
•
synchronization
communication
input/output operations
architectural bottlenecks
etc.
Superlinear Speedup
In an ideal system, the speedup Sp cannot be greater than p.
But superlinear speedup can be observed, when a nonoptimal sequential algorithm is used.
Lehrstuhl für Rechnerarchitektur - Universität Mannheim
WS05/06
Vorlesung Rechnerarchitektur 2
Seite 79
Programmierkonzepte
Speedup (3) - Problem Size
Gustafson’s Law:
The major assumption for Amdahl’s Law is that the problem size is fixed. John Gustafson (1988) proposed a concept that improves speedup by scaling the problems size with
the increase in machine size.
Thus the problem size n is another important parameter for analysing parallel computations
on parallel systems.
Example: Calulating the sum of 10 (100, 1000) numbers on a sequential computer will result
in an execution time nearly proportional to 10 (100, 1000) time steps.
Assume a parallel computer with p=10. The execution time for summing up 10 elements will
result in a time much longer than the execution on a sequential processor, because the overhead to synchronise and communicate the results will be great compared to the number of
useful computations. Increasing the problem size to 100 will increase the speed-up and the
efficiency. Using 1000 numbers will give each processor 100 elements to sum up and this
will further increase speed-up and efficiency.
Ts
yW + ( 1 – y )nW
S′ p = ----- = --------------------------------------- = y + ( 1 – y )
Tp
W
Speedup
32
28
ea
lin
24
ed
pe
s
r
up
10000
20
1000
16
increasing of problem size n
increases the achievable speedup
12
8
100
4
10
0
0
4
8
12
16
Lehrstuhl für Rechnerarchitektur - Universität Mannheim
20
24
28
32 Number of Processors
WS05/06
Vorlesung Rechnerarchitektur 2
Seite 80
Programmierkonzepte
Speedup (4) - Efficiency
The efficiency of a parallel computation can be defined as the ratio between the speed-up
factor and the number of processing elements in a parallel system.
In an ideal parallel system, the speed-up factor is equal to p, so the efficiency becomes equal
to one.
Sp
Ts
Ep = ------ = -------------p
p ⋅ Tp
Lehrstuhl für Rechnerarchitektur - Universität Mannheim
0 ≤ Ep ≤ 1
WS05/06
Vorlesung Rechnerarchitektur 2
Seite 81
Programmierkonzepte
Programmiermodelle (1)
dusty decks
applicative languages
control parallelism
data parallelism
implicit
parallelism
The main two characteristics of parallel programming systems are how to manage communication between the concurrent activities, and how they synchronize, or coordinate, these
activities [Wilson95].
parallelizing compiler
SISAL
FORTRAN 77 + Parafrase compiler
FORTRAN 77 + Cray directives
parallelization directives
FORTRAN 77 + HPF directives
whole-array operations
FORTRAN 90, C*
arbitrary sharing
fork/join
futures
partial/explicit sharing
LINDA
(Implementation of a VSM)
CSP
no sharing
procedural message passing
Language Taxonomy Scheme [Wilson95]
Conventional languages and implementations of a VSM
•
•
•
•
Fortran
C, C++
Implementations of VSM in software with hardware support
adaptive consistence
Lehrstuhl für Rechnerarchitektur - Universität Mannheim
WS05/06
Vorlesung Rechnerarchitektur 2
Seite 82
Programmierkonzepte
Programming models (2)
Commitments for a programming model
• Message Passing Model
• Shared Memory Model (global address space model)
- data types
- structuring of the program in program units
- data exchange between the program units
- determination of the control flow
- exploitation of the parallelism
- parallelization of program units
- usage of communication protocols
- coordination of the parallel flow
PCAM: Partition - Communicate - Agglomerate - Map [Forster95]
Scalability requires:
• Shared memory node
- as long as the speedup inside a node justifies the costs
• Physical distributed memory and processing units
- as soon as larger systems are targeted
Problems due to physical distribution:
• latency (normaly not predictable)
- of the communication
- due to access on global data
• synchronization costs
Lehrstuhl für Rechnerarchitektur - Universität Mannheim
WS05/06
Vorlesung Rechnerarchitektur 2
Seite 83
Programmierkonzepte
Data-parallel languages
C*
• Extension of C with data types and operations on vectors and matrices, specific
to the SIMD-processing of the Connection Machine CM2 and CM5
CC++
• Extension of C++ with functions for communication, synchronization and controlling of the parallelity
• 6 basic abstractions: processor object, global pointer, thread, sync variable,
atomic function, transfer function
Fortran 90
• Extension of Fortran77 with data types and operations on vectos and matrices
(SIMD)
• dynamic memory expansion
• introduction of pointers and data structures
• ’intrinsic functions’ for vectors und matrices
• ’access functions’ for vectors und matrices
HPF-Fortran
• Extension of Fortran 77 with constructs for parallel processing (Forall, Independant)
• Directives for the defining of the locality of data structures
Lehrstuhl für Rechnerarchitektur - Universität Mannheim
WS05/06
Vorlesung Rechnerarchitektur 2
Seite 84
Programmierkonzepte
Message-Passing Programming Model
Contra:
• The difficulty with this model is that the programmer must explicitly program
the communication and synchronization between the cooperating threads of his
task.
• Some communication constructs, e.g., blocking send, blocking receive, double
as synchronization points. Thus, communication is intertwined with coordination.
• Constraining the totally asynchronous execution of the MIMD mode of operation, message passing programming may be facilitated.
• Such a restriction is the single program-multiple data (SPMD) model.
• SPMD solves data parallel problems by applying a replicated thread to different
data sets.
• SPMD combines the global homogeneity with local autonomy of execution.
Parallelization reduces to (??) the task of data distribution.
• Thread coordination is performed in the lock-step mode, thus replacing individual thread synchronization by a global barrier.
Pro:
• Message passing is the implementationally simplest, computationally most efficient programming model for distributed memory architectures.
• Explicit knowlegde and usage of the location of data-structures
[W.K.Giloi]
Lehrstuhl für Rechnerarchitektur - Universität Mannheim
WS05/06
Vorlesung Rechnerarchitektur 2
Seite 85
Programmierkonzepte
Communication with message-passing
Extension of conventional languages using message-passing libraries
• Erweiterung um Funktionen:
- Communication
- Synchronization
- process management
- mapping
• Features of communication
- synchronous / asynchronous
- blocking behavior
- management of communication object
- buffering
- concurrency level
Classic programming languages and the message-passing extensions:
• Fortran
• C, C++
• Express
• Parmacs
• PVM
• MPI
- MPI Standard of the MPI Forum
(http://www.mpi-forum.org)
- Implementation: Open MPI, merger of LAM-MPI, FT-MPI, LA-MPI
(http://www.open-mpi.org)
- Implementation: MPICH-1, MPICH-2, Argonne National Laborator
(http://www-unix.mcs.anl.gov/mpi/mpich)
Lehrstuhl für Rechnerarchitektur - Universität Mannheim
WS05/06
Vorlesung Rechnerarchitektur 2
Seite 86
Programmierkonzepte
MPI-Message Passing Interface (MPI 1.2)
MPI is a library for Fortran, C or C++ for message passing and the de-facto message passing
standard. MPI is a complex system. It comprises 129 functions, many of which have numerous parameters or variants.
In the MPI programming model a computation comprises one or more processes that communicate by calling library routines to send and receive messages to other processes. In most
MPI implementations, a fixed set of processes is created at program initialization, and one
process is created per processor. MPI is a message passing interface, not a complete programming environment. MPI does not allow dynamic process creation.
Features:
•
•
•
•
•
•
static process model, one-to-one mapping
fixed number of processes
point-to-point communication, collective communication
easy heterogeneous implementation
virtual communication channels
efficient implementations for multithreaded environments
The abillity of MPI to probe for messages supports asynchronous communication. Nevertheless, non-blocking communication is not treated as a basic function, because blocking communication can replace non-blocking communication any time.
The basic functions of MPI are:
• MPI_INIT: Initiate an MPI computation
• MPI_FINALIZE: Terminate a computation
• MPI_COMM_SIZE: Determine the number of processes
• MPI_COMM_RANK: Determine my process identifier
• MPI_SEND: Send a message (6 types)
• MPI_RECV: Receive a message (2 types)
MPI has three communication modes for the send function: standard, ready and synchronous. Every mode can be combined with a blocking or non-blocking behaviour of the
function.
MPI provides support for general application topologies that are specified by graphs and it
provides explicit support for n-dimensional Cartesian grids.
Lehrstuhl für Rechnerarchitektur - Universität Mannheim
WS05/06
Vorlesung Rechnerarchitektur 2
Seite 87
Programmierkonzepte
MPI-Message Passing Interface (MPI 2.0)
MPI 2.0 provides the same functionality as MPI 1.2, but additional extended functions for:
• Dynamic process creation
• Remote-Memory-Access (RMA) operations like remote load (get), remote store
(put), ...
• Parallel I/O operations
• Thread-awareness and thread-safety (optional)
Most notifiable are the RMA operations. This feature enables the programmer to move from
two-sided communication to one-sided communication, i.e. only one process (the local process) is involved in a communication, no longer two processes (local and remote process).
Note that RMA operations only improve the communication modes, not the synchronization
modes. It is obvious that for synchronization issues all synchronizing processes must participate.
The thread support provides improvements on the effectiveness of symmetric multi-processing (SMP) nodes, where more than one processor is working in a node. Here, the intra-node
communcation can be realized using shared-memory and the inter-node communication
using conventional message-passing. Beside this, the usage of threads is a kind of dynamic
process management, enabling the user to take advantage of the benefits of muti-grid algorithms.
SMP node
process
NIC
shared mem
interconnection
network
message-passing
MPI
thread2
thread1
MPI
NIC
thread2
thread1
SMP node
process
shared mem
Combined message-passing / shared-memory modell
Lehrstuhl für Rechnerarchitektur - Universität Mannheim
WS05/06
Vorlesung Rechnerarchitektur 2
Seite 88
Programmierkonzepte
CC++ (1)
CC++ is the abbreviation for Compositional C++. It is a:
• general-purpose programming language
• superset of C++ with six new keywords
CC++ was developed by Chandy und Kesselman, an example for an implementation is the
Caltech C++ Compiler
CC++ introduces abstractions for representation of shared-memory structures, threads of
control, mutual exclusions and synchronizataion principles.
With these six new abstractions it is possible to implement parallel code in a C++ fashion.
Local and remote data accesses can be modelled and therefore are visible to the user. Because it both implements shared- and distribtuted-memory model, no restriction regarding the
programming model is given. Note that there is no explicit support for message-passing, abstracts like send and receive do not exist. [Foster95]
Six new basic abstractions
processor object
• mechanism for controlling locality
• computation comprises one or more processor objects
• within a processor object sequential code can be executed without modifications
• access to local data structures
• identified by keyword global
• predefined class proc_t controls processor object placement
• processor objects run in separate address spaces (represent shared data)
Lehrstuhl für Rechnerarchitektur - Universität Mannheim
WS05/06
Vorlesung Rechnerarchitektur 2
Seite 89
Programmierkonzepte
CC++ (2)
global pointer
• identified by the type modifier global
• mechanism for linking together processor objects
• must be used to access a data structure or to perform computation (using a remote procedure call, or RPC) in another processor object
thread
• mechanism for specifying concurrent execution
• created independently from processor objects
• more than one thread can execute in a processor object
• par, parfor, and spawn statements create threads
• represent threads of control in a processor object (common address space)
sync variable
• type modifier sync
• used to synchronize thread execution
atomic function
• specified by the keyword atomic
• mechanism to control the interleaving of threads executing in the same processor object.
transfer function
• predefined type CCVoid
• allows arbitrary data structures to be transferred between processor objects
Lehrstuhl für Rechnerarchitektur - Universität Mannheim
WS05/06
Vorlesung Rechnerarchitektur 2
Seite 90
Programmierkonzepte
CC++ (3)
Concurrency
The par, parfor and spawn constructs create additional threads. Forbidden in parallel blocks
are statements that result in nonlocal changes of the control flow (e.g. return).
A parallel block is distinguished from a sequential block by keyword par or parfor, the block
is blocking until all statements inside the block terminate.
par and parfor can be nested as desired, e.g. for master-worker modell
spawn
• creates independant thread of control to specify unstructured parallelism
• parent cannot wait for neither termination nor return values
Examples for concurrency:
par {
statement1;
statement2;
statement3;
...
statementN;
}
parfor (int i=0; i<10; i++) {
my_process(i);
}
par {
master();
parfor (int i=0; i<10; i++)
worker(i);
}
Lehrstuhl für Rechnerarchitektur - Universität Mannheim
WS05/06
Vorlesung Rechnerarchitektur 2
Seite 91
Programmierkonzepte
CC++ (4)
Locality
Processor object:
• represents address space, in which threads can be executed
• identified by the keyword global
• unit of locality, where data access is considered local and therefore cheap
This example creates a processor object class with public member functions:
global class MyClass : public ParentClass {
public:
void func1();
void func2();
};
Processor objects are linked together using global pointers
Global pointers:
• like a normal pointer but it can refer to other processor objects or data structures
within other processor objects
• represents nonlocal data which is expensive to access
• identified by the keyword global
Example:
float *global gpf;
Processor objects created with the new statement are represented by a global pointer:
MyClass *global myclass_pobj = new MyClass;
Lehrstuhl für Rechnerarchitektur - Universität Mannheim
WS05/06
Vorlesung Rechnerarchitektur 2
Seite 92
Programmierkonzepte
CC++ (5)
Thread placement:
• threads are executed by default in the same processor object as its parent
• can be placed in another processor object using RPC
• can be invoked using the global pointer to another processor object
Example to execute a thread in a different processor object:
myclass_pobj->func1();
A single thread in a processor object is a task.
If the return result of the RPC is not required, the spawn function is more efficient than normal RPC.
spawn myclass_pobj->func1();
Communication
• No primitives for sending and receiving data between threads
• Threads communicate by operation on shared data structures (e.g. channel communication)
• Global pointers can be used to communicate data between processor objects
• Synchronization is done using the sync keyword
• atomic functions provide functionality for mutual exclusion
• data transfer functions can be used to communicate more complex data structures
Lehrstuhl für Rechnerarchitektur - Universität Mannheim
WS05/06
Vorlesung Rechnerarchitektur 2
Seite 93
Programmierkonzepte
CC++ (6)
Global pointers for remote operations:
• global pointers in CC++ are used in the same manner as C++ local pointers
• they can be used to operate on data of other processor objects
• they can be used to invoke remote functions (of other processor objects)
Generic form of a RPC:
<type> *global gp;
result = gp->p(...);
Steps:
• The arguments are packed in a message, communicated and remotely unpakked. The calling thread suspends execution.
• A thread is created in the remote processor object.
• The transferring back to the calling processor object unblocks the calling
thread.
proc_obj1
(remote)
proc_obj0
global int* gp;
int len;
*gp
gp=5
... //gp assignment
*gp=5;
len=(*gp)*2;
length
write (*gp, 5
)
ack
len=(*gp)*2
read (*gp)
result=5
Lehrstuhl für Rechnerarchitektur - Universität Mannheim
length=5
? length
WS05/06
Vorlesung Rechnerarchitektur 2
Seite 94
Programmierkonzepte
CC++ (7)
The sync variable for synchronization:
• initially, it has a special value, "undefined"
• only once a value can be assigned to this variable
• a read to an undefined sync variable blocks the calling thread until the variable
is assigned a value
Examples:
sync int i;
sync int* p;
sync int *sync sp;
//i is a sync integer
//p is a pointer to a sync integer
//sp is a sync pointer to a sync integer
Example: A queue using sync variables
q_element #0
q_element #0
q_element #0
sync value; undef
q_element *next;
sync value; value
q_element *next;
sync value; value
q_element *next;
q_element #1
q_element #1
sync value; undef
q_element *next;
sync value; undef
q_element *next;
value stored in queue
value read from queue
Initial state, queue is empty
• With each value stored in the queue, a new (empty) element is added.
• A read from the queue will delete the read element.
• If a thread reads from an empty queue, it is blocked until an element is available
Lehrstuhl für Rechnerarchitektur - Universität Mannheim
WS05/06
Vorlesung Rechnerarchitektur 2
Seite 95
Programmierkonzepte
CC++ (8)
Mutual exclusion using the atomic keyword:
If there are multiple reader/writer for the queue presented above, mutual exclusion must be
guaranteed due to the multiple writer problem.
A function part of a processor object can be declared atomic. This specifies that the execution of this function cannot be interleaved with other atomic functions of the same object.
Example:
atomic void Queue:Put (int value);
If now multiple readers/writers occur, who will modify pointers and data structures, the multiple writer problem can be solved by disallowing concurrency in the object.
Data Transfer Functions:
They are only necessary for local pointers, arrays, structures containing local pointers. The
mechanism for packaging and unpacking these complex structures is analogous to the C++
built-in stream functions:
ostream& operator<<(ostream&, const TYPE& obj_in);
istream& operator>>(istream&, TYPE& obj_out);
(class ios of the iostream library define infix operators ’<<’ and ’>>’)
In CC++ for data transfer:
CCVoid& operator<<(CCVoid&, const TYPE& obj_in);
CCVoid& operator>>(CCVoid&, TYPE& obj_out);
Associated with every CC++ datatype is a pair of data transfer functions that define how to
transfer that type to another processor object. Only for the types shown above interaction by
the user is required, simple types are pre-defined (CCVoid is analogous to istream/ostream).
Lehrstuhl für Rechnerarchitektur - Universität Mannheim
WS05/06
Vorlesung Rechnerarchitektur 2
Seite 96
Programmierkonzepte
CC++ (9)
Asynchronous communication
• Use spezialized data tasks to provide read/write operation to shared data structures.
• The shared data structures are distributed among the computation tasks. Then
each task must periodically poll for pending requests.
• The shared data structures are distributed among the computation tasks. Remote
tasks can access data using RPCs to appropriate member functions.
Mapping
Task is to map the processor objects and/or the computing threads to physical processors of
a multiprocessor environment.
threads
mapping
processor
objects
mapping
physical
processors
Lehrstuhl für Rechnerarchitektur - Universität Mannheim
WS05/06
Vorlesung Rechnerarchitektur 2
Seite 97
Programmierkonzepte
CC++ (10)
Placement of processor objects
New created POs are placed on the same processor as their creators.
Alternative placement is possible using the placement argument to the new operator
(usage in C++: position an object in processor space)
The location is specified by the implementation-dependent class proc_t and node_t.
Examples:
MyClass *global G;
proc_t location(node_t("my_node")); //declare processor on node
G = new (location) MyClass; //create new PO on processor ’location’
proc_t location(node_t("your_node"));
//processor on a different node
Mapping threads to processor objects
Alternative: Create fixed number of POs and map them 1:1 to processors. The threads
are dynamically assigned to these POs on creation.
Approach for SPMD computations.
Sequential composition of CC++ computations (SPMD modell)
Two components:
• Initialization: Creation of POs for execution and communication
• Execution: Acutal computation using the structures created in the initialization
phase
main control
suspended
POs
initialization
execution
long-life threads executing
both components:
initialization and execution
=> simple programs
Lehrstuhl für Rechnerarchitektur - Universität Mannheim
separate
short-life
threads for initilization
and execution
=> more efficient
WS05/06
Vorlesung Rechnerarchitektur 2
Seite 98
Programmierkonzepte
Fortran
Data-parallelism is the concurrency when the same operation is applied to some elements of
a data ensemble. A data-parallel program is a sequence of such operations.
Fortan 90: Data-parallel language with concurrent execution but without domain decomposition
High Performance Fortran (HPF): Augments F90 with additional parallel constructs and
data placement directives
Parallelism of data can only be expressed using arrays. Thus data structures operated on are
arrays. Concurreny may be implicit or may be expressed using explicit parallel constructs:
A = B * C
! A,B,C are arrays, this is an explicit parallel construct
A do-loop is an example for implicit-parallel construct: A compiler may be able or not to
detect the independance of the iterations and thus perform them in parallel.
F90
Array assignment statements
An array section can be specified using the range triplet.
lower-bound : upper-bound : stride
Array intrinsic functions
Intrinic functions for operations on arrays and vectors, like muliplication, division, addition and substraction.
[Foster95]
Lehrstuhl für Rechnerarchitektur - Universität Mannheim
WS05/06
Vorlesung Rechnerarchitektur 2
Seite 99
Programmierkonzepte
HPF
The PROCESSOR directive is used to specify the shape and size of an array of abstract processors.
Distribute an array X over several processors using the DISTRIBUTE directive:
!HPF$
PROCESSORS pr(16)
real X(1024)
!HPF$
DISTRIBUTE X(BLOCK) ONTO pr
Distribution can be block or cyclic distribution, e.g. for a two dimensions array:
(BLOCK,*)
(CYCLIC,*)
(CYCLIC,BLOCK)
The ALIGN directive is used to align elements of different arrays with each other
ALIGN C(I) WITH B(I*2)
Mapping of abstract processors to physical processors is not defined in the language -> implementation dependent.
Concurrency in HPF is indicated using the FORALL statements and INDEPENDENT directive.
The FORALL statement has the general form:
FORALL ( triplet, ...,
triplet,
mask )
assignment
An example for FORALL statements:
FORALL (i=1:m, j=1:n)
X(i,j) = i+j
FORALL (i=1:n, j=1:n, i<j) Y(i,j) = 0.0
FORALL (i=1:n)
Z(i,i) = 0.0
The FORALL statement synchornizes after each iteration. The INDEPENDENT directive
asserts that the iterations of a do-loop can be performed independently.
!HPF$ INDEPENDENT
do i=1,n
...
enddo
[Foster95]
Lehrstuhl für Rechnerarchitektur - Universität Mannheim
WS05/06
Vorlesung Rechnerarchitektur 2
Seite 100
Programmierkonzepte
Design methodology - PCAM
Methodical design of parallel algorithms
• Partition: Decompose computation into small tasks
- Domain decomposition
- Functional decomposition
• Communicate: Required communication is determined and communication
structures and algorithms are defined
• Agglomerate: Combining tasks and communication structures to larger tasks
with respect to implementation costs and performance
• Map: Each larger task is assigned to a processor, mapping can be static or dynamic
Additionally dynamic working principles:
• Load-balancing
• Multi-grid
[Foster95], online under http://www-unix.mcs.anl.gov/dbpp
Example for multi-grid [www.iwr.uni-heidelberg.de]
Lehrstuhl für Rechnerarchitektur - Universität Mannheim
WS05/06