Programmierkonzepte
Transcription
Programmierkonzepte
Vorlesung Rechnerarchitektur 2 Seite 77 Programmierkonzepte Speedup (1) A sequential algorithm can be evaluated in terms of its execution time. The execution time of a parallel algorithm depends not only on the problem size but also on the architecture of the parallel system and the number of available computing resources. The speedup factor is a measure that captures the relative benefits of solving a computational problem in parallel. The speedup factor of a parallel computation utilizing p processors is derived as the following ratio: T s sequential execution time T s Sp = ----Tp T p parallel execution time Sp can be defined as the ratio of the sequential processing time to the parallel processing time. Ts is the execution time taken to perform the computation on one processor using the best algorithm known for a particular computation problem. Tp is the execution time needed to perfom the same computation on a parallel system using p processors. Amdahl’s Law: An application programm W is divided into two computational parts X and Y. These two parts take x% and y% of the total execution time. Assume that part Y cannot be improved regarding execution time and part X can be improved to run n times faster. Then the speedup Sp is defined as: Ts W 1 n 1 S p = ----- = ------------------------------------- = ------------------------------------- = --------------------------------- → --Tp ((x ⁄ n) + y) ⋅ W ((1 – y) ⁄ n) + y 1 + (n – 1) ⋅ y y n→∞ This implicits: • • • • Part X should be optimized, i.e. concentrate on the common cases Upper bound for the speedup is 1/y Part Y is called bottleneck. Y should be as small as possible y is known as the sequential bottleneck [Scalable Parallel Computing] Lehrstuhl für Rechnerarchitektur - Universität Mannheim WS05/06 Vorlesung Rechnerarchitektur 2 Seite 78 Programmierkonzepte Speedup (2) Speedup 1 ≤ Sp ≤ p 32 superlinear speedup 28 24 lin r ea p du e e sp 20 16 highest achievable speedup optimal number of processors 12 8 performance reduction by adding more processors 4 0 0 4 8 12 16 20 24 28 32 Number of Processors The speedup factor Sp is normally less than the number of processors p (theoretical maximum) due to the overhead factors: • • • • • synchronization communication input/output operations architectural bottlenecks etc. Superlinear Speedup In an ideal system, the speedup Sp cannot be greater than p. But superlinear speedup can be observed, when a nonoptimal sequential algorithm is used. Lehrstuhl für Rechnerarchitektur - Universität Mannheim WS05/06 Vorlesung Rechnerarchitektur 2 Seite 79 Programmierkonzepte Speedup (3) - Problem Size Gustafson’s Law: The major assumption for Amdahl’s Law is that the problem size is fixed. John Gustafson (1988) proposed a concept that improves speedup by scaling the problems size with the increase in machine size. Thus the problem size n is another important parameter for analysing parallel computations on parallel systems. Example: Calulating the sum of 10 (100, 1000) numbers on a sequential computer will result in an execution time nearly proportional to 10 (100, 1000) time steps. Assume a parallel computer with p=10. The execution time for summing up 10 elements will result in a time much longer than the execution on a sequential processor, because the overhead to synchronise and communicate the results will be great compared to the number of useful computations. Increasing the problem size to 100 will increase the speed-up and the efficiency. Using 1000 numbers will give each processor 100 elements to sum up and this will further increase speed-up and efficiency. Ts yW + ( 1 – y )nW S′ p = ----- = --------------------------------------- = y + ( 1 – y ) Tp W Speedup 32 28 ea lin 24 ed pe s r up 10000 20 1000 16 increasing of problem size n increases the achievable speedup 12 8 100 4 10 0 0 4 8 12 16 Lehrstuhl für Rechnerarchitektur - Universität Mannheim 20 24 28 32 Number of Processors WS05/06 Vorlesung Rechnerarchitektur 2 Seite 80 Programmierkonzepte Speedup (4) - Efficiency The efficiency of a parallel computation can be defined as the ratio between the speed-up factor and the number of processing elements in a parallel system. In an ideal parallel system, the speed-up factor is equal to p, so the efficiency becomes equal to one. Sp Ts Ep = ------ = -------------p p ⋅ Tp Lehrstuhl für Rechnerarchitektur - Universität Mannheim 0 ≤ Ep ≤ 1 WS05/06 Vorlesung Rechnerarchitektur 2 Seite 81 Programmierkonzepte Programmiermodelle (1) dusty decks applicative languages control parallelism data parallelism implicit parallelism The main two characteristics of parallel programming systems are how to manage communication between the concurrent activities, and how they synchronize, or coordinate, these activities [Wilson95]. parallelizing compiler SISAL FORTRAN 77 + Parafrase compiler FORTRAN 77 + Cray directives parallelization directives FORTRAN 77 + HPF directives whole-array operations FORTRAN 90, C* arbitrary sharing fork/join futures partial/explicit sharing LINDA (Implementation of a VSM) CSP no sharing procedural message passing Language Taxonomy Scheme [Wilson95] Conventional languages and implementations of a VSM • • • • Fortran C, C++ Implementations of VSM in software with hardware support adaptive consistence Lehrstuhl für Rechnerarchitektur - Universität Mannheim WS05/06 Vorlesung Rechnerarchitektur 2 Seite 82 Programmierkonzepte Programming models (2) Commitments for a programming model • Message Passing Model • Shared Memory Model (global address space model) - data types - structuring of the program in program units - data exchange between the program units - determination of the control flow - exploitation of the parallelism - parallelization of program units - usage of communication protocols - coordination of the parallel flow PCAM: Partition - Communicate - Agglomerate - Map [Forster95] Scalability requires: • Shared memory node - as long as the speedup inside a node justifies the costs • Physical distributed memory and processing units - as soon as larger systems are targeted Problems due to physical distribution: • latency (normaly not predictable) - of the communication - due to access on global data • synchronization costs Lehrstuhl für Rechnerarchitektur - Universität Mannheim WS05/06 Vorlesung Rechnerarchitektur 2 Seite 83 Programmierkonzepte Data-parallel languages C* • Extension of C with data types and operations on vectors and matrices, specific to the SIMD-processing of the Connection Machine CM2 and CM5 CC++ • Extension of C++ with functions for communication, synchronization and controlling of the parallelity • 6 basic abstractions: processor object, global pointer, thread, sync variable, atomic function, transfer function Fortran 90 • Extension of Fortran77 with data types and operations on vectos and matrices (SIMD) • dynamic memory expansion • introduction of pointers and data structures • ’intrinsic functions’ for vectors und matrices • ’access functions’ for vectors und matrices HPF-Fortran • Extension of Fortran 77 with constructs for parallel processing (Forall, Independant) • Directives for the defining of the locality of data structures Lehrstuhl für Rechnerarchitektur - Universität Mannheim WS05/06 Vorlesung Rechnerarchitektur 2 Seite 84 Programmierkonzepte Message-Passing Programming Model Contra: • The difficulty with this model is that the programmer must explicitly program the communication and synchronization between the cooperating threads of his task. • Some communication constructs, e.g., blocking send, blocking receive, double as synchronization points. Thus, communication is intertwined with coordination. • Constraining the totally asynchronous execution of the MIMD mode of operation, message passing programming may be facilitated. • Such a restriction is the single program-multiple data (SPMD) model. • SPMD solves data parallel problems by applying a replicated thread to different data sets. • SPMD combines the global homogeneity with local autonomy of execution. Parallelization reduces to (??) the task of data distribution. • Thread coordination is performed in the lock-step mode, thus replacing individual thread synchronization by a global barrier. Pro: • Message passing is the implementationally simplest, computationally most efficient programming model for distributed memory architectures. • Explicit knowlegde and usage of the location of data-structures [W.K.Giloi] Lehrstuhl für Rechnerarchitektur - Universität Mannheim WS05/06 Vorlesung Rechnerarchitektur 2 Seite 85 Programmierkonzepte Communication with message-passing Extension of conventional languages using message-passing libraries • Erweiterung um Funktionen: - Communication - Synchronization - process management - mapping • Features of communication - synchronous / asynchronous - blocking behavior - management of communication object - buffering - concurrency level Classic programming languages and the message-passing extensions: • Fortran • C, C++ • Express • Parmacs • PVM • MPI - MPI Standard of the MPI Forum (http://www.mpi-forum.org) - Implementation: Open MPI, merger of LAM-MPI, FT-MPI, LA-MPI (http://www.open-mpi.org) - Implementation: MPICH-1, MPICH-2, Argonne National Laborator (http://www-unix.mcs.anl.gov/mpi/mpich) Lehrstuhl für Rechnerarchitektur - Universität Mannheim WS05/06 Vorlesung Rechnerarchitektur 2 Seite 86 Programmierkonzepte MPI-Message Passing Interface (MPI 1.2) MPI is a library for Fortran, C or C++ for message passing and the de-facto message passing standard. MPI is a complex system. It comprises 129 functions, many of which have numerous parameters or variants. In the MPI programming model a computation comprises one or more processes that communicate by calling library routines to send and receive messages to other processes. In most MPI implementations, a fixed set of processes is created at program initialization, and one process is created per processor. MPI is a message passing interface, not a complete programming environment. MPI does not allow dynamic process creation. Features: • • • • • • static process model, one-to-one mapping fixed number of processes point-to-point communication, collective communication easy heterogeneous implementation virtual communication channels efficient implementations for multithreaded environments The abillity of MPI to probe for messages supports asynchronous communication. Nevertheless, non-blocking communication is not treated as a basic function, because blocking communication can replace non-blocking communication any time. The basic functions of MPI are: • MPI_INIT: Initiate an MPI computation • MPI_FINALIZE: Terminate a computation • MPI_COMM_SIZE: Determine the number of processes • MPI_COMM_RANK: Determine my process identifier • MPI_SEND: Send a message (6 types) • MPI_RECV: Receive a message (2 types) MPI has three communication modes for the send function: standard, ready and synchronous. Every mode can be combined with a blocking or non-blocking behaviour of the function. MPI provides support for general application topologies that are specified by graphs and it provides explicit support for n-dimensional Cartesian grids. Lehrstuhl für Rechnerarchitektur - Universität Mannheim WS05/06 Vorlesung Rechnerarchitektur 2 Seite 87 Programmierkonzepte MPI-Message Passing Interface (MPI 2.0) MPI 2.0 provides the same functionality as MPI 1.2, but additional extended functions for: • Dynamic process creation • Remote-Memory-Access (RMA) operations like remote load (get), remote store (put), ... • Parallel I/O operations • Thread-awareness and thread-safety (optional) Most notifiable are the RMA operations. This feature enables the programmer to move from two-sided communication to one-sided communication, i.e. only one process (the local process) is involved in a communication, no longer two processes (local and remote process). Note that RMA operations only improve the communication modes, not the synchronization modes. It is obvious that for synchronization issues all synchronizing processes must participate. The thread support provides improvements on the effectiveness of symmetric multi-processing (SMP) nodes, where more than one processor is working in a node. Here, the intra-node communcation can be realized using shared-memory and the inter-node communication using conventional message-passing. Beside this, the usage of threads is a kind of dynamic process management, enabling the user to take advantage of the benefits of muti-grid algorithms. SMP node process NIC shared mem interconnection network message-passing MPI thread2 thread1 MPI NIC thread2 thread1 SMP node process shared mem Combined message-passing / shared-memory modell Lehrstuhl für Rechnerarchitektur - Universität Mannheim WS05/06 Vorlesung Rechnerarchitektur 2 Seite 88 Programmierkonzepte CC++ (1) CC++ is the abbreviation for Compositional C++. It is a: • general-purpose programming language • superset of C++ with six new keywords CC++ was developed by Chandy und Kesselman, an example for an implementation is the Caltech C++ Compiler CC++ introduces abstractions for representation of shared-memory structures, threads of control, mutual exclusions and synchronizataion principles. With these six new abstractions it is possible to implement parallel code in a C++ fashion. Local and remote data accesses can be modelled and therefore are visible to the user. Because it both implements shared- and distribtuted-memory model, no restriction regarding the programming model is given. Note that there is no explicit support for message-passing, abstracts like send and receive do not exist. [Foster95] Six new basic abstractions processor object • mechanism for controlling locality • computation comprises one or more processor objects • within a processor object sequential code can be executed without modifications • access to local data structures • identified by keyword global • predefined class proc_t controls processor object placement • processor objects run in separate address spaces (represent shared data) Lehrstuhl für Rechnerarchitektur - Universität Mannheim WS05/06 Vorlesung Rechnerarchitektur 2 Seite 89 Programmierkonzepte CC++ (2) global pointer • identified by the type modifier global • mechanism for linking together processor objects • must be used to access a data structure or to perform computation (using a remote procedure call, or RPC) in another processor object thread • mechanism for specifying concurrent execution • created independently from processor objects • more than one thread can execute in a processor object • par, parfor, and spawn statements create threads • represent threads of control in a processor object (common address space) sync variable • type modifier sync • used to synchronize thread execution atomic function • specified by the keyword atomic • mechanism to control the interleaving of threads executing in the same processor object. transfer function • predefined type CCVoid • allows arbitrary data structures to be transferred between processor objects Lehrstuhl für Rechnerarchitektur - Universität Mannheim WS05/06 Vorlesung Rechnerarchitektur 2 Seite 90 Programmierkonzepte CC++ (3) Concurrency The par, parfor and spawn constructs create additional threads. Forbidden in parallel blocks are statements that result in nonlocal changes of the control flow (e.g. return). A parallel block is distinguished from a sequential block by keyword par or parfor, the block is blocking until all statements inside the block terminate. par and parfor can be nested as desired, e.g. for master-worker modell spawn • creates independant thread of control to specify unstructured parallelism • parent cannot wait for neither termination nor return values Examples for concurrency: par { statement1; statement2; statement3; ... statementN; } parfor (int i=0; i<10; i++) { my_process(i); } par { master(); parfor (int i=0; i<10; i++) worker(i); } Lehrstuhl für Rechnerarchitektur - Universität Mannheim WS05/06 Vorlesung Rechnerarchitektur 2 Seite 91 Programmierkonzepte CC++ (4) Locality Processor object: • represents address space, in which threads can be executed • identified by the keyword global • unit of locality, where data access is considered local and therefore cheap This example creates a processor object class with public member functions: global class MyClass : public ParentClass { public: void func1(); void func2(); }; Processor objects are linked together using global pointers Global pointers: • like a normal pointer but it can refer to other processor objects or data structures within other processor objects • represents nonlocal data which is expensive to access • identified by the keyword global Example: float *global gpf; Processor objects created with the new statement are represented by a global pointer: MyClass *global myclass_pobj = new MyClass; Lehrstuhl für Rechnerarchitektur - Universität Mannheim WS05/06 Vorlesung Rechnerarchitektur 2 Seite 92 Programmierkonzepte CC++ (5) Thread placement: • threads are executed by default in the same processor object as its parent • can be placed in another processor object using RPC • can be invoked using the global pointer to another processor object Example to execute a thread in a different processor object: myclass_pobj->func1(); A single thread in a processor object is a task. If the return result of the RPC is not required, the spawn function is more efficient than normal RPC. spawn myclass_pobj->func1(); Communication • No primitives for sending and receiving data between threads • Threads communicate by operation on shared data structures (e.g. channel communication) • Global pointers can be used to communicate data between processor objects • Synchronization is done using the sync keyword • atomic functions provide functionality for mutual exclusion • data transfer functions can be used to communicate more complex data structures Lehrstuhl für Rechnerarchitektur - Universität Mannheim WS05/06 Vorlesung Rechnerarchitektur 2 Seite 93 Programmierkonzepte CC++ (6) Global pointers for remote operations: • global pointers in CC++ are used in the same manner as C++ local pointers • they can be used to operate on data of other processor objects • they can be used to invoke remote functions (of other processor objects) Generic form of a RPC: <type> *global gp; result = gp->p(...); Steps: • The arguments are packed in a message, communicated and remotely unpakked. The calling thread suspends execution. • A thread is created in the remote processor object. • The transferring back to the calling processor object unblocks the calling thread. proc_obj1 (remote) proc_obj0 global int* gp; int len; *gp gp=5 ... //gp assignment *gp=5; len=(*gp)*2; length write (*gp, 5 ) ack len=(*gp)*2 read (*gp) result=5 Lehrstuhl für Rechnerarchitektur - Universität Mannheim length=5 ? length WS05/06 Vorlesung Rechnerarchitektur 2 Seite 94 Programmierkonzepte CC++ (7) The sync variable for synchronization: • initially, it has a special value, "undefined" • only once a value can be assigned to this variable • a read to an undefined sync variable blocks the calling thread until the variable is assigned a value Examples: sync int i; sync int* p; sync int *sync sp; //i is a sync integer //p is a pointer to a sync integer //sp is a sync pointer to a sync integer Example: A queue using sync variables q_element #0 q_element #0 q_element #0 sync value; undef q_element *next; sync value; value q_element *next; sync value; value q_element *next; q_element #1 q_element #1 sync value; undef q_element *next; sync value; undef q_element *next; value stored in queue value read from queue Initial state, queue is empty • With each value stored in the queue, a new (empty) element is added. • A read from the queue will delete the read element. • If a thread reads from an empty queue, it is blocked until an element is available Lehrstuhl für Rechnerarchitektur - Universität Mannheim WS05/06 Vorlesung Rechnerarchitektur 2 Seite 95 Programmierkonzepte CC++ (8) Mutual exclusion using the atomic keyword: If there are multiple reader/writer for the queue presented above, mutual exclusion must be guaranteed due to the multiple writer problem. A function part of a processor object can be declared atomic. This specifies that the execution of this function cannot be interleaved with other atomic functions of the same object. Example: atomic void Queue:Put (int value); If now multiple readers/writers occur, who will modify pointers and data structures, the multiple writer problem can be solved by disallowing concurrency in the object. Data Transfer Functions: They are only necessary for local pointers, arrays, structures containing local pointers. The mechanism for packaging and unpacking these complex structures is analogous to the C++ built-in stream functions: ostream& operator<<(ostream&, const TYPE& obj_in); istream& operator>>(istream&, TYPE& obj_out); (class ios of the iostream library define infix operators ’<<’ and ’>>’) In CC++ for data transfer: CCVoid& operator<<(CCVoid&, const TYPE& obj_in); CCVoid& operator>>(CCVoid&, TYPE& obj_out); Associated with every CC++ datatype is a pair of data transfer functions that define how to transfer that type to another processor object. Only for the types shown above interaction by the user is required, simple types are pre-defined (CCVoid is analogous to istream/ostream). Lehrstuhl für Rechnerarchitektur - Universität Mannheim WS05/06 Vorlesung Rechnerarchitektur 2 Seite 96 Programmierkonzepte CC++ (9) Asynchronous communication • Use spezialized data tasks to provide read/write operation to shared data structures. • The shared data structures are distributed among the computation tasks. Then each task must periodically poll for pending requests. • The shared data structures are distributed among the computation tasks. Remote tasks can access data using RPCs to appropriate member functions. Mapping Task is to map the processor objects and/or the computing threads to physical processors of a multiprocessor environment. threads mapping processor objects mapping physical processors Lehrstuhl für Rechnerarchitektur - Universität Mannheim WS05/06 Vorlesung Rechnerarchitektur 2 Seite 97 Programmierkonzepte CC++ (10) Placement of processor objects New created POs are placed on the same processor as their creators. Alternative placement is possible using the placement argument to the new operator (usage in C++: position an object in processor space) The location is specified by the implementation-dependent class proc_t and node_t. Examples: MyClass *global G; proc_t location(node_t("my_node")); //declare processor on node G = new (location) MyClass; //create new PO on processor ’location’ proc_t location(node_t("your_node")); //processor on a different node Mapping threads to processor objects Alternative: Create fixed number of POs and map them 1:1 to processors. The threads are dynamically assigned to these POs on creation. Approach for SPMD computations. Sequential composition of CC++ computations (SPMD modell) Two components: • Initialization: Creation of POs for execution and communication • Execution: Acutal computation using the structures created in the initialization phase main control suspended POs initialization execution long-life threads executing both components: initialization and execution => simple programs Lehrstuhl für Rechnerarchitektur - Universität Mannheim separate short-life threads for initilization and execution => more efficient WS05/06 Vorlesung Rechnerarchitektur 2 Seite 98 Programmierkonzepte Fortran Data-parallelism is the concurrency when the same operation is applied to some elements of a data ensemble. A data-parallel program is a sequence of such operations. Fortan 90: Data-parallel language with concurrent execution but without domain decomposition High Performance Fortran (HPF): Augments F90 with additional parallel constructs and data placement directives Parallelism of data can only be expressed using arrays. Thus data structures operated on are arrays. Concurreny may be implicit or may be expressed using explicit parallel constructs: A = B * C ! A,B,C are arrays, this is an explicit parallel construct A do-loop is an example for implicit-parallel construct: A compiler may be able or not to detect the independance of the iterations and thus perform them in parallel. F90 Array assignment statements An array section can be specified using the range triplet. lower-bound : upper-bound : stride Array intrinsic functions Intrinic functions for operations on arrays and vectors, like muliplication, division, addition and substraction. [Foster95] Lehrstuhl für Rechnerarchitektur - Universität Mannheim WS05/06 Vorlesung Rechnerarchitektur 2 Seite 99 Programmierkonzepte HPF The PROCESSOR directive is used to specify the shape and size of an array of abstract processors. Distribute an array X over several processors using the DISTRIBUTE directive: !HPF$ PROCESSORS pr(16) real X(1024) !HPF$ DISTRIBUTE X(BLOCK) ONTO pr Distribution can be block or cyclic distribution, e.g. for a two dimensions array: (BLOCK,*) (CYCLIC,*) (CYCLIC,BLOCK) The ALIGN directive is used to align elements of different arrays with each other ALIGN C(I) WITH B(I*2) Mapping of abstract processors to physical processors is not defined in the language -> implementation dependent. Concurrency in HPF is indicated using the FORALL statements and INDEPENDENT directive. The FORALL statement has the general form: FORALL ( triplet, ..., triplet, mask ) assignment An example for FORALL statements: FORALL (i=1:m, j=1:n) X(i,j) = i+j FORALL (i=1:n, j=1:n, i<j) Y(i,j) = 0.0 FORALL (i=1:n) Z(i,i) = 0.0 The FORALL statement synchornizes after each iteration. The INDEPENDENT directive asserts that the iterations of a do-loop can be performed independently. !HPF$ INDEPENDENT do i=1,n ... enddo [Foster95] Lehrstuhl für Rechnerarchitektur - Universität Mannheim WS05/06 Vorlesung Rechnerarchitektur 2 Seite 100 Programmierkonzepte Design methodology - PCAM Methodical design of parallel algorithms • Partition: Decompose computation into small tasks - Domain decomposition - Functional decomposition • Communicate: Required communication is determined and communication structures and algorithms are defined • Agglomerate: Combining tasks and communication structures to larger tasks with respect to implementation costs and performance • Map: Each larger task is assigned to a processor, mapping can be static or dynamic Additionally dynamic working principles: • Load-balancing • Multi-grid [Foster95], online under http://www-unix.mcs.anl.gov/dbpp Example for multi-grid [www.iwr.uni-heidelberg.de] Lehrstuhl für Rechnerarchitektur - Universität Mannheim WS05/06