Les MPSoC
Transcription
Les MPSoC
Multiprocessor SoC Benoît Miramond University of Cergy-Pontoise ETIS Lab What do you need ? Use cases • On my local desktop – I need RAM – – – – Light processing Performance is misused Multi-task Versatile, so very heavy OS • Indirectly on a server - importance of parallelism – Access to distant servers (internet, local network, LDAP server…) – Continuous intensive processing • In my pocket – specific architecture – Signal and Image processing – Low consumption, low frequency What are the achieved performances today ? One TeraOps yesterday… [www.top500.org] Dec2005 GFlops on LINPACK Rk Site Computer Cores Year Rmax Rpeak 1 DOE/NNSA/LLNL United States BlueGene/L - eServer Blue Gene Solution IBM 131072 2005 280,600 367000 2 IBM Thomas J. Watson Research Center United States BGW - eServer Blue Gene Solution IBM 40960 2005 91,290 114688 3 DOE/NNSA/LLNL United States ASC Purple - eServer pSeries p5 575 1.9 GHz IBM 10240 2005 63,390 77824 4 NASA/Ames Research Center/NAS United States Columbia - SGI Altix 1.5 GHz, Voltaire Infiniband SGI 10160 2004 51,870 60960 5 Sandia National Laboratories United States Thunderbird - PowerEdge 1850, 3.6 GHz, Infiniband Dell 8000 2005 38,270 64512 Commissariat a l'Energie Atomique (CEA) France Tera10 beta system NovaScale 5160, Itanium2 1.6 GHz, Quadrics Bull SA 1008 2005 5,829 6451.2 … 62 A french example : Tera-10 (CEA) • 2000 m² • The most powerfull machine in 2005 • Untill 100 Tops in 2009 • 8704 processors Intel Itanium2 (3GHz) • Interconnection network : Quadrics (100Gbytes/s) • Linux • 1 PetaOctet of data generated for each simulation [www.top500.org] Sept2008 One TeraOps nowadays Rank Site Computer/Year Vendor Cores Rmax(TOPs Power ) 1 DOE/NNSA/LANL United States Roadrunner - BladeCenter / 2008 IBM 122400 1026.00 2345.50 2 DOE/NNSA/LLNL United States BlueGene/L / 2007 IBM 212992 478.20 2329.60 3 Argonne National Laboratory United States Blue Gene/P Solution / 2007 IBM 163840 450.30 1260.00 9 IDRIS France Blue Gene/P Solution / 2008 IBM 40960 112.50 315.00 10 Total Exploration Production France SGI Altix ICE 8200EX, Xeon quad core 3.0 GHz / 2008 SGI 10240 106.10 442.00 EDF R&D France Frontier2 BG/L - Blue Gene/P Solution / 2008 IBM 32768 92.96 252.00 CEA France Tera-10 - NovaScale 5160, Itanium2 1.6 GHz, Quadrics / 2006 Bull SA 9968 52.84 … … 13 … 32 TeraOps, a parallel problem ? And what about embedded performances … • • Desktop MPSoC architectures – Intel Core 2 duo Kentsfiled • 130 W • 4 cores à 2,4 GHz • 100 GFlops – AMD Opteron barcelona • 65W • 4 cores à 2,5GHz • 30 GOps – Cell processor • 80 W • 9 cores à 6 GHz • 512 GOPS – Nvidia G80 (GeForce) • 175W • SIMD 128 à 574 MHz • 500 GOps Embedded multicore processors – ARM11 MPCore • ~266mW • 4 cores à 620MHz • 2600 D-MIPS, 2,5 GOps – ADSP Blackfin • ~ 200mW • 2 cores à 600MHz • 1200 MMACs x 2 = 2,4 GMacs On chip parallelism Oups ?! OPS or IPS ? How to compare performance ? Processor IPS Pencil and paper (for comparison) 0.0119 IPS Intel 4004 92 kIPS at 740 kHz ARM 7500FE IPS/MHz Year n/a 1892 0.124 1971 35.9 MIPS at 40 MHz 0.897 MIPS/MHz 1996 ARM Cortex A8 2,000 MIPS at 1.0 GHz 2.0 MIPS/MHz 2005 AMD Athlon FX-57 12,000 MIPS at 2.8 GHz 4.285 MIPS/MHz 2005 AMD Athlon 64 3800+ X2 (Dual Core) 14,564 MIPS at 2.0 GHz 7.282 MIPS/MHz 2005 Xbox360 IBM "Xenon" Triple Core 19,200 MIPS at 3.2 GHz 2.0 MIPS/MHz 2005 PS3 Cell BE 10,240 MIPS at 3.2 GHz 3.2 MIPS/MHz 2006 AMD Athlon FX-60 (Dual Core) 18,938 MIPS at 2.6 GHz 7.283 MIPS/MHz 2006 Intel Core 2 X6800 27,079 MIPS at 2.93 GHz 9.242 MIPS/MHz 2006 Intel Core 2 Extreme QX6700 49,161 MIPS at 2.66 GHz 18.481 MIPS/MHz 2006 Intel Core 2 Extreme QX9770 59,455 MIPS at 3.2 GHz 18.580 MIPS/MHz 2008 [2] Power efficiency 200 180 160 140 120 100 80 60 40 20 0 AMD Barcelona Core2Duo Cell NvidiaG80 ARM MPCORE MOPS/mW/MHz blackfin Energy Efficiency MOPS/mW (or MIPS/mW) Some different processing architectures 1000 Dedicated HW 100 1Tops/1W 10 1 100-200 MOPS/mW Reconfigurable Processor/Logic ASIPs DSPs 10-100 MOPS/mW 1-10 MOPS/mW Embedded mProcessors 0.1 0.1-1 MIPS/mW Flexibility (Coverage) The French TeraOps project Partners of the project Keywords of the project • On Chip parallel machine • Embedded • Programmable • Reaching One TeraOperation into 1Litre • (1012 operations/sec) Enduser applications – Security : • Smart camera • And heterogeneous sensors – Multimedia : • Video Compression/Decompression, HDTV – Transport : • New types of equipments in automotive, etc. – Defense : • High performance sensors – Space : • Observation and telecommunication ... Ter@ops: The proposed architecture One Ter@ops chip se composed of 48 tiles • 20 SIMD SIMD: MIPS4k ( CTR) + SIMD GPP: MIPS24 k avec unité flottante Éventuellement avec un MIPS 4 k ( CTR) GPP typé SCMP: MIPS4 k ( CTR) + MIPS24 k • 7 GPP SCMP: Scheduler+ RDN Contrôleur DDR3 800/ 1600 32b • 8 CTR Contrôleur Flash/ périphérique • 2 Reconf (2x20mm²) • 11 control and com ajout du NoC et All under 400 mm²mémoires partagées SHMEM: mémoire partagée 1Moctet Contrôleur Réseau Reconfigurable Communication inter - circuit Supervision du circuit/ JTAG ajout du SCMP et réseau OSOC RDN + SCMP Memories OSOC Control Bus Ter@ops: A tile based architecture SIMD: MIPS 4k ( CTR ) + SIMD GPP : MIPS 24 k avec unité flottante Éventuellement avec un MIPS 4 k ( CTR) GPP typé SCMP: MIPS 4 k ( CTR ) + MIPS 24 k SCMP: Scheduler+ RDN Contrôleur DDR3 800/ 1600 32b Contrôleur Flash/ périphérique Contrôleur Réseau Reconfigurable Communication inter - circuit TYPE I ajout du NoC et mémoires partagées TYPE II Supervision du circuit/ JTAG (SIMD) control/réseau ajout du SCMP et (SCMP) réseau OSOC TYPE III TYPE IV TYPE n (GPP) (Reconf.) control/réseau (Custom) control/réseau (OSOC) FIREn FIRE2 FIRE1 SHMEM: mémoire partagée 1Moctet FIRE1 SCMP2 GPP GPP GPP2 GPP RDN + SCMP Memories SCMP1 Custom GPP1 OSOC Control Bus Custom NoC data NoC control (or bus) Ter@ops DDR Ter@ops host Parallelism management Classification Basically, we find two levels of parallelism : • ILP : Instruction Level Parallelism • TLP : Task Level Parallelism And two types of architectures : • SMP : Symetric Multiprocessor • AMP : Asymetric Multiprocessor (The application is dispatched in a non symetric manner onto the different nodes : one core dedicated to one type of task) The parallel architectures • Homogeneous / heterogeneous • Regular / Irregular • Symetric / Asymetric • Shared memory • Distributed memory Shared memory architectures • Into SMP, processors are connected alltogether through a single bus (memory and peripherals). • Performance grows up with the number of processors till bus saturation caused by the multiple requests • => observed limitation is about 16 processors • Well adapted to today embedded context Multiprocessor architectures Mono-task/thread architectures Multi-task/thread architectures UltraSparc T2 Symetric (SMP) Asymetric (AMP) homogeneous Distributed memory Intel 80 core heterogeneous shared memory ARM 11 MPCore Opteron shared memory Distributed memory shared memory Distributed memory Cell G80 Hypercore Nomadik Hypercore • • Desktop MPSoC architectures – Intel Core 2 duo Kentsfiled • 130 W • 4 cores à 2,4 GHz • 100 GFlops – AMD Opteron barcelona • 65W • 4 cores à 2,5GHz • 30 GOps – Cell processor • 80 W • 9 cores à 6 GHz • 512 GOPS – Nvidia G80 (GeForce) • 175W • SIMD 128 à 574 MHz • 500 GOps Embedded multicore processors – ARM11 MPCore • ~266mW • 4 cores à 620MHz • 2600 D-MIPS, 2,5 GOps – ADSP Blackfin • ~ 200mW • 2 cores à 600MHz • 1200 MMACs x 2 = 2,4 GMacs On chip parallelism Mémoire partagée Multiprocessor architectures Mono-task/thread architectures Multi-task/thread architectures UltraSparc T2 Symetric (SMP) Asymetric (AMP) homogeneous Distributed memory Intel 80 core heterogeneous shared memory ARM 11 MPCore Opteron shared memory Distributed memory shared memory Distributed memory Cell G80 Hypercore Nomadik OMAP Hypercore Architectures OMAP OMAP3503 OMAP3515 ARM Cortex-A8 600MHz; 720MHz Core Processors Shared Peripheral Set USB 2.0 HS OTG controller, LCD controller, camera interface, serial interfaces, memory support, and more DSP Processing & Multimedia Software Compatibility ARM Cortex-A8 600MHz OMAP3530 ARM Cortex-A8 600MHz 2D/3D Graphics Compatibility ARM Cortex-A8 600MHz OMAP3525 POWER VR SGX™ Graphics POWER VR SGX™ Graphics C64x+ DSP & video accelerator 430MHz C64x+ DSP & video accelerator 430MHz; 520MHz OMAP 3530 CPU Peak MMACS Frequency(MHz) RISC Frequency(MHz) On-Chip L1/SRAM On-Chip L2/SRAM RAM ROM EMIF External Memory Type Supported DMA Video Port (Configurable) OMAP3530 1 C64x+,ARM Cortex-A8 4160 520 720 112 KB (DSP),32 KB (ARM Cortex-A8) 96 KB (DSP),256 KB (ARM Cortex-A8) 64 KB 16 KB (DSP),32 KB (ARM Cortex-A8) 1 32-Bit SDRC,1 16-Bit GPMC LPDDR,NOR Flash,NAND flash,OneNAND,Asynch SRAM 64-Ch EDMA,32-Bit Channel SDMA 1 Dedicated Output,1 Dedicated Input Graphics Accelerator MMC/SD McBSP 1 3 5 Pin/Package 423FCBGA, 515POP-FCBGA POP Interface I2C McSPI HDQ/1-Wire UART USB Timers Core Supply (Volts) IO Supply(V) Operating Temperature Range(°C) Yes (CBB) 3 4 1 3 2 12 32-Bit GP,2 32-Bit WD 0.8 V to 1.35 V 1.8 V,3.0 V (MMC1 Only) 0 to 90,-40 to 105 BeagleBoard http://beagleboard.org/ OMAP 4 PandaBoard Multiprocessor architectures Mono-task/thread architectures Multi-task/thread architectures UltraSparc T2 Symetric (SMP) Asymetric (AMP) homogeneous Distributed memory Intel 80 core heterogeneous shared memory ARM 11 MPCore Opteron shared memory Distributed memory shared memory Distributed memory Cell G80 Hypercore Nomadik OMAP Hypercore Architectures NUMA • Une solution est de former des ensembles de nœuds chacun composé de un ou plusieurs (4) processeurs, de leur mémoire et de leurs périphériques Nœud 1 CPU CPU CPU Mémoire Nœud 2 Réseau à forte bande passante CPU CPU E/S CPU Mémoire Nœud n CPU CPU Mémoire CPU CPU E/S CPU CPU E/S Non Uniform Memory Access • Par conséquent le temps d’accès aux mémoire diffère entre un accès local ou distant • On appelle donc ces architectures NUMA • Exemple sur l’Opteron d’AMD : – Accès mémoire locale = 65 ns – Accès au nœud voisin = 100 ns • => concentrer les accès locaux (OS + application) OS for NUMA • The binary code of the OS is duplicated onto each processing node • The OS tries to focus onto the local access • It then uses some verification mechanisms for cache coherency for the shared pages between nodes • cc-Numa = cache coherency NUMA • The OS proposes – Processus containment mechanisms – migration of processus (taking into acount affinities between appli/memory/input-output ) AMD Roadmap [2008] Architecture of the Opteron Performance metrics The Opteron processor • The opteron (Shangaï) is competitive face to the Intel Xeon processor (Harpertown 3.0Ghz) when considering performances • But it is more interesting when considering the performance/power ratio Software applications for NUMA • The application can also be adapted in order to enhance executions onto NUMA • For example, the Oracle Data Base takes advantage of Numa architectures How to program parallel architectures ? Creating explicit parallelism Concurrent programmation models • Different approaches are possible – Shared memory model : multithread In this case, it is necessary to protect shared resources such as memory, peripherals … And one must ensure cache coherency ! – Distributed memory : MPI Communication from on addressing space to another through message passing : IPC (Inter-processor communications) Example of threads Posix • Creation of a thread : – pthread_create(pthread_t *thread, const pthread_attr_t * attr, void *(*routine)(void*), void *arg); • Wait an other (child) thread – pthread_join(pthread_t *th, void* p); • Synchronization between threads – pthread_mutex_lock(pthread_mutex_t *m); – pthread_mutex_unlock(pthread_mutex_t *m); Open MP What is OpenMP? OpenMP Is: An Application Program Interface (API) that may be used to explicitly direct multi-threaded, shared memory parallelism Comprised of three primary API components: Compiler Directives Runtime Library Routines Environment Variables Portable: The API is specified for C/C++ and Fortran Most major platforms have been implemented including Unix/Linux platforms and Windows NT Standardized: Jointly defined and endorsed by a group of major computer hardware and software vendors Expected to become an ANSI standard later??? What does OpenMP stand for? Short version: Open Multi-Processing Long version: Open specifications for Multi-Processing via collaborative work between interested parties from the hardware and software industry, government and academia. OpenMP Is Not: Meant for distributed memory parallel systems (by itself) Necessarily implemented identically by all vendors Guaranteed to make the most efficient use of shared memory Required to check for data dependencies, data conflicts, race conditions, or deadlocks Required to check for code sequences that cause a program to be classified as non-conforming Meant to cover compiler-generated automatic parallelization and directives to the compiler to assist such parallelization Designed to guarantee that input or output to the same file is synchronous when executed in parallel. The programmer is responsible for synchronizing input and output. References: OpenMP website: openmp.org Problems with IPC in multicore Method File Signal Socket Message queue Pipe Provided by (Operating systems or other environments) Most operating systems. some systems, such as Windows, only implement signals in the C run-time library and do not actually provide support for their use as an IPC technique. Named pipe Semaphore Shared memory Used in MPI Message passing paradigm, Java RMI, (shared nothing) CORBA, MSMQ, MailSlots and others. Memory-mapped file / peripherals Problems with IPC in multicore MCA has produced several specifications to enhance the adoption of multicore, including a communications API (MCAPI) and resource management API (MRAPI). Independent of the operating system http://www.multicore-association.org/home.php MCAPI : MultiCore Communication API • The MCAPI specification is both an API and communications semantic specification. It does not define which link management, device model or wire protocol is used underneath it. As such, by defining a standard API, it is intended to provide source code compatibility for application code to be ported from one operating environment to another. • MCAPI defines three fundamental communications types. These are: – 1. Messages – connection-less datagrams. – 2. Packet channels – connection-oriented, uni-directional, FIFO packet streams. – 3. Scalar channels – connection-oriented single word unidirectional, FIFO packet streams MCAPI MRAPI : Multicore Resource Management API SMP advantages SMP RTOS Concurrency and SMP Browsers are good candidates for SMP CoreMarks : absolute performance Chose the good coinDistributed ! memory, message passing Open MP Posix MPI RTOS, shared memory Dif Open CL CAL Cuda Architecture Specific Language Dataflow programming MPSoC onto Altera FPGAs The MPSoC concept of Altera • A symetric multiprocessor system • With that solution it will take longer to build the software than the hardware ! • Shared memory paradigm, so hardware protection : – Hardware Mutex : Altera Mutex Core, chapter in volume 5 of the Quartus II Handbook. 2 independ Nios-II processors Shared resources There is no hardware protection against corruption, so … Use of the mutex core • It is a simple atomic test-and-set operation but at a hardware level • This core is an Altera IP associated with a API for the software programmation Software use of the mutex core Altera provides the following software files accompanying the mutex core: • altera_avalon_mutex_regs.h—Defines the core's register map, providing symbolic constants to access the low-level hardware. • altera_avalon_mutex.h—Defines data structures and functions to access the mutex core hardware. • altera_avalon_mutex.c—Contains the implementations of the functions to access the mutex core Register maps & mutex API Writing to and Reading from a Mailbox #include <stdio.h> #include "altera_avalon_mailbox.h" int main() { alt_u32 message = 0; alt_mailbox_dev* send_dev, recv_dev; /* Open the two mailboxes between this processor and another */ send_dev = altera_avalon_mailbox_open("/dev/mailbox_0"); recv_dev = altera_avalon_mailbox_open("/dev/mailbox_1"); while(1) { /* Send a message to the other processor */ altera_avalon_mailbox_post(send_dev, message); /* Wait for the other processor to send a message back */ message = altera_avalon_mailbox_pend(recv_dev); } return 0; } Addressing Peripherals Memory partitioning • Here, all the processors are sharing the same physical memory: • The elf sections .text, .rodata, .rwdata, .heap, .stack are allocated into the external SRAM • So if CPU1 uses 128KB and CPU2 64KB, the memory is partitioned as the following: – From 0x0 to 0x1FFFF for CPU1 – From 0x20000 to 0x2FFFF for CPU2 • Partitioning is made on the base of the exception address of the cpu under SOPC_BUILDER • NiosIDE then takes theses addresses into account (from system.h) in order to link the corresponding executable files Example of a memory partitioning Memory sections • Exceptions sections always begin with an offset of 0x20 • Which corresponds to 1 line of instruction cache (0x20 bytes = 32 bytes = 8 instructions) • The static partitioning is not verified at runtime => possible crushing of a memory zone by another Idem for the bootloader in flash And what about putting an OS? • It is your turn to work … Summary & conclusions Controling the evolution • The exponential growup of on-chip integration, associated with performance, is finally too fast to be controled. Today systems suffer related problems: – – – – – – Instability of operating systems Incompatibility between softwares High turnover on training Fast extinction of technologies Complexity poorly controled at all the levels While this may worsen … • There is no question about accessible performance anymore • The question about the efficient utilization and the control of this performance is the fondamental problem So performance will come from you! • The technological and physical constraints slow down the classical evolution • The performance gain remains accessible with a significant effort from software designers • Exploit parallelism: – Multi-thread programmation as example – OS multiprocessor – Best knowledge of low-level computer organization Article sur Linux et le SMP The kernel does its part to optimize the load across the available CPUs . All that's left is to ensure that the application can be sufficiently multi-threaded to exploit the power in SMP.