Les MPSoC

Transcription

Les MPSoC
Multiprocessor SoC
Benoît Miramond
University of Cergy-Pontoise
ETIS Lab
What do you need ?
Use cases
• On my local desktop – I need RAM
–
–
–
–
Light processing
Performance is misused
Multi-task
Versatile, so very heavy OS
• Indirectly on a server - importance of parallelism
– Access to distant servers (internet, local network,
LDAP server…)
– Continuous intensive processing
• In my pocket – specific architecture
– Signal and Image processing
– Low consumption, low frequency
What are the achieved
performances today ?
One TeraOps yesterday…
[www.top500.org] Dec2005
GFlops on
LINPACK
Rk
Site
Computer
Cores
Year
Rmax
Rpeak
1
DOE/NNSA/LLNL
United States
BlueGene/L - eServer Blue
Gene Solution
IBM
131072
2005
280,600
367000
2
IBM Thomas J. Watson
Research Center
United States
BGW - eServer Blue Gene
Solution
IBM
40960
2005
91,290
114688
3
DOE/NNSA/LLNL
United States
ASC Purple - eServer pSeries
p5 575 1.9 GHz
IBM
10240
2005
63,390
77824
4
NASA/Ames Research
Center/NAS
United States
Columbia - SGI Altix 1.5 GHz,
Voltaire Infiniband
SGI
10160
2004
51,870
60960
5
Sandia National Laboratories
United States
Thunderbird - PowerEdge
1850, 3.6 GHz, Infiniband
Dell
8000
2005
38,270
64512
Commissariat a l'Energie
Atomique (CEA)
France
Tera10 beta system NovaScale 5160, Itanium2
1.6 GHz, Quadrics
Bull SA
1008
2005
5,829
6451.2
…
62
A french example :
Tera-10 (CEA)
• 2000 m²
• The most powerfull
machine in 2005
• Untill 100 Tops in 2009
• 8704 processors Intel Itanium2 (3GHz)
• Interconnection network : Quadrics
(100Gbytes/s)
• Linux
• 1 PetaOctet of data generated for each
simulation
[www.top500.org] Sept2008
One TeraOps nowadays
Rank
Site
Computer/Year Vendor
Cores
Rmax(TOPs
Power
)
1
DOE/NNSA/LANL
United States
Roadrunner - BladeCenter / 2008
IBM
122400
1026.00
2345.50
2
DOE/NNSA/LLNL
United States
BlueGene/L / 2007
IBM
212992
478.20
2329.60
3
Argonne National
Laboratory
United States
Blue Gene/P Solution / 2007
IBM
163840
450.30
1260.00
9
IDRIS
France
Blue Gene/P Solution / 2008
IBM
40960
112.50
315.00
10
Total Exploration
Production
France
SGI Altix ICE 8200EX, Xeon quad core 3.0
GHz / 2008
SGI
10240
106.10
442.00
EDF R&D
France
Frontier2 BG/L - Blue Gene/P Solution / 2008
IBM
32768
92.96
252.00
CEA
France
Tera-10 - NovaScale 5160,
Itanium2 1.6 GHz, Quadrics /
2006
Bull SA
9968
52.84
…
…
13
…
32
TeraOps, a parallel problem
?
And what about embedded
performances …
•
•
Desktop MPSoC architectures
– Intel Core 2 duo Kentsfiled
• 130 W
• 4 cores à 2,4 GHz
• 100 GFlops
– AMD Opteron barcelona
• 65W
• 4 cores à 2,5GHz
• 30 GOps
– Cell processor
• 80 W
• 9 cores à 6 GHz
• 512 GOPS
– Nvidia G80 (GeForce)
• 175W
• SIMD 128 à 574 MHz
• 500 GOps
Embedded multicore processors
– ARM11 MPCore
• ~266mW
• 4 cores à 620MHz
• 2600 D-MIPS, 2,5 GOps
– ADSP Blackfin
• ~ 200mW
• 2 cores à 600MHz
• 1200 MMACs x 2 = 2,4 GMacs
On chip parallelism
Oups ?!
OPS or IPS ?
How to compare performance ?
Processor
IPS
Pencil and paper (for comparison)
0.0119 IPS
Intel 4004
92 kIPS at 740 kHz
ARM 7500FE
IPS/MHz
Year
n/a
1892
0.124
1971
35.9 MIPS at 40 MHz
0.897 MIPS/MHz
1996
ARM Cortex A8
2,000 MIPS at 1.0 GHz
2.0 MIPS/MHz
2005
AMD Athlon FX-57
12,000 MIPS at 2.8 GHz
4.285 MIPS/MHz
2005
AMD Athlon 64 3800+ X2 (Dual Core)
14,564 MIPS at 2.0 GHz
7.282 MIPS/MHz
2005
Xbox360 IBM "Xenon" Triple Core
19,200 MIPS at 3.2 GHz
2.0 MIPS/MHz
2005
PS3 Cell BE
10,240 MIPS at 3.2 GHz
3.2 MIPS/MHz
2006
AMD Athlon FX-60 (Dual Core)
18,938 MIPS at 2.6 GHz
7.283 MIPS/MHz
2006
Intel Core 2 X6800
27,079 MIPS at 2.93 GHz
9.242 MIPS/MHz
2006
Intel Core 2 Extreme QX6700
49,161 MIPS at 2.66 GHz
18.481 MIPS/MHz
2006
Intel Core 2 Extreme QX9770
59,455 MIPS at 3.2 GHz
18.580 MIPS/MHz
2008
[2]
Power efficiency
200
180
160
140
120
100
80
60
40
20
0
AMD
Barcelona
Core2Duo
Cell
NvidiaG80 ARM MPCORE
MOPS/mW/MHz
blackfin
Energy Efficiency
MOPS/mW (or MIPS/mW)
Some different processing
architectures
1000
Dedicated
HW
100
1Tops/1W
10
1
100-200 MOPS/mW
Reconfigurable
Processor/Logic
ASIPs
DSPs
10-100 MOPS/mW
1-10 MOPS/mW
Embedded mProcessors
0.1
0.1-1 MIPS/mW
Flexibility (Coverage)
The French TeraOps project
Partners of the project
Keywords of the project
• On Chip parallel machine
• Embedded
• Programmable
• Reaching One TeraOperation into 1Litre
•
(1012 operations/sec)
Enduser applications
– Security :
• Smart camera
• And heterogeneous sensors
– Multimedia :
• Video Compression/Decompression, HDTV
– Transport :
• New types of equipments in automotive, etc.
– Defense :
• High performance sensors
– Space :
• Observation and telecommunication
...
Ter@ops: The proposed architecture
One Ter@ops chip se composed of 48 tiles
• 20 SIMD
SIMD: MIPS4k ( CTR) + SIMD
GPP: MIPS24 k avec unité flottante
Éventuellement avec un MIPS
4 k ( CTR)
GPP typé SCMP: MIPS4 k ( CTR) + MIPS24 k
• 7 GPP
SCMP: Scheduler+ RDN
Contrôleur DDR3 800/ 1600 32b
• 8 CTR
Contrôleur Flash/ périphérique
• 2 Reconf (2x20mm²)
• 11 control and com
ajout du NoC et
All under 400 mm²mémoires
partagées
SHMEM: mémoire
partagée 1Moctet
Contrôleur Réseau
Reconfigurable
Communication inter
- circuit
Supervision du circuit/ JTAG
ajout du SCMP et
réseau OSOC
RDN + SCMP Memories
OSOC Control Bus
Ter@ops: A tile based architecture
SIMD: MIPS 4k ( CTR ) + SIMD
GPP : MIPS 24 k avec unité flottante
Éventuellement avec un MIPS
4 k ( CTR)
GPP typé SCMP: MIPS 4 k ( CTR ) + MIPS 24 k
SCMP: Scheduler+ RDN
Contrôleur DDR3 800/ 1600 32b
Contrôleur Flash/ périphérique
Contrôleur Réseau
Reconfigurable
Communication inter
- circuit
TYPE I
ajout du NoC et
mémoires partagées
TYPE II
Supervision du circuit/ JTAG
(SIMD)
control/réseau
ajout du SCMP et
(SCMP)
réseau OSOC
TYPE III
TYPE IV
TYPE n
(GPP)
(Reconf.)
control/réseau
(Custom)
control/réseau
(OSOC)
FIREn
FIRE2
FIRE1
SHMEM: mémoire
partagée 1Moctet
FIRE1
SCMP2
GPP
GPP
GPP2
GPP
RDN + SCMP Memories
SCMP1
Custom
GPP1
OSOC Control Bus
Custom
NoC data
NoC control (or bus)
Ter@ops
DDR
Ter@ops
host
Parallelism management
Classification
Basically, we find two levels of parallelism :
• ILP : Instruction Level Parallelism
• TLP : Task Level Parallelism
And two types of architectures :
• SMP : Symetric Multiprocessor
• AMP : Asymetric Multiprocessor (The
application is dispatched in a non symetric
manner onto the different nodes : one core
dedicated to one type of task)
The parallel architectures
• Homogeneous / heterogeneous
• Regular / Irregular
• Symetric / Asymetric
• Shared memory
• Distributed memory
Shared memory architectures
• Into SMP, processors are connected
alltogether through a single bus (memory
and peripherals).
• Performance grows up with the number of
processors till bus saturation caused by
the multiple requests
• => observed limitation is about 16
processors
• Well adapted to today embedded context
Multiprocessor architectures
Mono-task/thread architectures
Multi-task/thread architectures
UltraSparc T2
Symetric (SMP)
Asymetric (AMP)
homogeneous
Distributed
memory
Intel 80 core
heterogeneous
shared
memory
ARM 11 MPCore
Opteron
shared
memory
Distributed
memory
shared
memory
Distributed
memory
Cell
G80
Hypercore
Nomadik
Hypercore
•
•
Desktop MPSoC architectures
– Intel Core 2 duo Kentsfiled
• 130 W
• 4 cores à 2,4 GHz
• 100 GFlops
– AMD Opteron barcelona
• 65W
• 4 cores à 2,5GHz
• 30 GOps
– Cell processor
• 80 W
• 9 cores à 6 GHz
• 512 GOPS
– Nvidia G80 (GeForce)
• 175W
• SIMD 128 à 574 MHz
• 500 GOps
Embedded multicore processors
– ARM11 MPCore
• ~266mW
• 4 cores à 620MHz
• 2600 D-MIPS, 2,5 GOps
– ADSP Blackfin
• ~ 200mW
• 2 cores à 600MHz
• 1200 MMACs x 2 = 2,4 GMacs
On chip parallelism
Mémoire partagée
Multiprocessor architectures
Mono-task/thread architectures
Multi-task/thread architectures
UltraSparc T2
Symetric (SMP)
Asymetric (AMP)
homogeneous
Distributed
memory
Intel 80 core
heterogeneous
shared
memory
ARM 11 MPCore
Opteron
shared
memory
Distributed
memory
shared
memory
Distributed
memory
Cell
G80
Hypercore
Nomadik
OMAP
Hypercore
Architectures OMAP
OMAP3503
OMAP3515
ARM
Cortex-A8 600MHz;
720MHz
Core Processors
Shared Peripheral
Set
USB 2.0 HS OTG controller, LCD controller, camera interface, serial interfaces, memory
support, and more
DSP Processing &
Multimedia Software
Compatibility
ARM
Cortex-A8 600MHz
OMAP3530
ARM
Cortex-A8 600MHz
2D/3D Graphics
Compatibility
ARM
Cortex-A8 600MHz
OMAP3525
POWER VR
SGX™
Graphics
POWER VR
SGX™
Graphics
C64x+ DSP
& video
accelerator 430MHz
C64x+ DSP
& video
accelerator 430MHz;
520MHz
OMAP 3530
CPU
Peak MMACS
Frequency(MHz)
RISC Frequency(MHz)
On-Chip L1/SRAM
On-Chip L2/SRAM
RAM
ROM
EMIF
External Memory Type
Supported
DMA
Video Port (Configurable)
OMAP3530
1 C64x+,ARM Cortex-A8
4160
520
720
112 KB (DSP),32 KB (ARM
Cortex-A8)
96 KB (DSP),256 KB (ARM
Cortex-A8)
64 KB
16 KB (DSP),32 KB (ARM
Cortex-A8)
1 32-Bit SDRC,1 16-Bit
GPMC
LPDDR,NOR Flash,NAND
flash,OneNAND,Asynch
SRAM
64-Ch EDMA,32-Bit Channel
SDMA
1 Dedicated Output,1
Dedicated Input
Graphics Accelerator
MMC/SD
McBSP
1
3
5
Pin/Package
423FCBGA, 515POP-FCBGA
POP Interface
I2C
McSPI
HDQ/1-Wire
UART
USB
Timers
Core Supply (Volts)
IO Supply(V)
Operating Temperature
Range(°C)
Yes (CBB)
3
4
1
3
2
12 32-Bit GP,2 32-Bit WD
0.8 V to 1.35 V
1.8 V,3.0 V (MMC1 Only)
0 to 90,-40 to 105
BeagleBoard
http://beagleboard.org/
OMAP 4
PandaBoard
Multiprocessor architectures
Mono-task/thread architectures
Multi-task/thread architectures
UltraSparc T2
Symetric (SMP)
Asymetric (AMP)
homogeneous
Distributed
memory
Intel 80 core
heterogeneous
shared
memory
ARM 11 MPCore
Opteron
shared
memory
Distributed
memory
shared
memory
Distributed
memory
Cell
G80
Hypercore
Nomadik
OMAP
Hypercore
Architectures NUMA
• Une solution est de former des ensembles
de nœuds chacun composé de un ou
plusieurs (4) processeurs, de leur
mémoire et de leurs périphériques
Nœud 1
CPU
CPU
CPU
Mémoire
Nœud 2
Réseau à forte
bande passante
CPU
CPU
E/S
CPU
Mémoire
Nœud n
CPU
CPU
Mémoire
CPU
CPU
E/S
CPU
CPU
E/S
Non Uniform Memory Access
• Par conséquent le temps d’accès aux
mémoire diffère entre un accès local ou
distant
• On appelle donc ces architectures NUMA
• Exemple sur l’Opteron d’AMD :
– Accès mémoire locale = 65 ns
– Accès au nœud voisin = 100 ns
• => concentrer les accès locaux (OS +
application)
OS for NUMA
• The binary code of the OS is duplicated onto
each processing node
• The OS tries to focus onto the local access
• It then uses some verification mechanisms for
cache coherency for the shared pages between
nodes
• cc-Numa = cache coherency NUMA
• The OS proposes
– Processus containment mechanisms
– migration of processus (taking into acount affinities
between appli/memory/input-output )
AMD Roadmap [2008]
Architecture of the Opteron
Performance metrics
The Opteron processor
• The opteron (Shangaï) is competitive face
to the Intel Xeon processor (Harpertown
3.0Ghz) when considering performances
• But it is more interesting when considering
the performance/power ratio
Software applications for NUMA
• The application can also be adapted in
order to enhance executions onto NUMA
• For example, the Oracle Data Base takes
advantage of Numa architectures
How to program parallel
architectures ?
Creating explicit parallelism
Concurrent programmation
models
• Different approaches are possible
– Shared memory model : multithread
In this case, it is necessary to protect shared
resources such as memory, peripherals …
And one must ensure cache coherency !
– Distributed memory : MPI
Communication from on addressing space to
another through message passing : IPC
(Inter-processor communications)
Example of threads Posix
• Creation of a thread :
– pthread_create(pthread_t *thread, const
pthread_attr_t * attr, void *(*routine)(void*),
void *arg);
• Wait an other (child) thread
– pthread_join(pthread_t *th, void* p);
• Synchronization between threads
– pthread_mutex_lock(pthread_mutex_t *m);
– pthread_mutex_unlock(pthread_mutex_t *m);
Open MP
What is OpenMP?
OpenMP Is:
An Application Program Interface (API) that may be used to explicitly direct multi-threaded, shared memory
parallelism
Comprised of three primary API components:
Compiler Directives
Runtime Library Routines
Environment Variables
Portable:
The API is specified for C/C++ and Fortran
Most major platforms have been implemented including Unix/Linux platforms and Windows NT
Standardized:
Jointly defined and endorsed by a group of major computer hardware and software vendors
Expected to become an ANSI standard later???
What does OpenMP stand for?
Short version: Open Multi-Processing
Long version: Open specifications for Multi-Processing via collaborative work between interested parties from the hardware
and software industry, government and academia.
OpenMP Is Not:
Meant for distributed memory parallel systems (by itself)
Necessarily implemented identically by all vendors
Guaranteed to make the most efficient use of shared memory
Required to check for data dependencies, data conflicts, race conditions, or deadlocks
Required to check for code sequences that cause a program to be classified as non-conforming
Meant to cover compiler-generated automatic parallelization and directives to the compiler to assist such
parallelization
Designed to guarantee that input or output to the same file is synchronous when executed in parallel. The
programmer is responsible for synchronizing input and output.
References:
OpenMP website: openmp.org
Problems with IPC in multicore
Method
File
Signal
Socket
Message
queue
Pipe
Provided by (Operating
systems or other
environments)
Most operating systems.
some systems, such as
Windows, only implement
signals in the C run-time
library and do not actually
provide support for their
use as an IPC technique.
Named pipe
Semaphore
Shared memory
Used in MPI
Message passing paradigm, Java RMI,
(shared nothing) CORBA, MSMQ,
MailSlots and others.
Memory-mapped
file / peripherals
Problems with IPC in multicore
MCA has produced several specifications to enhance the adoption
of multicore, including a communications API (MCAPI) and
resource management API (MRAPI).
Independent of
the operating
system
http://www.multicore-association.org/home.php
MCAPI : MultiCore Communication API
• The MCAPI specification is both an API and communications
semantic specification. It does not define which link management,
device model or wire protocol is used underneath it. As such, by
defining a standard API, it is intended to provide source code
compatibility for application code to be ported from one operating
environment to another.
• MCAPI defines three fundamental communications types. These
are:
– 1. Messages – connection-less datagrams.
– 2. Packet channels – connection-oriented, uni-directional, FIFO
packet streams.
– 3. Scalar channels – connection-oriented single word unidirectional, FIFO packet streams
MCAPI
MRAPI :
Multicore Resource Management API
SMP advantages
SMP RTOS
Concurrency and SMP
Browsers are good candidates
for SMP
CoreMarks : absolute performance
Chose the good coinDistributed
! memory,
message passing
Open
MP
Posix
MPI
RTOS, shared memory
Dif
Open
CL
CAL
Cuda
Architecture Specific Language
Dataflow programming
MPSoC onto Altera FPGAs
The MPSoC concept of Altera
• A symetric multiprocessor system
• With that solution it will take longer to build
the software than the hardware !
• Shared memory paradigm, so hardware
protection :
– Hardware Mutex : Altera Mutex Core,
chapter in volume 5 of the Quartus II Handbook.
2 independ Nios-II processors
Shared resources
There is no hardware protection against corruption, so …
Use of the mutex core
• It is a simple atomic test-and-set operation
but at a hardware level
• This core is an Altera IP associated with a
API for the software programmation
Software use of the mutex core
Altera provides the following software files accompanying
the mutex core:
• altera_avalon_mutex_regs.h—Defines the core's
register map, providing symbolic constants to access the
low-level hardware.
• altera_avalon_mutex.h—Defines data structures and
functions to access the mutex core hardware.
• altera_avalon_mutex.c—Contains the implementations
of the functions to access the mutex core
Register maps & mutex API
Writing to and Reading from a
Mailbox
#include <stdio.h>
#include "altera_avalon_mailbox.h"
int main()
{
alt_u32 message = 0;
alt_mailbox_dev* send_dev, recv_dev;
/* Open the two mailboxes between this processor and another */
send_dev = altera_avalon_mailbox_open("/dev/mailbox_0");
recv_dev = altera_avalon_mailbox_open("/dev/mailbox_1");
while(1)
{
/* Send a message to the other processor */
altera_avalon_mailbox_post(send_dev, message);
/* Wait for the other processor to send a message back */
message = altera_avalon_mailbox_pend(recv_dev);
}
return 0;
}
Addressing Peripherals
Memory partitioning
• Here, all the processors are sharing the same physical
memory:
• The elf sections .text, .rodata, .rwdata, .heap, .stack are
allocated into the external SRAM
• So if CPU1 uses 128KB and CPU2 64KB, the memory is
partitioned as the following:
– From 0x0 to 0x1FFFF for CPU1
– From 0x20000 to 0x2FFFF for CPU2
• Partitioning is made on the base of the exception
address of the cpu under SOPC_BUILDER
• NiosIDE then takes theses addresses into account (from
system.h) in order to link the corresponding executable
files
Example of a memory partitioning
Memory sections
• Exceptions sections always begin with an
offset of 0x20
• Which corresponds to 1 line of instruction
cache (0x20 bytes = 32 bytes = 8
instructions)
• The static partitioning is not verified at runtime => possible crushing of a memory
zone by another
Idem for the bootloader in flash
And what about putting an OS?
• It is your turn to work …
Summary & conclusions
Controling the evolution
• The exponential growup of on-chip integration,
associated with performance, is finally too fast to be
controled. Today systems suffer related problems:
–
–
–
–
–
–
Instability of operating systems
Incompatibility between softwares
High turnover on training
Fast extinction of technologies
Complexity poorly controled at all the levels
While this may worsen …
• There is no question about accessible performance
anymore
• The question about the efficient utilization and the
control of this performance is the fondamental problem
So performance will come from you!
• The technological and physical constraints
slow down the classical evolution
• The performance gain remains accessible
with a significant effort from software
designers
• Exploit parallelism:
– Multi-thread programmation as example
– OS multiprocessor
– Best knowledge of low-level computer
organization
Article sur Linux et le SMP
The kernel does its part to optimize the load across the available CPUs .
All that's left is to ensure that the application can be sufficiently
multi-threaded to exploit the power in SMP.