Deep learning and hardware spiking neurons

Transcription

NeuroSTIC 2015
July 1st, 2015
Implementing Deep
Neural Networks with
Non Volatile Memories
Olivier Bichler1 ([email protected])
Daniele Garbin2
Elisa Vianello2
Luca Perniola2
Barbara DeSalvo2
Christian Gamrat1
1 CEA,
www.cea.fr
&
LIST, Laboratory for Enhancing
Reliability of Embedded Systems
2 CEA, LETI
Summary
Cliquez pour modifier le style
du titre
Context
Opportunity – Deep Neural Networks
Challenge – The Memory Bottleneck
Paradigm Shift – Spiking, NVM-based Networks
Related Developments
Perspectives
&
© CEA. All rights reserved
DACLE Division| July 2015
|2
Internet
of (Smart?)
Things
Cliquez pour
modifier
le style du
titre
&
|3
Smart
Cliquez pour How
modifier
le Can
styleWe
du Get?
titre
ImageNet classification (Hinton’s team, hired by Google) [1]
1.2 million high res images, 1,000 different classes
Top-5 17% error rate (huge improvement)
Learned features
on first layer
Facebook’s ‘DeepFace’ Program (labs head: Y. LeCun) [2]
4 million images, 4,000 identities
97.25% accuracy, vs. 97.53% human performance
&
|4
Recognition
CliquezState-of-the-art
pour modifier leinstyle
du titre
# Images
# Classes
Best score
MNSIT
10
Handwritten digits
60,000 +
10,000
99.79%
[3]
GTSRB
~ 50,000
43
99.46%
[4]
10
91.2%
[5]
101
86.5%
[6]
Traffic sign
CIFAR-10
airplane, automobile, bird, cat,
deer, dog, frog, horse, ship, truck
50,000 +
10,000
INCREASING COMPLEXITY
Database
Caltech-101
~ 50,000
ImageNet
~ 1,000,000
1,000
Top-5
83% [1]
DeepFace
~ 4,000,000
4,000
97.25%
[2]
State-of-the-art are Deep Neural Networks every time
&
|5
Mainpour
Actors
at International
Cliquez
modifier
le style duLevel
titre
ACADEMICS
• Deep learning
• Andrew Ng
• NVM-based
architectures
• H.-S. P. Wong
INDUSTRIALS
• Deep learning
• G. Hinton
• A. Krizhevsky
• Deep learning
• Overfeat, Torch
• Y. LeCun
• RRAM-based
architectures
• D. Strukov
• Deep learning
• J. Schmidhuber
&
• NVM-based
architectures
• H. Hwang
• RRAM-based
architectures
• S. Park
• Specialized
architectures
• H.-J. Yoo
• Deep learning
• Y. LeCun
• DeepFace
• TrueNorth chip
• PCM-based
architectures
• Deep learning
• G. Hinton
• O. Temam
• Speech
Recognition
• DBN, RNN
• R. Sarikaya
• G. E. Dahl
• Project Adam
• Memristor /
RRAM
• R. S. Williams
• NVM-based
architectures
• Wei Lu
• Deep learning
• Andrew Ng
• Zeroth chip
• E. M. Izhikevich
• BrainOS
•
•
•
•
•
nn-X
FPGA / GPU
Cloud
Y. LeCun
C. Farabet
Madbits
• Deep learning
• C. Farabet
• Software
• Bio-inspired
• S. J. Thorpe
• NeuroDSP chip
• Cognimem chip
|6
Convolutional
Networks
Cliquez Deep
pour modifier
le style
du titre
Convolutional Neural
Network (CNN)
or similar topology
&
Source: Rodrigo Benenson github page
http://rodrigob.github.io/are_we_there_yet/build/
|7
Convolutional
Cliquez pour modifier
le style duLayer
titre
𝑛−1 𝑛−1
𝑂𝑖,𝑗 = tanh
Input map
(𝐼𝑖,𝑗 matrix)
0
-1
0
-1
0
-1
0
-1
0
-1
0
-1
-1
0
1
-1
1
0
-1
1
0
-1
1
0
-1
1
0
-1
1
0
-1
1
-1
1
-1
1
0
-1
1
0
-1
1
0
-1
1
0
-1
1
0
-1
1
0
-1
0
1
-1
0
1
-1
1
0
-1
1
0
-1
1
0
-1
1
0
-1
1
0
-1
1
0
-1
0
1
-1
0
1
-1
1
0
-1
1
0
-1
1
0
-1
1
0
-1
1
0
-1
1
0
-1
0
1
-1
0
1
-1
1
0
-1
1
0
-1
1
0
-1
1
0
-1
1
0
-1
1
0
-1
0
1
-1
0
1
-1
1
-1
1
-1
0
1
-1
0
1
-1
0
1
-1
0
1
-1
0
1
0
1
𝐼𝑖+𝑘,𝑗+𝑙 . 𝐾𝑘,𝑙
𝑘=0 𝑙=0
-1
-1
-1
0
-1
0
-1
0
-1
0
-1
0
0
Output feature map
(𝑂𝑖,𝑗 matrix)
1 2 3 4 5 6
7 8 9 10 11 12
13 14 15 16 17 18
19 20 21 22 23 24
25 26 27 28 29 30
31 32 33 34 35 36
0 1 -1
-1 1 -1
-1 1 0
𝒏 × 𝒏 kernel
(𝐾𝑘,𝑙 matrix)
Each kernel generates ≠ output feature maps
𝑛−1 𝑛−1
Convolution operation: 𝑂𝑖,𝑗 = tanh
𝐼𝑖+𝑘,𝑗+𝑙 . 𝐾𝑘,𝑙
𝑘=0 𝑙=0
Kernels are learned with gradient-descent algorithms
(classical back-propagation is very efficient!)
&
|8
…
…
…
CNNs
le Organization
style du titre
Deep = number of layers >> 1
&
|9
Cliquez State-of-the-art
pour modifier leCNN
styleExample
du titre
The German Traffic Sign
Recognition Benchmark (GTSRB)
43 traffic sign types
> 50,000 images
Neurons: 287,843
Synapses: 1,388,800
– Total memory: 1.5MB (with 8 bits synapses)
Connections: 124,121,800
[3] D. Ciresan, U. Meier, J. Masci, J. Schmidhuber, Multi-column deep neural
network for traffic sign classification, Neural Networks (32), pp. 333-338, 2012
&
Near human recognition (> 98%) [3]
| 10
The Memory
Bottleneck
le style
du titre
Input matrix memory
Output matrix memory
ADD
?
REG
MULT
Store
Apply the
the result
non-linearity
back in
(tanh
memory
function)
System clock:
Kernel memory
&
𝑛 × 𝑛 cycles per kernel + non-linearity computation
× output matrix size
× number of kernels in output feature map
× number of output feature maps in layer
× number of layers
| 11
The Memory
Bottleneck:
Solutions?
Cliquez
pour modifier
le style
du titre
↗ data level parallelism
SIMD instructions: ×2 – ×32 acceleration
But… limited by the size of the memory bus
↗ number of processing cores
× (number of cores) acceleration (assuming distributed memory)
High-end GPU: ×100 acceleration over CPU! (@ 250W power consumption…)
Back to our example:
~ 125MM MAC operations (for 48x48 pixels inputs)
128 bit memory bus (SIMD x16)
16 processing cores (distributed memory)
 500K cycles @ 200 Mhz = 2.5 ms / input
ROIs extraction @ 30 frames/s
Time to process 12 ROIs / frame…
&
Highly specialized architectures required
to envision embeddable systems
| 12
Is a Paradigm
Possible?
Cliquez pour
modifier leShift
style
du titre
Fully distributed, fully parallel?  Compute in memory!
MAC computation in memory?
Read current
Signal to accumulate
𝐼 =𝐺×𝑈
NVM
conductance
Synaptic weight
Read pulse
voltage
Input signal
Input signal coding?
Voltage level: digital to analog converter
Pulse duration: pulse width modulation
Spike-based
coding!
Non-linearity computation?
Analog computation
Look-up table
&
t
| 13
Neural
Networks
Cliquez Spike-based
pour modifier
le style
du titre
Input signal: rate-based coding
From 1 pulse to N pulses / input time slot
N  precision of the input signal discretization
Tunability: energy consumption ∝ N, applicative performances ∝ N
t
Non-linearity: refractory period
Approximates tanh() with a piece-wise linear function [7]
Easy to implement, no applicative performance penalty!
𝑇𝑟𝑒𝑓𝑟𝑎𝑐
Direct interface to bio-inspired sensors [8]:
[7] J. A. Pérez-Carrasco et al., “Mapping from Frame-Driven to Frame-Free Event-Driven Vision Systems by Low-Rate Rate-Coding and Coincidence Processing.
Application to Feed-Forward ConvNets, ” IEEE Trans. on Pattern Analysis and Machine Intelligence, 2014
[8] L. Camuñas-Mesa et al., “ An Event-Driven Multi-Kernel Convolution Processor Module for Event-Driven Vision Sensors, ” IEEE J. of Solid-State Circuits, 2012
&
| 14
Spike-based
Codingle&style
Propagation
Cliquez
pour modifier
du titre
29x29 pixels
841 addresses
Pixel
brightness
Spiking frequency
V
fMIN
Rate-based
input coding
fMAX
t
layer 1
layer 2
layer 3
layer 4
Correct
Output
Time
&
| 15
Ourmodifier
Simulation
Tools:
Cliquez pour
le style
du Xnet
titre
Example on the MNIST database
Deep network description file: network.ini
; Environment
[env]
SizeX=29
SizeY=29
ConfigSection=env.config
[env.config]
ImageScale=0
; First layer (convolutionnal)
[conv1]
Input=env
Type=Conv
KernelWidth=5
KernelHeight=5
NbChannels=6
Stride=2
ConfigSection=common.config
; Second layer (convolutionnal)
[conv2]
Input=conv1
Type=Conv
KernelWidth=5
KernelHeight=5
NbChannels=12
Stride=2
&
MNIST database (60000 images)
; Third layer (fully connected)
[fc1]
Input=conv2
Type=Fc
NbOutputs=100
; Output layer (fully connected)
[fc2]
Input=fc1
Type=Fc
NbOutputs=10
; Common config for static model
[common.config]
NoBias=1
WeightsLearningRate=0.0005
Threshold=1.0
NoClamping=1
xnet_convnet network.ini mnist
-learn 6000000 -log 10000
CONFIDENTIAL
| 16
Back-propagation
Cliquez
pour modifierOffline
le styleLearning
du titre
Simulated network topology for MNIST (auto-generated)
Learning and test
performances
Learning
Recogn. rate:
99.7%
Test
Learned kernels for conv1 layer
Recogn. rate:
98.7%
&
CONFIDENTIAL
| 17
Spike-based
Read-only
Cliquez
pour modifier
le styleNetwork
du titre
Spiking propagation of one pattern
Spike-based test performances
Test
0% performance
drop vs. static
network! 
Recogn. rate:
98.7%
Spike-based network statistics
conv1 conv2
&
fc1
fc2
Layer
Synapses
(shared)
Connections
Events/
frames
Events/
connection
conv1
150
25,350
36,666
1.45
conv2
1,800
45,000
173,278
3.85
fc1
30,000
30,000
226,859
7.56
fc2
1,000
1,000
8,037
8.04
CONFIDENTIAL
| 18
Spike-based
Networks
withduNVMs
Cliquez
pour modifier
le style
titre
PCM
From spiking pre-synaptic
neurons (inputs)
VRD
ILTP
Unsupervised cars
trajectories extraction
ILTD
Crystallization/
Amorphization
I = ILTP - ILTD
Spiking postsynaptic neuron
Equivalent
(output)
2-PCM synapse
[9] O. Bichler et al., “Visual pattern extraction using energy-efficient ’2-PCM synapse’ neuromorphic architecture.” Electron Devices, IEEE Transactions on, 2012
CBRAM
Unsupervised MNIST handwritten digits
classification with stochastic learning
Forming/
Dissolution of
conductive filament
[10] M. Suri et al., “CBRAM devices as binary synapses for low-power stochastic neuromorphic systems: Auditory (cochlea) and visual (retina) cognitive processing
applications”, IEDM, 2012
&
| 19
Implementation
with
Cliquez
pour modifier
leNVM
style Devices
du titre
Spike-based computing principle
Signal propagation
Output
neurons
Output neurons
Signal propagation
Input
spike
Convolution
kernel
Input
neurons
Input spike
&
Convolution
kernel
Synaptic weighting of the spike
(multi-level or binary RRAM device(s))
Other
convolution
kernel(s)…
CMOS
dynamic
interconnect
| 20
Application
in Xnet
(1)
Cliquez pourBenchmarking
modifier le style
du titre
[env]
SizeX=48
SizeY=48
[env.config]
ImageScale=1
Kernel.Gamma=0.3
Kernel[0][0].Theta=0.0
ConfigSection=common_fixed.config
[conv1_7x7]
Input=env
Type=Conv
KernelWidth=7
KernelHeight=7
NbChannels=4
Stride=1
Kernel=Gabor
Kernel.Sigma=2.8
Kernel.Lambda=3.5
Kernel.Psi=0.0
Kernel.Gamma=0.3
ConfigSection=common_fixed.config
[pool1]
Input=conv1_7x7,conv1_9x9
Type=Pool
PoolWidth=8
PoolHeight=8
NbChannels=8
Stride=4
Pooling=Max
Mapping.Size=1
Mapping.NbIterations=4
[conv1_9x9]
Input=env
Type=Conv
KernelWidth=9
KernelHeight=9
NbChannels=4
Stride=1
Padding=1
Kernel=Gabor
Kernel.Sigma=3.6
Kernel.Lambda=4.6
Kernel.Psi=0.0
&
pool1 mapping
1 0 0 0 # conv1_7x7
0 1 0 0 # conv1_7x7
0 0 1 0 # conv1_7x7
0 0 0 1 # conv1_7x7
1 0 0 0 # conv1_9x9
0 1 0 0 # conv1_9x9
0 0 1 0 # conv1_9x9
0 0 0 1 # conv1_9x9
[fc1]
Input=pool1
Type=Fc
NbOutputs=20
[fc2]
Input=fc1
Type=Fc
NbOutputs=2
[common_fixed.config]
NoBias=1
WeightsLearningRate=0.0
BiasLearningRate=0.0
NoClamping=1
[common.config]
NoBias=1
NoClamping=1
Simplified “HMAX”-like:
8,560 weights to learn
925,320 shared weights
| 21
Application
in Xnet
(2)
modifier le style
du titre
Caltech 101 subset: 2 categories
Faces_easy (435 images) – 200 learning / 200 testing
BACKGROUND_Google (468 images) – 200 learning / 200 testing
&
| 22
Application
in Xnet
(3)
modifier le style
du titre
20 output neurons
Fast learning
Learning (20,000 steps)
Testing
98.25%
Weights discretization
Precision
Ideal
256
128
64
32
16
8
4
98.25
99
98
97.75
97.75
98.5
89.75
55.5
(number of levels)
Score
tanh() approximated with simple saturation  identical performances
&
| 23
Towards
Synthesis
Cliquez
pour Hardware
modifier le style
du titre
1) Deep network builder
; Environment
[env]
SizeX=8
SizeY=8
[env.config]
ImageScale=0
; First layer (convolutionnal)
[conv1]
Input=env
Type=Conv
KernelWidth=3
KernelHeight=3
NbChannels=32
Stride=1
2) Defects learning
Input=conv1
Type=Pool
PoolWidth=2
PoolHeight=2
NbChannels=32
Stride=2
; Third layer (fully connected)
[fc1]
Input=conv2
Type=Fc
NbOutputs=100
; Output layer (fully connected)
[fc2]
Input=fc1
Type=Fc
NbOutputs=10
; Second layer (pooling)
[pool1]
xnet network.ini database -learn
Recon. rate
Recon. rate
3) Performances analysis
Learning
4) C Export and RTL synthesis
Estimated
defects
visualization
Test
Recon. rate: 95%
&
| 24
Towards
Fully
le style
duCNNs
titre
State-of-the-art in image segmentation
Take arbitrary input size
Trained end-to-end, pixels-to-pixels
Eliminate redundant calculations inherent to “patch” segmentation
Spike-coding compatible!
[11] Jon Long, Evan Shelhamer, Trevor Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR, 2015
&
| 25
Long-Term
Perspectives
le style
du titre
Towards even more bio-inspired systems!
Unsupervised online learning (Spike-Timing-Dependent Plasticity)
Learning directly from bio-inspired sensors (artificial retina, cochlea, …)
Kernel
(15x15 synapses)
Output feature map activity
(factor 2 subsampling = 57x57)
Input activity (128x128)
&
CONFIDENTIAL
| 26
Conclusion
du titre
Deep Neural Networks are…
… at the edge of today’s recognition systems
… deployed in large-scale commercial products (Facebook, Google, …)
… hard to integrate into embedded products, even with ASICs
Spiking NVM-based deep networks are promising:
Computing capabilities identical to conventional networks
Provide the high memory density required
True computing in memory, eliminate the memory bottleneck
Simple and efficient performance tunability capabilities
Direct interface to bio-inspired sensors (retina, cochlea…)
Large potential for advanced bio-inspired learning systems
&
| 27
Centre de Grenoble
17 rue des Martyrs
38054 Grenoble Cedex
Thank you!
Centre de Saclay
Nano-Innov PC 172
91191 Gif sur Yvette Cedex
Questions?
Implementing Deep Neural Networks with
Non Volatile Memories
Olivier Bichler
[email protected]
NeuroSTIC 2015
July 1st, 2015
References
du titre
[1] A. Krizhevsky, I. Sutskever, G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks”,
NIPS 2012
[2] Y. Taigman, M. Yang, M. Ranzato, L. Wolf, “DeepFace: Closing the Gap to Human-Level Performance in Face
Verification”, CVPR 2014
[3] D. Ciresan, U. Meier, J. Schmidhuber, “Multi-column Deep Neural Networks for Image Classification,” CVPR
2012
[4] D. Ciresan, U. Meier, J. Masci, J. Schmidhuber , “Multi-column deep neural network for traffic sign
classification”, Neural Networks (32), pp. 333-338, 2012
[5] M. Lin, Q. Chen, S. Yan, “Network In Network”, ICLR 2014
[6] M. D. Zeiler, R. Fergus, “Visualizing and Understanding Convolutional Networks”, arXiv:1311.2901
[7] J. A. Pérez-Carrasco et al., “Mapping from Frame-Driven to Frame-Free Event-Driven Vision Systems by LowRate Rate-Coding and Coincidence Processing. Application to Feed-Forward ConvNets, ” IEEE Trans. on Pattern
Analysis and Machine Intelligence, 2014
[8] L. Camuñas-Mesa et al., “ An Event-Driven Multi-Kernel Convolution Processor Module for Event-Driven
Vision Sensors, ” IEEE J. of Solid-State Circuits, 2012
[9] O. Bichler, M. Suri, D. Querlioz, D. Vuillaume, B. DeSalvo, and C. Gamrat. “Visual pattern extraction using
energy-efficient ’2-PCM synapse’ neuromorphic architecture.” Electron Devices, IEEE Transactions on, 2012
[10] M. Suri, O. Bichler, D. Querlioz, G. Palma, E. Vianello, D. Vuillaume, C. Gamrat, and B. DeSalvo. “CBRAM
devices as binary synapses for low-power stochastic neuromorphic systems: Auditory (cochlea) and visual
(retina) cognitive processing applications”, IEDM, 2012
&
| 29
Unsupervised
Features
Extraction
Cliquez
pour modifier
le style
du titre
Learning rule
Conductance change W (%)
120
Network topology
Exp. data [Bi&Poo]
LTP
LTD
LTP simulation
LTD simulation
100
80
60
Lateral
inhibition
2nd layer
Lateral
inhibition
40
20
Neurons activity
1st layer
……
0
-20
TLTP
-40
CMOS Retina
16,384 spiking pixels
-60
-100
-50
0
50
100
T = tpost - tpre (ms)
128
128
Input stimuli
Synaptic weights
Xnet
Synaptic model
𝑢 = 𝑢. 𝑒
−
𝑡𝑠𝑝𝑖𝑘𝑒 −𝑡𝑙𝑎𝑠𝑡_𝑠𝑝𝑖𝑘𝑒
𝜏𝑙𝑒𝑎𝑘
+𝑤
40
Conductance (nS)
Leaky Integrate & Fire:
Conductance (nS)
Neuron model
20
Neuron membrane potential
40
20
0
0
0
20
40
60
80
Pulse number
100
0
20
40
60
80
100
Pulse number
O. Bichler et al. “Extraction of temporally correlated features from dynamic vision sensors with spike-timing-dependent plasticity.” Neural Networks, 2012
&
| 30

Deep learning and hardware spiking neurons

Transcription

Documents pareils

PULSE Systems, Inc

CEA LIST

Set List Fontoy 2014-1

REPERTOIRE ROCKFELLER septembre 2016 contact

ISB Online Store is OPEN! La Boutique du LIB est maintenant en

Alcoholics Anonymous 2016.01.01 Alcooliques Anonymes Regular

POPS - CEA

Imagine you are working for the Scottish Tourist Board and you are

Diapositive 1 - SPE Automotive Division