Deep learning and hardware spiking neurons
Transcription
Deep learning and hardware spiking neurons
NeuroSTIC 2015 July 1st, 2015 Implementing Deep Neural Networks with Non Volatile Memories Olivier Bichler1 ([email protected]) Daniele Garbin2 Elisa Vianello2 Luca Perniola2 Barbara DeSalvo2 Christian Gamrat1 1 CEA, www.cea.fr & LIST, Laboratory for Enhancing Reliability of Embedded Systems 2 CEA, LETI Summary Cliquez pour modifier le style du titre Context Opportunity – Deep Neural Networks Challenge – The Memory Bottleneck Paradigm Shift – Spiking, NVM-based Networks Related Developments Perspectives & © CEA. All rights reserved DACLE Division| July 2015 |2 Internet of (Smart?) Things Cliquez pour modifier le style du titre & © CEA. All rights reserved DACLE Division| July 2015 |3 Smart Cliquez pour How modifier le Can styleWe du Get? titre ImageNet classification (Hinton’s team, hired by Google) [1] 1.2 million high res images, 1,000 different classes Top-5 17% error rate (huge improvement) Learned features on first layer Facebook’s ‘DeepFace’ Program (labs head: Y. LeCun) [2] 4 million images, 4,000 identities 97.25% accuracy, vs. 97.53% human performance & © CEA. All rights reserved DACLE Division| July 2015 |4 Recognition CliquezState-of-the-art pour modifier leinstyle du titre # Images # Classes Best score MNSIT 10 Handwritten digits 60,000 + 10,000 99.79% [3] GTSRB ~ 50,000 43 99.46% [4] 10 91.2% [5] 101 86.5% [6] Traffic sign CIFAR-10 airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck 50,000 + 10,000 INCREASING COMPLEXITY Database Caltech-101 ~ 50,000 ImageNet ~ 1,000,000 1,000 Top-5 83% [1] DeepFace ~ 4,000,000 4,000 97.25% [2] State-of-the-art are Deep Neural Networks every time & © CEA. All rights reserved DACLE Division| July 2015 |5 Mainpour Actors at International Cliquez modifier le style duLevel titre ACADEMICS • Deep learning • Andrew Ng • NVM-based architectures • H.-S. P. Wong INDUSTRIALS • Deep learning • G. Hinton • A. Krizhevsky • Deep learning • Overfeat, Torch • Y. LeCun • RRAM-based architectures • D. Strukov • Deep learning • J. Schmidhuber & • NVM-based architectures • H. Hwang • RRAM-based architectures • S. Park • Specialized architectures • H.-J. Yoo • Deep learning • Y. LeCun • DeepFace • TrueNorth chip • PCM-based architectures • Deep learning • G. Hinton • O. Temam • Speech Recognition • DBN, RNN • R. Sarikaya • G. E. Dahl • Project Adam • Memristor / RRAM • R. S. Williams • NVM-based architectures • Wei Lu • Deep learning • Andrew Ng • Zeroth chip • E. M. Izhikevich • BrainOS • • • • • nn-X FPGA / GPU Cloud Y. LeCun C. Farabet Madbits • Deep learning • C. Farabet • Software • Bio-inspired • S. J. Thorpe © CEA. All rights reserved • NeuroDSP chip • Cognimem chip DACLE Division| July 2015 |6 Convolutional Networks Cliquez Deep pour modifier le style du titre Convolutional Neural Network (CNN) or similar topology & Source: Rodrigo Benenson github page http://rodrigob.github.io/are_we_there_yet/build/ © CEA. All rights reserved DACLE Division| July 2015 |7 Convolutional Cliquez pour modifier le style duLayer titre 𝑛−1 𝑛−1 𝑂𝑖,𝑗 = tanh Input map (𝐼𝑖,𝑗 matrix) 0 -1 0 -1 0 -1 0 -1 0 -1 0 -1 -1 0 1 -1 1 0 -1 1 0 -1 1 0 -1 1 0 -1 1 0 -1 1 -1 1 -1 1 0 -1 1 0 -1 1 0 -1 1 0 -1 1 0 -1 1 0 -1 0 1 -1 0 1 -1 1 0 -1 1 0 -1 1 0 -1 1 0 -1 1 0 -1 1 0 -1 0 1 -1 0 1 -1 1 0 -1 1 0 -1 1 0 -1 1 0 -1 1 0 -1 1 0 -1 0 1 -1 0 1 -1 1 0 -1 1 0 -1 1 0 -1 1 0 -1 1 0 -1 1 0 -1 0 1 -1 0 1 -1 1 -1 1 -1 0 1 -1 0 1 -1 0 1 -1 0 1 -1 0 1 0 1 𝐼𝑖+𝑘,𝑗+𝑙 . 𝐾𝑘,𝑙 𝑘=0 𝑙=0 -1 -1 -1 0 -1 0 -1 0 -1 0 -1 0 0 Output feature map (𝑂𝑖,𝑗 matrix) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 0 1 -1 -1 1 -1 -1 1 0 𝒏 × 𝒏 kernel (𝐾𝑘,𝑙 matrix) Each kernel generates ≠ output feature maps 𝑛−1 𝑛−1 Convolution operation: 𝑂𝑖,𝑗 = tanh 𝐼𝑖+𝑘,𝑗+𝑙 . 𝐾𝑘,𝑙 𝑘=0 𝑙=0 Kernels are learned with gradient-descent algorithms (classical back-propagation is very efficient!) & © CEA. All rights reserved DACLE Division| July 2015 |8 … … … CNNs Cliquez pour modifier le Organization style du titre Deep = number of layers >> 1 & © CEA. All rights reserved DACLE Division| July 2015 |9 Cliquez State-of-the-art pour modifier leCNN styleExample du titre The German Traffic Sign Recognition Benchmark (GTSRB) 43 traffic sign types > 50,000 images Neurons: 287,843 Synapses: 1,388,800 – Total memory: 1.5MB (with 8 bits synapses) Connections: 124,121,800 [3] D. Ciresan, U. Meier, J. Masci, J. Schmidhuber, Multi-column deep neural network for traffic sign classification, Neural Networks (32), pp. 333-338, 2012 & Near human recognition (> 98%) [3] © CEA. All rights reserved DACLE Division| July 2015 | 10 The Memory Bottleneck Cliquez pour modifier le style du titre Input matrix memory Output matrix memory ADD ? REG MULT Store Apply the the result non-linearity back in (tanh memory function) System clock: Kernel memory & 𝑛 × 𝑛 cycles per kernel + non-linearity computation × output matrix size × number of kernels in output feature map × number of output feature maps in layer × number of layers © CEA. All rights reserved DACLE Division| July 2015 | 11 The Memory Bottleneck: Solutions? Cliquez pour modifier le style du titre ↗ data level parallelism SIMD instructions: ×2 – ×32 acceleration But… limited by the size of the memory bus ↗ number of processing cores × (number of cores) acceleration (assuming distributed memory) High-end GPU: ×100 acceleration over CPU! (@ 250W power consumption…) Back to our example: ~ 125MM MAC operations (for 48x48 pixels inputs) 128 bit memory bus (SIMD x16) 16 processing cores (distributed memory) 500K cycles @ 200 Mhz = 2.5 ms / input ROIs extraction @ 30 frames/s Time to process 12 ROIs / frame… & Highly specialized architectures required to envision embeddable systems © CEA. All rights reserved DACLE Division| July 2015 | 12 Is a Paradigm Possible? Cliquez pour modifier leShift style du titre Fully distributed, fully parallel? Compute in memory! MAC computation in memory? Read current Signal to accumulate 𝐼 =𝐺×𝑈 NVM conductance Synaptic weight Read pulse voltage Input signal Input signal coding? Voltage level: digital to analog converter Pulse duration: pulse width modulation Spike-based coding! Non-linearity computation? Analog computation Look-up table & t © CEA. All rights reserved DACLE Division| July 2015 | 13 Neural Networks Cliquez Spike-based pour modifier le style du titre Input signal: rate-based coding From 1 pulse to N pulses / input time slot N precision of the input signal discretization Tunability: energy consumption ∝ N, applicative performances ∝ N t Non-linearity: refractory period Approximates tanh() with a piece-wise linear function [7] Easy to implement, no applicative performance penalty! 𝑇𝑟𝑒𝑓𝑟𝑎𝑐 Direct interface to bio-inspired sensors [8]: [7] J. A. Pérez-Carrasco et al., “Mapping from Frame-Driven to Frame-Free Event-Driven Vision Systems by Low-Rate Rate-Coding and Coincidence Processing. Application to Feed-Forward ConvNets, ” IEEE Trans. on Pattern Analysis and Machine Intelligence, 2014 [8] L. Camuñas-Mesa et al., “ An Event-Driven Multi-Kernel Convolution Processor Module for Event-Driven Vision Sensors, ” IEEE J. of Solid-State Circuits, 2012 & © CEA. All rights reserved DACLE Division| July 2015 | 14 Spike-based Codingle&style Propagation Cliquez pour modifier du titre 29x29 pixels 841 addresses Pixel brightness Spiking frequency V fMIN Rate-based input coding fMAX t layer 1 layer 2 layer 3 layer 4 Correct Output Time & © CEA. All rights reserved DACLE Division| July 2015 | 15 Ourmodifier Simulation Tools: Cliquez pour le style du Xnet titre Example on the MNIST database Deep network description file: network.ini ; Environment [env] SizeX=29 SizeY=29 ConfigSection=env.config [env.config] ImageScale=0 ; First layer (convolutionnal) [conv1] Input=env Type=Conv KernelWidth=5 KernelHeight=5 NbChannels=6 Stride=2 ConfigSection=common.config ; Second layer (convolutionnal) [conv2] Input=conv1 Type=Conv KernelWidth=5 KernelHeight=5 NbChannels=12 Stride=2 ConfigSection=common.config & MNIST database (60000 images) ; Third layer (fully connected) [fc1] Input=conv2 Type=Fc NbOutputs=100 ConfigSection=common.config ; Output layer (fully connected) [fc2] Input=fc1 Type=Fc NbOutputs=10 ConfigSection=common.config ; Common config for static model [common.config] NoBias=1 WeightsLearningRate=0.0005 Threshold=1.0 NoClamping=1 xnet_convnet network.ini mnist -learn 6000000 -log 10000 CONFIDENTIAL © CEA. All rights reserved DACLE Division| July 2015 | 16 Back-propagation Cliquez pour modifierOffline le styleLearning du titre Simulated network topology for MNIST (auto-generated) Learning and test performances Learning Recogn. rate: 99.7% Test Learned kernels for conv1 layer Recogn. rate: 98.7% & CONFIDENTIAL © CEA. All rights reserved DACLE Division| July 2015 | 17 Spike-based Read-only Cliquez pour modifier le styleNetwork du titre Spiking propagation of one pattern Spike-based test performances Test 0% performance drop vs. static network! Recogn. rate: 98.7% Spike-based network statistics conv1 conv2 & fc1 fc2 Layer Synapses (shared) Connections Events/ frames Events/ connection conv1 150 25,350 36,666 1.45 conv2 1,800 45,000 173,278 3.85 fc1 30,000 30,000 226,859 7.56 fc2 1,000 1,000 8,037 8.04 CONFIDENTIAL © CEA. All rights reserved DACLE Division| July 2015 | 18 Spike-based Networks withduNVMs Cliquez pour modifier le style titre PCM From spiking pre-synaptic neurons (inputs) VRD ILTP Unsupervised cars trajectories extraction ILTD Crystallization/ Amorphization I = ILTP - ILTD Spiking postsynaptic neuron Equivalent (output) 2-PCM synapse [9] O. Bichler et al., “Visual pattern extraction using energy-efficient ’2-PCM synapse’ neuromorphic architecture.” Electron Devices, IEEE Transactions on, 2012 CBRAM Unsupervised MNIST handwritten digits classification with stochastic learning Forming/ Dissolution of conductive filament [10] M. Suri et al., “CBRAM devices as binary synapses for low-power stochastic neuromorphic systems: Auditory (cochlea) and visual (retina) cognitive processing applications”, IEDM, 2012 & © CEA. All rights reserved DACLE Division| July 2015 | 19 Implementation with Cliquez pour modifier leNVM style Devices du titre Spike-based computing principle Signal propagation Output neurons Output neurons Signal propagation Input spike Convolution kernel Input neurons Input spike & Convolution kernel Synaptic weighting of the spike (multi-level or binary RRAM device(s)) Other convolution kernel(s)… © CEA. All rights reserved CMOS dynamic interconnect DACLE Division| July 2015 | 20 Application in Xnet (1) Cliquez pourBenchmarking modifier le style du titre [env] SizeX=48 SizeY=48 ConfigSection=env.config [env.config] ImageScale=1 Kernel.Gamma=0.3 Kernel[0][0].Theta=0.0 Kernel[0][1].Theta=45.0 Kernel[0][2].Theta=90.0 Kernel[0][3].Theta=135.0 ConfigSection=common_fixed.config [conv1_7x7] Input=env Type=Conv KernelWidth=7 KernelHeight=7 NbChannels=4 Stride=1 Kernel=Gabor Kernel.Sigma=2.8 Kernel.Lambda=3.5 Kernel.Psi=0.0 Kernel.Gamma=0.3 Kernel[0][0].Theta=0.0 Kernel[0][1].Theta=45.0 Kernel[0][2].Theta=90.0 Kernel[0][3].Theta=135.0 ConfigSection=common_fixed.config [pool1] Input=conv1_7x7,conv1_9x9 Type=Pool PoolWidth=8 PoolHeight=8 NbChannels=8 Stride=4 Pooling=Max Mapping.Size=1 Mapping.NbIterations=4 [conv1_9x9] Input=env Type=Conv KernelWidth=9 KernelHeight=9 NbChannels=4 Stride=1 Padding=1 Kernel=Gabor Kernel.Sigma=3.6 Kernel.Lambda=4.6 Kernel.Psi=0.0 & pool1 mapping 1 0 0 0 # conv1_7x7 0 1 0 0 # conv1_7x7 0 0 1 0 # conv1_7x7 0 0 0 1 # conv1_7x7 1 0 0 0 # conv1_9x9 0 1 0 0 # conv1_9x9 0 0 1 0 # conv1_9x9 0 0 0 1 # conv1_9x9 [fc1] Input=pool1 Type=Fc NbOutputs=20 ConfigSection=common.config [fc2] Input=fc1 Type=Fc NbOutputs=2 ConfigSection=common.config [common_fixed.config] NoBias=1 WeightsLearningRate=0.0 BiasLearningRate=0.0 NoClamping=1 [common.config] NoBias=1 NoClamping=1 Simplified “HMAX”-like: © CEA. All rights reserved 8,560 weights to learn 925,320 shared weights DACLE Division| July 2015 | 21 Application in Xnet (2) Cliquez pourBenchmarking modifier le style du titre Caltech 101 subset: 2 categories Faces_easy (435 images) – 200 learning / 200 testing BACKGROUND_Google (468 images) – 200 learning / 200 testing & © CEA. All rights reserved DACLE Division| July 2015 | 22 Application in Xnet (3) Cliquez pourBenchmarking modifier le style du titre 20 output neurons Fast learning Learning (20,000 steps) Testing 98.25% Weights discretization Precision Ideal 256 128 64 32 16 8 4 98.25 99 98 97.75 97.75 98.5 89.75 55.5 (number of levels) Score tanh() approximated with simple saturation identical performances & © CEA. All rights reserved DACLE Division| July 2015 | 23 Towards Synthesis Cliquez pour Hardware modifier le style du titre 1) Deep network builder ; Environment [env] SizeX=8 SizeY=8 ConfigSection=env.config [env.config] ImageScale=0 ; First layer (convolutionnal) [conv1] Input=env Type=Conv KernelWidth=3 KernelHeight=3 NbChannels=32 Stride=1 2) Defects learning Input=conv1 Type=Pool PoolWidth=2 PoolHeight=2 NbChannels=32 Stride=2 ; Third layer (fully connected) [fc1] Input=conv2 Type=Fc NbOutputs=100 ; Output layer (fully connected) [fc2] Input=fc1 Type=Fc NbOutputs=10 ; Second layer (pooling) [pool1] xnet network.ini database -learn Recon. rate Recon. rate 3) Performances analysis Learning 4) C Export and RTL synthesis Estimated defects visualization Test Recon. rate: 95% & © CEA. All rights reserved DACLE Division| July 2015 | 24 Towards Fully Cliquez pour modifier le style duCNNs titre State-of-the-art in image segmentation Take arbitrary input size Trained end-to-end, pixels-to-pixels Eliminate redundant calculations inherent to “patch” segmentation Spike-coding compatible! [11] Jon Long, Evan Shelhamer, Trevor Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR, 2015 & © CEA. All rights reserved DACLE Division| July 2015 | 25 Long-Term Perspectives Cliquez pour modifier le style du titre Towards even more bio-inspired systems! Unsupervised online learning (Spike-Timing-Dependent Plasticity) Learning directly from bio-inspired sensors (artificial retina, cochlea, …) Kernel (15x15 synapses) Output feature map activity (factor 2 subsampling = 57x57) Input activity (128x128) & CONFIDENTIAL © CEA. All rights reserved DACLE Division| July 2015 | 26 Conclusion Cliquez pour modifier le style du titre Deep Neural Networks are… … at the edge of today’s recognition systems … deployed in large-scale commercial products (Facebook, Google, …) … hard to integrate into embedded products, even with ASICs Spiking NVM-based deep networks are promising: Computing capabilities identical to conventional networks Provide the high memory density required True computing in memory, eliminate the memory bottleneck Simple and efficient performance tunability capabilities Direct interface to bio-inspired sensors (retina, cochlea…) Large potential for advanced bio-inspired learning systems & © CEA. All rights reserved DACLE Division| July 2015 | 27 Centre de Grenoble 17 rue des Martyrs 38054 Grenoble Cedex Thank you! Centre de Saclay Nano-Innov PC 172 91191 Gif sur Yvette Cedex Questions? Implementing Deep Neural Networks with Non Volatile Memories Olivier Bichler [email protected] NeuroSTIC 2015 July 1st, 2015 References Cliquez pour modifier le style du titre [1] A. Krizhevsky, I. Sutskever, G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks”, NIPS 2012 [2] Y. Taigman, M. Yang, M. Ranzato, L. Wolf, “DeepFace: Closing the Gap to Human-Level Performance in Face Verification”, CVPR 2014 [3] D. Ciresan, U. Meier, J. Schmidhuber, “Multi-column Deep Neural Networks for Image Classification,” CVPR 2012 [4] D. Ciresan, U. Meier, J. Masci, J. Schmidhuber , “Multi-column deep neural network for traffic sign classification”, Neural Networks (32), pp. 333-338, 2012 [5] M. Lin, Q. Chen, S. Yan, “Network In Network”, ICLR 2014 [6] M. D. Zeiler, R. Fergus, “Visualizing and Understanding Convolutional Networks”, arXiv:1311.2901 [7] J. A. Pérez-Carrasco et al., “Mapping from Frame-Driven to Frame-Free Event-Driven Vision Systems by LowRate Rate-Coding and Coincidence Processing. Application to Feed-Forward ConvNets, ” IEEE Trans. on Pattern Analysis and Machine Intelligence, 2014 [8] L. Camuñas-Mesa et al., “ An Event-Driven Multi-Kernel Convolution Processor Module for Event-Driven Vision Sensors, ” IEEE J. of Solid-State Circuits, 2012 [9] O. Bichler, M. Suri, D. Querlioz, D. Vuillaume, B. DeSalvo, and C. Gamrat. “Visual pattern extraction using energy-efficient ’2-PCM synapse’ neuromorphic architecture.” Electron Devices, IEEE Transactions on, 2012 [10] M. Suri, O. Bichler, D. Querlioz, G. Palma, E. Vianello, D. Vuillaume, C. Gamrat, and B. DeSalvo. “CBRAM devices as binary synapses for low-power stochastic neuromorphic systems: Auditory (cochlea) and visual (retina) cognitive processing applications”, IEDM, 2012 & © CEA. All rights reserved DACLE Division| July 2015 | 29 Unsupervised Features Extraction Cliquez pour modifier le style du titre Learning rule Conductance change W (%) 120 Network topology Exp. data [Bi&Poo] LTP LTD LTP simulation LTD simulation 100 80 60 Lateral inhibition 2nd layer Lateral inhibition 40 20 Neurons activity 1st layer …… 0 -20 TLTP -40 CMOS Retina 16,384 spiking pixels -60 -100 -50 0 50 100 T = tpost - tpre (ms) 128 128 Input stimuli Synaptic weights Xnet Synaptic model 𝑢 = 𝑢. 𝑒 − 𝑡𝑠𝑝𝑖𝑘𝑒 −𝑡𝑙𝑎𝑠𝑡_𝑠𝑝𝑖𝑘𝑒 𝜏𝑙𝑒𝑎𝑘 +𝑤 40 Conductance (nS) Leaky Integrate & Fire: Conductance (nS) Neuron model 20 Neuron membrane potential 40 20 0 0 0 20 40 60 80 Pulse number 100 0 20 40 60 80 100 Pulse number O. Bichler et al. “Extraction of temporally correlated features from dynamic vision sensors with spike-timing-dependent plasticity.” Neural Networks, 2012 & © CEA. All rights reserved DACLE Division| July 2015 | 30