Université des Actuaires Machine
Transcription
Université des Actuaires Machine
Université des Actuaires Machine-Learning pour les données massives: algorithmes randomisés, en ligne et distribués Stéphan Clémençon Institut Mines Télécom - Télécom ParisTech July 7, 2014 Agenda I Technologies Big Data: contexte et opportunités I Machine-Learning: un bref tour d’horizon I Les défis des applications du ML dans l’industrie et les services: Vers une généralisation? ”Big Data” - Le contexte Une accumulation de données massives dans de nombreux domaines: I Biologie/Médecine (génomique, métabolomique, essais cliniques, imagerie, etc. I Grande distribution, marketing (CRM), e-commerce I Moteurs de recherche internet (contenu multimedia) I Réseaux sociaux (Facebook, Tweeter, ...) I Banque/Finance (risque de marché/liquidité, accès au crédit) I Sécurité (ex: biométrie, vidéosurveillance) I Administrations (Santé Publique, Douanes) I Risques opérationnels ”Big Data” - Le contexte Un déluge de données qui rend inopérant: I les outils basiques de I I I stockage de données gestion de base de données (MySQL) le prétraitement reposant sur l’expertise humaine I I I indexation, analyse sémantique modélisation intelligence décisionnelle ”Big Data” - Le contexte Une multitude de briques technologiques et de services disponibles pour: I La parallélisation massive (Velocity) I Le calcul distribué (Volume) I La gestion de données sans schéma prédéfini (Variety) parmi lesquels: I Le modèle de programmation MapReduce: calculs parallélisés/distribuées I Framework Hadoop I NoSQL: SGBD Cassandra, MongoDB, bases de données orientées graphe, moteur de recherche Elasticsearch, etc. I Clouds: infrastructures, plate-formes, logiciels as a Service promus par Google, Amazon, Facebook, etc. ”Big Data” - Les opportunités Des avancées spectaculaires pour I la collecte et le stockage (distribué) des données I la recherche automatique d’objets, de contenu I le partage de données peu structurées Le Big Data: un moteur pour la technologie, la science, l’économie I Moteurs de recherche, moteurs de recommandation I Maintenance prédictive I Marketing viral à travers les réseaux sociaux I Détection des fraudes I Médecine individualisée I Publicité en ligne (retargeting) ”Big Data” - Les opportunités Ubiquité De nombreux secteurs d’activité sont concernés: I (e-) Commerce I CRM I Santé I Défense, renseignement (e.g. cybersécurité, biométrie) I Banque/Finance I Transports ”intelligents” I etc. Big Data - Recherche Afin d’exploiter les données (prédiction, interpétation), développer des technologies mathématiques permettant de résoudre les problèmes computationnels liés: I aux contraintes du quasi-temps réel ! apprentissage automatique séquentiel (”on-line”) 6= batch, par renforcement I au caractère distribué des données/ressources ! apprentissage automatique distribué I à la volumétrie des données ! impact des techniques de sondages sur la performance des algorithmes ”Big Data” - Recherche Des techniques de visualisation, représentation de données complexes I Graphes (évolutifs) - clustering, graph-mining I Image, audio, video - filtrage, compression I Données textuelles (e.g. page web, tweet) Domaines I Probabilité, Statistique I Machine-Learning I Optimisation I Traitement du signal et de l’image I Analyse Harmonique Computationnelle I Analyse sémantique I etc. Goals of Statistical Learning I Statistical issues cast as M-estimation problems: I I I I I Classification Regression Density level set estimation Compression, sparse representation ... and their variants I Minimal assumptions on the distribution I Build realistic M-estimators for special criteria Questions I I I Theory: optimal elements, consistency, non-asymptotic excess risk bounds, fast rates of convergence, oracle inequalities Practice: numerical optimization, convexification, randomization, relaxation, constraints (distributed architectures, real-time, memory, etc.) Main Example: Classification (Pattern Recognition) I I I I (X , Y ) random pair with unknown distribution P X 2 X observation vector Y 2 { 1, +1} binary label/class A posteriori probability ⇠ regression function 8x 2 X , I I ⌘(x) = {Y = 1 | X = x} g : X ! { 1, +1} classifier Performance measure = classification error L(g ) = g (X ) 6= Y I g Solution: Bayes rule 8x 2 X , I ! min g ⇤ (x) = 2{⌘(x) > 1/2} Bayes error L⇤ = L(g ⇤ ) 1 Main Paradigm - Empirical Risk Minimization I Sample (X1 , Y1 ), . . . , (Xn , Yn ) with i.i.d. copies of (X , Y ),class of classifiers I Empirical Risk Minimization principle n gbn = Argmin Ln (g ) := g2 I 1X {g (Xi ) 6= Yi } n i=1 Best classifier in the class ḡ = Argmin L(g ) g2 I Concentration inequality With probability 1 sup | Ln (g ) g2 : L(g ) | C r V + n r 2 log(1/ ) n Machine-Learning - Achievements I Numerous applications: I I I I I I I I I I Supervised anomaly detection Handwritten digit recognition Face recognition Medical diagnosis Credit-risk screening CRM Speech recognition Monitoring of complex systems etc. Many ”o↵-the-shelf” methods I I I I I Neural Networks Support Vector Machines Boosting Vector quantization etc. Machine-Learning - Challenges I Ongoing intense research activity, motivated by I I I I I Need for increasing performance Evolution of computing environments (data centers, clouds, HDFS) New applications/problems: recommending systems, search engines, medical imagery, yield management etc. The Big Data era Mathematical/computational challenges I I I Volume (data deluge): ubiquity of sensors, high dimension, distributed storage/processing systems Variety (of data structures): text, graphs, images, signals Velocity (real-time): on-line prediction, evolutionary environment, reinforcement learning strategies (exploration vs exploitation) Statistical Learning - Milestones I The 30’s - Fisher’s (parametric/Gaussian) statistics I I I Linear Discriminant Analysis Linear (logistic) regression PCA, . . . Statistical Learning - Milestones I The 60’s and 70’s - F. Rosenblatt’s perceptron & VC theory I I I I First ”machine-learning” algorithm (linear binary classification) Inspired by congnitive sciences Convexification, one-pass/on-line (stochastic gradient descent) Relaxation, large margin linear classifiers, structural ERM Statistical Learning - Milestones I The 80’s - Neural Networks & Decision Trees I I I Artificial Intelligence ”A theory of learnability” Valiant ’84 The Backpropagation algorithm The CART algorithm (’84) Statistical Learning - Milestones I From the 90’s - Kernels & Boosting I Kernel trick: SVM, nonlinear PCA, . . . I AdaBoost (’95) Lasso, compressed sensing A comprehensive theory beyond VC concepts Rebirth of Q-learning I I I Applications Supervised Learning - Pattern Recognition/Regression I Data with labels, e.g. (Xi , Yi ) 2 Rd ⇥ { 1, +1}, i = 1, . . . , n. Learn to predict Y based on X I Example: in Quality Control, X features of the product and/or production factors, Y = +1 if ”defect” and Y = 1 otherwise. Build a decision rule C minimizing L(C ) = P{Y 6= C (X )} Applications Supervised Learning - Scoring I Data with labels, e.g. (Xi , Yi ) 2 Rd ⇥ { 1, +1}, i = 1, . . . , n. Learn to rank all possible observations X in the same order as that induced by P{Y = +1 | X } through a scoring function s(X ) Applications Supervised Learning - Image Recognition I Objects are assigned to data (pixels), e.g. biometrics I Goal: learn to assign objects to new data Empirical Risk Minimization and Stochastic Approximation I Most learning problems consists of minimizing a functional L(f ) = E[ (Z , f )] where Z is the observation, f a decision rule candidate I In general, a stochastic approximation inductive method must be implemented ft+1 = ft d ⇢t r f L(ft ), d where r f L is a statistical estimate of L’s gradient based on training data Z1 , . . . , Zn Empirical Risk Minimization and Stochastic Approximation I Popular algorithms are based on these principles I Examples: Logit, Neural Networks, linear SVM, etc. I Computational advantages but too rigid (underfitting) Kernel Trick I Apply a simple algorithm but... in a transformed space K (x, x 0 ) = h (x), (x 0 )i I Examples: Nonlinear SVM, Kernel PCA, SVR I Kernels for images, text data, biological sequences, etc. Greedy Algorithms I Recursive methods exploring exhaustively a structured space at each step I Examples: CART, projection pursuit, matching pursuit, etc. I Highly interpretable/visualizable but poor performance No Free Lunch Ensemble learning Heuristic: combine predictions output by weak decision rules Amit & Geman (’97) for image recognition I Example committee-based binary classification: predict Y 2 { 1, +1} based on X ! M X Cagg (X ) = sgn !m Cm (X ) , m=1 where !m controls the impact of the vote of weak rule Cm I The Bootstrap Aggregating method - Breiman (’96) The Cm ’s re learnt from bootstrap versions of the training data and !m ⌘ 1 ) Bagging reduces instability of prediction rules Ensemble learning I The Adaptive Boosting algorithm for binary classification Freund & Shapire (’95) - Slow learning I AdaBoost can be interpreted as a forward additive stagewise modelling strategy to minimize a convexified version of the risk E[exp( Y M X ↵m Cm (X ))] m=1 I A serious competitor: Random Forest, Breiman (’01) Bagging applied to randomized decision trees I Boosting methods and Random Forests outperform older methods in most cases Applications Unsupervised Learning - Anomaly/Novelty Detection I I I Data with no labels, e.g. Xi 2 Rd , i = 1, . . . , n Example: monitoring of complex systems, e.g. aircraft systems, fraud detection, predictive maintenance, cybersecurity Detect abnormal observations - Rarity replaces labeling b↵ = {x 2 X : Pn ↵i K (x, Xi ) tµ } I 1-class SVM: G i=1 Applications Unsupervised Learning - Anomaly/Novelty Ranking I I I Data with no labels, e.g. Xi 2 Rd , i = 1, . . . , n Rank data by degree of novelty/abnormality Distributed Fleet Monitoring: check the 5 % the most abnormal, then the next 5 %, etc. Optimal Good Bad VOLUME 1 0 MASS (Modes) (Extremal behavior) Feature Selection A quick algorithm has been proposed by Efron et al (2002) to compute " n # p X X 1 T 2 bn ( ) = Argmin (y i xi ) + | i| , n i=1 i=1 for all > 0. As decreases one obtains more and more active (i.e. non zero) coefficients. The regularization path thus defines as dimension. 7! bn ( ) # 0 a sequence of models with increasing Variable selection 0.5 1.0 Under `1 constraints, the points that are the `2 -furthest away from the origin are on the axes (zero coefficient): −1.0 −0.5 0.0 l 1 unit ball containing l 2 ball contained l 2 ball −1.0 −0.5 0.0 0.5 1.0 Industrial example (Renault Technocentre) Regularization path: 1,2 , . . . , 1,p VS 0.4 v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 v13 v14 v15 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 −1.2 0 0.5 1 1.5 2 2.5 3 Lasso - L1 penalty I Many variants: group Lasso, lasso and elastic net I L1 penalty ensures sparsity I Compressed sensing: Candès & Tao (’04), Donoho (’04) I Numerous applications, e.g. matrix completion, recommender systems Spectral clustering - Ng et al. (’01) I Partition the vertices of a graph, clustering I Graph Laplacian L = D W , D is the degree matrix and W is the adjacency/weight matrix I Spectral Clustering using the normalised version L̃ = D 1/2 LD 1/2 (i) Find k-smallest eigenvectors Vk = (v1 , . . . , vk ) of L̃ (ii) Normalise Vk ’s rows: Vk diag (Vk Vkt ) 1/2 Vk (iii) Cluster rows of Vk with the k-means algorithm Spectral clustering - Ng et al. (’01) Spectral clustering - Ng et al. (’01) 7 17 19 2 18 8 11 20 22 15 10 1 24 26 30 9 28 36 35 37 21 3 33 13 25 34 5 29 27 12 23 16 4 31 14 32 38 6 39 Community Detection Spectral clustering - Ng et al. (’01) I Partition the vertices of a graph, clustering I Graph Laplacian L = D W , D is the degree matrix and W is the adjacency/weight matrix I Spectral Clustering using the normalised version L̃ = D 1/2 LD 1/2 (i) Find k-smallest eigenvectors Vk = (v1 , . . . , vk ) of L̃ (ii) Normalise Vk ’s rows: Vk diag (Vk Vkt ) 1/2 Vk (iii) Cluster rows of Vk with the k-means algorithm Scaling-up Machine-Learning Algorithms I ”Smart Randomization/Sampling” I Massive Parallelization: break a large optimization problem into smaller problems e.g. Cascade SVM, parallel large-scale feature selection, parallel clustering I Distributed Optimization I Many frameworks are available: MapReduce (+In Memory=PLANET, IBM PML, Mahout), DryadLINQ, MADlib, Storm, etc. How to apply the ERM paradigm to Big Data? I Suppose that n is too large to evaluate the empirical risk Ln (g ) I Common sense: run your preferred learning algorithm using a subsample of ”reasonable” size B << n, e.g. by drawing with replacement in the original training data set... How to apply the ERM paradigm to Big Data? I Suppose that n is too large to evaluate the empirical risk Ln (g ) I Common sense: run your preferred learning algorithm using a subsample of ”reasonable” size B << n, e.g. by drawing with replacement in the original training data set... I ... but of course, statistical performance is downgraded p p 1/ n << 1/ B ”Smart sampling” I Use side information and implement your mini-batch SGD with a Horvitz-Thompson estimate of the local gradient I In various situations, the performance criterion is not a basic sample mean statistic any more but a U-statistic I Examples: I Clustering: within cluster point scatter related to a partition P 2 n(n I I I 1) X D(Xi , Xj ) i<j Graph inference (link prediction) Ranking ··· X C2P I{(Xi , Xj ) 2 C 2 } Example: Ranking I Data with ordinal label: (X1 , Y1 ), . . . , (Xn , Yn ) 2 X ⇥ {1, . . . , K } ⌦n Example: Ranking I Data with ordinal label: (X1 , Y1 ), . . . , (Xn , Yn ) 2 X ⇥ {1, . . . , K } I Want to: rank X1 , . . . , Xn through a scoring function s : X ! s.t. s(X ) and Y tend to increase/decrease together with high probability ⌦n Example: Ranking I Data with ordinal label: (X1 , Y1 ), . . . , (Xn , Yn ) 2 X ⇥ {1, . . . , K } I Want to: rank X1 , . . . , Xn through a scoring function s : X ! s.t. s(X ) and Y tend to increase/decrease together with high probability I Quantitative formulation: maximize the criterion ⌦n L(s) = P{s(X (1) ) < . . . < s(X (k) ) | Y (1) = 1, . . . , Y (K ) = K } Example: Ranking I Data with ordinal label: (X1 , Y1 ), . . . , (Xn , Yn ) 2 X ⇥ {1, . . . , K } I Want to: rank X1 , . . . , Xn through a scoring function s : X ! s.t. s(X ) and Y tend to increase/decrease together with high probability I Quantitative formulation: maximize the criterion ⌦n L(s) = P{s(X (1) ) < . . . < s(X (k) ) | Y (1) = 1, . . . , Y (K ) = K } I Observations: nk i.i.d. copies of X given Y = k, (k) (k) X1 , . . . , Xnk n = n1 + . . . + nK Example: Ranking I A natural empirical counterpart of L(s) is n o Pn1 PnK (1) (K ) · · · I s(X ) < . . . < s(X ) i1 =1 iK =1 i1 iK b Ln (s) = , n1 ⇥ · · · ⇥ nK Example: Ranking I A natural empirical counterpart of L(s) is n o Pn1 PnK (1) (K ) · · · I s(X ) < . . . < s(X ) i1 =1 iK =1 i1 iK b Ln (s) = , n1 ⇥ · · · ⇥ nK I But the number of terms to be summed is prohibitive! n1 ⇥ . . . ⇥ nK Example: Ranking I A natural empirical counterpart of L(s) is n o Pn1 PnK (1) (K ) · · · I s(X ) < . . . < s(X ) i1 =1 iK =1 i1 iK b Ln (s) = , n1 ⇥ · · · ⇥ nK I But the number of terms to be summed is prohibitive! n1 ⇥ . . . ⇥ nK I Maximization of b Ln (s) is computationally unfeasible... Generalized U-statistics 1 samples and degrees (d1 , . . . , dK ) 2 N⇤K I K I (X1 , . . . , Xnk ), 1 k K , K independent i.i.d. samples drawn from Fk (dx) on X k respectively I Kernel H : X d11 ⇥ · · · ⇥ X dKK ! R, square integrable w.r.t. µ = F1⌦d1 ⌦ · · · ⌦ FK⌦dK (k) (k) Generalized U-statistics Definition The K -sample U-statistic of degrees (d1 , . . . , dK ) with kernel H is Un (H) = where (k) X Ik P Ik P I1 ... P IK (1) n1 d1 ⇥ ··· refers to summation over all (k) (Xi1 , (k) Xid ) k = ..., 1 i1 < . . . < idk nk (2) (K ) H(XI1 ; XI2 ; . . . ; XIK ) nK dK nk dk , subsets related to a set Ik of dk indexes It is said symmetric when H is permutation symmetric in each set (k) of dk arguments XIk . References: Lee (1990) Generalized U-statistics I Unbiased estimator of (1) (1) (K ) ✓(H) = E[H(X1 , . . . , Xd1 , . . . , X1 (K ) , . . . , Xdk )] with minimum variance I Asymptotically Gaussian as nk /n ! k = 1, . . . , K I Its computation requires the summation of K ✓ ◆ Y nk k=1 I dk k > 0 for terms K -partite ranking: dk = 1 for 1 k K Hs (x1 , . . . , xK ) = I {s(x1 ) < s(x2 ) < · · · < s(xK )} Incomplete U-statistics I Replace Un (H) by an incomplete version, involving much less terms I Build a set DB of cardinality B built by sampling with replacement in the set ⇤ of indexes (1) (1) (K ) (K ) ((i1 , . . . , id1 ), . . . , (i1 , . . . , idK )) (k) with 1 i1 I I (k) < . . . < idk nk , 1 k K Compute the Monte-Carlo version based on B terms eB (H) = 1 U B X (1) (K ) H(XI1 , . . . , XIK ) (I1 , ..., IK )2DB An incomplete U-statistic is NOT a U-statistic M-Estimation based on incomplete U-statistics I Replace the criterion by a tractable incomplete version based on B = O(n) terms eB (H) min U H2H I This leads to investigate the maximal deviations eB (H) sup U H2H Un (H) Main Result Theorem Let H be a VC major class of bounded symmetric kernels of finite VC dimension V < +1. Set MH = sup(H,x)2H⇥X |H(x)|. Then, n o eB (H) Un (H) > ⌘ (i) P supH2H U 2(1 + #⇤)V ⇥ e (ii) for all B⌘ 2 /M2H 2 (0, 1), with probability at least 1 , we have: r h i 1 eB (H) E U eB (H) 2 2V log(1 + ) sup U MH H2H r r log(2/ ) V log(1 + #⇤) + log(4/ ) + + , B where = min{bn1 /d1 c, . . . , bnK /dK c} Consequences I I Empirical riskpsampling with B = O(n) yields a rate bound of the order O( log n/n) One su↵ers no loss in terms of learning rate, while drastically reducing computational cost Example: Ranking Empirical ranking performance for SVMrank based on 1%, 5%, 10%, 20% and 100% of the ”LETOR 2007” dataset. Context Consider a network composed of N agents I Agents process local data I Agents cooperate to estimate some global parameter 1/5 A regression example Data set formed by n samples (Xi , Yi ) (i = 1 . . . n) I Yi = variable to be explained I Xi = explanatory features Looking for a model. Linear regression example: min x n X i=1 kYi x T Xi k2 2/5 A regression example Data set formed by n samples (Xi , Yi ) (i = 1 . . . n) I Yi = variable to be explained I Xi = explanatory features Looking for a model. min x n X `(x T Xi , Yi ) + r (x) i=1 2/5 A regression example Data set formed by n samples (Xi , Yi ) (i = 1 . . . n) I Yi = variable to be explained I Xi = explanatory features Looking for a model. min x n X `(x T Xi , Yi ) + r (x) i=1 Distributed processing: the problem is separable min x n XX `(x T Xi,v , Yi,v ) + r (x) v 2V i=1 [Boyd’11, Agarwal’11] 2/5 Formally min x I I X fv (x) v 2V G = (V , E ) is the graph modelling the network fv is the cost function of agent v Difficulty : P v fv is nowhere observed. Methods : from distributed gradient algorithms to advanced proximal methods Common principle : 1. process local data 2. exchange information with neighbors 3. iterate. 3/5 Key issues I Distribute cutting-edge optimization algorithms (eq. primal-dual methods, fast-admm, etc.) I Include stochastic perturbations: I I I On-line algorithms Asynchronism Investigate specific ML application, e.g. ranking 4/5