Université des Actuaires Machine

Transcription

Université des Actuaires Machine
Université des Actuaires
Machine-Learning pour les données massives:
algorithmes randomisés, en ligne et distribués
Stéphan Clémençon
Institut Mines Télécom - Télécom ParisTech
July 7, 2014
Agenda
I
Technologies Big Data: contexte et opportunités
I
Machine-Learning: un bref tour d’horizon
I
Les défis des applications du ML dans l’industrie et les
services:
Vers une généralisation?
”Big Data” - Le contexte
Une accumulation de données massives dans de nombreux
domaines:
I
Biologie/Médecine (génomique, métabolomique, essais
cliniques, imagerie, etc.
I
Grande distribution, marketing (CRM), e-commerce
I
Moteurs de recherche internet (contenu multimedia)
I
Réseaux sociaux (Facebook, Tweeter, ...)
I
Banque/Finance (risque de marché/liquidité, accès au crédit)
I
Sécurité (ex: biométrie, vidéosurveillance)
I
Administrations (Santé Publique, Douanes)
I
Risques opérationnels
”Big Data” - Le contexte
Un déluge de données qui rend inopérant:
I
les outils basiques de
I
I
I
stockage de données
gestion de base de données (MySQL)
le prétraitement reposant sur l’expertise humaine
I
I
I
indexation, analyse sémantique
modélisation
intelligence décisionnelle
”Big Data” - Le contexte
Une multitude de briques technologiques et de services disponibles
pour:
I
La parallélisation massive (Velocity)
I
Le calcul distribué (Volume)
I
La gestion de données sans schéma prédéfini (Variety)
parmi lesquels:
I
Le modèle de programmation MapReduce: calculs
parallélisés/distribuées
I
Framework Hadoop
I
NoSQL: SGBD Cassandra, MongoDB, bases de données
orientées graphe, moteur de recherche Elasticsearch, etc.
I
Clouds: infrastructures, plate-formes, logiciels as a Service
promus par Google, Amazon, Facebook, etc.
”Big Data” - Les opportunités
Des avancées spectaculaires pour
I
la collecte et le stockage (distribué) des données
I
la recherche automatique d’objets, de contenu
I
le partage de données peu structurées
Le Big Data: un moteur pour la technologie, la science,
l’économie
I
Moteurs de recherche, moteurs de recommandation
I
Maintenance prédictive
I
Marketing viral à travers les réseaux sociaux
I
Détection des fraudes
I
Médecine individualisée
I
Publicité en ligne (retargeting)
”Big Data” - Les opportunités
Ubiquité
De nombreux secteurs d’activité sont concernés:
I
(e-) Commerce
I
CRM
I
Santé
I
Défense, renseignement (e.g. cybersécurité, biométrie)
I
Banque/Finance
I
Transports ”intelligents”
I
etc.
Big Data - Recherche
Afin d’exploiter les données (prédiction, interpétation),
développer des technologies mathématiques permettant de
résoudre les problèmes computationnels liés:
I
aux contraintes du quasi-temps réel
! apprentissage automatique séquentiel (”on-line”) 6= batch,
par renforcement
I
au caractère distribué des données/ressources
! apprentissage automatique distribué
I
à la volumétrie des données
! impact des techniques de sondages sur la performance des
algorithmes
”Big Data” - Recherche
Des techniques de visualisation, représentation de données
complexes
I
Graphes (évolutifs) - clustering, graph-mining
I
Image, audio, video - filtrage, compression
I
Données textuelles (e.g. page web, tweet)
Domaines
I
Probabilité, Statistique
I
Machine-Learning
I
Optimisation
I
Traitement du signal et de l’image
I
Analyse Harmonique Computationnelle
I
Analyse sémantique
I
etc.
Goals of Statistical Learning
I
Statistical issues cast as M-estimation problems:
I
I
I
I
I
Classification
Regression
Density level set estimation
Compression, sparse representation
... and their variants
I
Minimal assumptions on the distribution
I
Build realistic M-estimators for special criteria
Questions
I
I
I
Theory: optimal elements, consistency, non-asymptotic excess
risk bounds, fast rates of convergence, oracle inequalities
Practice: numerical optimization, convexification,
randomization, relaxation, constraints (distributed
architectures, real-time, memory, etc.)
Main Example: Classification (Pattern Recognition)
I
I
I
I
(X , Y ) random pair with unknown distribution P
X 2 X observation vector
Y 2 { 1, +1} binary label/class
A posteriori probability ⇠ regression function
8x 2 X ,
I
I
⌘(x) = {Y = 1 | X = x}
g : X ! { 1, +1} classifier
Performance measure = classification error
L(g ) = g (X ) 6= Y
I
g
Solution: Bayes rule
8x 2 X ,
I
! min
g ⇤ (x) = 2{⌘(x) > 1/2}
Bayes error L⇤ = L(g ⇤ )
1
Main Paradigm - Empirical Risk Minimization
I
Sample (X1 , Y1 ), . . . , (Xn , Yn ) with i.i.d. copies of
(X , Y ),class of classifiers
I
Empirical Risk Minimization principle
n
gbn = Argmin Ln (g ) :=
g2
I
1X
{g (Xi ) 6= Yi }
n
i=1
Best classifier in the class
ḡ = Argmin L(g )
g2
I
Concentration inequality
With probability 1
sup | Ln (g )
g2
:
L(g ) | C
r
V
+
n
r
2 log(1/ )
n
Machine-Learning - Achievements
I
Numerous applications:
I
I
I
I
I
I
I
I
I
I
Supervised anomaly detection
Handwritten digit recognition
Face recognition
Medical diagnosis
Credit-risk screening
CRM
Speech recognition
Monitoring of complex systems
etc.
Many ”o↵-the-shelf” methods
I
I
I
I
I
Neural Networks
Support Vector Machines
Boosting
Vector quantization
etc.
Machine-Learning - Challenges
I
Ongoing intense research activity, motivated by
I
I
I
I
I
Need for increasing performance
Evolution of computing environments (data centers, clouds,
HDFS)
New applications/problems: recommending systems, search
engines, medical imagery, yield management etc.
The Big Data era
Mathematical/computational challenges
I
I
I
Volume (data deluge): ubiquity of sensors, high dimension,
distributed storage/processing systems
Variety (of data structures): text, graphs, images, signals
Velocity (real-time): on-line prediction, evolutionary
environment, reinforcement learning strategies (exploration vs
exploitation)
Statistical Learning - Milestones
I
The 30’s - Fisher’s (parametric/Gaussian) statistics
I
I
I
Linear Discriminant Analysis
Linear (logistic) regression
PCA, . . .
Statistical Learning - Milestones
I
The 60’s and 70’s - F. Rosenblatt’s perceptron & VC theory
I
I
I
I
First ”machine-learning” algorithm (linear binary classification)
Inspired by congnitive sciences
Convexification, one-pass/on-line (stochastic gradient descent)
Relaxation, large margin linear classifiers, structural ERM
Statistical Learning - Milestones
I
The 80’s - Neural Networks & Decision Trees
I
I
I
Artificial Intelligence ”A theory of learnability” Valiant ’84
The Backpropagation algorithm
The CART algorithm (’84)
Statistical Learning - Milestones
I
From the 90’s - Kernels & Boosting
I
Kernel trick: SVM, nonlinear PCA, . . .
I
AdaBoost (’95)
Lasso, compressed sensing
A comprehensive theory beyond VC concepts
Rebirth of Q-learning
I
I
I
Applications
Supervised Learning - Pattern Recognition/Regression
I
Data with labels, e.g. (Xi , Yi ) 2 Rd ⇥ { 1, +1}, i = 1, . . . , n.
Learn to predict Y based on X
I
Example: in Quality Control, X features of the product and/or
production factors, Y = +1 if ”defect” and Y = 1 otherwise.
Build a decision rule C minimizing L(C ) = P{Y 6= C (X )}
Applications
Supervised Learning - Scoring
I
Data with labels, e.g. (Xi , Yi ) 2 Rd ⇥ { 1, +1}, i = 1, . . . , n.
Learn to rank all possible observations X in the same order as
that induced by P{Y = +1 | X } through a scoring function
s(X )
Applications
Supervised Learning - Image Recognition
I
Objects are assigned to data (pixels), e.g. biometrics
I
Goal: learn to assign objects to new data
Empirical Risk Minimization and Stochastic Approximation
I
Most learning problems consists of minimizing a functional
L(f ) = E[ (Z , f )]
where Z is the observation, f a decision rule candidate
I
In general, a stochastic approximation inductive method must
be implemented
ft+1 = ft
d
⇢t r
f L(ft ),
d
where r
f L is a statistical estimate of L’s gradient based on
training data Z1 , . . . , Zn
Empirical Risk Minimization and Stochastic Approximation
I
Popular algorithms are based on these principles
I
Examples: Logit, Neural Networks, linear SVM, etc.
I
Computational advantages but too rigid (underfitting)
Kernel Trick
I
Apply a simple algorithm but... in a transformed space
K (x, x 0 ) = h (x), (x 0 )i
I
Examples: Nonlinear SVM, Kernel PCA, SVR
I
Kernels for images, text data, biological sequences, etc.
Greedy Algorithms
I
Recursive methods exploring exhaustively a structured
space at each step
I
Examples: CART, projection pursuit, matching pursuit, etc.
I
Highly interpretable/visualizable but poor performance
No Free Lunch
Ensemble learning
Heuristic: combine predictions output by weak decision rules
Amit & Geman (’97) for image recognition
I
Example committee-based binary classification:
predict Y 2 { 1, +1} based on X
!
M
X
Cagg (X ) = sgn
!m Cm (X ) ,
m=1
where !m controls the impact of the vote of weak rule Cm
I
The Bootstrap Aggregating method - Breiman (’96)
The Cm ’s re learnt from bootstrap versions of the training
data and !m ⌘ 1
) Bagging reduces instability of prediction rules
Ensemble learning
I
The Adaptive Boosting algorithm for binary classification
Freund & Shapire (’95) - Slow learning
I
AdaBoost can be interpreted as a forward additive
stagewise modelling strategy to minimize a convexified
version of the risk
E[exp( Y
M
X
↵m Cm (X ))]
m=1
I
A serious competitor: Random Forest, Breiman (’01)
Bagging applied to randomized decision trees
I
Boosting methods and Random Forests outperform older
methods in most cases
Applications
Unsupervised Learning - Anomaly/Novelty Detection
I
I
I
Data with no labels, e.g. Xi 2 Rd , i = 1, . . . , n
Example: monitoring of complex systems,
e.g. aircraft systems, fraud detection, predictive maintenance,
cybersecurity
Detect abnormal observations - Rarity replaces labeling
b↵ = {x 2 X : Pn ↵i K (x, Xi ) tµ }
I 1-class SVM: G
i=1
Applications
Unsupervised Learning - Anomaly/Novelty Ranking
I
I
I
Data with no labels, e.g. Xi 2 Rd , i = 1, . . . , n
Rank data by degree of novelty/abnormality
Distributed Fleet Monitoring: check the 5 % the most
abnormal, then the next 5 %, etc.
Optimal
Good
Bad
VOLUME
1
0
MASS
(Modes)
(Extremal behavior)
Feature Selection
A quick algorithm has been proposed by Efron et al (2002) to
compute
" n
#
p
X
X
1
T
2
bn ( ) = Argmin
(y i xi ) +
| i| ,
n
i=1
i=1
for all > 0.
As decreases one obtains more and more active (i.e. non zero)
coefficients. The regularization path
thus defines as
dimension.
7! bn ( )
# 0 a sequence of models with increasing
Variable selection
0.5
1.0
Under `1 constraints, the points that are the `2 -furthest away from
the origin are on the axes (zero coefficient):
−1.0
−0.5
0.0
l 1 unit ball
containing l 2 ball
contained l 2 ball
−1.0
−0.5
0.0
0.5
1.0
Industrial example (Renault Technocentre)
Regularization path: 1,2 , . . . , 1,p VS
0.4
v1
v2
v3
v4
v5
v6
v7
v8
v9
v10
v11
v12
v13
v14
v15
0.2
0
−0.2
−0.4
−0.6
−0.8
−1
−1.2
0
0.5
1
1.5
2
2.5
3
Lasso - L1 penalty
I
Many variants: group Lasso, lasso and elastic net
I
L1 penalty ensures sparsity
I
Compressed sensing: Candès & Tao (’04), Donoho (’04)
I
Numerous applications, e.g. matrix completion, recommender
systems
Spectral clustering - Ng et al. (’01)
I
Partition the vertices of a graph, clustering
I
Graph Laplacian L = D W , D is the degree matrix and W
is the adjacency/weight matrix
I
Spectral Clustering using the normalised version
L̃ = D 1/2 LD 1/2
(i) Find k-smallest eigenvectors Vk = (v1 , . . . , vk ) of L̃
(ii) Normalise Vk ’s rows: Vk
diag (Vk Vkt ) 1/2 Vk
(iii) Cluster rows of Vk with the k-means algorithm
Spectral clustering - Ng et al. (’01)
Spectral clustering - Ng et al. (’01)
7
17
19
2
18
8
11
20
22
15
10
1
24
26
30
9
28
36
35
37
21
3
33
13
25
34
5
29
27
12
23
16
4
31
14
32
38
6
39
Community Detection
Spectral clustering - Ng et al. (’01)
I
Partition the vertices of a graph, clustering
I
Graph Laplacian L = D W , D is the degree matrix and W
is the adjacency/weight matrix
I
Spectral Clustering using the normalised version
L̃ = D 1/2 LD 1/2
(i) Find k-smallest eigenvectors Vk = (v1 , . . . , vk ) of L̃
(ii) Normalise Vk ’s rows: Vk
diag (Vk Vkt ) 1/2 Vk
(iii) Cluster rows of Vk with the k-means algorithm
Scaling-up Machine-Learning Algorithms
I
”Smart Randomization/Sampling”
I
Massive Parallelization: break a large optimization problem
into smaller problems
e.g. Cascade SVM, parallel large-scale feature selection,
parallel clustering
I
Distributed Optimization
I
Many frameworks are available:
MapReduce (+In Memory=PLANET, IBM PML, Mahout),
DryadLINQ, MADlib, Storm, etc.
How to apply the ERM paradigm to Big Data?
I
Suppose that n is too large to evaluate the empirical risk
Ln (g )
I
Common sense: run your preferred learning algorithm using a
subsample of ”reasonable” size B << n, e.g. by drawing with
replacement in the original training data set...
How to apply the ERM paradigm to Big Data?
I
Suppose that n is too large to evaluate the empirical risk
Ln (g )
I
Common sense: run your preferred learning algorithm using a
subsample of ”reasonable” size B << n, e.g. by drawing with
replacement in the original training data set...
I
... but of course, statistical performance is downgraded
p
p
1/ n << 1/ B
”Smart sampling”
I
Use side information and implement your mini-batch SGD
with a Horvitz-Thompson estimate of the local gradient
I
In various situations, the performance criterion is not a basic
sample mean statistic any more but a U-statistic
I
Examples:
I
Clustering: within cluster point scatter related to a partition P
2
n(n
I
I
I
1)
X
D(Xi , Xj )
i<j
Graph inference (link prediction)
Ranking
···
X
C2P
I{(Xi , Xj ) 2 C 2 }
Example: Ranking
I
Data with ordinal label:
(X1 , Y1 ), . . . , (Xn , Yn ) 2 X ⇥ {1, . . . , K }
⌦n
Example: Ranking
I
Data with ordinal label:
(X1 , Y1 ), . . . , (Xn , Yn ) 2 X ⇥ {1, . . . , K }
I
Want to: rank X1 , . . . , Xn through a scoring function
s : X ! s.t.
s(X ) and Y tend to increase/decrease together with high
probability
⌦n
Example: Ranking
I
Data with ordinal label:
(X1 , Y1 ), . . . , (Xn , Yn ) 2 X ⇥ {1, . . . , K }
I
Want to: rank X1 , . . . , Xn through a scoring function
s : X ! s.t.
s(X ) and Y tend to increase/decrease together with high
probability
I
Quantitative formulation: maximize the criterion
⌦n
L(s) = P{s(X (1) ) < . . . < s(X (k) ) | Y (1) = 1, . . . , Y (K ) = K }
Example: Ranking
I
Data with ordinal label:
(X1 , Y1 ), . . . , (Xn , Yn ) 2 X ⇥ {1, . . . , K }
I
Want to: rank X1 , . . . , Xn through a scoring function
s : X ! s.t.
s(X ) and Y tend to increase/decrease together with high
probability
I
Quantitative formulation: maximize the criterion
⌦n
L(s) = P{s(X (1) ) < . . . < s(X (k) ) | Y (1) = 1, . . . , Y (K ) = K }
I
Observations: nk i.i.d. copies of X given Y = k,
(k)
(k)
X1 , . . . , Xnk
n = n1 + . . . + nK
Example: Ranking
I
A natural empirical counterpart of L(s) is
n
o
Pn1
PnK
(1)
(K )
·
·
·
I
s(X
)
<
.
.
.
<
s(X
)
i1 =1
iK =1
i1
iK
b
Ln (s) =
,
n1 ⇥ · · · ⇥ nK
Example: Ranking
I
A natural empirical counterpart of L(s) is
n
o
Pn1
PnK
(1)
(K )
·
·
·
I
s(X
)
<
.
.
.
<
s(X
)
i1 =1
iK =1
i1
iK
b
Ln (s) =
,
n1 ⇥ · · · ⇥ nK
I
But the number of terms to be summed is prohibitive!
n1 ⇥ . . . ⇥ nK
Example: Ranking
I
A natural empirical counterpart of L(s) is
n
o
Pn1
PnK
(1)
(K )
·
·
·
I
s(X
)
<
.
.
.
<
s(X
)
i1 =1
iK =1
i1
iK
b
Ln (s) =
,
n1 ⇥ · · · ⇥ nK
I
But the number of terms to be summed is prohibitive!
n1 ⇥ . . . ⇥ nK
I
Maximization of b
Ln (s) is computationally unfeasible...
Generalized U-statistics
1 samples and degrees (d1 , . . . , dK ) 2 N⇤K
I
K
I
(X1 , . . . , Xnk ), 1  k  K , K independent i.i.d. samples
drawn from Fk (dx) on X k respectively
I
Kernel H : X d11 ⇥ · · · ⇥ X dKK ! R, square integrable w.r.t.
µ = F1⌦d1 ⌦ · · · ⌦ FK⌦dK
(k)
(k)
Generalized U-statistics
Definition
The K -sample U-statistic of degrees (d1 , . . . , dK ) with kernel H is
Un (H) =
where
(k)
X Ik
P
Ik
P
I1
...
P
IK
(1)
n1
d1
⇥ ···
refers to summation over all
(k)
(Xi1 ,
(k)
Xid )
k
=
...,
1  i1 < . . . < idk  nk
(2)
(K )
H(XI1 ; XI2 ; . . . ; XIK )
nK
dK
nk
dk
,
subsets
related to a set Ik of dk indexes
It is said symmetric when H is permutation symmetric in each set
(k)
of dk arguments XIk .
References: Lee (1990)
Generalized U-statistics
I
Unbiased estimator of
(1)
(1)
(K )
✓(H) = E[H(X1 , . . . , Xd1 , . . . , X1
(K )
, . . . , Xdk )]
with minimum variance
I
Asymptotically Gaussian as nk /n !
k = 1, . . . , K
I
Its computation requires the summation of
K ✓ ◆
Y
nk
k=1
I
dk
k
> 0 for
terms
K -partite ranking: dk = 1 for 1  k  K
Hs (x1 , . . . , xK ) = I {s(x1 ) < s(x2 ) < · · · < s(xK )}
Incomplete U-statistics
I
Replace Un (H) by an incomplete version, involving much less
terms
I
Build a set DB of cardinality B built by sampling with
replacement in the set ⇤ of indexes
(1)
(1)
(K )
(K )
((i1 , . . . , id1 ), . . . , (i1 , . . . , idK ))
(k)
with 1  i1
I
I
(k)
< . . . < idk  nk , 1  k  K
Compute the Monte-Carlo version based on B terms
eB (H) = 1
U
B
X
(1)
(K )
H(XI1 , . . . , XIK )
(I1 , ..., IK )2DB
An incomplete U-statistic is NOT a U-statistic
M-Estimation based on incomplete U-statistics
I
Replace the criterion by a tractable incomplete version based
on B = O(n) terms
eB (H)
min U
H2H
I
This leads to investigate the maximal deviations
eB (H)
sup U
H2H
Un (H)
Main Result
Theorem
Let H be a VC major class of bounded symmetric kernels of finite
VC dimension V < +1. Set MH = sup(H,x)2H⇥X |H(x)|. Then,
n
o
eB (H) Un (H) > ⌘ 
(i) P supH2H U
2(1 + #⇤)V ⇥ e
(ii) for all
B⌘ 2 /M2H
2 (0, 1), with probability at least 1
, we have:
r
h
i
1
eB (H) E U
eB (H)  2 2V log(1 + )
sup U
MH H2H

r
r
log(2/ )
V log(1 + #⇤) + log(4/ )
+
+
,

B
where  = min{bn1 /d1 c, . . . , bnK /dK c}
Consequences
I
I
Empirical riskpsampling with B = O(n) yields a rate bound of
the order O( log n/n)
One su↵ers no loss in terms of learning rate, while drastically
reducing computational cost
Example: Ranking
Empirical ranking performance for SVMrank based on 1%, 5%,
10%, 20% and 100% of the ”LETOR 2007” dataset.
Context
Consider a network composed of N agents
I
Agents process local data
I
Agents cooperate to estimate some global parameter
1/5
A regression example
Data set formed by n samples (Xi , Yi ) (i = 1 . . . n)
I Yi = variable to be explained
I Xi = explanatory features
Looking for a model.
Linear regression example:
min
x
n
X
i=1
kYi
x T Xi k2
2/5
A regression example
Data set formed by n samples (Xi , Yi ) (i = 1 . . . n)
I Yi = variable to be explained
I Xi = explanatory features
Looking for a model.
min
x
n
X
`(x T Xi , Yi ) + r (x)
i=1
2/5
A regression example
Data set formed by n samples (Xi , Yi ) (i = 1 . . . n)
I Yi = variable to be explained
I Xi = explanatory features
Looking for a model.
min
x
n
X
`(x T Xi , Yi ) + r (x)
i=1
Distributed processing: the problem is separable
min
x
n
XX
`(x T Xi,v , Yi,v ) + r (x)
v 2V i=1
[Boyd’11, Agarwal’11]
2/5
Formally
min
x
I
I
X
fv (x)
v 2V
G = (V , E ) is the graph modelling the network
fv is the cost function of agent v
Difficulty :
P
v fv
is nowhere observed.
Methods : from distributed gradient algorithms to advanced proximal methods
Common principle :
1. process local data
2. exchange information with neighbors
3. iterate.
3/5
Key issues
I
Distribute cutting-edge optimization algorithms (eq. primal-dual methods,
fast-admm, etc.)
I
Include stochastic perturbations:
I
I
I
On-line algorithms
Asynchronism
Investigate specific ML application, e.g. ranking
4/5