Des arbres de décision aux forêts aléatoires, état de l`art

Transcription

Outline
CART
Extensions
Aggregating classifiers
Justifying and extending
Des arbres de décision aux forêts aléatoires, état
de l’art
Badih Ghattas
Université d’Aix Marseille
[email protected]
Badih Ghattas
Des arbres de décision aux forêts aléatoires, état de l’art
Outline
CART
Extensions
Outline
◮
◮
Classification and Regression Trees
Extensions:
◮
◮
◮
◮
◮
◮
◮
◮
◮
Oblique trees
Multidimensional and functional outputs
Time Series covariates
Bayesian Approach
Decision trees for clustering
Decision trees for density estimation
Decision trees and other methods : logistic regression, SVM
Decision trees, distances, consensus
Aggregating Classifiers
◮
◮
◮
◮
Stacking
Bagging, Boosting
Random Forests
Cforest and other competitors
Badih Ghattas
Outline
CART
Extensions
Maximal tree
Pruning
Simulations
Instability
In the context of supervised learning
We wish to estimate f using the dataset at hand
Dn = {(X1 , Y1 ), . . . , (Xn , Yn )}
(Xi , Yi ) iid ∼ P(X , Y )
where P(X , Y ) is the joint distribution of (X , Y ).
We must choose f within a class of functions, with unknown
parameters.
For example : y = f (x) = a0 + a1 x1 + a2 x2 + ap xp
Dn =⇒ fn (X , Dn )
Badih Ghattas
Outline
CART
Extensions
Maximal tree
Pruning
Simulations
Instability
The model
Search for a partition of the space X and assign a value of Y to
each class of the partition.
In regression :
q
X
E [Y |X = x] =
cj 1Nj (x)
j=1
b
cj =
X
1
Yi
Card{i; xi ∈ Nj }
i;xi ∈Nj
In Classification : Y discrete having J levels
b
cj = The most frequent class in Nj (x)
General framework : linear or convex combination of non linear functions
Badih Ghattas
Outline
CART
Extensions
Maximal tree
Pruning
Simulations
Instability
Example: Predicting Ozone concentration
Wind < 6.6
300
|
49.731.2
Solar.R < 153
250
Wind < 4.35
200
26.8
100
Solar.R < 230 Wind < 10.6
12.4226.82
71.257.0
123.0 78.1
150
Wind < 8.9
Solar.R
Solar.R < 79.5
123.2078.08
28.5
50
Solar.R < 232.5
28.48
71.2249.71
12.4
0
57.0031.25
5
Badih Ghattas
10
15
20
Des arbres de décision
aux forêts aléatoires, état de l’art
Wind
Outline
CART
Extensions
Maximal tree
Pruning
Simulations
Instability
2 stages: Maximal Tree and Pruning
All the observations are in the root node.
Splitting rule: one variable and a threshold. How to do ?
Use the deviance to measure the heterogeneity of a node:
X
R(t) =
(yn − ȳ (t))2
xn ∈t
Badih Ghattas
Outline
CART
Extensions
Maximal tree
Pruning
Simulations
Instability
Optimal Splits: minimize the children’s deviance
Minimize total new nodes Heterogeneity. Let s be a split of the
form: x m < a,
∆R (s, t) = R(t) − (R(tL ) + R(tR )) ≥ 0
∆R (s, t) = maxs∈Σ ∆R (s, t)
In classification,
R(t) = −
X
pj (t)log (pj (t))
j∈J
where pj (t) prior probability for each class j in t.
Badih Ghattas
Outline
CART
Extensions
Maximal tree
Pruning
Simulations
Instability
Stopping Rule
Split the root t into two children tL et tR Do the same recursively.
Stop when at least one of the following conditions is satified:
◮
very few observations in a node, minsize
◮
∆R (s, t) is lower than a fixed threshold, mindev
The maximal tree:
◮
has low errors over learning sample
◮
is poor over test samples
◮
is too big, thus unreadable
Badih Ghattas
Outline
CART
Extensions
Maximal tree
Pruning
Simulations
Instability
Penalized deviance
Tree’s deviance:
R(T ) =
1 X
R(t)
N
t∈T̃
Penalised deviance:
Rα (T ) =
1 X
R(t) + α|T̃ |
N
t∈T̃
For a subtree pruned at node t,
X
Rα (Tt ) =
R(t) + α|T̃ |
t∈T̃t
and,
Rα (t) = R(t) + α
Badih Ghattas
Outline
CART
Extensions
Maximal tree
Pruning
Simulations
Instability
Selecting the optimal tree
It is based on the deviance estimate of each tree in the sequence.
Suppose the data set S is randomly partioned
S = S train ∪ S test
One may train the sequence using S train and select the best tree
estimating the deviance over S test
R̂ test (T ) =
1
|S train |
where
R̂ test (t) =
xt
X
X
R̂ test (t)
t∈T̃
(yt − ȳt )2
∈S test
and ȳt is the predicted value in t and nt is the number of training
set observations falling in node t.
An original cross validation procedure may also be used.
Badih Ghattas
Outline
CART
Extensions
Maximal tree
Pruning
Simulations
Instability
Simulation p=1
Badih Ghattas
Outline
CART
Extensions
Maximal tree
Pruning
Simulations
Instability
Simulation p=2
Badih Ghattas
Outline
CART
Extensions
Maximal tree
Pruning
Simulations
Instability
Advantages and Disavantages
◮
Working in high dimension
◮
Variables of different natures
◮
Regression - Classification
◮
Model easy to interpret
◮
Interactions between variables used
◮
Dealing with missing data
◮
Variables importance
◮
Many extensions possible
Disadvantage: Instability
Badih Ghattas
Outline
CART
Extensions
Maximal tree
Pruning
Simulations
Instability
Instability
Badih Ghattas
Outline
CART
Extensions
Oblique trees
Muldimensional or/and functional output
Bayesian Approach
Other usage
Type of extensions
Extension
Specific form for the partition
Arbitrary form for the partition
Regression within leaves
Bootstrap within nodes
p
Xj ∈ R j
Xj ∈ L2 (R)
q
Y ∈R
Y ∈ {0, 1}q
Y ∈ L2 (R)
Clustering
Density estimation
modification
Class of splits used
output estimation + criterion
–
Class of splits used
Class of splits
Criterion modification, L2
Criterion modification
Criterion modification, KL
Criterion modification
Specific
Badih Ghattas
Authors
Breiman et al, Murthy et al
Morimoto, 2001
Chaudhuri et al. (1994, 1995), TSR, FACT
Danneger et al. 1999
Coming soon
Yamada et al 2003, Roche & Ghattas 2009
Yu, 1999, Nerini & Ghattas 2007
Zhang, 1999
Segal 1992, Nerini & Ghattas 2007
Fraiman et. al., 2012
J. Klemla,
Software
rpart, OC1
NA
rpart
NA
rpart-variant
R+C
R code, rpartmv
NA
R+C
R package
delt package
Outline
CART
Extensions
Oblique trees
Bayesian Approach
Other usage
OC1 (Murthy et al., 1994)
Badih Ghattas
Outline
CART
Extensions
Oblique trees
Bayesian Approach
Other usage
Difficulties
Find oblique splits for each level in the form :
p
X
am xm + ap+1 ≤ s
m=1
◮
NP hard
◮
2 solutions: Deterministic (Breiman et. al., 1984) and
stochastic (Murthy et. al., 1994).
◮
Advantages : less complex trees sometimes with better
generalization error
◮
Disadvantage : Interpretation of the splits.
Badih Ghattas
Outline
CART
Extensions
Oblique trees
Bayesian Approach
Other usage
Oblique Regression Trees
Badih Ghattas
Outline
CART
Extensions
Oblique trees
Bayesian Approach
Other usage
Multidimensional or functional output
◮
Predict a vector and/or a functional. Y ∈ R d , or Y ∈ L2 (R)
◮
Predict the daily ozone profile
◮
Predict the size distribution of zooplanktons (indicator of
changes in climate)
◮
Predict the profiles of sea salinity
The regression function has the form:
f (x) = E [Y |X = x] =
q
X
fj I (X ∈ Nj )
j=1
Badih Ghattas
Outline
CART
Extensions
Oblique trees
Bayesian Approach
Other usage
Predicting Salinity profiles
Badih Ghattas
Outline
CART
Extensions
Oblique trees
Bayesian Approach
Other usage
Modeling zooplankton sizes’ densities
Badih Ghattas
Outline
CART
Extensions
Oblique trees
Bayesian Approach
Other usage
How is it done
Main difficulty : generalize the univariate criterion:
X
R(t) =
(yn − ȳ (t))2
xn ∈t
Natural idea
R(t) =
X
||yn − ȳ (t)| |2
xn ∈t
Where we must define the norm constrained to the property:
∆R (s, t) = R(t) − (R(tL ) − R(tR )) ≥ 0
Badih Ghattas
Outline
CART
Extensions
Oblique trees
Bayesian Approach
Other usage
Multivariate and functional output
◮
Multivariate case
◮
◮
◮
When Y is a vector Y ∈ R d , if the d components are
independent, we can use the Euclidian norm.
If not, we transform the data Y by projection onto an
orthogonal basis where the Euclidian norm may be used.
Functional case: we can use f-divergences
◮
◮
◮
Kullback Leibler
Chi2 distance
Hellinger distance
K (yi , yj ) =
2
H (yi , yj ) =
Badih Ghattas
Z
yj (t)ln
Z p
yi (t) −
yi (t)
yj (t)
p
dt
yi (t)
2
dt
Outline
CART
Extensions
Oblique trees
Bayesian Approach
Other usage
Simulated example
Badih Ghattas
Outline
CART
Extensions
Oblique trees
Bayesian Approach
Other usage
Estimated tree
Badih Ghattas
Outline
CART
Extensions
Oblique trees
Bayesian Approach
Other usage
Bayesian Approach
◮
Principle : look for the maximum a posteriori (MAP) tree
within the space of trees.
◮
How : A stochastic search algorithm may be used defining
random moves within trees space.
◮
See Denison, George...
Badih Ghattas
Outline
CART
Extensions
Oblique trees
Bayesian Approach
Other usage
...
◮
Within survival models (Intrator, 1991)
◮
Within mixture models, ZIP models (HU W., 2010)
◮
Combined with Kernel models (Torgo, 2000)
◮
...
◮
...
Badih Ghattas
Outline
CART
Extensions
Stacking
Bagging and Boosting
Bagging
random Forests
Other approaches
Why
◮
Instabilty ?
◮
Multiple models ?
◮
Boost ?
Badih Ghattas
Outline
CART
Extensions
Stacking
Bagging
random Forests
Other approaches
Stacked Regression, 1995
◮
◮
Construct K different models b
fk , k = 1..K over the data set at
hand
E = {(yi , xi ), i = 1..n}
Define the stacked model as
b
f (K ) (x) =
K
X
k=1
βk b
fk (xi )
where βk are the linear ridge coefficients resulting from the
regression of yi over b
fk (xi ).
Badih Ghattas
Outline
CART
Extensions
Stacking
Bagging
random Forests
Other approaches
Bagging, Boosting, ...
◮
Freund : Weak Learner ⇒ Strong learner
(vote within several ”learners”),
”Boosting” (1995)
◮
Breiman : Unstable ”Classifier” ⇒ Stable (by bootstrap
aggregation)
”Bagging” (1996), ”Arcing” (1999)
Badih Ghattas
Outline
CART
Extensions
Stacking
Bagging
random Forests
Other approaches
Aggregation
In Regression,
f (a) (x) =
K
1 Xb
fk (x)
K
k=1
In Classification,
f (a) (x) = Argmaxj
K
X
k=1
Badih Ghattas
1bf (x)=j
k
Outline
CART
Extensions
Stacking
Bagging
random Forests
Other approaches
Boosting
Y ∈ {0, 1}
ǫk =
n
X
dk (i)|b
yk (i)−yi |,
βk =
i=1
yba (x) = 1, if
1 − ǫk
|b
y (i)−yi |
, wk = log (βk ), dk+1 (i) = dk (i)βk k
ǫk
X
wk ≥
k;b
yk (x)=1
Badih Ghattas
X
wk
k;b
yk (x)=0
Outline
CART
Extensions
Stacking
Bagging
random Forests
Other approaches
Example-Breast Cancer
Figure: Learning and Test errors of the boosted classifier
Badih Ghattas
Outline
CART
Extensions
Stacking
Bagging
random Forests
Other approaches
Datasets from ML Benchmark
Simulated
name
waveform
Ringnorm
Variables
22
21
Observations
5000
7400
levels
3
2
Iono
Glass
Breast
DNA
Vowel
Vtrl1
Vtrl3
35
10
10
61
11
41
41
351
214
683 (+16)
3190
990
622
622
2
6
2
3
11
2
2
Real
Badih Ghattas
Outline
CART
Extensions
Stacking
Bagging
random Forests
Other approaches
Some remarks from (Breiman 99), FS(95,96,99)
◮
◮
◮
◮
◮
◮
◮
◮
◮
◮
Bagging + CART, high gain in generalization.
Boosting works due to adaptive re-sampling and not to the form of
the algorithm
Boosting has better results than Bagging in a majority of the tests
No tuning in these methods
Observations weights vary without convergence. This is essential.
Instability is also essential : We cant improve ADL neither kNN by
boosting.
Experiences of F.S.(95), Drucker and Cortes (97), Quinlan (96) show
that boosting trees give rise to a performant rapid classifier
Re-sampling or modifying the weights are equivalent in boosting.
In the cited experiences, for 5 out of 39 cases, CART outperformed
boosting. This never happened with BAGGING. (Why ? Sample size,
outliers ? )
Boosting CART, give very different trees.
Badih Ghattas
Outline
CART
Extensions
Stacking
Bagging
random Forests
Other approaches
Random Forests
◮
Construct bootstrap samples of the data
◮
Leave the OOB sample aside
◮
For each node of the tree, select the optimal split searching
over only log (p) variables among the p ones, selected
randomly.
◮
Dont prune the tree
◮
Aggregate the trees like in bagging
◮
Random Features : random linear combination of the selected
variables at each node
Badih Ghattas
Outline
CART
Extensions
Stacking
Bagging
random Forests
Other approaches
RF Properties
◮
Each tree has a low bias (but high variance)
◮
Trees are not correlated
◮
The correlation is defined to be the one computed between
trees predictions over OOB samples.
◮
Very high performences, ”Best of the chelf classifier”
◮
Computational complexity reduced
◮
Possible parallelization
Badih Ghattas
Outline
CART
Extensions
Stacking
Bagging
random Forests
Other approaches
Variables importance in RF
◮
◮
◮
◮
◮
◮
Set Ni = 0, Mi = 0 et Mij = 0, for i = 1..N et j = 1..p
Ni = Number of times observation i appears in a OOB.
Mi = Number of times observation i appears in a OOB and is
misclassified
Mij = Number of times observation i appears in a OOB and is
misclassified after permutation of the values of variable j in the OOB.
For variables j = 1, p, For each tree k = 1, K in the forest
If observation i is in OOBk , Ni = Ni + 1
If observation i is in OOBk and misclassified, Mi = Mi + 1
Perturb randomly the values of variable j in OOBk . If
observation i is in OOBk and is misclassified after permutation,
Mij = Mij + 1
P
(M −M )
Importance of variable j is = n1 i Zi (j) where Zi (j) = ijNi i .
Badih Ghattas
Outline
CART
Extensions
Stacking
Bagging
random Forests
Other approaches
...
◮
Cforest
◮
◮
◮
Hothorn T., K. Hornik & A. Zeileis. 2006. Unbiased Recursive
Partitioning: A Conditional Inference Framework. Journal of
Computational and Graphical Statistics, 15:651-674.
Hothorn T., K. Hornik & A. Zeileis. 2007. party: A Laboratory
for Recursive Part(y)itioning party package.
Extremely randomized trees, P. Geurts, 2003.
Badih Ghattas
Outline
CART
Extensions
multiclass approaches
Density estimation
State of the art
◮
Bayesian justifications.
◮
Comparison with generalized additive models.
◮
Comparison to game theory.
◮
Comparison to SVM
◮
Boosting and regression (Drucker, Schapire .R, Friedman
MART)
◮
Multi-class approaches
◮
Density estimation
Badih Ghattas
Outline
CART
Extensions
Density estimation
Multi Class generalisations - Some principles
◮
◮
◮
◮
Adaboost.M2, Schapire, 1996 - Divide, learn and agregate.
Adaboost.OC, - Error correcting Codes subdivisions.
Adaboost.ECC, Guruswami et. al. 1997 , - Error correcting Codes
subdivisions.
SAMME, Hastie 2007 - Direct generalization of Adaboost.M1.
Different strategies when selecting the binary subproblems are
used, one versus other or one versus rest.
Badih Ghattas
Outline
CART
Extensions
Density estimation
Hastie 2007
Y ∈ {0, 1}
ǫk =
βk =
Y ∈ {1, .., J}
Pn
i=1 dk (i)|yi
1−ǫk
ǫk
− ybk (i)|
ǫk =
Pn
i=1 dk (i)1|yi 6=ybk (i)|
k
βk = (J − 1) 1−ǫ
ǫk
1|yi 6=byk (i)|
|y −b
yk (i)|
dk+1 (i) = dk (i)βk i
dk+1 (i) = dk (i)βk
wk = log (βk )
wk = log (βk )
yba (x) = 1Pk;y
k (x)=1
wk ≥
P
k;yk (x)=0
wk
Badih Ghattas
P
yba (x) = Argmaxj { k;yk (x)=j wk }
Outline
CART
Extensions
Density estimation
...
Di Marzio, Rigollet & Tsybakov, Bourel & Ghattas, J. Klemla...
◮
◮
◮
◮
◮
Ridgeway, G. 2002. Looking for lumps: boosting and bagging for
density estimation. Comput. stat. data anal., 38(4), 379392.
Rigollet, P., & Tsybakov, A. B. 2007. Linear and convex aggregation
of density estimators. Math. methods statist., 16(3), 260280.
Rosset, S., & Segal, E. 2002. Boosting density estimation. Pages
641648 of: In advances in neural information processing systems 15.
MIT Press.
Smyth, P., & Wolpert, D. 1999. Linearly combining density
estimators via stacking. Mach. learn., 36(1-2), 5983.
Song, X., Yang, K., & Pavel, M. 2004. Density boosting for Gaussian
mixtures. Neural information processing, 3316, 508515.
Badih Ghattas
Outline
CART
Extensions
Density estimation
Call for submission: Special issue of Journal de la SFDS
This special issue concerns Decision trees, their variants, their extensions,
their applications and available software.
◮
◮
◮
◮
◮
◮
◮
◮
Oblique trees, Trees and linear models, TSR, Trees and mixture
models, Trees and zip models, Trees for multivariate and functional
output (continuous/discrete), Trees for time series
Other extensions, unsupervised cases : Trees for clustering, Trees for
density estimation
Different type of trees, and relation to other models : Bayesian trees,
Diadique trees, Trees vs Logistic models, Trees vs SVM
Agregating trees : Random Forests, Boosting, Bagging, Forest
Garrote
Call for Submission : January 2013.
Submission : January 2013- April 2013.
Reviewing : May 2013 - July 2013
Final submission and Reviewing : July 2013 - September 2013.
Badih Ghattas
Outline
CART
Extensions
Density estimation
References 1
◮
◮
◮
◮
◮
◮
◮
◮
◮
◮
◮
◮
◮
Breiman L., Friedman J.H., Olshen R., Stone C.J. (1984)
Classification And Regression Trees , Wadsworth, Belmont CA.
Chaudhuri P., Huang M.C., Loh W.Y. and Yao R., (1994).
Piecewise-polynomial regression trees, Statistica Sinica 4 , 143-167.
Danneger F. (1999), Tree stability diagnostics and some remedies
against instability. Submitted for publication.
Michael Leblanc Robert Tibshirani, Combining Estimates in
Regression and Classification, Journal of the American Statistical
Association December 1996, Vol. 91, N◦ .436 Theory and Methods
p.1641-1650.
Morimoto, Y., Hiromu, I., Morishita, S. Efficient Construction of
Regression Trees with Range and Region Splitting. Machine Learning,
45, 235259, 2001.
Segal M.R., (1992). Tree structured methods for longitudinal data.
Journal of the American Statistical Association, Vol. 87, N.418
Theory and Methods p.407-418.
Tibshirani, R (1996) Bias, Variance, and Prediction Error for
Classification Rules, Technical Report, Statistics Department,
University of Toronto.
Vapnik, V.N. (1995) The nature of statistical learning theory.
Springer, New York.
S.K.Murthy ,S.Kasif, S.Salzberg. A system for induction of oblique
decision trees, JAIR 1994.
Trevor, H. Tibshirani, R. and Friedman, J. The elements of statistical
learning. Data mining, inference and prediction. (English). Springer
Series in Statistics. New York, NY: Springer.
Yamada, Y., Suzuki, E., Yokoi, H., Takabayashi, K. (2003).
Decision-tree Induction from Time-series Data Based on a
Standard-example Split Test. Proceeding of the Twentieth
International Conference on Machine Learning, ICML, Washington
DC.
Yu, Y., Lambert, D. (1999). Fitting trees to functional data: with an
application to time-of-day patterns. J. Comput. Gragh. Statist. 8,
pp, 749-762.
Zhang, H., (1998). Classification Tree for Multiple Binary Responses.
American Statistical Association, Vol. 93.
Badih Ghattas
Outline
CART
Extensions
Density estimation
References 2
◮
◮
◮
◮
◮
◮
◮
◮
Breiman L., Heuristic of instability and stabilization in model
selection, The Annals of Statistics, Vol 24, N◦ 6,2350-2383 (1996).
Breiman L., Bagging Predictors, Machine Learning, 24, 123-140
(1996).
Breiman L., Arcing classifiers, The Annals of Statistics, Vol. 26 N ◦ 3,
pp. 801-849 (1997).
Dietterich, T.G. and Bakiri, G.(1995) Solving Multiclass Problems via
Error-Correcting Output Coding Journal of Artificial Intelligence
Research, 2, pp. 263-286.
Freund, Y. and Schapire R. (1995) A decision-theoretic generalization
of on-line learning and an application to boosting. to appear , Journal
of Computer and System Sciences.
Freund, Y. and Schapire R. (1996) Experiments with a new boosting
algorithm, Machine Learning: Proceedings of the Thirteenth
International Conference, pp. 148-156
Schapire R.E. (1997) Using output codes to boost multiclass learning
problems. In Machine Iearning : Proceedings of the Fourteenth
International Conference, pages 313-321.
Schapire R., Freund Y., Bartlett P., Lee W.S. (1998) Boosting the
Margin: A New Explanation of the Effectiveness of Voting Methods.
The Annals of Statistics, Vol. 26, N ◦ 5, pp.1651-1686.
Badih Ghattas

Des arbres de décision aux forêts aléatoires, état de l`art

Transcription

Documents pareils

la villa - Cancale gourmand

Classifica CACIL: deerhound 1 2 3 4 5 6 7 8 9 10 11 12

SPE Marseille 2013

Glossario FLORA - DAILYNTERPRETER.COM

RE La Réunion 2014

CLASSE BABY MALE - BABY CLASS DOGS