Courte introduction à R

Transcription

Courte introduction à R
Courte introduction à R
Florent Benaych-Georges
Université Paris Descartes
Janvier 2015
Séries temporelles pratiques
Commencer
2 étapes :
1
Télécharger et installer R depuis
http://cran.r-project.org
2
Utiliser R studio, un software open-source pour R :
http://www.rstudio.com
Principaux types
Essentiellement 3 types d’objets :
char, logical, numeric
1
2
3
4
5
6
7
8
9
x=5
class (x)
y=(x <6)
y
class (y)
( z=" H e l l o ␣ w o r l d " ) #p a r e n t h e s e s : r e s u l t d i s p l a y
class (z)
(file: Types.R)
Principaux conteneurs
Essentiellement 5 types de conteneurs :
vector
list
factor
data.frame
ts
Vecteurs
Toutes les entrées ont le même type
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
( v=c ( 2 , 5 , 7 , 8 ) ) #c=c o n c a t e n a t e s e n t r i e s o r v e c t o r s
#i n a s i n g l e v e c t o r
( u=c ( v , 9 , 1 5 : 1 8 ) ) ; u [ u <10]
(w= l e t t e r s [ 1 : 7 ] ) ;
c l a s s (w) #v e c t o r s : n o t o n l y w i t h
#n u m e r i c s a s e n t r i e s
w [ 3 ] #t h i r d c o o r d i n a t e
w [ l e n g t h ( v4 ) ] #l a s t e n t r y
w [ c ( 1 , 4 , 5 ) ] #1 s t , 4 t h and 5 t h e n t r i e s
w[− c ( 1 , 4 , 5 ) ] #a l l b u t 1 s t , 4 t h and 5 t h e n t r i e s
( x=s e q ( from =1 , t o =2, by = . 2 ) )
( y=r e p ( x , 2 ) )
(file: Vectors .R)
Conteneur similaire en 2 dimensions : matrix
Listes
Les coordonnées peuvent avoir des types différents et des noms
1
2
3
4
5
L= l i s t ( name=" A lban " , age =1 , c o u n t r y=" F r a n c e " ,
s i z e_i n_cm=83)
s t r (L)
L$name
L[[2]]
(file: Lists .R)
Facteurs
Adaptés pour les données qualitatives (i.e. non quantitatives)
1
2
3
4
5
6
7
8
9
C=c ( " b l u e " , " b l u e " , " r e d " , " r e d " , " b l u e " , " g r e e n " ,
" y e l l o w " , " g r e e n " , " r e d " ) #v e c t o r o f c h a r a c t e r s
(my_f a c t o r 1=f a c t o r (C ) )
s t r (my_f a c t o r 1 )
t a b l e (my_f a c t o r 1 )
L2=c ( " b l u e " , " r e d " , " g r e e n " , " y e l l o w " , " o r a n g e " )
(my_f a c t o r 2=f a c t o r (C , l e v e l s =L2 ) )
t a b l e (my_f a c t o r 2 )
(file: Factors .R)
Data.frames
data.frame : liste de vecteurs de même taille mais
éventuellement de classes différentes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
name=c ( " C l o " , " Tony " , " Alban " , "Tom" , " Lou " )
c i t y=c ( " P a r i s " , " P a r i s " , " Lyon " , "NYC" , " S h a n g h a i " )
s i z e=c ( 1 7 0 , 1 8 1 , 8 3 , 1 8 5 , 1 7 9 )
( d = d a t a . f r a m e ( name , c i t y , s i z e ) )
str (d)
dim ( d ) ; nrow ( d ) ; n c o l ( d )
d$name
d [ 2 , 3 ] #2 nd e n t r y o f 3 r d column
d$ y e a r = c ( 1 9 8 3 , 1 9 8 0 , 2 0 1 3 , 1 9 8 8 , 1 9 8 6 ) #add column
d [ 2 , ] #2 nd row
d [ , c ( 3 , 1 ) ] #3 r d and 1 s t c o l u m n s
d [ , − c ( 2 , 4 ) ] #a l l b u t 2 nd and 4 t h c o l u m n s
d [ d$ c i t y==" P a r i s " , ] #e x t r a c t s a s u b s e t
c o l n a m e s ( d ) [ 4 ] = " b i r t h_y e a r " #c h a n g e s column name
summary ( d )
(file: Data.frame.R)
ts (Time Series) (1)
Objet de type ts : un vecteur de valeurs, un vecteur start (année,
moment), un vecteur end (année, moment) et une fréquence
1 # s a v e a n u m e r i c v e c t o r c o n t a i n i n g 72 m o n t h l y o b s e r v a t i o n s
2 myvector = . 1 ∗ (1:72)+ s i n ( ( 1 : 7 2 ) ∗ p i / 12) +
3
rnorm ( 7 2 , mean=0 , s d =.3)
4
5 # from Jan 2009 t o Dec 2014 a s a t i m e s e r i e s o b j e c t
6 myts = t s ( m y v e c t o r , s t a r t=c ( 2 0 0 9 , 1 ) , end=c ( 2 0 1 4 , 1 2 ) ,
7
f r e q u e n c y =12)
8 c l a s s ( myts )
9 p l o t ( myts , c o l=" r e d " )
10
11 # s u b s e t t h e t i m e s e r i e s ( June 2014 t o December 2 0 1 4 )
12 myts2 <− window ( myts , s t a r t=c ( 2 0 1 0 , 6 ) , end=c ( 2 0 1 4 , 1 2 ) )
13 c l a s s ( myts2 )
14 p l o t ( myts2 , c o l=" b l u e " )
(file: Example_ts_1.R)
ts (Time Series) (2)
Objet de type ts : un vecteur de valeurs, un vecteur start (année,
moment), un vecteur end (année, moment) et une fréquence
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# s a v e a n u m e r i c v e c t o r c o n t a i n i n g 72 m o n t h l y o b s e r v a t i o n s
myvector = . 1 ∗ (1:72)+ s i n ( ( 1 : 7 2 ) ∗ p i / 12) +
rnorm ( 7 2 , mean=0 , s d =.3)
# from Jan 2009 t o Dec 2014 a s a t i m e s e r i e s o b j e c t
myts = t s ( m y v e c t o r , s t a r t=c ( 2 0 0 9 , 1 ) , end=c ( 2 0 1 4 , 1 2 ) ,
f r e q u e n c y =12)
# Seasonal decomposition
f i t = s t l ( myts , s . window=" p e r i o d " )
p l o t ( f i t , c o l=" p u r p l e " )
# additional plots
m o n t h p l o t ( myts )
library ( forecast )
s e a s o n p l o t ( myts )
(file: Example_ts_2.R)
ts (Time Series, Corrélation) (2)
Objet de type ts : un vecteur de valeurs, un vecteur start (année,
moment), un vecteur end (année, moment) et une fréquence
1
2
3
4
5
6
7
8
9
d a t a ( USAccDeaths )
USAccDeaths
b r u i t . b l a n c=t s ( rnorm ( 1 0 0 ) , f r e q u e n c y = 1 , s t a r t = c ( 1 ) , end=c
plot ( bruit . blanc )
a c f ( b r u i t . b l a n c , t y p e=" c o r r e l a t i o n " ) #ou t y p e=" c o v a r i a n c e " o
(file: Example_ts_3.R)
Autres types, conversions
Beaucoup d’autres types en R (date, matrix, etc...). Conversions
possibles grâce à as .mon_type(mon_object)
1
2
3
4
5
6
7
8
9
10
11
12
( d1=a s . Date ( "1999−09−17" ) )
c l a s s ( d1 )
( d2=a s . Date ( " 19980901 " , f o r m a t="%Y%m%d" ) )
(my_d a t e s=s e q ( d1 , d2 , by="−1␣ month " ) )
( x=a s . n u m e r i c ( d1 ) )
(my_ch=a s . c h a r a c t e r ( d2 ) )
(my_f a c t o r 1=f a c t o r ( c ( " b l u e " , " r e d " , " g r e e n " , " r e d " ) ) )
( v=a s . c h a r a c t e r (my_f a c t o r 1 ) )
(file: Dates_and_conversions.R)
Obtenir et modifier le répertoire de travail
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
getwd ( ) #g e t w o r k i n g d i r e c t o r y
s e t w d ( "~/ D e s k t o p / " ) #s e t w o r k i n g d i r e c t o r y
#s e t w d w o r k s w i t h l i n u x a b s o l u t e o r r e l a t i v e pathnames :
# ~/ r e t u r n s you t o y o u r l o g i n d i r e c t o r y
# /home/ t a k e s you t o t h e home d i r e c t o r y
#where u s e r l o g i n d i r e c t o r i e s a r e u s u a l l y s t o r e d
# ../
moves you up one d i r e c t o r y
# . /my_sub_d i r moves you down t o t h e s u b d i r e c t o r y
#named "my_sub_d i r "
getwd ( ) #g e t w o r k i n g d i r e c t o r y
d i r () #l i s t of d i r e c t o r y f i l e s
l s () #l i s t of R o b j e c t s
(file: Linux_pathnames.R)
Importer des données
Une étape clé dans l’analyse de données Souvent : compliqué ! !
1 ( x=s c a n ( "my_d a t a_ f i l e . d a t " ) )
2 c l a s s ( x ) #s c a n g i v e s a v e c t o r back
3
4 ( y=r e a d . t a b l e ( "my_t a b l e . d a t " , h e a d e r =1))
5 c l a s s ( y ) #r e a d . t a b l e g i v e s a d a t a . f r a m e back
6
7 t a=r e a d . c s v ( f i l e =" e x o 1 . c s v " , h e a d e r =1, s k i p =6)
8 #h e r e , r e a d . t a b l e a l s o w o r k s
9 #" s k i p " a l l o w s t o s k i p a c e r t a i n number o f r o w s
10
11 h=r e a d . t a b l e ( " h t t p : / /www . cmapx . p o l y t e c h n i q u e . f r /~b e n a y c h / 201
12
h e a d e r=T , f i l l =T)# f i l l : b l a n k f i e l d s
13
#added when r o w s
14
#h a v e m i s s i n g v a l u e s
15 names ( h )
16 s t r ( h )
17 summary ( h )
(file: Import_data.R)
Export de données
Beaucoup plus facile que l’importation
1
2
3
4
5
6
7
8
9
10
11
w r i t e ( r u n i f ( 5 ) , f i l e ="my_w r i t i n g_ f i l e . d a t " )
y=r e a d . t a b l e ( "my_t a b l e . d a t " , h e a d e r =1)
w r i t e . t a b l e ( y , f i l e ="my_w r i t i n g_ f i l e _f o r_t a b l e . d a t " )
w r i t e . c s v ( y , f i l e ="my_w r i t i n g_ f i l e _f o r_t a b l e . c s v " )
( z=r e a d . t a b l e ( "my_w r i t i n g_ f i l e _f o r_t a b l e . d a t " ,
h e a d e r =1))
t a=r e a d . c s v ( f i l e =" e x o 1 . c s v " , h e a d e r =1, s k i p =6)
w r i t e . c s v ( ta , f i l e = " . . /my_c s v_ f i l e . c s v " )
(file: Export_data.R)
Import/export d’objects R
1 myvector = . 1 ∗ (1:72)+ s i n ( ( 1 : 7 2 ) ∗ p i / 12) +
2
rnorm ( 7 2 , mean=0 , s d =.3)
3
4 # from Jan 2009 t o Dec 2014 a s a t i m e s e r i e s o b j e c t
5 myts = t s ( m y v e c t o r , s t a r t=c ( 2 0 0 9 , 1 ) , end=c ( 2 0 1 4 , 1 2 ) ,
6
f r e q u e n c y =12)
7
8 s a v e ( myts , f i l e ="my_r d a_ f i l e . r d a " )
9 rm ( l i s t = l s ( ) )
10 p r i n t ( myts )
11 l o a d ( "my_r d a_ f i l e . r d a " )
12 p r i n t ( myts )
13 p l o t ( myts )
(file: Import_export_R_objects.R)
Fonctions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
f = f u n c ti o n ( x ) {x∗x}
f (5)
f (1:7)
PF = f u n c t i o n ( n=1 , p r o b a =.5) {
a s . i n t e g e r ( r u n i f ( n)< p r o b a ) }
PF ( 1 0 )
PF ( 1 0 , . 1 )
PF2 = f u n c t i o n ( n=1, p r o b a =.5) {
e c h=a s . i n t e g e r ( r u n i f ( n)< p r o b a )
f r e q=mean ( e c h )
r e t u r n ( l i s t ( s a m p l e=ech , f r e q u e n c y=f r e q ) ) }
PF2 ( 1 0 , . 6 )
(file: Functions .R)
Structures de contrôle
1 x=r u n i f ( 1 ) ; y=c ( ) ; z=c ( ) ; #x : random i n [ 0 , 1 ]
2 f o r ( i i n 1 : 1 0 ) y [ i ]= s i n ( p i ∗ i / 1 0 )
3 y==s i n ( p i ∗ ( 1 : 1 0 ) / 1 0 )
4
5 for ( i in 1:10) {
6
i f ( y [ i ] < . 5 ) { z [ i ]= y [ i ] } e l s e { z [ i ]= y [ i ]+1}}
7 z==((y <.5) ∗ y+(y >=.5)∗ ( y +1))
8
9 w h i l e ( x <.5) { x=r u n i f ( 1 ) }
10 x
11
12 #way q u i c k e r : m a t r i x a p p r o a c h :
13 v=r u n i f ( 1 E5 ) #10^5 random numbers i n [ 0 , 1 ]
14 t=S y s . t i m e ( )
15 f o r ( i i n 1 : l e n g t h ( v ) ) { v [ i ]= v [ i ]+5}
16 t 1=S y s . t i m e () − t
17 t=S y s . t i m e ( )
18 v=v+5
19 t 2=S y s . t i m e () − t ; a s . n u m e r i c ( t 1 ) / a s . n u m e r i c ( t 2 )
(file: Control_structures .R)
Graphiques : plot, points, lines
1 g r a p h i c s . o f f ( ) #e r a s e s p r e v i o u s g r a p h i c s
2 x=s e q ( . 4 , 4 , . 1 ) ; y=l o g ( x^2+1/ x ^2)
3 p l o t ( x , y , t y p e="p" ) #p f o r p o i n t s
4 p l o t ( x , y , t y p e=" l " ) #l f o r l i n e
5 p l o t ( x , y , t y p e=" s " ) #l f o r s t e p
6 p l o t ( x , y , t y p e="p" , pch =2) #pch : t y p e o f p o i n t s
7 ##y l i m=c ( ay , by ) , x l i m=c ( ax , bx ) : bounds o f t h e a x e s
8 ##l t y : t y p e o f l i n e , c o l : c o l o r , main : t i t l e
9 x=rnorm ( 2 0 ) ; y=r e x p ( 2 0 )
10 p l o t ( x , y , c o l=" r e d " , y l i m=c ( −0.5 , 5 ) , x l i m=c ( − 3 ,2 ))
11 p o i n t s ( x +.5 , y −1 , pch =2 , c o l=" b l u e " ) #add p o i n t s
12 l e g e n d ( " t o p l e f t " , c ( " p l o t_1 " , " p l o t_2 " ) , c o l=c ( " r e d " , " b l u e " ) ,
13
pch=c ( 1 , 2 ) )
14 l i n e s ( x , y , l t y =2 , c o l=" g r e e n " ) #add l i n e
15 a b l i n e ( h=3 , c o l=" y e l l o w " , l t y =1)#add h o r i z o n t a l l i n e
16 a b l i n e ( v = −1.3 , c o l=" a q u a m a r i n e 2 " )#add v e r t i c a l l i n e
17 a b l i n e ( a =1 ,b=1 , c o l=" c o r a l 3 " )#add o b l i q u e l i n e
18 t e x t ( −2 ,2 , " commentary " ) ; t i t l e ( " Example ␣ c u r v e s " )
(file: Plot_Intro.R)
Les nombreuses "méthodes" de plot
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#I m a g e s p r o d u c t e d by p l o t ( x ) depend on t h e c l a s s o f x :
methods ( p l o t )
#Ex a mp l e s :
c l a s s ( s i n ) ; p l o t ( s i n , x l i m=c(− p i , p i ) , x l a b=" x " , y l a b=" s i n ( x ) " )
x=r p o i s ( 1 0 0 0
, 1 ) ; y=t a b l e ( x ) ; c l a s s ( y ) ; p l o t ( y )
# s a v e a n u m e r i c v e c t o r c o n t a i n i n g 72 m o n t h l y o b s e r v a t i o n s
myvector = . 1 ∗ (1:72)+ s i n ( ( 1 : 7 2 ) ∗ p i / 12) +
rnorm ( 7 2 , mean=0, s d =.3)
# from Jan 2009 t o Dec 2014 a s a t i m e s e r i e s o b j e c t
myts = t s ( m y v e c t o r , s t a r t=c ( 2 0 0 9 , 1 ) , end=c ( 2 0 1 4 , 1 2 ) ,
f r e q u e n c y =12)
c l a s s ( myts ) ; p l o t ( myts , c o l=" r e d " )
r=rnorm ( 1 E3 , 1 ) ; w=d e n s i t y ( r ) ; c l a s s (w)
p l o t (w)
(file: Plot_Intro_2−methods.R)
Probabilités, variables éléatoires, distributions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#D i s c r e t e l a w s : binom , h y p e r , p o i s , geom , sample , . . .
#C o n t i n u o u s : norm , u n i f , t , c h i s q , . . .
#
#
#
#
#
G e n e r a l s y n t a x , f o r "my_l a w " a p r o b a b i l i t y d i s t r i b u t i o n :
rmy_l a w ( n , . . . ) : s a m p l e
dmy_l a w ( x , . . . ) : d e n s i t y o f my_l a w a t x
pmy_l a w ( x , . . . ) : P( X =< x ) f o r X d i s t r i b u t e d a s my_l a w
qmy_l a w ( a , . . . ) : q u a n t i l e w i t h l e v e l a
x=s e q ( −5 , 5 , . 0 1 )
u=s e q ( 0 , 1 , . 0 1 )
p a r ( mfrow=c ( 2 , 2 ) )#d e f i n e s an a r r a y o f p l o t s
p l o t ( rnorm ( 5 0 0 , 0 , 1 ) , pch =20 , c o l=" r e d " ) #s a m p l e
p l o t ( x , dnorm ( x , 0 , 1 ) , t y p e=" l " , c o l=" b l u e " ) #d e n s i t y
p l o t ( x , pnorm ( x , 0 , 1 ) , t y p e=" l " , c o l=" p u r p l e " ) #P( X =< x )
p l o t ( u , qnorm ( u , 0 , 1 ) , t y p e=" l " , c o l=" g r e e n " ) #q u a n t i l e s
g r a p h i c s . o f f ( ) #e r a s e s g r a p h i c s
(file: Probability _Random_variables_Distributions.R)
Description graphique des données : hist, boxplot, pie,
barplot...
1
2
3
4
5
6
7
8
9
10
11
12
d a t a ( i r i s ) ; s t r ( i r i s ) ; summary ( i r i s ) ; head ( i r i s )
x= i r i s $ S e p a l . L e n g t h
mean ( x ) ; s d ( x ) ; v a r ( x )
quantile (x , c (.25 ,.5 ,.75))
boxplot (x)
hist (x)
h i s t ( x , b r e a k s = 12)
r e c o r d s=s a m p l e ( c ( " c l e a r " , " c l o u d y " , " r a i n y " ) , 3 6 5 ,
r e p l a c e=T , p r o b=c ( 5 , 4 , 1 ) )
pie ( table ( records ))
barplot ( table ( records ))
(file: Graphical_description_of_data−hist−boxplot−pie−barplot.R)
Description graphique des données : distributions
conditionnelles
1 x y p l o t ( Ozone~Wind | Month , d a t a=a i r q u a l i t y , l a y o u t=c ( 5 , 1 ) )
2 h i s t o g r a m ( ~ S e p a l . L e n g t h | S p e c i e s , d a t a= i r i s )
3 b w p l o t ( S e p a l . L e n g t h~ S p e c i e s , d a t a= i r i s )
4
5 r e q u i r e ( UsingR ) #o p e n s t h e f u n c t i o n s and d a t a o f
6 #t h e p a c k a g e UsingR
7 a t t a c h ( k i d . w e i g h t s ) #a l l o w s t o w r i t e h e i g h t
8 #i n s t e a d o f k i d . w e i g h t s $ h e i g h t . . .
9 h i s t o g r a m ( ~ w e i g h t | g e n d e r , d a t a=k i d . w e i g h t s , l a y o u t=c ( 1 , 2 ) )
10 h i s t o g r a m ( ~ h e i g h t | g e n d e r , d a t a=k i d . w e i g h t s , l a y o u t=c ( 1 , 2 ) )
11 mean ( w e i g h t [ g e n d e r=="M" ] )
12 mean ( w e i g h t [ g e n d e r=="F" ] )
13 mean ( h e i g h t [ g e n d e r=="M" ] )
14 mean ( h e i g h t [ g e n d e r=="F" ] )
15
16 H=r e a d . c s v ( f i l e =" male_f e m a l e_o v e r_20_h e i g h t s . c s v " )
17 s t r (H)
18 h i s t o g r a m ( ~H$ h e i g h t | H$ g e n d e r , l a y o u t=c ( 1 , 2 ) )
Graphical description of data : correlations
1 r e q u i r e ( UsingR ) #o p e n s t h e f u n c t i o n s and d a t a o f
2
#t h e p a c k a g e UsingR
3 data ( kid . weights )
4 X=k i d . w e i g h t s ; s t r (X)
5 p a i r s (X [ , 1 : 3 ] , c o l=X [ , 4 ] , pch =20)
6
7 ## BMI : body mass i n d e x
8 a t t a c h ( k i d . w e i g h t s ) #a l l o w s t o w r i t e h e i g h t
9
#i n s t e a d o f k i d . w e i g h t s $ h e i g h t . . .
10 h t_i n_m = h e i g h t ∗ 2 . 5 4 / 100
# 2 . 5 4 cm p e r i n c h
11 w e i g h t_i n_kg = w e i g h t / 2 . 2 0 4 6
# 2 . 2 0 4 6 l b s . p e r kg
12 bmi = w e i g h t_i n_kg / h t_i n_m^2
13 h i s t o g r a m ( bmi )
14 h i s t o g r a m ( ~bmi | g e n d e r )
15
16 l i b r a r y ( c a r )
17 s t a t e s=a s . d a t a . f r a m e ( s t a t e . x77 [ , c ( " Murder " , " P o p u l a t i o n " ,
18
" I l l i t e r a c y " , " Income " , " F r o s t " ) ] )
19 s c a t t e r p l o t M a t r i x ( s t a t e s , s p r e a d=T , pch =20 , l t y . smooth =2)
(file: Graphical_description_of_data−correlations.R)
Help, demos, examples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
help (" plot ")
? plot
help . search (" plot ")
?? p l o t
demo ( ) # d e m o n t r a t i o n s l i s t
demo ( g r a p h i c s )
e x a m p l e ( p l o t ) #t h e e x a m p l e f u n c t i o n u s u a l l y r u n s
#t h e e x e m p l e s t h a t one can f i n d a t
#t h e end o f t h e h e l p f i l e s
(file: help_demo_examples.R)

Documents pareils