Courte introduction à R
Transcription
Courte introduction à R
Courte introduction à R Florent Benaych-Georges Université Paris Descartes Janvier 2015 Séries temporelles pratiques Commencer 2 étapes : 1 Télécharger et installer R depuis http://cran.r-project.org 2 Utiliser R studio, un software open-source pour R : http://www.rstudio.com Principaux types Essentiellement 3 types d’objets : char, logical, numeric 1 2 3 4 5 6 7 8 9 x=5 class (x) y=(x <6) y class (y) ( z=" H e l l o ␣ w o r l d " ) #p a r e n t h e s e s : r e s u l t d i s p l a y class (z) (file: Types.R) Principaux conteneurs Essentiellement 5 types de conteneurs : vector list factor data.frame ts Vecteurs Toutes les entrées ont le même type 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ( v=c ( 2 , 5 , 7 , 8 ) ) #c=c o n c a t e n a t e s e n t r i e s o r v e c t o r s #i n a s i n g l e v e c t o r ( u=c ( v , 9 , 1 5 : 1 8 ) ) ; u [ u <10] (w= l e t t e r s [ 1 : 7 ] ) ; c l a s s (w) #v e c t o r s : n o t o n l y w i t h #n u m e r i c s a s e n t r i e s w [ 3 ] #t h i r d c o o r d i n a t e w [ l e n g t h ( v4 ) ] #l a s t e n t r y w [ c ( 1 , 4 , 5 ) ] #1 s t , 4 t h and 5 t h e n t r i e s w[− c ( 1 , 4 , 5 ) ] #a l l b u t 1 s t , 4 t h and 5 t h e n t r i e s ( x=s e q ( from =1 , t o =2, by = . 2 ) ) ( y=r e p ( x , 2 ) ) (file: Vectors .R) Conteneur similaire en 2 dimensions : matrix Listes Les coordonnées peuvent avoir des types différents et des noms 1 2 3 4 5 L= l i s t ( name=" A lban " , age =1 , c o u n t r y=" F r a n c e " , s i z e_i n_cm=83) s t r (L) L$name L[[2]] (file: Lists .R) Facteurs Adaptés pour les données qualitatives (i.e. non quantitatives) 1 2 3 4 5 6 7 8 9 C=c ( " b l u e " , " b l u e " , " r e d " , " r e d " , " b l u e " , " g r e e n " , " y e l l o w " , " g r e e n " , " r e d " ) #v e c t o r o f c h a r a c t e r s (my_f a c t o r 1=f a c t o r (C ) ) s t r (my_f a c t o r 1 ) t a b l e (my_f a c t o r 1 ) L2=c ( " b l u e " , " r e d " , " g r e e n " , " y e l l o w " , " o r a n g e " ) (my_f a c t o r 2=f a c t o r (C , l e v e l s =L2 ) ) t a b l e (my_f a c t o r 2 ) (file: Factors .R) Data.frames data.frame : liste de vecteurs de même taille mais éventuellement de classes différentes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 name=c ( " C l o " , " Tony " , " Alban " , "Tom" , " Lou " ) c i t y=c ( " P a r i s " , " P a r i s " , " Lyon " , "NYC" , " S h a n g h a i " ) s i z e=c ( 1 7 0 , 1 8 1 , 8 3 , 1 8 5 , 1 7 9 ) ( d = d a t a . f r a m e ( name , c i t y , s i z e ) ) str (d) dim ( d ) ; nrow ( d ) ; n c o l ( d ) d$name d [ 2 , 3 ] #2 nd e n t r y o f 3 r d column d$ y e a r = c ( 1 9 8 3 , 1 9 8 0 , 2 0 1 3 , 1 9 8 8 , 1 9 8 6 ) #add column d [ 2 , ] #2 nd row d [ , c ( 3 , 1 ) ] #3 r d and 1 s t c o l u m n s d [ , − c ( 2 , 4 ) ] #a l l b u t 2 nd and 4 t h c o l u m n s d [ d$ c i t y==" P a r i s " , ] #e x t r a c t s a s u b s e t c o l n a m e s ( d ) [ 4 ] = " b i r t h_y e a r " #c h a n g e s column name summary ( d ) (file: Data.frame.R) ts (Time Series) (1) Objet de type ts : un vecteur de valeurs, un vecteur start (année, moment), un vecteur end (année, moment) et une fréquence 1 # s a v e a n u m e r i c v e c t o r c o n t a i n i n g 72 m o n t h l y o b s e r v a t i o n s 2 myvector = . 1 ∗ (1:72)+ s i n ( ( 1 : 7 2 ) ∗ p i / 12) + 3 rnorm ( 7 2 , mean=0 , s d =.3) 4 5 # from Jan 2009 t o Dec 2014 a s a t i m e s e r i e s o b j e c t 6 myts = t s ( m y v e c t o r , s t a r t=c ( 2 0 0 9 , 1 ) , end=c ( 2 0 1 4 , 1 2 ) , 7 f r e q u e n c y =12) 8 c l a s s ( myts ) 9 p l o t ( myts , c o l=" r e d " ) 10 11 # s u b s e t t h e t i m e s e r i e s ( June 2014 t o December 2 0 1 4 ) 12 myts2 <− window ( myts , s t a r t=c ( 2 0 1 0 , 6 ) , end=c ( 2 0 1 4 , 1 2 ) ) 13 c l a s s ( myts2 ) 14 p l o t ( myts2 , c o l=" b l u e " ) (file: Example_ts_1.R) ts (Time Series) (2) Objet de type ts : un vecteur de valeurs, un vecteur start (année, moment), un vecteur end (année, moment) et une fréquence 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 # s a v e a n u m e r i c v e c t o r c o n t a i n i n g 72 m o n t h l y o b s e r v a t i o n s myvector = . 1 ∗ (1:72)+ s i n ( ( 1 : 7 2 ) ∗ p i / 12) + rnorm ( 7 2 , mean=0 , s d =.3) # from Jan 2009 t o Dec 2014 a s a t i m e s e r i e s o b j e c t myts = t s ( m y v e c t o r , s t a r t=c ( 2 0 0 9 , 1 ) , end=c ( 2 0 1 4 , 1 2 ) , f r e q u e n c y =12) # Seasonal decomposition f i t = s t l ( myts , s . window=" p e r i o d " ) p l o t ( f i t , c o l=" p u r p l e " ) # additional plots m o n t h p l o t ( myts ) library ( forecast ) s e a s o n p l o t ( myts ) (file: Example_ts_2.R) ts (Time Series, Corrélation) (2) Objet de type ts : un vecteur de valeurs, un vecteur start (année, moment), un vecteur end (année, moment) et une fréquence 1 2 3 4 5 6 7 8 9 d a t a ( USAccDeaths ) USAccDeaths b r u i t . b l a n c=t s ( rnorm ( 1 0 0 ) , f r e q u e n c y = 1 , s t a r t = c ( 1 ) , end=c plot ( bruit . blanc ) a c f ( b r u i t . b l a n c , t y p e=" c o r r e l a t i o n " ) #ou t y p e=" c o v a r i a n c e " o (file: Example_ts_3.R) Autres types, conversions Beaucoup d’autres types en R (date, matrix, etc...). Conversions possibles grâce à as .mon_type(mon_object) 1 2 3 4 5 6 7 8 9 10 11 12 ( d1=a s . Date ( "1999−09−17" ) ) c l a s s ( d1 ) ( d2=a s . Date ( " 19980901 " , f o r m a t="%Y%m%d" ) ) (my_d a t e s=s e q ( d1 , d2 , by="−1␣ month " ) ) ( x=a s . n u m e r i c ( d1 ) ) (my_ch=a s . c h a r a c t e r ( d2 ) ) (my_f a c t o r 1=f a c t o r ( c ( " b l u e " , " r e d " , " g r e e n " , " r e d " ) ) ) ( v=a s . c h a r a c t e r (my_f a c t o r 1 ) ) (file: Dates_and_conversions.R) Obtenir et modifier le répertoire de travail 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 getwd ( ) #g e t w o r k i n g d i r e c t o r y s e t w d ( "~/ D e s k t o p / " ) #s e t w o r k i n g d i r e c t o r y #s e t w d w o r k s w i t h l i n u x a b s o l u t e o r r e l a t i v e pathnames : # ~/ r e t u r n s you t o y o u r l o g i n d i r e c t o r y # /home/ t a k e s you t o t h e home d i r e c t o r y #where u s e r l o g i n d i r e c t o r i e s a r e u s u a l l y s t o r e d # ../ moves you up one d i r e c t o r y # . /my_sub_d i r moves you down t o t h e s u b d i r e c t o r y #named "my_sub_d i r " getwd ( ) #g e t w o r k i n g d i r e c t o r y d i r () #l i s t of d i r e c t o r y f i l e s l s () #l i s t of R o b j e c t s (file: Linux_pathnames.R) Importer des données Une étape clé dans l’analyse de données Souvent : compliqué ! ! 1 ( x=s c a n ( "my_d a t a_ f i l e . d a t " ) ) 2 c l a s s ( x ) #s c a n g i v e s a v e c t o r back 3 4 ( y=r e a d . t a b l e ( "my_t a b l e . d a t " , h e a d e r =1)) 5 c l a s s ( y ) #r e a d . t a b l e g i v e s a d a t a . f r a m e back 6 7 t a=r e a d . c s v ( f i l e =" e x o 1 . c s v " , h e a d e r =1, s k i p =6) 8 #h e r e , r e a d . t a b l e a l s o w o r k s 9 #" s k i p " a l l o w s t o s k i p a c e r t a i n number o f r o w s 10 11 h=r e a d . t a b l e ( " h t t p : / /www . cmapx . p o l y t e c h n i q u e . f r /~b e n a y c h / 201 12 h e a d e r=T , f i l l =T)# f i l l : b l a n k f i e l d s 13 #added when r o w s 14 #h a v e m i s s i n g v a l u e s 15 names ( h ) 16 s t r ( h ) 17 summary ( h ) (file: Import_data.R) Export de données Beaucoup plus facile que l’importation 1 2 3 4 5 6 7 8 9 10 11 w r i t e ( r u n i f ( 5 ) , f i l e ="my_w r i t i n g_ f i l e . d a t " ) y=r e a d . t a b l e ( "my_t a b l e . d a t " , h e a d e r =1) w r i t e . t a b l e ( y , f i l e ="my_w r i t i n g_ f i l e _f o r_t a b l e . d a t " ) w r i t e . c s v ( y , f i l e ="my_w r i t i n g_ f i l e _f o r_t a b l e . c s v " ) ( z=r e a d . t a b l e ( "my_w r i t i n g_ f i l e _f o r_t a b l e . d a t " , h e a d e r =1)) t a=r e a d . c s v ( f i l e =" e x o 1 . c s v " , h e a d e r =1, s k i p =6) w r i t e . c s v ( ta , f i l e = " . . /my_c s v_ f i l e . c s v " ) (file: Export_data.R) Import/export d’objects R 1 myvector = . 1 ∗ (1:72)+ s i n ( ( 1 : 7 2 ) ∗ p i / 12) + 2 rnorm ( 7 2 , mean=0 , s d =.3) 3 4 # from Jan 2009 t o Dec 2014 a s a t i m e s e r i e s o b j e c t 5 myts = t s ( m y v e c t o r , s t a r t=c ( 2 0 0 9 , 1 ) , end=c ( 2 0 1 4 , 1 2 ) , 6 f r e q u e n c y =12) 7 8 s a v e ( myts , f i l e ="my_r d a_ f i l e . r d a " ) 9 rm ( l i s t = l s ( ) ) 10 p r i n t ( myts ) 11 l o a d ( "my_r d a_ f i l e . r d a " ) 12 p r i n t ( myts ) 13 p l o t ( myts ) (file: Import_export_R_objects.R) Fonctions 1 2 3 4 5 6 7 8 9 10 11 12 13 14 f = f u n c ti o n ( x ) {x∗x} f (5) f (1:7) PF = f u n c t i o n ( n=1 , p r o b a =.5) { a s . i n t e g e r ( r u n i f ( n)< p r o b a ) } PF ( 1 0 ) PF ( 1 0 , . 1 ) PF2 = f u n c t i o n ( n=1, p r o b a =.5) { e c h=a s . i n t e g e r ( r u n i f ( n)< p r o b a ) f r e q=mean ( e c h ) r e t u r n ( l i s t ( s a m p l e=ech , f r e q u e n c y=f r e q ) ) } PF2 ( 1 0 , . 6 ) (file: Functions .R) Structures de contrôle 1 x=r u n i f ( 1 ) ; y=c ( ) ; z=c ( ) ; #x : random i n [ 0 , 1 ] 2 f o r ( i i n 1 : 1 0 ) y [ i ]= s i n ( p i ∗ i / 1 0 ) 3 y==s i n ( p i ∗ ( 1 : 1 0 ) / 1 0 ) 4 5 for ( i in 1:10) { 6 i f ( y [ i ] < . 5 ) { z [ i ]= y [ i ] } e l s e { z [ i ]= y [ i ]+1}} 7 z==((y <.5) ∗ y+(y >=.5)∗ ( y +1)) 8 9 w h i l e ( x <.5) { x=r u n i f ( 1 ) } 10 x 11 12 #way q u i c k e r : m a t r i x a p p r o a c h : 13 v=r u n i f ( 1 E5 ) #10^5 random numbers i n [ 0 , 1 ] 14 t=S y s . t i m e ( ) 15 f o r ( i i n 1 : l e n g t h ( v ) ) { v [ i ]= v [ i ]+5} 16 t 1=S y s . t i m e () − t 17 t=S y s . t i m e ( ) 18 v=v+5 19 t 2=S y s . t i m e () − t ; a s . n u m e r i c ( t 1 ) / a s . n u m e r i c ( t 2 ) (file: Control_structures .R) Graphiques : plot, points, lines 1 g r a p h i c s . o f f ( ) #e r a s e s p r e v i o u s g r a p h i c s 2 x=s e q ( . 4 , 4 , . 1 ) ; y=l o g ( x^2+1/ x ^2) 3 p l o t ( x , y , t y p e="p" ) #p f o r p o i n t s 4 p l o t ( x , y , t y p e=" l " ) #l f o r l i n e 5 p l o t ( x , y , t y p e=" s " ) #l f o r s t e p 6 p l o t ( x , y , t y p e="p" , pch =2) #pch : t y p e o f p o i n t s 7 ##y l i m=c ( ay , by ) , x l i m=c ( ax , bx ) : bounds o f t h e a x e s 8 ##l t y : t y p e o f l i n e , c o l : c o l o r , main : t i t l e 9 x=rnorm ( 2 0 ) ; y=r e x p ( 2 0 ) 10 p l o t ( x , y , c o l=" r e d " , y l i m=c ( −0.5 , 5 ) , x l i m=c ( − 3 ,2 )) 11 p o i n t s ( x +.5 , y −1 , pch =2 , c o l=" b l u e " ) #add p o i n t s 12 l e g e n d ( " t o p l e f t " , c ( " p l o t_1 " , " p l o t_2 " ) , c o l=c ( " r e d " , " b l u e " ) , 13 pch=c ( 1 , 2 ) ) 14 l i n e s ( x , y , l t y =2 , c o l=" g r e e n " ) #add l i n e 15 a b l i n e ( h=3 , c o l=" y e l l o w " , l t y =1)#add h o r i z o n t a l l i n e 16 a b l i n e ( v = −1.3 , c o l=" a q u a m a r i n e 2 " )#add v e r t i c a l l i n e 17 a b l i n e ( a =1 ,b=1 , c o l=" c o r a l 3 " )#add o b l i q u e l i n e 18 t e x t ( −2 ,2 , " commentary " ) ; t i t l e ( " Example ␣ c u r v e s " ) (file: Plot_Intro.R) Les nombreuses "méthodes" de plot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 #I m a g e s p r o d u c t e d by p l o t ( x ) depend on t h e c l a s s o f x : methods ( p l o t ) #Ex a mp l e s : c l a s s ( s i n ) ; p l o t ( s i n , x l i m=c(− p i , p i ) , x l a b=" x " , y l a b=" s i n ( x ) " ) x=r p o i s ( 1 0 0 0 , 1 ) ; y=t a b l e ( x ) ; c l a s s ( y ) ; p l o t ( y ) # s a v e a n u m e r i c v e c t o r c o n t a i n i n g 72 m o n t h l y o b s e r v a t i o n s myvector = . 1 ∗ (1:72)+ s i n ( ( 1 : 7 2 ) ∗ p i / 12) + rnorm ( 7 2 , mean=0, s d =.3) # from Jan 2009 t o Dec 2014 a s a t i m e s e r i e s o b j e c t myts = t s ( m y v e c t o r , s t a r t=c ( 2 0 0 9 , 1 ) , end=c ( 2 0 1 4 , 1 2 ) , f r e q u e n c y =12) c l a s s ( myts ) ; p l o t ( myts , c o l=" r e d " ) r=rnorm ( 1 E3 , 1 ) ; w=d e n s i t y ( r ) ; c l a s s (w) p l o t (w) (file: Plot_Intro_2−methods.R) Probabilités, variables éléatoires, distributions 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 #D i s c r e t e l a w s : binom , h y p e r , p o i s , geom , sample , . . . #C o n t i n u o u s : norm , u n i f , t , c h i s q , . . . # # # # # G e n e r a l s y n t a x , f o r "my_l a w " a p r o b a b i l i t y d i s t r i b u t i o n : rmy_l a w ( n , . . . ) : s a m p l e dmy_l a w ( x , . . . ) : d e n s i t y o f my_l a w a t x pmy_l a w ( x , . . . ) : P( X =< x ) f o r X d i s t r i b u t e d a s my_l a w qmy_l a w ( a , . . . ) : q u a n t i l e w i t h l e v e l a x=s e q ( −5 , 5 , . 0 1 ) u=s e q ( 0 , 1 , . 0 1 ) p a r ( mfrow=c ( 2 , 2 ) )#d e f i n e s an a r r a y o f p l o t s p l o t ( rnorm ( 5 0 0 , 0 , 1 ) , pch =20 , c o l=" r e d " ) #s a m p l e p l o t ( x , dnorm ( x , 0 , 1 ) , t y p e=" l " , c o l=" b l u e " ) #d e n s i t y p l o t ( x , pnorm ( x , 0 , 1 ) , t y p e=" l " , c o l=" p u r p l e " ) #P( X =< x ) p l o t ( u , qnorm ( u , 0 , 1 ) , t y p e=" l " , c o l=" g r e e n " ) #q u a n t i l e s g r a p h i c s . o f f ( ) #e r a s e s g r a p h i c s (file: Probability _Random_variables_Distributions.R) Description graphique des données : hist, boxplot, pie, barplot... 1 2 3 4 5 6 7 8 9 10 11 12 d a t a ( i r i s ) ; s t r ( i r i s ) ; summary ( i r i s ) ; head ( i r i s ) x= i r i s $ S e p a l . L e n g t h mean ( x ) ; s d ( x ) ; v a r ( x ) quantile (x , c (.25 ,.5 ,.75)) boxplot (x) hist (x) h i s t ( x , b r e a k s = 12) r e c o r d s=s a m p l e ( c ( " c l e a r " , " c l o u d y " , " r a i n y " ) , 3 6 5 , r e p l a c e=T , p r o b=c ( 5 , 4 , 1 ) ) pie ( table ( records )) barplot ( table ( records )) (file: Graphical_description_of_data−hist−boxplot−pie−barplot.R) Description graphique des données : distributions conditionnelles 1 x y p l o t ( Ozone~Wind | Month , d a t a=a i r q u a l i t y , l a y o u t=c ( 5 , 1 ) ) 2 h i s t o g r a m ( ~ S e p a l . L e n g t h | S p e c i e s , d a t a= i r i s ) 3 b w p l o t ( S e p a l . L e n g t h~ S p e c i e s , d a t a= i r i s ) 4 5 r e q u i r e ( UsingR ) #o p e n s t h e f u n c t i o n s and d a t a o f 6 #t h e p a c k a g e UsingR 7 a t t a c h ( k i d . w e i g h t s ) #a l l o w s t o w r i t e h e i g h t 8 #i n s t e a d o f k i d . w e i g h t s $ h e i g h t . . . 9 h i s t o g r a m ( ~ w e i g h t | g e n d e r , d a t a=k i d . w e i g h t s , l a y o u t=c ( 1 , 2 ) ) 10 h i s t o g r a m ( ~ h e i g h t | g e n d e r , d a t a=k i d . w e i g h t s , l a y o u t=c ( 1 , 2 ) ) 11 mean ( w e i g h t [ g e n d e r=="M" ] ) 12 mean ( w e i g h t [ g e n d e r=="F" ] ) 13 mean ( h e i g h t [ g e n d e r=="M" ] ) 14 mean ( h e i g h t [ g e n d e r=="F" ] ) 15 16 H=r e a d . c s v ( f i l e =" male_f e m a l e_o v e r_20_h e i g h t s . c s v " ) 17 s t r (H) 18 h i s t o g r a m ( ~H$ h e i g h t | H$ g e n d e r , l a y o u t=c ( 1 , 2 ) ) Graphical description of data : correlations 1 r e q u i r e ( UsingR ) #o p e n s t h e f u n c t i o n s and d a t a o f 2 #t h e p a c k a g e UsingR 3 data ( kid . weights ) 4 X=k i d . w e i g h t s ; s t r (X) 5 p a i r s (X [ , 1 : 3 ] , c o l=X [ , 4 ] , pch =20) 6 7 ## BMI : body mass i n d e x 8 a t t a c h ( k i d . w e i g h t s ) #a l l o w s t o w r i t e h e i g h t 9 #i n s t e a d o f k i d . w e i g h t s $ h e i g h t . . . 10 h t_i n_m = h e i g h t ∗ 2 . 5 4 / 100 # 2 . 5 4 cm p e r i n c h 11 w e i g h t_i n_kg = w e i g h t / 2 . 2 0 4 6 # 2 . 2 0 4 6 l b s . p e r kg 12 bmi = w e i g h t_i n_kg / h t_i n_m^2 13 h i s t o g r a m ( bmi ) 14 h i s t o g r a m ( ~bmi | g e n d e r ) 15 16 l i b r a r y ( c a r ) 17 s t a t e s=a s . d a t a . f r a m e ( s t a t e . x77 [ , c ( " Murder " , " P o p u l a t i o n " , 18 " I l l i t e r a c y " , " Income " , " F r o s t " ) ] ) 19 s c a t t e r p l o t M a t r i x ( s t a t e s , s p r e a d=T , pch =20 , l t y . smooth =2) (file: Graphical_description_of_data−correlations.R) Help, demos, examples 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 help (" plot ") ? plot help . search (" plot ") ?? p l o t demo ( ) # d e m o n t r a t i o n s l i s t demo ( g r a p h i c s ) e x a m p l e ( p l o t ) #t h e e x a m p l e f u n c t i o n u s u a l l y r u n s #t h e e x e m p l e s t h a t one can f i n d a t #t h e end o f t h e h e l p f i l e s (file: help_demo_examples.R)