slides Block 1 - trutschnig.net

Transcription

slides Block 1 - trutschnig.net
Mögliche Inhalte
Deskriptive Werkzeuge
Statistik mit R für Fortgeschrittene
(Interne Weiterbildung FOR SS16-08)
Block 1: Ausrichtung
Ass.-Prof. Dr. Wolfgang Trutschnig
Arbeitsgruppe Stochastik/Statistik
FB Mathematik
Universität Salzburg
www.trutschnig.net
Salzburg, 2016-06-03
Wolfgang Trutschnig
Statistik mit R für Fortgeschrittene
Beispiele
Mögliche Inhalte
Deskriptive Werkzeuge
Mögliche Inhalte (Diskussionsgrundlage)
1. Fokus auf Statistik, R als Werkzeug
2. Fokus auf effizientes Programmieren in R, Datenanalyse als
Werkzeug
3. R-shiny: interaktive apps mit R (web application framework for R)
I
I
Sehr nützlich für die Lehre
Ermöglicht ’Herumspielen’ mit Daten
4. knitR: dynamic reporting mit R
I
I
Kombination R mit LaTeX (Textverarbeitung)
Stichwort: Reproduzierbare Forschung
5. Wünsche?
Wolfgang Trutschnig
Statistik mit R für Fortgeschrittene
Beispiele
Mögliche Inhalte
Deskriptive Werkzeuge
Beispiele
(Relative) Häufigkeiten, Histogramm, empirische Verteilungsfunktion, Quantile, Boxplots
I
Gegeben sind numerische Daten (sample) x1 , . . . , xn im Intervall [a, b]
I
Histogramm: Übersichtliche, einfache Darstellung der Daten: Zerlege
[a, b] in Intervalle I1 , . . . , Ik und zähle wie viele Werte in welchem Intervall
liegen, d.h. hn (Ij ) := #{m : xm ∈ Ij }, rn (Ij ) := hn (Ij )/n
Histogramm sum_out
60
0
0
20
40
Frequency
200
100
Frequency
300
80
400
Histogramm sum_out
0
5000 10000
20000
sum_out
Wolfgang Trutschnig
Statistik mit R für Fortgeschrittene
30000
0
5000 10000
20000
sum_out
30000
Mögliche Inhalte
Deskriptive Werkzeuge
Beispiele
(Relative) Häufigkeiten, Histogramm, empirische Verteilungsfunktion, Quantile, Boxplots
I
Gegeben sind numerische Daten (sample) x1 , . . . , xn im Intervall [a, b]
I
Empirische Verteilungsfunktion: Übersichtliche, einfache Darstellung der
Daten: für jedes x ∈ [a, b] zähle wie viele Werte ≤ x sind, d.h.
Fn (x) := #{i : xi ≤ x}/n
empirische Verteilungsfunktion
0.8
0.6
Fn(x)
0
0.0
0.2
0.4
60
40
20
Frequency
80
1.0
Histogramm sum_out
0
5000 10000
20000
sum_out
Wolfgang Trutschnig
Statistik mit R für Fortgeschrittene
30000
●●
0
●
● ● ●●
●●●●●●
●
●●
●
●● ●●
●
●●●
● ●●
●
●
●●
●
●●
●
●●
●
●●
●
●
●●● ●
●●●
●
●●
●
●
●●
●●
●
●
●●
●●
●●
●●●
●●
●
●
●
●
●
●●
●●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●●●
●●
●●
● ●
5000
●
● ●● ●
●●
●●
●●●
●
●
●
●
● ●
●
●
10000 15000 20000 25000 30000
x
Mögliche Inhalte
Deskriptive Werkzeuge
Beispiele
(Relative) Häufigkeiten, Histogramm, empirische Verteilungsfunktion, Quantile, Boxplots
I
Gegeben sind numerische Daten (sample) x1 , . . . , xn im Intervall [a, b], Fn
bezeichne die empirische Verteilungsfunktion.
I
Für jedes p ∈ [0, 1] heisst F (−1) (p) := min{x : Fn (x) ≥ p} p-Quantil der
Stichprobe.
empirische Verteilungsfunktion
0.8
0.6
Fn(x)
0
0.0
0.2
0.4
60
40
20
Frequency
80
1.0
Histogramm sum_out
0
5000 10000
20000
sum_out
Wolfgang Trutschnig
Statistik mit R für Fortgeschrittene
30000
●●
0
●
●
●
●●
●●
●
●
●●
●●
●●
●●●
●●
●
●
●
●
●
●●
●●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●●●
●●
●●
● ●
5000
●
● ● ●●
●●●●●●
●
●●
●
●● ●●
●
●●●
● ●●
●
●
●●
●
●●
●
●●
●
●●
●
●
●●● ●
●●●
●
●
●
● ●● ●
●●
●●
●●●
●
●
●
●
● ●
●
●
10000 15000 20000 25000 30000
x
Mögliche Inhalte
Deskriptive Werkzeuge
Beispiele
(Relative) Häufigkeiten, Histogramm, empirische Verteilungsfunktion, Quantile, Boxplots
I
Gegeben sind numerische Daten (sample) x1 , . . . , xn im Intervall [a, b]
I
Ein boxplot ist eine zusammenfassende Darstellung basierend auf den
0.25-, 0.5-, 0.75-Quantilen und Ausreißern:
●●
0
5000
●
●
10000 15000 20000 25000 30000
x
Wolfgang Trutschnig
Statistik mit R für Fortgeschrittene
●
25000
●
Boxplot pro Jahr sum_out
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
15000
●
● ●● ●
●●
●●
●●●
●
5000
●
● ● ●●
●●●●●●
●
●●
●
●● ●●
●
●●●
● ●●
●
●
●●
●
●●
●
●●
●
●●
●
●
●●● ●
●●●
●
●●
●
●
●●
●●
●
●
●●
●●
●●
●●●
●●
●
●
●
●
●
●●
●●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●●●
●●
●●
● ●
●
● ●
0
0.6
0.4
0.0
0.2
Fn(x)
0.8
1.0
empirische Verteilungsfunktion
2007
2008
2009
total
Mögliche Inhalte
Deskriptive Werkzeuge
Beispiele
(Relative) Häufigkeiten, Histogramm, empirische Verteilungsfunktion, Quantile, Boxplots
I
Warum bisher Mittelwert nicht einmal erwähnt?
I
OeNB: ”Der durchschnittliche österreichische Haushalt verfügte 2004 über
ein Geldvermögen von rund 55.000 Euro”.
I
Informationsgehalt ?
I
Mittelwert ist sehr sensitiv auf Ausreißer - Veränderung eines einzigen
Wertes kann Mittelwert extrem verändern - nicht robust !
150
100
Frequency
0
50
100
0
50
Frequency
150
200
Histogramm von x plus einmal 1.000.000
200
Histogramm von x
0
200
400
600
x
Wolfgang Trutschnig
Statistik mit R für Fortgeschrittene
800
1000
1200
0
200
400
600
x
800
1000
1200
Mögliche Inhalte
Deskriptive Werkzeuge
Beispiele
Learning by doing
I Verwendung der deskriptiven tools zur Analyse eines ersten realen
Datensatzes:
ymd
2007-01-01
2007-01-02
2007-01-03
2007-01-04
2007-01-05
2007-01-06
I
I
I
weekday
Mon
Tue
Wed
Thu
Fri
Sat
nr weekday
1
2
3
4
5
6
sum out
4040
22760
18810
24910
25650
5650
holiday
1.00
1.50
0.00
0.00
0.50
1.00
Der Datensatz enthält die Zeitreihe der bei einem Bankomaten
(einer Filiale einer Bank) abgehobenen täglichen Geldmenge.
Ursprüngliche Problemstellung: Entwicklung von zuverlässigen
forecasts für die abgehobenen täglichen Geldmenge zum Zwecke der
Optimierung des Zuliefersystems (500 verschiedene Filialen,
Zeitreihen von 3 Jahren).
Komplettes Skript unter www.trutschnig.net/courses
Wolfgang Trutschnig
Statistik mit R für Fortgeschrittene