slides Block 1 - trutschnig.net
Transcription
slides Block 1 - trutschnig.net
Mögliche Inhalte Deskriptive Werkzeuge Statistik mit R für Fortgeschrittene (Interne Weiterbildung FOR SS16-08) Block 1: Ausrichtung Ass.-Prof. Dr. Wolfgang Trutschnig Arbeitsgruppe Stochastik/Statistik FB Mathematik Universität Salzburg www.trutschnig.net Salzburg, 2016-06-03 Wolfgang Trutschnig Statistik mit R für Fortgeschrittene Beispiele Mögliche Inhalte Deskriptive Werkzeuge Mögliche Inhalte (Diskussionsgrundlage) 1. Fokus auf Statistik, R als Werkzeug 2. Fokus auf effizientes Programmieren in R, Datenanalyse als Werkzeug 3. R-shiny: interaktive apps mit R (web application framework for R) I I Sehr nützlich für die Lehre Ermöglicht ’Herumspielen’ mit Daten 4. knitR: dynamic reporting mit R I I Kombination R mit LaTeX (Textverarbeitung) Stichwort: Reproduzierbare Forschung 5. Wünsche? Wolfgang Trutschnig Statistik mit R für Fortgeschrittene Beispiele Mögliche Inhalte Deskriptive Werkzeuge Beispiele (Relative) Häufigkeiten, Histogramm, empirische Verteilungsfunktion, Quantile, Boxplots I Gegeben sind numerische Daten (sample) x1 , . . . , xn im Intervall [a, b] I Histogramm: Übersichtliche, einfache Darstellung der Daten: Zerlege [a, b] in Intervalle I1 , . . . , Ik und zähle wie viele Werte in welchem Intervall liegen, d.h. hn (Ij ) := #{m : xm ∈ Ij }, rn (Ij ) := hn (Ij )/n Histogramm sum_out 60 0 0 20 40 Frequency 200 100 Frequency 300 80 400 Histogramm sum_out 0 5000 10000 20000 sum_out Wolfgang Trutschnig Statistik mit R für Fortgeschrittene 30000 0 5000 10000 20000 sum_out 30000 Mögliche Inhalte Deskriptive Werkzeuge Beispiele (Relative) Häufigkeiten, Histogramm, empirische Verteilungsfunktion, Quantile, Boxplots I Gegeben sind numerische Daten (sample) x1 , . . . , xn im Intervall [a, b] I Empirische Verteilungsfunktion: Übersichtliche, einfache Darstellung der Daten: für jedes x ∈ [a, b] zähle wie viele Werte ≤ x sind, d.h. Fn (x) := #{i : xi ≤ x}/n empirische Verteilungsfunktion 0.8 0.6 Fn(x) 0 0.0 0.2 0.4 60 40 20 Frequency 80 1.0 Histogramm sum_out 0 5000 10000 20000 sum_out Wolfgang Trutschnig Statistik mit R für Fortgeschrittene 30000 ●● 0 ● ● ● ●● ●●●●●● ● ●● ● ●● ●● ● ●●● ● ●● ● ● ●● ● ●● ● ●● ● ●● ● ● ●●● ● ●●● ● ●● ● ● ●● ●● ● ● ●● ●● ●● ●●● ●● ● ● ● ● ● ●● ●● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ●● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ●●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ●●● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ●●● ●● ●● ● ● 5000 ● ● ●● ● ●● ●● ●●● ● ● ● ● ● ● ● ● 10000 15000 20000 25000 30000 x Mögliche Inhalte Deskriptive Werkzeuge Beispiele (Relative) Häufigkeiten, Histogramm, empirische Verteilungsfunktion, Quantile, Boxplots I Gegeben sind numerische Daten (sample) x1 , . . . , xn im Intervall [a, b], Fn bezeichne die empirische Verteilungsfunktion. I Für jedes p ∈ [0, 1] heisst F (−1) (p) := min{x : Fn (x) ≥ p} p-Quantil der Stichprobe. empirische Verteilungsfunktion 0.8 0.6 Fn(x) 0 0.0 0.2 0.4 60 40 20 Frequency 80 1.0 Histogramm sum_out 0 5000 10000 20000 sum_out Wolfgang Trutschnig Statistik mit R für Fortgeschrittene 30000 ●● 0 ● ● ● ●● ●● ● ● ●● ●● ●● ●●● ●● ● ● ● ● ● ●● ●● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ●● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ●●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ●●● ●● ●● ● ● 5000 ● ● ● ●● ●●●●●● ● ●● ● ●● ●● ● ●●● ● ●● ● ● ●● ● ●● ● ●● ● ●● ● ● ●●● ● ●●● ● ● ● ● ●● ● ●● ●● ●●● ● ● ● ● ● ● ● ● 10000 15000 20000 25000 30000 x Mögliche Inhalte Deskriptive Werkzeuge Beispiele (Relative) Häufigkeiten, Histogramm, empirische Verteilungsfunktion, Quantile, Boxplots I Gegeben sind numerische Daten (sample) x1 , . . . , xn im Intervall [a, b] I Ein boxplot ist eine zusammenfassende Darstellung basierend auf den 0.25-, 0.5-, 0.75-Quantilen und Ausreißern: ●● 0 5000 ● ● 10000 15000 20000 25000 30000 x Wolfgang Trutschnig Statistik mit R für Fortgeschrittene ● 25000 ● Boxplot pro Jahr sum_out ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 15000 ● ● ●● ● ●● ●● ●●● ● 5000 ● ● ● ●● ●●●●●● ● ●● ● ●● ●● ● ●●● ● ●● ● ● ●● ● ●● ● ●● ● ●● ● ● ●●● ● ●●● ● ●● ● ● ●● ●● ● ● ●● ●● ●● ●●● ●● ● ● ● ● ● ●● ●● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ●● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ●●● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ●●● ●● ●● ● ● ● ● ● 0 0.6 0.4 0.0 0.2 Fn(x) 0.8 1.0 empirische Verteilungsfunktion 2007 2008 2009 total Mögliche Inhalte Deskriptive Werkzeuge Beispiele (Relative) Häufigkeiten, Histogramm, empirische Verteilungsfunktion, Quantile, Boxplots I Warum bisher Mittelwert nicht einmal erwähnt? I OeNB: ”Der durchschnittliche österreichische Haushalt verfügte 2004 über ein Geldvermögen von rund 55.000 Euro”. I Informationsgehalt ? I Mittelwert ist sehr sensitiv auf Ausreißer - Veränderung eines einzigen Wertes kann Mittelwert extrem verändern - nicht robust ! 150 100 Frequency 0 50 100 0 50 Frequency 150 200 Histogramm von x plus einmal 1.000.000 200 Histogramm von x 0 200 400 600 x Wolfgang Trutschnig Statistik mit R für Fortgeschrittene 800 1000 1200 0 200 400 600 x 800 1000 1200 Mögliche Inhalte Deskriptive Werkzeuge Beispiele Learning by doing I Verwendung der deskriptiven tools zur Analyse eines ersten realen Datensatzes: ymd 2007-01-01 2007-01-02 2007-01-03 2007-01-04 2007-01-05 2007-01-06 I I I weekday Mon Tue Wed Thu Fri Sat nr weekday 1 2 3 4 5 6 sum out 4040 22760 18810 24910 25650 5650 holiday 1.00 1.50 0.00 0.00 0.50 1.00 Der Datensatz enthält die Zeitreihe der bei einem Bankomaten (einer Filiale einer Bank) abgehobenen täglichen Geldmenge. Ursprüngliche Problemstellung: Entwicklung von zuverlässigen forecasts für die abgehobenen täglichen Geldmenge zum Zwecke der Optimierung des Zuliefersystems (500 verschiedene Filialen, Zeitreihen von 3 Jahren). Komplettes Skript unter www.trutschnig.net/courses Wolfgang Trutschnig Statistik mit R für Fortgeschrittene