CorpLing10/Material/Handout frequencies
Transcription
CorpLing10/Material/Handout frequencies
Corpus Linguistics – Some statistics in R Heike Zinsmeister Konstanz, 16.3.2010 1 Preliminaries Start R by double clicking on the R symbol. 1.1 Packages For some tasks we will need additional programmes of R, which are distributed in packages. To load a package that is already installed: library(package) Example: library(languageR) Loads a package by Harald Baayen which includes programmes and data that are described in his book: Analyzing Linguistic Data (2008). To install a new package, type in the R window: install.packages(c(list of packages), repos ="url of download repository") Example: install.packages(c(zipfR , languageR), repos ="http://cran.r-project.org") This will install the packages zipfR and languageR. install.packages(c(rpart, chron, Hmisc, Design, Matrix, lme4, coda, e1071, zipfR , ape, languageR), repos ="http://cran.r-project.org") This will install all packages that are required to follow Baayen’s book. 1.2 1.2.1 Reading and writing of tables Create a table in R First, we create a simple table x by hand: Filler <- c("aeh" , "silence" , "aehm") Frequency <- c(394,332,274) x <- data.frame(Filler , Frequency); x Using the values of one factor as row names: x <- data.frame(Frequency, row.names=Filler); x 1.2.2 Saving a data frame The name of the table is x. Save it as file /Users/cluser/Desktop/Fillers.txt: write.table(x, "/Users/cluser/Desktop/Fillers.txt ", quote=FALSE, sep="\t", row.names=TRUE, col.names=TR 1 1.2.3 Loading tabular data into R read.table() loads tabular data from a directory into R. Load file /Users/cluser/Desktop/Fillers.txt. x <- read.table(file.choose(), header=TRUE, sep="\t") 1.2.4 # new data frame = x Objects (and functions) on the workspace List all objects on the workspace: ls(all=TRUE) Delete all objects on the workspace: rm(list=ls(all=T)) Check the workspace (list all objects): ls(all=T) Task: Load file /Users/cluser/Desktop/Fillers.txt into R (see above). Then check the workspace again. 1.3 Plots First, create two simple vectors a andb: a<-c(1,3,5,2,4); b<- 1:5 1.3.1 Scatter plot plot(a) 1.3.2 Printing a graphic to a file png () creates a png graphic. For other graphic formats see: help(png), help(postscript) png("/Users/cluser/Desktop/plot_a.png", width=300, height=300) # Saves the plot as ‘plot_a.png’ on the Desktop plot(a) dev.off() # ends writing the file 1.3.3 Axes and titles Compare plot(a) with plot(a,b) plot(a,b) Attributes of plot (and main sub xlab ylab cex.main cex.lab cex.axis xlim=c(xmin, xmax) Example: other graphics): an overall title for the plot: see help(title) a subtitle for the plot a title for the x axis: x axis label a title for the y axis: y axis label relative font size of the overall title relative font size of the axis titles relative size of axis annotation limits of the x axis annotation plot(b, a ,xlab="x axis label", ylab="y axis label", main="My plot",cex.lab=1.5,ylim=c(1,7)) 2 1.3.4 Different plot types p for points l for lines b for both c for the lines part alone of b o for both ‘overplotted’ h for ‘histogram’ like (or ‘high-density’) vertical lines s for stair steps: moves first horizontal, then vertical S for stair steps: moves first vertical, then horizontal n for no plotting Example: plot(b, a, type="b", xlab="x axis label", ylab="y axis label", main="My plot", cex.lab=1.5) 1.3.5 Histograms Besides plot type "h", you can create histograms with the function hist(x) where x is a numeric vector of values to be plotted. The option freq=FALSE plots probability densities instead of frequencies. The option breaks=n controls the number of bins (see: http://www.statmethods.net/graphs/density.html). 1.3.6 Barplot Create barplots with the barplot(height) function, where height is a vector or matrix. If height is a vector, the values determine the heights of the bars in the plot. If height is a matrix and the option beside=FALSE then each bar of the plot corresponds to a column of height, with the values in the column giving the heights of stacked ‘sub-bars’. If height is a matrix and beside=TRUE, then the values in each column are juxtaposed rather than stacked. Include option names.arg=(character vector) to label the bars. The option horiz=TRUE to createa a horizontal barplot (see: http://www.statmethods.net/graphs/bar.html). 2 2.1 Data frame Setup Load the data set verbs of the package LanguageR data(verbs) Read the documentation of verbs. What kind of data is it? How many data points (‘observations’) does it include? How many variables? What scales do the variables belong to? help(verbs) Make yourself familiar with the data itself. str(verbs) head(verbs) 2.2 # shows the n first lines (default n=6); see also tail() Navigation in a table Find rows and columns in the table: tabe_name[rows,columns]. Display the first row: verb[1,] Display the third row: verbs[3,] Display the first three rows: verbs[1:3,] # range operator Display the first column: verbs[,1] Display the value of the third row in the first column: verbs[3,1] 3 2.3 Restrictions on table values Show only those rows in which AnimacyOfTheme is animate. This is a restriction on rows, hence, it precedes the comma: verbs[verbs$AnimacyOfTheme=="animate" , ] Since AnimacyOfTheme has only two levels (animate, inanimate). The last command is equivalent to: verbs[verbs$AnimacyOfTheme!="inanimate" , ] Restrict the display even further to rows in which both, AnimacyOfTheme and AnimacyOfRec is animate: verbs[verbs$AnimacyOfTheme=="animate" & verbs$AnimacyOfRec=="animate",] If the columns are accessible to R the syntax becomes simpler since you are accessing the column names (=variable names) directly: attach(verbs) verbs[AnimacyOfTheme=="animate" & AnimacyOfRec=="animate",] 3 Frequencies 3.1 Extracting contingency tables from data frames Contingency tables sum up and cross-tabulate the frequencies of individual values of categorical variables. The values of the variables RealizationOfRec and AnimacyOfRec of the data set verbs: levels(RealizationOfRec) # [1] "NP" "PP" levels(AnimacyOfRec) # [1] "animate" "inanimate" The function xtabs() creates a contingency table. It has the general syntax: xtabs( dependent variable ~ predictor 1 + predictor 2 +..., data =data_name) 3.2 Distribution of the values of a single variable xtabs( ~ RealizationOfRec, data = verbs) xtabs( ~ AnimacyOfRec) With ‘attach(data)’ the specification of ‘data = verbs’ is redundant. 3.3 Goodness-of-fit test: chi-square test Is the realization of the recipient optional, i.e. randomly distributed, or is the observed difference in frequency statistically significant? Hypotheses: • H0 : The frequencies of the values of the variable realization of the recipient are identical; Variation in the sample is due to chance: nN P = nP P • H1 : The frequencies of the values of the variable realization of the recipient are not identical; Variation in the sample is not due to chance: nN P 6= nP P Goal: To Determine that the probability of wrongly rejecting H0 is lower than a recognized significance level. p p p p > < < < 0.05 0.05 0.01 0.001 not significant, H0 cannot be rejected significant, H0 can be rejected highly significant, H0 can be rejected very highly significant, H0 can be rejected Tabelle 1: Significance levels of error probability Conditions for the chi-square test: 4 • all observations are independent • 80% of the expected frequencies are ≤ 5 • all frequencies (in the contingency table) are > 1 • (Alternative for lower frequencies: fisher.test()) xtabs( ~ RealizationOfRec) RealizationOfRec NP PP 555 348 chisq.test(xtabs( ~ RealizationOfRec)) Chi-squared test for given probabilities data: xtabs(~RealizationOfRec) X-squared = 47.4518, df = 1, p-value = 5.637e-12 Interpretation: The frequency differences between the realization of the recipient as NP and PP is statistically significant (X-squared = 47.4518, df = 1, p<0.001). 3.4 Confidence interval for real distribution To generalize from the observed sample frequencies to the population a confidence interval of 0.95 is computed: > prop.test (555,903,conf.level = 0.95) 1-sample proportions test with continuity correction data: 555 out of 903, null probability 0.5 X-squared = 46.9945, df = 1, p-value = 7.119e-12 alternative hypothesis: true p is not equal to 0.5 95 percent confidence interval: 0.5818929 0.6463551 sample estimates: p 0.614618 Interpretation: With 95% confidence, the recipient is realized as NP in a range of 58% - to 65% of all double object cases. prop.test (420,903,conf.level = 0.95) 1-sample proportions test with continuity correction data: 420 out of 903, null probability 0.5 X-squared = 4.2569, df = 1, p-value = 0.03909 alternative hypothesis: true p is not equal to 0.5 95 percent confidence interval: 0.4322516 0.4982816 sample estimates: p 0.4651163 Interpretation: ... The two-sided version of the prop.test tests whether two proportions (also from different samples, e.g. n= c(500,700)) are statistically the same. 5 prop.test( x=c(555,420), n=c(903,903), alt="two.sided") 2-sample test for equality of proportions with continuity correction data: c(555, 420) out of c(903, 903) X-squared = 40.0241, df = 1, p-value = 2.508e-10 alternative hypothesis: two.sided 95 percent confidence interval: 0.1029411 0.1960622 sample estimates: prop 1 prop 2 0.6146179 0.4651163 Interpretation: ... 3.5 Cross-tabulation of two (independent) variables Example with two predictors only (= independent variables): xtabs( ~ RealizationOfRec + AnimacyOfRec) Task: Try to interpret the data. Is there a preference for the realisation of the recipient dependent on its animacy? 3.5.1 Proportions You will get a better feeling if you look at the proportions instead of the absolute frequencies: verbs.xtabs = xtabs( ~ RealizationOfRec + AnimacyOfRec) prop.table(verbs.xtabs, 1) # rows sum to 1 prop.table(verbs.xtabs, 2) # columns sum to 1 Sometimes it is more readable if proportions are rounded round (prop.table(verbs.xtabs, 1), 2) # 2= number of digits after the zero point round (prop.table(verbs.xtabs, 1), 2) *100 # Percentages 3.5.2 Simple visualisation of cross-tabled variables: barplot barplot(verbs.xtabs, legend.text = TRUE) # stacked representation barplot(verbs.xtabs, legend.text = TRUE, beside=TRUE) # columns are shown as extraposed bars 3.5.3 Mosaic plot This type of plot conditions one variable on another and displays relative preferences: plot(RealizationOfRec ~AnimacyOfRec) 6 3.5.4 Association plot In the association plot the height of the rectangles is proportional to its deviation from the expected (random) frequency. The width is proportional to the square root of its expected frequency. Hence, the surface of the rectangle is proportional to the difference between expected and observed frequency. Black = higher frequency than expected, red= lower frequency than expected. assocplot(verbs.xtabs) 3.6 Test of independence: chi-square test Is the observed difference statistically significant? Hypotheses: • H0 : The frequencies of the values of the dependent variable realization of the recipient do not co-vary with the frequencies of the values of the independent variable animacy of the recipient. 7 • H1 : The frequencies of the values of the dependent variable realization of the recipient do co-vary with the frequencies of the values of the independent variable animacy of the recipient. Goal: To determine that the probability of wrongly rejecting H0 is lower than a recognized significance level. Conditions for the chi-square test: • all observations are independent • 80% of the expected frequencies are ≤ 5 • all frequencies (in the contingency table) are > 1 (Alternative for lower frequencies: fisher.test()) chisq.test(verbs.xtabs) Pearson’s Chi-squared test with Yates’ continuity correction data: verbs.xtabs X-squared = 13.3755, df = 1, p-value = 0.0002549 Interpretation: The differences between NP and PP realization given different levels of animacy is statistically significant (Xsquared = 13.3755, df = 1, p<0.001). 3.6.1 Odds ratio Significance doesn’t tell anything about the strength of the difference. How strong is the difference? Compare: • probability p: frequency of a value in relation to all events • odds O: probability of a value to occur in relation to its non-occurrence pE odds = 1−p E odds ratio = O1 O2 Example: verbs.xtabs AnimacyOfRec RealizationOfRec animate inanimate NP 521 34 PP 301 47 O_np_animate = (521/34)/(301/47) [1] 2.392711 # = (521*47)/(34*301) Interpretation: The odds that a recipient is realized as NP is about 2.4 times higher if the recipient is animate than if it is inanimate. 3.7 CART analysis: Classification tree This section is based on Baayen (200: 148-154) section 5.2.1. See the book for further explanation. For the next task, we will load the ‘real’ data: data(dative) help(dative) attach(data) str(dative) Question: Can the realization of the recipient as NP or PP be predicted from the other variables? For the CART analysis we need package rpart: library(rpart) 8 We create an initial tree for the realization of the recipient. Each node is a decision. If the answer is “yes”, one proceeds along the left branch elsealong the right branch – top-town from the root to the leaves. dative.rp = rpart(RealizationOfRecipient ~ ., data = dative[ ,-c(1, 3)]) # exclude the columns with subjects, verbs plot(dative.rp, compress = TRUE, branch = 1, margin = 0.1) text(dative.rp, use.n = TRUE, pretty = 0) 3.7.1 Pruning The initial tree is over-fitting and can be pruned according to a cost-complexity parameter: plotcp(dative.rp) # ten-fold cross validation For explanation see Baayen. 2008. p. 151! Then, we create the pruned tree: dative.rp1 = prune(dative.rp, cp = 0.041) plot(dative.rp1, compress = TRUE, branch = 1, margin = 0.1) text(dative.rp1, use.n = TRUE, pretty = 0) 9 3.7.2 Quality of the classification tree predict() extracts predictions from the model: head(predict(dative.rp1)) Choose the realization with the higher probability. choiceIsNP = predict (dative.rp1)[,1] >= 0.5 choiceIsNP[1:6] Combine this vector with the original observations: preds= data.frame(obs = RealizationOfRecipient, choiceIsNP) head(preds) Cross-tabulation xtabs( ~ obs + choiceIsNP, data = preds) Interpretation: Only 269+177 data points out of 3263 are misclassified (13.7%). The baseline was 26%. 3.8 Regression analysis: Generalized linear mixed model This section uses data and code from Baayen (2008): section 7.4. See there for further explanations and further development of the argumentation. library(Design) dative.dd = datadist(dative) options(datadist = ’dative.dd’) dative.lrm = lrm(RealizationOfRecipient ~ AccessOfTheme + AccessOfRec + LengthOfRecipient + AnimacyOfRec + AnimacyOfTheme + PronomOfTheme + DefinOfTheme + LengthOfTheme+ SemanticClass + Modality, data = dative) 10 anova(dative.lrm) Wald Statistics Factor AccessOfTheme AccessOfRec LengthOfRecipient AnimacyOfRec AnimacyOfTheme PronomOfTheme DefinOfTheme LengthOfTheme SemanticClass Modality TOTAL Response: RealizationOfRecipient Chi-Square d.f. P 30.79 2 <.0001 258.06 2 <.0001 69.87 1 <.0001 93.35 1 <.0001 3.71 1 0.0542 54.42 1 <.0001 28.72 1 <.0001 79.03 1 <.0001 166.55 4 <.0001 49.91 1 <.0001 747.64 15 <.0001 Visualization of the data: par(mfrow = c(4,3)) plot(dative.lrm) par(mfrow = c(1,1)) 4 References and further reading • General – Baayen, R.H. 2008. Analyzing Linguistic Data. Cambridge University Press. (pre-print draft: www.ualberta.ca/~baayen/publications/baayenCUPstats.pdf). – Gries, S.T. 2008. Statistik für Sprachwissenschaftler. Vandenhoeck & Ruprecht. • Data frame – Baayen. 2008. Section 1.3. • Association plot and chi-square test – Gries. 2008. Section 4.1.1.2, 4.1.2.2 • CART Analysis – Baayen. 2008. Section. 5.2.1 – Heylen, Kris. 2005. A Quantitative Corpus Study of German Word Order Variation. Reis, M. & Kepser, S. (eds.) Evidence in Linguistics: Empirical, Theoretical, and Computational Perspectives, Mouton de Gruyter, 241-263. 11