CorpLing10/Material/Handout frequencies

Transcription

CorpLing10/Material/Handout frequencies
Corpus Linguistics –
Some statistics in R
Heike Zinsmeister
Konstanz, 16.3.2010
1
Preliminaries
Start R by double clicking on the R symbol.
1.1
Packages
For some tasks we will need additional programmes of R, which are distributed in packages. To load a package
that is already installed:
library(package)
Example:
library(languageR)
Loads a package by Harald Baayen which includes programmes and data that are described in his book: Analyzing Linguistic Data (2008).
To install a new package, type in the R window:
install.packages(c(list of packages), repos ="url of download repository")
Example:
install.packages(c(zipfR , languageR), repos ="http://cran.r-project.org")
This will install the packages zipfR and languageR.
install.packages(c(rpart, chron, Hmisc, Design, Matrix, lme4, coda, e1071,
zipfR , ape, languageR), repos ="http://cran.r-project.org")
This will install all packages that are required to follow Baayen’s book.
1.2
1.2.1
Reading and writing of tables
Create a table in R
First, we create a simple table x by hand:
Filler <- c("aeh" , "silence" , "aehm")
Frequency <- c(394,332,274)
x <- data.frame(Filler , Frequency); x
Using the values of one factor as row names:
x <- data.frame(Frequency, row.names=Filler); x
1.2.2
Saving a data frame
The name of the table is x. Save it as file /Users/cluser/Desktop/Fillers.txt:
write.table(x, "/Users/cluser/Desktop/Fillers.txt ", quote=FALSE, sep="\t", row.names=TRUE, col.names=TR
1
1.2.3
Loading tabular data into R
read.table() loads tabular data from a directory into R. Load file /Users/cluser/Desktop/Fillers.txt.
x <- read.table(file.choose(), header=TRUE, sep="\t")
1.2.4
# new data frame = x
Objects (and functions) on the workspace
List all objects on the workspace:
ls(all=TRUE)
Delete all objects on the workspace:
rm(list=ls(all=T))
Check the workspace (list all objects):
ls(all=T)
Task:
Load file /Users/cluser/Desktop/Fillers.txt into R (see above). Then check the workspace again.
1.3
Plots
First, create two simple vectors a andb:
a<-c(1,3,5,2,4); b<- 1:5
1.3.1
Scatter plot
plot(a)
1.3.2
Printing a graphic to a file
png () creates a png graphic. For other graphic formats see: help(png), help(postscript)
png("/Users/cluser/Desktop/plot_a.png", width=300, height=300)
# Saves the plot as ‘plot_a.png’ on the Desktop
plot(a)
dev.off() # ends writing the file
1.3.3
Axes and titles
Compare plot(a) with plot(a,b)
plot(a,b)
Attributes of plot (and
main
sub
xlab
ylab
cex.main
cex.lab
cex.axis
xlim=c(xmin, xmax)
Example:
other graphics):
an overall title for the plot: see help(title)
a subtitle for the plot
a title for the x axis: x axis label
a title for the y axis: y axis label
relative font size of the overall title
relative font size of the axis titles
relative size of axis annotation
limits of the x axis annotation
plot(b, a ,xlab="x axis label", ylab="y axis label", main="My plot",cex.lab=1.5,ylim=c(1,7))
2
1.3.4
Different plot types
p for points
l for lines
b for both
c for the lines part alone of b
o for both ‘overplotted’
h for ‘histogram’ like (or ‘high-density’) vertical lines
s for stair steps: moves first horizontal, then vertical
S for stair steps: moves first vertical, then horizontal
n for no plotting
Example:
plot(b, a, type="b", xlab="x axis label", ylab="y axis label", main="My plot", cex.lab=1.5)
1.3.5
Histograms
Besides plot type "h", you can create histograms with the function hist(x) where x is a numeric vector of values
to be plotted. The option freq=FALSE plots probability densities instead of frequencies. The option breaks=n
controls the number of bins (see: http://www.statmethods.net/graphs/density.html).
1.3.6
Barplot
Create barplots with the barplot(height) function, where height is a vector or matrix. If height is a vector, the
values determine the heights of the bars in the plot. If height is a matrix and the option beside=FALSE then
each bar of the plot corresponds to a column of height, with the values in the column giving the heights of
stacked ‘sub-bars’. If height is a matrix and beside=TRUE, then the values in each column are juxtaposed rather
than stacked. Include option names.arg=(character vector) to label the bars. The option horiz=TRUE to createa
a horizontal barplot (see: http://www.statmethods.net/graphs/bar.html).
2
2.1
Data frame
Setup
Load the data set verbs of the package LanguageR
data(verbs)
Read the documentation of verbs. What kind of data is it? How many data points (‘observations’) does it
include? How many variables? What scales do the variables belong to?
help(verbs)
Make yourself familiar with the data itself.
str(verbs)
head(verbs)
2.2
# shows the n first lines (default n=6); see also tail()
Navigation in a table
Find rows and columns in the table: tabe_name[rows,columns]. Display the first row:
verb[1,]
Display the third row:
verbs[3,]
Display the first three rows:
verbs[1:3,] # range operator
Display the first column:
verbs[,1]
Display the value of the third row in the first column:
verbs[3,1]
3
2.3
Restrictions on table values
Show only those rows in which AnimacyOfTheme is animate. This is a restriction on rows, hence, it precedes
the comma:
verbs[verbs$AnimacyOfTheme=="animate" , ]
Since AnimacyOfTheme has only two levels (animate, inanimate). The last command is equivalent to:
verbs[verbs$AnimacyOfTheme!="inanimate" , ]
Restrict the display even further to rows in which both, AnimacyOfTheme and AnimacyOfRec is animate:
verbs[verbs$AnimacyOfTheme=="animate" & verbs$AnimacyOfRec=="animate",]
If the columns are accessible to R the syntax becomes simpler since you are accessing the column names
(=variable names) directly:
attach(verbs)
verbs[AnimacyOfTheme=="animate" & AnimacyOfRec=="animate",]
3
Frequencies
3.1
Extracting contingency tables from data frames
Contingency tables sum up and cross-tabulate the frequencies of individual values of categorical variables. The
values of the variables RealizationOfRec and AnimacyOfRec of the data set verbs:
levels(RealizationOfRec) # [1] "NP" "PP"
levels(AnimacyOfRec) # [1] "animate"
"inanimate"
The function xtabs() creates a contingency table. It has the general syntax:
xtabs( dependent variable ~ predictor 1 + predictor 2 +..., data =data_name)
3.2
Distribution of the values of a single variable
xtabs( ~ RealizationOfRec, data = verbs)
xtabs( ~ AnimacyOfRec)
With ‘attach(data)’ the specification of ‘data = verbs’ is redundant.
3.3
Goodness-of-fit test: chi-square test
Is the realization of the recipient optional, i.e. randomly distributed, or is the observed difference in frequency
statistically significant?
Hypotheses:
• H0 : The frequencies of the values of the variable realization of the recipient are identical; Variation in the
sample is due to chance: nN P = nP P
• H1 : The frequencies of the values of the variable realization of the recipient are not identical; Variation in
the sample is not due to chance: nN P 6= nP P
Goal:
To Determine that the probability of wrongly rejecting H0 is lower than a recognized significance level.
p
p
p
p
>
<
<
<
0.05
0.05
0.01
0.001
not significant, H0 cannot be rejected
significant, H0 can be rejected
highly significant, H0 can be rejected
very highly significant, H0 can be rejected
Tabelle 1: Significance levels of error probability
Conditions for the chi-square test:
4
• all observations are independent
• 80% of the expected frequencies are ≤ 5
• all frequencies (in the contingency table) are > 1
• (Alternative for lower frequencies: fisher.test())
xtabs( ~ RealizationOfRec)
RealizationOfRec
NP PP
555 348
chisq.test(xtabs( ~ RealizationOfRec))
Chi-squared test for given probabilities
data: xtabs(~RealizationOfRec)
X-squared = 47.4518, df = 1, p-value = 5.637e-12
Interpretation:
The frequency differences between the realization of the recipient as NP and PP is statistically significant
(X-squared = 47.4518, df = 1, p<0.001).
3.4
Confidence interval for real distribution
To generalize from the observed sample frequencies to the population a confidence interval of 0.95 is computed:
> prop.test (555,903,conf.level = 0.95)
1-sample proportions test with continuity correction
data: 555 out of 903, null probability 0.5
X-squared = 46.9945, df = 1, p-value = 7.119e-12
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.5818929 0.6463551
sample estimates:
p
0.614618
Interpretation:
With 95% confidence, the recipient is realized as NP in a range of 58% - to 65% of all double object cases.
prop.test (420,903,conf.level = 0.95)
1-sample proportions test with continuity correction
data: 420 out of 903, null probability 0.5
X-squared = 4.2569, df = 1, p-value = 0.03909
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.4322516 0.4982816
sample estimates:
p
0.4651163
Interpretation:
...
The two-sided version of the prop.test tests whether two proportions (also from different samples, e.g. n=
c(500,700)) are statistically the same.
5
prop.test( x=c(555,420), n=c(903,903), alt="two.sided")
2-sample test for equality of proportions with continuity correction
data: c(555, 420) out of c(903, 903)
X-squared = 40.0241, df = 1, p-value = 2.508e-10
alternative hypothesis: two.sided
95 percent confidence interval:
0.1029411 0.1960622
sample estimates:
prop 1
prop 2
0.6146179 0.4651163
Interpretation:
...
3.5
Cross-tabulation of two (independent) variables
Example with two predictors only (= independent variables):
xtabs( ~ RealizationOfRec + AnimacyOfRec)
Task:
Try to interpret the data. Is there a preference for the realisation of the recipient dependent on its animacy?
3.5.1
Proportions
You will get a better feeling if you look at the proportions instead of the absolute frequencies:
verbs.xtabs = xtabs( ~ RealizationOfRec + AnimacyOfRec)
prop.table(verbs.xtabs, 1) # rows sum to 1
prop.table(verbs.xtabs, 2) # columns sum to 1
Sometimes it is more readable if proportions are rounded
round (prop.table(verbs.xtabs, 1), 2) # 2= number of digits after the zero point
round (prop.table(verbs.xtabs, 1), 2) *100 # Percentages
3.5.2
Simple visualisation of cross-tabled variables: barplot
barplot(verbs.xtabs, legend.text = TRUE) # stacked representation
barplot(verbs.xtabs, legend.text = TRUE, beside=TRUE)
# columns are shown as extraposed bars
3.5.3
Mosaic plot
This type of plot conditions one variable on another and displays relative preferences:
plot(RealizationOfRec ~AnimacyOfRec)
6
3.5.4
Association plot
In the association plot the height of the rectangles is proportional to its deviation from the expected (random)
frequency. The width is proportional to the square root of its expected frequency. Hence, the surface of the
rectangle is proportional to the difference between expected and observed frequency. Black = higher frequency
than expected, red= lower frequency than expected.
assocplot(verbs.xtabs)
3.6
Test of independence: chi-square test
Is the observed difference statistically significant?
Hypotheses:
• H0 : The frequencies of the values of the dependent variable realization of the recipient do not co-vary with
the frequencies of the values of the independent variable animacy of the recipient.
7
• H1 : The frequencies of the values of the dependent variable realization of the recipient do co-vary with
the frequencies of the values of the independent variable animacy of the recipient.
Goal:
To determine that the probability of wrongly rejecting H0 is lower than a recognized significance level.
Conditions for the chi-square test:
• all observations are independent
• 80% of the expected frequencies are ≤ 5
• all frequencies (in the contingency table) are > 1 (Alternative for lower frequencies: fisher.test())
chisq.test(verbs.xtabs)
Pearson’s Chi-squared test with Yates’ continuity correction
data: verbs.xtabs
X-squared = 13.3755, df = 1, p-value = 0.0002549
Interpretation:
The differences between NP and PP realization given different levels of animacy is statistically significant (Xsquared = 13.3755, df = 1, p<0.001).
3.6.1
Odds ratio
Significance doesn’t tell anything about the strength of the difference. How strong is the difference? Compare:
• probability p: frequency of a value in relation to all events
• odds O: probability of a value to occur in relation to its non-occurrence
pE
odds = 1−p
E
odds ratio =
O1
O2
Example:
verbs.xtabs
AnimacyOfRec
RealizationOfRec animate inanimate
NP
521
34
PP
301
47
O_np_animate = (521/34)/(301/47)
[1] 2.392711
# = (521*47)/(34*301)
Interpretation:
The odds that a recipient is realized as NP is about 2.4 times higher if the recipient is animate than if it is
inanimate.
3.7
CART analysis: Classification tree
This section is based on Baayen (200: 148-154) section 5.2.1. See the book for further explanation. For the next
task, we will load the ‘real’ data:
data(dative)
help(dative)
attach(data)
str(dative)
Question:
Can the realization of the recipient as NP or PP be predicted from the other variables?
For the CART analysis we need package rpart:
library(rpart)
8
We create an initial tree for the realization of the recipient. Each node is a decision. If the answer is “yes”, one
proceeds along the left branch elsealong the right branch – top-town from the root to the leaves.
dative.rp = rpart(RealizationOfRecipient ~ .,
data = dative[ ,-c(1, 3)]) # exclude the columns with subjects, verbs
plot(dative.rp, compress = TRUE, branch = 1, margin = 0.1)
text(dative.rp, use.n = TRUE, pretty = 0)
3.7.1
Pruning
The initial tree is over-fitting and can be pruned according to a cost-complexity parameter:
plotcp(dative.rp) # ten-fold cross validation
For explanation see Baayen. 2008. p. 151! Then, we create the pruned tree:
dative.rp1 = prune(dative.rp, cp = 0.041)
plot(dative.rp1, compress = TRUE, branch = 1, margin = 0.1)
text(dative.rp1, use.n = TRUE, pretty = 0)
9
3.7.2
Quality of the classification tree
predict() extracts predictions from the model:
head(predict(dative.rp1))
Choose the realization with the higher probability.
choiceIsNP = predict (dative.rp1)[,1] >= 0.5
choiceIsNP[1:6]
Combine this vector with the original observations:
preds= data.frame(obs = RealizationOfRecipient, choiceIsNP)
head(preds)
Cross-tabulation
xtabs( ~ obs + choiceIsNP, data = preds)
Interpretation:
Only 269+177 data points out of 3263 are misclassified (13.7%). The baseline was 26%.
3.8
Regression analysis: Generalized linear mixed model
This section uses data and code from Baayen (2008): section 7.4. See there for further explanations and further
development of the argumentation.
library(Design)
dative.dd = datadist(dative)
options(datadist = ’dative.dd’)
dative.lrm = lrm(RealizationOfRecipient ~
AccessOfTheme + AccessOfRec + LengthOfRecipient + AnimacyOfRec +
AnimacyOfTheme + PronomOfTheme + DefinOfTheme + LengthOfTheme+
SemanticClass + Modality, data = dative)
10
anova(dative.lrm)
Wald Statistics
Factor
AccessOfTheme
AccessOfRec
LengthOfRecipient
AnimacyOfRec
AnimacyOfTheme
PronomOfTheme
DefinOfTheme
LengthOfTheme
SemanticClass
Modality
TOTAL
Response: RealizationOfRecipient
Chi-Square d.f. P
30.79
2
<.0001
258.06
2
<.0001
69.87
1
<.0001
93.35
1
<.0001
3.71
1
0.0542
54.42
1
<.0001
28.72
1
<.0001
79.03
1
<.0001
166.55
4
<.0001
49.91
1
<.0001
747.64
15
<.0001
Visualization of the data:
par(mfrow = c(4,3))
plot(dative.lrm)
par(mfrow = c(1,1))
4
References and further reading
• General
– Baayen, R.H. 2008. Analyzing Linguistic Data. Cambridge University Press.
(pre-print draft: www.ualberta.ca/~baayen/publications/baayenCUPstats.pdf).
– Gries, S.T. 2008. Statistik für Sprachwissenschaftler. Vandenhoeck & Ruprecht.
• Data frame
– Baayen. 2008. Section 1.3.
• Association plot and chi-square test
– Gries. 2008. Section 4.1.1.2, 4.1.2.2
• CART Analysis
– Baayen. 2008. Section. 5.2.1
– Heylen, Kris. 2005. A Quantitative Corpus Study of German Word Order Variation. Reis, M. &
Kepser, S. (eds.) Evidence in Linguistics: Empirical, Theoretical, and Computational Perspectives,
Mouton de Gruyter, 241-263.
11