167-29

Transcription

167-29
SUGI 29
Posters
Paper 167-29
Segmenting Children’s Narratives with PROC CLUSTER: An
Application of SAS® Tools to Child Language Studies
Lindsey N. Chen, University of Southern California, Los Angeles, CA
ABSTRACT
Multivariate statistical methods have been utilized for classifying texts into genre categories or differentiating one
literary style from another. Moreover, such tasks could be performed faster and more efficiently, thanks to today’s
powerful computing capabilities. Within this paper, I present a similar methodology for studying primary school
children’s narrative genre skills. In particular, the Average Group Linkage Method via SAS® PROC CLUSTER is
TM
applied to 32 first-grader’s spoken narratives obtained from the CHILDES corpus. Here, SAS® was extremely
useful for helping to make sense of a large collection of textual data that consists of various linguistic features
measured on a number of narrative texts. Preliminary results show that the children’s narratives can be segmented
into two sub-genre categories, thus supporting the hypothesis that children at that young age already possess the
ability to produce different kinds of narratives.
INTRODUCTION
Are children able to assume different narrative “voices”? Studies by Heath (Heath 1983) show that, along with factual
news report, the two narrative genres-ongoing event and embellished story-are part of the “key” narratives because
they are found cross-culturally in children’s language learning environments. This paper aims to support previous
qualitative analyses by providing quantitative evidence for children’s narrative genre ability. Towards this end, cluster
analysis is used to identify two segments in terms of the following linguistic measures: independent clause,
subordinate adverbial clause, temporal sequential marker and modal verbs. These four particular linguistic variables
were selected based on their relative frequencies in the two kinds of narratives.
Examples of temporal sequential marker
*CHI:
*CHI:
so [then] they catch {!} the
balloon around the bakery.
and [then] the boy comes out.
(partial narrative transcript from a child of age 5;9)
Examples of subordinate adverbial clause
*CHI: and they have it
*CHI: [tied on a string # onto a boys # wrist].
*CHI: and now the boy all fell down.
*CHI: [because now he got his balloon].
(partial narrative transcript from a child of age 6;6)
The CLUSTERING METHOD WITH SAS®
The dataset contains the normalized frequency counts of the four linguistic measures from thirty-two spoken narrative
TM
texts obtained from the CHILDES corpus. Speech samples include both event and story narratives from firstgraders who were asked to report on the events of a film, either acting as a sportscaster or a storyteller. In the
analysis, results from the cluster procedure will be used to compare to the actual types from which the narratives are
known to have come. First, the following code creates the dataset.
1
SUGI 29
Posters
TITLE ‘SEGMENTING CHILDREN NARRATIVES’;
DATA EVENT;
INPUT ID IND SADV TSQ MODV;
TYPE=’E’;
CARDS;
/*observations 1-16 */
DATA STORY;
INPUT ID IND SADV TSQ MODV;
TYPE=’S’;
CARDS;
/*observations 17-32*/
RUN;
DATA NARRATIVES; SET EVENT STORY;
RUN;
The hierarchical clustering procedure used is the Average Group Linkage Method, which calculates the distance
between clusters by taking the average of all pairwise differences between the points within each cluster. Initially,
each case is treated as a cluster and are then combined based on the measured characteristics. The following
SAS® code outputs the Cluster History.
/*The Average Linkage Procedure*/
PROC CLUSTER DATA=NARRATIVES STANDARD METHOD=AVERAGE OUTREE=TREE;
RUN;
Cluster History
NCL ------Clusters Joined------FREQ
PST2
6
5
4
3
2
1
7.4
8.4
17.1
2.8
6.4
6.7
CL17
CL6
CL5
CL4
CL3
CL2
CL24
CL7
CL8
13
CL20
4
5
13
28
29
31
32
Norm
RMS
Dist
0.6981
0.8641
0.9719
1.151
1.2699
1.8202
The Cluster History includes useful diagnostics for estimating the number of clusters in these data. For example, the
values of the PST2 statistic produced by PSEUDO help to determine whether the two clusters combined should have
been combined. Essentially, if PST2 is large, the two clusters should not be combined. But if PST2 is small, then the
two clusters can safely be combined. Of course, what is “large” and what is “small” is relative to the data being
analyzed. Looking at the cluster history, one notes that when the program reduces 5 clusters to 4 clusters, the value
of the PST2 is 17.1, which is considered large when compared to those of other clusters. Thus, comparison of the
PST2 values suggests five to be the estimated number of cluster in the data. As another useful diagnostic, a plot of
cubic clustering criteria (CCC plot, not shown) also suggests five clusters.
The clustering procedure can be represented as a hierarchical tree where each step in the process is illustrated by
the joining of the tree.
/*Creating the hierarchical tree*/
PROC TREE DATA=TREE OUT=TREEOUT NCLUSTERS=5;
COPY IND SADV TSQ MODV;
ID ID;
RUN;
2
SUGI 29
Posters
PROC TREE outputs the dendogram. Visualizing the tree, the following are noted: texts 1, 8, 11, 16, 5, 2, 3, 20, 19,
5, 6, 9 and 12 form a cluster, texts 17, 57, 59, 60, 47, 49 48, 52, 42, 45, 55, 53, 54 and 58 form the second cluster,
text 13 forms the third cluster, texts 43 and 44 form the fourth cluster and text 4 forms the fifth cluster.
A principle component analysis is performed for the purpose of helping to fine-tune the results of the clustering
process.
/*Eigenvalue of the correlation matrix*/
PROC PRINCOMP DATA=NARRATIVES OUT=SCORES;
VAR IND SADV TSQ MODV;
RUN;
PROC PRINCOMP outputs the eigenvalues of the correlation matrix.
Eigenvalues of the Correlation Matrix
Eigenvalue Difference Proportion Cumulative
1
2
3
4
2.15578234 1.14839724 0.5389
1.00738510 0.54416448 0.2518
0.46322063 0.08960870 0.1158
0.37361193
0.0934
0.5389
0.7908
0.9066
1.0000
An examination of the eigenvalues shows that two eigenvalues account for 79.08% of the total variability
in the measure variables. This implies that the measure variables nearly fall within a two-dimensional subspace of
the four-dimensional sample space. Thus, a plot of the first two principal component scores should be useful for
clustering these data because it can identify grouping or structure within a dataset.
/*Creating a plot of Prin1xPrin2*/
PROC SORT DATA=TREEOUT; BY ID;
PROC SORT DATA=SCORES; BY ID;
DATA COMB; MERGE SCORES TREEOUT; BY ID;
3
SUGI 29
Posters
PROC PLOT DATA=COMB;
PLOT PRIN1*PRIN2=TYPE;
PLOT PRIN1*PRIN2=CLUSTER;
RUN;
PROC PLOT will output two plots of the first two PC scores. See Figure 1and Figure 2.
"
$
%
&
%
!
!
#
!
!
!
"
!
#
!
!
!
!
#
!
!
!
!
#
!
!
!
!
#
!
!
!
!
#
!
!
!
!
#
!
!
!
%
!
#
!
%
!
!
!
#
'#((((((((((((#((((((((((((#((((((((((((#((((((((((((#((((((((((((#(
&
%
4
SUGI 29
Posters
Figure 2 is a similar graph but with symbols plotted to identify the cluster to which the narrative was assigned.
)
"
$
%
&
%
!
!
#
!
!
!
!
#
!
!
!
!
#
!
!
!
!
#
!
!
!
!
#
!
!
!
!
#
!
!
!
!
#
!
!
!
!
#
!
!
!
!
#
!
'#((((((((((((#((((((((((((#((((((((((((#((((((((((((#((((((((((((#(
&
%
5
SUGI 29
Posters
The two plots are encouraging! Perhaps the most salient feature is the clear separation of the event narratives (E)
and the story narratives (S) along an imaginary horizontal line demarcated at ‘0’; the event narratives are above it
while the story narratives are below. In general, a comparison of the two graphs shows that except for the one
isolated case located far above the plane and the two event narratives classified with the story narratives, the majority
of the event narratives are grouped together under the cluster labeled ‘2’. Similar observation can be made about
the story narratives. Except for the two cases located at farther right and the one story narrative classified into cluster
2, the majority of the story narratives are grouped together under the cluster labeled‘1’. Overall, ignoring the few
isolated or “unusual” cases, the two kinds of narratives are clearly partitioned into distinct groups.
CONCLUSION
Given the linguistic measures used in the clustering method, preliminary results show that the children’s narratives
can be separated into two subgenres. Thus, they suggest that the differences between the syntactic constructions
among the narrative genres can decide whether or not the spoken narrative texts can be segmented. Moreover, this
emergence of two subgenres seems to suggest that young children do possess the ability to produce different kinds
of narratives.
REFERENCE
Heath, Shirley Brice. (1983). Ways with Words: Language, Life and Work in Communities and Classrooms.
Cambridge: Cambridge University Press.
Hicks, Deborah. (1990). Kinds of texts: Narrative genre skills among children from two communities. In A. McCabe
(Ed.), Developing Narrative Structure. Hillsadale, NJ: Erlbaum.
Johnson, Dallas. (1998). Applied Multivariate Methods for Data Analysts. Pacific Grove, CA: Brooks and Cole
Publishing Company.
MacWhinney, Brian. (2000). The CHILDES Project: Tools for Analyzing Talk, Third Edition. Mahwah, NJ: Lawrence
Erlbaum Associates.
CONTACT INFORMATION
Lindsey N. Chen
Email: [email protected]
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the
USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective
companies.
6