167-29
Transcription
167-29
SUGI 29 Posters Paper 167-29 Segmenting Children’s Narratives with PROC CLUSTER: An Application of SAS® Tools to Child Language Studies Lindsey N. Chen, University of Southern California, Los Angeles, CA ABSTRACT Multivariate statistical methods have been utilized for classifying texts into genre categories or differentiating one literary style from another. Moreover, such tasks could be performed faster and more efficiently, thanks to today’s powerful computing capabilities. Within this paper, I present a similar methodology for studying primary school children’s narrative genre skills. In particular, the Average Group Linkage Method via SAS® PROC CLUSTER is TM applied to 32 first-grader’s spoken narratives obtained from the CHILDES corpus. Here, SAS® was extremely useful for helping to make sense of a large collection of textual data that consists of various linguistic features measured on a number of narrative texts. Preliminary results show that the children’s narratives can be segmented into two sub-genre categories, thus supporting the hypothesis that children at that young age already possess the ability to produce different kinds of narratives. INTRODUCTION Are children able to assume different narrative “voices”? Studies by Heath (Heath 1983) show that, along with factual news report, the two narrative genres-ongoing event and embellished story-are part of the “key” narratives because they are found cross-culturally in children’s language learning environments. This paper aims to support previous qualitative analyses by providing quantitative evidence for children’s narrative genre ability. Towards this end, cluster analysis is used to identify two segments in terms of the following linguistic measures: independent clause, subordinate adverbial clause, temporal sequential marker and modal verbs. These four particular linguistic variables were selected based on their relative frequencies in the two kinds of narratives. Examples of temporal sequential marker *CHI: *CHI: so [then] they catch {!} the balloon around the bakery. and [then] the boy comes out. (partial narrative transcript from a child of age 5;9) Examples of subordinate adverbial clause *CHI: and they have it *CHI: [tied on a string # onto a boys # wrist]. *CHI: and now the boy all fell down. *CHI: [because now he got his balloon]. (partial narrative transcript from a child of age 6;6) The CLUSTERING METHOD WITH SAS® The dataset contains the normalized frequency counts of the four linguistic measures from thirty-two spoken narrative TM texts obtained from the CHILDES corpus. Speech samples include both event and story narratives from firstgraders who were asked to report on the events of a film, either acting as a sportscaster or a storyteller. In the analysis, results from the cluster procedure will be used to compare to the actual types from which the narratives are known to have come. First, the following code creates the dataset. 1 SUGI 29 Posters TITLE ‘SEGMENTING CHILDREN NARRATIVES’; DATA EVENT; INPUT ID IND SADV TSQ MODV; TYPE=’E’; CARDS; /*observations 1-16 */ DATA STORY; INPUT ID IND SADV TSQ MODV; TYPE=’S’; CARDS; /*observations 17-32*/ RUN; DATA NARRATIVES; SET EVENT STORY; RUN; The hierarchical clustering procedure used is the Average Group Linkage Method, which calculates the distance between clusters by taking the average of all pairwise differences between the points within each cluster. Initially, each case is treated as a cluster and are then combined based on the measured characteristics. The following SAS® code outputs the Cluster History. /*The Average Linkage Procedure*/ PROC CLUSTER DATA=NARRATIVES STANDARD METHOD=AVERAGE OUTREE=TREE; RUN; Cluster History NCL ------Clusters Joined------FREQ PST2 6 5 4 3 2 1 7.4 8.4 17.1 2.8 6.4 6.7 CL17 CL6 CL5 CL4 CL3 CL2 CL24 CL7 CL8 13 CL20 4 5 13 28 29 31 32 Norm RMS Dist 0.6981 0.8641 0.9719 1.151 1.2699 1.8202 The Cluster History includes useful diagnostics for estimating the number of clusters in these data. For example, the values of the PST2 statistic produced by PSEUDO help to determine whether the two clusters combined should have been combined. Essentially, if PST2 is large, the two clusters should not be combined. But if PST2 is small, then the two clusters can safely be combined. Of course, what is “large” and what is “small” is relative to the data being analyzed. Looking at the cluster history, one notes that when the program reduces 5 clusters to 4 clusters, the value of the PST2 is 17.1, which is considered large when compared to those of other clusters. Thus, comparison of the PST2 values suggests five to be the estimated number of cluster in the data. As another useful diagnostic, a plot of cubic clustering criteria (CCC plot, not shown) also suggests five clusters. The clustering procedure can be represented as a hierarchical tree where each step in the process is illustrated by the joining of the tree. /*Creating the hierarchical tree*/ PROC TREE DATA=TREE OUT=TREEOUT NCLUSTERS=5; COPY IND SADV TSQ MODV; ID ID; RUN; 2 SUGI 29 Posters PROC TREE outputs the dendogram. Visualizing the tree, the following are noted: texts 1, 8, 11, 16, 5, 2, 3, 20, 19, 5, 6, 9 and 12 form a cluster, texts 17, 57, 59, 60, 47, 49 48, 52, 42, 45, 55, 53, 54 and 58 form the second cluster, text 13 forms the third cluster, texts 43 and 44 form the fourth cluster and text 4 forms the fifth cluster. A principle component analysis is performed for the purpose of helping to fine-tune the results of the clustering process. /*Eigenvalue of the correlation matrix*/ PROC PRINCOMP DATA=NARRATIVES OUT=SCORES; VAR IND SADV TSQ MODV; RUN; PROC PRINCOMP outputs the eigenvalues of the correlation matrix. Eigenvalues of the Correlation Matrix Eigenvalue Difference Proportion Cumulative 1 2 3 4 2.15578234 1.14839724 0.5389 1.00738510 0.54416448 0.2518 0.46322063 0.08960870 0.1158 0.37361193 0.0934 0.5389 0.7908 0.9066 1.0000 An examination of the eigenvalues shows that two eigenvalues account for 79.08% of the total variability in the measure variables. This implies that the measure variables nearly fall within a two-dimensional subspace of the four-dimensional sample space. Thus, a plot of the first two principal component scores should be useful for clustering these data because it can identify grouping or structure within a dataset. /*Creating a plot of Prin1xPrin2*/ PROC SORT DATA=TREEOUT; BY ID; PROC SORT DATA=SCORES; BY ID; DATA COMB; MERGE SCORES TREEOUT; BY ID; 3 SUGI 29 Posters PROC PLOT DATA=COMB; PLOT PRIN1*PRIN2=TYPE; PLOT PRIN1*PRIN2=CLUSTER; RUN; PROC PLOT will output two plots of the first two PC scores. See Figure 1and Figure 2. " $ % & % ! ! # ! ! ! " ! # ! ! ! ! # ! ! ! ! # ! ! ! ! # ! ! ! ! # ! ! ! ! # ! ! ! % ! # ! % ! ! ! # '#((((((((((((#((((((((((((#((((((((((((#((((((((((((#((((((((((((#( & % 4 SUGI 29 Posters Figure 2 is a similar graph but with symbols plotted to identify the cluster to which the narrative was assigned. ) " $ % & % ! ! # ! ! ! ! # ! ! ! ! # ! ! ! ! # ! ! ! ! # ! ! ! ! # ! ! ! ! # ! ! ! ! # ! ! ! ! # ! '#((((((((((((#((((((((((((#((((((((((((#((((((((((((#((((((((((((#( & % 5 SUGI 29 Posters The two plots are encouraging! Perhaps the most salient feature is the clear separation of the event narratives (E) and the story narratives (S) along an imaginary horizontal line demarcated at ‘0’; the event narratives are above it while the story narratives are below. In general, a comparison of the two graphs shows that except for the one isolated case located far above the plane and the two event narratives classified with the story narratives, the majority of the event narratives are grouped together under the cluster labeled ‘2’. Similar observation can be made about the story narratives. Except for the two cases located at farther right and the one story narrative classified into cluster 2, the majority of the story narratives are grouped together under the cluster labeled‘1’. Overall, ignoring the few isolated or “unusual” cases, the two kinds of narratives are clearly partitioned into distinct groups. CONCLUSION Given the linguistic measures used in the clustering method, preliminary results show that the children’s narratives can be separated into two subgenres. Thus, they suggest that the differences between the syntactic constructions among the narrative genres can decide whether or not the spoken narrative texts can be segmented. Moreover, this emergence of two subgenres seems to suggest that young children do possess the ability to produce different kinds of narratives. REFERENCE Heath, Shirley Brice. (1983). Ways with Words: Language, Life and Work in Communities and Classrooms. Cambridge: Cambridge University Press. Hicks, Deborah. (1990). Kinds of texts: Narrative genre skills among children from two communities. In A. McCabe (Ed.), Developing Narrative Structure. Hillsadale, NJ: Erlbaum. Johnson, Dallas. (1998). Applied Multivariate Methods for Data Analysts. Pacific Grove, CA: Brooks and Cole Publishing Company. MacWhinney, Brian. (2000). The CHILDES Project: Tools for Analyzing Talk, Third Edition. Mahwah, NJ: Lawrence Erlbaum Associates. CONTACT INFORMATION Lindsey N. Chen Email: [email protected] SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. 6