Experience on Automated Coding of Occupation in Population
Transcription
Experience on Automated Coding of Occupation in Population
Experience on Automated Coding of Occupation in Population Census in Croatia 6U DQ'XPLþLü PULS – Market, Media and Public Opinion Research Agency âXELüHYD 10000 Zagreb, Croatia [email protected] .VHQLMD'XPLþLü Faculty of Economics – Zagreb, Department of Statistics Kennedyjev trg 6 10000 Zagreb, Croatia [email protected] 1. Introduction This paper describes the estimates of the final results of joint impact of optical reading and automatic coding of the occupation in the population Census ’91 in Croatia. For the OCR and coding quality estimation the systematic sampling with the sampling fraction of about f=0.006 was XVHG 'XPLþLü DQG 'XPLþLü 7KH HVWLPDWHV LQGLFDWHG WKDW VXEVWLWXWLRQ UDWH IRU DOSKD characters was less then 2.5%. In 14 fields there were textual answers, which were coded to corresponding codes using VSHFLDO VRIWZDUH 'XPLþLü DQG 'XPLþLü IRU DXWRPDWHG DQG FRPSXWHU DVVLVWHG FRGLQJ 7KH efficiency of the process of automatic coding was measured by the percent of fields being coded automatically and the percent of fields being coded in the computer assisted mode. Also, the quality was measured by percent of fields in which the codes given automatically or in computer assisted mode were the same as those given by a well-trained specialist. 2. Automated Coding In order to achive high quality results in automatic and comptuer assisted coding one complex DSSOLFDWLRQ ZDV GHYHORSHG .DOSLü ,W FRQVLVWV RI WZR SULQFLSDO VWHSV WKH VLQJOH ZRUG recognition and the phrase recognition step. A prerequisite has been creation of thesauri. Each single word appearing there became a candidate to be matched with the input word. The relative word weights were calculated to represent the discrimination power of every word. An input word was expected to match exactly a word from the thesaurus or to be similar to some of them. The reason for the absence of an exact match might be that the thesaurus did not contain the word in the same case, count or gender, or that the input word was erroneously hand-written or optically read. In such a case the similarity between the input and the candidate word from thesauri was based upon: difference in words lengths, matching of equal or similar characters, matching after shift, and length of continuous matching strings. The application supports: À automatic coding, À computer assisted coding, and À monitoring and managing whole process through the large set of parameters. At that time a national code book for occupation including 2922 different codes was used. It has got a hierarchical structure with the 6-digit code at the lowest level. In the thesaurus 21644 phrases with 72148 words were put, and 13 109 of them were unique. For the quality monitoring 2 different types of results were analyzed: À values we got through automatic coding and À values, which were imputed in computer assisted mode. In total 2.162.219 codes were given. The comparison was made between these codes and those assigned in a usual way by the subject matter people. Results are given in the following table: 7DEOH(IILFLHQF\DQGTXDOLW\RIDXWRPDWLFFRGLQJ Efficiency Number of fields in the sample automatic coding 6058 67,2 % computer assisted mode 32,8 % 3143 Number of equal fields % of equal fields 1 digit 5773 95.30 2 digits 5642 93.13 3 digits 5492 90.66 1 digit 2406 76.99 2 digits 3 digits 1959 1711 62.69 54.75 It should be pointed out that all the values were the result of joint influences of both optical reading and automatic coding, related to the results that the subject matter specialists got after checking and correcting the inquiry forms. It is understandable that if all the material has been coded manually, the quality of the coders would be considerably reduced, and the expected quality would be lower. The codlist for the attribut occupation has got the hierarchy structure. It has been obvious that the results of the automatic coding and the work of specialists were more different when analyzing the lower level of codes. It could be partly explained with the fact that the data were not unambiguous enough for assignment at the low hierarchy level. Several times it was proved that more answers were possible. REFERENCES 1. 'XPLþLü6DQG'XPLþLü.VHQLMD2SWLFDO5HDGLQJDQG$XWRPDWLF&RGLQJLQWKH&HQVXV 1991 in Croatia. Proceedings of the International Conference on Survey Measurement and Process Quality, Contributed Papers, Bristol, U.K., April 1-4, 1995, Published by American Statistical Association, Alexandria, Virginia, U.S.A. pp. 58-63. ISBN 1-883276-18-7. 2. 'XPLþLü 6 'XPLþLü .VHQLMD .DOSLü ' DQG 0RUQDU 9 $XWRPDWHG FRGLQJ LQ WKH Census ’91 in Croatia. Statistical Data Editing, Volume No. 2, Methods and Techniques. UNITED NATIONS, Statistical Commission and ECE, Conference of European Statisticians, Statistical Standards and Studies, No.48. ISBN 92-1-116664-0. pp. 209-216. 3. .DOSLü'$XWRPDWHG&RGLQJRI&HQVXV'DWD-RXUQDORI2IILFLDO6WDWLVWLFV9RO No. 4, 1994, pp 449-463, Statistics Sweden. RÉSUMÉ Pour la saisie des données dans le recensement démographique de 1991 en Croatie ont a utilisé les lecteurs optiques. Pour l'attribut occupation on a inscrit le text des réponses que les interviewes ont donné aux enquêteurs. On a rassemblé plus de 2 millions de réponses. Après la lecture optique et la reconaissance des caracteres on a utilisé le programme de codage automatique speciallement construit. Le programme soutenait le codage automatique et le codage a l'aide d'ordinateur. En utilisant cette technologie on a obtenu de très bons résultats. Plus de 2/3 de reponses ont été codé automatiquement avec une très haute qualité de réponses, et 1/3 a été code avec le codage a l'aide d'ordinateur avec une qualité de réponses un peu plus bas.