Experience on Automated Coding of Occupation in Population

Transcription

Experience on Automated Coding of Occupation in Population
Experience on Automated Coding of Occupation in Population
Census in Croatia
6U DQ'XPLþLü
PULS – Market, Media and Public Opinion Research Agency
âXELüHYD
10000 Zagreb, Croatia
[email protected]
.VHQLMD'XPLþLü
Faculty of Economics – Zagreb, Department of Statistics
Kennedyjev trg 6
10000 Zagreb, Croatia
[email protected]
1. Introduction
This paper describes the estimates of the final results of joint impact of optical reading and
automatic coding of the occupation in the population Census ’91 in Croatia. For the OCR and
coding quality estimation the systematic sampling with the sampling fraction of about f=0.006 was
XVHG 'XPLþLü DQG 'XPLþLü 7KH HVWLPDWHV LQGLFDWHG WKDW VXEVWLWXWLRQ UDWH IRU DOSKD
characters was less then 2.5%.
In 14 fields there were textual answers, which were coded to corresponding codes using
VSHFLDO VRIWZDUH 'XPLþLü DQG 'XPLþLü IRU DXWRPDWHG DQG FRPSXWHU DVVLVWHG FRGLQJ 7KH
efficiency of the process of automatic coding was measured by the percent of fields being coded
automatically and the percent of fields being coded in the computer assisted mode. Also, the quality
was measured by percent of fields in which the codes given automatically or in computer assisted
mode were the same as those given by a well-trained specialist.
2. Automated Coding
In order to achive high quality results in automatic and comptuer assisted coding one complex
DSSOLFDWLRQ ZDV GHYHORSHG .DOSLü ,W FRQVLVWV RI WZR SULQFLSDO VWHSV WKH VLQJOH ZRUG
recognition and the phrase recognition step.
A prerequisite has been creation of thesauri. Each single word appearing there became a
candidate to be matched with the input word. The relative word weights were calculated to represent
the discrimination power of every word. An input word was expected to match exactly a word from
the thesaurus or to be similar to some of them. The reason for the absence of an exact match might
be that the thesaurus did not contain the word in the same case, count or gender, or that the input
word was erroneously hand-written or optically read. In such a case the similarity between the input
and the candidate word from thesauri was based upon: difference in words lengths, matching of
equal or similar characters, matching after shift, and length of continuous matching strings.
The application supports:
À automatic coding,
À computer assisted coding, and
À monitoring and managing whole process through the large set of parameters.
At that time a national code book for occupation including 2922 different codes was used. It
has got a hierarchical structure with the 6-digit code at the lowest level. In the thesaurus 21644
phrases with 72148 words were put, and 13 109 of them were unique.
For the quality monitoring 2 different types of results were analyzed:
À values we got through automatic coding and
À values, which were imputed in computer assisted mode.
In total 2.162.219 codes were given. The comparison was made between these codes and
those assigned in a usual way by the subject matter people. Results are given in the following table:
7DEOH(IILFLHQF\DQGTXDOLW\RIDXWRPDWLFFRGLQJ
Efficiency
Number of fields in the sample
automatic coding
6058
67,2 %
computer assisted
mode
32,8 %
3143
Number of equal fields % of equal fields
1 digit
5773
95.30
2 digits
5642
93.13
3 digits
5492
90.66
1 digit
2406
76.99
2 digits
3 digits
1959
1711
62.69
54.75
It should be pointed out that all the values were the result of joint influences of both optical
reading and automatic coding, related to the results that the subject matter specialists got after
checking and correcting the inquiry forms. It is understandable that if all the material has been
coded manually, the quality of the coders would be considerably reduced, and the expected quality
would be lower.
The codlist for the attribut occupation has got the hierarchy structure. It has been obvious that
the results of the automatic coding and the work of specialists were more different when analyzing
the lower level of codes. It could be partly explained with the fact that the data were not
unambiguous enough for assignment at the low hierarchy level. Several times it was proved that
more answers were possible.
REFERENCES
1. 'XPLþLü6DQG'XPLþLü.VHQLMD2SWLFDO5HDGLQJDQG$XWRPDWLF&RGLQJLQWKH&HQVXV
1991 in Croatia. Proceedings of the International Conference on Survey Measurement and
Process Quality, Contributed Papers, Bristol, U.K., April 1-4, 1995, Published by American
Statistical Association, Alexandria, Virginia, U.S.A. pp. 58-63. ISBN 1-883276-18-7.
2. 'XPLþLü 6 'XPLþLü .VHQLMD .DOSLü ' DQG 0RUQDU 9 $XWRPDWHG FRGLQJ LQ WKH
Census ’91 in Croatia. Statistical Data Editing, Volume No. 2, Methods and Techniques.
UNITED NATIONS, Statistical Commission and ECE, Conference of European Statisticians,
Statistical Standards and Studies, No.48. ISBN 92-1-116664-0. pp. 209-216.
3. .DOSLü'$XWRPDWHG&RGLQJRI&HQVXV'DWD-RXUQDORI2IILFLDO6WDWLVWLFV9RO
No. 4, 1994, pp 449-463, Statistics Sweden.
RÉSUMÉ
Pour la saisie des données dans le recensement démographique de 1991 en Croatie ont a
utilisé les lecteurs optiques. Pour l'attribut occupation on a inscrit le text des réponses que les
interviewes ont donné aux enquêteurs. On a rassemblé plus de 2 millions de réponses. Après la
lecture optique et la reconaissance des caracteres on a utilisé le programme de codage
automatique speciallement construit. Le programme soutenait le codage automatique et le codage a
l'aide d'ordinateur. En utilisant cette technologie on a obtenu de très bons résultats. Plus de 2/3 de
reponses ont été codé automatiquement avec une très haute qualité de réponses, et 1/3 a été code
avec le codage a l'aide d'ordinateur avec une qualité de réponses un peu plus bas.