Validation report
Transcription
Validation report
SUBJECT: AUTHORS: VERSION: DATE : Validation Swiss French SpeechDat(M) corpus Henk van den Heuvel, Eric Sanders 2.0 26 June 1996 The speech databases made within the SpeechDat(M) project were validated by SPEX, Leidschendam, the Netherlands, to assess their compliance with the SpeechDat(M) format and content specifications, as documented in Deliverable 1.4.1 of the project. The validation results of the Swiss French SpeechDat(M) database are contained in this document. The validation of the Swiss French SpeechDat corpus has taken into consideration that (1) not all SpeechDat specifications were known at the time of delivery of the corpus, (2) the exceptional status of IDIAP as an external partner in the SpeechDat project. However, where appropriate, references to the most important SpeechDat validation criteria will be made. In the validation procedure we systematically checked a list of validation criteria for a range of subjects. In the following we will evaluate these criteria one by one for the Swiss French data base offered by IDIAP. Validation results that call for attention are marked by =>. The following subjects were validated: 1 DOCUMENTATION 2 DATABASE STRUCTURE, CONTENTS AND FILE NAMES 3 ITEMS 4 CONTENTS SAMPLED DATA FILES 5 CONTENTS ANNOTATION 6 LEXICON 7 SPEAKERS 8 RECORDING PLATFORM 9 TRANSCRIPTION The document is concluded by 10 SUMMARY ==================================================================== 1. DOCUMENTATION The main documentation is provided in the file README.TXT - Documentation file must be present OK, as README.TXT - Language of doc file: preferably English OK - Contact person: name, address, affiliation OK - Number of CDs / Tapes OK - Contents of each CD OK =>However, we find contradictory information under heading: `Directories and files'. In the second paragraph it is said that the phonetically rich sentences are on the first CD. In the forth paragraph it is stated that this information is on the first AND SECOND CD. - The directory structure of the CDs / tapes OK, is explained =>but not in SpeechDat format - Speaker information =>In general speaker information is poorly provided in the documentation: . which regions, how many of each =>The distribution of regions is alluded to, but not made explicit. . motivation for selection of regions =>This information is not supplied . which age groups, how many of each =>Age groups are mentioned, but not in quantitative terms. . sexes: males, females, also children?; how many of each. This information is supplied in the section on directories and files, =>but we recommend to put it also in the section on speakers. =>There is information about the participation of children, and there is no way to extract this information in another way (e.g. from the NIST file headers). Please be more explicit on this topic. - Reference to a file where speaker characteristics are stored (speaker.tbl) =>speaker table is not present - The number of items on the CD and per speaker (Contents.tbl) =>There is no contents table; there is only a list if missing items. - Naming conventions for directories and files OK, is mentioned, but see remark above at CD-contents - Prompting . linguistic specification (and motivation) for the prompting material OK, this information is in the section Promp sheets generation . connection of sheet items to item numbers on CD / tape OK, has been provided, . sheet example OK, provided as additional file . items must be spread over the sheet to prevent list effects (e.g. three yes/no questions right after another are not allowed) OK, as can be seen from sheet example - analysis of frequency of occurence of the sub-word units represented in the phonetically rich sentences (either of phones, biphones, triphones) . recommended: at least 2 samples of each phone per caller (should appear from documentation) This has been computed as is mentioned in section Prompt sheet =>generation, but it is not explained in a perspicuous way. Exact figures are not presented. - Recording platform should be specified OK, but very scarcely. - Signal characteristics (number of bits per sample; bandwith; coding type; compression procedures) OK =>However, linear expansion to 16 bit before application of SHORTEN was not intended for SpeechDat corpora. Second, there seems to be a typing error in the coding table used for linear expansion. - The format and the file header structure of annotation files should be specified OK, is mentioned, =>but at an unexpected place, viz. at the end of the section about annotation and transcription. - Annotation . procedure OK, is mentioned but scarcely. =>There is no information on how aspects of transliteration were carried out (upper-case, lower case lettering; how are digits transliterated; how are abbreviations transliterated; how are truncations treated). This information is now implicit in the file GUIDE.PS. . quality assurance OK . character set used for annotation (transliteration) OK, ISO-latin 1 . annotations symbols for non-speech acoustic events must be mentioned at least for [Filled_Pause] [Speaker_Other] [Nonspeaker_Other] =>Not mentioned . list of symbols used to denote word interruptions and break-offs should be provided =>Not supplied - Lexicon information . Transcription manual: Which graphemic characters and conventions are used in annotations and lexicon OK . Procedures to obtain phonemic forms from orthographic input OK, =>It is not clear from the text if transcriptions made by the automatic text-to-phoneme converter were checked by hand. . Overview of SAMPA symbols used (only in this manner it can be checked if the lexicon contains only legal symbols) =>Not provided - Only one spelling of each word is allowed Therefore a list of normalised spellings for words with alternative spellings should be included. =>Not provided - Indication of how many of the files were double checked by the producer together with percentage of detected errors =>There is no information about this. - Optional RECORDING table Not provided ===================================================================== ====== 2. DATABASE STRUCTURE CONTENTS AND FILE NAMES 1 Directory / subdirectory conventions - Format of directory tree should be \<database>\<volume>\<block>\<session> . data base: defined as <name><#><language code> <name> can be FIXED, MOBIL, VERIF <#> is 0 for Speechdat(M) and 1 for SpeechDat <language_code> is the ISO two-letter code for the language . volume : is a progressive number specifying the CD containing the material. Defined as CD<nn> where <nn> is the number. . block : defined as BLOCK<nn> where <nn> is a progressive number from 00 to 99. Block numbers are unique over all CDs. They could typically be the first two digits of <nnnn> below. . session: defined as SES<nnnn> where <nnnn> is the session code also appearing in file name =>This convention was not followed because it was not known at the time the corpus was put to CD. We find all person directories one level below the sex directory. These are far more than 100 subdirectories. - A README.TXT file should be in the root describing all (documentation) files on the CD-ROM. OK - A file containing a shortened version of the volume name (11 chars max.) should be in the root directory. The name of this file is DISK.ID. This file supplies the volume label to UNIX systems that cannot read the physical volume label. Example of contents: FIXED0EN_00. OK - Any source code supplied should be in (SAMLIB, V4 and GNU gunzip + licence) =>There is no source code on the CD-ROM. \<database_name>\SOURCE - A copyright statement should be given in COPYRIGH.TXT OK - The index files (if presented) obey the nomenclature <database><language_code><item_code>.LST where e.g. A0ENN3.LST (see below for item_code) Not present - Documentation should be in \<database_name>\DOC - Tables should be in \<database_name>\TABLE - Index files (optional) should be in \<database_name>\LST - Prompt sheet files (optional) should be in \<database_name> \PROMPT =>This information was not known at the time the corpus was put to CD by IDIAP. - File naming conventions All file names should obey the following pattern: DDNNNNCC.LLF DD : database identification code For SpeechDat(M): A0 = fixed net, B0 = mobile For SpeechDat : A1 = fixed net, B1 = mobile, C1 = speaker verification NNNN : session code 0000 to 9999 CC : item code; first character is item type identifier, second character is item number LL : ISO-639 language code (with extensions) F : speech file type Z is for A-law, compressed O is for Orthographic label (label file) =>This information was not known at the time the corpus was put to CD by IDIAP. - Files in the corpus - Contents lowest level subdirectories should be of one call only OK - Empty (i.e. zero-length) files are not permitted OK, empty files were not found - Counts should match information in documentation . count of files in each subdirectory . count grand total We have found directories for all 575 female and 425 male speakers mentioned in the documentation. =>The documentation does not mention the grand total of files in the corpus. Therefore this cannot be checked. - Missing items per speaker Information about missing items is absent in the documentation file itself, but there is a file with missing items in the root. This list is correct. - File match: For each label file there must be one speech file and vice versa. This issue is relevant in case label files and speech files are stored separately (SAM format), which is not the case here (NIST headers). - Part of the cirpus may be designed for training and a (typically smaller) part for testing. No partitioning is supplied. - The contents of the database as given in CONTENTS.LST should comprise . CD-ROM volume name (VOL:) . full pathname (DIR:) . speech file name (SRC:) . speaker code (SCD:) . speaker sex (SEX:) . speaker age (AGE:) . region of call (REG:) . orthographic transcription of uttered item (LBO:) This file must be supplied as an ASCII delimited file (either using TAB, or commas and (double) quoted strings). => This file is not present. This information was not known at the time the corpus was put to CD by IDIAP. ===================================================================== ===== 3. MISSING ITEMS, STRUCTURALLY AND INCIDENTALLY 1. Structurally missing items 1.1 The following items in the Swiss French Speechdat corpus are obligatory and present: - 1 isolated digit item25 (but contains also hash (#) and star (*) symbols) - 3 connected digits - 4 digit number to identify the prompt sheet identif - ~10 digit telephone number telefon - ~12 digit credit card number item5 - 3 natural numbers item20 ! natural number item4 ! quantity => Third natural is missing - 2 money amounts item8 item17 - 3 spelled words item7 item22 item26 - 1 time of day => missing - 1 time phrase item13 - 1 date (spontaneous) naissanc - 2 dates (prompted) item18 => second date is missing - 3 yes/no questions sexe => two yes/no questions are missing - city of call/birth ecole - 6 common application words item3 item10 item16 item21 item24 => one application word is missing - 3 application word phrases => missing - 9 phonetically rich sentences item2 item6 item9 item11 item12 item15 item19 item23 item27 1.2. The following items are obligatory but not present - one natural number - time of day (spontaneous) - one date (prompted) - two yes/no questions - one application word - three application phrases 1.3. The following items are present but not obligatory item29 ! city name prompted item14 ! name for spelling table langue ! mothertongue speaker niveau ! education level speaker typetele! type of telephone used rensei ! query to telephone dir comments! free comment on session item28 ! extra phon. rich sentence 1.4. Conclusion =>This means that structurally 9 items too few were recorded (or included) in the SwissFrench database, namely - one natural number - time of day (spontaneous) - one date (prompted) - two yes/no questions - one application word - three application phrases We have understood from the documentation that this is due to the fact that recordings had been made before the structure of SpeechDat corpora was known. 2. Application words In appendix A of SpeechDat deliverable 1.4-1 a list of 39 obligatory application words is provided. =>Checking the documentation for the Swiss corpus it can be established that a number of application words (or direct equivalents) were not included in the Swiss set: Appel, Effacer, Enregistrer, Activer, Composer, Telephone, Annonce, Repondeur, Conference, Extern, Intern, Programmer, Rappel, Ecouter, Pause. The application words that are present appear in a sufficient quantity in the prompt texts. Each word occurs at least 40 times. Most words occur between 40 and 60 times. A few words occur about 300 times: (Chef t\'el\'eop'eratrice; Informations consommateurs; Informations touristiques; L'heuer; Service des t\'el \'ecommunications; Sevice international; Service PTT). A full overview is displayed below. Abonnement: 63 Adresse: 59 Adulte: 58 Agenda: 58 Aide: 52 Allemand: 59 Anglais: 60 Annuler: 64 Billet: 57 Chef-téléopératrice: 280 Choisir: 59 Cinéma: 50 Concert: 58 Continuer: 67 Corriger: 53 Début: 62 Détaillé: 54 Enfant: 47 Espagnol: 68 Exemple: 59 Explications: 63 Fin: 54 Français: 47 Galerie: 60 Guide: 55 Horaire: 61 Informations consommateurs: 312 Informations touristiques: 297 Italien: 61 L'heure: 267 Le temps: 57 Lire: 60 Lister: 72 Message: 52 Mode d'emploi: 60 Musée: 66 Non: 65 Oui: 75 Petites annonces: 55 Place assise: 44 Précédent: 54 Quitter: 67 Rendez-vous: 56 Romanche: 63 Réception: 46 Répéter: 57 Réservation: 51 Résumé: 43 Service des télécommunications: 295 Service international: 286 Services ptt: 263 Ski: 51 Standardiste: 53 Stop: 60 Suivant: 48 Tarif: 53 Théâtre: 64 Transfert: 64 Validation: 56 3. Incidentally missing items We examined the directories (calls) for which more than 9 obligatory items were missing. The following result was found (9 10 11 12 13 14 obligatory obligatory obligatory obligatory obligatory obligatory items items items items items items missing missing missing missing missing missing in in in in in in 857 directories) 131 directories 7 directories 3 directories 1 directories 1 directories 4. Overall conclusion SpeechDat has the following criteria for missing items: - 85% (850) out of 1000 calls must be complete . A maximum of 10% (100) of the calls may miss up to 3 mandatory items . A maximum of 5% (50) of the calls may miss more items (A complete call is one with all speech files recorded for all prompt items) => Since 9 obligatory items are structurally missing in each call the Swiss French corpus does not fulfil the SpeechDat specifications in this respect. If we take into account the special status of the corpus and only look at the incidentally missing items, then we observe that 141 calls (14.1%) miss up to 3 obligatory items and 2 calls (0%) miss more items. This does not fulfil the specifications either. ===================================================================== ======== 4 CONTENTS SAMPLED DATA FILES 1 File structure . NIST (header : contains file info -> ant.txt) . SAM OK, NIST headers 2 Coding . A-law . Compression by Shorten (A-law version of shorten) =>According to SpeechDat criteria the 8-bit A-law files should be Shortened immediately, and not via first expanding them to 13-bit linear. 3 Sample distribution Several sample distributions were checked: 3.1 File length in seconds We calculated the length of the files in seconds in order to trace spurious recordings if files were of extraordinary length. Distribution of file durations in all items (in seconds): #Seconds 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 : : : : : : : : : : : : : Occurences 1056 8980 8411 6969 3956 2386 1711 1232 936 671 454 249 172 13 14 15 16 17 18 19 20 - 14 15 16 17 18 19 20 21 : : : : : : : : 117 63 48 31 25 17 36 14 Distribution of file durations over all obligatory items: #Seconds 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 : : : : : : : : : : : : : : : : : : : : : Occurences 909 5947 7430 6735 3814 2262 1555 1077 776 527 314 177 101 69 46 26 12 6 11 11 7 The obligatory items with a duration larger than 17 seconds have been examined in greater detail. We found that files up to 18 seconds length generally had no particular anomalies. =>Files with a length of more than 19 or even 20 seconds tended to be cut off or to be lengthened with 10-15 seconds silence or noise (but this was not always the case). Thus we found the following very long files which had problems: File Duration MALES\P3038\NAISANC 20.2 MALES\P3054\ECOLE 20.2 MALES\P3075\IDENTIF 20.2 than MALES\P3188\ITEM20 MALES\P3297\ITEM22 MALES\P3297\ITEM7 19.7 MALES\P3345\ITEM22 FEMALES\P0345\ITEM22 FEMALES\P0402\ITEM22 FEMALES\P0402\ITEM7 FEMALES\P0519\ITEM26 20.5 19.6 18.4 19.8 19.4 20.1 19.6 Problem Long trailing silence of 15 s. Long trailing noise of 19 s. File contains much more speech transliterated and is cut off. Long trailing noise of 15 s. Cut off Cut off Cut off Cut off Cut off Cut off Cut off =>These speech files are still usable but the waveforms should be edited more properly. Distribution of mean file durations per call over all items: 2 3 4 5 6 - 3 4 5 6 7 : : : : : 109 589 265 35 2 3.2 min-max samples We provide a histogram with clipping rates. We have counted for each file how many times the maximum (4032) and minimum (-4032) value occured. The clipping rate is defined as the proportion of samples in a file that is equal to the maximum or minimum value, divided by all samples in the file. The histogram, then, is an overview of how many files were found in a set of clipping rate intervals. Clip distribution for all files: Clipping rate (in %) 0.0 - 0.1 0.1 - 0.2 0.2 - 0.3 0.3 - 0.4 0.4 - 0.5 0.5 - 0.6 0.6 - 0.7 0.7 - 0.8 0.8 - 0.9 0.9 - 1.0 1.0 - 1.1 1.1 - 1.2 1.2 - 1.3 1.3 - 1.4 1.4 - 1.5 1.5 - 1.6 1.6 - 1.7 1.8 - 1.9 Occurences : : : : : : : : : : : : : : : : : : 1705 365 187 85 59 38 25 16 16 10 6 6 9 6 5 3 3 1 Number of files with absolute maximum < 4032: 34989 Total number files: 37,534 Clip distribution for obligatory items : Clipping rate (in %) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 - 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 Occurences : : : : : : : : : : : 1481 312 152 64 51 32 20 13 13 8 3 1.1 1.2 1.3 1.4 1.5 1.6 - 1.2 1.3 1.4 1.5 1.6 1.7 : : : : : : 6 6 5 3 3 2 Number of files with absolute maximum < 4032: 29638 Total number of files: 34357 A clipping rate higher than 1.5% was found in (obligatory items): (1.61) file FEMALES\P0458\ITEM13 (1.62) file FEMALES\P0458\ITEM7 (1.59) file FEMALES\P0458\ITEM8 (1.53) file MALES\P3154\TELEFON (1.52) file MALES\P3232\ITEM14 Clipping Rate mean per call over all items: Clip distribution per call: 0.0 0.1 0.2 0.3 0.4 0.5 0.6 - 0.1 0.2 0.3 0.4 0.5 0.6 0.7 : : : : : : : 273 13 6 3 1 1 1 Number of directories with absolute maximum < 4032: 702 The calls with a mean clipping rate of more than 0.4% (all items) are: (0.63) dir FEMALES\P0458 (0.54) dir MALES\P3154 (0.41) dir MALES\P3379 There is no criterion to decide that files or directories with a high clipping rate should be rejected outright. It will depend on the application to what extent files with a high clipping rate are usable. 3.3 Mean values We computed the mean sample value of each item in each call. We provide a histogram with mean values below. The histogram, then, is an overview of how many files were found in a set of mean sample value intervals. This overview can be used to trace files with large DC-offsets. We remind to our remark above that the minimum/maximum sample values were -4032/4032. Distribution of means over all items: Mean -280 -270 -260 -250 -240 -220 -200 value - -270 - -260 - -250 - -240 - -230 - -210 - -190 : : : : : : : Occurrences 3 1 1 1 1 6 4 -190 -180 -170 -160 -150 -140 -130 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 - -180 -170 -160 -150 -140 -130 -120 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 : 6 : 3 : 1 : 5 : 2 : 3 : 1 : 1 : 2 : 7 : 34 : 77 : 156 : 399 : 18906 : 17123 : 575 : 91 : 49 : 28 : 23 : 17 : 1 : 5 : 2 Distribution of means over all obligatory items: Mean -280 -250 -220 -200 -190 -180 -170 -160 -150 -140 -130 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 value - -270 - -240 - -210 - -190 - -180 - -170 - -160 - -150 - -140 - -130 - -120 - -70 - -60 - -50 - -40 - -30 - -20 - -10 0 10 20 30 40 50 60 70 80 90 - 100 The files (-215.8) (-135.8) (-190.1) (-210.7) Occurrences : 3 : 1 : 6 : 4 : 4 : 3 : 1 : 4 : 2 : 3 : 1 : 1 : 1 : 6 : 27 : 66 : 138 : 357 : 15879 : 14612 : 516 : 74 : 38 : 27 : 20 : 14 : 1 : 2 : 1 with file file file file a mean lower than -100 and higher than 70 were: FEMALES\P0226\ITEM11 FEMALES\P0226\ITEM12 FEMALES\P0226\ITEM15 FEMALES\P0226\ITEM19 (-178.4) (-136.6) (-191.7) (-214.1) (-153.2) (-183.0) (-276.5) (-158.6) (-172.0) (-200.0) (-146.5) (-165.3) (-211.3) (-211.2) (-242.9) (-155.8) (-139.1) (-181.4) (-140.6) (-174.9) (-120.0) (-198.0) (-181.6) (-150.7) (-189.6) (-277.8) (-270.2) (-217.3) file file file file file file file file file file file file file file file file file file file file file file file file file file file file FEMALES\P0226\ITEM2 FEMALES\P0226\ITEM23 FEMALES\P0226\ITEM27 FEMALES\P0226\ITEM28 FEMALES\P0226\ITEM6 FEMALES\P0226\ITEM9 FEMALES\P0226\ECOLE FEMALES\P0226\IDENTIF FEMALES\P0226\ITEM10 FEMALES\P0226\ITEM13 FEMALES\P0226\ITEM14 FEMALES\P0226\ITEM16 FEMALES\P0226\ITEM17 FEMALES\P0226\ITEM18 FEMALES\P0226\ITEM20 FEMALES\P0226\ITEM21 FEMALES\P0226\ITEM22 FEMALES\P0226\ITEM24 FEMALES\P0226\ITEM25 FEMALES\P0226\ITEM26 FEMALES\P0226\ITEM3 FEMALES\P0226\ITEM4 FEMALES\P0226\ITEM5 FEMALES\P0226\ITEM7 FEMALES\P0226\ITEM8 FEMALES\P0226\NAISSANC FEMALES\P0226\SEXE FEMALES\P0226\TELEFON (87.9) (70.8) (80.9) (98.8) file file file file FEMALES\P0361\ITEM19 FEMALES\P0361\ECOLE FEMALES\P0361\ITEM20 FEMALES\P0361\SEXE Distribution of means per call over all items: Mean -200 -50 -40 -30 -10 0 10 20 30 40 60 value - -190 - -40 - -30 - -20 0 10 20 30 40 50 70 : : : : : : : : : : : Occurrences 1 1 1 5 555 426 6 2 1 1 1 The calls with an overall mean of less than -50 or more than 50 were: (-191.8) dir FEMALES\P0226 (65.3) dir FEMALES\P0361 =>By viewing and listening the files in directory p0226 we noticed that all files in this directory are severely corrupted by errors in the recording platform. The files in directory p0361 are OK. 3.4 Signal to Noise Ratio (SNR) We split each signal file into contiguous windows of 10 ms and computed the Mean Square (energy) in each window. Before computing the square value we substracted the mean value (calculated over the total file) from the sample value. 5% of the windows that contained the lowest energy were assumed to contain line noise. In this way the signal to noise ratio could be calculated for each file by dividing the mean energy over all windows by the mean energy of the 5% sample mentioned above. The result was multiplied by 10*log for scaling. Distribution of SNR over all items: SNR 0 5 10 15 20 25 30 35 40 45 50 55 - 5 10 15 20 25 30 35 40 45 50 55 60 Occurrences : 6 : 3 : 5 : 23 : 213 : 1632 : 6925 : 13365 : 10058 : 4622 : 652 : 30 Distribution of SNR over all obligatory items: SNR 0 5 10 15 20 25 30 35 40 45 50 55 An SNR (9.6) (4.1) (4.9) 5 10 15 20 25 30 35 40 45 50 55 60 Occurrences : 2 : 1 : 3 : 16 : 169 : 1308 : 5596 : 11086 : 8824 : 4192 : 588 : 27 lower than 10 was found in (obligatory items): file FEMALES\P0226\ITEM21 file MALES\P3050\ITEM3 file MALES\P3134\ITEM6 =>File FEMALES\P0226\ITEM21 is one of the corrupted call mentioned before. Files MALES\P3050\ITEM3 and MALES\P3134\ITEM6 contain nothing but noise and are therefore unusable. =>From the non-obligatory items some other files were found that contain only noise, and are unusable for that reason. These were: (1.6) file FEMALES\P0108\COMMENTS (5.5) file FEMALES\P0173\LANGUE (4.3) file FEMALES\P0400\COMMENTS (5.3) file FEMALES\P0404\COMMENTS (4.7) file FEMALES\P0413\NIVEAU (3.0) file MALES\P3215\COMMENTS Distribution of mean SNR per call over all items: SNR 20 25 30 35 40 45 50 - 25 30 35 40 45 50 55 : : : : : : : Occurrences 1 11 139 456 351 41 1 =>The call with the lowest mean SNR was, again, the corrupted call of FEMALES\P0226. The other directories with low mean SNRs are acceptable. E.g.: For directory FEMALES\P0367 (mean SNR=27.2) there was background music. For directory MALES\P3054 (mean SNR=25) there is a buzz in the beginning of the items with a long silence after the utterance. 5. Conclusion =>Due to unacceptable acoustics the following files are unusable. Of the obligatory items: - file MALES\P3050\ITEM3 - file MALES\P3134\ITEM6 Of the optional items: - file FEMALES\P0108\COMMENTS - file FEMALES\P0173\LANGUE - file FEMALES\P0400\COMMENTS - file FEMALES\P0404\COMMENTS - file FEMALES\P0413\NIVEAU - file MALES\P3215\COMMENTS These files contain only noise. =>Further the full call in directory FEMALES\P0226 is unusable. ===================================================================== ==== 5 CONTENTS ANNOTATION FILE 1 Label header information A NIST header was used for documenting label file information. There were not any files which did not have a header. The following information is provided in the header (taken from the example in the documentation.) database_id -s26 Swiss French Polyphone 0.0 recording_site -s17 Swiss Telecom PTT sheet_id -i 13946 prompt -s26 Informations consommateurs text_transcription -s26 Informations consommateurs speaking_mode -s4 read sample_begin -r 0.200000 sample_end -r 2.725125 sample_count -i 23401 sample_n_bytes -i 2 channel_count -i 1 sample_coding -s26 pcm,embedded-shorten-v1.09 sample_rate -i 8000 sample_byte_format -s2 10 sample_checksum -i 12379 =>According to the SpeechDat specifications also the following information should further have been included: - data base volume (FIXED0SF_00) - session number of 4 digits - region of call - speaker age and sex - file directory - signal file name - corpus (item) code - recording date - recording time of first item - number of significant bits per sample =>The sheet id in the Swiss French corpus is not uniquely identifying each call and can therefore not be used as an alternative for session number. We verified if all items in a directory had the same sheet id. No irregularities were found. 2. Transliteration format - Transliterations should be only in lower case letters, also at sentence beginning Only exception: proper names and spelled words, ZIP codes, acronyms and abbreviations In the latter case blanks should be used in between the letters. =>Capital letters are used very inconsistently in the transliterations. In the text transliterations capitals are not clearly restricted to the aforementioned categories. Moreover every first word of a transliteration starts with a capital letter. However, capitals occur on other positions than the first word, too. E.g. in P3002\COMMENTS.SHN in: text_transcription -s24 C'était Hyper méga super - punctuation marks should not be used in the transliterations =>punctuation marks were used and in an inconsistent manner sometimes a blank was inserted between the punctuation mark and the word preceding it, and sometimes not. - Digits must appear in full orthographic form We tested if digits in numerical form were present in the transliterated texts. Such digits were not found. - In principle only the following symbols are allowed to indicate non-speech acoustic events: [Filled_Pause] [Speaker_Other] [Nonspeaker_Other] Other symbols (and language equivalents) must be mentioned in the documentation =>The following 'touches' were used: [\hesitations] [\prononciation bizarre] [\inintelligible] The symbolic representations for non-speech acoustis events in SpeechDat corpora were not followed. - Asterisks should be used to indicate incomplete realisations => Not used. - According to a spelling check on annotated text (including bracket check) up to 1% errors may be found A spelling check could not be performed on the transliterations since we do not have an independent lexicon for Swiss French available. Using the lexicon delivered by the corpus itself we found no misspelled words, but this procedure is of course a rather circular one. According to a spelling check performed by the producer of the corpus, there should be no remaining orthographic mistakes in the corpus. - A comparison (of some sort) of prompted with spoken text will be carried out We followed the following strategy. First, prompt text and transliterated text were downcased. Next, all punctuation marks were removed and the 'touches' were removed (but not the words that the 'touches' marked). Next, we looked if every word in the prompt appeared in the transliterated text, and if the prompt text and the transliterated text contained the same number of words. As is obvious, the check was only carried out for the read items and not for the spontaneous ones. We used it to trace textual errors that were clear from the text itself and did not need to be verified by auditive inspection. In this way we could trace missing diacritics in the transliterations. - Assessment of speech items in terms of SNR, presence of additional noise adherence to prompting text is provided (optional) Not provided. See previous section about speech file contents for our findings. ===================================================================== ==== 6 LEXICON - Lexicon existence A lexicon was provided under file name PHONEMIC.TXT - Lexicon contents 1 SAMPA symbols only =>This is difficult to check since there are no blanks between the phoneme symbols, although this has been prescribed for SpeechDat corpora. 2 Capitals only in proper names, spelled words and in single letters derived from abbreviations (exception: German) As far as we can see names have been used with capitals in the lexicon, =>but single letters derived from abbreviations and spelled words do not have capital letters. This may lead to problems. E.g. : The transcription for 'p' is pE/ and for 't' it is t. This means that the transcription for p t t would result in pE/ t t, which obviously is incorrect. This problem could be solved by using capitals for abbreviations and small letters for clitics (as in l'argent, a-t-il). We did not found any words in the lexicon that were included once with a leading capital letter and once with a leading lower case letter. 3 Blanks should be placed between phoneme symbols =>Not done. 4 [TAB] between grapheme and phoneme transcription OK 5 Homographs should have separate entries Homographs were not found in the lexicon. 6 Double entries should not occur =>Châtelard SAt@lar is included twice in the lexicon. 7 Completeness We checked if all words in the transliterations in the field text_transcription in the file headers were contained in the lexicon. The following operations were carried out on the transliterations before the words were compared with the lexicon: - punctuation marks were removed. =>here we noticed an inconsistency in the transliterated texts: sometimes a blank was inserted between the punctuation mark and the word preceding it, and sometimes not. (see previous section) - all words were converted to lower case. =>Here we noticed that the first word in all transliterations started with an uppercase letter, which shouldn't be. (see previous section) - words between square brackets ([]) in [\hesitation x] [\pronunciation bizarre x] and [\inintelligible x] were stripped. - apostrophes and dashes were replaced by blanks (e.g. l'argent, a-t-il) - all lexicon entries were converted to lower case After these operations were carried out, nearly all words were found in the lexicon. =>A word that was not found in the lexicon was 's'. Also the word 'Saint.John' (found in MALES\P3311\ITEM22.SHN) could not be traced back in the lexicon. =>Since words are printed in a very inconsistent manner in the transliterated text (due to large variability in capital letters and placement of punctuation marks), it is difficult to find the matching word in the lexicon. =>Something strange was observed for single letter entries in the lexicon. These are most often typical for spelled words, but in other cases typical for clitics. E.g. : The transcription for 'p' is pE/ and for 't' it is t. This means that the transcription for p t t would result in pE/ t t, which obviously is incorrect. This problem could be solved by following the SpeechDat specifications and use capital letters for abbreviations. Words that were only found between square brackets were not included in the lexicon. This is in accordance with the SpeechDat specifications. 8. Words should be ordered alphabetically =>Words are not in alphabetical order in term of the ISO-latin coding table. The discrepancy comes from the characters with diacritics (such as accent grave, accent circonflex etc.). These are put before the other letters in stead of behind them. ===================================================================== ===== 7 SPEAKERS 1 Speaker database file => A speaker table file should be present but is not there. 2 Obligatory information: 1. unique number (speaker/caller) 2. sex 3. age 4. region of call =>This information is not provided in the file headers either. sheet_id is not unique per speaker. 3 Balance of sexes =>Recordings were delivered of 575 females and 425 males The disbalance between the sexes exceeds 5%, which is in conflict with the SpeechDat specifications. 4 Balance of regions =>Information about the region of call is not supplied and can therefore not be validated. The item that comes closest is 'ecole', where the speaker is asked in which place s/he started his school program. 5 Balance of ages . A minimum of 20% of speakers must be in following age groups: 17-30, 31-45, 46-60. A maximum of 40% speakers may be younger than 17 or older than 60. =>Information about speakers' ages are not given in the documentation. It could be derived from item 'naissanc'. ===================================================================== ==== 8 RECORDING CONDITIONS 1 Digital telephone line OK 2 A-law coding OK 3 Recording information may be stored in a separate file (optional) Not provided ===================================================================== == 9 VALIDATION TRANSCRIPTION This validation is carried out by taking 5% of the short items and 5% of the long items in the corpus. The transcriptions in the label files for these samples are checked by listening to the corresponding speech files. This check is performed by native speakers of the language involved. Short items are: - isolated digit - time phrases - date phrases - yes/no questions - place name - application words Long items are: - connected digits - natural numbers - money amounts - spelled words - application phrases - phonetically rich sentences Given the fact that 19 (of the 23) long items were present in the database, and that there were 1000 speakers, a selection of 5% of the long items would comprise 950 samples. To remain on the safe side a random selection of 999 items was done. For the short items 11 (of the 16) items were included in the database. A selection of 5% yields a sample 550 items. A random selection of 549 short items was used for the evaluation. - The evaluation comprises the following criteria . did the speaker actually speak the translitterated words . did the speaker speak the prompted text . is translitteration of non-speech acoustics events correct . speech quality, line quality . up to 5% transcription errors are allowed - Abbreviations may only be used if spoken as such RESULTS The transcriptions of 999 randomly chosen long items and 549 randomly chosen short items were evaluated. It appeared that 33 of the long items contained an error (3.30% of the sample), and 8 of the short items (1.46% of the sample). This is below the error threshold of 5%. This threshold is in fact meant on word level, but it stands to reason that this criterion is met on word level, too, if it is met on item level. For the long items there were 2 errors in the cutting of the signals, 9 typing errors, and 22 transliteration errors. For the short items there were 8 transliteration errors and 1 typing error. For completeness a full list of errors is given below. Long items: Change in item females/p0063/item15: Original text: Mais, pour l'heure, le tribunal conclut en jugeant que l'artiste poursuivie était en droit de réclamer une indemnité. Modified text: Mais, pour l'heure, le tribunal conclut en jugeant que l'artiste poursuivie était en droit de réclamer un indemnité. Kind of error: Transliteration Change in item females/p0093/telefon: Original text: Septante quatre quarante quatre six Modified text: Septante quatre quarante quatre seize Kind of error: Transliteration Change in item Original text: Son père lui a Modified text: Son père lui a Kind of error: females/p0119/item11: fait une réprimande. fait une reprimande. Transliteration Change in item females/p0180/item07: Original text: [\prononciation bizarre Daves] d a v e s Modified text: Daves d a v e s Kind of error: Transliteration Change in item females/p0198/identif: Original text: Mille neuf cent septante et un Modified text: *euf mille neuf cent septante et un Kind of error: Cutting Change in item Original text: Spécialisées s Modified text: Spécialisées s Kind of error: females/p0201/item07: p é accent aigu c i a l i s é accent aigu e s p e accent aigu c i a l i s e accent aigu e s Transliteration Change in item females/p0219/item11: Original text: Il y a peu de chance que l'on en discute à Rennes, où doit se réunir, en mars mille neuf cent nonante, le congrès. Modified text: Il y a peu de chance que l'on en discute à Rennes, on doit se réunir, en mars mille neuf cent nonante, le congrès. Kind of error: Transliteration Change in item females/p0259/item11: Original text: Toutes ces bêtes ont reçu les soins appropriés et ont été placés. Modified text: Toutes ces bêtes ont reçu les soins appropriés et ont été placées. Kind of error: Type Change in item females/p0273/item04: Original text: Trois mille cinq Kilos Modified text: Trois mille cinq kilos Kind of error: Type Change in item females/p0285/item12: Original text: Et elle n'a pas cherché à monnayer quelques accord avec l'impertinent reporter. Modified text: Et elle n'a pas cherché à monnayer quelques accords avec l'impertinent reporter.Kind of error: Type Change in item females/p0286/telefon: Original text: Soixante et un soixante et un soixante et un Modified text: *ois cent soixante et un soixante et un soixante et un Kind of error: Transliteration Change in item Original text: Sinon, il nous Modified text: Sinon, il nous vide de sens. Kind of error: females/p0293/item02: parait exsangue, dépassé, vide de sens. parait [\prononciation bizarre exsangue], dépassé, Transliteration Change in item females/p0321/item15: Original text: Quatre personnes ont été blessées au cours de ces interpellations. Modified text: Quatre personnes ont été blessées au cours de cette interpellation. Kind of error: Transliteration Change in item females/p0338/item23: Original text: Quel est le diagnostic du médecin ? Modified text: Quel est le [\prononciation bizarre diagnostic] du médecin ? Kind of error: Transliteration Change in item females/p0373/item26: Original text: [\inintelligible aïssaoui] a i tréma s s a o u i Modified text: Aïssaoui a i tréma s s a o u i Kind of error: Transliteration Change in item females/p0389/item05: Original text: Six cinq huit trois trois trois deux trois un quatre quatre deux trois quatre huit cinq Modified text: Six cinq huit trois trois deux trois un quatre quatre deux trois quatre huit cinq Kind of error: Transliteration Change in item Original text: Pierre a lancé Modified text: Pierre a lancé Kind of error: females/p0450/item15: une oeillade à marie. une oeillade à Marie. Transliteration Change in item females/p0483/item12: Original text: Té vingt quintaux de blé. Modified text: *té vingt quintaux de blé. Kind of error: Cutting Change in item Original text: Toute critique éliminée; on a Modified text: Toute critique éliminée; on a Kind of error: females/p0549/item12: Change in item Original text: Qui prendra le Modified text: Qui prendra le Kind of error: males/p3023/item02: en profondeur du système Dollar a été, par exemple, préféré incriminer les émirs du pétrole. en profondeur du système dollar a été, par exemple, préféré incriminer les émirs du pétrole. Type relais de jean ? relais de Jean ? Transliteration Change in item males/p3170/item05: Original text: Deux mille sept cent septante trois zéro trois cent vingt quatre trois mille quatre cent quarante neuf trois mille quatre cent soixante quinze Modified text: Deux mille sept cent septante trois zéro trois cent vingt quatre trois mille quatre cent quarante neuf trois mille quatre cent septante cinq Kind of error: Transliteration Change in item males/p3182/item26: Original text: Germanier g e r m a n i e r Modified text: Germanier g é r m a n i e r Kind of error: Transliteration Change in item males/p3235/item11: Original text: Ils ne disent seulement saturés par les campagnes d'information sur la maladie, mais également sceptiques quant à leur efficacité. Modified text: Ils ne disent seulement saturés par les campagnes d'information sur la maladie, mais également *asceptiques quant à leur efficacité. Kind of error: Transliteration Change in item males/p3256/item06: Original text: Cette journée de réflexion prendra fin par un une office divine. Modified text: Cette journée de réflexion prendra fin par un par une office divine. Kind of error: Transliteration Change in item Original text: Cinquante deux Modified text: Cinquante deux Kind of error: males/p3289/item20: mille zéro onze virgules sept cent cinquante neuf mille zéro onze virgule sept cent cinquante neuf Type Change in item males/p3294/item22: Original text: Acide a c i d e Modified text: Acide a c i d è Kind of error: Transliteration Change in item males/p3295/item05: Original text: Cent quatre zéro mille six cent nonante trois mille neuf cent quarante neuf zéro huit cent soixante sept Modified text: Zéro cent quatre mille six cent nonante trois mille neuf cent quarante neuf zéro huit cent soixante sept Kind of error: Transliteration Change in item males/p3296/item20: Original text: Trois million trente neuf mille neuf cent cinquante neuf Modified text: Trois millions trente neuf mille neuf cent cinquante neuf Kind of error: Type Change in item males/p3298/item26: Original text: Georges-Etienne g e o r g e s trait d'union e t i e n n e Modified text: Georges-Etienne g e o r g e s trait d'union é t i e n n e Kind of error: Transliteration Change in item Original text: La conseillère des stands. Modified text: La conseillère des stands. Kind of error: males/p3332/item12: Change in item Original text: Cent cinquante Modified text: Cent cinquante Kind of error: males/p3354/item04: en style est même intervenue dans le choix de tissu en style est même intervenue dans le choix du tissu Transliteration cinq mille sept cent Kilos cinq mille sept cent kilos Type Change in item males/p3376/item08: Original text: Huit cent soixante deux mille et un Franc trente francs suisses Modified text: Huit cent soixante deux mille et un franc trente francs suisses Kind of error: Type Change in item males/p3394/item12: Original text: En fait, les pays d'Europe de l'est ne sont pas les seuls visés. Modified text: En fait, les pays d'Europe de l'Est ne sont pas les seuls visés. Kind of error: Type ****************** Short items: Change in item females/p0103/item25: Original text: Dièse trois quatre cinq dièse étoile ou asterix Modified text: Dièse trois quatre cinq dièse étoile ou astérisque Kind of error: Transliteration Change in item females/p0120/naissanc: Original text: Le douze janvier mille neuf cent cinquante et un Modified text: Un douze janvier mille neuf cent cinquante et un Kind of error: Transliteration Change in item females/p0125/item03: Original text: Lister Modified text: Liste Kind of error: Transliteration Change in item females/p0303/item25: Original text: [\prononciation bizarre astérisque] six neuf sept deux [\prononciation bizarre astérisque] Modified text: astérisque six neuf sept deux astérisque Kind of error: Transliteration Change in item females/p0330/item21: Original text: Lister Modified text: [\prononciation bizarre Lister] Kind of error: Transliteration Change in item females/p0361/ecole: Original text: À Kussnacht Zürich Modified text: A Kussnacht Zürich Kind of error: Type Change in item females/p0538/item25: Original text: Huit cinq [\prononciation bizarre astérisque] cinq deux un Modified text: Huit cinq astérisque cinq deux un Kind of error: Transliteration Change in item males/p3212/item03: Original text: Concert Modified text: [\prononciation bizarre Concert] Kind of error: Transliteration Change in item males/p3226/item21: Original text: Lister Modified text: [\prononciation bizarre Lister] Kind of error: Transliteration ===================================================================== === 11. SUMMARY Below we give a brief overview of our findings. We repeat that it should be borne in mind that the Swiss French corpus was recorded long before the SpeechDat specifications were released. 1. Documentation In general formal matters are properly described (contact person, number and contents of CD-ROMs). Speaker information is very poorly provided. Naming conventions for directories and files are well described. Prompting information is well described. Annotation is well described but information about annotation conventions (when upper and when lower case letters) is missing. Information about the lexicon is OK, but we miss an overview of the SAMPA symbols used. We miss a list of alternative spelling of the words (or a paragraph stating that this has been made uniform in one way or another). We miss information about the number of annotation files that were double checked by the producer. 2. Database structure contents and file names Directory structure and file names are not according to SpeechDat specifications (these were not known at the time of compilation). 3. Missing items, structurally and incidentally We miss 9 out of 39 obligatory items per call systematically; 141 calls miss up to 3 further obligatory items. We miss 15 out of 39 obligatory application words. This does not fulfil the SpeechDat specifications. 4. Contents sampled data files Speech files are not in a-law, but in a linear conversion of a-law (13 bits). The call in FEMALES\P0226 is corrupted. Due to unacceptable acoustics the following files are unusable. Of the obligatory items: - file MALES\P3050\ITEM3 - file MALES\P3134\ITEM6 Of the optional items: - file FEMALES\P0108\COMMENTS - file FEMALES\P0173\LANGUE - file FEMALES\P0400\COMMENTS - file FEMALES\P0404\COMMENTS - file FEMALES\P0413\NIVEAU - file MALES\P3215\COMMENTS These files contain only noise. By viewing and listening the files in directory p0226 we noticed that all files in this directory are severely corrupted by errors in the recording platform. The call with the lowest mean SNR was, again, the corrupted call of FEMALES\P0226. The other directories with low mean SNRs are acceptable. 5. Contents annotation file In the file headers we miss the following information: - data base volume (FIXED0SF_00) - session number of 4 digits - region of call - speaker age and sex - file directory - signal file name - corpus (item) code - recording date - recording time of first item - number of significant bits per sample In the transcriptions capital letters are used rather inconsistently. We recommend that all punctuation marks be removed. 6. Lexicon Lexicon format: <grapheme string> <TAB> <phoneme string> is OK. There are no blanks between phoneme symbols. Only two words were found not to be present in the lexicon. 7. Speakers There is no speaker table. There is no information about sex, age and region of call in the file headers. The disbalance between the sexes exceeds 5%, which is in conflict with the SpeechDat specifications. 8. Recording conditions Recording conditions are fine. 9. Validation transcription The transcriptions of 999 randomly chosen long items and 549 randomly chosen short items were evaluated. They were found to fit the SpeechDat specifications nicely. ===================================================================== =====