Validation report

Commentaires

Transcription

Validation report
SUBJECT:
AUTHORS:
VERSION:
DATE
:
Validation Swiss French SpeechDat(M) corpus
Henk van den Heuvel, Eric Sanders
2.0
26 June 1996
The speech databases made within the SpeechDat(M) project were
validated
by SPEX, Leidschendam, the Netherlands, to assess their compliance
with the SpeechDat(M) format and content specifications, as
documented
in Deliverable 1.4.1 of the project.
The validation results of the Swiss French SpeechDat(M) database are
contained
in this document.
The validation of the Swiss French SpeechDat corpus has
taken into consideration that (1) not all SpeechDat specifications
were known at the time of delivery of the corpus, (2) the exceptional
status of IDIAP as an external partner in the SpeechDat project.
However, where appropriate, references to the most important
SpeechDat validation criteria will be made.
In the validation procedure we systematically checked a list
of validation criteria for a range of subjects.
In the following we will evaluate these criteria one by one
for the Swiss French data base offered by IDIAP.
Validation results that call for attention are marked by =>.
The following subjects were validated:
1 DOCUMENTATION
2 DATABASE STRUCTURE, CONTENTS AND FILE NAMES
3 ITEMS
4 CONTENTS SAMPLED DATA FILES
5 CONTENTS ANNOTATION
6 LEXICON
7 SPEAKERS
8 RECORDING PLATFORM
9 TRANSCRIPTION
The document is concluded by
10 SUMMARY
====================================================================
1. DOCUMENTATION
The main documentation is provided in the file README.TXT
- Documentation file must be present
OK, as README.TXT
- Language of doc file: preferably English
OK
- Contact person: name, address, affiliation
OK
- Number of CDs / Tapes
OK
- Contents of each CD
OK
=>However, we find contradictory information under heading:
`Directories and files'.
In the second paragraph it is said that the phonetically rich
sentences
are on the first CD. In the forth paragraph it is stated that this
information is on the first AND SECOND CD.
- The directory structure of the CDs / tapes
OK, is explained
=>but not in SpeechDat format
- Speaker information
=>In general speaker information is poorly provided in the
documentation:
. which regions, how many of each
=>The distribution of regions is alluded to, but not made explicit.
. motivation for selection of regions
=>This information is not supplied
. which age groups, how many of each
=>Age groups are mentioned, but not in quantitative terms.
. sexes: males, females, also children?; how many of each.
This information is supplied in the section on directories and
files,
=>but we recommend to put it also in the section on speakers.
=>There is information about the participation of children, and there
is no way to extract this information in another way (e.g. from
the NIST file headers). Please be more explicit on this topic.
- Reference to a file where speaker characteristics are stored
(speaker.tbl)
=>speaker table is not present
- The number of items on the CD and per speaker (Contents.tbl)
=>There is no contents table;
there is only a list if missing items.
- Naming conventions for directories and files
OK, is mentioned, but see remark above at CD-contents
- Prompting
. linguistic specification (and motivation) for the prompting
material
OK, this information is in the section Promp sheets generation
. connection of sheet items to item numbers on CD / tape
OK, has been provided,
. sheet example
OK, provided as additional file
. items must be spread over the sheet to prevent list effects
(e.g. three yes/no questions right after another are not
allowed)
OK, as can be seen from sheet example
- analysis of frequency of occurence of the sub-word units
represented
in the phonetically rich sentences (either of phones, biphones,
triphones)
. recommended: at least 2 samples of each phone per caller
(should appear from documentation)
This has been computed as is mentioned in section Prompt sheet
=>generation, but it is not explained in a perspicuous way.
Exact figures are not presented.
- Recording platform should be specified
OK, but very scarcely.
- Signal characteristics (number of bits per sample; bandwith; coding
type;
compression procedures)
OK
=>However, linear expansion to 16 bit before application of SHORTEN
was not intended for SpeechDat corpora.
Second, there seems to be a typing error in the coding table used
for linear
expansion.
- The format and the file header structure of
annotation files should be specified
OK, is mentioned,
=>but at an unexpected place, viz. at the end of the section
about annotation and transcription.
- Annotation
. procedure
OK, is mentioned but scarcely.
=>There is no information on how aspects of transliteration were
carried out
(upper-case, lower case lettering; how are digits transliterated;
how are abbreviations transliterated; how are truncations treated).
This information is now implicit in the file GUIDE.PS.
. quality assurance
OK
. character set used for annotation (transliteration)
OK, ISO-latin 1
. annotations symbols for non-speech acoustic events must be
mentioned
at least for
[Filled_Pause] [Speaker_Other] [Nonspeaker_Other]
=>Not mentioned
. list of symbols used to denote word interruptions and break-offs
should be provided
=>Not supplied
- Lexicon information
. Transcription manual:
Which graphemic characters and conventions are used in
annotations and
lexicon
OK
. Procedures to obtain phonemic forms from orthographic input
OK,
=>It is not clear from the text if transcriptions made by the
automatic
text-to-phoneme converter were checked by hand.
. Overview of SAMPA symbols used (only in this manner it can be
checked if
the lexicon contains only legal symbols)
=>Not provided
- Only one spelling of each word is allowed
Therefore a list of normalised spellings for words with alternative
spellings should be included.
=>Not provided
- Indication of how many of the files were double checked by the
producer
together with percentage of detected errors
=>There is no information about this.
- Optional RECORDING table
Not provided
=====================================================================
======
2. DATABASE STRUCTURE CONTENTS AND FILE NAMES
1 Directory / subdirectory conventions
- Format of directory tree should be
\<database>\<volume>\<block>\<session>
. data base: defined as <name><#><language code>
<name> can be FIXED, MOBIL, VERIF
<#> is 0 for Speechdat(M) and 1 for SpeechDat
<language_code> is the ISO two-letter code for the
language
. volume : is a progressive number specifying the CD containing the
material. Defined as CD<nn> where <nn> is the number.
. block : defined as BLOCK<nn> where <nn> is a progressive number
from
00 to 99. Block numbers are unique over all CDs.
They could typically be the first two digits of <nnnn>
below.
. session: defined as SES<nnnn> where <nnnn> is the session code
also appearing in file name
=>This convention was not followed because it was not known at the
time
the corpus was put to CD.
We find all person directories one level below
the sex directory. These are far more than 100 subdirectories.
- A README.TXT file should be in the root describing all
(documentation) files
on the CD-ROM.
OK
- A file containing a shortened version of the volume name (11 chars
max.)
should be in the root directory. The name of this file is DISK.ID.
This file supplies the volume label to UNIX systems that cannot
read
the physical volume label. Example of contents: FIXED0EN_00.
OK
- Any source code supplied should be in
(SAMLIB, V4 and GNU gunzip + licence)
=>There is no source code on the CD-ROM.
\<database_name>\SOURCE
- A copyright statement should be given in COPYRIGH.TXT
OK
- The index files (if presented) obey the nomenclature
<database><language_code><item_code>.LST where
e.g. A0ENN3.LST
(see below for item_code)
Not present
- Documentation should be in \<database_name>\DOC
- Tables should be in
\<database_name>\TABLE
- Index files (optional) should be in
\<database_name>\LST
- Prompt sheet files (optional) should be in
\<database_name>
\PROMPT
=>This information was not known at the time
the corpus was put to CD by IDIAP.
- File naming conventions
All file names should obey the following pattern: DDNNNNCC.LLF
DD
: database identification code
For SpeechDat(M): A0 = fixed net, B0 = mobile
For SpeechDat
: A1 = fixed net, B1 = mobile, C1 =
speaker verification
NNNN : session code 0000 to 9999
CC
: item code; first character is item type identifier,
second character is item number
LL
: ISO-639 language code (with extensions)
F
: speech file type
Z is for A-law, compressed
O is for Orthographic label (label file)
=>This information was not known at the time
the corpus was put to CD by IDIAP.
- Files in the corpus
- Contents lowest level subdirectories should be of one call only
OK
- Empty (i.e. zero-length) files are not permitted
OK, empty files were not found
- Counts should match information in documentation
. count of files in each subdirectory
. count grand total
We have found directories for all 575 female and 425 male speakers
mentioned in the documentation.
=>The documentation does not mention the grand total of files in the
corpus. Therefore this cannot be checked.
- Missing items per speaker
Information about missing items is absent in the documentation file
itself,
but there is a file with missing items in the root.
This list is correct.
- File match:
For each label file there must be one speech file and vice versa.
This issue is relevant in case label files and speech files are
stored
separately (SAM format), which is not the case here (NIST headers).
- Part of the cirpus may be designed for training and a (typically
smaller) part for testing.
No partitioning is supplied.
- The contents of the database as given in CONTENTS.LST should
comprise
. CD-ROM volume name (VOL:)
. full pathname (DIR:)
. speech file name (SRC:)
. speaker code (SCD:)
. speaker sex (SEX:)
. speaker age (AGE:)
. region of call (REG:)
. orthographic transcription of uttered item (LBO:)
This file must be supplied as an ASCII delimited file (either using
TAB,
or commas and (double) quoted strings).
=> This file is not present. This information was not known at the
time
the corpus was put to CD by IDIAP.
=====================================================================
=====
3. MISSING ITEMS, STRUCTURALLY AND INCIDENTALLY
1. Structurally missing items
1.1 The following items in the Swiss French Speechdat corpus are
obligatory
and present:
- 1 isolated digit
item25 (but contains also hash (#) and star (*) symbols)
- 3 connected digits
- 4 digit number to identify the prompt sheet
identif
- ~10 digit telephone number
telefon
- ~12 digit credit card number
item5
- 3 natural numbers
item20
! natural number
item4 ! quantity
=> Third natural is missing
- 2 money amounts
item8
item17
- 3 spelled words
item7
item22
item26
- 1 time of day
=> missing
- 1 time phrase
item13
- 1 date (spontaneous)
naissanc
- 2 dates (prompted)
item18
=> second date is missing
- 3 yes/no questions
sexe
=> two yes/no questions are missing
- city of call/birth
ecole
- 6 common application words
item3
item10
item16
item21
item24
=> one application word is missing
- 3 application word phrases
=> missing
- 9 phonetically rich sentences
item2
item6
item9
item11
item12
item15
item19
item23
item27
1.2. The following items are obligatory but not present
- one natural number
- time of day (spontaneous)
- one date (prompted)
- two yes/no questions
- one application word
- three application phrases
1.3. The following items are present but not obligatory
item29
! city name prompted
item14
! name for spelling table
langue
! mothertongue speaker
niveau
! education level speaker
typetele! type of telephone used
rensei
! query to telephone dir
comments! free comment on session
item28
! extra phon. rich sentence
1.4. Conclusion
=>This means that structurally 9 items too few were recorded (or
included)
in the SwissFrench database, namely
- one natural number
- time of day (spontaneous)
- one date (prompted)
- two yes/no questions
- one application word
- three application phrases
We have understood from the documentation that this is due to
the fact that recordings had been made before the structure
of SpeechDat corpora was known.
2. Application words
In appendix A of SpeechDat deliverable 1.4-1 a list of 39 obligatory
application words is provided.
=>Checking the documentation for the Swiss corpus it can be
established
that a number of application words (or direct equivalents)
were not included in the Swiss set:
Appel, Effacer, Enregistrer, Activer, Composer, Telephone,
Annonce, Repondeur, Conference, Extern, Intern, Programmer,
Rappel, Ecouter, Pause.
The application words that are present appear in a sufficient
quantity
in the prompt texts.
Each word occurs at least 40 times. Most words occur between 40 and
60 times. A few words occur about 300 times:
(Chef t\'el\'eop'eratrice; Informations consommateurs;
Informations touristiques; L'heuer; Service des t\'el
\'ecommunications;
Sevice international; Service PTT).
A full overview is displayed below.
Abonnement: 63
Adresse: 59
Adulte: 58
Agenda: 58
Aide: 52
Allemand: 59
Anglais: 60
Annuler: 64
Billet: 57
Chef-téléopératrice: 280
Choisir: 59
Cinéma: 50
Concert: 58
Continuer: 67
Corriger: 53
Début: 62
Détaillé: 54
Enfant: 47
Espagnol: 68
Exemple: 59
Explications: 63
Fin: 54
Français: 47
Galerie: 60
Guide: 55
Horaire: 61
Informations consommateurs: 312
Informations touristiques: 297
Italien: 61
L'heure: 267
Le temps: 57
Lire: 60
Lister: 72
Message: 52
Mode d'emploi: 60
Musée: 66
Non: 65
Oui: 75
Petites annonces: 55
Place assise: 44
Précédent: 54
Quitter: 67
Rendez-vous: 56
Romanche: 63
Réception: 46
Répéter: 57
Réservation: 51
Résumé: 43
Service des télécommunications: 295
Service international: 286
Services ptt: 263
Ski: 51
Standardiste: 53
Stop: 60
Suivant: 48
Tarif: 53
Théâtre: 64
Transfert: 64
Validation: 56
3. Incidentally missing items
We examined the directories (calls) for which more than 9 obligatory
items were missing. The following result was found
(9
10
11
12
13
14
obligatory
obligatory
obligatory
obligatory
obligatory
obligatory
items
items
items
items
items
items
missing
missing
missing
missing
missing
missing
in
in
in
in
in
in
857 directories)
131 directories
7 directories
3 directories
1 directories
1 directories
4. Overall conclusion
SpeechDat has the following criteria for missing items:
- 85% (850) out of 1000 calls must be complete
. A maximum of 10% (100) of the calls may miss up to 3 mandatory
items
. A maximum of 5% (50) of the calls may miss more items
(A complete call is one with all speech files recorded for all prompt
items)
=> Since 9 obligatory items are structurally missing in each call
the Swiss French corpus does not fulfil the SpeechDat specifications
in
this respect.
If we take into account the special status of the corpus and
only look at the incidentally missing items,
then we observe that 141 calls (14.1%) miss up to 3 obligatory items
and 2 calls (0%) miss more items.
This does not fulfil the specifications either.
=====================================================================
========
4 CONTENTS SAMPLED DATA FILES
1 File structure
. NIST (header : contains file info -> ant.txt)
. SAM
OK, NIST headers
2 Coding
. A-law
. Compression by Shorten (A-law version of shorten)
=>According to SpeechDat criteria the 8-bit A-law files should
be Shortened immediately, and not via first expanding them
to 13-bit linear.
3 Sample distribution
Several sample distributions were checked:
3.1 File length in seconds
We calculated the length of the files in seconds in order to trace
spurious recordings if files were of extraordinary length.
Distribution of file durations in all items (in seconds):
#Seconds
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 11
11 12
12 13
:
:
:
:
:
:
:
:
:
:
:
:
:
Occurences
1056
8980
8411
6969
3956
2386
1711
1232
936
671
454
249
172
13
14
15
16
17
18
19
20
-
14
15
16
17
18
19
20
21
:
:
:
:
:
:
:
:
117
63
48
31
25
17
36
14
Distribution of file durations over all obligatory items:
#Seconds
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 11
11 12
12 13
13 14
14 15
15 16
16 17
17 18
18 19
19 20
20 21
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
Occurences
909
5947
7430
6735
3814
2262
1555
1077
776
527
314
177
101
69
46
26
12
6
11
11
7
The obligatory items with a duration larger than 17 seconds have been
examined in greater detail.
We found that files up to 18 seconds length generally
had no particular anomalies.
=>Files with a length of more than 19 or even 20 seconds tended to
be cut off or to be lengthened with 10-15 seconds silence or noise
(but this was not always the case).
Thus we found the following very long files which had problems:
File
Duration
MALES\P3038\NAISANC
20.2
MALES\P3054\ECOLE 20.2
MALES\P3075\IDENTIF
20.2
than
MALES\P3188\ITEM20
MALES\P3297\ITEM22
MALES\P3297\ITEM7 19.7
MALES\P3345\ITEM22
FEMALES\P0345\ITEM22
FEMALES\P0402\ITEM22
FEMALES\P0402\ITEM7
FEMALES\P0519\ITEM26
20.5
19.6
18.4
19.8
19.4
20.1
19.6
Problem
Long trailing silence of 15 s.
Long trailing noise of 19 s.
File contains much more speech
transliterated and is cut off.
Long trailing noise of 15 s.
Cut off
Cut off
Cut off
Cut off
Cut off
Cut off
Cut off
=>These speech files are still usable but the waveforms should be
edited more
properly.
Distribution of mean file durations per call over all items:
2
3
4
5
6
-
3
4
5
6
7
:
:
:
:
:
109
589
265
35
2
3.2 min-max samples
We provide a histogram with clipping rates.
We have counted for each file how many times the maximum (4032) and
minimum (-4032) value occured.
The clipping rate is defined as the proportion of samples in a file
that is equal to the maximum or minimum value, divided by all samples
in the file.
The histogram, then, is an overview of how many files were found
in a set of clipping rate intervals.
Clip distribution for all files:
Clipping
rate
(in %)
0.0 - 0.1
0.1 - 0.2
0.2 - 0.3
0.3 - 0.4
0.4 - 0.5
0.5 - 0.6
0.6 - 0.7
0.7 - 0.8
0.8 - 0.9
0.9 - 1.0
1.0 - 1.1
1.1 - 1.2
1.2 - 1.3
1.3 - 1.4
1.4 - 1.5
1.5 - 1.6
1.6 - 1.7
1.8 - 1.9
Occurences
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
1705
365
187
85
59
38
25
16
16
10
6
6
9
6
5
3
3
1
Number of files with absolute maximum < 4032: 34989
Total number files: 37,534
Clip distribution for obligatory items :
Clipping
rate
(in %)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
-
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
Occurences
:
:
:
:
:
:
:
:
:
:
:
1481
312
152
64
51
32
20
13
13
8
3
1.1
1.2
1.3
1.4
1.5
1.6
-
1.2
1.3
1.4
1.5
1.6
1.7
:
:
:
:
:
:
6
6
5
3
3
2
Number of files with absolute maximum < 4032: 29638
Total number of files: 34357
A clipping rate higher than 1.5% was found in (obligatory items):
(1.61) file FEMALES\P0458\ITEM13
(1.62) file FEMALES\P0458\ITEM7
(1.59) file FEMALES\P0458\ITEM8
(1.53) file MALES\P3154\TELEFON
(1.52) file MALES\P3232\ITEM14
Clipping Rate mean per call over all items:
Clip distribution per call:
0.0
0.1
0.2
0.3
0.4
0.5
0.6
-
0.1
0.2
0.3
0.4
0.5
0.6
0.7
:
:
:
:
:
:
:
273
13
6
3
1
1
1
Number of directories with absolute maximum < 4032: 702
The calls with a mean clipping rate of more than 0.4%
(all items) are:
(0.63) dir FEMALES\P0458
(0.54) dir MALES\P3154
(0.41) dir MALES\P3379
There is no criterion to decide that files or directories with a high
clipping rate should be rejected outright. It will depend on the
application
to what extent files with a high clipping rate are usable.
3.3 Mean values
We computed the mean sample value of each item in each call.
We provide a histogram with mean values below.
The histogram, then, is an overview of how many files were found
in a set of mean sample value intervals.
This overview can be used to trace files with large DC-offsets.
We remind to our remark above that the minimum/maximum sample values
were -4032/4032.
Distribution of means over all items:
Mean
-280
-270
-260
-250
-240
-220
-200
value
- -270
- -260
- -250
- -240
- -230
- -210
- -190
:
:
:
:
:
:
:
Occurrences
3
1
1
1
1
6
4
-190
-180
-170
-160
-150
-140
-130
-80
-70
-60
-50
-40
-30
-20
-10
0
10
20
30
40
50
60
70
80
90
-
-180
-170
-160
-150
-140
-130
-120
-70
-60
-50
-40
-30
-20
-10
0
10
20
30
40
50
60
70
80
90
100
:
6
:
3
:
1
:
5
:
2
:
3
:
1
:
1
:
2
:
7
:
34
:
77
:
156
:
399
: 18906
: 17123
:
575
:
91
:
49
:
28
:
23
:
17
:
1
:
5
:
2
Distribution of means over all obligatory items:
Mean
-280
-250
-220
-200
-190
-180
-170
-160
-150
-140
-130
-80
-70
-60
-50
-40
-30
-20
-10
0
10
20
30
40
50
60
70
80
90
value
- -270
- -240
- -210
- -190
- -180
- -170
- -160
- -150
- -140
- -130
- -120
- -70
- -60
- -50
- -40
- -30
- -20
- -10
0
10
20
30
40
50
60
70
80
90
- 100
The files
(-215.8)
(-135.8)
(-190.1)
(-210.7)
Occurrences
:
3
:
1
:
6
:
4
:
4
:
3
:
1
:
4
:
2
:
3
:
1
:
1
:
1
:
6
:
27
:
66
:
138
:
357
: 15879
: 14612
:
516
:
74
:
38
:
27
:
20
:
14
:
1
:
2
:
1
with
file
file
file
file
a mean lower than -100 and higher than 70 were:
FEMALES\P0226\ITEM11
FEMALES\P0226\ITEM12
FEMALES\P0226\ITEM15
FEMALES\P0226\ITEM19
(-178.4)
(-136.6)
(-191.7)
(-214.1)
(-153.2)
(-183.0)
(-276.5)
(-158.6)
(-172.0)
(-200.0)
(-146.5)
(-165.3)
(-211.3)
(-211.2)
(-242.9)
(-155.8)
(-139.1)
(-181.4)
(-140.6)
(-174.9)
(-120.0)
(-198.0)
(-181.6)
(-150.7)
(-189.6)
(-277.8)
(-270.2)
(-217.3)
file
file
file
file
file
file
file
file
file
file
file
file
file
file
file
file
file
file
file
file
file
file
file
file
file
file
file
file
FEMALES\P0226\ITEM2
FEMALES\P0226\ITEM23
FEMALES\P0226\ITEM27
FEMALES\P0226\ITEM28
FEMALES\P0226\ITEM6
FEMALES\P0226\ITEM9
FEMALES\P0226\ECOLE
FEMALES\P0226\IDENTIF
FEMALES\P0226\ITEM10
FEMALES\P0226\ITEM13
FEMALES\P0226\ITEM14
FEMALES\P0226\ITEM16
FEMALES\P0226\ITEM17
FEMALES\P0226\ITEM18
FEMALES\P0226\ITEM20
FEMALES\P0226\ITEM21
FEMALES\P0226\ITEM22
FEMALES\P0226\ITEM24
FEMALES\P0226\ITEM25
FEMALES\P0226\ITEM26
FEMALES\P0226\ITEM3
FEMALES\P0226\ITEM4
FEMALES\P0226\ITEM5
FEMALES\P0226\ITEM7
FEMALES\P0226\ITEM8
FEMALES\P0226\NAISSANC
FEMALES\P0226\SEXE
FEMALES\P0226\TELEFON
(87.9)
(70.8)
(80.9)
(98.8)
file
file
file
file
FEMALES\P0361\ITEM19
FEMALES\P0361\ECOLE
FEMALES\P0361\ITEM20
FEMALES\P0361\SEXE
Distribution of means per call over all items:
Mean
-200
-50
-40
-30
-10
0
10
20
30
40
60
value
- -190
- -40
- -30
- -20
0
10
20
30
40
50
70
:
:
:
:
:
:
:
:
:
:
:
Occurrences
1
1
1
5
555
426
6
2
1
1
1
The calls with an overall mean of less than -50
or more than 50 were:
(-191.8) dir FEMALES\P0226
(65.3) dir FEMALES\P0361
=>By viewing and listening the files in directory
p0226 we noticed that all files in this directory are
severely corrupted by errors in the recording platform.
The files in directory p0361 are OK.
3.4 Signal to Noise Ratio (SNR)
We split each signal file into contiguous windows of 10 ms and
computed the Mean Square (energy) in each window.
Before computing the square value we substracted the mean
value (calculated over the total file) from the sample value.
5% of the windows that contained the lowest energy were assumed to
contain line noise. In this way the signal to noise ratio
could be calculated for each file by dividing the mean energy
over all windows by the mean energy of the 5% sample mentioned
above. The result was multiplied by 10*log for scaling.
Distribution of SNR over all items:
SNR
0 5 10 15 20 25 30 35 40 45 50 55 -
5
10
15
20
25
30
35
40
45
50
55
60
Occurrences
:
6
:
3
:
5
:
23
:
213
: 1632
: 6925
: 13365
: 10058
: 4622
:
652
:
30
Distribution of SNR over all obligatory items:
SNR
0 5 10 15 20 25 30 35 40 45 50 55 An SNR
(9.6)
(4.1)
(4.9)
5
10
15
20
25
30
35
40
45
50
55
60
Occurrences
:
2
:
1
:
3
:
16
:
169
: 1308
: 5596
: 11086
: 8824
: 4192
:
588
:
27
lower than 10 was found in (obligatory items):
file FEMALES\P0226\ITEM21
file MALES\P3050\ITEM3
file MALES\P3134\ITEM6
=>File FEMALES\P0226\ITEM21 is one of the corrupted call mentioned
before.
Files MALES\P3050\ITEM3 and MALES\P3134\ITEM6 contain nothing but
noise
and are therefore unusable.
=>From the non-obligatory items some other files were found that
contain
only noise, and are unusable for that reason. These were:
(1.6) file FEMALES\P0108\COMMENTS
(5.5) file FEMALES\P0173\LANGUE
(4.3) file FEMALES\P0400\COMMENTS
(5.3) file FEMALES\P0404\COMMENTS
(4.7) file FEMALES\P0413\NIVEAU
(3.0) file MALES\P3215\COMMENTS
Distribution of mean SNR per call over all items:
SNR
20 25 30 35 40 45 50 -
25
30
35
40
45
50
55
:
:
:
:
:
:
:
Occurrences
1
11
139
456
351
41
1
=>The call with the lowest mean SNR was, again, the corrupted call
of FEMALES\P0226.
The other directories with low mean SNRs are acceptable. E.g.:
For directory FEMALES\P0367 (mean SNR=27.2) there was background
music.
For directory MALES\P3054 (mean SNR=25) there is a buzz in the
beginning
of the items with a long silence after the utterance.
5. Conclusion
=>Due to unacceptable acoustics the following files are unusable.
Of the obligatory items:
- file MALES\P3050\ITEM3
- file MALES\P3134\ITEM6
Of the optional items:
- file FEMALES\P0108\COMMENTS
- file FEMALES\P0173\LANGUE
- file FEMALES\P0400\COMMENTS
- file FEMALES\P0404\COMMENTS
- file FEMALES\P0413\NIVEAU
- file MALES\P3215\COMMENTS
These files contain only noise.
=>Further the full call in directory FEMALES\P0226 is unusable.
=====================================================================
====
5 CONTENTS ANNOTATION FILE
1 Label header information
A NIST header was used for documenting label file information.
There were not any files which did not have a header.
The following information is provided in the header (taken from
the example in the documentation.)
database_id -s26 Swiss French Polyphone 0.0
recording_site -s17 Swiss Telecom PTT
sheet_id -i 13946
prompt -s26 Informations consommateurs
text_transcription -s26 Informations consommateurs
speaking_mode -s4 read
sample_begin -r 0.200000
sample_end -r 2.725125
sample_count -i 23401
sample_n_bytes -i 2
channel_count -i 1
sample_coding -s26 pcm,embedded-shorten-v1.09
sample_rate -i 8000
sample_byte_format -s2 10
sample_checksum -i 12379
=>According to the SpeechDat specifications also the following
information
should further have been included:
- data base volume (FIXED0SF_00)
- session number of 4 digits
- region of call
- speaker age and sex
- file directory
- signal file name
- corpus (item) code
- recording date
- recording time of first item
- number of significant bits per sample
=>The sheet id in the Swiss French corpus is not uniquely identifying
each
call and can therefore not be used as an alternative for session
number.
We verified if all items in a directory had the same sheet id.
No irregularities were found.
2. Transliteration format
- Transliterations should be only in lower case letters,
also at sentence beginning
Only exception: proper names
and spelled words, ZIP codes, acronyms and abbreviations
In the latter case blanks should be used in between the letters.
=>Capital letters are used very inconsistently in the
transliterations.
In the text transliterations capitals are not clearly restricted
to the aforementioned categories.
Moreover every first word of a transliteration starts with a
capital
letter. However, capitals occur on other positions than the first
word, too.
E.g. in P3002\COMMENTS.SHN in:
text_transcription -s24 C'était Hyper méga super
- punctuation marks should not be used in the transliterations
=>punctuation marks were used and in an inconsistent manner
sometimes a blank was inserted between the punctuation mark and the
word preceding it, and sometimes not.
- Digits must appear in full orthographic form
We tested if digits in numerical form were present in the
transliterated
texts. Such digits were not found.
- In principle only the following symbols are allowed to indicate
non-speech acoustic events:
[Filled_Pause] [Speaker_Other] [Nonspeaker_Other]
Other symbols (and language equivalents) must be mentioned in the
documentation
=>The following 'touches' were used:
[\hesitations] [\prononciation bizarre]
[\inintelligible]
The symbolic representations for non-speech acoustis events in
SpeechDat corpora were not followed.
- Asterisks should be used to indicate incomplete realisations
=> Not used.
- According to a spelling check on annotated text
(including bracket check) up to 1% errors may be found
A spelling check could not be performed on the transliterations
since we do not have an independent lexicon for Swiss French
available.
Using the lexicon delivered by the corpus itself we found no
misspelled words, but this procedure is of course a rather circular
one.
According to a spelling check performed by the producer of the
corpus,
there should be no remaining orthographic mistakes in the corpus.
- A comparison (of some sort) of prompted with spoken text will
be carried out
We followed the following strategy.
First, prompt text and transliterated text were downcased.
Next, all punctuation marks were removed and the 'touches' were
removed
(but not the words that the 'touches' marked). Next, we looked if
every word
in the prompt appeared in the transliterated text, and if the prompt
text and the transliterated text contained the same number of words.
As is obvious, the check was only carried out for the read items and
not for the spontaneous ones.
We used it to trace textual errors that were clear from
the text itself and did not need to be verified by auditive
inspection.
In this way we could trace missing diacritics in the
transliterations.
- Assessment of speech items in terms of SNR, presence of additional
noise
adherence to prompting text is provided (optional)
Not provided. See previous section about speech file contents
for our findings.
=====================================================================
====
6 LEXICON
- Lexicon existence
A lexicon was provided under file name PHONEMIC.TXT
- Lexicon contents
1 SAMPA symbols only
=>This is difficult to check since there are no blanks between
the phoneme symbols, although this has been prescribed
for SpeechDat corpora.
2 Capitals only in proper names, spelled words and in single letters
derived
from abbreviations (exception: German)
As far as we can see names have been used with capitals in the
lexicon,
=>but single letters derived from abbreviations and spelled words do
not have capital letters. This may lead to problems.
E.g. : The transcription for 'p' is pE/ and for 't' it is t.
This means that the transcription for p t t
would result in pE/ t t, which obviously is incorrect.
This problem could be solved by using capitals for abbreviations
and small letters for clitics (as in l'argent, a-t-il).
We did not found any words in the lexicon that were included
once with a leading capital letter and once with a leading lower case
letter.
3 Blanks should be placed between phoneme symbols
=>Not done.
4 [TAB] between grapheme and phoneme transcription
OK
5 Homographs should have separate entries
Homographs were not found in the lexicon.
6 Double entries should not occur
=>Châtelard [email protected]
is included twice in the lexicon.
7 Completeness
We checked if all words in the transliterations in the field
text_transcription in the file headers were contained in the lexicon.
The following operations were carried out on the transliterations
before the words were compared with the lexicon:
- punctuation marks were removed.
=>here we noticed an inconsistency in the transliterated texts:
sometimes a blank was inserted between the punctuation mark and the
word preceding it, and sometimes not. (see previous section)
- all words were converted to lower case.
=>Here we noticed that the first word in all transliterations started
with an uppercase letter, which shouldn't be. (see previous section)
- words between square brackets ([]) in [\hesitation x]
[\pronunciation bizarre x] and [\inintelligible x] were stripped.
- apostrophes and dashes were replaced by blanks
(e.g. l'argent, a-t-il)
- all lexicon entries were converted to lower case
After these operations were carried out, nearly all words
were found in the lexicon.
=>A word that was not found in the lexicon was 's'.
Also the word 'Saint.John' (found in MALES\P3311\ITEM22.SHN)
could not be traced back in the lexicon.
=>Since words are printed in a very inconsistent manner in the
transliterated text (due to large variability in capital letters and
placement of punctuation marks), it is difficult to find the matching
word in the lexicon.
=>Something strange was observed for single letter entries in the
lexicon. These are most often typical for spelled words, but in other
cases typical for clitics. E.g. : The transcription for 'p' is pE/
and for 't' it is t. This means that the transcription for p t t
would result in pE/ t t, which obviously is incorrect.
This problem could be solved by following the SpeechDat
specifications
and use capital letters for abbreviations.
Words that were only found between square brackets were not included
in the lexicon. This is in accordance with the SpeechDat
specifications.
8. Words should be ordered alphabetically
=>Words are not in alphabetical order in term of the ISO-latin coding
table. The discrepancy comes from the characters with diacritics
(such
as accent grave, accent circonflex etc.). These are put before the
other letters in stead of behind them.
=====================================================================
=====
7 SPEAKERS
1 Speaker database file
=> A speaker table file should be present but is not there.
2 Obligatory information:
1. unique number (speaker/caller)
2. sex
3. age
4. region of call
=>This information is not provided in the file headers either.
sheet_id is not unique per speaker.
3 Balance of sexes
=>Recordings were delivered of 575 females and 425 males
The disbalance between the sexes exceeds 5%, which is in conflict
with the SpeechDat specifications.
4 Balance of regions
=>Information about the region of call is not supplied and can
therefore
not be validated.
The item that comes closest is 'ecole', where the speaker is asked in
which place s/he started his school program.
5 Balance of ages
. A minimum of 20% of speakers must be in following age groups:
17-30, 31-45, 46-60.
A maximum of 40% speakers may be younger than 17 or older than
60.
=>Information about speakers' ages are not given in the
documentation.
It could be derived from item 'naissanc'.
=====================================================================
====
8 RECORDING CONDITIONS
1 Digital telephone line
OK
2 A-law coding
OK
3 Recording information may be stored in a separate file (optional)
Not provided
=====================================================================
==
9 VALIDATION TRANSCRIPTION
This validation is carried out by taking 5% of the short items and
5% of the long items in the corpus.
The transcriptions in the label files for these samples are checked
by listening to the corresponding speech files.
This check is performed by native speakers of the language involved.
Short items are:
- isolated digit
- time phrases
- date phrases
- yes/no questions
- place name
- application words
Long items are:
- connected digits
- natural numbers
- money amounts
- spelled words
- application phrases
- phonetically rich sentences
Given the fact that 19 (of the 23) long items were present in the
database,
and that there were 1000 speakers, a selection of 5% of the long
items would comprise 950 samples. To remain on the safe side a random
selection
of 999 items was done.
For the short items 11 (of the 16) items were included in the
database.
A selection of 5% yields a sample 550 items. A random selection of
549
short items was used for the evaluation.
- The evaluation comprises the following criteria
. did the speaker actually speak the translitterated words
. did the speaker speak the prompted text
. is translitteration of non-speech acoustics events correct
. speech quality, line quality
. up to 5% transcription errors are allowed
- Abbreviations may only be used if spoken as such
RESULTS
The transcriptions of 999 randomly chosen long items and 549 randomly
chosen
short items were evaluated.
It appeared that 33 of the long items contained an error (3.30% of
the sample),
and 8 of the short items (1.46% of the sample). This is below the
error
threshold of 5%. This threshold is in fact meant on word level, but
it stands
to reason that this criterion is met on word level, too, if it is met
on
item level.
For the long items there were 2 errors in the cutting of the signals,
9 typing errors, and 22 transliteration errors.
For the short items there were 8 transliteration errors and 1 typing
error.
For completeness a full list of errors is given below.
Long items:
Change in item females/p0063/item15:
Original text:
Mais, pour l'heure, le tribunal conclut en jugeant que l'artiste
poursuivie était en droit de réclamer une indemnité.
Modified text:
Mais, pour l'heure, le tribunal conclut en jugeant que l'artiste
poursuivie était en droit de réclamer un indemnité.
Kind of error: Transliteration
Change in item females/p0093/telefon:
Original text:
Septante quatre quarante quatre six
Modified text:
Septante quatre quarante quatre seize
Kind of error: Transliteration
Change in item
Original text:
Son père lui a
Modified text:
Son père lui a
Kind of error:
females/p0119/item11:
fait une réprimande.
fait une reprimande.
Transliteration
Change in item females/p0180/item07:
Original text:
[\prononciation bizarre Daves] d a v e s
Modified text:
Daves d a v e s
Kind of error: Transliteration
Change in item females/p0198/identif:
Original text:
Mille neuf cent septante et un
Modified text:
*euf mille neuf cent septante et un
Kind of error: Cutting
Change in item
Original text:
Spécialisées s
Modified text:
Spécialisées s
Kind of error:
females/p0201/item07:
p é accent aigu c i a l i s é accent aigu e s
p e accent aigu c i a l i s e accent aigu e s
Transliteration
Change in item females/p0219/item11:
Original text:
Il y a peu de chance que l'on en discute à Rennes, où doit se réunir,
en mars mille neuf cent nonante, le congrès.
Modified text:
Il y a peu de chance que l'on en discute à Rennes, on doit se réunir,
en mars mille neuf cent nonante, le congrès.
Kind of error: Transliteration
Change in item females/p0259/item11:
Original text:
Toutes ces bêtes ont reçu les soins appropriés et ont été placés.
Modified text:
Toutes ces bêtes ont reçu les soins appropriés et ont été placées.
Kind of error: Type
Change in item females/p0273/item04:
Original text:
Trois mille cinq Kilos
Modified text:
Trois mille cinq kilos
Kind of error: Type
Change in item females/p0285/item12:
Original text:
Et elle n'a pas cherché à monnayer quelques accord avec l'impertinent
reporter.
Modified text:
Et elle n'a pas cherché à monnayer quelques accords avec
l'impertinent reporter.Kind of error: Type
Change in item females/p0286/telefon:
Original text:
Soixante et un soixante et un soixante et un
Modified text:
*ois cent soixante et un soixante et un soixante et un
Kind of error: Transliteration
Change in item
Original text:
Sinon, il nous
Modified text:
Sinon, il nous
vide de sens.
Kind of error:
females/p0293/item02:
parait exsangue, dépassé, vide de sens.
parait [\prononciation bizarre exsangue], dépassé,
Transliteration
Change in item females/p0321/item15:
Original text:
Quatre personnes ont été blessées au cours de ces interpellations.
Modified text:
Quatre personnes ont été blessées au cours de cette interpellation.
Kind of error: Transliteration
Change in item females/p0338/item23:
Original text:
Quel est le diagnostic du médecin ?
Modified text:
Quel est le [\prononciation bizarre diagnostic] du médecin ?
Kind of error: Transliteration
Change in item females/p0373/item26:
Original text:
[\inintelligible aïssaoui] a i tréma s s a o u i
Modified text:
Aïssaoui a i tréma s s a o u i
Kind of error: Transliteration
Change in item females/p0389/item05:
Original text:
Six cinq huit trois trois trois deux trois un quatre quatre deux
trois quatre huit cinq
Modified text:
Six cinq huit trois trois deux trois un quatre quatre deux trois
quatre huit cinq
Kind of error: Transliteration
Change in item
Original text:
Pierre a lancé
Modified text:
Pierre a lancé
Kind of error:
females/p0450/item15:
une oeillade à marie.
une oeillade à Marie.
Transliteration
Change in item females/p0483/item12:
Original text:
Té vingt quintaux de blé.
Modified text:
*té vingt quintaux de blé.
Kind of error: Cutting
Change in item
Original text:
Toute critique
éliminée; on a
Modified text:
Toute critique
éliminée; on a
Kind of error:
females/p0549/item12:
Change in item
Original text:
Qui prendra le
Modified text:
Qui prendra le
Kind of error:
males/p3023/item02:
en profondeur du système Dollar a été, par exemple,
préféré incriminer les émirs du pétrole.
en profondeur du système dollar a été, par exemple,
préféré incriminer les émirs du pétrole.
Type
relais de jean ?
relais de Jean ?
Transliteration
Change in item males/p3170/item05:
Original text:
Deux mille sept cent septante trois zéro trois cent vingt quatre
trois mille quatre cent quarante neuf trois mille quatre cent
soixante quinze
Modified text:
Deux mille sept cent septante trois zéro trois cent vingt quatre
trois mille quatre cent quarante neuf trois mille quatre cent
septante cinq
Kind of error: Transliteration
Change in item males/p3182/item26:
Original text:
Germanier g e r m a n i e r
Modified text:
Germanier g é r m a n i e r
Kind of error: Transliteration
Change in item males/p3235/item11:
Original text:
Ils ne disent seulement saturés par les campagnes d'information sur
la maladie, mais également sceptiques quant à leur efficacité.
Modified text:
Ils ne disent seulement saturés par les campagnes d'information sur
la maladie, mais également *asceptiques quant à leur efficacité.
Kind of error: Transliteration
Change in item males/p3256/item06:
Original text:
Cette journée de réflexion prendra fin par un une office divine.
Modified text:
Cette journée de réflexion prendra fin par un par une office divine.
Kind of error: Transliteration
Change in item
Original text:
Cinquante deux
Modified text:
Cinquante deux
Kind of error:
males/p3289/item20:
mille zéro onze virgules sept cent cinquante neuf
mille zéro onze virgule sept cent cinquante neuf
Type
Change in item males/p3294/item22:
Original text:
Acide a c i d e
Modified text:
Acide a c i d è
Kind of error: Transliteration
Change in item males/p3295/item05:
Original text:
Cent quatre zéro mille six cent nonante trois mille neuf cent
quarante neuf zéro huit cent soixante sept
Modified text:
Zéro cent quatre mille six cent nonante trois mille neuf cent
quarante neuf zéro huit cent soixante sept
Kind of error: Transliteration
Change in item males/p3296/item20:
Original text:
Trois million trente neuf mille neuf cent cinquante neuf
Modified text:
Trois millions trente neuf mille neuf cent cinquante neuf
Kind of error: Type
Change in item males/p3298/item26:
Original text:
Georges-Etienne g e o r g e s trait d'union e t i e n n e
Modified text:
Georges-Etienne g e o r g e s trait d'union é t i e n n e
Kind of error: Transliteration
Change in item
Original text:
La conseillère
des stands.
Modified text:
La conseillère
des stands.
Kind of error:
males/p3332/item12:
Change in item
Original text:
Cent cinquante
Modified text:
Cent cinquante
Kind of error:
males/p3354/item04:
en style est même intervenue dans le choix de tissu
en style est même intervenue dans le choix du tissu
Transliteration
cinq mille sept cent Kilos
cinq mille sept cent kilos
Type
Change in item males/p3376/item08:
Original text:
Huit cent soixante deux mille et un Franc trente francs suisses
Modified text:
Huit cent soixante deux mille et un franc trente francs suisses
Kind of error: Type
Change in item males/p3394/item12:
Original text:
En fait, les pays d'Europe de l'est ne sont pas les seuls visés.
Modified text:
En fait, les pays d'Europe de l'Est ne sont pas les seuls visés.
Kind of error: Type
******************
Short items:
Change in item females/p0103/item25:
Original text:
Dièse trois quatre cinq dièse étoile ou asterix
Modified text:
Dièse trois quatre cinq dièse étoile ou astérisque
Kind of error: Transliteration
Change in item females/p0120/naissanc:
Original text:
Le douze janvier mille neuf cent cinquante et un
Modified text:
Un douze janvier mille neuf cent cinquante et un
Kind of error: Transliteration
Change in item females/p0125/item03:
Original text:
Lister
Modified text:
Liste
Kind of error: Transliteration
Change in item females/p0303/item25:
Original text:
[\prononciation bizarre astérisque] six neuf sept deux
[\prononciation bizarre astérisque]
Modified text:
astérisque six neuf sept deux astérisque
Kind of error: Transliteration
Change in item females/p0330/item21:
Original text:
Lister
Modified text:
[\prononciation bizarre Lister]
Kind of error: Transliteration
Change in item females/p0361/ecole:
Original text:
À Kussnacht Zürich
Modified text:
A Kussnacht Zürich
Kind of error: Type
Change in item females/p0538/item25:
Original text:
Huit cinq [\prononciation bizarre astérisque] cinq deux un
Modified text:
Huit cinq astérisque cinq deux un
Kind of error: Transliteration
Change in item males/p3212/item03:
Original text:
Concert
Modified text:
[\prononciation bizarre Concert]
Kind of error: Transliteration
Change in item males/p3226/item21:
Original text:
Lister
Modified text:
[\prononciation bizarre Lister]
Kind of error: Transliteration
=====================================================================
===
11. SUMMARY
Below we give a brief overview of our findings.
We repeat that it should be borne in mind that the Swiss French
corpus
was recorded long before the SpeechDat specifications were released.
1. Documentation
In general formal matters are properly described (contact person,
number and contents of CD-ROMs).
Speaker information is very poorly provided.
Naming conventions for directories and files are well described.
Prompting information is well described.
Annotation is well described but information about annotation
conventions
(when upper and when lower case letters) is missing.
Information about the lexicon is OK, but we miss an overview of the
SAMPA
symbols used.
We miss a list of alternative spelling of the words (or a paragraph
stating that this has been made uniform in one way or another).
We miss information about the number of annotation files that were
double
checked by the producer.
2. Database structure contents and file names
Directory structure and file names are not according to SpeechDat
specifications (these were not known at the time of compilation).
3. Missing items, structurally and incidentally
We miss 9 out of 39 obligatory items per call systematically;
141 calls miss up to 3 further obligatory items.
We miss 15 out of 39 obligatory application words.
This does not fulfil the SpeechDat specifications.
4. Contents sampled data files
Speech files are not in a-law, but in a linear conversion of a-law
(13 bits).
The call in FEMALES\P0226 is corrupted.
Due to unacceptable acoustics the following files are unusable.
Of the obligatory items:
- file MALES\P3050\ITEM3
- file MALES\P3134\ITEM6
Of the optional items:
- file FEMALES\P0108\COMMENTS
- file FEMALES\P0173\LANGUE
- file FEMALES\P0400\COMMENTS
- file FEMALES\P0404\COMMENTS
- file FEMALES\P0413\NIVEAU
- file MALES\P3215\COMMENTS
These files contain only noise.
By viewing and listening the files in directory
p0226 we noticed that all files in this directory are
severely corrupted by errors in the recording platform.
The call with the lowest mean SNR was, again, the corrupted call
of FEMALES\P0226.
The other directories with low mean SNRs are acceptable.
5. Contents annotation file
In the file headers we miss the following information:
- data base volume (FIXED0SF_00)
- session number of 4 digits
- region of call
- speaker age and sex
- file directory
- signal file name
- corpus (item) code
- recording date
- recording time of first item
- number of significant bits per sample
In the transcriptions capital letters are used rather inconsistently.
We recommend that all punctuation marks be removed.
6. Lexicon
Lexicon format: <grapheme string> <TAB> <phoneme string> is OK.
There are no blanks between phoneme symbols.
Only two words were found not to be present in the lexicon.
7. Speakers
There is no speaker table.
There is no information about sex, age and region of call in the
file headers.
The disbalance between the sexes exceeds 5%, which is in conflict
with the SpeechDat specifications.
8. Recording conditions
Recording conditions are fine.
9. Validation transcription
The transcriptions of 999 randomly chosen long items and 549 randomly
chosen
short items were evaluated.
They were found to fit the SpeechDat specifications nicely.
=====================================================================
=====

Documents pareils