Digital cura%on: the SLDR experience
Transcription
Digital cura%on: the SLDR experience
Digital cura+on: the SLDR experience Frédérique Bénard – frederique.benard@lpl-‐aix.fr Bernard Bel – bernard.bel@lpl-‐aix.fr Speech & Language Data Repository – h3p://sldr.org TRASP 2013 – Aix-‐en-‐Provence Menu TRASP Special • 1st course: SLDR/ORTOLANG archive system • 2nd course: A case study for digital cura+on • 3rd course: Tools presented in TRASP 2 Diversity • Capitalizing knowledge from low-‐interconnected scien+fic domains formal and computa+onal linguis+cs sociolinguis+cs didac+cs psycholinguis+cs neuroscience literature language transla+on • Integrated management of access-‐rights => all cases! 3 Convergence Metadata • OLAC/Dublin Core format (via its OAI-‐PMH server) • RDF format imbedded in HTML • EAD (Encoded Archival Descrip2on) for structural metadata • CMDI (Component metadata, CLARIN) for finer descrip+ons + interoperability => real-‐size test on Swedia 2000 (swedia-‐000788) 4 Convergence Metadata • OLAC/Dublin Core format (via its OAI-‐PMH server) • RDF format imbedded in HTML • EAD (Encoded Archival Descrip2on) for structural metadata • CMDI (Component metadata, CLARIN) for finer descrip+ons + interoperability => real-‐size test on Swedia 2000 (swedia-‐000788) Persistent iden<fiers (PID) • Same scheme (Handles) applied to items and documents, e.g.: hdl:11041/swedia-‐000788 hdl:11041/swedia-‐000788/kyrkslac_om_2_prosody.imdi • Scheme is compa+ble with versioning 5 Convergence Metadata • OLAC/Dublin Core format (via its OAI-‐PMH server) • RDF format imbedded in HTML • EAD (Encoded Archival Descrip2on) for structural metadata • CMDI (Component metadata, CLARIN) for finer descrip+ons + interoperability => real-‐size test on Swedia 2000 (swedia-‐000788) Persistent iden<fiers (PID) • Same scheme (Handles) applied to items and documents, e.g.: hdl:11041/swedia-‐000788 hdl:11041/swedia-‐000788/kyrkslac_om_2_prosody.imdi • Scheme is compa+ble with versioning Users’ community • SLDR keeps a trace of transac+ons on resources • Users are bound by a non-‐commercial licence 6 Other relevant features of SLDR • The site is accessed in four languages: English, Chinese, Spanish and French. An addi+onal language may be used for textual descrip+ve metadata. • Accurately displaying ins+tu+onal affilia+ons, roles, financial support and projects. • Long-‐term preserva+on (es+mated 5000€/Tb/year) handed over to CINES ( C e n t r e i n f o r m a 2 q u e n a 2 o n a l d e l’enseignement supérieur) via Huma-‐Num (a VLRI in charge of Digital Humani+es). 7 Mul+lingual metadata 8 Displaying ins+tu+onal roles Authors Sponsor Owner 9 Systema+c aspects of digital cura+on • “Hi, I would like to put my data on the SLDR/ ORTOLANG plateform” -‐ because I have to: my director asked me to / my project received financial support for it... -‐ because I want to save it: safe storage, long-‐term preserva+on, reusability of the data, visibility of the project for future possible collabora+ons 10 Out of the cupboard Data Reclamation/Salvage 11 Here are my data 12 Physical data formats Different supports: floppy disk, DAT, audio tape, audio cassece, mini disk, CD, DVD, USB drive, qp (Internet transac+ons) qp:// 13 A pragma+c approach to data sharing Example of C-‐PROM (a French annotated corpus for the study of prominence) First step is the ‘remote access’ op+on: hcp://sldr.org/c-‐prom-‐000250 The en<re resource remains on the orginal website. Advantages: • It has been assigned a PID poin+ng directly to the original item: • hcp://hdl.handle.net/11041/c-‐prom-‐000250 • Its descrip+ve metadata (hcp://sldr.org/c-‐prom-‐000250/metadata/olac) are shared on the OAI-‐PMH server. 14 Presenta+on of C-‐PROM on SLDR 15 Descrip+ve metadata for C-‐PROM 16 A pragma+c approach to data sharing The en<re resource remains on the orginal website. Advantages: • It has been assigned a PID poin+ng directly to the original item: • hcp://hdl.handle.net/11041/c-‐prom-‐000250 • Its descrip+ve metadata (hcp://sldr.org/c-‐prom-‐000250/metadata/olac) are shared on the OAI-‐PMH server. We may proceed to the uploading of all material when the data producer feels ready to benefit from: • long-‐term preserva+on • dissemina+on via a reliable service (currently CC-‐IN2P3) • integrated access-‐rights management 17 A +me structure for digital cura+on Present: I would like to deposit a data for long +me preserva+on. Past: Where does it come from, what formats were used, who can explain these formats? Future: Reusability of the data, formats updates, new soqware developement integra+ng these former work. 18 Importance of metadata: OTG’s story Past Present Jean-‐Yves Antoine: depositor on SLDR OTG = Office du Tourisme de Grenoble Future Presenta+on of OTG on SLDR 20 Discovery of an exo+c format: AFO 21 What is this format? Who can tell us about AFO? Private curator 22 Importance of metadata: OTG’s story Past Present Future Jean-‐Yves Antoine: depositor on SLDR Geneviève Caelen-‐Haumont: handed over this corpus to Jean-‐Yves 23 Importance of metadata: OTG’s story Past Present Future Jean-‐Yves Antoine: depositor on SLDR Geneviève Caelen-‐Haumont: handed over this corpus to Jean-‐Yves Jean Caelen: one of the creators of the corpus 24 Importance of metadata: OTG’s story Past Present Future Jean-‐Yves Antoine: depositor on SLDR Geneviève Caelen-‐Haumont: handed over this corpus to Jean-‐Yves Jean Caelen: one of the creators of the corpus Robert Espesser: worked on another project using the AFO format 25 Importance of metadata: OTG’s story Past Present Future Jean-‐Yves Antoine: depositor on SLDR Geneviève Caelen-‐Haumont: handed over this corpus to Jean-‐Yves Jean Caelen: one of the creators of the corpus Robert Espesser: worked on another project using the AFO format Google: completed the informa+on we had on AFO • User guide to ETR tools. SAM: Mul2-‐lingual speech Input/Output Assessment, Methodology and Standardisa2on. Ref: SAM-‐UCL-‐G007 • Spoken Language Materials: hcp://www.spectrum.uni-‐bielefeld.de/~gibbon/ gibbon_handbook_1997/node445.html 26 co-‐authored by Dafydd Gibbon, Roger Moore and Richard Winsky 27 (AFO) 28 Remote access 29 Final version of the resource … aqer obtaining permission from Walter de Gruyter Publishers. Thanks to Dafydd Gibbon! 30 Links to resource on OTG metadata Link to documenta+on Link to AFO format 31 Tools presented in TRASP 32 Tools presented in TRASP Usage of PRAAT: 12 out of the 16 tools presented in this workshop are based on PRAAT. Other pla{orms include Python (Rapid & Smooth Pitch Contour Manipula+on and SPPAS), a web service (TGA) and Windows (Winpitch) Praat Python Web service Windows 33 Archiving Praat scripts Beware: • Some Praat scripts may no longer work in forthcoming versions of Praat. Therefore, specify which version has been used for development of your script. • Encoding: default se|ng of Praat text files is UTF-‐16, but UTF-‐8 is required for long-‐term preserva+on 34 Availability on SLDR • Availability: 4 tools are already distributed via SLDR. We expect all tools to have entries on SLDR, even if they are maintained and distributed on remote sites for prac+cal reasons (updates, documenta+on, associated tools and download/install procedures) 35 To conclude • OpenProDat (sldr000805): open-‐access corpora in mul+ple languages, along with annota+ons, inspired by EUROM1 • OpenProDat is a good collec+on for comparing tools • This evening’s discussion should focus on tool interoperability and sharing 36