Digital cura%on: the SLDR experience

Transcription

Digital cura%on: the SLDR experience
Digital cura+on: the SLDR experience Frédérique Bénard – frederique.benard@lpl-­‐aix.fr Bernard Bel – bernard.bel@lpl-­‐aix.fr Speech & Language Data Repository – h3p://sldr.org TRASP 2013 – Aix-­‐en-­‐Provence Menu TRASP Special • 1st course: SLDR/ORTOLANG archive system • 2nd course: A case study for digital cura+on • 3rd course: Tools presented in TRASP 2 Diversity • Capitalizing knowledge from low-­‐interconnected scien+fic domains formal and computa+onal linguis+cs sociolinguis+cs didac+cs psycholinguis+cs neuroscience literature language transla+on • Integrated management of access-­‐rights => all cases! 3 Convergence Metadata • OLAC/Dublin Core format (via its OAI-­‐PMH server) • RDF format imbedded in HTML • EAD (Encoded Archival Descrip2on) for structural metadata • CMDI (Component metadata, CLARIN) for finer descrip+ons + interoperability => real-­‐size test on Swedia 2000 (swedia-­‐000788) 4 Convergence Metadata • OLAC/Dublin Core format (via its OAI-­‐PMH server) • RDF format imbedded in HTML • EAD (Encoded Archival Descrip2on) for structural metadata • CMDI (Component metadata, CLARIN) for finer descrip+ons + interoperability => real-­‐size test on Swedia 2000 (swedia-­‐000788) Persistent iden<fiers (PID) • Same scheme (Handles) applied to items and documents, e.g.: hdl:11041/swedia-­‐000788 hdl:11041/swedia-­‐000788/kyrkslac_om_2_prosody.imdi • Scheme is compa+ble with versioning 5 Convergence Metadata • OLAC/Dublin Core format (via its OAI-­‐PMH server) • RDF format imbedded in HTML • EAD (Encoded Archival Descrip2on) for structural metadata • CMDI (Component metadata, CLARIN) for finer descrip+ons + interoperability => real-­‐size test on Swedia 2000 (swedia-­‐000788) Persistent iden<fiers (PID) • Same scheme (Handles) applied to items and documents, e.g.: hdl:11041/swedia-­‐000788 hdl:11041/swedia-­‐000788/kyrkslac_om_2_prosody.imdi • Scheme is compa+ble with versioning Users’ community • SLDR keeps a trace of transac+ons on resources • Users are bound by a non-­‐commercial licence 6 Other relevant features of SLDR • The site is accessed in four languages: English, Chinese, Spanish and French. An addi+onal language may be used for textual descrip+ve metadata. • Accurately displaying ins+tu+onal affilia+ons, roles, financial support and projects. • Long-­‐term preserva+on (es+mated 5000€/Tb/year) handed over to CINES ( C e n t r e i n f o r m a 2 q u e n a 2 o n a l d e l’enseignement supérieur) via Huma-­‐Num (a VLRI in charge of Digital Humani+es). 7 Mul+lingual metadata 8 Displaying ins+tu+onal roles Authors Sponsor Owner 9 Systema+c aspects of digital cura+on • “Hi, I would like to put my data on the SLDR/
ORTOLANG plateform” -­‐ because I have to: my director asked me to / my project received financial support for it... -­‐ because I want to save it: safe storage, long-­‐term preserva+on, reusability of the data, visibility of the project for future possible collabora+ons 10 Out of the cupboard Data Reclamation/Salvage
11 Here are my data 12 Physical data formats Different supports: floppy disk, DAT, audio tape, audio cassece, mini disk, CD, DVD, USB drive, qp (Internet transac+ons) qp:// 13 A pragma+c approach to data sharing Example of C-­‐PROM (a French annotated corpus for the study of prominence) First step is the ‘remote access’ op+on: hcp://sldr.org/c-­‐prom-­‐000250 The en<re resource remains on the orginal website. Advantages: • It has been assigned a PID poin+ng directly to the original item: • hcp://hdl.handle.net/11041/c-­‐prom-­‐000250 • Its descrip+ve metadata (hcp://sldr.org/c-­‐prom-­‐000250/metadata/olac) are shared on the OAI-­‐PMH server. 14 Presenta+on of C-­‐PROM on SLDR 15 Descrip+ve metadata for C-­‐PROM 16 A pragma+c approach to data sharing The en<re resource remains on the orginal website. Advantages: • It has been assigned a PID poin+ng directly to the original item: • hcp://hdl.handle.net/11041/c-­‐prom-­‐000250 • Its descrip+ve metadata (hcp://sldr.org/c-­‐prom-­‐000250/metadata/olac) are shared on the OAI-­‐PMH server. We may proceed to the uploading of all material when the data producer feels ready to benefit from: • long-­‐term preserva+on • dissemina+on via a reliable service (currently CC-­‐IN2P3) • integrated access-­‐rights management 17 A +me structure for digital cura+on Present: I would like to deposit a data for long +me preserva+on. Past: Where does it come from, what formats were used, who can explain these formats? Future: Reusability of the data, formats updates, new soqware developement integra+ng these former work. 18 Importance of metadata: OTG’s story Past Present Jean-­‐Yves Antoine: depositor on SLDR OTG = Office du Tourisme de Grenoble Future Presenta+on of OTG on SLDR 20 Discovery of an exo+c format: AFO 21 What is this format? Who can tell us about AFO? Private curator 22 Importance of metadata: OTG’s story Past Present Future Jean-­‐Yves Antoine: depositor on SLDR Geneviève Caelen-­‐Haumont: handed over this corpus to Jean-­‐Yves 23 Importance of metadata: OTG’s story Past Present Future Jean-­‐Yves Antoine: depositor on SLDR Geneviève Caelen-­‐Haumont: handed over this corpus to Jean-­‐Yves Jean Caelen: one of the creators of the corpus 24 Importance of metadata: OTG’s story Past Present Future Jean-­‐Yves Antoine: depositor on SLDR Geneviève Caelen-­‐Haumont: handed over this corpus to Jean-­‐Yves Jean Caelen: one of the creators of the corpus Robert Espesser: worked on another project using the AFO format 25 Importance of metadata: OTG’s story Past Present Future Jean-­‐Yves Antoine: depositor on SLDR Geneviève Caelen-­‐Haumont: handed over this corpus to Jean-­‐Yves Jean Caelen: one of the creators of the corpus Robert Espesser: worked on another project using the AFO format Google: completed the informa+on we had on AFO • User guide to ETR tools. SAM: Mul2-­‐lingual speech Input/Output Assessment, Methodology and Standardisa2on. Ref: SAM-­‐UCL-­‐G007 • Spoken Language Materials: hcp://www.spectrum.uni-­‐bielefeld.de/~gibbon/
gibbon_handbook_1997/node445.html 26 co-­‐authored by Dafydd Gibbon, Roger Moore and Richard Winsky 27 (AFO) 28 Remote access 29 Final version of the resource … aqer obtaining permission from Walter de Gruyter Publishers. Thanks to Dafydd Gibbon! 30 Links to resource on OTG metadata Link to documenta+on Link to AFO format 31 Tools presented in TRASP 32 Tools presented in TRASP Usage of PRAAT: 12 out of the 16 tools presented in this workshop are based on PRAAT. Other pla{orms include Python (Rapid & Smooth Pitch Contour Manipula+on and SPPAS), a web service (TGA) and Windows (Winpitch) Praat Python Web service Windows 33 Archiving Praat scripts Beware: • Some Praat scripts may no longer work in forthcoming versions of Praat. Therefore, specify which version has been used for development of your script. • Encoding: default se|ng of Praat text files is UTF-­‐16, but UTF-­‐8 is required for long-­‐term preserva+on 34 Availability on SLDR • Availability: 4 tools are already distributed via SLDR. We expect all tools to have entries on SLDR, even if they are maintained and distributed on remote sites for prac+cal reasons (updates, documenta+on, associated tools and download/install procedures) 35 To conclude • OpenProDat (sldr000805): open-­‐access corpora in mul+ple languages, along with annota+ons, inspired by EUROM1 • OpenProDat is a good collec+on for comparing tools • This evening’s discussion should focus on tool interoperability and sharing 36