Presentation is loading. Please wait.

Presentation is loading. Please wait.

GL15 Grey Literature Bratislava 2-3 december 2013 Industrial Philology: problems and techniques of data and archives preservation for future generations.

Similar presentations


Presentation on theme: "GL15 Grey Literature Bratislava 2-3 december 2013 Industrial Philology: problems and techniques of data and archives preservation for future generations."— Presentation transcript:

1 GL15 Grey Literature Bratislava 2-3 december 2013 Industrial Philology: problems and techniques of data and archives preservation for future generations Istituto di Linguistica Computazionale (ILC) CNR of Pisa E. Sassolini, A. Cinini, S. Sbrulli, N. Cucurullo, and M. Sassi

2 GL15 Grey Literature Bratislava 2-3 december 2013 Industrial Philology: problems and techniques of data and archives preservation for future generations ILC fifty-year history of ICT applied to the NLP wide variety of texts and corpora that have been stored in various formats and record layouts Industrial Philology

3 GL15 Grey Literature Bratislava 2-3 december 2013 Industrial Philology: problems and techniques of data and archives preservation for future generations e.g.: texts in Latin and ancient Greek, which required a complicated system of encoding in the 60s-70s punch cards, with a limited set of characters available were used equivalence tables were created, so that a full performance can be obtained using Unicode encoding. Cultural Heritage domain

4 GL15 Grey Literature Bratislava 2-3 december 2013 Industrial Philology: problems and techniques of data and archives preservation for future generations character encodings, based on different operating systems used over the years (from EBCDIC, passing by ASCII, to Unicode) textual materials produced in or recovered from a project of the past do not have a standard format, but are the expression both of technology and of the research needs at that time text handling problems

5 GL15 Grey Literature Bratislava 2-3 december 2013 Industrial Philology: problems and techniques of data and archives preservation for future generations 5 Our experiencesDigital Text Repository schema Magnetic tape TCL formatDBT format other interchange format XML TEI Intermediate format (1) Intermediate format (2) Intermediate format (n)

6 GL15 Grey Literature Bratislava 2-3 december 2013 Industrial Philology: problems and techniques of data and archives preservation for future generations 6 Text acquisition strategy Source textTransition phases (TP) required Meta data Text in magnetic tape Many type TPstudy and research in the archives ILC Text divided into separate resources TP>3recovered from paper-based data Text in file obsoleteTP>2recovered from paper-based data Text digital with obsolete character encoding 2<TP<3recovered from: - paper-based data - the digital format Digital textOne TPrecovered from the digital format

7 GL15 Grey Literature Bratislava 2-3 december 2013 Industrial Philology: problems and techniques of data and archives preservation for future generations 7 Annotated text acquisition strategy: Source textTransition phases (TP) required specific annotations type encoding Meta data Text in magnetic tape Many type TP ?work long and difficult Text divided into separate resources TP>3DBT type encodingrecovered from paper- based data Text in file obsolete TP>2Obsolete type encoding recovered from paper- based data Text digital with obsolete character encoding 2<TP<3Specific type encoding recovered from: - paper-based data - the digital format Digital textOne TPILC text encodingrecovered from the digital format

8 GL15 Grey Literature Bratislava 2-3 december 2013 Industrial Philology: problems and techniques of data and archives preservation for future generations 8 Phase 1: texts material collection Research of all existing text materials in ILC, looking to the historical projects to which ILC has worked; Identification of physical locations where these texts are; Recovery and analysis of text material; Quantification of the work required.

9 GL15 Grey Literature Bratislava 2-3 december 2013 Industrial Philology: problems and techniques of data and archives preservation for future generations Main innovations 9 Phase 2: text corpus standardization For each type of textual data definition of: a procedure for the text standardization; Xml TEI model of text representation costs of such work

10 GL15 Grey Literature Bratislava 2-3 december 2013 Industrial Philology: problems and techniques of data and archives preservation for future generations 10 Tools Tools of texts analysis: DBT (Data Base Testuale); Modules and procedures for specific text recovery; Converter and parser XML TEI for text corpus testing

11 GL15 Grey Literature Bratislava 2-3 december 2013 Industrial Philology: problems and techniques of data and archives preservation for future generations Scientific Cooperation Agreement between the ILC (Computational Lingusitics Institute) and the “Accademia della Crusca” of Florence Selection of relevant text materials in ILC and then specific classification; Identification of the text encoding and, where present, of the linguistic annotations associated; Conversion of texts into a shared and standardized representation format; Development of a text management system for the advanced search functionalities. 12 projects & applications

12 GL15 Grey Literature Bratislava 2-3 december 2013 Industrial Philology: problems and techniques of data and archives preservation for future generations To preserve the ancient procedures in programming languages which printouts of processing still exist: Can they be preserved or should they be dropped? Can they be considered a form of “industrial Philology” and maintened? Open questions


Download ppt "GL15 Grey Literature Bratislava 2-3 december 2013 Industrial Philology: problems and techniques of data and archives preservation for future generations."

Similar presentations


Ads by Google