Presentation is loading. Please wait.

Presentation is loading. Please wait.

From pixels and minds to the mathematical knowledge in digital library Petr Sojka, Masaryk University, Brno Jiří Rákosník, Institute of Mathematics AS.

Similar presentations


Presentation on theme: "From pixels and minds to the mathematical knowledge in digital library Petr Sojka, Masaryk University, Brno Jiří Rákosník, Institute of Mathematics AS."— Presentation transcript:

1 From pixels and minds to the mathematical knowledge in digital library Petr Sojka, Masaryk University, Brno Jiří Rákosník, Institute of Mathematics AS CR, Praha

2 Motivation for DML  the increment of new papers is growing faster and faster  Zentralblatt MATH:  items indexed  items added in 2008  MathSciNet  items indexed  new item yearly

3 Motivation for DML  Maths relies more than other sciencies on past literature  50 % of current references aim at literature 15 years old  25 % aim 25 year back Number of references in Collection of Computer Science bibliographies

4 Publish or perish  “If [in 2600] you stacked all the new books being published next to each other, you would have to move at ninety miles an hour just to keep up with the end of the line. Of course, by 2600 new artistic and scientific work will come in electronic forms, rather than as physical books and paper. Nevertheless, if the exponential growth continued, there would be ten papers a second in my kind of theoretical physics, and no time to read them.” Stephen Hawking

5 Motivation for DML-CZ NUMDAMNumérisation de documents anciens mathématiques ERAMThe Jahrbuch Project – Electronic Research Archive for Mathematics (1868–1942): “Jahrbuch über die Fortschritte der Mathematik” JSTORarchives of over one thousand academic journals across the humanities, social sciences, and sciences, as well as select monographs EMANIelectronic mathematical archiving network (Cornell, SUB Göttingen, MathDoc, Tsinghua University Library) RusDMLRussian DML ( pages of papers in journals covered by Zentralblatt MATH) … DML-CZDigital Mathematical Library of mathematical literature published in the Czech Republic and Slovakia

6 The occasion  R&D programme Information Society funded by the Academy of Sciences  project DML-CZ: Czech Digital Mathematics Library, 2005–2009

7 Partners  Institute of Mathematics AS CR, Praha (J. Rákosník) – coordinator, material selection, copyright, mathematical supervision  Institute of Computer Science, Masaryk University, Brno (M. Bartošek, P. Kovář, M. Šárfy, V. Krejčíř) – content management system, metadata Q/A, long-term archiving  Faculty of Informatics MU, Masaryk University, Brno (P. Sojka) – formats and tools, technical coordination, information retrieval, indexing  Faculty of Mathematics and Physics, Charles University, Praha (O. Ulrych, J. Veselý) – harvesting and adjusting metadata  Library AS CR, Praha (M. Lhoták, M. Duda, A. Ryšánková) – document scanning, adjustment and OCR in the Digitization Centre Jenštejn Jenštejn

8 The aim  journals for mathematical research and education including Mathematica Slovaca  conference proceedings  monographs, textbooks  altogether about pages

9 Journals Titleretro (scan)retro-born Czechoslovak Mathematical Journal Aplikace Matematiky / Applications of Mathematics Archivum Mathematicum, Brno Commentationes Mathematicae Universitatis Carolinae Kybernetika Časopis pro pěstování matematiky a fysiky Časopis pro pěstování matematiky Mathematica Bohemica Acta Univ. Palackianae Olomucensis. Mathematica Acta Mathematica et Informatica Univ. Ostraviensis Acta Mathematica Univ. Ostraviensis Mathematica Slovaca Matematika-Fyzika-Informatika Pokroky matematiky, fyziky a astronomie

10 Journals - pilot part launched on 11th June 2008 Titleretro (scan)retro-born Czechoslovak Mathematical Journal Aplikace Matematiky / Applications of Mathematics Archivum Mathematicum, Brno Commentationes Mathematicae Universitatis Carolinae Kybernetika Časopis pro pěstování matematiky a fysiky Časopis pro pěstování matematiky Mathematica Bohemica Acta Univ. Palackianae Olomucensis. Mathematica Acta Mathematica et Informatica Univ. Ostraviensis Acta Mathematica Univ. Ostraviensis Mathematica Slovaca Matematika-Fyzika-Informatika Pokroky matematiky, fyziky a astronomie

11 Workflow overview

12 Preparation  selection of titles – quality of content, historical value  preparation – acquisition of documents for scanning, content survey  copyright – negotiation with publishers or authors

13 Scanning  parameters – 600 dpi, 4bit depth  scanning facilities – Digibook RGB 10000, A1 color book scanner and two book scanners Zeutschel OS 7000, A2 B/W  software – BookRestorer to make the scanned pages uniform (white space around text body, …);  Sirius system for archival storage of scans (put on CDs as TIFFs)

14 Optical Character Recognition  text OCR by two phase DML-OCR implemented with ABBYY FineReader SDK 8.1  errors in maths reading → Methods for separation of text OCR and mathematics OCR  maths: Infty system (Suzuki et al., Japan)  layout analysis  character recognition  structure analysis of math. expressions  manual error correction  multilayer PDF with several OCR layers (text, math in TeX, math in MathML or OMDoc)  99 %+ accuracy for text, 96 %+ for mathematics

15 Metadata and Image Enhancement/Processing  metadata standards – choice of standards (DC, MODS, METS are supported by DSpace)  metadata acqusition – Zbl/MR, OCR tagging, (retyping)  image enhancements – TIFF, PDF, jbig2 compression as a measure of quality  semantic processing – document markup enhancement, document classification, citation linking, document clustering, indexing  References and fulltexts are metadata as well, English titles and MSC mandatory. OAI-MPH export.

16 Metadata Editor  metadata creation & DL integration  developed in Brno for DML-CZ  web-based application  web interface  suite of scripts  files in directories  internal database

17 Storage, indexing  space – multiple OCR, multiple attribute layers (lemmas, reviewer comments, semantic classifications, etc.), no problems to store and index that for all mathematics literature so far  software  client/server architecture,  Lucene indexing software (OSS)

18 Document Markup Enhancement Methods  context dependent mapping from visual to logical markup  algorithms of language identification (bi-gram, tri-gram based, paragraph or even sentence level)  document classification, metrics, ontology construction, comparison with AMS 2000 classification  semiautomatic bibliography markup and metrics, global mathematics citation index, “MathRank”  document clustering (for visualization, …), identification of near duplicates

19 Presentation  delivery – customised digital library system DSpace (open source, created at MIT) for final articles delivery, search; Manakin interface  planned visualization techniques – “lost in hyperspace fear”, vizualization of document clustering, Visual Browser (different user's eyes)

20 Delivery  web portal – unique and persistent URLs: Digital Object Identifier DOI (PURL, URN? …)  interfaces to other services – OAI-PMH harvesting, bibtex export, Googlebot optimization  indexing, search relevance – Lucene, customized for maths (Experiments with Manatee and EDBM-2 (Zbl, NUMDAM))?

21 Further problems and questions  paper classification  automated MSC experiment  automated MSC learning  metadata from born-digital documents  search  OCR systems  OCR XML postprocessing  maths OCR

22 Possibilities


Download ppt "From pixels and minds to the mathematical knowledge in digital library Petr Sojka, Masaryk University, Brno Jiří Rákosník, Institute of Mathematics AS."

Similar presentations


Ads by Google