Presentation is loading. Please wait.

Presentation is loading. Please wait.

Million Book Bibliotheca Alexandrina Noha Adly 20 November 2006.

Similar presentations


Presentation on theme: "Million Book Bibliotheca Alexandrina Noha Adly 20 November 2006."— Presentation transcript:

1 Million Book Project @ Bibliotheca Alexandrina Noha Adly 20 November 2006

2 Bibliotheca Alexandrina2

3 3

4 4 BA Digitization Workflow

5 Bibliotheca Alexandrina5 Statistics - November 2006 ArabicLatinTotal Scanned Books22,023 4,64626,669 Pages7,003,1851,350,688 8,353,873 Processed Books21,9474,642 26,589 Pages6,987,3921,348,900 8,336,292 OCRed Books16,6524,600 21,252 Pages5,248,3371,327,385 6,575,722 Total Archived Data1,500 GB

6 Bibliotheca Alexandrina6 Statistics (Contd)  Daily Rates –Scan: ≈ 1800 pages/person –Process: ≈ 1800 pages/person –Latin OCR: ≈ 4000 pages/person –Arabic OCR: ≈ 1500 pages/person  Five Minolta scanners  2 shifts – 7 days a week

7 OCR Image to Text

8 Bibliotheca Alexandrina8 OCR - Arabic  Poses unique challenges –Written cursively, with blocks of connected characters –a ‘block of characters’ can have more than one base line. –Uses external objects such as dots, 'Hamza' and 'Madda'. –Diacritization –Characters can have more than one shape according to their position –Overlapping makes it difficult to determine the spacing  Sakhr Automatic reader is used  Tricky with old books  Requires learning

9 Bibliotheca Alexandrina9 Arabic Script Is Cursive

10 Bibliotheca Alexandrina10 Old, Smudgy, and Sticked Together

11 Bibliotheca Alexandrina11 Use of Diacritics

12 Bibliotheca Alexandrina12 16 Font Groups

13 Bibliotheca Alexandrina13 Evaluation of VERUS and AR  Research agreement with NovoDynamics  Preliminary evaluation on two data sets is promising –Challenge: difficult to OCR, degraded images –Normal: known to return acceptable accuracy

14 Encoding Image on Text

15 Bibliotheca Alexandrina15 Image-on-Text  Multilayered: –Visible page image –Hidden OCR text  View exact original layout while searching and highlighting  Supported with some OCR suites only  Supported format: DJVU and PDF

16 Bibliotheca Alexandrina16 Quality Assurance  No missing cover or pages  All pages are in order  Text quality  Images quality  PDF quality

17 DAR Digital Assets Repository

18 Bibliotheca Alexandrina18 System Architecture

19 Bibliotheca Alexandrina19 DAK Publishing Module

20 Bibliotheca Alexandrina20 DAK Publishing Module

21 Bibliotheca Alexandrina21 DAK Publishing Module

22 Bibliotheca Alexandrina22 DAK Publishing Module

23 Bibliotheca Alexandrina23

24 Bibliotheca Alexandrina24 Show notes

25 Bibliotheca Alexandrina25

26 Bibliotheca Alexandrina26 Transfer of Digitized Books  Challenges –Storage: CD vs Online –Bandwidth: 10 Mbps vs 155 Mbps –Copyright: not published  Actions: –Transferred 8,500+ books to the Internet Archive –Process is still going on

27 Books From India Towards better collaboration

28 Bibliotheca Alexandrina28 Books From India LanguageNumber Books Arabic832 Arabic + French3 Arabic + German1 Persian101 French2 English1 Spanish1 German1 Total942

29 Bibliotheca Alexandrina29 Progress Phase NameDone as of November 1, 2006 Expected to finished by Comments Cataloging801-35 have metadata problems Processing742November 20, 2006 OCRing200March 1, 2007 Encoding171-- Publishing171--

30 Bibliotheca Alexandrina30 Metadata Problems

31 Bibliotheca Alexandrina31 Processing

32 Bibliotheca Alexandrina32 OCR Using VERUS or AR?  Calculated accuracy for a small sample –Images processed once with darkening effect and once without –VERUS likes darkening, AR does not –Overall, AR won 70% of cases

33 Bibliotheca Alexandrina33

34 Bibliotheca Alexandrina34

35 Bibliotheca Alexandrina35

36 Bibliotheca Alexandrina36

37 Bibliotheca Alexandrina37

38 Bibliotheca Alexandrina38

39 Bibliotheca Alexandrina39

40 Bibliotheca Alexandrina40

41 Bibliotheca Alexandrina41

42 Bibliotheca Alexandrina42 Thank You


Download ppt "Million Book Bibliotheca Alexandrina Noha Adly 20 November 2006."

Similar presentations


Ads by Google