Presentation is loading. Please wait.

Presentation is loading. Please wait.

Million Book Bibliotheca Alexandrina Youssef Eldakar 19 November 2006.

Similar presentations


Presentation on theme: "Million Book Bibliotheca Alexandrina Youssef Eldakar 19 November 2006."— Presentation transcript:

1 Million Book Project @ Bibliotheca Alexandrina Youssef Eldakar 19 November 2006

2 Bibliotheca Alexandrina2 BA Digitization Workflow

3 Bibliotheca Alexandrina3

4 Image Processing Image to Better Image

5 Bibliotheca Alexandrina5 Image Processing Sequence  Deskew  Despeckle  Rotation  Noise Removal  Black Edge Removal  Page resize  Center Text to page  Enhance text quality [Grow & Erode]  Renaming Files  File compression (CCITT – Group 4) ScanFix (Automated processing) Adobe Photoshop (Manual processing) ScanFix (Automated processing) ACDSee (Automated processing)

6 Bibliotheca Alexandrina6 Image Processing Sequence  Deskew  Despeckle  Rotation  Noise Removal  Black Edge Removal  Page resize  Center Text to page  Enhance text quality [Grow & Erode]  Renaming Files  File compression (CCITT – Group 4) ScanFix (Automated processing) Adobe Photoshop (Manual processing) ScanFix (Automated processing) ACDSee (Automated processing)

7 Bibliotheca Alexandrina7 Scanfix  Deskew Before After

8 Bibliotheca Alexandrina8 Scanfix  Despeckle Before After

9 Bibliotheca Alexandrina9 Scanfix  Rotation

10 Bibliotheca Alexandrina10 Image Processing Sequence  Deskew  Despeckle  Rotation  Noise Removal  Black Edge Removal  Page resize  Center Text to page  Enhance text quality [Grow & Erode]  Renaming Files  File compression (CCITT – Group 4) ScanFix (Automated processing) Adobe Photoshop (Manual processing) ScanFix (Automated processing) ACDSee (Automated processing)

11 Bibliotheca Alexandrina11 Photoshop  Noise removal Before After

12 Bibliotheca Alexandrina12 Photoshop  Black edge removal Before After

13 Bibliotheca Alexandrina13 Photoshop  Page resize

14 Bibliotheca Alexandrina14 Photoshop  Center text to page Before After

15 Bibliotheca Alexandrina15 Image Processing Sequence  Deskew  Despeckle  Rotation  Noise Removal  Black Edge Removal  Page resize  Center Text to page  Enhance text quality [Grow & Erode]  Renaming Files  File compression (CCITT – Group 4) ScanFix (Automated processing) Adobe Photoshop (Manual processing) ScanFix (Automated processing) ACDSee (Automated processing)

16 Bibliotheca Alexandrina16 Scanfix  Enhance text quality : Grow, Erode (Horizontal / Vertical) Before After

17 Bibliotheca Alexandrina17 Image Processing Sequence  Deskew  Despeckle  Rotation  Noise Removal  Black Edge Removal  Page resize  Center Text to page  Enhance text quality [Grow & Erode]  Renaming Files  File compression (CCITT – Group 4) ScanFix (Automated processing) Adobe Photoshop (Manual processing) ScanFix (Automated processing) ACDSee (Automated processing)

18 Bibliotheca Alexandrina18 ACDSee  Renaming Files

19 Bibliotheca Alexandrina19 ACDSee  Compression to TIFF (CCITT– Group 4)

20 OCR Image to Text

21 Bibliotheca Alexandrina21 OCR - Arabic  Poses unique challenges –Written cursively, with blocks of connected characters –a ‘block of characters’ can have more than one base line. –Uses external objects such as dots, 'Hamza' and 'Madda'. –Diacritization –Characters can have more than one shape according to their position –Overlapping makes it difficult to determine the spacing  Sakhr Automatic reader is used  Tricky with old books  Requires learning

22 Bibliotheca Alexandrina22 Arabic Script Is Cursive

23 Bibliotheca Alexandrina23 Old, Smudgy, and Sticked Together

24 Bibliotheca Alexandrina24 Use of Diacritics

25 Bibliotheca Alexandrina25 Pre-OCR Text Enhancement  Condition of Arabic printings varies –Old/new –Light/heavy –Solid/dot-matrix  ScanFix’s smoothing and completion features improve recognition accuracy  Separate from actual processing phase –Must be tested under OCR right away –OCR specialists have a better feel for “good text”

26 Bibliotheca Alexandrina26 Text Repair in ScanFix

27 Bibliotheca Alexandrina27 Font Libraries  Improvement of Arabic OCR results through –Tweaking of OCR engine settings –Learning  Libraries for different fonts have been built to achieve higher recognition rates  Databases of character glyphs that describe a particular type of script and improve OCR accuracy  Built on a carefully selected and classified high-variety set of scanned images belonging to a batch of about 1000 books that boiled down to 15 font groups

28 Bibliotheca Alexandrina28 Font Classification  Classification criteria: –Script type TA: Traditional Arabic AR: Arabic Transparent DT: Deco type Naskh and Deco type Naskh extension –Printing quality: High (H), Medium (M), and Low (L) –Font size: 1 (largest) to 5 (smallest)  “Group X” – virtual font to tag unclassifiable printings and handwriting  Minimum accuracy number assigned to each group based on testing results

29 Bibliotheca Alexandrina29 16 Font Groups

30 Bibliotheca Alexandrina30 Learning  Train the engine on two representational pages of the book to build upon an initial font file picked from a set of pre- built font libraries  Use a different page to manually calculate OCR accuracy before and after learning  Batch OCR book using learned font file and save to ART

31 Bibliotheca Alexandrina31 Learning in Sakhr’s Automatic Reader

32 Bibliotheca Alexandrina32 VERUS from NovoDynamics  Preliminay evaluation on two data sets is promising –Challenge: difficult to OCR, degraded images –Normal: known to return acceptable accuracy  No learning capabilities—no human operators  VERUS uses an XML format to store recognition data  BA and NovoDynamics entered into a research agreement

33 Bibliotheca Alexandrina33 Evaluation of VERUS and AR

34 Encoding Image on Text

35 Bibliotheca Alexandrina35 Challenges in Publishing  Preservation of layout  Searchability of content and metadata  Efficient image compression  Easy browsing of books  Accommodating low bandwidth user  Multilingual text support  Multipaging

36 Bibliotheca Alexandrina36 Image-on-Text  Multilayered: –Visible page image –Hidden OCR text  View exact original layout while searching and highlighting  Supported with some OCR suites only  Supported format: DJVU and PDF

37 Bibliotheca Alexandrina37 UDBE  Universal Digital Book Encoder  A framework for integrating many OCR engines and supporting many target formats into a system for encoding image-on-text documents for publishing  Made possible through the use of a Common OCR Format (COF)

38 Bibliotheca Alexandrina38 UDBE  Built around a Common OCR Format (COF)

39 Bibliotheca Alexandrina39 Performance – Arabic B&W

40 Bibliotheca Alexandrina40 Performance – Latin B&W

41 Quality Assurance

42 Bibliotheca Alexandrina42 Q/A - Common Errors  No missing cover or pages  All pages are in order  Text quality  Images quality  Pages quality  PDF quality

43 Bibliotheca Alexandrina43 Q/A - Common Errors  No missing cover or pages  All pages are in order  Text quality  Images quality  Pages quality  PDF quality

44 Bibliotheca Alexandrina44 Q/A - Common Errors  No missing cover or pages  All pages are in order  Text quality  Images quality  Pages quality  PDF quality 17

45 Bibliotheca Alexandrina45 Q/A - Common Errors  No missing cover or pages  All pages are in order  Text quality  Images quality  Pages quality  PDF quality   Pale Text  Toothed Text  Curved Text

46 Bibliotheca Alexandrina46 Q/A - Common Errors  No missing cover or pages  All pages are in order  Text quality  Images quality  Pages quality  PDF quality  

47 Bibliotheca Alexandrina47  Cut Pages Q/A - Common Errors  No missing cover or pages  All pages are in order  Text quality  Images quality  Pages quality  PDF quality  Fingers  Noise and page edges  Pages Size  Skew

48 Bibliotheca Alexandrina48 Q/A - Common Errors  No missing cover or pages  All pages are in order  Text quality  Images quality  Pages quality  PDF quality  Image on Text  Searching Hits

49 DAR Digital Assets Repository

50 Bibliotheca Alexandrina50 System Architecture

51 Bibliotheca Alexandrina51 DAK - Metadata  Descriptive Metadata  Administrative Metadata  Technical Metadata

52 Bibliotheca Alexandrina52 DAK Publishing Module  Providing access to the repository content through search and browse facilities  Multilingual full-text search

53 Bibliotheca Alexandrina53 DAK Publishing Module  Functionalities –Browse the repository contents by Collection, Subject, Creator and Title –Search content by an indexed metadata field –Multilingual full-text search using both exact and morphological matching

54 Bibliotheca Alexandrina54 DAK Publishing Module  Functionalities (cont’d) –Display brief record information –Display full record information with links to digital objects –Display MARC and DC format

55 Bibliotheca Alexandrina55

56 Bibliotheca Alexandrina56

57 Bibliotheca Alexandrina57

58 Bibliotheca Alexandrina58

59 Bibliotheca Alexandrina59 Show notes

60 Bibliotheca Alexandrina60

61 Bibliotheca Alexandrina61 DAR: Future Work  Consider MODS and METS standards in the new system data model.  Enhance the functionalities of the Books Viewer with more security and copyright management  Join the Open Source community by building DAR modules with open source technologies and languages.  Provide support for the currently available digital library interoperability protocols

62 Books from India Towards Better Collaboration

63 Bibliotheca Alexandrina63 Books From India LanguageNumber Books Arabic832 Arabic + French3 Arabic + German1 Persian101 French2 English1 Spanish1 German1 Total942

64 Bibliotheca Alexandrina64 Progress Phase NameDone as of November 1, 2006 Expected to finished by Comments Cataloging801-35 have metadata problems Processing742November 20, 2006 OCRing200March 1, 2007 Encoding171-- Publishing171--

65 Bibliotheca Alexandrina65 Metadata Problems

66 Bibliotheca Alexandrina66 Processing

67 Bibliotheca Alexandrina67 OCR Using VERUS or AR?  Calculated accuracy for a small sample –Images processed once with darkening effect and once without –VERUS likes darkening, AR does not –Overall, AR won 70% of cases

68 Bibliotheca Alexandrina68

69 Bibliotheca Alexandrina69

70 Bibliotheca Alexandrina70

71 Bibliotheca Alexandrina71

72 Bibliotheca Alexandrina72

73 Bibliotheca Alexandrina73

74 Bibliotheca Alexandrina74

75 Bibliotheca Alexandrina75


Download ppt "Million Book Bibliotheca Alexandrina Youssef Eldakar 19 November 2006."

Similar presentations


Ads by Google