Presentation is loading. Please wait.

Presentation is loading. Please wait.

Content Level Access to Digital Library of India Pages

Similar presentations


Presentation on theme: "Content Level Access to Digital Library of India Pages"— Presentation transcript:

1 Content Level Access to Digital Library of India Pages
10/29/11 Content Level Access to Digital Library of India Pages Praveen Krishnan, Ravi Shekhar, C.V. Jawahar CVIT, IIIT Hyderabad 1

2 Digital Library of India (DLI)
Vision : To enhance access to information and knowledge to masses. Partner to Million Book Universal Digital Library Programme. Information for people Dataset for researchers Vamshi Ambati, N.Balakrishnan, Raj Reddy, Lakshmi Pratha, C V Jawahar: The Digital Library of India Project: Process, Policies and Architecture, ICDL , 2007.

3 Digital Library of India (DLI)
Vision : To enhance access to information and knowledge to masses. Content Languages Statistics 41 different languages Includes - Hindi, Telugu, Marathi.. - English, French, Greek.. #Books 4 Lakhs #Pages 134 Million #Words 26 Billion Source:

4 Digital Library of India (DLI)
Meta data search Supports Meta data based search. No Content Level Access Indian freedom struggle and independence Search

5 Digital Library of India (DLI)
Need Content Level Access Content + Meta Data Indian freedom struggle and independence Search

6 Digital Library of India (DLI)
Reliable Text Representation ? Need Content Level Access Content + Meta Data Indian freedom struggle and independence Search

7 Digital Library of India Search
Goal Digital Library of India Search Build a search engine with support for Indian languages. Word Spotting

8 Indian Language Document Search Engine
Goal Indian Language Document Search Engine Text Query Support खोज Page 1

9 Indian Language Document Search Engine
Goal Indian Language Document Search Engine शिवाजी और मराठा साम्राज्य खोज Multi Keyword Support Page 1

10 Indian Language Document Search Engine
Goal Indian Language Document Search Engine शिवाजी और मराठा साम्राज्य खोज Ranks based on # Occurrences Page 1

11 Indian Language Document Search Engine
Goal Indian Language Document Search Engine शिवाजी और मराठा साम्राज्य खोज Semantically Related Words Page 1

12 Seamless scaling to billions of word images.
Goal Indian Language Document Search Engine शिवाजी और मराठा साम्राज्य खोज Seamless scaling to billions of word images. Sub second retrieval Page 1

13 Text from OCR Hindi Page Telugu Page
- Hindi: Title - Praachin Bhaartiy Vichaar Aur Vibhutiyaan, Published in 1624 - Telugu: Title - Andhra Vagmayaramba Dasha, Published in 1960

14 Text from OCR Hindi Page Telugu Page Cuts Cuts

15 Text from OCR Hindi Page Telugu Page Merges Cuts

16 Variations in Script, Font and Typesetting.
Text from OCR Hindi Page Telugu Page Variations in Script, Font and Typesetting. Cuts

17 Text from OCR Char % Hindi Telugu
[1 ] D. Arya, T. Patnaik, S. Chaudhury, C. V. Jawahar, B. B. Chaudhury, A. G. Ramakrishnan, G. S. Lehal, and C. Bhagavati, “Experiences of Integration and Performance Testing of Multilingual OCR for Printed Indian Scripts,” in ICDAR MOCR Workshop, 2011.

18 Text from OCR Word % Hindi Telugu
[1 ] D. Arya, T. Patnaik, S. Chaudhury, C. V. Jawahar, B. B. Chaudhury, A. G. Ramakrishnan, G. S. Lehal, and C. Bhagavati, “Experiences of Integration and Performance Testing of Multilingual OCR for Printed Indian Scripts,” in ICDAR MOCR Workshop, 2011.

19 Text from OCR Search % Hindi Telugu

20 BoVW for Image Retrieval
Text Retrieval Image Recognition Query Image Ranked Retrieved Results Josef Sivic, Andrew Zisserman: Video Google: A Text Retrieval Approach to Object Matching in Videos. ICCV 2003

21 BoVW for Image Retrieval
Fixed Length Representation Invariant to popular deformation Query Image Ranked Retrieved Results Josef Sivic, Andrew Zisserman: Video Google: A Text Retrieval Approach to Object Matching in Videos. ICCV 2003

22 BoVW for Document Image Retrieval
R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.

23 BoVW for Document Image Retrieval
Histogram of Visual Words R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.

24 BoVW for Document Image Retrieval
Cuts R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.

25 BoVW for Document Image Retrieval
Cuts Histogram of Visual Words R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.

26 BoVW for Document Image Retrieval
Merges R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.

27 BoVW for Document Image Retrieval
Merges Histogram of Visual Words R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.

28 BoVW for Document Image Retrieval
Robust against degradation Lost Geometry Use Spatial Verification SIFT based. Longest Subsequence alignment. V1 V2 V6 V4 V8 V9 x y 0.5 1 1.5 2 2.5 3 Merge Clean Cuts R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012. I. Z. Yalniz and R. Manmatha. An Efficient Framework for Searching Text in Noisy Document Images. In DAS, 2012.

29 Query Expansion Querying Database Query Image Query Image Histogram
Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Rank 6 Refined Histogram

30 Query Expansion Better Results Querying Database Query Image
Query Histogram Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Rank 6 Better Results

31 Text Query Support Originally formulated in a “query by example” setting. Input Query Image Histogram

32 Text Query Support Originally formulated in a “query by example” setting. Need Text Queries Input Text Query Text Query Histogram

33 Observations Are the results of OCR and BoVW complementary? BoVW OCR

34 Observations mAP v/s Word Length mAP No. of Characters

35 Observations “OCR system has a high precision while BoVW approach has a high recall.” Example: #GT = 5 OCR Out List; Precision = 1 ; Recall = 0.4 BoVW Out List; Precision = 0.8 ; Recall = 1

36 Fusion Fusion Techniques:- Naïve Fusion mAP Chart OCR

37 Fusion Fusion Techniques:- Naïve Fusion mAP Chart BoVW

38 Fusion Fusion Techniques:- Naïve Fusion
Concatenating OCR Results with BoVW mAP Chart OCR BoVW

39 Fusion Fusion Techniques:- Edit Distance Based Fusion mAP Chart OCR
BoVW

40 Fusion Fusion Techniques:- Edit Distance Based Fusion Reordering BoVW
mAP Chart Reordering BoVW BoVW score Modified Edit distance cost BoVW

41 Fusion Fusion Techniques:- Edit Distance Based Fusion Reordering BoVW
mAP Chart Reordering BoVW BoVW score Modified Edit distance cost BoVW

42 Fusion Fusion Techniques:- Edit Distance Based Fusion mAP Chart OCR
BoVW

43 Fusion Fusion Techniques:- Hybrid Fusion mAP Chart OCR BoVW

44 Fusion Fusion Techniques:- Hybrid Fusion Re-querying BoVW using
mAP Chart Re-querying BoVW using OCR retrieved results. Using rank aggregation techniques BoVW

45 Fusion Fusion Techniques:- Hybrid Fusion Re-querying BoVW using
mAP Chart Re-querying BoVW using OCR retrieved results. Using rank aggregation techniques BoVW

46 Fusion Fusion Techniques:- Hybrid Fusion mAP Chart OCR BoVW

47 Experimental Results

48 Experimental Details OCR [1] Feature Detector Feature Descriptor
Harris Interest point detection. [2] Feature Descriptor SIFT [2] Indexing Lucene [3] [1 ] D. Arya, T. Patnaik, S. Chaudhury, C. V. Jawahar, B. B. Chaudhury, A. G. Ramakrishnan, G. S. Lehal, and C. Bhagavati, “Experiences of Integration and Performance Testing of Multilingual OCR for Printed Indian Scripts,”in ICDAR MOCR Workshop, 2011. [2] [3]

49 Test Bed Sample Word Images DLI Corpus
Language #Books #Pages #Words #Annotation Hindi (HS1) 11 1000 362,593 Yes Hindi (HS2) 52 10,196 4,290,864 No Telugu (TS1) 161,276 Telugu (TS2) 69 13,871 2,531,069 DLI Corpus In addition, we used HP1 & TP1 fully annotated dataset

50 Evaluation Measures Precision Recall mAP (Mean Average Precision)
Mean of the area under the precision recall curve for all the queries. 10 Shows how accurate top 10 retrieved results are. TP = True Positive FP = False Positive FN = False Negative Precision-Recall Curve

51 Comparison of naïve BoVW with BoVW + Query Expansion
Language #Query BoVW Search BoVW + Query Expansion mAP Hindi (HP1) 100 62.54 81.30 66.09 83.86 Telugu (TP1) 71.13 78 73.08 79.89 Comparison of naïve BoVW with BoVW + Query Expansion

52 Comparison of naïve BoVW with BoVW + Text Query Support
Language #Query BoVW Search BoVW using Text Queries mAP Hindi (HP1) 100 62.54 81.30 56.32 73.89 Telugu (TP1) 71.13 78 69.06 78.83 Comparison of naïve BoVW with BoVW + Text Query Support

53 Comparative performance of different fusion
Language #Query Naïve Edit Distance Hybrid mAP Hindi (HP1) 100 75.66 90.7 79.58 90.8 80.37 91.4 Telugu (TP1) 76.02 81.2 78.01 81.4 80.23 83.7 Comparative performance of different fusion techniques on HP1 & TP1

54 Performance statistics on DLI Annotated Corpus
Language #Query OCR BoVW Fusion mAP Hindi (HS1) 100 14.95 62.60 60.55 95.5 68.81 95.6 Telugu (TS1) 27.03 62.10 74.38 90.6 78.41 91.9 Performance statistics on DLI Annotated Corpus

55 Performance statistics on DLI Un-Annotated Corpus
Language #Query N OCR BoVW Fusion Hindi (HS2) 50 82.03 96.94 97.11 75.16 94.83 95.42 71.12 92.82 93.16 Telugu (TS2) 90.85 99.14 85.42 98.00 98.85 80.76 96.38 96.57 Performance statistics on DLI Un-Annotated Corpus

56 Retrieved Results

57 Retrieved Results

58 Failure Cases The word images shown in the figure fails in both OCR and BoVW. Reason: (a) Word Image smaller in length and containing a character not used these days. (b) A highly degraded word image.

59 Implementation Details
Search Engine Development An elegant web based search and retrieval interface. Lucene Scalability Time in milliseconds No of Images Sample Retrieved Page No of Visual Words

60 Search Architecture (Ongoing)
Query Expansion Ranking OCR BoVW F U S I O N Partial Scores Index Delegator Web Service Search Query Ranked Results

61 Ongoing Work Learn to improve from annotated dataset
Use of visual confusion matrix to improve BoVW results from annotated datasets. Necessity of Costly Features for Re-ranking The images shows in failure cases would require costly features to show up. Use of machine learning algorithms. Exploration of features better than SIFT.

62 Thank You


Download ppt "Content Level Access to Digital Library of India Pages"

Similar presentations


Ads by Google