Word Recognition of Indic Scripts

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Michele Merler Jacquilene Jacob.  Applications online are inherently insecure  Growing rate of hackers  Confidentiality of online systems should be.
Ignas Budvytis*, Tae-Kyun Kim*, Roberto Cipolla * - indicates equal contribution Making a Shallow Network Deep: Growing a Tree from Decision Regions of.
Segmentation of Touching Characters in Devnagari & Bangla Scripts Using Fuzzy MultiFactorial Analysis Presented By: Sanjeev Maharjan St. Xavier’s College.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.
Lecture 14 – Neural Networks
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Graz University of Technology, AUSTRIA Institute for Computer Graphics and Vision Fast Visual Object Identification and Categorization Michael Grabner,
Real-time Computer Vision with Scanning N-Tuple Grids Simon Lucas Computer Science Dept.
5/30/2006EE 148, Spring Visual Categorization with Bags of Keypoints Gabriella Csurka Christopher R. Dance Lixin Fan Jutta Willamowski Cedric Bray.
Distributed Representations of Sentences and Documents
Con-Text: Text Detection Using Background Connectivity for Fine-Grained Object Classification Sezer Karaoglu, Jan van Gemert, Theo Gevers 1.
Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.
Content Level Access to Digital Library of India Pages
IIIT HyderabadUMASS AMHERST Robust Recognition of Documents by Fusing Results of Word Clusters Venkat Rasagna 1, Anand Kumar 1, C. V. Jawahar 1, R. Manmatha.
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Presented by: Kamakhaya Argulewar Guided by: Prof. Shweta V. Jain
Alert Correlation for Extracting Attack Strategies Authors: B. Zhu and A. A. Ghorbani Source: IJNS review paper Reporter: Chun-Ta Li ( 李俊達 )
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
IIIT Hyderabad Thesis Presentation By Raman Jain ( ) Towards Efficient Methods for Word Image Retrieval.
Neural Networks Chapter 6 Joost N. Kok Universiteit Leiden.
End-to-End Text Recognition with Convolutional Neural Networks
Detecting Pedestrians Using Patterns of Motion and Appearance Paul Viola Microsoft Research Irfan Ullah Dept. of Info. and Comm. Engr. Myongji University.
Avoiding Segmentation in Multi-digit Numeral String Recognition by Combining Single and Two-digit Classifiers Trained without Negative Examples Dan Ciresan.
Part-Of-Speech Tagging using Neural Networks Ankur Parikh LTRC IIIT Hyderabad
IIIT Hyderabad Document Image Retrieval using Bag of Visual Words Model Ravi Shekhar CVIT, IIIT Hyderabad Advisor : Prof. C.V. Jawahar.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
PROJECT PROPOSAL DIGITAL IMAGE PROCESSING TITLE:- Automatic Machine Written Document Reader Project Partners:- Manohar Kuse(Y08UC073) Sunil Prasad Jaiswal(Y08UC124)
Stable Multi-Target Tracking in Real-Time Surveillance Video
Presented By Lingzhou Lu & Ziliang Jiao. Domain ● Optical Character Recogntion (OCR) ● Upper-case letters only.
Neural Network Applications in OCR Daniel Hentschel Robert Johnston Center for Imaging Science Rochester Institute of Technology.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval  Representation Retrieval Thanks for David Doermann for most of these.
Lukáš Neumann and Jiří Matas Centre for Machine Perception, Department of Cybernetics Czech Technical University, Prague 1.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Using decision trees to build an a framework for multivariate time- series classification 1 Present By Xiayi Kuang.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Convolutional Restricted Boltzmann Machines for Feature Learning Mohammad Norouzi Advisor: Dr. Greg Mori Simon Fraser University 27 Nov
C - IT Acumens. COMIT Acumens. COM. To demonstrate the use of Neural Networks in the field of Character and Pattern Recognition by simulating a neural.
Essential components of the implementation are:  Formation of the network and weight initialization routine  Pixel analysis of images for symbol detection.
Preliminary Transformations Presented By: -Mona Saudagar Under Guidance of: - Prof. S. V. Jain Multi Oriented Text Recognition In Digital Images.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
1 A Statistical Matching Method in Wavelet Domain for Handwritten Character Recognition Presented by Te-Wei Chiang July, 2005.
A Document-Level Sentiment Analysis Approach Using Artificial Neural Network and Sentiment Lexicons Yan Zhu.
High resolution product by SVM. L’Aquila experience and prospects for the validation site R. Anniballe DIET- Sapienza University of Rome.
Cancer Metastases Classification in Histological Whole Slide Images
The Relationship between Deep Learning and Brain Function
Bag-of-Visual-Words Based Feature Extraction
Supervised Time Series Pattern Discovery through Local Importance
Mousavi,Seyed Muhammad – Lyashenko, Vyacheslav
Conditional Random Fields for ASR
Intelligent Information System Lab
Basic machine learning background with Python scikit-learn
Features & Decision regions
A New Approach to Track Multiple Vehicles With the Combination of Robust Detection and Two Classifiers Weidong Min , Mengdan Fan, Xiaoguang Guo, and Qing.
EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
A Comparative Study of Convolutional Neural Network Models with Rosenblatt’s Brain Model Abu Kamruzzaman, Atik Khatri , Milind Ikke, Damiano Mastrandrea,
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.
Word embeddings (continued)
Automatic Handwriting Generation
Introduction to Neural Networks
iLayout: Performance Evaluation
2017 APSIPA A Study on Landmark Detection Based on CTC and Its Application to Pronunciation Error Detection Chuanying Niu1, Jinsong Zhang1, Xuesong Yang2.
Presentation transcript:

Word Recognition of Indic Scripts 10/29/11 Word Recognition of Indic Scripts Naveen TS CVIT IIIT Hyderabad 1

Introduction 22 official languages. 100+ languages. Language specific number system. Two major groups Indo – Aryan Dravidian

Optical Character Recognition

OCR Challenges Challenges due to text editors Multiple fonts Different editors renders same symbol in different ways. Multiple fonts Poor/cheap printing technology Can cause degradations like Cuts/Merges Scanning quality

IL Script Complexity Script complexity Matras, similar looking characters Samyuktakshar UNICODE re-ordering

Unicode re-ordering Final Output

OCR Development challenges Word -> Symbol segmentation Presence of cuts/merges Development of a strong classifier Efficient post-processor Porting of technology for development of OCR for a new language.

Motivation for this Thesis Avoiding the tough word->symbol segmentation Automatic learning of latent symbol -> UNICODE conversion Common architecture for multiple languages Post-processor development challenges for highly inflectional languages.

OCR Development

Recognition Architecture Large # Output Classes Huge training size Degradation impact minimal Small # Output Classes Moderate training size Degradation impact serious Word Recognizer Symbol Recognizer

Limitation of Char recognition System 10.2.57.116 Limitation of Char recognition System Difficult to obtain annotated training samples Extracting symbols from words is tough. Inability to utilize all available training data Extremely difficult to extract all symbols from 5000 pages and annotate them. Classifier output(Char) -> Required output(Word) conversion. Issues due to degradations (Cuts/Merges) etc.

Word Recognition System Holistic Recognition Word Annotation Word Text Word Image Word Recognition System Evaluation Final Output To Evaluation System

BLSTM Workflow Word Output CTC … Output layer LSTM Cell backward pass Input sequence Hidden layers … CTC Input layer Output layer backward pass forward pass t t+1 Features Word Output LSTM Cell

Importance of Context Small Context Larger Context For a given feature, BLSTM takes into account forward as well as backward context.

BLSTM for Devanagari Motivation No Zoning Word Recognition Handle large # classes Naveen Sankaran and C V Jawahar. “Recognition of Printed Devanagari Text Using BLSTM Neural Network” International Conference on Pattern Recognition(ICPR), 2012.

BLSTM for Devanagari अदालत Input Image Feature Extraction BLSTM Network Output Class Labels Class Label to Unicode conversion 35, 64, 55, 105 अदालत

BLSTM Results Trained on 90K words and tested on 67K words. Obtained more than 20% improvement in Word Error Rate. Char. Error Rate Word Error Rate Devanagari OCR[1] Ours Good 7.63 5.65 17.88 8.62 Poor 20.11 15.13 43.15 22.15 1. D. Arya, et al., @ ICDAR MOCR Workshop, 2011. Experiences of Integration and Performance Testing of Multilingual OCR for Printed Indian Scripts

Qualitative Results

Limitations Symbol to UNICODE conversion rules are required to generate final output. Huge training time of about 2 weeks.

Recognition as Transcription Network learns how to “Transcribe” input features to output labels. Target labels are UNICODE No Symbol-> UNICODE output mapping Easily scalable to other languages

Recognition Vs Transcription

Challenges Segmentation free training and testing UNICODE (akshara) training and UNICODE (akshara) testing Practical Issues: Learning with memory: (symbol ordering in Unicode) Large output label space Scalability to large data set Efficiency in testing

Training time Training time increases when # Output classes increases # Features decreases # Training data increases

Training at Unicode level UNICODE training largely reduces the number of classes. UNICODE training can reduce the time taken Language # Unicode # Symbols Malayalam 163 215 Tamil 143 212 Telugu 138 359 Kannada 156 352

Features Each word split horizontally into two parts 7 features extracted from top and bottom half Sliding window of size 5pixel used. Binary Features Grey Features Mean Variance Std. Deviation

Network Configuration Learning rate of 0.0009 Momentum 0.9 Number of hidden layers = 1 Number of nodes in hidden layer = 100

Final Network Architecture CTC LAYER अदालत . . Input layer Hidden Layer Output Layer UNICODE Output Input t=0

Evaluation & Results

Dataset Annotated Multi-lingual Dataset (AMD) Annotated DLI dataset (ADD) 1000 Hindi pages from DLI Language No. of Books No. of Pages Hindi 33 5000 Malayalam 31 Tamil 23 Kannada 27 Telugu 28 Gurumukhi 32 Bangla 12 1700 AMD ADD

Evaluation Measure  

Character Error Rate(CER) Quantitative Results Language Character Error Rate(CER) Word Error Rate(WER) Our Method Char OCR[1] Tesseract[2] Hindi 6.38 12.0 20.52 25.39 38.61 34.44 Malayalam 2.75 5.16 46.71 10.11 23.72 94.62 Tamil 6.89 13.38 41.05 26.49 42.22 92.37 Telugu 5.68 24.26 39.48 16.27 71.34 76.15 Kannada 6.41 16.13 - 23.83 48.63 Bangla 6.71 5.24 53.02 21.68 24.19 84.86 Gurumukhi 5.21 5.58 13.65 25.72 1. D. Arya, et al., @ ICDAR MOCR Workshop, 2011. Experiences of Integration and Performance Testing of Multilingual OCR for Printed Indian Scripts 2. https://code.google.com/p/tesseract-ocr/

Qualitative Results

Performance with Degradation Added Synthetic degradation to words and evaluated them. Degradation Level 2 Degradation Level 1 Degradation Level 3

Qualitative Results Unicode Rearranging

Error Detection for Indian Languages

Error Detection : Why is it hard? Highly Inflectional UNICODE Vs Akshara Words can be joined to from another valid new word.

Development Challenges Availability of large corpus Percentage of unique words Language Total Words Unique Words Average Word Length Hindi 4,626,594 296,656 (6.42%) 3.71 Malayalam 3,057,972 912,109 (29.83%) 7.02 Kannada 2,766,191 654,799 (23.67%) 6.45 Tamil 3,763,587 775,182 (20.60%) 6.41 Telugu 4,365,122 1,117,972 (25.62%) 6.36 English 5,031,284 247,873 (4.93%) 4.66

Development Challenges # Unique words in Indian Languages

Development Challenges Word Coverage Corpus % Malayalam Tamil Kannada Telugu Hindi English 10 71 95 53 103 7 8 20 491 479 347 556 23 38 30 1969 1541 1273 2023 58 100 40 6061 4037 3593 5748 159 223 50 16,555 9680 8974 14,912 392 449 60 43,279 22,641 21,599 38,314 963 988 70 114,121 54,373 53,868 101,110 2395 2573 80 300,515 140,164 144,424 271,474 6616 8711

Error Models for IL OCR Two type of errors generated by OCR Non-Word error Presence of impossible symbols between words. Caused due to recognition issues, Symbol -> UNICODE mapping issues etc.

Error Models for IL OCR Two type of errors generated by OCR Real-Word error Caused when one valid symbol is recognized as another valid symbol. Mainly caused due to confusion among symbols

Error Models for IL OCR Percentage of words which gets converted to another word for a give Hamming distance.

Error Detection Methods Using Dictionary Create a dictionary based on most frequently occurring words. Valid words are those which are present . Accuracy depends on dictionary coverage. Using akshara nGram Generate symbol (akshara) nGram based dictionary. Every word is converted to its associated nGrams. Dictionary generated using these nGrams. A word is valid if all nGrams are present in dictionary. Word and akshara dictionary combination First check if word is present in dictionary. If not, check in the nGram dictionary. Detection through learning Use linear classification methods to classify a word as valid or invalid. nGram probabilities are chosen as features. Used SVM based binary classifier to train. This model was used to predict if a word was valid or not.

Error Detection Methods Word and akshara dictionary combination First check if word is present in dictionary. If not, check in the nGram dictionary. Detection through learning Use linear classification methods to classify a word as valid or invalid. nGram probabilities are chosen as features. Used SVM based binary classifier to train. This model was used to predict if a word was valid or not.

Evaluation Matrix True Positive (TP) : Our model detect a word as Invalid and annotation seconds it False Positive(FP) : Our model detect a word as Invalid but is actually a valid word True Negative (FN) : Our model detects a word as Valid but is actually invalid word False Negative (TN) : Our model detects a word as Valid and annotation seconds it Precision, Recall and F-Score

Dataset British National Corpus for English and CIIL corpus for Indian Languages. Used OCR output from Arya et.al (J-MOCR, ICDAR 2011) for experiments. Took 50% wrong OCR outputs to train SVM with negative samples. Malayalam dictionary size of 670K words and Telugu dictionary size of 700K

Table showing TP,FP,TN and FN values for Malayalam and Telugu Results Method Malayalam Telugu TP FP TN FN Word Dictionary 72.36 22.88 77.12 27.63 94.32 92.13 7.87 5.67 nGram Dictionary 72.85 22.17 77.83 27.15 62.12 6.37 93.63 37.88 Word Dict. + nGram 67.97 14.95 85.04 32.02 65.01 2.2 97.8 34.99 Word Dictionary + SVM 62.87 9.73 90.27 37.13 68.48 3.24 96.76 31.52 Table showing TP,FP,TN and FN values for Malayalam and Telugu Method Malayalam Telugu Precision Recall F-Score Word Dictionary 0.52 0.72 0.60 0.51 0.94 0.68 nGram Dictionary 0.53 0.73 0.61 0.91 0.62 Word Dict. + nGram 0.74 0.64 0.76 Word Dictionary + SVM 0.69 0.63 0.95 0.67 0.78 Table showing Precision, Recall and F-Score values for Malayalam and Telugu

Conclusion A generic OCR framework for multiple Indic Scripts. Recognition as Transcription. Holistic recognition with UNICODE output. High accuracy without any post-processing. Understanding challenges in developing post- processor for Indic Scripts. Error detection using machine learning.

Thank You !!!!