An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.

An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information Extraction Lab Language Technologies Research Center IIIT Hyderabad

Outline Introduction Model of our approach – An example – Different steps – Scoring Dataset and Evaluation. – Dataset – Evaluation Results – Empirical Results – Coverage Discussion

Why CLIA? Cross lingual information access. Hindi-Wide Web and Telugu-Wide Web. Bridge the gap between the information available and languages known, CLIA systems are vital. Dictionaries form our first step in building such CLIA system. Built exclusively to translate/ transliterate user queries. Why dictionaries?

Why Wikipedia ? Rich multi-lingual content in 272 languages and growing. Well structured. Updated regularly. All languages doesn’t have the privilege of resources.  Can harness Wikipedia structure instead.

How is it done? Exploit different structural aspects of the Wikipedia Build as many resources from Wikipedia itself. Extract parallel/comparable text from each structure using the resources built. Build dictionaries using previously built dictionaries and resources.

Contd… Extract maximum information using the structure of the articles that are linked with cross lingual link.

Model of our approach

Example…

Titles (dictionary 1 ) Titles of the articles form the parallel text to build dictionary. Considered both directions for building parallel corpus  i.e.. English to Hindi and Hindi to English. Infobox (dictionary 2 ) Contain vital information especially nouns. Two step process  Keys  Values Values of the mapped keys from the info boxes are considered as parallel text.  We reduce the number of words in the value pair by removing the highly scored word pairs in dictionary 1 and stop words.

Categories (dictionary 3 ) Categories of articles linked by the ILL link form the parallel text. Considered both directions for building parallel corpus  i.e.. English to Hindi and Hindi to English Article text (dictionary 4 ) First paragraph of the text is generally the abstract of the article. Sentence pairs are filtered by Jaccard similarity metric using the dictionaries built. Mapped words in any of the dictionaries and stop words in each sentence pair are removed.

Scoring The scoring of the words from parallel text is based on the formula are i th and j th words in English and Hindi wordlists respectively are number of occurrences of respectively in the parallel text is the number of occurrences of in a single parallel text instance.

Dataset Three sets of 300 English words each are generated from existing English-Hindi and English-Telugu dictionaries.  Existing dictionaries are provided by Language Technologies Research Center (LTRC, IIIT-H) Mix of most frequently used to less frequently used.  Frequency determined using news corpus. POS tagged to perform the tag-based analysis.

Precision and recall are calculated for the dictionaries built.  precision is (ExtractedCorrectMappings) / (AllExtractedMappings)  recall is (ExtractedCorrectMappings) / (CorrectMappingsUsingAvailableDictionary)  Correctness of a mapping is determined in two ways  Automatic: Using an available dictionary  Manual: We manually evaluate the correctness of the word Two methods are required because  No language dependent processing (parsing, chunking etc).  Different word forms (plural, tense etc) are returned.  Different spellings in Wikipedia for the same word. Evaluation

Empirical Results Method Automated Eval Titles (manual) Manual Eval PrecisionRecallPrecisionRecallPrecision Set10.4640.5540.5700.4340.777 Set20.4970.5370.5840.4170.783 Set30.5030.5570.6330.4270.743 English to Hindi results Method Automated Eval Titles (manual) Manual Eval PrecisionRecallPrecisionRecallPrecision Set10.1170.170.3530.0560.411 Set20.0930.1430.2350.0560.441 Set30.1170.20.2850.070.441 English to Telugu results

Existing Systems ApproachPrecisionRecallF 1 Score Existing (High precision)0.7810.2250.349 Existing (High recall)0.3330.6130.431 Our approach (Hindi)0.7670.5490.639  We can see that our system with equal precision and recall, performs comparatively with the existing system.

Coverage Number of Unique words that are added to the dictionary from each structure Structure(X-axis) Vs Unique word count (Y-axis)

Discussion Titles are considered as a baseline  most of the existing CLIA systems over Wikipedia also consider titles as baseline. The precision is high when evaluated manually because  Various context-based translation for a single English word.  Different word forms of the word returned and that present in the dictionary.  Different spelling for the same word. (different characters) Also the precision of dictionaries created are in the order  Title > info box > categories > text

Cont.. Query formation in wiki-CLIA does not depend completely on dictionaries and their accuracy. The words returned by our dictionaries, if not exact translations, are related words since they are present in a related wiki article. Words returned can be used to form the query. The coverage of proper nouns that are generally not present in dictionaries is high. Their values are PrecisionRecallF-Measure (F 1 ) 0.7150.7870.749

On-going Work Extract more parallel sentences using other structures to increase the coverage of dictionary Image meta tags, Body of the article and Anchor text. Query Formation from these dictionaries.

Questions??? Thank you.

An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.

Similar presentations

Presentation on theme: "An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.

Similar presentations

Presentation on theme: "An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information."— Presentation transcript:

Similar presentations

About project

Feedback