Presented By: Hassan S. Al-Ayesh. Outlines Introduction. Extracting Parallel Documents from the Internet. Preprocessing. The System. The First Algorithm.

Slides:

Advertisements

Similar presentations

Advertisements

PDAs Accept Context-Free Languages

EuroCondens SGB E.

TOWARDS PRACTICAL GENRE CLASSIFICATION FOR THE WEB George Ferizis and Peter Bailey CSIRO ICT Centre Figure Authors: George Ferizis

The creation of "Yaolan.com" A Site for Pre-natal and Parenting Education in Chinese by James Caldwell DAE Interactive Marketing a Web Connection Company.

12.3 – Analyzing Data.

The 5S numbers game..

A Fractional Order (Proportional and Derivative) Motion Controller Design for A Class of Second-order Systems Center for Self-Organizing Intelligent.

Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)

LABELING TURKISH NEWS STORIES WITH CRF Prof. Dr. Eşref Adalı ISTANBUL TECHNICAL UNIVERSITY COMPUTER ENGINEERING 1.

The basics for simulations

Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)

Connecticut Mastery Test (CMT) and the Connecticut Academic Achievement Test (CAPT) Spring 2013 Presented to the Guilford Board of Education September.

Reading charts and graphs Interpreting Data Bridging the gap to.

Before Between After.

2011 FRANKLIN COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=332.

1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)

1 Minimally Supervised Morphological Analysis by Multimodal Alignment David Yarowsky and Richard Wicentowski.

Static Equilibrium; Elasticity and Fracture

CSE3201/4500 Information Retrieval Systems

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:

1 Chap 14 Ranking Algorithm 指導教授 : 黃三益博士學生 : 吳金山鄭菲菲.

A Data Warehouse Mining Tool Stephen Turner Chris Frala

Chapter 5: Introduction to Information Retrieval

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

Improved TF-IDF Ranker

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

MINING FEATURE-OPINION PAIRS AND THEIR RELIABILITY SCORES FROM WEB OPINION SOURCES Presented by Sole A. Kamal, M. Abulaish, and T. Anwar International.

Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.

In Search of a More Probable Parse: Experiments with DOP* and the Penn Chinese Treebank Aaron Meyers Linguistics 490 Winter 2009.

Final Project of Information Retrieval and Extraction by d 吳蕙如.

The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.

WMES3103 : INFORMATION RETRIEVAL

1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.

Noun Phrases & Suffixes. Nouns Part of the form class Have markers and identifiers to show that it is a noun Can be made either plural or possessive Markers.

Chapter 5: Information Retrieval and Web Search

Arabic Natural Language Processing: P-Stemmer, Browsing Taxonomy, Text Classification, RenA, ALDA, and Template Summaries — for Arabic News Articles Tarek.

Huffman Codes Message consisting of five characters: a, b, c, d,e

Kalyani Patel K.S.School of Business Management,Gujarat University.

Natural Language Processing Lab Northeastern University, China Feiliang Ren EBMT Based on Finite Automata State Transfer Generation Feiliang Ren.

Modern Information Retrieval Chapter 7: Text Processing.

Evaluating Statistically Generated Phrases University of Melbourne Department of Computer Science and Software Engineering Raymond Wan and Alistair Moffat.

AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.

Chapter 6: Information Retrieval and Web Search

Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.

Ibrahim Badr, Rabih Zbib, James Glass. Introduction Experiment on English-to-Arabic SMT. Two domains: text news,spoken travel conv. Explore the effect.

Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.

Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.

Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.

Web- and Multimedia-based Information Systems Lecture 2.

An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.

Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.

Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.

1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.

1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Authors: Yutaka Matsuo & Mitsuru Ishizuka Designed by CProDM Team.

Measuring Monolinguality

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

HITS Hypertext Induced Topic Selection

Inf 722 Information Organisation

HITS Hypertext Induced Topic Selection

Information Retrieval and Web Design

Arabic Searching and Web Visibility

Presentation transcript:

Presented By: Hassan S. Al-Ayesh

Outlines Introduction. Extracting Parallel Documents from the Internet. Preprocessing. The System. The First Algorithm. Examples. Results. The Second Algorithm. Examples. Results. Effect of Stemming. Conclusions. Reference.

Introduction Stemming: Process of normalizing word variations by removing prefixes & suffixes. English-Arabic Parallel documents can be used to build Arabic-English Dictionary, However they are hard to find. Main Providers for these Documents are the newspapers, magazines & News websites. Bitexts or Parallel Corpora are bodies of texts in Parallel translation. In Following slides, A system will be demonstrated for creating English-Arabic Bitexts.

Extracting Parallel Documents from the Internet Steps Required to Find Parallel Documents: 1. The pages that might contain parallel documents located using some search queries like “Arabic Version”, “English Version” and so on. 2. Download or Generate the Documents Pairs. 3. Filter out the non-translation candidates Pairs. The Types of Documents Collected are: a) Parent Page: Is the one that contains links to different languages versions. b) Sibling Page: Is a page in one language that contains a link to another language version of the same page.

Preprocessing Preprocessing involves: 1. Align the sentence pairs based on their length. 2. Remove the English and Arabic stop word lists from these documents. English: possessive pronouns, pronouns, prepositions and some words that has no candidates like a, an and so on. Arabic: pronouns, prepositions and some words like و، أن، إن، لقد and so on. 3. Delete some symbols and remove diacritics from Arabic texts. 4. Convert plural words to singular.

The System The System contains of: 1. The Searcher. 2. The Preprocessor. 3. The Stemmer with the Two Algorithms that will be Discussed Later.

First Algorithm The similarity metric between English and Arabic words based on statistical co-occurrence and frequency of English & Arabic words. First make a table that contains the word, sentence numbers in which the word occurred & the frequency of the word. Then use the Algorithm in Next page.

First Algorithm (continued) Set i=j=1 Test:if (mi,j>=x*ani)&&(y*enj=<ani=<z*enj) CopyArabicWord(i)&CopyEnglishWord(j) to Final Document End if j = j+1 If(j<=NE) Goto Test End if i= i+1&j=1 If(<i=NA) Goto Test End if Where(mi,j) is number of occurrence of Arabic and English word in same sentences, (ani) is the frequency of Arabic word, (enj) is the frequency of English word, (i) Arabic word selected, (j) English word selected, (x,y,z) system parameters, (NE) is total number of English words and (NA) is total number of Arabic words.

First Algorithm (Example) Assume the following English sentences: (1) Swimming is a popular sport. (2) Basketball was considered as the popular game in USA. The Arabic translations are: (1 ) السباحة رياضة محببة. (2 ) كرة السلة تعتبر لعبة محببة في الولايات المتحدة.

First Algorithm (Results) To find the results we calculate: 1. Precision = Correct / (Correct + Wrong) 2. Recall = Correct / (Correct + Missing) 3. F = (2* Precision * Recall)/(Precision + Recall) The effect of (Mi, j) on the precision, recall and F- measure: Mi, jMi,j > 2Mi,j > 4Mi,j > 6Mi,j > 8Mi,j > 10Mi,j > 12 Precision44.3%75.7%82.5%86.3%95.6%100% Recall33.8%11.8%6.2%5.4%3.9%2.3% F-measure38.3%20.4%11.5%10.2%7.5%4.5%

First Algorithm (Results) Cont. (mi, j) is directly proportional to the precision of the resulted dictionary, and inversely proportional to the recall. Advantage: The Algorithm is efficient for big corpora. Disadvantage: Fail to capture dependencies between group of words.

Second Algorithm Based on statistical co-occurrence of pairs. FirstStep: Set n=1, & m=n+1. Test: Compare Arabic_sentence (n) with Arabic_sentence (m) Compare English_sentence (n) with English_sentence (m) If only one Arabic word common between Arabic_sentence (n) and Arabic_sentence (m) Copy the Arabic word and the associated English word or phrase in a table End m=m+1 If m< = N GOTO Test End If n=n+1 & m=n+1 If m < = N GOTO Test End If Then Exchange Arabic by English, and then repeat the previous pseudo code. Where N is number of sentences.

Second Algorithm Cont. The output of previous Algorithm is: a- One English word translated to one Arabic word. b- One English word translated to an Arabic phrase. c- One Arabic word translated to an English phrase. Then use that Algorithm: For i=1; i< = Na_word; i++ Get all English_words associated with Arabic_word (i) For all English_words associated with Arabic_word (i) If R(e)> Th Copy Arabic_word (i) & “e” in a final file End IF End For Where Na_Word is the total number of Arabic words in the table, R(e)= f(e)/NE is The Repetition percentage of English word e and f(e) is the frequency of English word e.

Second Algorithm (Example) Assume the following English sentences: 1. I can play football. 2. Football is a popular sport. 3. Basketball was considered as the popular game in USA. The Arabic translations are: 1. استطيع ان العب كرة القدم. 2. كرة القدم رياضة محببة. 3. كرة السلة تعتبر لعبة محببة في الولايات المتحدة.

Second Algorithm (Results) The precision and recall of translation pairs resulted from applying the previous algorithm depend on the value (Th) in which the precision is directly proportional to (Th), and the recall is inversely proportional to (Th). The effect of (Th) on the precision, recall and F- measure: h Precision69.2%76.0%83.8%85.0%85.3% Recall90.0%84.6%75.9%75.0%74.5% F-measure78.2%80.1%79.7% 79.5%

Second Algorithm (Results) Cont. The effect of trial number on the precision, recall and F-measure: Disadvantage: The processing time required for algorithm 2 is higher than that of algorithm 1. Because of disadvantages of algorithm 1 & 2 it is better to use combination of these Algorithms. TrialFirstSecondThirdFourthFifthSixthSeventhEighth Precision86.0%91.4%85.3%81.6%76.7%73.1%74.3%68.7% Recall8.0%45.8%15.4%6.1%2.3%1.4%1.0%0.7% F-measure14.6%61.0%26.1%11.4%4.5%2.7%2.0%1.4%

Effect of Stemming Using “Aitao Chan & Ferdric Gey” Arabic Light Stemmer. That Stemmer removes Prefixes and Suffixes in that sequence: 1. If the word is at least five-character long, remove the first three characters if they are one of the following: وال،لال،سال،اال،مال،ولل،كال،فال،بال. 2. If the word is at least four-character long, remove the first two characters if they are one of the following: وا،ال،فا،كا،ول،وي،وس،سي،لا،وب،وت،وم،لل،با. 3. If the word is at least four-character long and begins with و, remove the initial letter و. 4. If the word is at least four-character long and begins with either ب or ل, remove ب or ل only if, after removing the initial character, the resultant word is present in the Arabic document collection. 5. Recursively strips the following two-character suffixes in the order of presentation if the word is at least four characters long before removing a suffix: ون،ات،ان،ين،تن،تم،كن،كم،هن،يا،ني،وا،ما،نا،هم،ية،ها Recursively strips the following one-character suffixes in the order of presentation if the word is at least three-character long before removing a suffix: ت،ي،ه،ة.

Effect of Stemming The system accuracy increased but the total recall decreased. The accuracy increased because of the decrease of the system confusion due to the increase of the translation pair frequency after stemming. The recall decreased due to that many Arabic words have been reduced to one word after stemming. The accuracy did not increase too much after stemming because the formation of broken Arabic plurals is complex and often irregular. Like ادوات after stemming becomes ادو which is not right word.

The Output A part of the final output file: غفور Oft-forgiving1 يا O1 اتفاق Agreement1 ليل Night1 اجتماع Meeting1 إلا Except1 اختيار Choice1 أو Little0 استخدام Fragmentation0 قليلا little1 استعراض Review1 رب Lord1 استكشاف Exploration1 إله God1 استنادا Based1 يقول Say1 اقتصاد Economy1 أرض Earth1 اتحاد Union1 يوم Day1 صم Deaf1 جبال Mountain0 اتصال Communication1 فصل Day of sorting out1 اتفاق Agreement1 قارعة Day of noise and clamour1

Conclusion The first algorithm achieved high precision with low recall for high frequency words and its required processing time is small. However it failed to handle compound nouns Algorithm 2 can handle the translation of compound nouns with high precision and recall, but it needs long time. Stemming as a preprocessing step has increased the system accuracy but it has decreased recall.

Reference Stemming to Improve Translation Lexicon Creation from bitexts by Mohamed Abdel Fattah, Fuji Ren and Shingo Kuroiwa.

Thank You… If you have any Question, Just Ask.