Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,

Similar presentations


Presentation on theme: "1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,"— Presentation transcript:

1 1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology, Japan 10 th July 2008

2 2 Outlines 1. Introduction 2. Design of Crawler 3. Evaluation 4. Conclusions 5. Limitations 6. Themes for Doctoral Study

3 3 Internet users are 0.1% of population Few Myanmar language contents found on the Web No search engine is available for Myanmar language CountryPopulation#of internet users Internet users (%) Myanmar (.mm) 52,373,95863,7000.123 1.Introduction

4 4 Multiple encodings used Myanmar pages are sparsely scattered over the entire Web Collect as much pages as possible with limited time and computer resources Myanmar Pages Non-Myanmar Pages Challenges for Language Specific Crawler (LSC) for Myanmar

5 5 Corpus/ Lexicon WWW Ranking engine Query engine Parser Indexer Language specific crawler Page repository query results Crawler Language Identification Language Specific Search Engine Basic Architecture

6 6 Objectives To propose Language Specific Crawler (LSC) which enables maximum collection of web pages written in target language, independent of domains. To efficiently collect Myanmar web pages which then can be indexed and sorted and finally to be used in Search Engine.

7 7 2. Design of Crawler (cont.) Challenges Multiple encodings used Myanmar pages are sparsely scattered over the entire Web Collect as much pages as possible with limited time and computer resources Design of Crawler Automatic Language Identification (LI) capable of multiple encodings Language-based tracing of links Choice of seed-URLs Multi-thread crawling Robot-text exclusion

8 8 Get URLs Language Identifier 1. Extract URLs 2. Language Identification 3. Saving into Database World Wide Web Crawling Process

9 9 A single crawling loop spends a large amount of time. Multi-threading, can provide reasonable speed- up and efficient use of available bandwidth. Multi-threaded Crawler

10 10 G2LI: is an algorithm from n-gram based Language Identification for Web Documents. Advantages  Requires small computing resources.  Small training set (5~20 KB. Length is enough). Language Identification (cont.)

11 11 Various Myanmar Fonts and Encodings Font NameEncoding Scheme BITPartial Unicode CE ClassicGraphic Encoding Myanmar1Unicode Myanmar2Unicode MyaZediPartial Unicode MyMyanmarPartial Unicode PopularGraphic Encoding WininwaGraphic Encoding Zawgyi-OnePartial Unicode

12 12 Save URLs in CSV file Save pages content in Dearby databaseDearby URL ID URL 1 http://www.google.com CONTENT ID ParentURL URL Level Content 1 http://www.google.com http://www.google.com 0 xxx… 1 http://www.google.com http://www.google.com/mail 1 xxx… 1 http://www.google.com/mail http://www.google.com/mail/signout 2xxx… Database Design Cont..

13 13 A) Evaluation on the Language Identification (G2LI) B) Evaluation on Crawling efficiency by means of precision and recall C) Evaluation on the crawling coverage. 3.Evaluation

14 14 G2LI’s Guessing Verified Language MyanmarNon- Myanmar Total Identified as Myanmar 763 (92%) [87%]37 (8%) 800 (100%) Identified as Non- Myanmar 106[13%]10941200 Total869[100%]11312000 A) Evaluation of Language Identifier

15 15 (763+1094)/2000 = 93% (37+106)/2000 = 7% T = Downloaded pages Relevant sites Retrieved sites T X Y Accuracy Rate and Error Rate

16 16 1) not being retrieved but relevant case: Bilingual Page: written in Myanmar and English. Web page using numeric character reference. eg; (&#4156, &#4153) 2) being retrieved but not relevant case: the misclassified pages are all English Web pages Misclassified Cases

17 17 B) Precision and Recall Precision  The ability to retrieve top-ranked documents that are mostly relevant. Recall  The ability of the search to find all of the relevant items in the entire Web space.  Where X= relevant documents Y= retrieved documents

18 18 Second Keyword AB First Keyword X = the estimated no of total Myanmar pages on the Web = first keyword = second keyword How to estimate total number of Web pages

19 19 Total numbers of URLs returned by Google for each Keyword KeywordsNumbers of URLs (Day) 68,500 (But) 41,000 (Human Being)117,000 (Now)31,500 (Myanmar) 56,500 (He)46,600 Total361,100 Experiment period 25th June 2008 to 27th June 2008.

20 20 DayBut68,50045,20013,700205,000 DayHuman68,500120,00014,200564,401 DayNow68,50035,30011,800182,860 : ::::: : ::::: : ::::: NowHe31,50046,60010,000140,805 MyanmarHe56,50046,60011,200225,496 Total4,905,169 Average of 15 pairs of Keyword combination327,011 Estimated X

21 21 Precision and Recall of crawling Entertainment site case

22 22 Precision and Recall of crawling Blog site case

23 23 Precision and Recall of crawling News site case

24 24 Crawling parameters  Seed URLs 35  Level of depth 6  Crawling time 2 weeks  CPU 2.40 GHz  Memory 1 GB  Internet connection: 100 M bit per second DomainsThe Number of Pages Collected.mm3,555 [ 1.1%].com276,554 [ 83.2%] Other gTLDs 52,245 [ 15.7%] Total332,354 [100.0%] C) Crawling Coverage

25 25 Distribution of estimated total number of Myanmar pages Estimated Average 327,011 Collected 332,354

26 26 4.Conclusion Proposed design of crawler proved to work as a LSC for Myanmar Languages LSC can download Myanmar pages on the Web at satisfactory level Proposed LSC can be used for the part of Myanmar search engine

27 27 5.Limitations of LSC How to reach isolated Myanmar pages (choice of seed-URLs, etc.) Misidentification of Language Identifier (in particular, need to collect bilingual pages - English and Myanmar) Improved speed of LSC

28 28 6.Themes for doctoral study 1. Lexicon 2. Indexing 3. Code conversion (Transcoding) 4. Stop words removal 5. Stemming algorithm

29 29 Corpus/ Lexicon WWW Ranking engine Query engine Parser Indexer Language specific crawler Page repository query results Crawler Language Identification Language Specific Search Engine Basic Architecture Language specific Search Engine

30 30 1. Lexicon Lexicon is also a synonym for dictionary or encyclopedic dictionary. In linguistic, the lexicon of a language is its vocabulary, including its words and expressions. Daily News Paper Web pages URLs Dictionary Lexicon

31 31 DatabaseIDWeb PagesLexicon 12,3 28 36 4N: 54: ::: ::: N-15: N7: Page 1 Page 3 Page N Page 2 :::::: DatabaseIndexer 2. Indexing Indexing is a process by which a keywords is assigned to which documents of a corpus

32 32 3. Code Conversion Unicode Lexicon encoded in Unicode Web Page (contents) UnicodeNon-Unicode Transcoding Client Server

33 33 4. Stop Words Removal Stop words are defined as non-information- bearing words. Myanmar sentences can be tokenized by eliminating stop words. computer students useful N N Adj

34 34 1. Subject personal pronouns I, you, he, she, it, we, you, they uRefawmf? uRefr? ig? usKyf? uREkfyf? usaemf? 2. Object personal pronouns 3. Reflexive personal pronouns 4. Relative pronouns 5. Possessive pronouns and adjectives 6. Indefinite pronouns and adjectives 7. Demonstrative pronouns and adjectives 8. Conjunctions 9. Questions 10. Other (pronouns, prepositions) Stop-words list English Vs Myanmar

35 35 5. Stemming Stemming algorithm is a conflation procedure  reduces all words with same root into a single root A stem is the portion of a word which is left after the removal of its affixes (i.e., prefixes and suffixes)  e.g., connect is the stem for the variants connected, connecting, and connections  e.g., is the stem for the variants, and

36 36 Thank you! Any question ?


Download ppt "1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,"

Similar presentations


Ads by Google