Presentation is loading. Please wait.

Presentation is loading. Please wait.

Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.

Similar presentations


Presentation on theme: "Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou."— Presentation transcript:

1 Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou Paris, December 4-5, 2002

2 © IIT, NCSR “Demokritos”, Paris 4-5 December 2002 Spidering Tool Spidering Tool Page Filtering: Corpus Formation, Evaluation Results Page Filtering: Corpus Formation, Evaluation Results Link Scoring: Corpus Formation, Evaluation Results Link Scoring: Corpus Formation, Evaluation Results Corpus collection for the needs of NERC, FE Corpus collection for the needs of NERC, FE Contents

3 © IIT, NCSR “Demokritos”, Paris 4-5 December 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling Web Pages Collection with NE annotations NERC-FE Multilingual NERC and Name Matching Multilingual and Multimedia Fact Extraction XHTML pages XML pages Insertion into the data base Products Database User Interface End user

4 © IIT, NCSR “Demokritos”, Paris 4-5 December 2002 Domain-specific Spidering Domain Ontology

5 © IIT, NCSR “Demokritos”, Paris 4-5 December 2002 NEAC: Web Pages Collection URL list XHTML pages NEAC input output

6 © IIT, NCSR “Demokritos”, Paris 4-5 December 2002 NEAC URL list input XHTML pages output NEAC: Web Pages Collection

7 © IIT, NCSR “Demokritos”, Paris 4-5 December 2002 URL list input XHTML pages output NEAC: Web Pages Collection One URL New interesting links Error URLs Connection NEAC ……. LIST …….……. …….……. ……. Queue Content Processing ……. LIST……. …….……. ……. LIST……. …….……. ……. IDs of pages visited during previous visit (found in tool’s log file) IDs of pages visited during current visit Link Processing 1 23 Save page Update Ignore page Page Filtering Module Link Scoring Module OK NOTOK Meta TIDY PAGE PROCESSING

8 © IIT, NCSR “Demokritos”, Paris 4-5 December 2002 NEAC: What is new URL list input XHTML pages output One URL New interesting links Error URLs Connection NEAC ……. LIST …….……. …….……. ……. Queue Content Processing ……. LIST……. …….……. ……. LIST……. …….……. ……. IDs of pages visited during previous visit (found in tool’s log file) IDs of pages visited during current visit Link Processing 1 23 Save page Update Ignore page Page Filtering Module Link Scoring Module OK NOTOK Meta TIDY PAGE PROCESSING Page Filtering Module Link Scoring Module ……. LIST……. …….……. ……. 2 LIST……. …….……. ……. 3 Both lists contain MD-5 hash IDs Both modules are now based on ML algorithms Log file of previous visit

9 © IIT, NCSR “Demokritos”, Paris 4-5 December 2002 Step 1: Connection to Site  Before the Connection  Open connection & get response code On e UR L New interesting links Error URLs Connection NEAC … …. LI S T … …. … …. Queue Content Processing … …. LI S T … …. LI S T … …. IDs of pages visited during previous visit (found in tool’s log file) IDs of pages visited during current visit Link Processing 1 23 Save page Update Ignore page Page Filtering Module Link Scoring Module OK NOTOK Meta TIDY PAGE PROCESSING URL list input XHTML pages output check URL’s protocol check URL’s protocol check for common mistakes in URL put error-URLs to queue (have a 2nd try) put error-URLs to queue (have a 2nd try) create an error-log file listing error-URLs

10 © IIT, NCSR “Demokritos”, Paris 4-5 December 2002 Step 2a: Page Content Processing  Check for frames and split them if found  Clear code: scripts & meta- tags, often containing product characteristics which may mislead the scoring process, are removed from code  Link Scoring  Page Filtering On e UR L New interesting links Error URLs Connection NEAC … …. LI S T … …. … …. Queue Content Processing … …. LI S T … …. LI S T … …. IDs of pages visited during previous visit (found in tool’s log file) IDs of pages visited during current visit Link Processing 1 23 Save page Update Ignore page Page Filtering Module Link Scoring Module OK NOTOK Meta TIDY PAGE PROCESSING URL list input XHTML pages output

11 © IIT, NCSR “Demokritos”, Paris 4-5 December 2002 On e UR L New interesting links Error URLs Connection NEAC … …. LI S T … …. … …. Queue Content Processing … …. LI S T … …. LI S T … …. IDs of pages visited during previous visit (found in tool’s log file) IDs of pages visited during current visit Link Processing 1 23 Save page Update Ignore page Page Filtering Module Link Scoring Module OK NOTOK Meta TIDY PAGE PROCESSING URL list input XHTML pages output Step 2b: Link Scoring  Collect all link-types supported  Multi-check  Score links one-by-one : all links getting an over-a-threshold score are finally put in ‘Queue’. text links text links image links image links image maps image maps check if points or redirects to another site check if points or redirects to another site check if points to non-html file (text document, executable, image, video, sound) check if points to non-html file (text document, executable, image, video, sound) check if host is relevant (‘brother URL’) check if host is relevant (‘brother URL’) check if already in ‘Queue’ check if already in ‘Queue’

12 © IIT, NCSR “Demokritos”, Paris 4-5 December 2002 Step 2c: Page Filtering  Double Check  Page Filtering Check List 2: If a page-ID is among the IDs logged during tool’s previous visit the page is not saved. Check List 2: If a page-ID is among the IDs logged during tool’s previous visit the page is not saved. Check List 3: If a page-ID is found in list 3 the page has already been visited in current visit. Check List 3: If a page-ID is found in list 3 the page has already been visited in current visit. The ‘Page Filtering’ module “decides” whether the page contains domain target information. If so, the page is going to be saved. The ‘Page Filtering’ module “decides” whether the page contains domain target information. If so, the page is going to be saved. On e UR L New interesting links Error URLs Connection NEAC … …. LI S T … …. … …. Queue Content Processing … …. LI S T … …. LI S T … …. IDs of pages visited during previous visit (found in tool’s log file) IDs of pages visited during current visit Link Processing 1 23 Save page Update Ignore page Page Filtering Module Link Scoring Module OK NOTOK Meta TIDY PAGE PROCESSING URL list input XHTML pages output

13 © IIT, NCSR “Demokritos”, Paris 4-5 December 2002 Step 3: Save Page  Meta Tidy  Log File Update (List 2) Transform HTML code in XHTML Transform HTML code in XHTML Once a page is saved, its ID is added in the “previous-visit log file” Once a page is saved, its ID is added in the “previous-visit log file” On e UR L New interesting links Error URLs Connection NEAC … …. LI S T … …. … …. Queue Content Processing … …. LI S T … …. LI S T … …. IDs of pages visited during previous visit (found in tool’s log file) IDs of pages visited during current visit Link Processing 1 23 Save page Update Ignore page Page Filtering Module Link Scoring Module OK NOTOK Meta TIDY PAGE PROCESSING URL list input XHTML pages output

14 © IIT, NCSR “Demokritos”, Paris 4-5 December 2002 Navigation Schema URL NO FRAMES FRAMES LINKS FORMS SELECT LIST SEARCH BOX IMAGE MAP JAVA SCRIPT TEXT LINK IMAGE LINK TEXT CONSTANTS OTHER Split frames OK --- ------ ---

15 © IIT, NCSR “Demokritos”, Paris 4-5 December 2002 Use of the Corpus formation tool (for the needs of page filtering) Positive pages Unidentified pages Corpus formation tool Ontology 1 or more Lexicon(s) Similar-to- positive pages Manual classification Negative pages

16 © IIT, NCSR “Demokritos”, Paris 4-5 December 2002 Page Filtering Evaluation Results total positives negatives 950 281 669 1233 330 903 1787 447 1340 1110 361 749 HMLH H H Precision (%)0,95 0,870,950,730,980,97 Recall (%)1,000,900,980,990,920,960,200,91 Fmeasure (%)0,970,930,950,970,810,970,330,94 Seven Weka algorithms used in evaluation tests: Naive Bayes, C4.5, IBk, SMO, AdaBoostM1, LogiBoost-DS. IBk performed best.

17 © IIT, NCSR “Demokritos”, Paris 4-5 December 2002 For every site: For every site: Page classifier B Page-Validating & Link-Grabbing Web Spider Homepage Page classifier A Manually classified positives Conflicts Automatically classified positives Unscored links Training Phase Link Scorer Scored links Formation of corpus of scored links (for the needs of link scoring)

18 © IIT, NCSR “Demokritos”, Paris 4-5 December 2002 GR Web Sites www.acom.gr www.asmacom.gr www.1oneway.gr www.rainbow.gr www.ramshop.gr Link Scoring Evaluation Results ML H H

19 © IIT, NCSR “Demokritos”, Paris 4-5 December 2002 FR Web Sites www.a2mi.fr www.adamscomputers.fr www.e-soph.fr www.fujitsu-siemens.fr www.nec-online.fr Link Scoring Evaluation Results ML H H

20 © IIT, NCSR “Demokritos”, Paris 4-5 December 2002 ML Link Scoring Evaluation Results UK Web Sites www.ishop.com www.laptopcomputers.com www.laptopsetc.com www.pro-star.com www.winbook.com H H H H

21 © IIT, NCSR “Demokritos”, Paris 4-5 December 2002 IT Web Sites www.computerhouse.pisa.it www.computershop.pisa.it www.microdata.it www.pcbook.it www.pc-si.it Link Scoring Evaluation Results ML H H

22 © IIT, NCSR “Demokritos”, Paris 4-5 December 2002 Corpus Collection Methodology (for the needs of NERC + FE) Selection of at least 50 domain relevant sites per language Selection of at least 50 domain relevant sites per language Investigation of domain characteristics, e.g. presentation types, in relation to the IE tasks performed by the systems Investigation of domain characteristics, e.g. presentation types, in relation to the IE tasks performed by the systems Quantitative analysis of domain characteristics per language  Statistics Quantitative analysis of domain characteristics per language  Statistics Selection of pages for each corpus according to the Statistics for the corresponding language Selection of pages for each corpus according to the Statistics for the corresponding language The Training corpus is equal in size to the Testing corpus and their size is the same for all languages and agreed between the partners for each domain The Training corpus is equal in size to the Testing corpus and their size is the same for all languages and agreed between the partners for each domain

23 © IIT, NCSR “Demokritos”, Paris 4-5 December 2002 Corpus Collection (1) 1 st domain: completed 1 st domain: completed 2 nd domain (job announcements appearing in the sites of Computing\Telecommunications companies): 2 nd domain (job announcements appearing in the sites of Computing\Telecommunications companies):  The investigation of domain relevant sites & statistical analysis for the 4 languages has been completed (at least 50 sites per language have been categorized) – problems with the URLs sent by some partners  Domain characteristics: 4 presentation types, size in words, images, existence of interesting facts for extraction, rate of job categories

24 © IIT, NCSR “Demokritos”, Paris 4-5 December 2002 Corpus Collection Methodology (2) Categories of Web pages containing Product Descriptions (2 nd Domain)

25 © IIT, NCSR “Demokritos”, Paris 4-5 December 2002 Corpus Collection (3) 2 nd Domain Presentation Category Results English English French French HellenicItalian TOTAL SITES 50505250505050 A162%47%39%34% A20%0%0%0% A30%0%0%0% B134%30%55%62% B20%0%0%0% B30%0%0%0% B40%0%0%0% B54%15%6%2% B6B6B6B60%0%0%0% B7B7B7B70%8%0%2%

26 © IIT, NCSR “Demokritos”, Paris 4-5 December 2002 Corpus Collection (4) 2 nd Domain Other Corpus Characteristics

27 © IIT, NCSR “Demokritos”, Paris 4-5 December 2002 Corpus Collection (5) 2 nd Domain Characteristics English English French French HellenicItalian F010%2%2%24% f1f1f1f190%98%98%76% w026%23%37%36% w174%77%63%64% i0 72% 66% 40% 90% i1 28%28%28%28%34% 60% 10% Comp/Engineering47%47%47%53% Finance1%12%1%2% Management17%10%17%15% Marketing6%13%10%8% Sales29%15%25%20% Other0%3% 0%0%0%0%2%

28 © IIT, NCSR “Demokritos”, Paris 4-5 December 2002 Suggestions for the corpora of the 2 nd domain: Suggestions for the corpora of the 2 nd domain:  size of the corpora: 200 pages (100 Training +100 Testing), annotation time has been estimated less  maximum number of pages from one site and type for every language: 10  amount of the pages in the Testing corpora that must come from sites not represented in the Training corpus: 30%+  page naming convention: Shorter names, e.g. A1-apple- 1.htm No comments from partners have been received so far, we need feedback in order to continue collecting pages and separate corpora into Training and Testing No comments from partners have been received so far, we need feedback in order to continue collecting pages and separate corpora into Training and Testing Corpus Collection (6) New Training & Testing Corpora (2 nd Domain)


Download ppt "Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou."

Similar presentations


Ads by Google