1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Defense November 19,

Slides:



Advertisements
Similar presentations
Ziv Bar-YossefMaxim Gurevich Google and Technion Technion TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A AA.
Advertisements

Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.
WEB BASICS FOR CRITICAL THINKING. SEARCH ENGINES Use a variety of search engines: Google Yahoo! Dogpile AltaVista HotBot Lycos WebCrawler Bing.
Contextual Advertising by Combining Relevance with Click Feedback D. Chakrabarti D. Agarwal V. Josifovski.
1 Automating the Extraction of Genealogical Information from the Web GeneTIQS Troy Walker & David W. Embley Family History Technology Conference March.
CS652 Spring 2004 Summary. Course Objectives  Learn how to extract, structure, and integrate Web information  Learn what the Semantic Web is  Learn.
Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.
Traditional Information Extraction -- Summary CS652 Spring 2004.
Modern Information Retrieval
6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University.
Extracting Data Behind Web Forms Stephen W. Liddle David W. Embley Del T. Scott, Sai Ho Yau Brigham Young University Presented by: Helen Chen.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,
FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.
Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center.
1 Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF.
Scheme Matching and Data Extraction over HTML Tables from Heterogeneous Sources Cui Tao March, 2002 Founded by NSF.
Ontology-Based Free-Form Query Processing for the Semantic Web Mark Vickers Brigham Young University MS Thesis Defense Supported by:
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Information Retrieval
BYU Data Extraction Group Funded by NSF1 Brigham Young University Li Xu Source Discovery and Schema Mapping for Data Integration.
1 Cui Tao PhD Dissertation Defense Ontology Generation, Information Harvesting and Semantic Annotation For Machine-Generated Web Pages.
1 Automating the Extraction of Domain Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Proposal January 2004.
1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Spring Research Conference.
Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March 31, 2004 Funded by National.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Databases & Data Warehouses Chapter 3 Database Processing.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
The 5th annual UK Workshop on Computational Intelligence London, 5-7 September 2005 The 5th annual UK Workshop on Computational Intelligence London, 5-7.
Data Mining By Dave Maung.
CONCLUSION & FUTURE WORK Normally, users perform search tasks using multiple applications in concert: a search engine interface presents lists of potentially.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
Data Mining for Web Intelligence Presentation by Julia Erdman.
ITGS Databases.
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Algorithmic Detection of Semantic Similarity WWW 2005.
Web- and Multimedia-based Information Systems Lecture 2.
Activity 4b Systems of Professional Learning Module 3 Grades 6–12: Supporting all Students in Writing and Research.
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
The anatomy of a Large-Scale Hypertextual Web Search Engine.
GreenFIE-HD: A “Green” Form-based Information Extraction Tool for Historical Documents Tae Woo Kim.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Selected Semantic Web UMBC CoBrA – Context Broker Architecture  Using OWL to define ontologies for context modeling and reasoning  Taking.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Big Data: Every Word Managing Data Data Mining TerminologyData Collection CrowdsourcingSecurity & Validation Universal Translation Monolingual Dictionaries.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Julián ALARTE DAVID INSA JOSEP SILVA
Map Reduce.
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Clustering Semantically Enhanced Web Search Results
Multimedia Information Retrieval
Web Mining Department of Computer Science and Engg.
Combining Keyword and Semantic Search for Best Effort Information Retrieval  Andrew Zitzelberger 1.
Presentation transcript:

1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Defense November 19, 2004 Research funded by NSF grant #IIS

2 Genealogical Information on the Web Hundreds of thousands of sites Hundreds of thousands of sites Some professional (Ancestry.com, Familysearch.org) Some professional (Ancestry.com, Familysearch.org) Mostly hobbyist (240,400 indexed by Cyndislist.com) Mostly hobbyist (240,400 indexed by Cyndislist.com) Search engines Search engines “Walker genealogy” on Google: 523,000 results “Walker genealogy” on Google: 523,000 results 1 page/minute = 1 year to go through 1 page/minute = 1 year to go through Why not enlist the help of a computer? Why not enlist the help of a computer?

3 Problems No standard way of presenting data No standard way of presenting data Sites have differing schemas Sites have differing schemas Web pages change Web pages change New pages continuously come on line New pages continuously come on line

4 GeneTIQS Based on work done by BYU DEG Based on work done by BYU DEG Able to extract from: Able to extract from: Single-record documents Single-record documents Simple multiple-record documents Simple multiple-record documents Complex multiple-record documents Complex multiple-record documents Robust to changes in pages Robust to changes in pages Immediately works for new pages Immediately works for new pages

5 Person Ontology

6

7 Value Matchers

8 Record Separation Separating data related to each person Separating data related to each person Previous technique Previous technique Combines many heuristics Combines many heuristics Has problems Has problems Assumes multiple records Assumes multiple records Must be simple separation Must be simple separation

9 Single-Record Document

10 Simple Multiple-Record Document

11 Complex Multiple-Record Document

12 Vector Space Modeling Ontology Vector Ontology Vector Compare to candidate records Compare to candidate records Cosine measure Cosine measure Magnitude measure Magnitude measure

13 Ontology Vector { 0.8, 0.99, 0.95, 0.9, 0.6, 0.5, 0.6, 3.0, 3.0}

14 Vector Space Modeling <!DOCTYPE…><html> … …header… …header… … … {0, 0, 0, 0, 0, 0, 0, 0, 0} {0, 149, 89, 76, 0, 0, 48, 23, 23} {0, 1, 0, 0, 0, 0, 0, 0, 0, 0} {0, 1, 0, 0, 0, 0, 0, 0, 0, 0} {0, 148, 89, 76, 0, 0, 48, 23, 23} {0, 148, 89, 76, 0, 0, 48, 23, 23} {0, 0, 0, 0, 0, 0, 0, 0, 0} {0, 0, 0, 0, 0, 0, 0, 0, 0} {0, 146, 88, 76, 0, 0, 48, 23, 23} {0, 146, 88, 76, 0, 0, 48, 23, 23}… {0, 1, 1, 1, 0, 0, 0, 0, 0} {0, 1, 1, 1, 0, 0, 0, 0, 0} Gender Christening Burial Marriage Relation Name Birth Name Death Relationship

15 Problems and Improvements Differing schemas Differing schemas Low cosine measures Low cosine measures Discarded data Discarded data Prune dimensions Prune dimensions {0.8, 0.99, 0.95, 0.9, 0.6, 0.5, 0.6, 3.0, 3.0} {0.0, 141.0, 89.0, 76.0, 0.0, 0.0, 48.0, 23.0, 23.0} Richness of data in single-record documents Richness of data in single-record documents High magnitude measure High magnitude measure Higher magnitude to split documents Higher magnitude to split documents

16 Problems and Improvements Missed Simple Patterns Missed Simple Patterns More than 3 records More than 3 records Valid Records:Total Records > 2:3 Valid Records:Total Records > 2:3 Keep all Keep all Discard header and footer Discard header and footer

17 Demonstration

18 Presenting Results

19 Evaluation Semi-structured Text Semi-structured Text 21 single-record documents 21 single-record documents 10 simple documents containing 130 records 10 simple documents containing 130 records 20 complex documents with 238 records 20 complex documents with 238 records Precision and recall for record separation Precision and recall for record separation

20 recordsreturnedcorrectprecisionrecall single % single % single % single % single % single % single % single % single % single % single % single % single % single % single % single % single % single % single % single % single % Total %90.48% Single- Record Documents

21 recordsreturnedcorrectprecisionrecall simple %100.00% Simple %89.47% Simple % Simple % Simple %91.67% Simple %83.33% Simple %71.43% Simple %100.00% Simple % Simple % Total %93.08% recordsreturnedcorrectprecisionrecall simple %100.00% simple % simple %100.00% simple %100.00% simple %100.00% simple %75.00% simple %92.86% simple % simple %100.00% simple % Total %66.92% Simple Multiple-Record Documents VSM Separator Highest-Fanout Separator

22 Complex Multiple- Record Documents recordsreturnedmissedextracorrectprecisionrecall complex % complex % complex % complex %85.71% complex %93.75% complex %86.67% complex %92.31% complex % complex %94.74% complex % complex %73.33% complex % complex % complex %93.75% complex % complex %100.00% complex %110.00% complex %25.00% complex %100.00% complex %75.00% Total %91.60%

23 Conclusion Integrate, build on previous DEG work Integrate, build on previous DEG work Accurate record separation Accurate record separation Average recall: 92% Average recall: 92% Average precision: 93% Average precision: 93% Ontology based Ontology based Robust to changes in pages Robust to changes in pages Immediately works with new pages Immediately works with new pages

24 Future Work Scale Scale Distribute computation Distribute computation Intelligent URL selector Intelligent URL selector More Sources More Sources Tables Tables Forms and dynamic pages Forms and dynamic pages Obtain more information behind links Obtain more information behind links

25 Future Work Improve VSM record separation Improve VSM record separation Weight object importance Weight object importance Disambiguate before record separation Disambiguate before record separation Recognize patterns Recognize patterns Improve detection of single-record documents Improve detection of single-record documents