1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Spring Research Conference.

Slides:



Advertisements
Similar presentations
Ziv Bar-YossefMaxim Gurevich Google and Technion Technion TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A AA.
Advertisements

Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.
Contextual Advertising by Combining Relevance with Click Feedback D. Chakrabarti D. Agarwal V. Josifovski.
1 Automating the Extraction of Genealogical Information from the Web GeneTIQS Troy Walker & David W. Embley Family History Technology Conference March.
Semi-Supervised, Knowledge-Based Information Extraction for the Semantic Web Thomas L. Packer Funded in part by the National Science Foundation. 1.
CS652 Spring 2004 Summary. Course Objectives  Learn how to extract, structure, and integrate Web information  Learn what the Semantic Web is  Learn.
Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.
Time-dependent Similarity Measure of Queries Using Historical Click- through Data Qiankun Zhao*, Steven C. H. Hoi*, Tie-Yan Liu, et al. Presented by: Tie-Yan.
Traditional Information Extraction -- Summary CS652 Spring 2004.
Modern Information Retrieval
6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,
Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center.
1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Defense November 19,
1 Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF.
By ANDREW ZITZELBERGER A Framework for Extraction Ontology Based Information Management.
Scheme Matching and Data Extraction over HTML Tables from Heterogeneous Sources Cui Tao March, 2002 Founded by NSF.
Ontology-Based Free-Form Query Processing for the Semantic Web Mark Vickers Brigham Young University MS Thesis Defense Supported by:
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Information Retrieval
BYU Data Extraction Group Funded by NSF1 Brigham Young University Li Xu Source Discovery and Schema Mapping for Data Integration.
1 Automating the Extraction of Domain Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Proposal January 2004.
Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March 31, 2004 Funded by National.
18th International Conference on Database and Expert Systems Applications Journey to the Centre of the Star: Various Ways of Finding Star Centers in Star.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Recommender Systems on the Web: A Model-Driven Approach Gonzalo Rojas – Francisco Domínguez – Stefano Salvatori Department of Computer Science University.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Dr. Susan Gauch When is a rock not a rock? Conceptual Approaches to Personalized Search and Recommendations Nov. 8, 2011 TResNet.
25/03/2003CSCI 6405 Zheyuan Yu1 Finding Unexpected Information Taken from the paper : “Discovering Unexpected Information from your Competitor’s Web Sites”
Learning Probabilistic User Profiles: Interesting Web Sites, Notifying User of Relevant Changes to Web Pages, and Locating Grant Opportunities M. Ackerman,
Theory and Application of Database Systems A Hybrid Approach for Extending Ontology from Text He Wei.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Giorgos Giannopoulos (IMIS/”Athena” R.C and NTU Athens, Greece) Theodore Dalamagas (IMIS/”Athena” R.C., Greece) Timos Sellis (IMIS/”Athena” R.C and NTU.
Domain-Specific Iterative Readability Computation Jin Zhao 13/05/2011.
Understanding the Content Index. Review: The Search Engine.
CONCLUSION & FUTURE WORK Normally, users perform search tasks using multiple applications in concert: a search engine interface presents lists of potentially.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
Data Mining for Web Intelligence Presentation by Julia Erdman.
Cost-Effective Information Extraction from Lists in OCRed Historical Documents Thomas Packer and David W. Embley Brigham Young University FamilySearch.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
Web- and Multimedia-based Information Systems Lecture 2.
Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.
Mercury – A Service Oriented Web-based system for finding and retrieving Biogeochemical, Ecological and other land- based data National Aeronautics and.
A Novel Visualization Model for Web Search Results Nguyen T, and Zhang J IEEE Transactions on Visualization and Computer Graphics PAWS Meeting Presented.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Learning Taxonomic Relations from Heterogeneous Evidence Philipp Cimiano Aleksander Pivk Lars Schmidt-Thieme Steffen Staab (ECAI 2004)
BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
The anatomy of a Large-Scale Hypertextual Web Search Engine.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
UIC at TREC 2006: Blog Track Wei Zhang Clement Yu Department of Computer Science University of Illinois at Chicago.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
June 30, 2005 Public Web Site Search Project Update: 6/30/2005 Linda Busdiecker & Andy Nguyen Department of Information Technology.
On the Relation Between Simulation-based and SAT-based Diagnosis CMPE 58Q Giray Kömürcü Boğaziçi University.
Automated Information Retrieval
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Clustering Semantically Enhanced Web Search Results
Combining Keyword and Semantic Search for Best Effort Information Retrieval  Andrew Zitzelberger 1.
Extracting Information from Diverse and Noisy Scanned Document Images
Genealogy with the Internet
Extracting Information from Diverse and Noisy Scanned Document Images
Presentation transcript:

1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Spring Research Conference March 20, 2004 Research funded by NSF grant #IIS

2 Genealogical Information on the Web Hundreds of thousands of sites Hundreds of thousands of sites Some professional (Ancestry.com, Familysearch.org) Some professional (Ancestry.com, Familysearch.org) Mostly hobbyist (203,200 indexed by Cyndislist.com) Mostly hobbyist (203,200 indexed by Cyndislist.com) Search engines Search engines “Walker genealogy” on Google: 199,000 results “Walker genealogy” on Google: 199,000 results 1 page/minute = 5 months to go through 1 page/minute = 5 months to go through Why not enlist the help of a computer? Why not enlist the help of a computer?

3 Problems No standard way of presenting data No standard way of presenting data Sites have differing schemas Sites have differing schemas

4 Proposed Solution Based on Ontos and other work done by the BYU Data Extraction Group (DEG) Based on Ontos and other work done by the BYU Data Extraction Group (DEG) Able to extract from: Able to extract from: Single-record documents Single-record documents Simple multiple-record documents Simple multiple-record documents Complex multiple-record documents Complex multiple-record documents Robust to changes in pages Robust to changes in pages Easily adaptable to other domains Easily adaptable to other domains

5 Person Ontology

6 Record Separation Separating data related to each person Separating data related to each person Previous technique Previous technique Combines many heuristics Combines many heuristics Has problems Has problems Assumes multiple records Assumes multiple records Must be simple separation Must be simple separation

7 Single-Record Document

8 Simple Multiple-Record Document

9 Complex Multiple-Record Document

10 Vector Space Modeling { 0.8, 0.99, 0.95, 0.9, 0.6, 0.5, 0.6, 3.0} Ontology Vector Ontology Vector Compare to candidate records Compare to candidate records Cosine measure Cosine measure Magnitude measure Magnitude measure

11 Vector Space Modeling <!DOCTYPE…><html> … …header… …header… … {0, 0, 0, 0, 0, 0, 0, 0} {0, 141, 89, 76, 0, 0, 48, 23} {0, 1, 0, 0, 0, 0, 0, 0} {0, 1, 0, 0, 0, 0, 0, 0} {0, 140, 89, 76, 0, 0, 48, 23} {0, 140, 89, 76, 0, 0, 48, 23} {0, 0, 0, 0, 0, 0, 0, 0} {0, 0, 0, 0, 0, 0, 0, 0} {0, 138, 88, 76, 0, 0, 48, 23} {0, 138, 88, 76, 0, 0, 48, 23}…

12 Improvements Differing schemas Differing schemas Low cosine measures Low cosine measures Discarded data Discarded data Prune dimensions Prune dimensions {0.8, 0.99, 0.95, 0.9, 0.6, 0.5, 0.6, 3.0} {0.0, 141.0, 89.0, 76.0, 0.0, 0.0, 48.0, 23.0} Richness of data in single-record documents Richness of data in single-record documents High magnitude measure High magnitude measure Higher magnitude to split documents Higher magnitude to split documents

13 Presenting Results

14 Preliminary Results Semi-structured Text Semi-structured Text 10 single-record documents 10 single-record documents 3 simple documents containing 268 records 3 simple documents containing 268 records 3 complex documents containing 266 records 3 complex documents containing 266 records Precision and recall calculated on record separation Precision and recall calculated on record separation

15 Record Separation RecallPrecision Single100%94.1% Simple94.7%97.3% Complex88.3%93.6%

16 Conclusion Integrate, build on previous DEG work Integrate, build on previous DEG work Accurate record separation Accurate record separation Average recall: 94.3% Average recall: 94.3% Average precision: 95.0% Average precision: 95.0% Ontology based—easily adapted to other domains Ontology based—easily adapted to other domains