ELIJAH: Extracting Genealogy from the Web By David Barney and Rachel Lee WhizBang! Labs.

Slides:



Advertisements
Similar presentations
Digital Collections: Storage and Access Jon Dunn Assistant Director for Technology IU Digital Library Program
Advertisements

By Rohen Shah – rxs07u.  Introduction  Different methodologies used  Different types of testing tools  Most commonly used testing tools  Summary.
Database management system (DBMS)  a DBMS allows users and other software to store and retrieve data in a structured way  controls the organization,
Integrated Library System (ILS) Group 5: Leung Chui Ting Yuen Miu Kwan Chan Ying, Sarah Cheung Chor Ying Wan Ka Wai,
Google AdSense Presented by: Naresh Gourishetty.
Ping Gallivan Xiang Gao Eric Heinen Akarsh Sakalaspur Automated Coin Grader.
Advanced Manufacturing Laboratory Department of Industrial Engineering Sharif University of Technology Session # 12.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
Peter Granda Archival Assistant Director / ICPSR and the Gerald R. Ford Presidential Library: Two Decades of Collaboration.
Jianwei Lu1 Information Extraction from Event Announcements Student: Jianwei Lu ( ) Supervisor: Robert Dale.
1 Automating the Extraction of Genealogical Information from the Web GeneTIQS Troy Walker & David W. Embley Family History Technology Conference March.
H YPERLINKING DIGITAL LIBRARIES ON THE WEB Juan Camilo Zapata ITEC – 810 Supervisor Robert Dale 1.
Aki Hecht Seminar in Databases (236826) January 2009
When to use Data Mining. Introduction An important question that should be answered before you commence any data mining project is whether data mining.
Verifying the Validity of Websites By: Group One.
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
Software Engineering for Safety : A Roadmap Presentation by: Manu D Vij CS 599 Software Engineering for Embedded Systems.
Connecting Diverse Web Search Facilities Udi Manber, Peter Bigot Department of Computer Science University of Arizona Aida Gikouria - M471 University of.
Static VS Dynamic websites. 1-What are the advantages and disadvantages? 2- Which one should you choose and why?
Internet Research Finding Free and Fee-based Obituaries Online.
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
Online Research Nothing has revolutionized genealogy.
November 4, How do I start when I have no information?  Create a Family Group Sheet with the following:  name  birth date, place  marriage date,
Contents:  1 – Introduction to the subject of web mining and techniques  2 – Overview of research conducted (both theory and practical)  3 – Software.
The Research Process Why Do Research?. Research is a process made up of many small steps. What Next? Steps in the Research Process 1. Define your research.
Annotating Search Results from Web Databases. Abstract An increasing number of databases have become web accessible through HTML form-based search interfaces.
People Mapping and Executive Research Blake Roberts Collins Consulting Group
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach by: Craig A. Knoblock, Kristina Lerman Steven Minton, Ion Muslea Presented.
Collaborative Research Assistant 2007 Family History Technology Conference John Finlay Christopher Stolworthy Daniel Parker.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Automatically Extracting Data Records from Web Pages Presenter: Dheerendranath Mundluru
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
A Language Independent Method for Question Classification COLING 2004.
2005 Epocrates, Inc. All rights reserved. Integrating XML with legacy relational data for publishing on handheld devices David A. Lee Senior member of.
A Genealogy System for the Web Matthew A. Page November 20, 2002.
Presenter: Shanshan Lu 03/04/2010
Restricted Search Engine Laurent Balat Christophe Decis Thomas Forey Sebastien Leclercq ESSI2 Project Supervisor: Johny BOND June 2002.
FAMILY TREE 133 POINTS. FAMILY TREE – PART 1(15 POINTS) Students should prepare a prezi or powerpoint that includes the following information. Step 1:
DAWN SQUIRES BROOKS COUNTY HIGH SCHOOL Advanced Databases Introduction.
Indexing Facts & Statistics Launch date 24 May 1999 Number of names in searchable databases Over 3.5 billion Number of historic records published online.
UCSD Libraries Portal Project: Building a Database-Driven Web Content Management System Sharecase, 3/28/2001 Esmé Cowles and Laura Galvan-Estrada.
Introduction to XML By Manzur Ashraf (Shovon) Dept. of Computer Science & Engineering (BUET)
Suzanne Collins Author Research: Compiled by Scholastic Publishing.
Dynamic Web Pages Jin Wu INF 385E Information Architecture School of Information 11/2/2006 Jin Wu INF 385E Information Architecture School of Information.
Welcome!   Log into the computer using this name: Username: oms.amy.student Password: fishy1! Domain:HSMS   Sign into the wiki and post your files.
Design a full-text search engine for a website based on Lucene
Automatic Metadata Discovery from Non-cooperative Digital Libraries By Ron Shi, Kurt Maly, Mohammad Zubair IADIS International Conference May 2003.
Refining Internet and Database Searches Created by Kathryn Reilly.
BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.
“My brothers and sisters, the Holy Bible is a miracle! It is a miracle that the Bible’s 4,000 years of sacred and secular history were recorded and preserved.
Evaluating Educational Resources W301. What are Educational Resources? There are many kinds of resources available to you: –Productivity Resources are.
People and Families of the Bible Nathan Friedly. Overview Introduction Key Ideas Description and use Deliverables Demonstration Conclusion.
Unveiling Zeus Automated Classification of Malware Samples Abedelaziz Mohaisen Omar Alrawi Verisign Inc, VA, USA Verisign Labs, VA, USA
Data Collection Methods Pros and Cons of Primary and Secondary Data.
MEDICAL RECORD BROKER -LAVANYA GUNDAMARAJU Introduction Introduction n Database and database systems have become an essential part of everyday life.
KEEPS – a system for UELMA preservation and security
Disha –Web designing.
Eric Johnson Miami University 2016 August 15 IFLA
KEEPS – a system for UELMA preservation and security
FamNet Where Kiwis Tell Their Stories 14 August 2016
Smart IT Job Advisor and Analysis on web application
Automated MS Word and PowerPoint Translator
Vision for an Automatically Constructed FH-WoK
Dynamic Web Pages Jin Wu INF 385E Information Architecture
The Big Picture Behind New.FamilySearch
Introduction to Databases
WEB DESIGNING THROUGH HTML
Searching the Internet
Presentation transcript:

ELIJAH: Extracting Genealogy from the Web By David Barney and Rachel Lee WhizBang! Labs

Introduction “A new era of family history work has arrived. As President Gordon B. Hinckley recently noted, ‘The Lord has inspired skilled men and women in developing new technologies which we can use to our great advantage in moving forward this sacred work.’ “ Elder Russell M. Nelson, “A New Harvest Time,” Ensign, May 1998, 43

Introduction: The General Problem There is a large amount of genealogical information already published on the web. How do you put it into a usable format? A search engine would be nice.

Introduction: The Specific Problem Key word search is not good enough.Key word search is not good enough. –Is 1897 a death date, birth date, etc. ? 2 main problems with extracting information2 main problems with extracting information –Finding the fields (names, birthdates…) –Associating the fields into records

Example: a Genealogy Page HTML page Relational/XML Database

Related Work: Wrappers Make a site-specific set of rulesMake a site-specific set of rules Pro: highly accuratePro: highly accurate Cons: not scalable, fragileCons: not scalable, fragile

Related Work: Global Models General approachGeneral approach –example: FlipDog.com Pros: applies to any website, scalablePros: applies to any website, scalable Cons: time consuming to train/tune, possible to have low accuracy on specific sitesCons: time consuming to train/tune, possible to have low accuracy on specific sites

Our approach: ELIJAH Key: 1000s of pages are produced by about 100 different software programs.Key: 1000s of pages are produced by about 100 different software programs. Combines the two previous methodsCombines the two previous methods Extracting Lineage Information with Java using Automated HeuristicsExtracting Lineage Information with Java using Automated Heuristics

ELIJAH Architecture

Example: ELIJAH in action classifier Ged2HTML rules

Experiment Rules for 15 most common formats (out of 100)Rules for 15 most common formats (out of 100) Executed ELIJAH on 51 random websites with family tree informationExecuted ELIJAH on 51 random websites with family tree information Failed ifFailed if –couldn’t identify what format it was –didn’t extract information –extracted information had errors

Results With the 15 rule sets, we extracted data fromWith the 15 rule sets, we extracted data from –33% of all pages –41% of machine generated pages –55% of machine generated pages with sufficient html formatting

Conclusion With only 15% of the work we got 55% of the information that we targetedWith only 15% of the work we got 55% of the information that we targeted We preserved the meaning of the website data and can put it in a databaseWe preserved the meaning of the website data and can put it in a database

More to Come? Tools developed at WhizBang! Labs, Inc. will significantly improve Global Models, Hand Wrappers, and the ELIJAH approach.Tools developed at WhizBang! Labs, Inc. will significantly improve Global Models, Hand Wrappers, and the ELIJAH approach. As the “Spirit of Elijah” spreads throughout the world, technology will assist the massive work.As the “Spirit of Elijah” spreads throughout the world, technology will assist the massive work.