Information Extraction from HTML: General Machine Learning Approach Using SRV.

Slides:



Advertisements
Similar presentations
Specific Word Instruction Chapter 11 Summary
Advertisements

Conceptual Clustering
Imbalanced data David Kauchak CS 451 – Fall 2013.
The Semantic Web-Week 22 Information Extraction and Integration (continued) Module Website: Practical this week:
Decision Tree Approach in Data Mining
Knowledge Representation
Plain Text Information Extraction (based on Machine Learning ) Chia-Hui Chang Department of Computer Science & Information Engineering National Central.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
Security in semantic web Hassan Abolhassani, Leila Sharif Sharif university of technology
Xyleme A Dynamic Warehouse for XML Data of the Web.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Relational Learning of Pattern-Match Rules for Information Extraction Mary Elaine Califf Raymond J. Mooney.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Traditional Information Extraction -- Summary CS652 Spring 2004.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 8 The Enhanced Entity- Relationship (EER) Model.
1 Draft of a Matchmaking Service Chuang liu. 2 Matchmaking Service Matchmaking Service is a service to help service providers to advertising their service.
Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
XML Technologies and Applications Rajshekhar Sunderraman Department of Computer Science Georgia State University Atlanta, GA 30302
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Experiment Databases: Towards better experimental research in machine learning and data mining Hendrik Blockeel Katholieke Universiteit Leuven.
REFACTORING Lecture 4. Definition Refactoring is a process of changing the internal structure of the program, not affecting its external behavior and.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Fast Webpage classification using URL features Authors: Min-Yen Kan Hoang and Oanh Nguyen Thi Conference: ICIKM 2005 Reporter: Yi-Ren Yeh.
Overview: Humans are unique creatures. Everything we do is slightly different from everyone else. Even though many times these differences are so minute.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
A Z Approach in Validating ORA-SS Data Models Scott Uk-Jin Lee Jing Sun Gillian Dobbie Yuan Fang Li.
Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.
The INTERNET how it works. the internet: defined So, what is it?
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
CS Learning Rules1 Learning Sets of Rules. CS Learning Rules2 Learning Rules If (Color = Red) and (Shape = round) then Class is A If (Color.
A Language Independent Method for Question Classification COLING 2004.
1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion Muslea, Steve Minton and Craig Knoblock.
Presenter: Shanshan Lu 03/04/2010
Today Ensemble Methods. Recap of the course. Classifier Fusion
1 Overview of XSL. 2 Outline We will use Roger Costello’s tutorial The purpose of this presentation is  To give a quick overview of XSL  To describe.
Talk Schedule Question Answering from Bryan Klimt July 28, 2005.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Programming Languages and Design Lecture 3 Semantic Specifications of Programming Languages Instructor: Li Ma Department of Computer Science Texas Southern.
LOGO 1 Mining Templates from Search Result Records of Search Engines Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hongkun Zhao, Weiyi.
Learning to Share Meaning in a Multi-Agent System (Part I) Ganesh Padmanabhan.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
1 Knowledge Acquisition and Learning by Experience – The Role of Case-Specific Knowledge Knowledge modeling and acquisition Learning by experience Framework.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
Support Vector Machines and Kernel Methods for Co-Reference Resolution 2007 Summer Workshop on Human Language Technology Center for Language and Speech.
Finite State Machines (FSM) OR Finite State Automation (FSA) - are models of the behaviors of a system or a complex object, with a limited number of defined.
Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
1 An infrastructure for context-awareness based on first order logic 송지수 ISI LAB.
1 Question Answering and Logistics. 2 Class Logistics  Comments on proposals will be returned next week and may be available as early as Monday  Look.
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
Multiple-goal Search Algorithms and their Application to Web Crawling Dmitry Davidov and Shaul Markovitch Computer Science Department Technion, Haifa 32000,
Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network Author:Vikas Bahirwani 、 Doina Caragea.
Relation Extraction (RE) via Supervised Classification See: Jurafsky & Martin SLP book, Chapter 22 Exploring Various Knowledge in Relation Extraction.
Names and Attributes Names are a key programming language feature
Knowledge Representation
Rules within an Enterprise
Kriti Chauhan CSE6339 Spring 2009
Plain Text Information Extraction (based on Machine Learning)
Generalized Diagnostics with the Non-Axiomatic Reasoning System (NARS)
Presentation transcript:

Information Extraction from HTML: General Machine Learning Approach Using SRV

Abstract Information Extraction (IE) can be viewed as a machine learning problem. Use a relational learning algorithm (SRV): make no assumptions about document structure and kinds of information available; instead, uses a token-oriented feature set. Compare results of a standard memorization agent to SRV in extracting data from university course and research web pages.

SRV Top-down relational algorithm. Token-oriented features, easy to implement and add to the system. Domain-specific features are separate from the core system – easier to port to other domains. Easy to add to the basic feature set for different formatted text, i.e. HTML.

Terminology Title – field Actual title – instantiation or instance –A page may have multiple fields of interest and multiple instantiations of each field. In traditional IE terms, a collection of tasks is called a template. Each field is a slot, and each instantiation a slot fill.

Main Problem IE involves many sub-tasks, including syntactic and semantic pre-processing, slot filling, etc. SRV was developed to solve just the slot filling problem. Find the best unbroken fragment(s) of text to fill a given slot in the answer template.

Extraction = Text Classification Every candidate instance is presented to a classifier which is asked to “rate” each as likely project members.

Relational Learning RL, or inductive logic programming, suited to domains with rich relational structure. Instance attributes are not isolated, but related to each other logically. Predicates based on attribute values are greedily added to a rule under construction. Can logically derive new attributes from existing ones. DATA USED FOR LEARNING IS IMPORANT!

SRV Differs from previous systems by learning over an explicit set of simple and relational features. –Simple maps a token to a discrete value. –Relational maps a token to another token.

SRV – Predefined Predicate Types length(Relop N)Relop: > < = –length(= 2) accepts fragments with 2 tokens. some(Var Path Feat Value) –some(?A [ ] capitalizedp true) means: “the fragment contains some capitalized token.” –some(?A [prev_token prev_token] capitalized true) means: “The fragment contains a token where two tokens before was capitalized.”

SRV – More Predicate Types every(Feat Value) –every(numericp false) means: “every token in the fragment is non-numeric.” position(Var From Relop N) –position(?A fromfirst < 2) means: “the token ?A is either first or second in the fragment.” relpos(Var1 Var2 Relop N) –relpos(?A ?B > 1) means: “the token ?B is more than one token away from ?A.”

Validation Each 1/3 of the training data is set aside one at a time to “validate” the rules generated from the other 2/3 against. # of total matches and correct matches is stored with each validated rule set. The 3 rule sets generated are concatenated, all that match a given fragment are used to generate the confidence score in the extracted data.

Generation and Validation (SRV)

Adapting SRV for HTML Easier than previous methods, because SRV has “core” features and is extensible. Simply add features to deal with different domains.

HTML Testing (SRV) 105 class web pages and 96 research project pages from Cornell, University of Texas, University of Washington and University of Wisconsin. Tested twice using random partitioning and LOUO (leave one university out).

Results (One-Per-Document) The system only has to return one prediction despite having more than one result – just return the prediction with the highest confidence.

Results (Many-Per-Document) More difficult task, as when multiple possibilities are found, you have to return all of them.

Baseline Approach Comparison A simple memorizing agent

Some Conclusions SRV performs better than a simple machine learning agent in all cases. When HTML features added, it performs much better; though note that these features are NOT NECESSARY for SRV to function. Random partitioning works better, probably because each university in the test data had certain web page formatting similarities within the university, but not between them.

Some Conclusions Recall/accuracy and precision/coverage can be “tweaked” by throwing out rules with confidence less than a certain x%, so that a system can be tuned to a particular application. Domain-specific information is separate from the learning mechanism (allows adaptation to other domains). Top-down induction - general to specific, rather than previous bottom-up models which relied on heuristics, implicit features to trim constraints.