Information extraction from text Spring 2003, Part 4 Helena Ahonen-Myka.

Slides:



Advertisements
Similar presentations
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Advertisements

Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
RoadRunner: Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo Presented by Lei Lei.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Aki Hecht Seminar in Databases (236826) January 2009
Algorithms and Problem Solving-1 Algorithms and Problem Solving.
Event Extraction: Learning from Corpora Prepared by Ralph Grishman Based on research and slides by Roman Yangarber NYU.
Feature Selection for Regression Problems
Algorithms and Problem Solving. Learn about problem solving skills Explore the algorithmic approach for problem solving Learn about algorithm development.
R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo VLDB 2001.
Extracting Structured Data from Web Page Arvind Arasu, Hector Garcia-Molina ACM SIGMOD 2003.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
CS246 Extracting Structured Information from the Web.
R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo VLDB 2001.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 9: Wrappers PRINCIPLES OF DATA INTEGRATION.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
A Brief Survey of Web Data Extraction Tools Alberto H. F. Laender, Berthier A. Ribeiro-Neto, Altigran S. da Silva, Juliana S. Teixeira Federal University.
2.2 A Simple Syntax-Directed Translator Syntax-Directed Translation 2.4 Parsing 2.5 A Translator for Simple Expressions 2.6 Lexical Analysis.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Learning Information Extraction Patterns Using WordNet Mark Stevenson and Mark A. Greenwood Natural Language Processing Group University of Sheffield,
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Processing of large document collections Part 10 (Information extraction: multilingual IE, IE from web, IE from semi-structured data) Helena Ahonen-Myka.
Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach by: Craig A. Knoblock, Kristina Lerman Steven Minton, Ion Muslea Presented.
A Semantic Approach to IE Pattern Induction Mark Stevenson and Mark Greenwood Natural Language Processing Group University of Sheffield, UK.
Lexical Analysis Hira Waseem Lecture
A Language Independent Method for Question Classification COLING 2004.
1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion Muslea, Steve Minton and Craig Knoblock.
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Presenter: Shanshan Lu 03/04/2010
A Semantic Approach to IE Pattern Induction Mark Stevenson and Mark A. Greenwood Natural Language Processing Group University of Sheffield, UK.
Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.
Tutorial 13 Validating Documents with Schemas
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Towards the Semantic Web 6 Generating Ontologies for the Semantic Web: OntoBuilder R.H.P. Engles and T.Ch.Lech 이 은 정
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
LOGO 1 Mining Templates from Search Result Records of Search Engines Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hongkun Zhao, Weiyi.
Learning to Share Meaning in a Multi-Agent System (Part I) Ganesh Padmanabhan.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
M1G Introduction to Programming 2 3. Creating Classes: Room and Item.
CSE 425: Syntax I Syntax and Semantics Syntax gives the structure of statements in a language –Allowed ordering, nesting, repetition, omission of symbols.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
FILTERED RANKING FOR BOOTSTRAPPING IN EVENT EXTRACTION Shasha Liao Ralph York University.
Data Profiling 13 th Meeting Course Name: Business Intelligence Year: 2009.
Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
XML Schema – XSLT Week 8 Web site:
Of 24 lecture 11: ontology – mediation, merging & aligning.
CS 2130 Lecture 18 Bottom-Up Parsing or Shift-Reduce Parsing Warning: The precedence table given for the Wff grammar is in error.
Automatically Labeled Data Generation for Large Scale Event Extraction
Algorithms and Problem Solving
Approaches to Machine Translation
CSC 594 Topics in AI – Natural Language Processing
A Simple Syntax-Directed Translator
Text Based Information Retrieval
Lecture 7: Introduction to Parsing (Syntax Analysis)
Introduction Task: extracting relational facts from text
Approaches to Machine Translation
Kriti Chauhan CSE6339 Spring 2009
Algorithms and Problem Solving
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
The Winograd Schema Challenge Hector J. Levesque AAAI, 2011
Presentation transcript:

Information extraction from text Spring 2003, Part 4 Helena Ahonen-Myka

2 In this part active learning unsupervised learning multilingual IE closing of the course

3 Active learning the key bottleneck is obtaining the labeled training data the cost of labelling documents is usually considerably less than the cost of writing the wrapper’s extraction rules by hand but: labeling documents can require domain expertise, and is tedious and error-prone active learning methods try to minimize the amount of training data required to achieve a a satisfactory level of generalization

4 Active learning basic idea: start with a small amount of training data run the learning algorithm use the learned wrapper to predict which of the remaining unlabeled documents is most informative informative = the example would help the learning algorithm generalize most trivial example: if the set of examples contains duplicates, the learning algorithm should not suggest the user to annotate the same document twice

5 Active learning e.g. STALKER wrapper learning algorithm learns a sequence of landmarks for scanning from the beginning of the document to the start of the fragment to be extracted an alternative way to find the same position is to scan backwards from the end of the document for a different set of landmarks both learning tasks are solved in parallel on the available training data two resulting wrappers are then applied to all the unlabeled documents the system asks the user to label one of the documents for which the two wrappers give different answers

6 Active learning: other strategies compare present for labeling the document that is textually least similar to the documents that have already been labeled unusual choose the document with the most unusual proper nouns in it committee invoke two different IE learning algorithms select the document on which the learned rule sets most disagree

7 Unsupervised learning methods Brin: Extracting patterns and relations from the World Wide Web (DIPRE), 1998 Yangarber et al: Automatic acquisition of domain knowledge for IE (ExDisco), 2000 Crescenzi et al: ROADRUNNER: Towards automatic data extraction from large web sites, VLDB’2001

8 DIPRE in many cases we can obtain documents from multiple information sources, which will include descriptions of the same relation in different forms if several descriptions mention the same names as participants, there is a good chance that they are instances of the same relation

9 DIPRE e.g., finding (author, booktitle) pairs 1. Start with a small seed set of (author, booktitle) pairs, like (Herman Melville, Moby Dick), e.g. 5 pairs 2. Find occurrences of all those books on the web 3. From these occurrences, recognise patterns for the citations of books 4. Search the web for these patterns and find new books 5. Go to 2

10 DIPRE we can also start from a pattern suppose that we are seeking patterns corresponding to the relation HQ between a company C and the location L of its headquarters we are initially given one such pattern: ”C, headquartered in L” => HQ(C,L)

11 DIPRE we can search for instances of this pattern in the corpus in order to collect pairs of invididuals in the relation HQ for instance, ”IBM, headquartered in Armonk” => HQ(”IBM”,”Armonk”) if we find other examples in the text which connect these pairs, e.g. ”Armonk-based IBM”, we might guess that the associated pattern ”L-based C” is also indicator of HQ

12 DIPRE application good for things that mean always the same (like authors and their books)? does not work, e.g., with stock prices, because they are numbers? the names etc. may not always have the same form (HP, Hewlett-Packard)

13 ExDisco Look for linguistic patterns which appear with a relatively high frequency in relevant documents the set of relevant documents is not known, they have to be found as part of the discovery process one of the best indications of the relevance of the documents is the presence of good patterns -> circularity -> acquired in tandem

14 Preprocessing Name recognition marks all instances of names of people, companies, and locations -> replaced with the class name (C-Person, C-Company,...) a parser is used to extract all the clauses from each document for each clause, a tuple is built, consisting of the basic syntactic constituents (subject, verb, object) different clause structures (passive…) are normalized

15 Preprocessing Because tuples may not repeat with sufficient frequency, each tuple is reduced to a set of pairs, e.g. verb-object subject-object each pair is used as a generalized pattern once relevant pairs have been identified, they can be used to gather the set of words for the missing roles e.g. verbs that occur with a relevant subject- object pair: ”company {hire/fire/expel/...} person”

16 Discovery procedure Unsupervised procedure the training corpus does not need to be annotated, not even classified the user must provide a small set of seed patterns regarding the scenario starting with this seed, the system automatically performs a repeated, automatic expansion of the pattern set

17 Discovery procedure 1. The pattern set is used to divide the corpus U into a set of relevant documents, R, and a set of non-relevant documents U – R a document is relevant, if it contains at least one instance of one of the patterns 2. Search for new candidate patterns: automatically convert each document in the corpus into a set of candidate patterns, one for each clause rank patterns by the degree to which their distribution is correlated with document relevance

18 Discovery procedure 3. Add the highest ranking pattern to the pattern set optionally present the pattern to the user for review 4. Use the new pattern set to induce a new split of the corpus into relevant and non- relevant documents. 5. Repeat the procedure (from step 1) until some iteration limit is reached

19 Example Management succession scenario two initial seed patterns C-Company C-Appoint C-Person C-Person C-Resign C-Company, C-Person: semantic classes C-Appoint = {appoint, elect, promote, name, nominate} C-Resign = {resign, depart, quit}

20 ExDisco: conclusion Resources needed: unannotated, unclassified corpus a set of seed patterns produces complete, multi-slot patterns

21 ROADRUNNER the system does not rely on user-specified examples also does not require any interaction with the user during the wrapper induction process the wrapper induction system does not know the structure (schema) of the page content schema will be inferred during the induction process system can handle arbitrarily nested structures

22 ROADRUNNER basic idea: the system works with two HTML pages at a time discovery is based on the study of similarities and dissimilarities between these pages mismatches are used to identify relevant structures and content to be extracted

23 Algorithm HTML documents are first pre-processed by a lexical analyzer -> a list of tokens each token is either an HTML tag or a string algorithm works on two objects at a time sample wrapper the wrapper is represented by a union-free regular expression (= a regular expression without union (or choice: | )

24 Algorithm the process starts with two documents one is chosen to be an initial version of the wrapper the wrapper is progressively refined the algorithm tries to find a common regular expression for the wrapper and the sample by solving mismatches between the wrapper and the sample

25 Algorithm the sample is parsed using the wrapper a mismatch happens, when some token in the sample does not comply to the grammar specified by the wrapper when a mismatch is found, the algorithm tries to solve it by generalizing the wrapper the algorithm succeeds if a common wrapper can be generated by solving all mismatches

26 Mismatches two kinds of mismatches: string mismatches: mismatches that happen when different strings occur in corresponding positions of the wrapper and sample tag mismatches: mismatches between different tags on the wrapper and sample, or between one tag and one string

27 Mismatches string mismatches: discovering fields If two pages belong to the same class, string mismatches may be due only to different values of a field -> discover fields to solve a string mismatch (e.g. ‘John Smith’ vs. ‘Paul Jones’), the wrapper is generalized to mark the newly discovered field (e.g. ‘John Smith’ -> Text) constant strings do not originate fields in the wrapper

28 Mismatches Tag mismatches: discovering optionals and iterators strategy: first look for repeated patterns, then, if this attempt fails, try to identify an optional pattern otherwise iterations may be missed e.g. two books vs three books would be interpreted as ”two books are required + an optional book is possible”

29 Discovering optionals Either on the wrapper or on the sample we have a piece of HTML code that is not present on the other side by skipping this piece of code, we should be able to resume the parsing 1. optional pattern is located by cross-search “look forward” and decide if the optional part is in the wrapper or in the sample 2. wrapper generalization if is the optional pattern, a new pattern ( )? is added to the wrapper

30 Discovering iterators e.g. one author has two books (wrapper) and the other has three books (sample) 1. repeated pattern is located by terminal-tag search both the wrapper and the sample contain at least one occurrence of the repeated pattern last token of the repeated pattern can be found immediately before the mismatch position this token is called terminal tag one of the mismatching tokens is the initial tag of the repeated pattern the one which is followed by the terminal tag

31 Discovering iterators 2. repeated pattern matching the candidate occurrence of the repeated pattern is matched against some upward portion of the sample search succeeds, if we manage to find a match for the whole pattern 3. wrapper generalization The contiguous repeated occurrences of the pattern s (around the mismatch region) are replaced by (s)+

32 Solving mismatches once a mismatch has been solved the parsing can be resumed if the parsing can completed (= it reaches the ends of the pages), we have generated a common wrapper for the two pages more complex cases: recursion: new mismatches can be generated when trying to solve a mismatch backtracking: algorithm may choose some alternative which does not lead to a correct solution -> choices can be backtracked and the next alternative tried

33 Evaluation cannot extract disjunctive structures as the names of the fields (attributes) are not known, the fields have to be named manually also automatic post-processing might be possible?

34 Multilingual IE Assume we have documents in two languages (English/French), and the user requires templates to be filled in one of the languages (English) from documents in either language ”Gianluigi Ferrero a assisté à la réunion annuelle de Vercom Corp à Londres.” ”Gianluigi Ferrero attended the annual meeting of Vercom Corp in London.”

35 Both texts should produce the same template fill: := organisation: ’Vercom Corp’ location: ’London’ type: ’annual meeting’ present: := name: ’Gianluigi Ferrero’ organisation: UNCLEAR

36 Multilingual IE: three ways of addressing the problem 1. solution A full French-English machine translation (MT) system translates all the French texts to English an English IE system then processes both the translated and the English texts to extract English template structures the solution requires a separate full IE system for each target language (here: for English) and a full MT system for each language pair

37 Multilingual IE: three ways of addressing the problem 2. solution Separate IE systems process the French and English texts, producing templates in the original source language a ’mini’ French-English MT system then translates the lexical items occurring in the French templates the solution requires a separate full IE system for each language and a mini-MT system for each language pair

38 Multilingual IE: three ways of addressing the problem 3. solution a general IE system, with separate French and English front ends the IE system uses a language-independent domain model in which ’concepts’ are related via bi-directional mappings to lexical items in multiple language-specific lexicons this domain model is used to produce a language- independent representation of the input text – a discourse model

39 Multilingual IE: three ways of addressing the problem 3. solution continues… the required information is extracted from the discourse model and the mappings from concepts to the English lexicon are used to produce templates with English lexical items the solution requires a separate syntactic/semantic analyser for each language, and the construction of mappings between the domain model and a lexicon for each language

40 Multilingual IE Which parts of the IE process/systems are language-specific? Which parts of the IE process are domain-specific?

41 Closing What did we study: stages of an IE process learning domain-specific knowledge (extraction rules, semantic classes) IE from (semi-)structured text

42 Closing exam: next week on Friday at (Auditorio) alternative: in May last exercises: deadline on Tuesday remember Course feedback / Kurssikysely!