Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University

Slides:

Advertisements

Similar presentations

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:

1 Initial Results on Wrapping Semistructured Web Pages with Finite-State Transducers and Contextual Rules Chun-Nan Hsu Arizona State University.

Chapter 5: Introduction to Information Retrieval

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Fast Algorithms For Hierarchical Range Histogram Constructions

Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.

1 Relational Learning of Pattern-Match Rules for Information Extraction Presentation by Tim Chartrand of A paper bypaper Mary Elaine Califf and Raymond.

Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,

NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.

IEPAD: Information Extraction based on Pattern Discovery Chia-Hui Chang National Central University, Taiwan

1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)

Web Document Clustering: A Feasibility Demonstration Hui Han CSE dept. PSU 10/15/01.

Online Clustering of Web Search results

Combinatorial Pattern Matching CS 466 Saurabh Sinha.

Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

Xyleme A Dynamic Warehouse for XML Data of the Web.

Aki Hecht Seminar in Databases (236826) January 2009

ODE: Ontology-assisted Data Extraction WEIFENG SU et al. Presented by: Meher Talat Shaikh.

Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.

Event Extraction: Learning from Corpora Prepared by Ralph Grishman Based on research and slides by Roman Yangarber NYU.

Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.

Extracting Structured Data from Web Page Arvind Arasu, Hector Garcia-Molina ACM SIGMOD 2003.

1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.

Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University

Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval 出處： institute of information science, academia sinica, taipei,

Annotation Free Information Extraction

Automatically Constructing a Dictionary for Information Extraction Tasks Ellen Riloff Proceedings of the 11 th National Conference on Artificial Intelligence,

Learning the Common Structure of Data Kristina Lerman and Steven Minton Presentation by Jeff Roth.

Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.

Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.

A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.

Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.

Chia-Hui Chang, Shao-Chen Lui Dept. of Computer Science and Information Engineering National Central University IEPAD: Information Extraction Based on.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.

An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.

RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.

CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

Querying Structured Text in an XML Database By Xuemei Luo.

Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong.

Presenter: Shanshan Lu 03/04/2010

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

Information Extraction for Semi-structured Documents: From Supervised learning to Unsupervised learning Chia-Hui Chang Dept. of Computer Science and Information.

1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.

LOGO 1 Corroborate and Learn Facts from the Web Advisor ： Dr. Koh Jia-Ling Speaker ： Tu Yi-Lang Date ： Shubin Zhao, Jonathan Betz (KDD '07 )

LOGO 1 Mining Templates from Search Result Records of Search Engines Advisor ： Dr. Koh Jia-Ling Speaker ： Tu Yi-Lang Date ： Hongkun Zhao, Weiyi.

A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,

Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.

Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

Learning Extraction Patterns for Subjective Expressions 2007/10/09 DataMining Lab 안민영.

BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.

Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Learning Analogies and Semantic Relations Nov William Cohen.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

PAIR project progress report Yi-Ting Chou Shui-Lung Chuang Xuanhui Wang.

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

CS 430: Information Discovery

Automatic Wrapper Induction: “Look Mom, no hands!”

KnowItAll and TextRunner

Presentation transcript:

Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University 10/4/2002

Introduction TEXT IE  AutoSlog-TS Semi IE  IEPAD

AutoSlog-TS: Automatically Generating Extraction Patterns from Untagged Text Ellen Riloff University of Utah AAAI96

AutoSlog-TS AutoSlog-TS is an extension of AutoSlog  It operates exhaustively by generating an extraction pattern for every noun phrase in the training corpus.  It then evaluates the extraction patterns by processing the corpus a second time and generating relevance statistics for each pattern. A more significant difference is that AutoSlog-TS allows multiple rules to fire if more than one matches the context.

AutoSlog-TS Concept

Relevance Rate Pr(relevant text | text contains pattern i ) = rel-freq i / total-freq i rel-freq i : the number of instances of pattern i that were activated in relevant texts. total-freq i : the total number of instances of pattern i that were activated in the training corpus. The motivation behind the conditional probability estimate is that domain-specific expressions will appear substantially more often in relevant texts than irrelevant texts.

Rank function Next, we use a rank function to rank the patterns in order of importance to the domain: relevance rate * log 2 (frequency) So, a person only needs to review the most highly ranked patterns.

Experimental Results Setup We evaluated AutoSlog and AutoSlog-TS by manually inspecting the performance of their dictionaries in the MUC-4 terrorism domain. We used the MUC-4 texts as input and the MUC-4 answer keys as the basis for judging “correct” output (MUC-4 Proceedings 1992). Training Extraction Patterns 1500,50% relevant 772 relevant Texts AutoSlog-TS: AutoSlog:

Testing To evaluate the two dictionaries, we chose 100 blind texts from the MUC-4 test set. (50 relevant texts and 50 irrelevant texts) We scored the output by assigning each extracted item to one of five categories: correct, mislabeled, duplicate, spurious, or missing.  Correct: If an item matched against the answer keys.  Mislabeled: If an item matched against the answer keys but was extracted as the wrong type of object.  Duplicate: If an item was referent to an item in the answer keys.  Spurious: If an item did not refer to any object in the answer keys.  Missing: Items in the answer keys that were not extracted

Experimental Results We scored three items: perpetrators, victims, and targets.

Experimental Results We calculated recall as correct / (correct + missing) Compute precision as: (correct + duplicate) / (correct + duplicate + mislabeled + spurious)

Behind the scenes In fact, we have reason to believe that AutoSlog-TS is ultimately capable of producing better recall than AutoSlog because it generates many good patterns that AutoSlog did not. AutoSlog-TS produced 158 patterns with a relevance rate ≧ 90% and frequency ≧ 5. Only 45 of these patterns were in the original AutoSlog dictionary. The higher precision demonstrated by AutoSlog-TS is probably a result of the relevance statistics.

Future Directions A potential problem with AutoSlog-TS is that there are undoubtedly many useful patterns buried deep in the ranked list, which cumulatively could have a substantial impact on performance. The precision of the extraction patterns could also be improved by adding semantic constraints and, in the long run, creating more complex extraction patterns.

IEPAD: Information Extraction based on Pattern Discovery Information Extraction based on Pattern Discovery C.H. Chang. National Central University WWW10

Semi-structured Information Extraction Information Extraction (IE)  Input: Html pages  Output: A set of records

Pattern Discovery based IE  Motivation Display of multiple records often forms a repeated pattern The occurrences of the pattern are spaced regularly and adjacently  Now the problem becomes... Find regular and adjacent repeats in a string

IEPAD Architecture Pattern Generator Extractor Extraction Results Html Page Patterns Pattern Viewer Extraction Rule Users Html Pages

The Pattern Generator Translator PAT tree construction Pattern validator Rule Composer HTML Page Token Translator PAT Tree Constructor Validator Rule Composer PAT trees and Maximal Repeats Advenced Patterns Extraction Rules A Token String

1. Web Page Translation Encoding of HTML source  Rule 1: Each tag is encoded as a token  Rule 2: Any text between two tags are translated to a special token called TEXT (denoted by a underscore) HTML Example: Congo 242 Egypt 20 Encoded token string T( )T(_)T( )T( )T(_)T( )T( )

Various Encoding Schemes

2. PAT Tree Construction PAT tree: binary suffix tree A Patricia tree constructed over all possible suffix strings of a text Example T( ) 000 T( )001 T( )010 T( )011 T( )100 T(_) T( )T(_)T( )T( )T(_)T( )T( )

The Constructed PAT Tree

Definition of Maximal Repeats Let  occurs in S in position p 1, p 2, p 3, …, p k  is left maximal if there exists at least one (i, j) pair such that S[p i -1]  S[p j -1]  is right maximal if there exists at least one (i, j) pair such that S[p i +|  |]  S[p j +|  |]  is a maximal repeat if it it both left maximal and right maximal

Finding Maximal Repeats Definition:  Let’s call character S[p i -1] the left character of suffix p i  A node is left diverse if at least two leaves in the ’s subtree have different left characters Lemma:  The path labels of an internal node in a PAT tree is a maximal repeat if and only if is left diverse

3. Pattern Validator Suppose a maximal repeat  are ordered by its position such that suffix p 1 < p 2 < p 3 … < p k, where p i denotes the position of each suffix in the encoded token sequence. Characteristics of a Pattern  Regularity: Variance coefficient  Adjacency: Density

Pattern Validator (Cont.) Basic Screening For each maximal repeat , compute V(  ) and D(  ) a) check if the pattern’s variance: V(  ) < 0.5 b) check if the pattern’s density: 0.25 < D(  ) < 1.5 V(  )< <D(  )<1.5 Yes No Discard Yes Pattern  No Discard Pattern 

4. Rule Composer Occurrence partition  Flexible variance threshold control Multiple string alignment  Increase density of a pattern

Occurrence Partition Problem  Some patterns are divided into several blocks  Ex: Lycos, Excite with large regularityLycosExcite Solution  Clustering of the occurrences of such a pattern Clustering V(  )<0.1 No Discard  Check density Yes

Multiple String Alignment Problem  Patterns with density less than 1 can extract only part of the information Solution  Align k-1 substrings among the k occurrences  A natural generalization of alignment for two strings which can be solved in O(n*m) by dynamic programming where n and m are string lengths.

Multiple String Alignment (Cont.) Suppose “ adc ” is the discovered pattern for token string “ adcwbdadcxbadcxbdadcb ” If we have the following multiple alignment for strings `àdcwbd'', `àdcxb'' and `àdcxbd'': a d c w b d a d c x b - a d c x b d The extraction pattern can be generalized as “ adc[w|x]b[d|-] ”

Pattern Viewer Java-application based GUI Web based GUI 

The Extractor Matching the pattern against the encoding token string  Knuth-Morris-Pratt’s algorithm  Boyer-Moore’s algorithm Alternatives in a rule  matching the longest pattern What are extracted?  The whole record

Experiment Setup Fourteen sources: search engines Performance measures  Number of patterns  Retrieval rate and Accuracy rate Parameters  Encoding scheme  Thresholds control

# of Patterns Discovered Using BlockLevel Encoding Average 117 maximal repeats in our test Web pages

Translation Average page length is 22.7KB

Accuracy and Retrieval Rate

Summary IEPAD: Information Extraction based on Pattern Discovery  Rule generator  The extractor  Pattern viewer Performance  97% retrieval rate and 94% accuracy rate

Problems Guarantee high retrieval rate instead of accuracy rate  Generalized rule can extract more than the desired data Only applicable when there are several records in a Web page, currently

References TEXT IE  Riloff, E. (1996) Automatically Generating Extraction Patterns from Untagged Text, (AAAI- 96), 1996, pp  Riloff, E. (1999) Information Extraction as a Stepping Stone toward Story Understanding, In Computational Models of Reading and Understanding, Ashwin Ram and Kenneth Moorman, eds., The MIT Press.

References Semi-structured IE  D.W. Embley, Y.S. Jiang, and W.-K. Ng, Record- Boundary Discovery in Web Documents, SIGMOD'99 ProceedingsRecord- Boundary Discovery in Web Documents  C.H. Chang. and S.C. Lui. IEPAD: Information Extraction based on Pattern Discovery, WWW10, pp , May 2-6, 2001, Hong Kong.IEPAD: Information Extraction based on Pattern Discovery  B. Chidlovskii, J. Ragetli, and M. de Rijke, Automatic Wrapper Generation for Web Search Engines, The 1st Intern. Conf. on Web-Age Information Management (WAIM'2000), Shanghai, China, June 2000Automatic Wrapper Generation for Web Search Engines