ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM) - BRAZIL Marcos Gonçalves Federal University of Minas Gerais (UFMG) - BRAZIL UFMG

Agenda Introduction Information Extraction by Text Segmentation ◦ Challenges Related Work ONDUX Experiments Conclusions and Future Work

Introduction (1) Abundance of on-line sources of text documents containing implicit semi- structured data records  Addresses  Bibliographic References  Classified Ads  Product Descriptions

Introduction (1I) Regent Square $228,900 1028 Mifflin Ave.; 6 Bedrooms; 2 Bathrooms. 412-638-7273 Classified Ad Dr. Robert A. Jacobson, 8109 Harford Road, Baltimore, MD 21214 Address Pável Calado, Marco Cristo, Marcos André Gonçalves, Edleno S. de Moura, Berthier Ribeiro-Neto, Nivio Ziviani. Link-based similarity measures for the classication of Web documents. JASIST, v. 57 n.2, p. 208-221, January 2006 Bibliographic Reference

Introduction (III) Why extracting information?  Database Storage, Query…  Data Mining  Record Linkage. Regent Square $228,900 1028 Mifflin Ave.; 6 Bedrooms; 2 Bathrooms. 412-638-7273 Classified Ad : Regent Square : $228,900 : 1028 : Mifflin Ave, : 6 Bedrooms : 2 Bathrooms : 412-638-7273

IETS – Challenges(I) Information Extraction by Text Segmentation (IETS) ◦ Borkar@SIGMOD'01, McCallum@ICML'01, Agichtein@SIGKDD'04, Mansuri@ICDE'06, Zhao@SICDM'08, Cortez@JASIST'09 Diversity of templates and styles  Attribute Ordering  Capitalization  Abbreviations. Different applications share similar domains  Ex.: Address and Ads  Records from both domains contain address information

IETS – Challenges(II) Diversity of templates and styles  Attribute Ordering; Capitalization; Abbreviations. HomePage DBLP ACM Link-based similarity measures for the classication of Web documents. Pável Calado. Journal of the American Society for the Information Science and Technology – 57(2) 2006 Pável Calado, Marco Cristo, Marcos André Gonçalves, Edleno Silva de Moura, Berthier A. Ribeiro-Neto, Nivio Ziviani. Link-based similarity measures for the classication of Web documents. JASIST 57 (2) 208-221(2006) Pável Calado, Marco Cristo, Marcos André Gonçalves, Edleno S. de Moura, Berthier Ribeiro-Neto, Nivio Ziviani. Link-based similarity measures for the classication of Web documents. JASIST, v. 57 n.2, p. 208-221, January 2006

Existing approaches deal with this problem use Machine Learning techniques  Hidden Markov Models (HMM)  Conditional Random Fields (CRF)  Structured Support Vector Machines (SSVM) (semi) Supervised approaches require a hand-labeled training set created by an expert. Each generated model is particular to a given application High computational cost IETS – Challenges(III)

Related Work [Borkar et. al @ SIGMOD 2001] ◦ Supervised extraction method based on Hidden Markov Models (HMM) [McCallum et. al @ ICML 2001] ◦ Proposed the usage of Conditional Random Fields (CRF), a supervised model – (S-CRF) [Mansuri et. al @ ICDE 2006] ◦ Semi-supervised approach based on CRF models All of these approaches require an expert to create a hand- labeled training set for each application.

Related Work (II) [Agichtein et. al @ SIGKDD 2004] ◦ Usage of Reference Tables to create an unsupervised model using Hidden Markov Models (HMM) [Zhao et. al @ SIAM ICDM 2008] ◦ Usage of reference tables to create unsupervised CRF models - (U-CRF) [Cortez et. al @ JASIST 2009] ◦ Unsupervised method to extract bibliographic information Domain-specific heuristics, not general application. Both models assume single positioning and ordering of attributes in all test instances. (Distinct Orderings ?)

Contributions Proposal of extraction method based on information retrieval to perform IETS tasks; ◦ Eliminate the need of a user involved in any source specific training process; ◦ Flexible in the sense that do not rely on any particular style to perform the extraction ◦ Unsupervised Reinforcement Phase  Attribute ordering and positioning learned On-Demand Experimental comparison with the state-of-art information extraction approach (CRF).

Basic Concepts(1) Given an input string I representing an implicit textual record (e.g. classified ad), the IETS task consists in: 1. Segmenting 2. Assigning to each segment a label corresponding to an attribute

Basic Concepts(I1) Knowledge Base ◦ Set of pairs KB = ◦ Easily built from pre-existing sources ◦ Bibliographic DBs, Freebase, Google Fusion Tables, etc. KB= { (Neighboorhhod, O ), (Street, O ), (Phone, O )} O = { “Regent Square”, “Milenight Park”} O = { “Regent St.”, “Morewood Ave.”, “Square Ave. Park”} O = { “323 462-6252”, “(171) 289-7527”} Neigh.Street Neigh. Street Phone

ONDUX (I) Three main steps ◦ Blocking ◦ Matching ◦ Reinforcement

ONDUX (II) General View 1

ONDUX (III) Blocking ◦ Split the input text in substrings called blocks; ◦ Consider the co-occurrence of consecutive terms based on the KB Regent Square $228,900 1028 Mifflin Ave.; 6 Bedrooms; 2 Bathrooms. 412-638-7273 Co-occur in the KB (Neighborhood) Left separated (no presence in the KB)

ONDUX (IV) General View 12

ONDUX (V) Matching ◦ Associate each block generated in the previous phase with an attribute according to the Knowledge Base ◦ Use distinct functions to compute the similarity between a block and the know values of the attributes in in the KB

ONDUX (VI) Matching  Textual Values: FF Function (Field Frequency)  Similarity between the terms on the block and the terms of a given attribute of the KB  Numeric Values : NM Function (Numeric Matching) [Agrawal @ CIDR 2003]  Similarity between the value on the block, the mean and the standard deviation of a numeric attribute in the KB

ONDUX (VI) Matching Regent Square $228,900 1028 Mifflin Ave.; 6 Bedrooms; 2 Bathrooms. 412-638-7273 Street Price No. ??? Street Bed. Bath. Phone

ONDUX (VII) How can we deal with blocks that were incorrectly labeled or were not associated to any attribute? Regent Square $228,900 1028 Mifflin Ave.; 6 Bedrooms; 2 Bathrooms. 412-638-7273 Street Price No. ??? Street Bed. Bath. Phone

ONDUX (VIII) Reinforcement ◦ Review the labeling task performed in the Matching step  Unmatched blocks must receive a label of a given attribute  Mismatched blocks must be correctly labeled ◦ How to handle these cases?  Using positioning and sequencing information that are obtained On-Demand.

ONDUX (IX) General View 2 3

ONDUX (X) Reinforcement ◦ Given the extraction output of the matching step  ONDUX automatically build a graphical structure, the PSM.  PSM: Positioning and Sequencing Model.

ONDUX (XI) Reinforcement – PSM Ordering and Positioning Probabilities are learned On-Demand based on the test instances trough the Matching Phase In the PSM, each state represents attributes of the KB plus special states start and end Edges represent transition probabilities

ONDUX (XII) Reinforcement ◦ Remarks  The PSM is automatically learned On-Demand from test instances  No a priori training required  No assumptions regarding a particular order of attribute values  Relies on the very effective strategies deployed in the Matching Step

ONDUX (XIII) Reinforcement ◦ Once the PSM is built, we combine the matching, positioning and sequencing evidences using the Bayesian operator OR. Matching Sequence Positioning

ONDUX (XIV) Reinforcement ◦ Extraction Result Regent Square $228,900 1028 Mifflin Ave.; 6 Bedrooms; 2 Bathrooms. 412-638-7273 Price No. Bed. Bath. Phone Street ??? NeighborhoodStreet

ONDUX (XV) Overview 3 12

Experiments (1) Setup ◦ We tested our proposed approach with several sources from 3 distinct domains:  Addresses  BigBook, Restaurants [RISE]  Bibilographic Data  CORA [Peng@IPM’ 06], PersonalBib [Mansuri@ICDE’ 06]  Classified Ads  7 distinct newspaper sites[Oliveira@SBBD’ 06] ◦ We limited the presentation to one experiment per domain. More on the paper

Experiments (II) Evaluation ◦ Metrics  Precision, Recall and F-Measure  T-Test for the statistical validation of the results ◦ Baselines  Conditional Random Fields (CRF)  U-CRF (Unsupervised method) [Zhao@SICDM’ 08]  S-CRF (Classical supervised method) [Peng@IPM’ 06]

Experiments (III) Extraction Quality U-CRF results similar to Zhao@SICDM (validation) Dataset follows the single order assumption After Reinforcement ONDUX achieved similar quality

Experiments (IV) Extraction Quality S-CRF achieved results higher than U-CRF due to the hand-labeled training CORA includes a variety of citation styles (conference, journal, books, etc,) In general, ONDUX outperformed CRF models

Experiments (V) Extraction Quality Due to the Matching Phase and the PSM that is learned On- Demand, ONDUX achieve very high quality results U-CRF presented a poor performance (very heterogeneous dataset)

Experiments (VI) Varying the number of terms common to test instances and the KB ◦ Determine how dependent the quality of results is from the overlap between the previously known data and the text input.  These experiments were conducted with the BigBook dataset.

Experiments (VII) Varying the number of shared terms Even presenting a poor quality in the Matching Phase, the PSM is able to increase ONDUX’s quality in the Reinforcement Step Starting with a batch of 500 input strings, after having an overlap of 500 terms, ONDUX achieved high quality results

Experiments (VIII) Varying the number of shared terms As the number of shared terms increases, the best quality the Mathching phase achieves

Conclusions and Future Work (I) New approach for information extraction independent of the style of the data records ONDUX ◦ Flexible: Do not consider any particular style ◦ Unsupervised: Do not require any human effort to create a training set ◦ On-Demand: Ordering and Positioning Information are learned trough the Matching Phase

Proposed strategy achieve good results of precision and recall ◦ Small size of the Knowledge Base ◦ Comparison with the state-of-art As a Future Work ◦ Investigate different matching functions; ◦ Nested structures? Conclusions and Future Work (II)

Acknowledgements UFMG

Questions?

Setup Experimentes ExperimentDataset (records)# Source (records) BigBook X BigBook2000 CORA X CORA150350 Folha X Web Ads500125

Experimentes

ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

Similar presentations

Presentation on theme: "ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

Similar presentations

Presentation on theme: "ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)"— Presentation transcript:

Similar presentations

About project

Feedback