Distance functions and IE -2 William W. Cohen CALD.

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Chapter 7 Dynamic Programming.
Record Linkage Tutorial: Distance Metrics for Text William W. Cohen CALD.
A Comparison of String Matching Distance Metrics for Name-Matching Tasks William Cohen, Pradeep RaviKumar, Stephen Fienberg.
Probabilistic Record Linkage: A Short Tutorial William W. Cohen CALD.
Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods William W. Cohen, Sunita Sarawagi.
CSC1016 Coursework Clarification Derek Mortimer March 2010.
Monday, 08 June 2015Dr. Mohamed Osman1 What is Database Administration A high level function (technical Function) that is responsible for ► physical DB.
Aki Hecht Seminar in Databases (236826) January 2009
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
7 -1 Chapter 7 Dynamic Programming Fibonacci Sequence Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, … F i = i if i  1 F i = F i-1 + F i-2 if.
Chapter 4 Relational Databases Copyright © 2012 Pearson Education, Inc. publishing as Prentice Hall 4-1.
Database Design Concepts Info 1408 Lecture 2 An Introduction to Data Storage.
FA05CSE182 CSE 182-L2:Blast & variants I Dynamic Programming
Chapter 4 Relational Databases Copyright © 2012 Pearson Education 4-1.
In the once upon a time days of the First Age of Magic, the prudent sorcerer regarded his own true name as his most valued possession but also the greatest.
Similarity Joins for Strings and Sets William Cohen.
Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.
Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google.
Ontologies for the Integration of Geospatial Data Michael Lutz Workshop: Semantics and Ontologies for GI Services, 2006 Paper: Lutz et al., Overcoming.
INFORMATION EXTRACTION SNITA SARAWAGI. Management of Information Extraction System Performance Optimization Handling Change Integration of Extracted Information.
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Advanced File Processing. 2 Objectives Use the pipe operator to redirect the output of one command to another command Use the grep command to search for.
Distance functions and IE William W. Cohen CALD. Announcements March 25 Thus – talk from Carlos Guestrin (Assistant Prof in Cald as of fall 2004) on max-margin.
Alignment, Part I Vasileios Hatzivassiloglou University of Texas at Dallas.
Integration of Spatial Information Sources Based on Source Description Framework Yoshiharu Ishikawa, Gihyong Ryu, and Hiroyuki Kitagawa University of Tsukuba.
M1G Introduction to Database Development 2. Creating a Database.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
Minimum Edit Distance Definition of Minimum Edit Distance.
MAP-REDUCE ABSTRACTIONS 1. Abstractions On Top Of Hadoop We’ve decomposed some algorithms into a map-reduce “workflow” (series of map-reduce steps) –
Blocking. Basic idea: – heuristically find candidate pairs that are likely to be similar – only compare candidates, not all pairs Variant 1: – pick some.
Distance functions and IE – 5 William W. Cohen CALD.
Distance functions and IE – 4? William W. Cohen CALD.
1 Limitations of BLAST Can only search for a single query (e.g. find all genes similar to TTGGACAGGATCGA) What about more complex queries? “Find all genes.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Relation Extraction William Cohen Kernels vs Structured Output Spaces Two kinds of structured learning: –HMMs, CRFs, VP-trained HMM, structured.
Student Centered ODS ETL Processing. Insert Search for rows not previously in the database within a snapshot type for a specific subject and year Search.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Learning Analogies and Semantic Relations Nov William Cohen.
The Canopies Algorithm from “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching” Andrew McCallum, Kamal Nigam, Lyle.
Edit Distances William W. Cohen.
More announcements Unofficial auditors: send to Sharon Woodside to make sure you get any late-breaking announcements. Project: –Already.
Day 5 - More Complexity With Queries Explanation of JOIN & Examples Explanation of JOIN & Examples Explanation & Examples of Aggregation Explanation &
Record Linkage and Disclosure Limitation William W. Cohen, CALD Steve Fienberg, Statistics, CALD & C3S Pradeep Ravikumar, CALD.
Minimum Edit Distance Definition of Minimum Edit Distance.
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 جامعة الملك فيصل عمادة.
SQL Basics Review Reviewing what we’ve learned so far…….
Concepts of Database Management, Fifth Edition Chapter 3: The Relational Model 2: SQL.
Spell checking. Spelling Correction and Edit Distance Non-word error detection: – detecting “graffe” “ سوژن ”, “ مصواک ”, “ مداا ” Non-word error correction:
Distance functions and IE - 3 William W. Cohen CALD.
Database System Implementation CSE 507
Definition of Minimum Edit Distance
Computing Full Disjunctions
Relational Algebra Chapter 4, Part A
Web Data Extraction Based on Partial Tree Alignment
Edit Distances William W. Cohen.
Distance Functions for Sequence Data and Time Series
Hierarchical clustering approaches for high-throughput data
Data Integration with Dependent Sources
Lecture 12: Data Wrangling
Guide To UNIX Using Linux Third Edition
Kernels for Relation Extraction
Relational Algebra Chapter 4, Sections 4.1 – 4.2
Bioinformatics Algorithms and Data Structures
Information Retrieval and Web Design
Presentation transcript:

Distance functions and IE -2 William W. Cohen CALD

Announcements March 25 Thus – talk from Carlos Guestrin (Assistant Prof in Cald as of fall 2004) on max-margin Markov nets –9:30 am in NSH 1507 –open to public - tell your friends! Datasets: –some public extraction data is (I hope readable) on /afs/cs/project/extract-learn/repository Writeups: –nothing today –“distance metrics for text” – three papers - due next Monday, 3/22

Record linkage: definition Record linkage: determine if pairs of data records describe the same entity –I.e., find record pairs that are co-referent –Entities: usually people (or organizations or…) –Data records: names, addresses, job titles, birth dates, … Main applications: –Joining two heterogeneous relations –Removing duplicates from a single relation

The data integration problem Control flow (modulo details about querying –Extract (author, department) pairs from DB1 –Extract (department,www server) pairs from DB2 –Execute the two-step plan to get paper: author -> department -> wwwServer –two steps means matching (linking, integrating, deduping,....) department names in DB1/DB2 –issues are completely different if user is executing a one-step plan: one-step plan is retrieval

String distance metrics: Levenshtein Edit-distance metrics –Distance is shortest sequence of edit commands that transform s to t. –Simplest set of operations: Copy character from s over to t Delete a character in s (cost 1) Insert a character in t (cost 1) Substitute one character for another (cost 1) –This is “Levenshtein distance”

Computing Levenshtein distance – 4 D(i,j) = min D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)+1 //insert D(i,j-1)+1 //delete COHEN M C C O H N A trace indicates where the min value came from, and can be used to find edit operations and/or a best alignment (may be more than 1)

Smith-Waterman distance - 2 D(i,j) = max 0 //start over D(i-1,j-1) - d(si,tj) //subst/copy D(i-1,j) - G //insert D(i,j-1) - G //delete G = 1 d(c,c) = -2 d(c,d) = +1 COHEN M C C O H N

Smith-Waterman distance - 3 D(i,j) = max 0 //start over D(i-1,j-1) - d(si,tj) //subst/copy D(i-1,j) - G //insert D(i,j-1) - G //delete G = 1 d(c,c) = -2 d(c,d) = +1 COHEN M C C O H N

Smith-Waterman distance - 5 c o h e n d o r f m c c o h n s k i dist=5

Smith-Waterman distance in Monge & Elkan’s WEBFIND (1996) String s=A 1 A 2... A K, string t=B 1 B 2... B L sim’ is editDistance scaled to [0,1] Monge-Elkan’s “recursive matching scheme” is average maximal similarity of A i to B j:

Results: S-W from Monge & Elkan

Affine gap distances Smith-Waterman fails on some pairs that seem quite similar: William W. Cohen William W. ‘Don’t call me Dubya’ Cohen Intuitively, a single long insertion is “cheaper” than a lot of short insertions Intuitively, are springlest hulongru poinstertimon extisn’t “cheaper” than a lot of short insertions

Affine gap distances - 2 Idea: –Current cost of a “gap” of n characters: nG –Make this cost: A + (n-1)B, where A is cost of “opening” a gap, and B is cost of “continuing” a gap.

Affine gap distances - 3 D(i,j) = max D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)-1 //insert D(i,j-1)-1 //delete IS(i,j) = max D(i-1,j) - A IS(i-1,j) - B IT(i,j) = max D(i,j-1) - A IT(i,j-1) - B Best score in which si is aligned with a ‘gap’ Best score in which tj is aligned with a ‘gap’ D(i-1,j-1) + d(si,tj) IS(I-1,j-1) + d(si,tj) IT(I-1,j-1) + d(si,tj)

Affine gap distances - 4 -B -d(si,tj) D IS IT -d(si,tj) -A

Affine gap distances – experiments ( from McCallum,Nigam,Ungar KDD2000) Goal is to match data like this:

Affine gap distances – experiments ( from McCallum,Nigam,Ungar KDD2000) Hand-tuned edit distance Lower costs for affine gaps Even lower cost for affine gaps near a “.” HMM-based normalization to group title, author, booktitle, etc into fields (as in Borkar et al)

Affine gap distances – experiments TFIDFEdit Distance Cora OrgName Orgname Restaurant Parks

TFIDF distance for data integration Experiments with WHIRL

Three ways to deal with output of IE systems Method 1. –Do the best you can at mapping the output into a conventional database (or KR system) with a natural schema (info about people, events, etc) –Answer any questions with the existing DB Method 2. –Given a query, try and see how much the answer can be constrained by information derived from IE (somehow or other –Probably requires some sort of uncertain reasoning.

Birds: r(birdName,soundDescription) and 5 short descriptions of sounds (“an owl hooting”) Movies r(movieName,review) and 5 long, 5 short plot descriptions (“sci-fi comedy”, “serious czech movie”,...)

Soft joins with “incompatible schemas”

WHIRL as a classification-learner

Classification with unlabeled “Background” instances Example: instances are paper titles, background instances are paper abstracts

Very very short examples Very short examples Classifying short newswire headlines

Inference in WHIRL “Best-first” search: pick state s that is “best” according to f(s) Suppose graph is a tree, and for all s, s’, if s’ is reachable from s then f(s)>=f(s’). Then A* outputs the globally best goal state s* first, and then next best,...

Inference in WHIRL Explode p(X1,X2,X3): find all DB tuples for p and bind Xi to ai. Constrain X~Y: if X is bound to a and Y is unbound, –find DB column C to which Y should be bound –pick a term t in X, find proper inverted index for t in C, and bind Y to something in that index Keep track of t’s used previously, and don’t allow Y to contain one.

Inference in WHIRL

Summary WHIRL finds the top k answers to a query Queries tend to be easy because either they’re –unconstrained (e.g. 2-way similarity join) => easy to find 100 or so “good” answers –highly constrained (e.g. restricted sim join, multi-way join, classification query,....) => easy to present all the “reasonable” answers to a user Data integration usually considers matching two lists of entity descriptions in the abstract –unconstrained, sometimes under constrained (what is a match to the end user?) – i.e., we don’t know what the final query, and hence final constraints, will turn out to be. –this is evaluated a lot in experiments, but in an ideal world it would not the “wrong” problem