Distance functions and IE – 4? William W. Cohen CALD.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Bioinformatics Tutorial I BLAST and Sequence Alignment.
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Record Linkage Tutorial: Distance Metrics for Text William W. Cohen CALD.
A Comparison of String Matching Distance Metrics for Name-Matching Tasks William Cohen, Pradeep RaviKumar, Stephen Fienberg.
Maurice Hermans.  Ontologies  Ontology Mapping  Research Question  String Similarities  Winkler Extension  Proposed Extension  Evaluation  Results.
Probabilistic Record Linkage: A Short Tutorial William W. Cohen CALD.
Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods William W. Cohen, Sunita Sarawagi.
Indexing Methods for Faster and More Effective Person Name Search Mark Arehart MITRE Corporation
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Structural bioinformatics
Information Retrieval in Practice
Sequence Similarity Searching Class 4 March 2010.
Heuristic alignment algorithms and cost matrices
Unsupervised Information Extraction from Unstructured, Ungrammatical Data Sources on the World Wide Web Mathew Michelson and Craig A. Knoblock.
Sequence Alignment vs. Database Task: Given a query sequence and millions of database records, find the optimal alignment between the query and a record.
Heuristic alignment algorithms; Cost matrices 2.5 – 2.9 Thomas van Dijk.
Heuristic Approaches for Sequence Alignments
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Scalable Text Mining with Sparse Generative Models
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Sequence alignment, E-value & Extreme value distribution
Overview of Search Engines
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Sequence Analysis Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics. This information.
Distance functions and IE -2 William W. Cohen CALD.
Masquerade Detection Mark Stamp 1Masquerade Detection.
Protein Sequence Alignment and Database Searching.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Distance functions and IE William W. Cohen CALD. Announcements March 25 Thus – talk from Carlos Guestrin (Assistant Prof in Cald as of fall 2004) on max-margin.
Chapter 6: Information Retrieval and Web Search
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
Chapter 3 Computational Molecular Biology Michael Smith
Blocking. Basic idea: – heuristically find candidate pairs that are likely to be similar – only compare candidates, not all pairs Variant 1: – pick some.
Distance functions and IE – 5 William W. Cohen CALD.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL.
Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text.
1 Limitations of BLAST Can only search for a single query (e.g. find all genes similar to TTGGACAGGATCGA) What about more complex queries? “Find all genes.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Sequence Alignment.
Construction of Substitution matrices
Doug Raiford Phage class: introduction to sequence databases.
Step 3: Tools Database Searching
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
More announcements Unofficial auditors: send to Sharon Woodside to make sure you get any late-breaking announcements. Project: –Already.
Record Linkage and Disclosure Limitation William W. Cohen, CALD Steve Fienberg, Statistics, CALD & C3S Pradeep Ravikumar, CALD.
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Text Similarity: an Alternative Way to Search MEDLINE James Lewis, Stephan Ossowski, Justin Hicks, Mounir Errami and Harold R. Garner Translational Research.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
Distance functions and IE - 3 William W. Cohen CALD.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Searching Similar Segments over Textual Event Sequences
Presentation transcript:

Distance functions and IE – 4? William W. Cohen CALD

Announcements Current statistics: –days with unscheduled student talks: 6 –students with unscheduled student talks: 4 –Projects are due: 4/28 (last day of class) –Additional requirement: draft (for comments) no later than 4/21

The data integration problem

String distance metrics so far... Term-based (e.g. TF/IDF as in WHIRL) –Distance depends on set of words contained in both s and t – so sensitive to spelling errors. –Usually weight words to account for “importance” –Fast comparison: O(n log n) for |s|+|t|=n Edit-distance metrics –Distance is shortest sequence of edit commands that transform s to t. –No notion of word importance –More expensive: O(n 2 ) Other metrics –Jaro metric & variants –Monge-Elkan’s recursive string matching –etc? Which metrics work best, for which problems?

Jaro metric

Winkler-Jaro metric

String distance metrics so far... Term-based (e.g. TF/IDF as in WHIRL) –Distance depends on set of words contained in both s and t – so sensitive to spelling errors. –Usually weight words to account for “importance” –Fast comparison: O(n log n) for |s|+|t|=n Edit-distance metrics –Distance is shortest sequence of edit commands that transform s to t. –No notion of word importance –More expensive: O(n 2 ) Other metrics –Jaro metric & variants –Monge-Elkan’s recursive string matching –etc? Which metrics work best, for which problems?

So which metric should you use? Java toolkit of string-matching methods from AI, Statistics, IR and DB communities Tools for evaluating performance on test data Exploratory tool for adding, testing, combining string distances –e.g. SecondString implements a generic “Winkler rescorer” which can rescale any distance function with range of [0,1] URL – Distribution also includes several sample matching problems. SecondString (Cohen, Ravikumar, Fienberg):

SecondString distance functions Edit-distance like: –Levenshtein – unit costs –untuned Smith-Waterman –Monge-Elkan (tuned Smith-Waterman) –Jaro and Jaro-Winkler

Results - Edit Distances Monge-Elkan is the best on average....

Edit distances

SecondString distance functions Term-based, for sets of terms S and T: –TFIDF distance –Jaccard distance: –Language models: construct P S and P T and use

SecondString distance functions Term-based, for sets of terms S and T: –TFIDF distance –Jaccard distance –Jensen-Shannon distance smoothing toward union of S,T reduces cost of disagreeing on common terms unsmoothed P S, Dirichlet smoothing, Jelenik-Mercer – “Simplified Fellegi-Sunter”

Results – Token Distances

SecondString distance functions Hybrid term-based & edit-distance based: –Monge-Elkan’s “recursive matching scheme”, segmenting strings at token boundaries (rather than separators like commas) –SoftTFIDF Like TFIDF but consider not just tokens in both S and T, but tokens in S “close to” something in T (“close to” relative to some distance metric) Downweight close tokens slightly

Results – Hybrid distances

Results - Overall

Prospective test on two clustering tasks

An anomolous dataset

An anomalous dataset: census

Why?

Other results with SecondString Distance functions over structured data records (first name, last name, street, house number) Learning to combine distance functions Unsupervised/semi-supervised training for distance functions over structured data

Combining Information Extraction and Similarity Computations 2) Krauthammer et al 1) Bunescu et al

Experiments Hand-tagged 50 abstracts for gene/protein entities (pre-selected to be about human genes) Collected dictionary of 40,000+ protein names from on-line sources –not complete –example matching is not sufficient Approach: use hand-coded heuristics to propose likely generalizations of existing dictionary entries. –not hand-coded or off-the-shelf similarity metrics

Example name generalizations

Basic idea behind the algorithm original dictionary carefully-tuned heuristics (aka hacks) similar (but not identical process) applied to word n- grams from text to do IE: extract if n-gram -> CD

Example: canonicalizing “short names” (different procedure for “full names” and “one-word” names)

NF-25 in OD NF Nf “... NF-kappa B...”NF NF in CD? ( ) NF => CD (from ) Recognize:

Results Why is precision less than 100%? When should you use “similarity by normalization”? Could a simpler algorithm do as well? Is there overfitting? (50 abstracts, <750 proteins)

...

Combining Information Extraction and Similarity Computations 2) Krauthammer et al 1) Bunescu et al

Background Common task in proteomics/genomics: –look for (soft) matches to a query sequence in a large “database” of sequences. –want to find subsequences (genes) that are highly similar (and hence probably related) –want to ignore “accidental” matches –possible technique is Smith-Waterman (local alignment) want char-char “reward” for alignment to reflect confidence that the alignment is not due to chance

Background Common task in proteomics/genomics: –look for (soft) matches to a query sequence in a large “database” of sequences. –want to find subsequences (genes) that are highly similar (and hence probably related) –want to ignore “accidental” matches –possible technique is Smith-Waterman (local alignment) want char-char “reward” for alignment to reflect confidence that the alignment is not due to chance

Smith-Waterman distance c o h e n d o r f m c c o h n s k i dist=5

In general “peaks” in the matrix scores indicate highly similar substrings.

Background Common task in proteomics/genomics: –look for (soft) matches to a query sequence in a large “database” of sequences. –possible technique is Smith-Waterman (local alignment) want char-char “reward” for alignment to reflect confidence that the alignment is not due to chance based on substitutability theory for amino acids –doesn’t scale well BLAST and FASTA: fast approximate S-W

BLAST/FASTA ideas Find all char n-grams (“words”) in the query string. FASTA: –Use inverted indices to find out where these words appear in the DB sequence –Use S-W only near DB sections that contain some of these words

BLAST/FASTA ideas Find all char n-grams (“words”) in the query string. BLAST: –Generate variations of these words by looking for changes that would lead to strong similarities –Discard “low IDF” words (where accidental matches are likely) –Use expanded set of n-grams to focus search

query string words and expansions

BLAST/FASTA ideas Find all char n-grams (“words”) in the query string. BLAST: –Generate variations of these words by looking for changes that would lead to strong similarities –Discard “low IDF” words (where accidental matches are likely) –Use expanded set of n-grams to focus search The BLAST program: –Widely used, –Fast implementation, –Supports asking multiple queries against a database at once... –Can one use it find soft matches of protein names (from a dictionary) in text?

Basic idea: Protein database Query strings Proposed alignment (query->database) Query algorithm: BLAST Biomedical paper Protein name dictionary Extracted protein name (dict. entry->text) IE system: dictionaries+BLAST (optimized for this problem)

1) Mapping text to DNA sequences (Q: what sort of char similarity is this?)

2) Optimizing blast Split protein-name database into several parts (for short, medium-length, long protein names) Require space chars before and after “short” protein names. Manually search (grid search?) for better settings for certain key parameters for each protein-name subdatabase –With what data? Evaluate on one review article, 1162 protein names –inter-annotator agreement not great (70-85%)

2) Optimizing blast

Results

Overall: precision 71.1%, recall 78.8% (opt)