Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.

Slides:



Advertisements
Similar presentations
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Advertisements

Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Measuring the degree of similarity: PAM and blosum Matrix
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Sequence Similarity Searching Class 4 March 2010.
Lecture outline Database searches
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Introduction to bioinformatics
Sequence similarity.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence Alignments Revisited
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
15-853:Algorithms in the Real World
Sequence alignment, E-value & Extreme value distribution
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Protein Sequence Alignment and Database Searching.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
BLAST Workshop Maya Schushan June 2009.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Approximate Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Alignment, Part I Vasileios Hatzivassiloglou University of Texas at Dallas.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
. Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Heuristic Alignment Algorithms Hongchao Li Jan
Last lecture summary. Sequence alignment What is sequence alignment Three flavors of sequence alignment Point mutations, indels.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Pairwise Sequence Alignment and Database Searching
Fast Sequence Alignments
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas

Hash tables A hash table or associative array implements efficiently a function with a very large domain but relatively few recorded values Example: Map names to phone numbers –Although there are many possible names, only a few will be stored in a particular phone book

Implementing hash tables A hash table works by using a hash function to translate the input (keys) to a small range of buckets –For example, h(n) = n mod k where k is the size of hash table Collisions can occur when different keys are mapped to the same bucket, and must be resolved Many programming languages directly support hash tables

Example hash table

FASTA after step 1

FASTA – Step 2 Group together hot spots on the same diagonal. This creates a partial alignment with matches and mismatches (no indels). Keep the 10 best diagonal runs If a hot spot matches at position i in S and position j in T, it will be on the (i-j)th diagonal Sort hot spots by i-j to group them

FASTA – Step 3 Rescore exact matches using realistic substitution penalties (from a set such as PAM250 for proteins) Trim and extend hot spots according to substitution penalties, allowing “good” mismatches

The PAM matrices From observing closely related proteins in evolution, we can estimate the likelihood than one amino acid mutates to another Normalize these probabilities by PAM (Percentage of Acceptable Mutations in 100 amino-acids) The PAM0 matrix is the identity matrix The PAM1 matrix diverges slightly from the identity matrix

Calculating PAM matrices If we have PAM1, then –PAMN = (PAM1) N –A Markov chain of independent mutations The PAM250 matrix has been found empirically most useful At this evolutionary distance, 80% of amino acids are changed Change varies according to class (from only 45% to 94%) Some amino acids are no longer good matches with themselves

FASTA after Steps 2 and 3

FASTA – Step 4 Starting from the best diagonal run, look at nearby diagonal runs and incorporate non- overlapping hot spots This extends the partial alignment with some insertions and deletions We only look a limited distance from the best diagonal run

FASTA after Step 4

FASTA – Step 5 Run the full dynamic programming alignment algorithm in a band around the extended best diagonal run Only consider matches within w positions on either side of the extended best diagonal run Typically, w is 16, and 32n ≪ n 2

FASTA final step

BLAST Basic Local Alignment Search Tool Uses words like FASTA, but allows for approximate matches of words to create high scoring pairs (HSPs) Usually longer words (k=3 for proteins, 11 for DNA) HSPs are combined on the same diagonal and extended Reports local alignments based on one HSP or a combination of two close HSPs Variations allow gaps and pattern search

Alignment as classification Alignment can be viewed as –A function that produces similarity values between any two strings These similarity values can then be used to inform classifiers and clustering programs –A binary classifier: Any two strings are classified as related/similar or not Requires the use of a threshold The threshold can be fixed or depend on the context and application

Measuring performance Done on a test set separate from the training set (the examples with known labels) We need to know (but not make available to the classifier) the class labels in the test set, in order to evaluate the classifier’s performance Both sets must be representative of the problem instances – not always the case

Contingency tables Given a n-way classifier, a set with labels assigned by the classifier and correct, known labels we construct a n×n contingency table counting all combinations of true/assigned classes

2×2 Contingency Table Binary classification in this example True class Classifier-assigned class SpamNot spam Spamac Not spambd

Two types of error Usually one class is associated with “success” or “detection” False positives: Report that the sought after class is the correct one when it is not (b in the contingency table) False negatives: Fail to report the sought after class even when it is the correct one (c in the contingency table)

Performance measures Accuracy: How often is the classification correct? A = (a+d)/N, where N is the size of the scored set (N=a+b+c+d) Problem: If the a priori probability of one class is much higher, we are usually better off just predicting that class, which is not a very meaningful classifier E.g., in a disease detection test

Accounting for rare classes Assign a cost to each error and measure the expected error –Normalize for fixed N to make results comparable across experiments Measure separate error rates –Precision P=a/(a+b) –Recall (or sensitivity) R=a/(a+c) –Specificity d/(d+b)