Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.

Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas

Hash tables A hash table or associative array implements efficiently a function with a very large domain but relatively few recorded values Example: Map names to phone numbers –Although there are many possible names, only a few will be stored in a particular phone book

Implementing hash tables A hash table works by using a hash function to translate the input (keys) to a small range of buckets –For example, h(n) = n mod k where k is the size of hash table Collisions can occur when different keys are mapped to the same bucket, and must be resolved Many programming languages directly support hash tables

Example hash table

FASTA after step 1

FASTA – Step 2 Group together hot spots on the same diagonal. This creates a partial alignment with matches and mismatches (no indels). Keep the 10 best diagonal runs If a hot spot matches at position i in S and position j in T, it will be on the (i-j)th diagonal Sort hot spots by i-j to group them

FASTA – Step 3 Rescore exact matches using realistic substitution penalties (from a set such as PAM250 for proteins) Trim and extend hot spots according to substitution penalties, allowing “good” mismatches

The PAM matrices From observing closely related proteins in evolution, we can estimate the likelihood than one amino acid mutates to another Normalize these probabilities by PAM (Percentage of Acceptable Mutations in 100 amino-acids) The PAM0 matrix is the identity matrix The PAM1 matrix diverges slightly from the identity matrix

Calculating PAM matrices If we have PAM1, then –PAMN = (PAM1) N –A Markov chain of independent mutations The PAM250 matrix has been found empirically most useful At this evolutionary distance, 80% of amino acids are changed Change varies according to class (from only 45% to 94%) Some amino acids are no longer good matches with themselves

FASTA after Steps 2 and 3

FASTA – Step 4 Starting from the best diagonal run, look at nearby diagonal runs and incorporate non- overlapping hot spots This extends the partial alignment with some insertions and deletions We only look a limited distance from the best diagonal run

FASTA after Step 4

FASTA – Step 5 Run the full dynamic programming alignment algorithm in a band around the extended best diagonal run Only consider matches within w positions on either side of the extended best diagonal run Typically, w is 16, and 32n ≪ n 2

FASTA final step

BLAST Basic Local Alignment Search Tool Uses words like FASTA, but allows for approximate matches of words to create high scoring pairs (HSPs) Usually longer words (k=3 for proteins, 11 for DNA) HSPs are combined on the same diagonal and extended Reports local alignments based on one HSP or a combination of two close HSPs Variations allow gaps and pattern search

Alignment as classification Alignment can be viewed as –A function that produces similarity values between any two strings These similarity values can then be used to inform classifiers and clustering programs –A binary classifier: Any two strings are classified as related/similar or not Requires the use of a threshold The threshold can be fixed or depend on the context and application

Measuring performance Done on a test set separate from the training set (the examples with known labels) We need to know (but not make available to the classifier) the class labels in the test set, in order to evaluate the classifier’s performance Both sets must be representative of the problem instances – not always the case

Contingency tables Given a n-way classifier, a set with labels assigned by the classifier and correct, known labels we construct a n×n contingency table counting all combinations of true/assigned classes

2×2 Contingency Table Binary classification in this example True class Classifier-assigned class SpamNot spam Spamac Not spambd

Two types of error Usually one class is associated with “success” or “detection” False positives: Report that the sought after class is the correct one when it is not (b in the contingency table) False negatives: Fail to report the sought after class even when it is the correct one (c in the contingency table)

Performance measures Accuracy: How often is the classification correct? A = (a+d)/N, where N is the size of the scored set (N=a+b+c+d) Problem: If the a priori probability of one class is much higher, we are usually better off just predicting that class, which is not a very meaningful classifier E.g., in a disease detection test

Accounting for rare classes Assign a cost to each error and measure the expected error –Normalize for fixed N to make results comparable across experiments Measure separate error rates –Precision P=a/(a+b) –Recall (or sensitivity) R=a/(a+c) –Specificity d/(d+b)

Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.

Similar presentations

Presentation on theme: "Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.

Similar presentations

Presentation on theme: "Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas."— Presentation transcript:

Similar presentations

About project

Feedback