# BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.

## Presentation on theme: "BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and."— Presentation transcript:

BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington thabangh@gmail.com

Outline Responses from last class Revision BLAST PSI-BLAST Position specific scoring matrices (PSSMs) Python

One-minute responses Please explain the null and alternative hypothesis again. Liked giving examples on the statistical concepts. Sometimes the class is boring because you are using only the projector. For Python, we learn more by practicing than just looking at your code. Python session was good, but too fast. More Python examples, please. The Python is difficult because it is different from what we learned before. The problem is how to use sys in Python. I hope you give lots of examples for the sys command. Please be available for consultation over the weekend on the assignment. Does BLAST use p-values to decide which alignments to consider?

Revision What is a distribution? – A mathematical function whose values sum to 1. If you roll a single die many times and make a histogram of the resulting values, what kind of distribution will you observe? – Uniform If you compare a protein sequence to many, randomly shuffled protein sequences and make a histogram of the resulting scores, what kind of distribution will you observed? – Extreme value distribution What is the definition of “null hypothesis”? – A statistical model of the situation that we are not interested in. What is the opposite of the null hypothesis? – The alternative hypothesis. What is the name of the estimated probability of observing the data, assuming that it was generated according to the null hypothesis? – p-value How do you decide what p-value threshold to use? – Consider the costs associated with making a mistake.

Significance of scores Sequence alignment algorithm HPDKKAHSIHAWILSKSKVLEGNTKEVVDNVLKT LENENQGKCTIAEYKYDGKKASVYNSFVSNGVKE 45 Low score = unrelated High score = homologs How high is high enough?

Sequence database Database searching Sequence comparison algorithm Query Targets ranked by score

How long does DP take? Dynamic programming matrix Target sequence of length m Query sequence of length n There are nm entries in the matrix. Each entry requires a constant number c of operations. The total number of required operations is approximate nmc. We say that the algorithm is “ order nm ” or “ O(nm). ”

How long does DP take? Say that your query is 200 amino acids long. You are searching a database that contains a million proteins. If their average length is 200, then you have to fill in 200  200  1,000,000 = 4  10 10 DP entries. If it takes only 10 operations to fill in each cell, then you still have to do 4  10 11 operations.

BLAST DP is O(nm); BLAST is O(m). Fundamental innovation: employ a data structure to index the query sequence. The data structure allows you to look up entries in a table in O(1) time. Does my length-n sequence contain the subsequence “ GTR ” ? Naive method: scan the sequence Improved method: hash table or search tree lookup O(n) O(1)

BLAST Query sequence Target sequence Query List of words in query and similar words

BLAST Query sequence Target sequence Query List of words in query and similar words “ Does this target word appear in the query word list? ”

“ Yes, at position 34 in the query sequence. ” BLAST Query sequence x Target sequence Query List of words in query and similar words

BLAST Query sequence x x x x x x x x Target sequence Query List of words in query and similar words x

BLAST Query sequence x x x x x x x x Target sequence Query List of words in query and similar words These two hits are on the diagonal and close to each other, so let ’ s try to connect them. x

BLAST Query sequence x x x x x x x x Target sequence Query List of words in query and similar words x

BLAST Query sequence x x x Target sequence Query List of words in query and similar words x 0.005 0.27 Assign a score to each hit

BLAST “ The central idea of the BLAST algorithm is that a statistically significant alignment is likely to contain a high-scoring pair of aligned words. ” The initial word threshold T is the most important parameter. Low T = high sensitivity, long compute. High T = low sensitivity, quick compute.

When does BLAST fail? BLAST works by joining together short regions of high similarity. Therefore, BLAST will fail to detect long regions of low similarity. ERDCRVSSFRVKENFDKARFAGTWYAMAKKDPEGLFLQDNIVAEFSVDENGHMSATAKGRVRLLNNWDVCADMVGTFTDT E R F E K A Y K E L I F E M A V N V M F ECEIRQFLFIQRESARKEACATGTYREKKMDPELIVLVIWICPQFEQLEMRAMWIHAKJEVIUENAQCVIYTMQEPFCII

Summary of BLAST Dynamic programming is O(nm), where n is the length of the query and m is the size of the database. BLAST is O(m). BLAST produces an index of the query sequence that allows fast matching to the database. Relative to Smith-Waterman, BLAST can produce false negatives; i.e., homologs that BLAST fails to detect.

BLAST Query Sequence database Homologs

Position-specific iterated BLAST BLAST Query Sequence database Statistical model of protein family Homologs Position-specific scoring matrix (PSSM)

Position-specific scoring matrix A PSSM is an n by m matrix, where n is the size of the alphabet, and m is the length of the sequence. The entry at (i, j) is the score assigned by the PSSM to letter i at the jth position. A-20 -20 R505 1-3-20 N06000-301 D-21 0-3 C-3 -2-3 Q101-25-3-20 E000 2-3-20 G 0 6 -36-2 H010 0-28 I-3 -4-30-4-3 L-2-3-2-4-20-4-3 K202-21-3-2 M -2-300 -2 F-3 6 P-2 -4-2 S1 00-20 T 0 -2-2 W-3-4-3-2 1 Y -33-32 V -2-3 “ K ” at position 3 gets a score of 2. Position in query sequence

Position-specific scoring matrix This PSSM assigns the sequence NMFWAFGH a score of 0 + -2 + -3 + - 2 + -1 + 6 + 6 + 8 = 12. A-20 -20 R505 1-3-20 N06000-301 D-21 0-3 C-3 -2-3 Q101-25-3-20 E000 2-3-20 G 0 6 -36-2 H010 0-28 I-3 -4-30-4-3 L-2-3-2-4-20-4-3 K202-21-3-2 M -2-300 -2 F-3 6 P-2 -4-2 S1 00-20 T 0 -2-2 W-3-4-3-2 1 Y -33-32 V -2-3

What score does this PSSM assign to KRPGHFLA? 2 + 0 + -2 + 6 + 0 + 6 + -4 + -2 = 6 A-20 -20 R505 1-3-20 N06000-301 D-21 0-3 C-3 -2-3 Q101-25-3-20 E000 2-3-20 G 0 6 -36-2 H010 0-28 I-3 -4-30-4-3 L-2-3-2-4-20-4-3 K202-21-3-2 M -2-300 -2 F-3 6 P-2 -4-2 S1 00-20 T 0 -2-2 W-3-4-3-2 1 Y -33-32 V -2-3

How PSI-BLAST makes PSSMs

Position-specific iterated BLAST BLAST Query Sequence database PSSM Multiple alignment ?

Creating a PSSM from 1 sequence A-20 -20 R505 1-3-20 N06000-301 D-21 0-3 C-3 -2-3 Q101-25-3-20 E000 2-3-20 G 0 6 -36-2 H010 0-28 I-3 -4-30-4-3 L-2-3-2-4-20-4-3 K202-21-3-2 M -2-300 -2 F-3 6 P-2 -4-2 S1 00-20 T 0 -2-2 W-3-4-3-2 1 Y -33-32 V -2-3 BLOSUM62 matrix RNRGQFGH R R 20 by 20 20 by L L

Position-specific iterated BLAST BLAST Query Sequence database PSSM Multiple alignment ?

Creating a PSSM from multiple sequences Discard columns that contain gaps in the query. For each column C – Compute relative sequence weights – Compute PSSM entries, taking into account Observed residues in this column Sequence weights Substitution matrix

Discard query gap columns EEFG----SVDGLVNNA QKYG----RLDVMINNA RRLG----TLNVLVNNA GGIG----PVD-LVNNA KALG----GFNVIVNNA ARFG----KID-LIPNA FEPEGPEKGMWGLVNNA AQLK----TVDVLINGA EEFGSVDGLVNNA QKYGRLDVMINNA RRLGTLNVLVNNA GGIGPVD-LVNNA KALGGFNVIVNNA ARFGKID-LIPNA FEPEGMWGLVNNA AQLKTVDVLINGA

Compute sequence weights Low weights are assigned to redundant sequences. High weights are assigned to unique sequences. EEFGSVDGLVNNA 1.2 QKYGRLDVMINNA 1.2 RRLGTLNVLVNNA 0.8 GGIGPVDLLVNNA 0.8 KALGGFNVIVNNA 1.1 ARFGKIDTLIPNA 0.9 FEPEGMWGLVNNA 1.1 AQLKTVDVLINGA 1.3

Compute PSSM entries EEFGSVDGLVNNA 1.2 QKYGRLDVMINNA 1.2 RRLGTLNVLVNNA 0.8 GGIGPVDLLVNNA 0.8 KALGGFNVIVNNA 1.1 ARFGKIDTLIPNA 0.9 FEPEGMWGLVNNA 1.1 AQLKTVDVLINGA 1.3 BLOSUM62 matrix PSSM

Position-specific iterated BLAST BLAST Query Sequence database PSSM Multiple alignment

Summary of PSI-BLAST PSI-BLAST builds a model of the query sequence and its close homologs. Instead of comparing a target sequence to the query, each target is compared to the model. The PSI-BLAST model is called a position-specific scoring matrix (PSSM). The PSSM can be constructed from a collection of targets aligned to the query sequence. PSI-BLAST is more accurate than BLAST.

Sample problem #1 Given: – a file containing a sequence of amino acids Return: – the amino acid counts./compute-counts.py seq1.txt Read 68 amino acids from seq1.txt. A 5 C 2 D 3 E 1 F 6 G 0 H 0 I 2 K 2 L 8 M 1 N 5 P 7 Q 1 R 1 S 2 T 5 V 6 W 3 Y 8

Sample problem #2 Given: – a pseudocount weight – a file containing amino acid frequencies – a file containing a sequence of amino acids Return: – the summed amino acid counts and pseudocounts

Sample problem #3 Given: – a pseudocount weight – a file containing amino acid frequencies – a file containing a sequence of amino acids Return: – the normalized summed amino acid counts and pseudocounts

Download ppt "BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and."

Similar presentations