Presentation is loading. Please wait.

Presentation is loading. Please wait.

BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.

Similar presentations


Presentation on theme: "BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon."— Presentation transcript:

1 BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon

2 2 BLOSUM (BLOck Substitution Matrices) Publication Henikoff and Henikoff, 1992 Motivation PAM matrices do not capture the difference between short and long time mutations Method For several degrees of sequence divergence, derive mutations from set of related proteins BLOSUM-k is based on related proteins with k% identity or less

3 3 BLOSUM – Method Use Blocks – collections of multiple alignments of similar segments without gaps Cluster together sequences whenever more than k% identical residues are shared Count number of substitutions across different clusters (in the same family) Estimate frequencies using the counts

4 4 BLOCKS Each BLOCK represents a conserved region in a group of proteins 1 5 n sequence 1 ABPEDG……FGW sequence 2 ABSEDQ……QGW sequence 3 SBPEDQ……FGD :: : sequence m ABAEDS……QGD

5 5 Obtaining Accepted Mutations from BLOCKS For each column we compute the frequency of each pair ( a, b ) of amino acids a E.g: if( m =10, column i contains 9 A ’s and 1 S, then f AA =8+7+…+1=36 and f AS =9. Total number of pairs per column: m ( m -1)/2  The probability to observe a pair ( a, b ) is given by

6 6 The Null Hypothesis The Background distribution of amino acid a is given by: The null hypothesis: E.g: in the above example – e AS = 2 · 0.9 · 0.1= 0.18 e AA = 0.9 · 0.9= 0.81 e SS = 0.1 · 0.1= 0.01

7 7 The LOD Ratio The LOD Ratio is given by: Properties: s ab >0  q ab >e ab, observed frequencies are more than expected s ab =0  q ab =e ab, observed frequencies are as expected s ab <0  q ab <e ab, observed frequencies are less than expected

8 8 Constructing the Different BLOSUM-k Matrices The idea: create substitution matrices that are based on different degrees of identity How: cluster all sequences similar in more than k% and treat them as a single sequence Example: Suppose k=80 and 8 of 9 sequences with A in the 9A-1S column are identical in more than 80% f AA =1, f AS =2, f SS =0

9 9 Information Resources NCBI GenBank PDB and SCOP GO There are many many more…

10 10 NCBI Contains several databases and tools for molecular biology research E.g: BLAST, PubMed, GenBank and more URL: http://www.ncbi.nih.govhttp://www.ncbi.nih.gov

11 11 GenBank GenBank is an annotated collection of all publicly available DNA sequences Data is partitioned into ‘divisions’ that roughly correspond to taxonomic groups (e.g bacteria, viruses, primates etc.) Statistics: DNA sequences for more than 165K organisms (2005) ~55M DNA sequences 60G bases URL: URL: http://www.ncbi.nlm.nih.gov/GenBank/http://www.ncbi.nlm.nih.gov/GenBank/

12 12 Protein Data Bank (PDB) and SCOP PDB is a database of known protein structures Currently contains ~36K known structures SCOP is a classification of proteins from PDB Family – clear evolutionary relationship Superfamily – Probable common evolutionary origin Fold – major structural similarity URLs: PDB – http://www.rcsb.orghttp://www.rcsb.org SCOP – http://scop.berkeley.orghttp://scop.berkeley.org

13 13 Gene Ontology (GO) The GO project “… is a collaborative effort to address the need for consistent descriptions of gene products in different databases”  Kept in the form of directed graph originating from one root Nodes are the different GO terms (more than 17K now exist) Node may have more than one parent Three main branches: biological process, molecular function and cellular components URL: http://www.geneontlogy.orghttp://www.geneontlogy.org


Download ppt "BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon."

Similar presentations


Ads by Google