Presentation is loading. Please wait.

Presentation is loading. Please wait.

Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.

Similar presentations


Presentation on theme: "Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence."— Presentation transcript:

1 Good solutions are advantageous Christophe Roos - MediCel ltd christophe.roos@medicel.fi Similarity is a tool in understanding the information in a sequence Evolution changes sequences

2 Spring 2002Christophe Roos - 5/6 Profiles, motifs, structures Proteins share similar domains By comparing several related sequences to each other, one can distiguish segments with higher level of conservation. Usually they have a key role in the function of a protein. Blast identifies related sequences fast but only roughly.

3 Spring 2002Christophe Roos - 5/6 Profiles, motifs, structures Refine the comparison Multiple sequence alignments of the best scoring sequences fround by Blast (or some other way) is done with a more sensitive algorithm. Example: The eyeless gene in the fruit fly is also found in several species: birds, mammals, reptiles, fish, invertebrates. There it is called PAX6.

4 Spring 2002Christophe Roos - 5/6 Profiles, motifs, structures Visualise the relationship Once a multiple sequence alignment is done, it can also be used for finding r elationship (evolutionary distance) The distance is calculated as the amount of mutations needed to evolve from a putative ancestor to all used ‘present-day’ sequences. Then a path including all sequences is computed. Different metrics can be used (most parsimonious, maximum likelihood, etc).

5 Spring 2002Christophe Roos - 5/6 Profiles, motifs, structures Visualise the output of aligned domains First all sequence pairs are aligned and scored, then in a second round a multiple sequence alignment is built up. In this case (PAX6 proteins from vertebrates and fruit fly), two domains are more conserved than the rest of the sequence. The most conserved areas have been highlighted by the use of black or gray background and white text. Only part of the alignment is shown.

6 Spring 2002Christophe Roos - 5/6 Profiles, motifs, structures Profiles and motifs A sequence motif is a locally conserved region of a sequence or a short sequence pattern shared by a set of sequences. The term motif refers to any sequence pattern that is predictive of a molecule’s function, a structural feature, or a family membership. Motifs can be detected in proteins, DNA and RNA sequences, but they most commonly refer to protein motifs. Motifs can be represented for computational purposes as –Flexible patterns [K,R]-R-P-C-x(11)-C-V-S (qualitative, unweighted; see the Prosite database at www.expasy.org) –Position-specific scoring matrices (PSSM, see next page) –Profile hidden Markov models (HMM). These are rigorous probabilistic formulation of a sequence profile. They contain the same probability information as PSSMs but can also account for gaps.

7 Spring 2002Christophe Roos - 5/6 Profiles, motifs, structures Position specific scoring matrix This corresponds to the flexible pattern of the paired box: [K,R]-R-P-C-x(11)-C-V-S A B C D E F G H I K L M N P Q R S T V W X Y Z * - -22 -22 -35 -26 -15 -37 -30 -9 -38 35 -36 -23 -16 -34 -5 53 -23 -24 -35 -40 -19 -31 -9 0 0 -51 -52 -62 -57 -46 -64 -59 -33 -66 -16 -63 -49 -44 -64 -34 70 -51 -53 -63 -64 -46 -57 -40 0 0 -42 -58 -59 -55 -53 -68 -59 -54 -63 -51 -65 -57 -62 73 -54 -56 -50 -53 -59 -72 -51 -69 -54 0 0 -42 -69 99 -75 -84 -49 -66 -72 -43 -76 -54 -53 -62 -79 -74 -75 -51 -48 -42 -65 -58 -59 -79 0 0 -21 -38 -19 -41 -30 -29 -43 -36 6 32 -16 -13 -35 -44 -25 -15 -34 -22 47 -41 -18 -36 -27 0 0 -21 6 -8 -12 -27 -7 -25 -13 26 -22 23 8 30 -39 -21 -23 -20 -13 10 -30 -9 -19 -24 0 0 -31 -40 -21 -43 -34 -23 -48 -36 50 33 -9 -8 -37 -47 -27 -17 -39 -28 5 -46 -20 -33 -30 0 0 -27 -36 -24 -38 -30 -12 -40 -30 -3 31 39 3 -32 -42 -20 -11 -35 -28 -10 -37 -16 -29 -24 0 0 -5 11 -7 -8 -18 -24 -15 -11 2 -17 -17 -13 35 -32 -17 -20 20 -2 23 -33 -7 -26 -18 0 0 24 -20 0 -22 -19 -21 -12 -20 5 -19 -12 -9 -16 -24 -19 -22 21 0 24 -29 -7 -25 -19 0 0 21 11 -3 -6 -16 -28 -10 -9 -19 -13 -25 -17 33 -26 -13 -16 2 28 -10 -35 -8 -25 -14 0 0 -3 -17 -4 -21 -21 -11 -19 -18 -1 -20 19 2 -12 -29 -19 -21 20 27 -3 -30 -6 -21 -20 0 0 -18 16 -17 33 -6 -20 -26 52 2 -21 -17 -13 -5 -35 -12 -19 -21 -16 20 -30 -10 -8 -10 0 0 -26 -41 -12 -45 -40 10 -43 -10 30 -33 5 45 -37 -44 -27 -31 -34 -21 7 -4 -17 45 -33 0 0 -27 12 -22 33 -13 -8 -28 -21 -10 -27 -5 42 -15 -40 -20 -28 -28 -24 -14 73 -14 -5 -17 0 0 -42 -69 99 -75 -84 -49 -66 -72 -43 -76 -54 -53 -62 -79 -74 -75 -51 -48 -42 -65 -58 -59 -79 0 0 -40 -73 -33 -75 -63 -45 -72 -68 -6 -66 -29 -28 -71 -71 -65 -67 -59 -40 64 -57 -45 -56 -64 0 0 -25 -40 -35 -44 -45 -59 -39 -45 -60 -47 -63 -56 -36 -55 -47 -48 61 -24 -52 -66 -39 -57 -46 0 0

8 Spring 2002Christophe Roos - 5/6 Profiles, motifs, structures Motif and databases – mode of use Motifs can be used to search sequence databases –take a family of related sequences –align and define motifs –use the motifs to search a database of sequences to find novel family members –can also be generated from unaligned sequences (e.g. MEME, see next page) Motif databases can be searched with sequences –take one sequence and ask what known motifs it contains –deduce its function using knowledge about those motifs in other sequences DBs –Blocks, Fred Hutchinson Cancer Research Center (ungapped alignments) –COG, clusters of orthologous groups, NCBI (21 complete genomes) –Pfam, Sanger Center (gapped profiles, curated) –Prints, Univ. Manchester (fingerprints, i.e. more than one pattern) –Prosite, Univ. Geneva (consensus patterns, expert-curated) –SMART, EMBL-Heidelberg –IntePro, EBI (multiple, curated), includes Pfam, SMART, etc. [2 pages forward]

9 Spring 2002Christophe Roos - 5/6 Profiles, motifs, structures Motif discovery tools and PSSM creators The MEME tool takes as input unaligned sequences and searches for patterns according to several parameters such as –Min-max length –Amount per sequence –Amount per set MEME also generates PSSM for the found domains. MAST is a tool for searching databases with PSSMs

10 Spring 2002Christophe Roos - 5/6 Profiles, motifs, structures The InterPro database of motifs at EBI (Nov 2001) was built from Pfam 6.6, PRINTS 31.0, PROSITE 16.37, ProDom 2001.2, SMART 3.1, TIGRFAMs 1.2, and the current SWISS-PROT + TrEMBL data. This release of InterPro contains 4691 entries, representing 1068 domains, 3532 families, 74 repeats and 15 post-translational modification sites. PfamPRINTS PROSITEProDomSMART TIGRFAMs SWISS-PROT + TrEMBL

11 Spring 2002Christophe Roos - 5/6 Profiles, motifs, structures Scan the InterPro database - example The InterPro database was scanned with the PAX6 sequence from the fruit fly.

12 Spring 2002Christophe Roos - 5/6 Profiles, motifs, structures Protein 3D structure 3D is better than linear strings of letters... Protein folding is critical for function Protein folding is ordered Structures consist of folds 3D structure can be measured, but computational ab initio structure prediction is a tough task and nearly impossible above a certain protein size (cpu and rule limits)

13 Spring 2002Christophe Roos - 5/6 Profiles, motifs, structures Protein 3D structure building blocks Primary structure: the linear array of aminoacids Secondary structures –Alpha helix –Beta-strand Tertiary structures DNA-binding protein (DNA helix, white; helices, pink; sheets of beta-strands, ocra)


Download ppt "Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence."

Similar presentations


Ads by Google