Presentation is loading. Please wait.

Presentation is loading. Please wait.

1-Month Practical Master Course Genome Analysis Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam The Netherlands.

Similar presentations


Presentation on theme: "1-Month Practical Master Course Genome Analysis Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam The Netherlands."— Presentation transcript:

1 1-Month Practical Master Course Genome Analysis Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam The Netherlands www.ibivu.cs.vu.nl heringa@cs.vu.nl C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

2 Mathematics Statistics Computer Science Informatics Biology Molecular biology Medicine Chemistry Physics Bioinformatics

3 Biological Sequence Analysis Pair-wise sequence alignment Residue exchange matrices Multiple sequence alignment Phylogeny CENTRFORINTEGRATIVE BIOINFORMATICSVU E

4 .....acctc ctgtgcaaga acatgaaaca nctgtggttc tcccagatgg gtcctgtccc aggtgcacct gcaggagtcg ggcccaggac tggggaagcc tccagagctc aaaaccccac ttggtgacac aactcacaca tgcccacggt gcccagagcc caaatcttgt gacacacctc ccccgtgccc acggtgccca gagcccaaat cttgtgacac acctccccca tgcccacggt gcccagagcc caaatcttgt gacacacctc ccccgtgccc ccggtgccca gcacctgaac tcttgggagg accgtcagtc ttcctcttcc ccccaaaacc caaggatacc cttatgattt cccggacccc tgaggtcacg tgcgtggtgg tggacgtgag ccacgaagac ccnnnngtcc agttcaagtg gtacgtggac ggcgtggagg tgcataatgc caagacaaag ctgcgggagg agcagtacaa cagcacgttc cgtgtggtca gcgtcctcac cgtcctgcac caggactggc tgaacggcaa ggagtacaag tgcaaggtct ccaacaaagc aaccaagtca gcctgacctg cctggtcaaa ggcttctacc ccagcgacat cgccgtggag tgggagagca atgggcagcc ggagaacaac tacaacacca cgcctcccat gctggactcc gacggctcct tcttcctcta cagcaagctc accgtggaca agagcaggtg gcagcagggg aacatcttct catgctccgt gatgcatgag gctctgcaca accgctacac gcagaagagc ctctc..... DNA sequence

5 Genome size OrganismNumber of base pairs  X-174 virus5,386 Epstein Bar Virus172,282 Mycoplasma genitalium580,000 Hemophilus Influenza1.8  10 6 Yeast (S. Cerevisiae)12.1  10 6 Human 3.2  10 9 Wheat16  10 9 Lilium longiflorum 90  10 9 Salamander100  10 9 Amoeba dubia670  10 9

6 Three main principles DNA makes RNA makes Protein Structure more conserved than sequence Sequence Structure Function

7 TERTIARY STRUCTURE (fold) Genome Expressome Proteome Metabolome Functional Genomics Regulation, signalling cascades, chaperonins, compartmentalisation

8 How to go from DNA to protein sequence A piece of double stranded DNA: 5’ attcgttggcaaatcgcccctatccggc 3’ 3’ taagcaaccgtttagcggggataggccg 5’ DNA direction is from 5’ to 3’

9 How to go from DNA to protein sequence 6-frame translation using the codon table (last lecture): 5’ attcgttggcaaatcgcccctatccggc 3’ 3’ taagcaaccgtttagcggggataggccg 5’

10 Dean, A. M. and G. B. Golding: Pacific Symposium on Bioinformatics 2000 Evolution and three-dimensional protein structure information Isocitrate dehydrogenase: The distance from the active site (in yellow) determines the rate of evolution (red = fast evolution, blue = slow evolution)

11 Protein Sequence-Structure-Function Sequence Structure Function Threading Homology searching (BLAST) Ab initio prediction and folding Function prediction from structure

12 Widely used tool for homology detection: PSI-BLAST Heuristic tool to cut down computations required for database searching (~1M sequences in DB) Sensitivity gained by iteratively finding hits (local alignments) and repeating search Q DBT hits PSSM

13 Threading Query sequence Template sequence + Template structure Compatibility score

14 Threading Query sequence Template sequence + Template structure Compatibility score

15 Fold recognition by threading Query sequence Compatibility scores Fold 1 Fold 2 Fold 3 Fold N

16 “Nothing in Biology makes sense except in the light of evolution” (Theodosius Dobzhansky (1900-1975)) “Nothing in bioinformatics makes sense except in the light of Biology” Bioinformatics

17 Divergent evolution Ancestral sequence: ABCD ACCD (B C) ABD (C ø) ACCD or ACCD Pairwise Alignment AB─D A─BD mutation deletion

18 Divergent evolution Ancestral sequence: ABCD ACCD (B C) ABD (C ø) ACCD or ACCD Pairwise Alignment AB─D A─BD true alignment mutation deletion

19 Mutations under divergent evolution Ancestral sequence Sequence 1Sequence 2 1: ACCTGTAATC 2: ACGTGCGATC * ** D = 3/10 (fraction different sites (nucleotides)) G GC (a)G AC (b) G AA (c) One substitution - one visible Two substitutions - one visible Two substitutions - none visible G G A (d) Back mutation - not visible G

20 Convergent evolution Often with shorter motifs (e.g. active sites) Motif (function) has evolved more than once independently, e.g. starting with two very different sequences adopting different folds Sequences and associated structures remain different, but (functional) motif can become identical Classical example: serine proteinase and chymotrypsin

21 Serine proteinase (subtilisin) and chymotrypsin Different evolutionary origins, no sequence similarity Similarities in the reaction mechanisms. Chymotrypsin, subtilisin and carboxypeptidase C have a catalytic triad of serine, aspartate and histidine in common: serine acts as a nucleophile, aspartate as an electrophile, and histidine as a base. The geometric orientations of the catalytic residues are similar between families, despite different protein folds. The linear arrangements of the catalytic residues reflect different family relationships. For example the catalytic triad in the chymotrypsin clan (SA) is ordered HDS, but is ordered DHS in the subtilisin clan (SB) and SDH in the carboxypeptidase clan (SC).

22 A protein sequence alignment MSTGAVLIY--TSILIKECHAMPAGNE----- ---GGILLFHRTHELIKESHAMANDEGGSNNS * * * **** *** A DNA sequence alignment attcgttggcaaatcgcccctatccggccttaa att---tggcggatcg-cctctacgggcc---- *** **** **** ** ******

23 What can sequence tell us about structure (HSSP) Sander & Schneider, 1991

24 Searching for similarities What is the function of the new gene? The “lazy” investigation (i.e., no biologial experiments, just bioinformatics techniques): – Find a set of similar protein sequences to the unknown sequence – Identify similarities and differences – For long proteins: identify domains first

25 Evolutionary and functional relationships Reconstruct evolutionary relation: Based on sequence -Identity (simplest method) -Similarity Homology (common ancestry: the ultimate goal) Other (e.g., 3D structure) Functional relation: Sequence Structure Function

26 Common ancestry is more interesting: Makes it more likely that genes share the same function Homology: sharing a common ancestor – a binary property (yes/no) – it’s a nice tool: When (an unknown) gene X is homologous to (a known) gene G it means that we gain a lot of information on X: what we know about G can be transferred to X as a good suggestion. Searching for similarities

27 Biological definitions for related sequences  Homologues are similar sequences in two different organisms that have been derived from a common ancestor sequence. Homologues can be described as either orthologues or paralogues.  Orthologues are similar sequences in two different organisms that have arisen due to a speciation event. Orthologs typically retain identical or similar functionality throughout evolution.  Paralogues are similar sequences within a single organism that have arisen due to a gene duplication event.  Xenologues are similar sequences that do not share the same evolutionary origin, but rather have arisen out of horizontal transfer events through symbiosis, viruses, etc.

28 How to evolve Important distinction: Orthologues: homologous proteins in different species (all deriving from same ancestor) Paralogues: homologous proteins in same species (internal gene duplication) In practice: to recognise orthology, bi-directional best hit is used in conjunction with database search program (this is called an operational definition)

29 Source: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html So this means …

30 Example today: Pairwise sequence alignment needs sense of evolution Global dynamic programming MDAGSTVILCFVG MDAASTILCGSMDAASTILCGS Amino Acid Exchange Matrix Gap penalties (open,extension) Search matrix MDAGSTVILCFVG- MDAAST-ILC--GS Evolution

31 How to determine similarity Frequent evolutionary events at the DNA level: 1. Substitution 2. Insertion, deletion 3. Duplication 4. Inversion We will restrict ourselves to these events

32 A DNA sequence alignment attcgttggcaaatcgcccctatccggccttaa att---tggcggatcg-cctctacgggcc---- *** **** **** ** ****** A protein sequence alignment MSTGAVLIY--TSILIKECHAMPAGNE----- ---GGILLFHRTHELIKESHAMANDEGGSNNS * * * **** *** nucleotide one- letter code amino acid one- letter code

33 – Substitution (or match/mismatch) DNA proteins – Gap penalty Linear: gp(k)=ak Affine: gp(k)=b+ak Concave, e.g.: gp(k)=log(k) The score for an alignment is the sum of the scores over all alignment columns Dynamic programming Scoring alignments

34 S a,b = - gp(k) = gap init + k  gap extension affine gap penalties

35 DNA: define a score for match/mismatch of letters Simple: Used in genome alignments: ACGT A1 C 1 G 1 T 1 ACGT A91-114-31-123 C-114100-125-31 G -125100-114 T-123-31-11491

36 Dynamic programming Scoring alignments 101 Amino Acid Exchange Matrix Affine gap penalties (open, extension) 20  20 Score: s(T,T)+s(D,D)+s(W,W)+s(V,L)-P o -2P x + +s(L,I)+s(K,K) T D W V T A L K T D W L - - I K


Download ppt "1-Month Practical Master Course Genome Analysis Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam The Netherlands."

Similar presentations


Ads by Google