EVOLUTIONARY CHANGE IN DNA SEQUENCES - usually too slow to monitor directly… … so use comparative analysis of 2 sequences which share a common ancestor.

Slides:



Advertisements
Similar presentations
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
Advertisements

Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Measuring the degree of similarity: PAM and blosum Matrix
1 ALIGNMENT OF NUCLEOTIDE & AMINO-ACID SEQUENCES.
Basics of Comparative Genomics Dr G. P. S. Raghava.
DNA sequences alignment measurement
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
Molecular Clock I. Evolutionary rate Xuhua Xia
Sequence Similarity Searching Class 4 March 2010.
Heuristic alignment algorithms and cost matrices
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Project Proposals Due Monday Feb. 12 Two Parts: Background—describe the question Why is it important and interesting? What is already known about it? Proposed.
Introduction to bioinformatics
Sequence similarity.
Review of Laboratory 3 Spectrophotometric determination of DNA quantity, purity Abs 260 nmAbs 280 nmAbs 320 nmAbs 260/Abs
Similar Sequence Similar Function Charles Yan Spring 2006.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Roadmap The topics:  basic concepts of molecular biology  more on Perl  overview of the field  biological databases and database searching  sequence.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.
Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.
An Introduction to Bioinformatics
- any detectable change in DNA sequence eg. errors in DNA replication/repair - inherited ones of interest in evolutionary studies Deleterious - will be.
Protein Evolution and Sequence Analysis Protein Evolution and Sequence Analysis.
Chapter 11 Assessing Pairwise Sequence Similarity: BLAST and FASTA (Lecture follows chapter pretty closely) This lecture is designed to introduce you to.
BLAST Workshop Maya Schushan June 2009.
1. Bacterial genomes - genes tightly packed, no introns... HOW TO FIND GENES WITHIN A DNA SEQUENCE? Scan for ORFs (open reading frames) - check all 6 reading.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
ANALYSIS AND VISUALIZATION OF SINGLE COPY ORTHOLOGS IN ARABIDOPSIS, LETTUCE, SUNFLOWER AND OTHER PLANT SPECIES. Alexander Kozik and Richard W. Michelmore.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
Biology 4900 Biocomputing.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Using BLAST for Genomic Sequence Annotation Jeremy Buhler For HHMI / BIO4342 Tutorial Workshop.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
In-Class Assignment #1: Research CD2
Sequence Alignment.
Construction of Substitution matrices
Doug Raiford Phage class: introduction to sequence databases.
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Last lecture summary. Sequence alignment What is sequence alignment Three flavors of sequence alignment Point mutations, indels.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Pairwise Sequence Alignment and Database Searching
Bioinformatics Overview
Sequence similarity, BLAST alignments & multiple sequence alignments
Introduction to sequence alignment Mike Hallett (David Walsh)
Basics of Comparative Genomics
Sequence evolution and homology identification
What are the Patterns Of Nucleotide Substitution Within Coding and
Dr Tan Tin Wee Director Bioinformatics Centre
Sequence alignment, Part 2
Basic Local Alignment Search Tool
Basics of Comparative Genomics
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
Presentation transcript:

EVOLUTIONARY CHANGE IN DNA SEQUENCES - usually too slow to monitor directly… … so use comparative analysis of 2 sequences which share a common ancestor - determine number and nature of nt substitutions that have occurred (ie measure degree of divergence) spontaneous mutation rates? p for mammalian nuclear DNA (regions not under functional constraint)... much higher for viruses ~ 4 x nt sub per site per year eg to nt sub per site per generation

Potential pitfalls 2. If indels between two sequences, can they be aligned with confidence? - algorithms with gap penalties 1. Are all evolutionary changes being monitored? - if closely-related, high probability only one change at any given site… but if distant, may have been multiple substitutions (“hits”) at a site - can use algorithms to correct for this

Ancestral sequence Present day sequences Fig. 3.6

Page & Holmes Fig. 5.9 (If comparing long stretches, highly unlikely they would have converged to the same sequence) Homoplasy: same nt, but not directly inherited from ancestral sequence

Nucleotide substitutions within protein-coding sequences 1. Synonymous vs. non-synonymous Single step: Multiple steps: AATACT Is one pathway more likely than another? p.82

2. Nomenclature related to “degeneracy”: Non-degenerate - all possible changes at site are non-synonymous 2-fold degenerate - one of the 3 possible changes is synonymous 4-fold degenerate - all possible changes at site are synonymous

ALIGNMENT OF SEQUENCES FOR COMPARATIVE ANALYSIS 1.By manual inspection - if sequences very similar and no (or few) gaps 2. By sequence distance methods (often followed by “correction by visual inspection”) - use algorithms which minimize mismatches and gaps - gap penalty > mismatch penalty

Fig Alignment of human and chicken pancreatic hormone proteins no gap penality with gap penalty alignment as in (b), with biochemically similar aa

ArabAAG52143 FIVDEADLLLDLGFRRDVEKIIDCLPRQR QSLLFSATIPKEVRRVS-QLVLKR 539 ArabAAC26676 FIVDEADLLLDLGFKRDVEKIIDCLPRQR QSLLFSATIPKEVRRVS-QLVLKR 586 yeast -VLDEADRLLEIGFRDDLETISGILNEKNSKSADNIKTLLFSATLDDKVQKLANNIMNKK 323 ::**** **::**: *:*.*. *.:. ::******:.:*:::: ::: *: CLUSTAL W (1.81) Multiple Sequence Alignments Sequence 1: ArabidopsisAAG aa Sequence 2: ArabidopsisAAC aa Sequence 3: yeast 664 aa Sequences (2:3) Aligned. Score: 23 Sequences (1:2) Aligned. Score: 93 Sequences (1:3) Aligned. Score: 22 Multiple sequence alignments - CLUSTALW ww.ebi.ac.uk/clustalw (European Bioinformatics Institute) Symbols used? * :.

Avers Fig    globin  globin Human  globin = 141 aa Human  globin = 146 aa Was D-helix loss neutral or adaptive mutation? (Nature 352: , 1991) Alignment of human  -globin and  -globin proteins

In sequence comparisons, refer to nt (or aa) sequence relatedness as “… % identity” or “...% similarity” BUT NOT “ … % homology” because “homology” means “shares a common ancestor” “Non-evolutionary biologists” Petsko Genome Biol. 2:1002,2001 Reminder about definition of the word “homology”

“Normalized alignment score” NAS = (# identities x 10) + (# Cys identities x 20) – (# gaps x 25) Doolittle, R. “URFs & ORFs” p.14

Query = yeast mt ribosomal protein L8 gene (1275 nt) BLAST searches - to detect similarity between “sequence of interest” & databank entries Score = 383 bits (193), Expect = 1e-102 Identities = 196/197 (99%), Gaps = 0/197 (0%) Query AGCGTCAGGATAGCTCGCTCGATGTGGTCAGGCTAACACAATGAACAACGAGACTAGTG |||||||||||||||||||||||||| |||||||||||||||||||||||||||||||| Sbjct AGCGTCAGGATAGCTCGCTCGATGTGATCAGGCTAACACAATGAACAACGAGACTAGTG E-values: statistical measure of likelihood that sequences with this degree of similarity occur randomly ie. reflects number of hits expected by chance Example of high score “hit” (red) Query GTTTTCTTAATATTTATTTAAAAA |||||||||||||||| ||||||| Sbjct GTTTTCTTAATATTTAATTAAAAA Example of low score “hit” (blue or black) Score = 40.1 bits (20), Expect = 3.6 Identities = 23/24 (95%), Gaps = 0/24 (0%) “low complexity sequence”

Why is “sequence complexity” important when judging whether two sequences are homologous? Human DNA Chimp DNA Pu-rich region #1 Pu-rich region #2 (not homologous to #1) Region of unbiased base composition G=C=A=T AAGAGGAG How frequently is AAGAGGAG (8-nt sequence) expected to occur by chance in a DNA sequence? If sequence A is of low complexity (or short length), high % identity with sequence B may not reflect shared evolutionary origin AAGAGGAG

Advantages of using aa (rather than nt) sequences for identifying homologous genes among organisms? -20 amino acids vs. 4 nucleotides - for distantly related sequences – “saturation” of synonymous sites within codons (multiple hits) - degeneracy of genetic code & different codon usage patterns (and G+C% of genomes) among organisms But… for certain phylogenetic analyses, number of informative characters may be higher at DNA than protein level - lower chance of “spurious” matches - unrelated nt sequences (non-homologous) expected to show 25% identity by random chance (if unbiased base composition)

What if BLAST search were done at protein (instead of nt) level? Query = yeast mitochondrial ribosomal protein L8 (238 aa) Fungal Bacterial

Dot matrix method for aligning sequences - 2 sequences to be compared along X and Y axis of matrix - dots put in matrix when nts in the 2 sequences are identical mismatch = “gap” (or break) in line Fig. 3.7

indel = shift in diagonal Fig. 3.7

Dot matrix method - normally compare blocks rather than individual nts - spurious matches (background noise) influenced by 1. window size – overlapping fixed-length windows whereby sequence 1 compared with seq 2 2. stringency – minimum threshold value (% identity) at each step to score as hit - for coding regions, could use aa instead of nt sequences to reduce “noise”

Comparison of human chromosome 7 “draft” sequence (2001) with “near-complete” sequence (2004) Nature 431:935, 2004 How do you interpret the data in this figure? 2004 sequence (fewer errors) 2001 sequence Blowup of 500 kb region