Presentation is loading. Please wait.

Presentation is loading. Please wait.

Http://creativecommons.org/licenses/by-sa/2.0/ Lecture 3.0.

Similar presentations


Presentation on theme: "Http://creativecommons.org/licenses/by-sa/2.0/ Lecture 3.0."— Presentation transcript:

1 Lecture 3.0

2 Sequencing & Sequence Alignment
David Wishart February 16, 2005 Sequencing & Sequence Alignment David Wishart, University of Alberta Lecture 3.0 (c) 2005 CGDN

3 Objectives Understand how DNA sequence data is collected and prepared
Be aware of the importance of sequence searching and sequence alignment in biology and medicine Be familiar with the different algorithms and scoring schemes used in sequence searching and sequence alignment Lecture 3.0

4 High Throughput DNA Sequencing
Lecture 3.0

5 30,000 Lecture 3.0

6 Shotgun Sequencing Isolate ShearDNA Clone into Chromosome
into Fragments Clone into Seq. Vectors Sequence Lecture 3.0

7 Principles of DNA Sequencing
Primer DNA fragment Amp PBR322 Tet Ori Denature with heat to produce ssDNA Klenow + ddNTP + dNTP + primers Lecture 3.0

8 The Secret to Sanger Sequencing
Lecture 3.0

9 Principles of DNA Sequencing
5’ G C A T G C 3’ Template 5’ Primer dATP dCTP dGTP dTTP ddCTP dATP dCTP dGTP dTTP ddATP dATP dCTP dGTP dTTP ddTTP dATP dCTP dGTP dTTP ddCTP GddC GCddA GCAddT ddG GCATGddC GCATddG Lecture 3.0

10 Principles of DNA Sequencing
T short _ _ C A G C A T + + Lecture 3.0 long

11 Capillary Electrophoresis
Separation by Electro-osmotic Flow Lecture 3.0

12 Multiplexed CE with Fluorescent detection
ABI 3700 96x700 bases Lecture 3.0

13 Shotgun Sequencing Assembled Sequence Send to Computer Sequence
Chromatogram Assembled Sequence Send to Computer Lecture 3.0

14 Shotgun Sequencing Very efficient process for small-scale (~10 kb) sequencing (preferred method) First applied to whole genome sequencing in 1995 (H. influenzae) Now standard for all prokaryotic genome sequencing projects Successfully applied to D. melanogaster Moderately successful for H. sapiens Lecture 3.0

15 The Finished Product GATTACAGATTACAGATTACAGATTACAGATTACAG
ATTACAGATTACAGATTACAGATTACAGATTACAGA TTACAGATTACAGATTACAGATTACAGATTACAGAT TACAGATTAGAGATTACAGATTACAGATTACAGATT ACAGATTACAGATTACAGATTACAGATTACAGATTA CAGATTACAGATTACAGATTACAGATTACAGATTAC AGATTACAGATTACAGATTACAGATTACAGATTACA Lecture 3.0

16 Sequencing Successes T7 bacteriophage completed in 1983
39,937 bp, 59 coded proteins Escherichia coli completed in 1998 4,639,221 bp, 4293 ORFs Sacchoromyces cerevisae completed in 1996 12,069,252 bp, 5800 genes Lecture 3.0

17 Sequencing Successes Caenorhabditis elegans completed in 1998
95,078,296 bp, 19,099 genes Drosophila melanogaster completed in 2000 116,117,226 bp, 13,601 genes Homo sapiens completed in 2003 3,201,762,515 bp, 31,780 genes Lecture 3.0

18 Genomes to Date 8 vertebrates (human, mouse, rat, fugu, zebrafish)
3 plants (arabadopsis, rice, poplar) 2 insects (fruit fly, mosquito) 2 nematodes (C. elegans, C. briggsae) 1 sea squirt 4 parasites (plasmodium, guillardia) 4 fungi (S. cerevisae, S. pombe) 200+ bacteria and archebacteria 2000+ viruses Lecture 3.0

19 So what do we do with all this sequence data?
Lecture 3.0

20 Sequence Alignment Lecture 3.0

21 Alignments tell us about...
Function or activity of a new gene/protein Structure or shape of a new protein Location or preferred location of a protein Stability of a gene or protein Origin of a gene or protein Origin or phylogeny of an organelle Origin or phylogeny of an organism Lecture 3.0

22 Factoid: Sequence comparisons lie at the heart of all bioinformatics
Lecture 3.0

23 Similarity versus Homology
Similarity refers to the likeness or % identity between 2 sequences Similarity means sharing a statistically significant number of bases or amino acids Similarity does not imply homology Homology refers to shared ancestry Two sequences are homologous is they are derived from a common ancestral sequence Homology usually implies similarity Lecture 3.0

24 Similarity versus Homology
Similarity can be quantified It is correct to say that two sequences are X% identical It is correct to say that two sequences have a similarity score of Z It is generally incorrect to say that two sequences are X% similar Lecture 3.0

25 Similarity versus Homology
Homology cannot be quantified If two sequences have a high % identity it is OK to say they are homologous It is incorrect to say two sequences have a homology score of Z It is incorrect to say two sequences are X% homologous Lecture 3.0

26 Homologues & All That Homologue (or Homolog) Paralogue (or Paralog)
Protein/gene that shares a common ancestor and which has good sequence and/or structure similarity to another (general term) Paralogue (or Paralog) A homologue which arose through gene duplication in the same species/chromosome Orthologue (or Ortholog) A homologue which arose through speciation (found in different species) Lecture 3.0

27 Sequence Complexity MCDEFGHIKLAN…. High Complexity
ACTGTCACTGAT…. Mid Complexity NNNNTTTTTNNN…. Low Complexity Translate those DNA sequences!!! Lecture 3.0

28 Assessing Sequence Similarity
THESTORYOFGENESIS THISBOOKONGENETICS THESTORYOFGENESI-S THE STORY OF GENESIS THIS BOOK ON GENETICS Two Character Strings Character Comparison * * * * * * * * * * * Context Comparison Lecture 3.0

29 Assessing Sequence Similarity
Rbn KETAAAKFERQHMD Lsz KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNT Rbn SST SAASSSNYCNQMMKSRNLTKDRCKPMNTFVHESLA Lsz QATNRNTDGSTDYGILQINSRWWCNDGRTP GSRN Rbn DVQAVCSQKNVACKNGQTNCYQSYSTMSITDCRETGSSKY Lsz LCNIPCSALLSSDITASVNC AKKIVSDGDGMNAWVAWR Rbn PNACYKTTQANKHIIVACEGNPYVPHFDASV Lsz NRCKGTDVQA WIRGCRL is this alignment significant? Lecture 3.0

30 Is This Alignment Significant?
Lecture 3.0

31 Some Simple Rules If two sequence are > 100 residues and > 25% identical, they are likely related If two sequences are 15-25% identical they may be related, but more tests are needed If two sequences are < 15% identical they are probably not related If you need more than 1 gap for every 20 residues the alignment is suspicious Lecture 3.0

32 Doolittle’s Rules of Thumb
Twilight Zone Lecture 3.0

33 Sequence Alignment - Methods
Dot Plots Dynamic Programming Heuristic (Fast) Local Alignment Multiple Sequence Alignment Contig Assembly Lecture 3.0

34 Dot Plots Lecture 3.0

35 Dot Plots “Invented” in 1970 by Gibbs & McIntyre
Good for quick graphical overview Simplest method for sequence comparison Inter-sequence comparison Intra-sequence comparison Identifies internal repeats Identifies domains or “modules” Lecture 3.0

36 Dot Plots & Internal Repeats
Lecture 3.0

37 Dot Plot Algorithm Take two sequences (A & B), write sequence A out as a row (length=m) and sequence B as a column (length =n) Create a table or “matrix” of “m” columns and “n” rows Compare each letter of sequence A with every letter in sequence B. If there’s a match mark it with a dot, if not, leave blank Lecture 3.0

38 Dot Plot Algorithm A C D E F G H G A C D E F G H Lecture 3.0

39 Dot Plots Most commercial programs offer pretty good dot plot programs including: GCG/Omiga (Pharmacopeia) PepTool (BioTools Inc.) LaserGene (DNAStar) Popular freeware package is Dotter Dotlet JDotter Lecture 3.0

40 Dynamic Programming G E N T I C S 10 60 40 30 20 50 Lecture 3.0

41 Dynamic Programming Developed by Needleman & Wunsch (1970)
Refined by Smith & Waterman (1981) Ideal for quantitative assessment Guaranteed to be mathematically optimal Slow N2 algorithm Performed in 2 stages Prepare a scoring matrix using recursive function Scan matrix diagonally using traceback protocol Lecture 3.0

42 The Recursive Function
Si-1,j or max Si-x,j-1 + wx-1 or max Si-1,j-y + wy-1 Sij = sij + max 2<x<i 2<y<j W = gap penalty S = alignment score Lecture 3.0

43 Identity Scoring Matrix (Sij)
Lecture 3.0

44 A Simple Example... A A T V D A A T V D A A T V D A 1 A 1 1
Lecture 3.0

45 A Simple Example... A A T V D A 1 1 0 0 0 V 0 1 1 2 1 V D A A T V D
| | | A V - V D A A T V D | | | A - V V D A A T V D | | | A V V D Lecture 3.0

46 Could We Do Better? Key to the performance of Dynamic Programming is the scoring function Dynamic Programming always gives the mathematically correct answer Dynamic Programming does not always give the biologically correct answer The weakest link -- The Scoring Matrix Lecture 3.0

47 Scoring Matrices An empirical model of evolution, biology and chemistry all wrapped up in a 20 X 20 table of integers Structurally or chemically similar residues should ideally have high diagonal or off-diagonal numbers Structurally or chemically dissimilar residues should ideally have low diagonal or off-diagonal numbers Lecture 3.0

48 A Better Matrix - PAM250 Lecture 3.0

49 Using PAM250... A A T V D A A T V D A A T V D A 2 A 2 1 A 2 1 0 -1 -1
Gap Penalty = -1 Using PAM250... A A T V D A 2 V D A A T V D A 2 1 V D A A T V D A V D A A T V D A V -1 2 V D A A T V D A V V D A A T V D A V V D Lecture 3.0

50 Using PAM250... A A T V D A 2 1 0 -1 -1 V -1 2 1 5 -1 V D A A T V D
Gap Penalty = -1 Using PAM250... A A T V D A V V D A A T V D A V V D A A T V D A V V D A A T V D | | | A V - V D Lecture 3.0

51 PAM Matrices Developed by M.O. Dayhoff (1978)
PAM = Point Accepted Mutation Matrix assembled by looking at patterns of substitutions in closely related proteins 1 PAM corresponds to 1 amino acid change per 100 residues 1 PAM = 1% divergence or 1 million years in evolutionary history Lecture 3.0

52 Dynamic Programming Great for doing pairwise global alignments
Produces a quantitative alignment “score” Problems if one tries to do alignments with very large sequences (memory requirement grows as N2 or as N x M) Serious problems if one tries to align one sequence against a database (10’s of hours) Need an alternative….. Lecture 3.0

53 Fast Local Alignment Methods
ACDEAGHNKLM... KKDEFGHPKLM... SCDEFCHLKLM... MCDEFGHNKLV... ACDEFGHIKLM... QCDEFGHAKLM... AQQQFGHIKLPI... WCDEFGHLKLM... SMDEFAHVKLM... ACDEFGFKKLM... Lecture 3.0

54 Fast Local Alignment Methods
Developed by Lipman & Pearson (1985/88) Refined by Altschul et al. (1990/97) Ideal for large database comparisons Uses heuristics & statistical simplification Fast N-type algorithm (similar to Dot Plot) Cuts sequences into short words (k-tuples) Uses “Hash Tables” to speed comparison Lecture 3.0

55 Fast Alignment Algorithm
Query: ACDEFGDEF….. ACD CDE DEF EFG FGD GDE … , ACD CDE DEF EFG FGD GDE ACE CDD NEF … … … GCE CEE DEY … … … GCD DDY … … … Lecture 3.0

56 Fast Alignment Algorithm
Query: ACDEFGDEF….. ACD CDE DEF EFG FGD GDE ACE CDD NEF … … … GCE CEE DEY … … … GCD DDY … … … Database: LMRGCDDYGDEY… Lecture 3.0

57 Fast Alignment Algorithm
A C D E F G D E F... L M R G C D D Y Lecture 3.0

58 Fast Alignment Algorithm
Lecture 3.0

59 FASTA Developed in 1985 and 1988 (W. Pearson)
Looks for clusters of nearby or locally dense “identical” k-tuples init1 score = score for first set of k-tuples initn score = score for gapped k-tuples opt score = optimized alignment score Z-score = number of S.D. above random expect = expected # of random matches Lecture 3.0

60 FASTA Lecture 3.0

61 Multiple Sequence Alignment
Multiple alignment of Calcitonins Lecture 3.0

62 Multiple Alignment Algorithm
Take all “n” sequences and perform all possible pairwise (n/2(n-1)) alignments Identify highest scoring pair, perform an alignment & create a consensus sequence Select next most similar sequence and align it to the initial consensus, regenerate a second consensus Repeat step 3 until finished Lecture 3.0

63 Multiple Sequence Alignment
Developed and refined by many (Doolittle, Barton, Corpet) through the 1980’s Used extensively for extracting hidden phylogenetic relationships and identifying sequence families Powerful tool for extracting new sequence motifs and signature sequences Lecture 3.0

64 Multiple Alignment Most commercial vendors offer good multiple alignment programs including: GCG (Accelerys) PepTool/GeneTool (BioTools Inc.) LaserGene (DNAStar) Popular web servers include T-COFFEE, MULTALIN and CLUSTALW Popular freeware includes PHYLIP & PAUP Lecture 3.0

65 Mutli-Align Websites Match-Box MUSCA T-Coffee MULTALIN CLUSTALW Lecture 3.0

66 Lecture 3.0

67 T-Coffee Uses standard progressive alignment but with a “twist” to avoid local minima Allows the combination of a collection of multiple/pairwise, global or local alignments into a single model It also allows to estimate the level of consistency of each position within the new alignment with the rest of the alignments Lecture 3.0

68 Multi-alignment & Contig Assembly
ATCGATGCGTAGCAGACTACCGTTACGATGCCTT… TAGCTACGCATCGTCTGATGGCAATGCTACGGAA.. TAGCTACGCATCGT TAGCAGACTACCGTT ATCGATGCGTAGC GTTACGATGCCTT Lecture 3.0

69 Contig Assembly Read, edit & trim DNA chromatograms
Remove overlaps & ambiguous calls Read in all sequence files (10-10,000) Reverse complement all sequences (doubles # of sequences to align) Remove vector sequences (vector trim) Remove regions of low complexity Perform multiple sequence alignment Lecture 3.0

70 Contig Assembly = Multiple Alignment
Only accept a very high sequence identity Accept unlimited number of “end” gaps Very high cost for opening “internal” gaps A short match with high score/residue is preferred over a long match with low score/residue Lecture 3.0

71 Assembly Parameters User-selected parameters Non-adjustable parameters
minimum length of overlap percent identity within overlap Non-adjustable parameters sequence “quality” factors Lecture 3.0

72 Chromatogram Editing Lecture 3.0

73 Sequence Loading Lecture 3.0

74 Sequence Alignment Lecture 3.0

75 Contig Alignment - Process
ATCGATGCGTAGC TAGCAGACTACCGTT GTTACGATGCCTT TGCTACGCATCG CGATGCGTAGCA CGATGCGTAGCA ATCGATGCGTAGC TAGCAGACTACCGTT GTTACGATGCCTT ATCGATGCGTAGCAGACTACCGTTACGATGCCTT… Lecture 3.0

76 Problems for Assembly Repeat regions Polymorphisms Large data volume
Capture sequences from non-contiguous regions Polymorphisms Cause failure to join correct regions Large data volume Requires large numbers of pair-wise comparisons Lecture 3.0

77 Sequence Assembly Programs
Phred - base calling program that does detailed statistical analysis (UNIX) Phrap - sequence assembly program (UNIX) TIGR Assembler - microbial genomes (UNIX) The Staden Package (UNIX) GeneTool/ChromaTool/Sequencher (PC/Mac) Lecture 3.0

78 Phrap Phrap is a program for assembling shotgun DNA sequence data
Uses a combination of user-supplied and internally computed data quality information to improve assembly accuracy in the presence of repeats Constructs the contig sequence as a mosaic of the highest quality read segments rather than a consensus Handles large datasets Lecture 3.0

79 Lecture 3.0

80 Conclusions Sequence alignments and database searching are key to all of bioinformatics There are four different methods for doing sequence comparisons 1) Dot Plots; 2) Dynamic Programming; 3) Fast Alignment; and 4) Multiple Alignment Understanding the significance of alignments requires an understanding of statistics and distributions Lecture 3.0


Download ppt "Http://creativecommons.org/licenses/by-sa/2.0/ Lecture 3.0."

Similar presentations


Ads by Google