Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails.

Slides:



Advertisements
Similar presentations
Sequence Alignments.
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
BLAST Sequence alignment, E-value & Extreme value distribution.
Previous Lecture: Probability
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Sequence Similarity Searching Class 4 March 2010.
Heuristic alignment algorithms and cost matrices
Introduction to Bioinformatics Algorithms Sequence Alignment.
We continue where we stopped last week: FASTA – BLAST
Overview of sequence database searching techniques and multiple alignment May 1, 2001 Quiz on May 3-Dynamic programming- Needleman-Wunsch method Learning.
Introduction to bioinformatics
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.
Sequence similarity.
Similar Sequence Similar Function Charles Yan Spring 2006.
Heuristic Approaches for Sequence Alignments
Introduction to Bioinformatics Algorithms Sequence Alignment.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman.
Sequence alignment, E-value & Extreme value distribution
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Basics of Sequence Alignment and Weight Matrices and DOT Plot
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Developing Pairwise Sequence Alignment Algorithms
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
An Introduction to Bioinformatics
. Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology  Two genes are homologous if they share a common evolutionary.
Pairwise Sequence Alignment
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Introduction to Bioinformatics Algorithms Sequence Alignment.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
DNA, RNA and protein are an alien language
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Pairwise Sequence Alignment and Database Searching
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

Introduction to Biostatistics and Bioinformatics Sequence Alignment Concepts This Lecture

Sequence Alignment Stuart M. Brown, Ph.D. Center for Health Informatics and Bioinformatics NYU School of Medicine Slides/images/text/examples borrowed liberally from: Torgeir R. Hvidsten, Michael Schatz, Bill Pearson, Fourie Joubert, others …

Learning Objectives Identity, similarity, homology Analyze sequence similarity by dotplots window/stringency Alignment of text strings by edit distance Scoring of aligned amino acids Gap penalties Global vs. local alignment Dynamic Programming (Smith Waterman) FASTA method

Why Compare Sequences? Identify sequences found in lab experiments What is this thing I just found? Compare new genes to known ones Compare genes from different species information about evolution Guess functions for entire genomes full of new gene sequences Map sequence reads to a Reference Genome (ChIP-seq, RNA-seq, etc.)

Are there other sequences like this one? 1) Huge public databases - GenBank, Swissprot, etc. 2) “Sequence comparison is the most powerful and reliable method to determine evolutionary relationships between genes” -R. Pearson 3) Similarity searching is based on alignment 4) BLAST and FASTA provide rapid similarity searching a. rapid = approximate (heuristic) b. false + and - scores

Similarity ≠ Homology 1) 25% similarity ≥ 100 AAs is strong evidence for homology 2) Homology is an evolutionary statement which means “descent from a common ancestor” –common 3D structure –usually common function –homology is all or nothing, you cannot say "50% homologous"

Similarity is Based on Dot Plots 1) two sequences on vertical and horizontal axes of graph 2) put dots wherever there is a match 3) diagonal line is region of identity (local alignment) 4) apply a window filter - look at a group of bases, must meet % identity to get a dot

Simple Dot Plot

PTHPLASKTQILPEDLASEDLTI PTHPLAGERAIGLARLAEEDFGM Score = 7 PTHPLASKTQILPEDLASEDLTI PTHPLAGERAIGLARLAEEDFGM Score = 11  Window = 12 Stringency = 9 Filtering PTHPLASKTQILPEDLASEDLTI PTHPLAGERAIGLARLAEEDFGM Score = 11  Window / Stringency

Dot plot filtered with 4 base window and 75% identity

Dot plot of real data

Hemoglobin  -chain Hemoglobin  -chain Dotplot (Window = 130 / Stringency = 9)

Dotplot (Window = 18 / Stringency = 10) Hemoglobin  -chain Hemoglobin  -chain

Manually line them up and count? an alignment program can do it for you or a just use a text editor Dot Plot –shows regions of similarity as diagonals GATGCCATAGAGCTGTAGTCGTACCCT < — — > CTAGAGAGC-GTAGTCAGAGTGTCTTTGAGTTCC How to Align Sequences?

Percent Sequence Identity The extent to which two nucleotide or amino acid sequences are invariant A C C T G A G – A G A C G T G – G C A G 70% identical mismatch indel

Hamming Distance The minimum number of base changes that will convert one (ungapped) sequence into another The Hamming distance is named after Richard Hamming, who introduced it in his fundamental paper on Hamming codes: “Error detecting and error correcting codes” (1950) Bell System Technical Journal 29 (2): 147–160. Python function hamming_distance def hamming_distance(s1, s2): #Return the Hamming distance between equal-length sequences if len(s1) != len(s2): raise ValueError("Undefined for sequences of unequal length") return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

Hamming Dist can be unrealistic v: ATATATAT w: TATATATA Hamming Dist = 8 (no gaps, no shifts) Levenshtein (1966) introduced edit distance v = _ATATATAT w = TATATATA_ edit distance: d(v, w) = 2 Levenshtein, Vladimir I. (February 1966). "Binary codes capable of correcting deletions, insertions, and reversals". Soviet Physics Doklady 10 (8): 707–710.

Affine Gap Penalties

Gap Penalites With unlimited gaps (no penalty), unrelated sequences can align (especially DNA) Gap should cost much more than a mismatch Multi-base gap should cost only a little bit more than a single base gap Adding an additional gap near another gap should cost more (not implemented in most algorithms) Score for a gap of length x is: -(p + σx) p is gap open penalty σ is gap extend penalty

Global vs Local similarity 1)Global similarity uses complete aligned sequences - total % matches - Needleman & Wunch algorithm 2) Local similarity looks for best internal matching region between 2 sequences - find a diagonal region on the dotplot –Smith-Waterman algorithm –BLAST and FASTA 3) dynamic programming – optimal computer solution, not approximate

Global vs. Local Alignments

[Essentially finding the diagonals on the dotplot] Basic principles of dynamic programming - Creation of an alignment path matrix - Stepwise calculation of score values - Backtracking (evaluation of the optimal path) Smith-Waterman Method

Creation of an alignment path matrix Idea: Build up an optimal alignment using previous solutions for optimal alignments of smaller subsequences Construct matrix F indexed by i and j (one index for each sequence) F(i,j) is the score of the best alignment between the initial segment x 1...i of x up to x i and the initial segment y 1...j of y up to y j Build F(i,j) recursively beginning with F(0,0) = 0

Michael Schatz

If F(i-1,j-1), F(i-1,j) and F(i,j-1) are known we can calculate F(i,j) Three possibilities: x i and y j are aligned, F(i,j) = F(i-1,j-1) + s(x i,y j ) x i is aligned to a gap, F(i,j) = F(i-1,j) - d y j is aligned to a gap, F(i,j) = F(i,j-1) - d The best score up to (i,j) will be the smallest of the three options Creation of an alignment path matrix

Michael Schatz Choose the best option

Michael Schatz

Smith-Waterman is OPTIMAL but computationally slow SW search requires computing of matrix of scores at every possible alignment position with every possible gap. Compute task increases with the product of the lengths of two sequence to be compared (N 2 ) Difficult for comparison of one small sequence to a much larger one, very difficult for two large sequences, essentially impossible to search very large databases.

Scoring Similarity 1) Can only score aligned sequences 2) DNA is usually scored as identical or not 3) modified scoring for gaps - single vs. multiple base gaps (gap extension) 4) Protein AAs have varying degrees of similarity –a. # of mutations to convert one to another –b. chemical similarity –c. observed mutation frequencies 5) PAM matrix calculated from observed mutations in protein families

The 20 amino acids used in proteins have different chemical structures

Amino acids have different biochemical and physical properties that influence their relative replaceability in evolution. C P G G A V I L M F Y WH K R E Q D N S T C SH S+S positive charged polar aliphatic aromatic small tiny hydrophobic Protein Scoring Systems

Evolutionary Conservation: how often is one AA replaced by another in the same position in orthologous proteins?

The PAM 250 scoring matrix Dayhoff, M, Schwartz, RM, Orcutt, BC ( 1978 ) A model of evolutionary change in proteins. in Atlas of Protein Sequence and Structure, vol 5, sup. 3, pp M. Dayhoff ed., National Biomedical Research Foundation, Silver Spring, MD.

PAM Vs. BLOSUM PAM100 = BLOSUM90 PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM200 = BLOSUM52 PAM250 = BLOSUM45 PAM250 = BLOSUM45 More distant sequences BLOSUM62 for general use BLOSUM62 for general use BLOSUM80 for close relations BLOSUM80 for close relations BLOSUM45 for distant relations BLOSUM45 for distant relations PAM120 for general use PAM120 for general use PAM60 for close relations PAM60 for close relations PAM250 for distant relations PAM250 for distant relations

Search with Protein, not DNA Sequences 1) 4 DNA bases vs. 20 amino acids - less chance similarity 2) can have varying degrees of similarity between different AAs - # of mutations, chemical similarity, PAM matrix 3) protein databanks are much smaller than DNA databanks

FASTA 1) A faster method to find similarities and make alignments – capable of searching many sequences (an entire database) 2) Only searches near the diagonal of the alignment matrix 3) Produces a statistic for each alignment (more on this in the next lecture)

FASTA 1) Derived from logic of the dot plot –compute best diagonals from all frames of alignment 2) Word method looks for exact matches between words in query and test sequence –hash tables (fast computer technique) –Only matches exactly identical words –DNA words are usually 6 bases –protein words are 1 or 2 amino acids –only searches for diagonals in region of word matches = faster searching

FASTA Format simple format used by almost all programs >header line with a [return] at end Sequence (no specific requirements for line length, characters, etc) >URO1 uro1.seq Length: 2018 November 9, :50 Type: N Check: CGCAGAAAGAGGAGGCGCTTGCCTTCAGCTTGTGGGAAATCCCGAAGATGGCCAAAGACA ACTCAACTGTTCGTTGCTTCCAGGGCCTGCTGATTTTTGGAAATGTGATTATTGGTTGTT GCGGCATTGCCCTGACTGCGGAGTGCATCTTCTTTGTATCTGACCAACACAGCCTCTACC CACTGCTTGAAGCCACCGACAACGATGACATCTATGGGGCTGCCTGGATCGGCATATTTG TGGGCATCTGCCTCTTCTGCCTGTCTGTTCTAGGCATTGTAGGCATCATGAAGTCCAGCA GGAAAATTCTTCTGGCGTATTTCATTCTGATGTTTATAGTATATGCCTTTGAAGTGGCAT CTTGTATCACAGCAGCAACACAACAAGACTTTTTCACACCCAACCTCTTCCTGAAGCAGA TGCTAGAGAGGTACCAAAACAACAGCCCTCCAAACAATGATGACCAGTGGAAAAACAATG GAGTCACCAAAACCTGGGACAGGCTCATGCTCCAGGACAATTGCTGTGGCGTAAATGGTC CATCAGACTGGCAAAAATACACATCTGCCTTCCGGACTGAGAATAATGATGCTGACTATC CCTGGCCTCGTCAATGCTGTGTTATGAACAATCTTAAAGAACCTCTCAACCTGGAGGCTT

FASTA Algorithm

Makes Longest Diagonal 3) after all diagonals found, tries to join diagonals by adding gaps 4) computes alignments in regions of best diagonals

FASTA Alignments

FASTA Results - Alignment SCORES Init1: 1515 Initn: 1565 Opt: 1687 z-score: E(): 2.3e-58 >>GB_IN3:DMU09374 (2038 nt) initn: 1565 init1: 1515 opt: 1687 Z-score: expect(): 2.3e % identity in 875 nt overlap (83-957: ) u39412.gb_pr CCCTTTGTGGCCGCCATGGACAATTCCGGGAAGGAAGCGGAGGCGATGGCGCTGTTGGCC || ||| | ||||| | ||| ||||| DMU09374 AGGCGGACATAAATCCTCGACATGGGTGACAACGAACAGAAGGCGCTCCAACTGATGGCC u39412.gb_pr GAGGCGGAGCGCAAAGTGAAGAACTCGCAGTCCTTCTTCTCTGGCCTCTTTGGAGGCTCA ||||||||| || ||| | | || ||| | || || ||||| || DMU09374 GAGGCGGAGAAGAAGTTGACCCAGCAGAAGGGCTTTCTGGGATCGCTGTTCGGAGGGTCC u39412.gb_pr TCCAAAATAGAGGAAGCATGCGAAATCTACGCCAGAGCAGCAAACATGTTCAAAATGGCC ||| | ||||| || ||| |||| | || | |||||||| || ||| || DMU09374 AACAAGGTGGAGGACGCCATCGAGTGCTACCAGCGGGCGGGCAACATGTTTAAGATGTCC u39412.gb_pr AAAAACTGGAGTGCTGCTGGAAACGCGTTCTGCCAGGCTGCACAGCTGCACCTGCAGCTC |||||||||| ||||| | |||||| |||| ||| || ||| || | DMU09374 AAAAACTGGACAAAGGCTGGGGAGTGCTTCTGCGAGGCGGCAACTCTACACGCGCGGGCT

Similarity Search Statistics Searches with Needleman-Wunsch and Smith- Waterman have shown problems with simple “score” functions Score varies with length of sequences, gap penalties, composition, protein scoring matrix, etc. Need an unbiased method to compare alignemnts and judge if they are “significant” in a biological sense

How to score alignments? In a large database search, the scores of good alignments differ from random alignments. Many are random, very few good ones. This follows an extreme value distribution, not a normal distribution. So we need to use an appropriate statistical test.

Normal vs. Extreme Value Distributions

Pearson: FASTA Statistics

Compute p-value from the extreme value distribution E is the e-value (significance score, m is your query length, n is the length of a matching database sequence, S is the score (computed from a count of matching letters with a scoring matrix and gap penalty). K is a constant computed from the database size, lambda is a constant that models the scoring system.

Similarity Statistics E() value is equivalent to a p value Significant if E() < 0.05 (smaller numbers are more significant) –The E-value represents the likelihood that the observed alignment is due to chance alone. A value of 1 indicates that an alignment this good would happen by chance with any random sequence searched against this database. The official NCBI explanation of BLAST statistics by Stephen Altschul

Summary Identity, similarity, homology Analyze sequence similarity by dotplots window/stringency Alignment of text strings by edit distance Scoring of aligned amino acids Gap penalties Global vs. local alignment Dynamic Programming ( Smith Waterman ) FASTA method

Next Lecture: Searching Sequence Databases