Designed by Manisha, NUS Part I : SEQUENCE COMPARISON PAIRWISE ALIGNMENT Manisha Brahmachary.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Bioinformatics Tutorial I BLAST and Sequence Alignment.
Measuring the degree of similarity: PAM and blosum Matrix
Lecture 8 Alignment of pairs of sequence Local and global alignment
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Sequence analysis June 20, 2006 Learning objectives-Understand sliding window programs. Understand difference between identity, similarity and homology.
Heuristic alignment algorithms and cost matrices
Sequence Alignment.
Introduction to Bioinformatics Algorithms Sequence Alignment.
We continue where we stopped last week: FASTA – BLAST
Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
It & Health 2009 Summary Thomas Nordahl Petersen.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Introduction to bioinformatics
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Basics of Sequence Alignment and Weight Matrices and DOT Plot
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.
An Introduction to Bioinformatics
Protein Evolution and Sequence Analysis Protein Evolution and Sequence Analysis.
Protein Sequence Alignment and Database Searching.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
. Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Biology 4900 Biocomputing.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
In-Class Assignment #1: Research CD2
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
Alignment methods April 17, 2007 Quiz 1—Question on databases Learning objectives- Understand difference between identity, similarity and homology. Understand.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
Protein Sequence Alignment Multiple Sequence Alignment
Sequence Alignment. Assignment Read Lesk, Problem: Given two sequences R and S of length n, how many alignments of R and S are possible? If you.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Sequence similarity, BLAST alignments & multiple sequence alignments
Basics of BLAST Basic BLAST Search - What is BLAST?
Bioinformatics and BLAST
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
BLAST Slides adapted & edited from a set by
Sequence Analysis Alan Christoffels
Presentation transcript:

designed by Manisha, NUS Part I : SEQUENCE COMPARISON PAIRWISE ALIGNMENT Manisha Brahmachary

designed by Manisha, NUS OUTLINE zWhat is sequence Comparison zWays to do Sequence Comparison zDot Plot zBLAST zFASTA

designed by Manisha, NUS What is sequence alignment or sequence comparison? zGiven two sequences of letters and a scoring scheme for evaluating matching letters, find best pairing from one sequence to letters of the other sequence. z THIS IS A RATHER LONGER SENTENCE THAN THE NEXT. z THIS IS A SHORT SENTENCE z Align: z THIS IS A RATHER LONGER SENTENCE THAN THE NEXT. z THIS IS A#######SHORT###SENTENCE############## (path 1) z or z THIS IS A SHORT#########SENTENCE############## (path 2)

designed by Manisha, NUS Aligning biological sequences z DNA (4 letter alphabet) y TTGACAC yTTTACAC z Proteins (20 letter alphabet) y RKVA--GMAKPNM y RKIAVAAASKPAV

designed by Manisha, NUS Why do Sequence Alignment?  Finding novel genes in silico  Phylogenetic/Evolutionary  Structure-template for modelling  Functional prediction

designed by Manisha, NUS Types of Sequence Comparison  Pairwise Alignment yComparison of two sequences  Multiple Alignment yComparison of more than two sequences

designed by Manisha, NUS CONCEPTS IN SEQUENCE COMPARISON  IDENTITY zPercentage identity between sequences means that they have a certain number of residues (nucleotide /amino- acids ) that are identical at that particular position after aligning both sequences.

designed by Manisha, NUS Exact match (shown by | ) : 10 identical residues Above example : Percentage identity: 10 identical matches /15 residues in the aligned sequence *100 = 66% identity RCI CTRGFCRCLCRR RCLCRRGVCRCICT R Query: Subject:

designed by Manisha, NUS MISMATCH(s) HERE RCI CTRGFCRCLCRR RCLCRRGVCRCICT R Query: Subject:

designed by Manisha, NUS Mismatch when different characters, therefore insertion of gaps. Gaps have penalties: Insertion of first gap( GAP OPENING) : high penalty (For eg. –2, subtracting 2 ) Insertion of consecutive gaps ( GAP EXTENSION): less penalty (For eg. -1 (subtracting 1 for each consecutive gap) More no. of gaps lesser the score of the alignment RCICT-RGFCRCLC---RR RCLCRRGVCRCICTAR Query: Subject:

designed by Manisha, NUS  Substitution : Less score than identical match For eg: +1 per substitution RCICT-RGFCRCLC---RR RCLCRRGVCRCICTAR-

designed by Manisha, NUS zSubstitution - Replace a residue with another of similar physiochemical property. Ile (I) Leu (L) Met (M) Val (V)Hydrophobic Ala (A) Cys (C) Gly (G) Pro (P) Ser (S) Thr (T)Hydrophilic Phe (F) Tyr (Y) Trp (W)Aromatic His (H) Lys (K) Arg (R)Basic Asp (D) Glu(E) Asn (N) Gln (Q)Acids and Amides Amino AcidCategory

designed by Manisha, NUS Similarity zSimilarity = Identical matches + Substitutions zEg. (10 identical matches + z 2 substitution) / 15 aligned residues * 100 = 80% similarity RCICT-RGFCRCLC---RR RCLCRRGVCRCICTAR

designed by Manisha, NUS ACTCGGCCCCGCG CTCACTG C ACTCGGAC - - GCG CTCAGTGC For DNA: Identity and gap are applicable

designed by Manisha, NUS Similarity Vs. Homology  Homology: When two similar proteins come from a common ancestor. zHomology is inferred from Similarity zIf two sequences are similar, then they are known as homologous sequences. zUsually, at least 30% identity over 400 bp for DNA sequences and over 125 amino acids for proteins.

designed by Manisha, NUS Scoring Matrices used in sequence comparison zWhat is a scoring Matrix: zScoring matrices are used when we compare sequences with one another zGives us a measure of which residue can be substituted by which residue.

designed by Manisha, NUS z zFor Amino acids, Each amino acid is compared to every other and a score is given to this pair zHigh score if they are the same residue (e.g. Cysteine compared to cysteine) z Low, if they are very different (e.g. Tryptophan compared to cysteine) Scoring Matrices

designed by Manisha, NUS Scoring Matrices A C G T ACGTACGT zDNA sequence: 4 characters only (A,T,G,C) zUnitary matrix used for scoring: yA scoring system in which only identical characters receive a positive score. for DNA:

designed by Manisha, NUS SCORING SCHEMES FOR PROTEIN SEQUENCE ALIGNMENTS z Scoring matrices used are: PAM( Point Accepted Mutation ) and BLOSUM( BLOcks SUbstitution Matrix z BLOSUM45---->BLOSUM 90 means MORE DIVERSE TO LESS DIVERSE zPAM30---  PAM250 means LESS DIVERSE TO MORE DIVERSE NOTE: Many different matrices are in use, each gives different values to pairs of amino acids Depending on how distantly related your sequences are, you might want to choose different matrices for your comparisons

designed by Manisha, NUS Scoring Matrices BLOSUM 45 BLOSUM62 BLOSUM90 PAM250 PAM160 PAM 100 MORE DIVERGENT LESS DIVERGENT Notes:

designed by Manisha, NUS Ways to do Pairwise Alignment yDot Plot (simplest method) zStatistical computation based yLocal alignment e.g. BLAST, FASTA yGlobal alignment e.g. CLUSTAL

designed by Manisha, NUS What are Dot Plots Program to do sequence comparison to find out: –Are the two sequences similar ? – Are there Repeat regions in your sequence?

designed by Manisha, NUS STEPS IN DOT PLOT Take two sequences to be compared Sequence A:MEHRKPGTGQ Sequence B:MEHRKPGTGQ Place sequence A in x-axis (Row). Place sequence B in y- axis (Column) X-axis M E H R K P G T G Q Y-axis

designed by Manisha, NUS Plot a dot everytime there is a match between an element of row sequence and an element of column sequence Do you see any diagonal line extending? If yes, then there is a match !

designed by Manisha, NUS Patterns in Dot Plot When two sequences are “identical” Sequence : GGTCCTTGGCTGAAAG ACCCCA GGTCCTTGGCTGAAAGACCCCA

designed by Manisha, NUS Application of Dot Plot zUsing self comparison : Finding Repeats Sequence used: Human ALU sequence CATCTCAAAAACAACAA CAAAAAAAAAAAAAAAA GAAAAAAAA Omit main diagonal Clusters of diagonal lines show repeats in the sequence. CATCTCAAAAACAACAACAAAAAAAAAAAAAAAAGAAAAAAAA

designed by Manisha, NUS Notes:What are repeats? zRepeats:are stretches of repeated regions of residues in a sequence. zImportance of repeats: zIn protein: yRegulatory regions yBinding sites yIn DNA: yPresent in Transposons, chromosomal mutational hotspots, many genetic diseases related with repeats.eg.Huntington.

designed by Manisha, NUS Patterns in Dot Plot When two sequences are similar : Broken diagonal,the interrupted region shows regions of mismatch GREGYPADSKGCKITCFLTAAGYCNTECTLKKGSSGYCAWPACYCYG MKGMILFISCLLLIDIVVGGKEGYLMDHEGCKLSCFIRPSGYCGRECTLKKGS

designed by Manisha, NUS Patterns in Dot Plot Two different, but related sequences Broken diagonal clusters of dots parallel to the central diagonal. Distance between the lines show no. of insertions done to get the alignment. GREGYPADSKGCKITCFLTAAGYCNTECTLKKGSSGYCAWPA ARDGYPVDEKGCKLSCLINDKWCNSACHSRGGKYGYCYTGGL

designed by Manisha, NUS Two models of alignment: Local and Global alignments zGlobal alignment: y Looks for similarity across full extent of sequences xSite:

designed by Manisha, NUS GLOBAL Alignment zThe two sequences are matched across their whole sequence length.

designed by Manisha, NUS Local alignment zLooks for regions of similarity in parts of the sequences only Softwares : BLAST, FASTA

designed by Manisha, NUS Local Alignment zExample of local alignment between two sequences using lalign program. ( GN_form.html) zNotice that the alignment is shown only of those regions that have strong identity or strong similarity

designed by Manisha, NUS Why two different models?  Global alignment zHigh degree of Homology zGood for modelling  Local Alignment zLocalised Similarity ( conserved regions with structural, functional importance, Repeats, Domains)

designed by Manisha, NUS FASTA zFast Alignment (expanded form of FASTA)by Pearson and Lipmann. zIs a method based on dynamic programming. zWebsites available: zhttp:// zhttp:// a.html

designed by Manisha, NUS What is BLAST? zBasic Local Alignment Search Tool (BLAST) zMethod for Pairwise Alignment. zIs used to search for homologous sequences from a database (of nucleotide/protein sequence) for a given query sequence. zModified version of FASTA yFaster in generating output. zSites for doing BLAST: xhttp://

designed by Manisha, NUS How to go about doing BLAST SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDTVYCPRHVICTAEDMLNPNYE DLLIRKSNHSFLVQAGNVQLRVIGHSMQNCLLRLKVDTSNPKTPKYKFVRIQPGQTF SVLACYNGSPSGVYQCAMRPNHTIKGSFLNGSCGSVGFNIDYDCVSFCYMHHMEL PTGVHAGTDLEGKFYGPFVDRQTAQAAGTDTTITLNVLAWLYAAVINGDRWFLNRF TTTLNDFNLVAMKYNYEPLTQDHVDILGPLSAQTGIAVLDMCAALKELLQNGMNGR T ILGSTILEDEFTPFDVVRQCSGVTFQ SARS virus gene:

designed by Manisha, NUS

BLAST output for a protein query sequence from a SARS virus Score (bits) is the score given letter by letter during alignment based on the Subtitution matrices. High score = less E value.

designed by Manisha, NUS E value: No. of chance alignments that one will get as hits. Lower the E value lesser no. of chance hits E value of zero or less than zero indicates very good hit (highly homologous sequence) E value is also known as P(N) in some BLAST programs

designed by Manisha, NUS Gives the identity Gives the similarity BLAST OUTPUT

designed by Manisha, NUS z BLAST query schemes:  Amino acid seq: against db? y Blastp (protein sequence db) y Tblastn (translated nucleotide sequence db)  DNA seq: against db? y Blastn (nucleotide db) y Blastx ( protein sequence db) y Tblastx (translated nucleotide sequence db) BLAST

designed by Manisha, NUS Gene(CDNA), Unknown CTAACATGCTTAGGATAATGGCCTCTCTTGTTCTTGCTCGCAAACATAACACTT GCTGTAACTTATCACA NMLRIMASLVLARKHNTC CNLSHRFYRLANECAQVL SEMVMCGGSLYVKPGGT SSGDATTAYANSVFNIC Choose the best hit using the lowest E value, highest %identity Function, family of gene found Find conserved regions, Domains, Phylogenetic relations:which family of gene closest to your target gene/protein Translate into 6 frames, Amino acid seq.choose appropriate frame. BLAST BLAST RESULTS If, High % identity and low e-value DNA Sequencing CLUSTAL Use multiple sequences

designed by Manisha, NUS SUMMARY zTODAY WE LOOKED AT: Methods to compare two sequences: y Dot plots (simplest, graphical view) y Different patterns of Dot plots y Local alignment y Global alignment y Difference between these two models yFASTA y BLAST y other types of BLAST