An Introduction to Bioinformatics

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Bioinformatics Tutorial I BLAST and Sequence Alignment.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Heuristic alignment algorithms and cost matrices
Sequence analysis course
We continue where we stopped last week: FASTA – BLAST
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Introduction to bioinformatics
Sequence similarity.
Similar Sequence Similar Function Charles Yan Spring 2006.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
BLAST: Basic Local Alignment Search Tool Urmila Kulkarni-Kale Bioinformatics Centre University of Pune.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Protein Evolution and Sequence Analysis Protein Evolution and Sequence Analysis.
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Copyright OpenHelix. No use or reproduction without express written consent1.
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
Sequence Alignment. Assignment Read Lesk, Problem: Given two sequences R and S of length n, how many alignments of R and S are possible? If you.
DNA sequences alignment measurement Lecture 13. Introduction Measurement of “strength” alignment Nucleic acid and amino acid substitutions Measurement.
Sequence Similarity The bioinformatics for molecular biologists lecture series.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Pairwise Sequence Alignment and Database Searching
Sequence similarity, BLAST alignments & multiple sequence alignments
Identifying templates for protein modeling:
Sequence Based Analysis Tutorial
Basic Local Alignment Search Tool (BLAST)
Bioinformatics Lecture 2 By: Dr. Mehdi Mansouri
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
BLAST Slides adapted & edited from a set by
Presentation transcript:

An Introduction to Bioinformatics Database Searching - Pairwise Alignments

AIMS To explain the principles underlying local and global alignment programs To explain what substitution matrices are and how they are used To introduce the commonly used pairwise alignment programs To explore the significance of alignment results OBJECTIVES Carry out FastA and Blast searches To select appropriate substitution matrices To evaluate the significance of alignment/search results

INTRODUCTION Sequence comparisons Protein v Protein DNA v DNA Protein v DNA DNA v Protein Pair-wise comparison Methodology

Similarity v Homology………. “If two genes shared a common ancestor then they are homologous” They did or they didn’t, they are or they arn’t % Homology

Definitions

Similarity v Homology……. But :- Comparison of two sequences complex Differences need to be quantified infer homology from degree of similarity

Information theory……….? Protein sequence = message 4.19 bits per residue bits = log2M bit: The amount of information required to distinguish between two equally likely choices Ref: Molecular Information theory - http://www-lmmb.ncifcrf.gov/~toms/ http://www.lecb.ncifcrf.gov/~toms/paper/nano2/latex/index.html

Are two proteins related ? Average protein size of 150 residues Information content of 630 bits. Probability that two random sequences specify the same message is 2-630 or about 10-190. Convergent evolution giving rise to two similar sequences would be very rare If two sequences exhibit significant similarity arose from a common ancestor and are homologous.

Basic concept The English alphabet contains 26 letters, that of DNA 4, and that of protein 20 Measure similarity or dissimilarity

Basic concept………. Hamming Distance AGATCTAG ACGA Measure No of differences between two sequences The answer to the above is………….. The proportional or p-distance. Hamming distance divided by the total sequence length, so ranges from 0 to 1. In the above example the p-distance is 10/14 AGATCTAG ACGA AGGCATCATGCAGT 10

Basic concept………. The log-odds ratio. - measure of how unlikely two sequences should be so similar. - based on the observed frequencies of each of the characters (bases or amino acids) in the sequences, and the probability of observing each homologous pair in the two sequences. - positive score, measuring similarity, calculated by adding the scores from pre-calculated matrices (PAM and BLOSUM for protein, unitary for DNA).

Two problems to consider: GAPS genes evolve deletions, insertions, recombination give penalties for gap creations and extensions Global or Local Alignments Will sequences be similar over their whole length? Use different algorithms AGATCTAG-ACGA-TGCAGT AGGCATCATGCAGT

Global and Local Alignments A global approach will attempt to align two sequences along their entire length A local alignment will look for local regions of similarity or subsequences.

T H E C A T S A T O N T H E M A T T l l l l l H l l E l l R A l l l T l l l l l S l O l N l T l l l l l C l Dotplots are the simplest form of alignment Identical sequences, or subsequences are identified by diaganol lines

DOTTUP website does this analysis Example of Rabbit v Emperor Penguin Haemoglobin

Matrices - PAM and BLOSUM Certain groups of amino acids have similar physico-chemical properties e.g Lysine and Arginine conservative substitution Genetic code is degenerate - silent mutations Dayhoff - Point Accepted Mutation (PAM)

Matrices - PAM and BLOSUM 1 PAM unit is the extent of evolutionary divergence in which 1% of amino acid residues are altered Alignment of 15 very closely related proteins Calculate a matrix of probability of a mutation altering one amino acid residue to any other amino acid on the basis of 1 PAM. Extrapolate to PAM250 more useful for proteins not well conserved

Problems: derived from proteins of only slight divergence PAM250 matrix

BLOSUM Henikov and Henikov (1992) derived matrices based on sequences more divergent. The BLOSUM (BLOcks SUbstition Matrix) matrices cover sequences with 80% or more similarity (BLOSUM 80), 62% or greater similarity (BLOSUM 62) etc Based on local not global alignments

Alignments - local Basic principle Choose one sequence to be searched against the other Query sequence (q) and target sequence (t) Divide the query sequence into small subsequences, called words For each word of q, look along t to find other words in t which are similar Matching words "anchors" build up a better alignment between q and t Assess how good this alignment is.

FastA and BLAST FastA Pearson and Lipman Method (late 80s) Query sequence compared to each sequence in a database matching words (up to 6 nucleotides, or two amino acids in a row) Rescore best regions with matrices Algorithm checks concatenation Best sequences displayed

FastA and BLAST BLAST Basic Local Alignment Search Tool Compares query to database For each pair - finds maximal segment pair (using BLOSUM) The algorithm calculates probability of random occurrence Faster than FastA, less accurate, method of choice since introduction of GAP-BLAST

Significance? Only Local Alignments - without gaps HSPs/MSPs - alignment occurring by chance (p value) is derived from the observed score (S) to the expected distribution of scores larger databases - larger probability of a sequence match by chance the closer the p-value to zero the more confidence can be given to the alignment

Types of BLAST Nucleotide BLAST Standard nucleotide-nucleotide BLAST [blastn] MEGABLAST Search for short nearly exact matches Protein BLAST Standard protein-protein BLAST [blastp] PSI- and PHI-BLAST Translated BLAST Searches Nucleotide query - Protein db [blastx] Protein query - Translated db [tblastn] Nucleotide query - Translated db [tblastx]

Example. I have a new mRNA sequence: TGGCGGCGGCGGCGGCGGTTGTCCCGGCTGTGCCGGTTGGTGTGGCCCGTCAGCCCGCGTACCACAGCGCCCGGGCCGCG TCGAGCCCAGTACAGCCAAGCCGCTGCGGCCGGGTCCGGCGCGGGCGGCGCGCGCAGACGGAGGGCGGCGGCCGCGGCCA GGGCGGCCCGTGGGACCGCGGGCCCCCGGCGCAGCGCTGCCCGGCTCCCGGCCCTGCCGGCCTCCTCCCTTGGCGCCGCG GCCATGGCGGCCAGCGCGAAGCGGAAGCAGGAGGAGAAGCACCTGAAGATGCTGCGGGACATGACCGGCCTCCCGCGCAA CCGAAAGTGCTTCGACTGCGACCAGCGCGGCCCCACCTACGTTAACATGACGGTCGGCTCCTTCGTGTGTACCTCCTGCT CCGGCAGCCTGCGAGGATTAAATCCACCACACAGGGTGAAATCTATCTCCATGACAACATTCACACAACAGGAAATTGAA TTCTTACAAAAACATGGAAATGAAGTCTGTAAACAGATTTGGCTAGGATTATTTGATGATAGATCTTCAGCAATTCCAGA CTTCAGGGATCCACAAAAAGTGAAAGAGTTTCTACAAGAAAAGTATGAAAAGAAAAGATGGTATGTCCCGCCAGAACAAG CCAAAGTCGTGGCATCAGTTCATGCATCTATTTCAGGGTCCTCTGCCAGTAGCACAAGCAGCACACCTGAGGTCAAACCA CTGAAATCTCTTTTAGGGGATTCTGCACCAACACTGCACTTAAATAAGGGCACACCTAGTCAGTCCCCAGTTGTAGGTCG TTCTCAAGGGCAGCAGCAGGAGAAGAAGCAATTTGACCTTTTAAGTGATCTCGGCTCAGACATCTTTGCTGCTCCAGCTC CTCAGTCAACAGCTACAGCCAATTTTGCTAACTTTGCACATTTCAACAGTCATGCAGCTCAGAATTCTGCAAATGCAGAT TTTGCAAACTTTGATGCATTTGGACAGTCTAGTGGTTCGAGTAATTTTGGAGGTTTCCCCACAGCAAGTCACTCTCCTTT TCAGCCCCAAACTACAGGTGGAAGTGCTGCATCAGTAAATGCTAATTTTGCTCATTTTGATAACTTCCCCAAATCCTCCA GTGCTGATTTTGGAACCTTCAATACTTCCCAGAGTCATCAAACAGCATCAGCTGTTAGTAAAGTTTCAACGAACAAAGCT GGTTTACAGACTGCAGACAAATATGCAGCACTTGCTAATTTAGACAATATCTTCAGTGCCGGGCAAGGTGGTGATCAGGG AAGTGGCTTTGGGACCACAGGTAAAGCTCCTGTTGGTTCTGTGGTTTCAGTTCCCAGTCAGTCAAGTGCATCTTCAGACA AGTATGCAGCTCTGGCAGAACTAGACAGCGTTTTCAGTTCTGCAGCCACCTCCAGTAATGCGTATACTTCCACAAGTAAT GCTAGCAGCAATGTTTTTGGAACAGTGCCAGTGGTTGCTTCTGCACAGACACAGCCTGCTTCATCAAGTGTGCCTGCTCC ATTTGGACGTACGCCTTCCACAAATCCATTTGTTGCTGCTGCTGGTCCTTCTGTGGCATCTTCTACAAACCCATTTCAGA CCAATGCCAGAGGAGCAACAGCGGCAACCTTTGGCACTGCATCCATGAGCATGCCCACGGGATTCGGCACTCCTGCTCCC TACAGTCTTCCCACCAGCTTTAGTGGCAGCTTTCAGCAGCCTGCCTTTCCAGCCCAAGCAGCTTTCCCTCAACAGACAGC TTTTTCTCAACAGCCCAATGGTGCAGGTTTTGCAGCATTTGGACAAACAAAGCCAGTAGTAACCCCTTTTGGTCAAGTTG CAGCTGCTGGAGTATCTAGTAATCCTTTTATGACTGGTGCACCAACAGGACAATTTCCAACAGGAAGCTCATCAACCAAT CCTTTCTTATAGCCTTATATAGACAATTTACTGGAACGAACTTTTATGTGGTCACATTACATCTCTCCACCTCTTGCACT GTTGTCTTGTTTCACTGATCTTAGCTTTAAACACAAGAGAAGTCTTTAAAAAGCCTGCATTGTGTATTAAACACCAGGTA ATATGTGCAAAACCGAGGGCTCCAGTAACACCTTCTAACCTGTGAATTGGCAGAAAAGGGTAGCGGTATCATGTATATTA AAATTGGCTAATATTAAGTTATTGCAGATACCACATTCATTATGCTGCAGTACTGTACATATTTTTCTTAGAAATTAGCT ATTTGTGCATATCAGTATTTGTAACTTTAACACATTGTTATGTGAGAAATGTTACTGGGGAAATAGATCAGCCACTTTTA AGGTGCTGTCATATATCTTGGAATGAATGACCTAAAATCATTTTAACCATTGCTACTGGAAAGTAACAGAGTCAAAATTG GAAGGTTTTATTCATTCTTGAATTTTTCCTTTCTAAAGAGCTCTTCTATTTATACATGCCTAAATTCTTTTAAAATGTAG AGGGATACCTGTCTGCATAATAAAGCTGATCATGTTTTGCTACAGTTTGCAGGTGAAAAAAAATAAATATTATAAAATAA AAAAAAAAAAAAAGAAAAAAAAAA

I’ve pasted my sequence I’ve selected the database I hit BLAST!

Record this number Press Format!

Setting up a BLAST search Step 1. Plan the search Step 2. Enter the query sequence Step 3. Choose the appropriate search parameters Step 4. Submit the query Deciphering the BLAST output Step 1. Examine the alignment scores and statistics Step 2. Examine the alignments Step 3. Review search details to plan the next step Post-BLAST analysis Perform a PSI-BLAST analysis Create a multiple alignment Try motif searching with PHI-BLAST