BINF350, Tutorial 4 Karen Marshall. Aim ► Examine how blast parameters (e.g. scoring scheme, word length) affect the alignment outcome ► To optimise blast.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
Bioinformatics Tutorial I BLAST and Sequence Alignment.
BLAST Sequence alignment, E-value & Extreme value distribution.
Types of homology BLAST
Introduction to Bioinformatics
GNANA SUNDAR RAJENDIRAN JOYESH MISHRA RISHI MISHRA FALL 2008 BIOINFORMATICS Clustering Method for Repeat Analysis in DNA sequences.
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Bioinformatics and Phylogenetic Analysis
BLAST Tutorial 3 What is BLAST? Basic Local Alignment Search Tool Is a set of similarity search programs designed to explore sequence databases. What are.
We continue where we stopped last week: FASTA – BLAST
Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.
Similar Sequence Similar Function Charles Yan Spring 2006.
Heuristic Approaches for Sequence Alignments
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
Sequence alignment, E-value & Extreme value distribution
From Pairwise Alignment to Database Similarity Search.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
An Introduction to Bioinformatics
BLAST What it does and what it means Steven Slater Adapted from pt.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
BLAST : Basic local alignment search tool B L A S T !
Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Indexing DNA sequences for local similarity search Joint work of Angela, Dr. Mamoulis and Dr. Yiu 17/5/2007.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
School B&I TCD Bioinformatics Database homology searching May 2010.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
BLAST Basic Local Alignment Search Tool (Altschul et al. 1990)
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Algorithms for Biological Sequence Analysis ─ Class Presentation Human-Mouse Alignments with BLASTZ Galaxy: A Platform for Interactive Large-scale Genome.
Whole Genome Repeat Analysis Package A Preliminary Analysis of the Caenorhabditis elegans Genome Paul Poole.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Construction of Substitution matrices
Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
Copyright OpenHelix. No use or reproduction without express written consent1.
What is BLAST? Basic BLAST search What is BLAST?
Practice -- BLAST search in your own computer 1.Download data file from the course web page, or Ensemble. Save in the blast\dbs folder. 2.Start a CMD window,
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
What is sequencing? Video: WlxM (Illumina video) WlxM.
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
What is BLAST? Basic BLAST search What is BLAST?
Blast Basic Local Alignment Search Tool
Basics of BLAST Basic BLAST Search - What is BLAST?
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
Presentation transcript:

BINF350, Tutorial 4 Karen Marshall

Aim ► Examine how blast parameters (e.g. scoring scheme, word length) affect the alignment outcome ► To optimise blast parameters for alignments with different levels of sequence homology

Practical: Part 1 ► Start with an ~200 bp original DNA sequence ► Simulation mutation events over time and collect sequences ► Blast original sequence against mutated sequences ► Repeat blasts using different parameters v Mutated sequences Original sequence

Simulation of mutated sequences ► Point accepted mutation (PAM) model of molecular evolution ► 1 PAM = 1 mutation per 100 bases on average  1 PAM  99.0% sequence homology  10 PAM  90.6% sequence homology  50 PAM  63.5% sequence homology  Concept of forward and backwards mutation

for each ‘successive PAM’ for each ‘nucleotide’ if (rand > 0.01) do not mutate else if (rand <=0.01) mutate by random selection from the non-identical bases

BLAST - Heuristic Step Suffix Tree Lookup table Words/seeds Location Threshold T Larger seq file

BLAST February 10, 2004: BLAST released BLAST release notes Correction to tblastx alignment computation ia32-linux now requires glibc Source code can be obtained from: ftp://ftp.ncbi.nlm.nih.gov/toolbox/ncbi_tools/old/ /ncbi.tar.gz. ftp://ftp.ncbi.nlm.nih.gov/toolbox/ncbi_tools/old/ /ncbi.tar.gz Binaries can be obtained from: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/2.2.8/. ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/2.2.8/ February 2, 2004: BLAST released BLAST release notes Standalone BLAST is now available for amd64-linux. formatdb now restricts volume sizes to 1G on 32-bit platforms for performance reasons. The -A option has been removed from formatdb, that is, all databases will be created with ASN.1 deflines. tblastn query concatenation now works correctly on 64-bit platforms. The wwwblast source code has been merged into the C toolkit tree and is no longer distributed with the binaries. Source code can be obtained from: ftp://ftp.ncbi.nlm.nih.gov/toolbox/ncbi_tools/old/ /ncbi.tar.gz. ftp://ftp.ncbi.nlm.nih.gov/toolbox/ncbi_tools/old/ /ncbi.tar.gz Binaries can be obtained from: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/2.2.7/. ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/2.2.7/ blast_whatsnew.shtml

BLAST on your own machine ► Allows you to BLAST multiple sequences  most web versions are single sequence only ► Steps  Sequence files in FASTA format Can have multiple sequences in each file but no duplicates  Format larger sequence file into a database Formatdb –i dbfile.txt –p F –o T  Perform BLAST using appropriate switches BLASTALL –p BLASTN –d dbfile.txt –i comp.txt –o out.txt

BLAST ► Arguments see appendix of handout  –W for seed word length (default = 11)  -r reward for a match (default = 1)  -q penalty for a mismatch (default = 3)  -G cost to open a gap  -E cost to extend a gap  -F filter query sequence  -e to set threshold expectation (threshold for HSP before gaps are included)  -m to specify different output options

Score E Score E Sequences producing significant alignments: (bits) Value 1_ e-046 0_ e-046 4_ e-029 2_ e-027 5_ e-023 3_ e-023 4_ e-015 2_ e-015 5_ e-011 QUERY 1 agattcactggtgtggcaagttgtctctcagactgtacatgcattaaaattttgcttggc 60 1_ _ _ t.....c......ag a _ a..c....a a g _ c......a g c _ g t c.....a _ t.....c......ag....a.....g a _ a..c...ta aa......c..a.....g _ c..c...a....g....g a......c......c Example of BLAST output: -m3

Substitution scores ► Optimal substitution scores were derived for different PAM distances / sequence homologies (States et al., 1991) ► Of importance is the match to mismatch score ratio

Substitution scores ► ‘Better’ substitution matrices exist, but not yet implemented in most BLAST software

Practical: Part 2 ► Apply concepts from Part 1 to ‘real sequences’ ► BLAST mRNA sequence for human and cattle INFG to an ~1/2 Mb sequence of human DNA ► Use optimal blast parameters for expected homology Human DNA Human INFG mRNA Cattle INFG mRNA

Expected levels of sequence homology ► Varies for sequences being considered and genomic region Human to mouse comparison, from …

Efficiency of BLAST ► Human to cattle coding sequence ~85% homology (~PAM 15) (~PAM 15)

INFG mRNA sequences ► Extracted from NCBI website using batch entrez >gi| |ref|NM_ | Homo sapiens interferon, gamma (IFNG), mRNA TGAAGATCAGCTATTAGAAGAGAAAGATCAGTTAAGTCCTTTGGACCTGATCAGCTTGATACAAGAACTACTGATTTCAACTTCTTTGGCTTAATTCTCTCGGAAACGATGAAATATACAAGTTATATCTTGGCTTTTCAGCTCTGCATCGTTTTGGGTTCTCTTGGCTGTTACTGCCAGGACCCATATGTAAAAGAAGCAGAAAACCTTAAGAAATATTTTAATGCAGGTCATTCAGATGTAGCGGATAATGGAACTCTTTTCTTAGGCATTTTGAAGAATTGGAAAGAGGAGAGTGACAGAAAAATAATGCAGAGCCAAATTGTCTCCTTTTACTTCAAACTTTTTAAAAACTTTAAAGATGACCAGAGCATCCAAAAGAGTGTGGAGACCATCAAGGAAGACATGAATGTCAAGTTTTTCAATAGCAACAAAAAGAAACGAGATGACTTCGAAAAGCTGACTAATTATTCGGTAACTGACTTGAATGTCCAACGCAAAGCAATACATGAACTCATCCAAGTGATGGCTGAACTGTCGCCAGCAGCTAAAACAGGGAAGCGAAAAAGGAGTCAGATGCTGTTTCAAGGTCGAAGAGCATCCCAGTAATGGTTGTCCTGCCTGCAATATTTGAATTTTAAATCTAAATCTATTTATTAATATTTAACATTATTTATATGGGGAATATATTTTTAGACTCATCAATCAAATAAGTATTTATAATAGCAACTTTTGTGTAATGAAAATGAATATCTATTAATATATGTATTATTTATAATTCCTATATCCTGTGACTGTCTCACTTAATCCTTTGTTTTCTGACTAATTAGGCAAGGCTATGTGATTACAAGGCTTTATCTCAGGGGCCAACTAGGCAGCCAACCTAAGCAAGATCCCATGGGTTGTGTGTTTATTTCACTTGATGATACAATGAACACTTATAAGTGAAGTGATACTATCCAGTTACTGCCGGTTTGAAAATATGCCTGCAATCTGAGCCAGTGCTTTAATGGCATGTCAGACAGAACTTGAATGTGTCAGGTGACCCTGATGAAAACATAGCATCTCAGGAGATTTCATGCCTGGTGCTTCCAAATATTGTTGACAACTGTGACTGTACCCAAATGGAAAGTAACTCATTTGTTAAAATTATCAATATCTAATATATATGAATAAAGTGTAAGTTCACAACT >gi| |ref|NM_ | Bos taurus interferon, gamma or immune type [interferon gamma type 2] (IFNG), mRNA ATTAGAAAAGAAAGATCAGCTACCTCCTTGGGACCTGATCATAACACAGGAGCTACCGATTTCAACTACTCCGGCCTAACTCTCTCCTAAACAATGAAATATACAAGCTATTTCTTAGCTTTACTGCTCTGTGGGCTTTTGGGTTTTTCTGGTTCTTATGGCCAGGGCCAATTTTTTAGAGAAATAGAAAACTTAAAGGAGTATTTTAATGCAAGTAGCCCAGATGTAGCTAAGGGTGGGCCTCTCTTCTCAGAAATTTTGAAGAATTGGAAAGATGAAA INFG_refseq.txt

Human Chr12 sub-sequence ► Extracted from USCS ‘Golden Path’ website ► chr12:66,589,493-67,085,092 ~ ½ Mb  does contain INFG gene ► Repeats masked to lower case >hg16_dna range=chr12: 'pad=0 3'pad=0 revComp=FALSE strand=? repeatMasking=lower CATTCATTACTTTTATAAGGTTTCTCTCTGGTATGCATCTGACTTACATC ATGGGAAAGCTAGTTTCATGACTCCTTTGGAATAGTTGTGGTCCTGAATA TGGAAAATCAATTAATGAATAGCTTAAAGCACAATAGTCAACAAATAGAT GTGAAAATTCTTTGTGAACTTTAAAGTCTTACTTAAACGTGAGATATTAT ATACAGTGTTTTATGTtagactgtgagcttgttaaagaaagaactatgcc ttctttttctttctaccagttccagtgcctcgtacaacatagaaaccata agtgtttttgaaagagcaaatGAATATTGGAAGGAGTAAGGTGATAGCTA AAGCTAAAACAATGTTTAGGGAGAACAACTGAAACAAAAGCAGCATTTGT GTCTTAAACTCATGGCCTCTGAAACAGCCTTGATAGATAGTAGAGAGGGT CAGATAGAGAGAGCCTGACTCAGAGATTGGGAAGCCCTATATGGTTGGAA GAGAAAGTAAGAGGAGACCCAAAGTATTAGACCACAGAAAGAAGTTCTAA TAGTCAGTGTCAAGAGATTCAGCAGGAGGTTGTGTATCAGGATTTGGGTT TGGGAGTGGTATGGAGCTTACCTATCTCTAAAACGAGCAGGAGGGCAAAA ATGAATCCCAGTCCCAAAGAATTCACTAATGGCCAGCAAACCAACACAGG AACCCCAGCACAGACACACAAGATAGGAAACCAGTTGTTGAAACTACAAT GTAACGGGGCTGATTTAATAAAAACCTGTTACATGAGTTATAGGtttttt ttttttttttttttttttAATGTATGTGCCCCACCTTAGGAAAGCCAGAA ATAATGGCAACGAAGAAATATTCATTCACAGTGAGAAAGCCATTAGAACG TTGGCTGGAACCTAGGGGCATATCGAGGGCCCACGTGGGAAGGACAATGA CAACTTGTTTAGTCCTCACTGGTTTCCCAGTCTGTGGATCTTATTTGAAT hs_chr12_subseq.txt

Human INFG gene

 From USCS ‘Golden Path website’ genome browser

INFG against ~1/2 Mb region of Chr 12

Assessment ► Submit  for either Part 1 or Part 2 the BLAST output, concatenated into one file and annotated  a short summary / discussion of the concepts covered in this practical (< 500 words)

References ► Strongly recommend BLAST tutorial on NCBI site  Altschul-1.html Altschul-1.html Altschul-1.html ► Further “Bioinformatics for quantitative geneticists course notes” J. McEwan  aabc_materials2004.htm#ModuleC aabc_materials2004.htm#ModuleC aabc_materials2004.htm#ModuleC