Muhammad Awais PhD Biochemistry 08-ARID-1103 Understanding Basic Local Alignment Search Tool.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

CSCE555 Bioinformatics Lecture 3 Gene Finding Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
BLAST Sequence alignment, E-value & Extreme value distribution.
Predicting Genes in Mycobacteriophages December 8, In Silico Workshop Training D. Jacobs-Sera.
As it is applied to the Bacillus megaterium genome
Sequence Similarity Searching Class 4 March 2010.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Introduction to Bioinformatics Spring 2008 Yana Kortsarts, Computer Science Department Bob Morris, Biology Department.
Summer Bioinformatics Workshop 2008 Sequence Alignments Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University.
Bioinformatics Student host Chris Johnston Speaker Dr Kate McCain.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence alignment, E-value & Extreme value distribution
Finding prokaryotic genes and non intronic eukaryotic genes
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
© Wiley Publishing All Rights Reserved. Searching Sequence Databases.
Chapter 3 The Biological Basis of Life. Chapter Outline  The Cell  DNA Structure  DNA Replication  Protein Synthesis  What is a Gene?  Cell Division:
Biology 10.1 How Proteins are Made:
BLAST What it does and what it means Steven Slater Adapted from pt.
CSE 6406: Bioinformatics Algorithms. Course Outline
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
A day 3/14/ writing prompts at end of the table in a pile! If it’s not there then it’s a zero 2. Replication quiz 2- take this time to review your.
BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
DNA alphabet DNA is the principal constituent of the genome. It may be regarded as a complex set of instructions for creating an organism. Four different.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
SC.912.L.16.5 Protein Synthesis: Transcription and Translation.
1 P6a Extra Discussion Slides Part 1. 2 Section A.
Microbial Genetics Part 1 Genetics can be a challenge to understand. Use the McGraw Hill website to supplement this lecture. Please.
Recombinant DNA Technology and Genomics A.Overview: B.Creating a DNA Library C.Recover the clone of interest D.Analyzing/characterizing the DNA - create.
RNA Structure and Protein Synthesis Chapter 10, pg
Chapter 3 The Biological Basis of Life. Chapter Outline  The Cell  DNA Structure  DNA Replication  Protein Synthesis.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
1 From Mendel to Genomics Historically –Identify or create mutations, follow inheritance –Determine linkage, create maps Now: Genomics –Not just a gene,
A. Chromosomes are made of DNA B.Segments of DNA code for a protein C.A protein in turn, relates to a trait or a gene (examples: eye color, hair color,
Sequence Alignment.
Doug Raiford Phage class: introduction to sequence databases.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
What is BLAST? Basic BLAST search What is BLAST?
From DNA to Proteins Section 2.3 BC Science Probe 9 Pages
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
AP Biology Discussion Notes Wednesday 2/10/2015. Goals for Today Be able to describe how DNA & RNA molecules differ from each other. Be able to name and.
What is BLAST? Basic BLAST search What is BLAST?
Bellringer What does Protein do?
Bacterial infection by lytic virus
bacteria and eukaryotes
Bacterial infection by lytic virus
Basics of BLAST Basic BLAST Search - What is BLAST?
Transcription Translation
As it is applied to the Bacillus megaterium genome
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
PROTEIN SYNTHESIS AND MUTATIONS
Chapter 3 The Double Helix.
12-3 RNA and Protein Synthesis
DNA and Genes Chapter 11.
Introduction to Bioinformatics II
Year 12 Biology Macromolecules Unit
What do you with a whole genome sequence?
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Applying principles of computer science in a biological context
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
So how do we get from DNA to Protein?
Presentation transcript:

Muhammad Awais PhD Biochemistry 08-ARID-1103 Understanding Basic Local Alignment Search Tool

DNA Sequencing Bioinformatics is based on the fact that DNA sequencing is cheap, and becoming easier and cheaper very quickly. The Human Genome Project cost roughly $3 billion and took 12 years ( ). Sequencing James Watson’s genome in 2007 cost $2 million and took 2 months Today, you could get your genome sequenced for about $100,000 and it would take a month.

To extract information from the genome is difficult. How to convert a string of ACGT’s into knowledge of how the organism works is hard. Most of the work is on the computer, with key confirming experiments done in the “wet lab”. The sequence below contains a gene critical for life: the gene that initiates replication of the DNA. Can you spot it? We are now going to spend some time on what genes look like and how we can find them. TTGGAAAACATTCATGATTTATGGGATAGAGCTTTAGATCAAATTGAAAAAAAATTAAGCAAACCTAGTTTTGAAACCTG GCTCAAATCGACAAAAGCTCATGCTTTACAAGGAGACACGCTCATTATTACTGCACCTAATGATTTTGCACGGGACTGGT TAGAATCTAGGTATTCTAATTTAATTGCTGAAACACTTTATGATCTTACGGGGGAAGAGTTAGATGTAAAATTTATTATT CCTCCTAACCAGGCCGAGGAAGAATTCGATATTCAAACTCCTAAAAAGAAAGTCAATAAAGACGAAGGAGCAGAATTTCC TCAAAGCATGCTAAATTCGAAGTATACCTTTGATACATTTGTTATCGGATCTGGAAATCGGTTTGCGCATGCAGCTTCTT TAGCAGTAGCAGAAGCGCCGGCTAAAGCGTATAATCCGCTTTTTATTTACGGGGGAGTAGGATTAGGCAAAACACACTTA ATGCACGCCATAGGCCACTATGTGTTAGATCATAATCCTGCCGCGAAAGTCGTGTACTTATCATCTGAAAAATTCACAAA CGAGTTTATTAACTCTATTCGTGACAATAAAGCAGTAGAATTCCGCAACAAATACCGTAATGTAGATGTTTTACTGATTG ATGATATTCAATTCTTAGCAGGTAAAGAGCAGACACAAGAAGAATTTTTCCATACGTTTAATACGCTTCACGAAGAAAGC AAGCAGATTGTCATCTCAAGTGATCGACCGCCGAAAGAAATTCCTACACTTGAAGATCGACTTCGCTCTCGCTTTGAATG GGGCCTTATTACAGACATCACACCACCAGATTTGGAAACACGAATTGCTATTTTGCGTAAAAAAGCCAAAGCGGACGGCT TAGTTATTCCAAATGAAGTTATGCTTTATATCGCCAATCAGATTGATTCAAATATTAGAGAATTAGAAGGCGCACTTATT

Genes and Proteins Most genes code for proteins: each gene contains the information necessary to make one protein. Proteins are the most important type of macromolecule. – Structure: collagen in skin, keratin in hair, crystallin in eye. – Enzymes: all metabolic transformations, building up, rearranging, and breaking down of organic compounds, are done by enzymes, which are proteins. – Transport: oxygen in the blood is carried by hemoglobin, everything that goes in or out of a cell (except water and a few gasses) is carried by proteins.

The Genetic Code Proteins are long chains of amino acids. There are 20 different amino acids coded in DNA There are only 4 DNA bases, so you need 3 DNA bases to code for the 20 amino acids – 4 x 4 x 4 = 64 possible 3 base combinations (codons) – Each codon codes for one amino acid – Most amino acids have more than one possible codon Genes start at a start codon and end at a stop codon. 3 codons are stop codons: all genes end at a stop codon. Start codons are a bit trickier, since they are used in the middle of genes as well as at the beginning – in eukaryotes, ATG is always the start codon, – In prokaryotes, ATG, GTG, or TTG can be used as a start codon. In bioinformatics, we generally ignore the fact that RNA uses the base uracil (U) in place of T.

Reading Frames Since codons consist of 3 bases, there are 3 “reading frames” possible on an RNA (or DNA), depending on whether you start reading from the first base, the second base, or the third base. – The different reading frames give entirely different proteins. Each gene uses a single reading frame, so once the ribosome gets started, it just has to count off groups of 3 bases to produce the proper protein.

Open Reading Frames Ribosomes are very obedient to stop codons: when a stop codon is reached, the protein is finished. Thus, all genes end at the first stop codon in their reading frame. Since 3 out of the 64 codons are stop codons, random DNA has stop codons very frequently. – Open reading frames (ORFs) are regions with no stop codons. All genes reside in long open reading frames – Note that stop codons in other reading frames have no effect on the gene. The start codon must occur “upstream” in the same reading frame as the stop codon. It is usually near the beginning of the ORF, but not necessarily the first possible start codon. – Determining the exact start codon is not easy or obvious. – But, the first stop codon in an open reading frame is always a reasonable guess

BLAST (Basic Local Alignment Search Tool) The BLAST programs (Basic Local Alignment Search Tools) are a set of sequence comparison algorithms introduced in 1990 that are used to search sequence databases for optimal local alignments to a query. BLAST itself is a bit of software that can be run on almost any computer, but the database needed for a good cross-species comparison is quite large – the database is called “nr” for “non-redundant”, and it contains at least 20 Gb of sequence data Terminology: your sequence, which you paste into the box on the web site, is the query sequence. Sequences in the database that match yours are called subject sequences.

Global Alignment  Compares total length of two sequences Local Alignment  Compares segments of sequences  Finds cases when one sequence is a part of another sequence, or they only match in parts.

10 BLAST terminology query sequence blast target database (GenBank/ SwissProt) output sequence list: Hits/subject information about input query sequence, e.g., function The aim of a database (blast) search is to discover sequence homology on basis of sequence similarity BLAST returns similar sequences, not necessarily biological similar sequences

 Search protein database using a translated nucleotide query  Use to find homologous proteins to a nucleotide coding region  Translates the query sequence in all six reading frames  Often the first analysis performed with a newly determined nucleotide sequence

 Search translated nucleotide database using a protein query  Does six-frame translations of the nucleotide database  Find homologous protein coding regions

 Search translated nucleotide database using a translated nucleotide query  Both translations use all six frames  Useful in identifying potential proteins  Good tool for identifying novel genes  Computationally intensive

BLASTing a sequence at NCBI – enter accession

BLASTing a sequence at NCBI – enter sequence

BLAST Scores Results are arranged with the best ones on top The most important score is the Expect value, or E-value, which can be defined the number of hits any random sequence (with the same length as yours) would have in the database. – E-values for good hits are usually written something like: 3e-42, which is the same as 3 x , a very small number – Bad hits are very common, and they have e-values in a more familiar form: for example, or 1.2 In this case we see many hits with good e-values, and the top e-values all are quite similar. Before we can conclude that our protein is a homologue of the proteins BLAST matches it with, we would like them to have roughly the same length and have a high percentage of identical amino acids. – the lengths of the query and subject sequences should be within 20% of each other – There should be at least 30% identical amino acids – In this case we can be quite sure we have a good match BLAST also returns a fourth value, the bit score, which we are going to ignore.

A Sequence to BLAST This is a more-or-less randomly chosen gene from Bascillus megaterium.. – It is 174 amino acids long It is written in “fasta” format: the first line starts with > and is immediately followed by an identifier (ORF00135), and then some miscellaneous comments. After that the sequence is written without spaces or other marks. >ORF00135 |chromosome MKAKLIQYVYDAECRLFKS VNQHFDRKHLNRFLRLLTH AGGATFTIVIACLLLFLYPSS VAYACAFSLAVSHIPVAIAK KLYPRKRPYIQLKHTKVLE NPLKDHSFPSGHTTAIFSLVT PLMIVYPAFAAVLLPLAVMV GISRIYLGLHYPTDVMVGLI LGIFSGAVALNIFLT

Protein Blast

Tblastn (Protein to Nucleotide)

Gene Names Mostly genes are named with the function of their protein. – at some point, some related genes had their function determined through lab work: by examining the effects of mutations in the gene, by isolating and studying the protein produced by the gene, etc. – Enzymes (end in –ase), transport across the cell membrane, genetic information processing (DNA->RNA->protein), structural proteins, sporulation and germination, and more! Many genes (maybe 1/4 of them in a typical genome) have no known function, although they are found in several different species: conserved hypothetical genes Every new genome has some genes that are unique: no matching BLAST hits in the database. – Are they real genes? Sometimes there is evidence in the form of messenger RNA, but usually we don’t know – call them hypothetical genes “putative” means that we think we know the gene’s function but we aren’t sure. Putative should be followed by the function name.

Summary 1.DNA can be read in 3 different reading frames, a consequence of the genetic code (3 bases = 1 amino acid) 2.Genes are found in long open reading frames, areas where there are no stop codons. 3.BLAST is the tool we use to compare sequences between species BLAST scores (e-values) describe the probability of finding a random sequence in the database 4.Gene sequences are conserved between species by natural selection DNA sequences outside of genes are much less conserved

blast.ncbi.nlm.nih.gov/ (Official) (Official) (EU Database) (Video Tutorials) (Text Knowledge) (Preferred Search Engine) Internet Links