© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Chapter 6 The Computational Foundations of Genomics Applying.

Slides:



Advertisements
Similar presentations
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Advertisements

Analysis of your 16s RNA. DNA sequencing Most current sequencing projects use the chain termination method –Also known as Sanger sequencing, after its.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Measuring the degree of similarity: PAM and blosum Matrix
Lecture 8 Alignment of pairs of sequence Local and global alignment
Molecular Evolution Revised 29/12/06
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
Heuristic alignment algorithms and cost matrices
Bioinformatics and Phylogenetic Analysis
Introduction to Computational Biology Topics. Molecular Data Definition of data  DNA/RNA  Protein  Expression Basics of programming in Matlab  Vectors.
Introduction to BioInformatics GCB/CIS535
Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences Database searching for sequences Multiple sequence alignment Protein classification.
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Probabilistic methods for phylogenetic trees (Part 2)
1 Lesson 3 Aligning sequences and searching databases.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
© Wiley Publishing All Rights Reserved.
BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.
An Introduction to Bioinformatics
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey Chapter 6 The Computational Foundations of Genomics Applying.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Scoring Matrices April 23, 2009 Learning objectives- 1) Last word on Global Alignment 2) Understand how the Smith-Waterman algorithm can be applied to.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Introduction to Bioinformatics Dr. Rybarczyk, PhD University of North Carolina-Chapel Hill
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Cluster validation Integration ICES Bioinformatics.
Analyzing Expression Data: Clustering and Stats Chapter 16.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Introduction to Bioinformatics Summary Thomas Nordahl Petersen.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Bioinformatics Overview
Introduction to Bioinformatics Resources for DNA Barcoding
Sequence Based Analysis Tutorial
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Presentation transcript:

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey Chapter 6 The Computational Foundations of Genomics Applying algorithms to analyze genomics data

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey Contents  Sequence alignment  Gene prediction  Algorithms for analysis of phylogeny  Analysis of microarray data

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey Computational Biology and Bioinformatics  Computational biology  Development of computational methods to solve problems in biology  Bioinformatics  Application of computational biology to analysis and management of real data  Why do biologists need computer science?  Discrete nature of sequence data is ideal for analysis using digital computers  Size and complexity of genomics data make the data impossible to analyze without computers

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey Algorithmic problems  Example: searching for a number in an unordered list  If the list has N numbers, the average amount of time the search will take will be proportional to N  A more clever approach  Place the numbers in order  Do a binary search  Step 1: Pick a number in the middle of the list  Step 2: Restrict the search to the half that contains your number  Return to Step 1 until you find your number  Time for this approach is proportional to log 2 N

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey The digital computer  Represents everything in a code of zeros and ones  Computer architecture  CPU  Memory  Input / Output  Advantages of digital computer  Deterministic  Minimization of noise Output CPUMemory Input

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey Sequence databases  What is a database?  An indexed set of records  Records retrieved using a query language  Database technology is well established  Examples of sequence databases  GenBank  Encompasses all publicly available protein and nucleotide sequences  Protein Data Bank  Contains 3-D structures of proteins

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey The client-server model  The clients and servers are software processes  Clients request data from servers  Servers and clients can reside on the same or different machines  Clients can act as servers to other processes and vice versa Web Browser BLAST Search Engine Database Web Server

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey Sequence alignment  Sequence alignments search for matches between sequences  Two broad classes of sequence alignments  Global  Local  Alignment can be performed between two or more sequences QKESGPSSSYC VQQESGLVRTTC Global alignment Local alignment ESG

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey The biological importance of sequence alignment  Sequence alignments assess the degree of similarity between sequences  Similar sequences suggest similar function  Proteins with similar sequences are likely to play similar biochemical roles  Regulatory DNA sequences that are similar will likely have similar roles in gene regulation  Sequence similarity suggests evolutionary history  Fewer differences mean more recent divergence

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey The algorithmic problem of aligning sequences  Comparison of similar sequences of similar length is straightforward  How does one deal with insertions and gaps that may hide true similarity?  How does one interpret minimal similarity?  Are sequences actually related?  Is alignment by chance? QQESGPVRSTC QKGSYQEKGYC QQESGPVRSTC RQQEPVRSTC QQESGPVRSTC QKESGPSRSYC

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey Methods of sequence alignment  Graphical methods  Dynamic-programming methods  Heuristic methods

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey Dot matrix analysis  A graphical method  Shows all possible alignments  Caveats  Some guesswork in picking parameters  Window size  Stringency  Not as rigorous or quantitative as other methods RQQEPVRSTC Q Q E S G P V R S T C QQESGPVRSTC RQQEPVRSTC

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey Dot matrix analysis: a real example Window size: 23 Stringency: 15 Window size: 1 Stringency: 1

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey Devising a scoring system  Scoring matrices allow biologists to quantify the quality of sequence alignments  Use different scoring matrices for different purposes  Score for similar structural domains in proteins  Score for evolutionary relationship  Some popular scoring matrices  PAM for evolutionary studies  BLOSUM for finding common motifs

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey An example of scoring ARNDCQE A4-2 0 R N D C Q E BLOSUM62 A sequence comparison Total score: 18 AA4AA4 DQ0DQ0 DE2DE2 RR5RR5 QQ5QQ5 C E -4 E C -4 RQ1RQ1 AA4AA4 DQ0DQ0

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey Heuristic methods with k-tuples  Example: BLAST  Using query sequence, derive a list of words of length w (e.g., 3)  Keep high-scoring words  High-scoring words are compared with database sequences  Sequences with many matches to high- scoring words are used for final alignments

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey Statistical significance  Chance alignments have no biological significance  Statistical significance implies low probability of generating a chance alignment  Probability of long alignments increases with longer sequences  The extreme-value distribution  Used to calculate the probability of chance alignment  Generated by calculating the scores resulting from repeatedly scrambling one of the sequences being compared

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey A practical example of sequence alignment

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey BLAST results

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey Detailed BLAST results

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey A pairwise alignment with MASH-1  HASH-2, a human homolog of MASH-1  “+” indicates conservative amino acid substitution  “–” indicates gap/insertion  XXXX… shows areas of low complexity

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey Phylogenetic analysis  Phylogenetic trees  Describe evolutionary relationships between sequences  Three common methods  Maximum parsimony  Distance  Maximum likelihood

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey Gene prediction  A problem of pattern recognition  Algorithms look for features of genes:  E.g., Splice sites, ORFs, starting methionine  Identification of regulatory regions is difficult  Statistical understanding of genes is ongoing  Problems of this type require machine learning algorithms

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey Analysis of microarray data  Microarrays can measure the expression of thousands of genes simultaneously  Vast amounts of data require computers  Types of analysis  Gene-by-gene  Method: Statistical techniques  Categorizing groups of genes  Method: Clustering algorithms  Deducing patterns of gene regulation  Method: Under development

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey Normalization of Microarray Data  To make arrays comparable:  Assume total intensity from an RNA pool is the same from another (cells growth arrested vs. cells dividing).  Take the median value of all the spot intensities and subtract it from each spot’s own intensity.  THIS IS KNOWN AS GLOBAL NORMALIZATION

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey Example Data

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey Log Normalized Data to total median intensity (Log2Ratio normalized) = =

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey Differentially Expressed Genes (DEGs)  The difference between two groups of samples (arrays that belong to tumor vs. those to health; or arrays from growth arrested cell and those from asynchronously dividing cells) can be estimated and those genes whose mRNA expression significantly differ can be determined statistically.

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey Log2 Ratio  ½=0.5  2/1=2  Log(1/2)=-1  Log(1)-Log(2)=-1  Log(2/1)=1  Log(2)-Log(1)=1

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey Average(Arrest)-Average(Control)  Which genes upregulated with respect to control in arrest phenotype?  Which genes downregulated with respect to control in arrest phenotype?

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey Are these FoldChanges Significant?  Very basic statistics: t-test between two groups

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey How to calculate a log2Ratio in excel?  Type in =AVERAGE(I2:K2)-AVERAGE(L2:N2) for FSTL1  Drag the cell from the bottom right corner down to fill in for the other rows.

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey How to calculate FoldChange in excel?  Raise the Log2Ratio Column to the power of 2 (2^O2 for FSTL1 gene)  Drag the cell from the bottom right corner down to fill in for the other rows.

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey How to do a ttest in excel?  Use function t-test from statistical function library:  Type in =TTEST(I2:K2,L2:N2,2,2) for the following data:

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey Metrics for Gene Expression  Euclidian Distance

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey Calculation of Euclidian  Calculate the Euclidian distance between FSTL1 and AACS

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey Calculation of Euclidian  Larger the Euclidian Distance between two expression profiles more different they are from each other

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey Correlation Coefficient

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey Plot of Genes Across Conditions

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey Plot of Highly Significant Genes Across Conditions

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey Plot of Highly Significant Gene Clusters Across Conditions

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey Metrics for gene expression  Need a method to measure how similar genes are based on expression  Examples  Euclidean distance  Pearson correlation coefficient Euclidean distance Pearson correlation coefficient

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey Unsupervised techniques  Make no assumptions about how the data should behave  Cluster genes based on similar patterns of gene expression  Examples  Hierarchical clustering  Principal components analysis (PCA) Hierarchical clustering PCA

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey Supervised techniques  Divide groups of genes based on sample properties  Can predict sample condition based on gene expression pattern  Examples  Support vector machine  Nearest neighbor Nearest neighbor Support vector machine

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey Summary  Vast amounts of data require bioinformatics  These are limited by the following:  Algorithmic complexity of bioinformatics problems  Computer hardware performance  Heuristic methods used to get around these limitations  Bioinformatics methods used in the following areas:  Sequence alignment  Phylogenetic-tree construction  Gene prediction  Secondary-structure determination  Analysis of microarray data  Simulation of biological systems