Presentation is loading. Please wait.

Presentation is loading. Please wait.

© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Chapter 6 The Computational Foundations of Genomics Applying.

Similar presentations


Presentation on theme: "© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Chapter 6 The Computational Foundations of Genomics Applying."— Presentation transcript:

1 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Chapter 6 The Computational Foundations of Genomics Applying algorithms to analyze genomics data

2 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Contents  Sequence alignment  Gene prediction  Algorithms for analysis of phylogeny  Analysis of microarray data

3 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Computational Biology and Bioinformatics  Computational biology  Development of computational methods to solve problems in biology  Bioinformatics  Application of computational biology to analysis and management of real data  Why do biologists need computer science?  Discrete nature of sequence data is ideal for analysis using digital computers  Size and complexity of genomics data make the data impossible to analyze without computers

4 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Algorithmic problems  Example: searching for a number in an unordered list  If the list has N numbers, the average amount of time the search will take will be proportional to N  A more clever approach  Place the numbers in order  Do a binary search  Step 1: Pick a number in the middle of the list  Step 2: Restrict the search to the half that contains your number  Return to Step 1 until you find your number  Time for this approach is proportional to log 2 N

5 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 The digital computer  Represents everything in a code of zeros and ones  Computer architecture  CPU  Memory  Input / Output  Advantages of digital computer  Deterministic  Minimization of noise Output CPUMemory Input

6 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Sequence databases  What is a database?  An indexed set of records  Records retrieved using a query language  Database technology is well established  Examples of sequence databases  GenBank  Encompasses all publicly available protein and nucleotide sequences  Protein Data Bank  Contains 3-D structures of proteins

7 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 The client-server model  The clients and servers are software processes  Clients request data from servers  Servers and clients can reside on the same or different machines  Clients can act as servers to other processes and vice versa Web Browser BLAST Search Engine Database Web Server

8 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Sequence alignment  Sequence alignments search for matches between sequences  Two broad classes of sequence alignments  Global  Local  Alignment can be performed between two or more sequences QKESGPSSSYC VQQESGLVRTTC Global alignment Local alignment ESG

9 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 The biological importance of sequence alignment  Sequence alignments assess the degree of similarity between sequences  Similar sequences suggest similar function  Proteins with similar sequences are likely to play similar biochemical roles  Regulatory DNA sequences that are similar will likely have similar roles in gene regulation  Sequence similarity suggests evolutionary history  Fewer differences mean more recent divergence

10 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 The algorithmic problem of aligning sequences  Comparison of similar sequences of similar length is straightforward  How does one deal with insertions and gaps that may hide true similarity?  How does one interpret minimal similarity?  Are sequences actually related?  Is alignment by chance? QQESGPVRSTC QKGSYQEKGYC QQESGPVRSTC RQQEPVRSTC QQESGPVRSTC QKESGPSRSYC

11 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Methods of sequence alignment  Graphical methods  Dynamic-programming methods  Heuristic methods

12 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Dot matrix analysis  A graphical method  Shows all possible alignments  Caveats  Some guesswork in picking parameters  Window size  Stringency  Not as rigorous or quantitative as other methods RQQEPVRSTC Q Q E S G P V R S T C QQESGPVRSTC RQQEPVRSTC

13 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Dot matrix analysis: a real example Window size: 23 Stringency: 15 Window size: 1 Stringency: 1

14 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Devising a scoring system  Scoring matrices allow biologists to quantify the quality of sequence alignments  Use different scoring matrices for different purposes  Score for similar structural domains in proteins  Score for evolutionary relationship  Some popular scoring matrices  PAM for evolutionary studies  BLOSUM for finding common motifs

15 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 An example of scoring ARNDCQE A4-2 0 R 50-2-310 N-2061-300 D-2 16-302 C0 9 -4 Q100-352 E002-425 BLOSUM62 A sequence comparison Total score: 18 AA4AA4 DQ0DQ0 DE2DE2 RR5RR5 QQ5QQ5 C E -4 E C -4 RQ1RQ1 AA4AA4 DQ0DQ0

16 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Heuristic methods with k-tuples  Example: BLAST  Using query sequence, derive a list of words of length w (e.g., 3)  Keep high-scoring words  High-scoring words are compared with database sequences  Sequences with many matches to high- scoring words are used for final alignments

17 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Statistical significance  Chance alignments have no biological significance  Statistical significance implies low probability of generating a chance alignment  Probability of long alignments increases with longer sequences  The extreme-value distribution  Used to calculate the probability of chance alignment  Generated by calculating the scores resulting from repeatedly scrambling one of the sequences being compared

18 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 A practical example of sequence alignment

19 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 BLAST results

20 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Detailed BLAST results

21 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 A pairwise alignment with MASH-1  HASH-2, a human homolog of MASH-1  “+” indicates conservative amino acid substitution  “–” indicates gap/insertion  XXXX… shows areas of low complexity

22 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Phylogenetic analysis  Phylogenetic trees  Describe evolutionary relationships between sequences  Three common methods  Maximum parsimony  Distance  Maximum likelihood

23 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Gene prediction  A problem of pattern recognition  Algorithms look for features of genes:  E.g., Splice sites, ORFs, starting methionine  Identification of regulatory regions is difficult  Statistical understanding of genes is ongoing  Problems of this type require machine learning algorithms

24 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Analysis of microarray data  Microarrays can measure the expression of thousands of genes simultaneously  Vast amounts of data require computers  Types of analysis  Gene-by-gene  Method: Statistical techniques  Categorizing groups of genes  Method: Clustering algorithms  Deducing patterns of gene regulation  Method: Under development

25 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Normalization of Microarray Data  To make arrays comparable:  Assume total intensity from an RNA pool is the same from another (cells growth arrested vs. cells dividing).  Take the median value of all the spot intensities and subtract it from each spot’s own intensity.  THIS IS KNOWN AS GLOBAL NORMALIZATION

26 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Example Data

27 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Log Normalized Data to total median intensity (Log2Ratio normalized) 10.128-7.7=2.428 6.5961-7.71=-1.1039

28 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Differentially Expressed Genes (DEGs)  The difference between two groups of samples (arrays that belong to tumor vs. those to health; or arrays from growth arrested cell and those from asynchronously dividing cells) can be estimated and those genes whose mRNA expression significantly differ can be determined statistically.

29 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Log2 Ratio  ½=0.5  2/1=2  Log(1/2)=-1  Log(1)-Log(2)=-1  Log(2/1)=1  Log(2)-Log(1)=1

30 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Average(Arrest)-Average(Control)  Which genes upregulated with respect to control in arrest phenotype?  Which genes downregulated with respect to control in arrest phenotype?

31 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Are these FoldChanges Significant?  Very basic statistics: t-test between two groups

32 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 How to calculate a log2Ratio in excel?  Type in =AVERAGE(I2:K2)-AVERAGE(L2:N2) for FSTL1  Drag the cell from the bottom right corner down to fill in for the other rows.

33 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 How to calculate FoldChange in excel?  Raise the Log2Ratio Column to the power of 2 (2^O2 for FSTL1 gene)  Drag the cell from the bottom right corner down to fill in for the other rows.

34 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 How to do a ttest in excel?  Use function t-test from statistical function library:  Type in =TTEST(I2:K2,L2:N2,2,2) for the following data:

35 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Metrics for Gene Expression  Euclidian Distance

36 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Calculation of Euclidian  Calculate the Euclidian distance between FSTL1 and AACS

37 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Calculation of Euclidian  Larger the Euclidian Distance between two expression profiles more different they are from each other

38 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Correlation Coefficient

39 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Plot of Genes Across Conditions

40 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Plot of Highly Significant Genes Across Conditions

41 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Plot of Highly Significant Gene Clusters Across Conditions

42 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Metrics for gene expression  Need a method to measure how similar genes are based on expression  Examples  Euclidean distance  Pearson correlation coefficient Euclidean distance Pearson correlation coefficient

43 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Unsupervised techniques  Make no assumptions about how the data should behave  Cluster genes based on similar patterns of gene expression  Examples  Hierarchical clustering  Principal components analysis (PCA) Hierarchical clustering PCA

44 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Supervised techniques  Divide groups of genes based on sample properties  Can predict sample condition based on gene expression pattern  Examples  Support vector machine  Nearest neighbor Nearest neighbor Support vector machine

45 © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Summary  Vast amounts of data require bioinformatics  These are limited by the following:  Algorithmic complexity of bioinformatics problems  Computer hardware performance  Heuristic methods used to get around these limitations  Bioinformatics methods used in the following areas:  Sequence alignment  Phylogenetic-tree construction  Gene prediction  Secondary-structure determination  Analysis of microarray data  Simulation of biological systems


Download ppt "© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 Chapter 6 The Computational Foundations of Genomics Applying."

Similar presentations


Ads by Google