1/17 Identification of thermophilic species by the amino acid compositions deduced from their genomes Reporter: Yu Lun Kuo

Slides:



Advertisements
Similar presentations
Different types of data e.g. Continuous data:height Categorical data ordered (nominal):growth rate very slow, slow, medium, fast, very fast not ordered:fruit.
Advertisements

An Introduction to Multivariate Analysis
Introduction to Bioinformatics
Mapping Nominal Values to Numbers for Effective Visualization Presented by Matthew O. Ward Geraldine Rosario, Elke Rundensteiner, David Brown, Matthew.
Response Surface Method Principle Component Analysis
BIOINFORMATICS Ency Lee.
Molecular Evolution Revised 29/12/06
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 9 Clustering Algorithms Bioinformatics Data Analysis and Tools.
Profile-profile alignment using hidden Markov models Wing Wong.
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
The Cell, Central Dogma and Human Genome Project.
Introduction to BioInformatics GCB/CIS535
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Biostatistics Unit 2 Descriptive Biostatistics 1.
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Scatterplots, Association, and Correlation Copyright © 2010, 2007, 2004 Pearson Education, Inc.
Essential knowledge 1.A.4:
The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.
Bioinformatics.
Exploratory Data Analysis: Two Variables
Basic concepts in ordination
JDS Special Program: Pre-training1 Basic Statistics 01 Describing Data.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
Gene expression analysis
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Your Poster Title Here Your name here, and names of others Place the name of your institution here Your Poster Title Here Your name here, and names of.
Tutorial 7 Gene expression analysis 1. Expression data –GEO –UCSC –ArrayExpress General clustering methods –Unsupervised Clustering Hierarchical clustering.
Correspondence Analysis Ahmed Rebai Center of Biotechnology of Sfax.
Chapter 14 – Cluster Analysis © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
EMBL- EBI Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD UK T +44 (0) F +44 (0) Gene Co-expression.
Significance in protein analysis
Multivariate Data Analysis Chapter 1 - Introduction.
We are learning to write expressions using variables. (8-1)
LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino.
Lecture 12 Factor Analysis.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Principal Component Analysis (PCA)
Chapter 7 Scatterplots, Association, and Correlation.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Advanced Statistics Factor Analysis, I. Introduction Factor analysis is a statistical technique about the relation between: (a)observed variables (X i.
Chapter 14 EXPLORATORY FACTOR ANALYSIS. Exploratory Factor Analysis  Statistical technique for dealing with multiple variables  Many variables are reduced.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
Unsupervised Learning
Unsupervised Learning
Bioinformatics Overview
Clustering Manpreet S. Katari.
Chapter 6 Diagnostics for Leverage and Influence
Exploring Microarray data
생물정보학 Bioinformatics.
LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel:
Genomes and Their Evolution
A Role for Codon Order in Translation Dynamics
Multivariate Analysis of a Carbonate Chemistry Time-Series Study
PCA of Waimea Wave Climate
Yamanishi, M., Itoh, M., Kanehisa, M.
Evolution of Genomes Chapter 21.
Structural differences between mesophilic, moderately thermophilic and extremely thermophilic protein subunits: results of a comprehensive survey  András.
Essential knowledge 1.B.1:
Inferring Cellular Processes from Coexpressing Genes
Universal microbial diagnostics using random DNA probes
Unsupervised Learning
Unsupervised Learning
Presentation transcript:

1/17 Identification of thermophilic species by the amino acid compositions deduced from their genomes Reporter: Yu Lun Kuo Date: October 26, 2006 David P. Kreil and Christos A. Ouzounis University of Cambridge and European Bioinformatics Institute, Computational Genomics Group, Research Programme, The European Bioinformatics Institute, EMBL Outstation, Wellcome Trust Gnome Campus, Cambridge CB10 1SD, UK

2/17 Outline Introduction Materials and Methods Results Discussion and Conclusion

3/17 Introduction The properties of thermophilic protein have been examined in the past two decades. Thermophilic protein for particular amino acids, but general rules have not yet emerged. Experiment is not only homologous proteins, but also protein unique to particular species.

4/17 Introduction The results for the genomes of six archaea, 19 bacteria, and the eukaryotic organisms. Using two different approaches, several factors –Determine amino acid composition can be deduced GC content of the coding sequences is the dominant influence on amino acid composition –Possible to identify thermophilic species

5/17 Materials and Methods Data sources and tools Exploratory data analysis Sensitivity analysis, sampling adequacy and significance

6/17 Data Sources and Tools Obtained from public databases –EBI (European Bioinformatics Institute) –NCBI (National Center for Biotechnology Information) –SRS – Access to multiple molecular biology databases –EPCLUST (Expression Profile data CLUSTering and analysis) –Hierarchical clustering –PCA (Principal Components Analysis)

7/17 Exploratory Data Analysis For all organisms, determined global amino acid compositions –Matrix where the rows represent the data sources list –The columns correspond to the respective percentage amino acid content

8/17 Exploratory Data Analysis Principal factors was supported two variables –GC ratio (GC counts vs. AT counts) –A binary variable (therm) The binary variable, therm –0 (zero) - mesophilic –1 (one) - thermophilic

9/17 Sensitivity Analysis, Sampling Adequacy and Significance Miscellaneous clustering methods were tried –Average linkage (UPGMA) –Complete linkage (Maximum distance method) –Single linkage (Minimum distance method) –Weighted pair group method (WPGMA) PCA was repeated to verify that this weighting did not affect any conclusions –20 amino acids with equal weight

10/17 Results Red – More than average Green – Less than average Thermophilic Unusually high GC ratio 57-67% Thermophilic High GC ratio

11/17 Results (PCA of Amino Acid) A clear separation of thermophiles and mesophiles along the second principal axis 0-mesophile 1-thermophile Thermophilic Archea – Red Bacteria – Green Eukaryote – Purple Outgroup - Blue

12/17 Component Loadings High Loading –Absolute component loadings > 0.6 Component loading can be interpreted as correlation coefficients Component 1 –Correlate with GC ratio Component 2 –Correlate with Therm

13/17 Statistical Evidence and Specific Feature of Thermophilic Species PCA –Starting from the distinct groups of thermophiles and mesophiles as obtained Gln (Q) & Glu (E) –Have very high component loadings Table 2 summarizes the results and most of the statistical evidence Very high factor loadings Raw correlations with the binary variable therm Strong PCA factor loading for component 2 Average difference between thermo & meso Thermo & meso more or less Low factor loadings Less – in Thermophiles < in mesophiles More - in Thermophiles > in mesophiles

14/17 Discussion and Conclusion The results discern several underlying factors that influence amino acid composition –Completely sequenced genomes of 27 species –Employing different methods of data analysis The two most prominent observations –Dominant effect of GC pressure –Clear identification of thermophilic species

15/17 Discussion and Conclusion PCA found GC ratio to be the most important factor Environmental adaptations would also be expected to play a role –A pernix is found at a little distance from the other thermophiles

16/17 Discussion and Conclusion Not only true for individual proteins or groups of proteins but also for entire genomes –GC contents with a stronger influence on amino acid composition than adaptation to extreme environments (e.g., thermophily) –Interesting to extend analysis from different phyla

17/17 Thanks