Bioinformatics Capstone Project The design and implementation of a system that integrates pathway data from KEGG and genome sequence data from NCBI Xiang (Sean) Zhou Advisor: Prof. Sun Kim Bioinformatics Capstone Project Indiana University 9/21/2018
Outline Background Methods Sample results Online demonstration Future direction 9/21/2018
Why do we want to study metabolic pathway? One of the challenges in life science is to uncover the fundamental design principle that provides the common underlying structure and function in all cells and microorganisms [2] . Metabolic pathway network serves as the tool to achieve the goal. 9/21/2018
Metabolic Pathway Definition of a metabolic pathway: A series of enzyme-catalyzed chemical reactions within a cell, which results in the removal of a molecule from the environment to be used/stored by the cell, or the initiation of another metabolic pathway[1]. A pathway is a linked set of biochemical reactions—linked in the sense that the product of one reaction is a reactant of, or an enzyme that catalyzes, a subsequent reaction[4]. 9/21/2018
Why is it so difficult to study metabolism in multiple genomes? The metabolism in one organism is too large to be grasped by a single mind. (i.e. E. coli has a metabolism involving over 850 substances and 1500 reactions.) Genome projects keep generating a large amount of sequence data. 9/21/2018
A sample metabolic pathway[3] 9/21/2018
Pathway Database(DB) A pathway DB is a bioinformatics DB that describes biochemical pathways and their component reactions, enzymes, and substrates[4]. 9/21/2018
Current Pathway DBs KEGG (Kyoto Encyclopedia of Genes and Genomes) The most comprehensive metabolic pathway DB. EcoCyc Encyclopedia of Escherichia coli K12 Genes and Metabolism. CGAP (Cancer Genome Anatomy Project) Pathways on the CGAP web site are obtained directly from BioCarta and KEGG. WIT It has changed to a commercial DB. 9/21/2018
Disadvantages of current DBs They are “static”. All data are pre-computed and stored in the DBs. User’s flexibility of choosing their genome and pathway of interest is limited. They can only study one genome at a time. User cannot compare the pathways in different genomes at the same time. 9/21/2018
Motivation Create a system User can select genomes and pathways of their interest and perform sequence analysis freely. Enables multi-genome pathways comparison The result is generated based on users need. 9/21/2018
Data Sources KEGG NCBI GenBank PLATCOM genome comparison data 9/21/2018
The Challenge In KEGG and NCBI GenBank The genome names and genes names are slightly different. The ids used in two DBs are totally different. Some of the protein id (pid) in KEGG are out-dated. Thus, integration of the two DBs is not trivial. 9/21/2018
The unique features of our system Easy to maintain Only need to download the latest datasets from KEGG and NCBI GenBank. Flexibility Sequence analysis is based on the combination of the genomes and pathways of user’s choice. Everything is computed on the fly. Integration of KEGG and NCBI GenBank DBs in terms of sequence analysis. 9/21/2018
Methods FASTA ClustalW HMMer A series of modules 9/21/2018
Infrastructure A query protein sequence A pathway A reference genome Interested genomes Protein information Pathway Information Search for missing genes 9/21/2018
PLATCOM-Metabolic Pathway Division 9/21/2018
Sample Result –(1) 9/21/2018
Sample Result –(2) 9/21/2018
Sample Result –(3) 9/21/2018
Online Demonstration PlatCom: A Platform for Computational Comparative Genomics 9/21/2018
Future Direction Use conserved domain to perform HMM search Enable sequence alignment and pattern search Connect to other DBs Protein-Protein Interaction DBs PDB Improve the performance by using dynamic cache. 9/21/2018
Reference H. JEONG, H., TOMBOR, B., ALBERT, R., OLTVAI, Z. N., and BARABÁSI, A.-L., (2000), The large-scale organization of metabolic networks, Nature, 407:651-654 http://www.free-definition.com/ http://www.genome.ad.jp/kegg/pathway.html Karp, PD, (2001), Pathway Databases: A Case Study in Computational Symbolic Theories, Science, 293:2040-2044 9/21/2018
Acknowledge Professor Sun Kim Kwangmin Choi Arvind Gopu 9/21/2018