Gene Clustering Haleh Ashki School of Informatics, Indiana University, Aug 2008 Advisor: Professor Sun Kim.

Slides:



Advertisements
Similar presentations
Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1.
Advertisements

Improved TF-IDF Ranker
Ferhat Ay, Tamer Kahveci & Valerie de-Crecy Lagard 4/17/20151 Ferhat Ay
PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Detecting active subnetworks in molecular interaction networks with missing data Luke Hunter Texas A&M University SHURP 2007 Student.
Using Semantic Similarity Measures in the Biomedical Domain for Computing Similarity between Genes based on Gene Ontology By : Elham Khabiri Adviser :
Predicting Enhancers in Co-Expressed Genes Harshit Maheshwari Prabhat Pandey.
Medical Genetics & Genomics
Phylogenetic Trees Understand the history and diversity of life. Systematics. –Study of biological diversity in evolutionary context. –Phylogeny is evolutionary.
Gene Ontology John Pinney
Basics of Molecular Biology
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
Gene Prediction: Similarity-Based Approaches (selected from Jones/Pevzner lecture notes)
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Identification of a Novel cis-Regulatory Element Involved in the Heat Shock Response in Caenorhabditis elegans Using Microarray Gene Expression and Computational.
Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Using Bioinformatics to Make the Bio- Math Connection The Confessions of a Biology Teacher.
Bioinformatics and Phylogenetic Analysis
Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.
Fuzzy K means.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource Claudia Reich NCSA, University of Illinois, Urbana.
Sequencing a genome and Basic Sequence Alignment
Pathways Database System: An Integrated System For Biological Pathways L. Krishnamurthy, J. Nadeau, G. Ozsoyoglu, M. Ozsoyoglu, G. Schaeffer, M. Tasan.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
Ch10. Intermolecular Interactions and Biological Pathways
Metagenomic Analysis Using MEGAN4
C OMPUTATIONAL BIOLOGY. O UTLINE Proteins DNA RNA Genetics and evolution The Sequence Matching Problem RNA Sequence Matching Complexity of the Algorithms.
Regulation of Gene Expression
Improving Gene Function Prediction Using Gene Neighborhoods Kwangmin Choi Bioinformatics Program School of Informatics Indiana University, Bloomington,
Identification of Regulatory Binding Sites Using Minimum Spanning Trees Pacific Symposium on Biocomputing, pp , 2003 Reporter: Chu-Ting Tseng Advisor:
I529: Lab5 02/20/2009 AI : Kwangmin Choi. Today’s topics Gene Ontology prediction/mapping – AmiGo –
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Sequencing a genome and Basic Sequence Alignment
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
Organizing information in the post-genomic era The rise of bioinformatics.
Discovering the Correlation Between Evolutionary Genomics and Protein-Protein Interaction Rezaul Kabir and Brett Thompson
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
ARE THESE ALL BEARS? WHICH ONES ARE MORE CLOSELY RELATED?
Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.
Protein and RNA Families
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.
PPI team Progress Report PPI team, IDB Lab. Sangwon Yoo, Hoyoung Jeong, Taewhi Lee Mar 2006.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Introduction to biological molecular networks
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation Bioinformatics, July 2003 P.W.Load,
Discovering functional interaction patterns in Protein-Protein Interactions Networks   Authors: Mehmet E Turnalp Tolga Can Presented By: Sandeep Kumar.
Predicting Protein Function Annotation using Protein- Protein Interaction Networks By Tamar Eldad Advisor: Dr. Yanay Ofran Computational Biology.
1 Computational functional genomics Lital Haham Sivan Pearl.
Ankita Sarangi School of Informatics, IUB Capstone Presentation, May 11, 2009 Advisor : Yuzhen Ye.
Bioinformatics Research Overview Li Liao Develop new algorithms and (statistical) learning methods > Capable of incorporating domain knowledge > Effective,
Tools in Bioinformatics Ontologies and pathways. Why are ontologies needed? A free text is the best way to describe what a protein does to a human reader.
PROTEIN INTERACTION NETWORK – INFERENCE TOOL DIVYA RAO CANDIDATE FOR MASTER OF SCIENCE IN BIOINFORMATICS ADVISOR: Dr. FILIPPO MENCZER CAPSTONE PROJECT.
SRI International Bioinformatics Selected PathoLogic Refining Tasks Creation of Protein Complexes Assignment of Modified Proteins Operon Prediction.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Dynamic Programming (cont’d) CS 466 Saurabh Sinha.
Computational Biology
bacteria and eukaryotes
Exam #1 is T 9/23 in class (bring cheat sheet).
Bioinformatics Capstone Project
Multiple Alignment and Phylogenetic Trees
CIS 4930/6930 – Recent Advances in Bioinformatics Spring 2014
Bioinformatics Vicki & Joe.
(Really) Basic Molecular Biology
Presentation transcript:

Gene Clustering Haleh Ashki School of Informatics, Indiana University, Aug 2008 Advisor: Professor Sun Kim

Goal of the project Gene cluster prediction algorithms are useful in discovering a set of gene “conserved” in a pair of genomes. However, the prediction result depend highly on the phylogenetic distance of two genomes. In particular, when two genomes are close, sizes of predicted gene clusters are large, containing several functional gene sets in one cluster.

Ecoli - Salmonella Ecoli - Shigella

Thus a new computational tool is needed to predict “functionally related gene sets” In this study, we developed a novel computational method to predict functionally related gene sets from gene clusters, using gene-ontology based clustering of genes and one dimensional dynamic programming techniques.

The input for this algorithm are the EGGS Clusters algorithm output: EGGS: Extraction of Gene clusters by iteratively using Genome context based Sequence matching techniques. Genes are matched between two genomes using two concepts, pairs of close bidirectional best hits (PCBBHs) and pairs of close homologs (PCHs), where the term close means the physical proximity, say within 300 bp.

path:eco00190 protoheme IX farnesyltransferase (haeme O biosynthesis) path:eco00190 cytochrome o ubiquinol oxidase subunit IV path:eco00190 cytochrome o ubiquinol oxidase subunit III "ATP-dependent specificity component of clpP serine protease, chaperone" "DNA-binding, ATP-dependent protease La; heat shock K-protein" "DNA-binding protein HU-beta, NS1 (HU-1)" peptidyl-prolyl cis-trans isomerase D path:eco02010 ATP-binding component of a transport system path:eco02010 putative ATP-binding component of a transport system nitrogen regulatory protein P-II probable ammonium transporter path:eco00632 acyl-CoA thioesterase II "orf, hypothetical protein" primosomal replication protein N'' path:eco00230 "DNA polymerase III, tau and gamma subunits; DNA elongation factor III" "orf, hypothetical protein" recombination and repair path:eco02010 putative ATP-binding component of a transport system putative oxidoreductase path:eco00632 acyl-CoA thioesterase I; also functions as protease I path:eco02010 putative ATP-binding component of a transport system This Cluster Contain 54 genes which have different Operons, Pathways and strand information.

predicted clusters are often too long and need to be dissected; BUT how? Predicting biologically meaningful gene clusters from conserved gene clusters: A conserved gene cluster depends much on phylogenic distance between two genomes and it often contains “multiple” biologically meaning clusters. Our method uses clustering technique using gene ontology information. Results from our method are shown biologically meaningful in terms of operon (a set of genes in a single transcription) and biological pathways.

GO : Gene Ontology The GO project has developed three structured controlled vocabularies (ontologies) that describe gene products in terms of their associated: 1. biological processes 2. cellular components 3. molecular functions in a species-independent manner. The ontologies are structured as directed acyclic graphs. GO terms can be linked by different types of relationships: is_a, part_of For each gene there are more than one GO terms. in all different component and also in all different level of the hierarchal tree. Here the UniProt IDs have been used as a key to get the Go terms of each gene.

Semantic Similarity Value (SS): Different methods to calculate the semantic similarity value: Resnik: is solely based on the information content of shared parents of the two terms. If there is more than one shared parent, the minimum information content is taken. Then the similarity score is derived as follows: where S(t1, t2) is the set of parent terms shared by t1 and t2. Lin and Jiang: Both methods use not only the information content of the shared parents, but also that of the query terms where p(t1), p(t2) and p(t) are information content values for t1, t2 and their parents, respectively.

The semantic of a GO term is determined by it’s location in the entire GO graph and semantic relations with all of it’s ancestor term. So we are using the subgraph, starting from the specific Go term and end at root (Biological, cellular, Molecular) In this study I have worked with Molecular Go Terms. Our method : by (James Z. Wang1, Zhidian Du) DAGA=(A,TA,EA) TA :is a set of GO terms,including A and all it’s ancestors in subgraph. EA:set of edges. SV(A)=4.52

Here I have used the online tool to measure the Semantic Similarity value for each two genes based on their GO terms. I made a matrix of semantic value for each group of genes. this value is normalized between 0 and max max Sim(ADh4,Ldb3)=.693 From Paper

Make the Cluster based on Semantic Similarity Matrix: Clustering Result: Value Genes this method group the genes based on their SS value. Descending (0.9 – 0.1) So each gene is grouped based on it’s highest SS value. The genes with SS value of 0 are omitted on this step. Is one of the features of R which make the cluster based on the Dissimilarity value of group of elements. I have used that for visualization of clustering based on my Semantic Similarity Matrix. HCluster

Hcluster visualization:

Now each Eggs cluster is grouped based on the Semantic similarity value. I made a key like as: FirstGenome.SecondGenome.EggClusterNumber.SSvalue S ESC12S0.8 EcoliSalmonellaCluster12Subcluster0.8 In this study I used clusters from four pairs of genomes: Ecoli Salmonella Ecoli Yersinia Ecoli Shigella Ecoli Shewanella I gathered all existence keys for each gene in Ecoli genome. For sure more conserved genes have more keys in all four groups: ESGc102s0.8 ESc125s0.8 EYc25s ESGc102s0.8 ESc125s0.8 EShc106s0.6 EYc25s0.8    ESGc102s0.8 ESc126s0.8 EShc107s0.8 EYc99s0.3  ESGc102s0.9 EYc99s0.3  ESGc102s0.9 EYc99s0.5  Break point

Break Point and Cluster Score Break points are defined in target genome (Ecoli). break points are the genes which the keys are changed. Based on both “cluster number” or “sub cluster value”. All breakpoints are collected and been removed of redundancies. Formula for “gene set score”: ((# of same keys inside the cluster)/(# of same keys outside the cluster) ) ^ 2 _______________________________________________________________ Size of cluster (number of genes)

EYc174s ESc3s EYc174s ESc3s EShc3s Breakpoint1-breakpoint2 genes #inner gene # outer gene Size gene set Score Break point interval score = Sum of gene set score / number of genes 4.36 /5 =0.872 ***************************************** ***************************************** Each group is defined as genes between each breakpoint and the 5 th,10 th,15 th break point ahead. Here: 15 break points in group

Problem definition any pair of breakpoints can define a functionally related gene set, but there are too many candidates: O(n^2) for n break points. We formulate a problem of functional gene set prediction as generating maximal cover of genes based on the Break point interval score. This problem is similar to exon chaining problem that predict exons from a number of intron-exon boundaries. Thus we used one dimensional dynamic programming technique to solve the functional gene set prediction problem: Select non overlapping break points’ intervals that maximize sum of break point interval scores.

One dimensional dynamic programming On each group ( each breakpoint with the next 5 th,.. Breakpoint ) the four highest score have been chosen as blocks for dynamic programming. This dynamic programming get the block as potential clusters, the start and stop position and the weight of that block (“ Break point interval score ”). and finally generate the clusters with highest score. This algorithm is modified based on our data such as overlapping on end points etc

One more step to refine predicted clusters Strand Information: Connected gene neighborhoods in prokaryotic genomes Nucleic Acids Research, 2002, Vol. 30, No : the genes which have the same function are in the same direction. So the strand information of Ecoli genome as target is used to dissect each cluster. in this step the clusters are dissected based on the strand information. The new clusters with one gene are removed.

************************************ eco eco eco eco eco eco eco00785 ************************************ Gene Id Start Position End Position Strand Operon ID Pathway

Predicted gene clusters verify in terms of: Definition of each gene: NCBI Operon information Detecting uber-operons in prokaryotic genomes, Dongsheng Che2, Guojun Li, Nucleic Acids Research, 2006 Database: This DB has grouped genes based on the operons they belongs too.Each Uber_Operon gropu represent a rich set of footprints of operon evolution. KEGG Pathway: a metabolic pathway is a series of chemical reactions occurring within a cell. In each pathway, a principal chemical is modified by chemical reaction. Enzymes catalyze these reactions. Database: absence of information for non enzyme genes make that not very useful.

EGGS: (Ecoli-Salmonella) Cluster Numbers:167 Gene range:2-130 (2-50) Operon Id Range:0-42 Cluster Numbers: 483 Gene range:2-25 (2-10) Operon Id Range: 0-6 Summary Our Method :

Conclusion By dissecting big conserved clusters we will have functionally meaningful related genes clusters without worry about phylogenetic distance of genes.

Resnik P: Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language. J Artif Intell Res, 1999, 11: Lin D: An information-theoretic definition of similarity. In: International Conference on Machine Learning: 1998; San Fransisco: Morgan Kaufmann; 1998: Jiang JaC, DW: Semantic similarity based on corpus statistics and lexical taxonomy. In: Proceedings of 10th International Conference on Research In Computational Linguistics. Taiwan; 1997: Wang JZ, Du Z, Payattakool R, Yu PS, Chen C-F: A new method to measure the semantic similarity of GO terms. Bioinformatics 2007, 23(10): EGGS: Extraction of Gene clusters using Genome context based Sequence matching techniques. Kwangmin Choi, Bharath Kumar Maryada,SunKim Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M: The KEGG resource for deciphering the genome. Nucl Acids Res 2004, 32(90001):D Database: Connected gene neighborhoods in prokaryotic genomes Nucleic Acids Research, 2002, Vol. 30, No : Genome Alignment, Evolution of Prokaryotic Genome Organization, and Prediction of Gene Function Using Genomic ContextYuri I. Wolf, Igor B. Rogozin, Alexey S. Kondrashov, and Eugene V. Koonin Research 11: (2001) Detecting uber-operons in prokaryotic genomes, Dongsheng Che2, Guojun Li, Nucleic Acids Research, 2006 Literature

Online resources:

Thanks Professor.Sun Kim Professor.Dalkilic Kwangmin choi, youngik yang Professor.Tang,Professor.Radivojac and all other Informatics faculties. Informatics Staffs. Mis.Linda Hostetter All Graduate Students (my Friends) Profesoor.Kehoe School of informatics.