Protein sequence clustering has been widely used as a part of the analysis of protein structure and function. We demonstrate an approach to protein clustering,

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Graphical comparison of sequences using “Dotplots”. ACCTGCCCTGTCCAGCTTACATGCATGCTTATAGGGGCATTTTACAT ACCTGCCGATTCCATATTACGCATGCTTCTGGGTTACCGTTCAGGGCATTTTACATGTGCTG.
1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
GenomePixelizer - a visualization tool for comparative genomics within and between species. A. Kozik, E. Kochetkova, and R. Michelmore (Department of Vegetable.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Clustering II.
Sequence Similarity Searching Class 4 March 2010.
Correlated Mutations and Co-evolution May 1 st, 2002.
Heuristic alignment algorithms and cost matrices
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Bioinformatics and Phylogenetic Analysis
Predicting protein functions from redundancies in large-scale protein interaction networks Speaker: Chun-hui CAI
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
A unified statistical framework for sequence comparison and structure comparison Michael Levitt Mark Gerstein.
Cluster Validation.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
The Analysis of Variance
Supplementary material Figure S1. Cumulative histogram of the fitness of the pairwise alignments of random generated ESSs. In order to assess the statistical.
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
APPLICATIONS OF DIFFERENTIATION 4. In Sections 2.2 and 2.4, we investigated infinite limits and vertical asymptotes.  There, we let x approach a number.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Relationships Among Variables
CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling
Slide Copyright © 2008 Pearson Education, Inc. Chapter 4 Descriptive Methods in Regression and Correlation.
The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.
Rev.S08 MAC 1140 Module 12 Introduction to Sequences, Counting, The Binomial Theorem, and Mathematical Induction.
Functions and Models 1. Exponential Functions 1.5.
14 Elements of Nonparametric Statistics
Copyright © Cengage Learning. All rights reserved. 3 Applications of Differentiation.
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
Equations of Lines Chapter 8 Sections
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
LIMITS AND DERIVATIVES 2. In Sections 2.2 and 2.4, we investigated infinite limits and vertical asymptotes.  There, we let x approach a number.  The.
R. Ray and K. Chen, department of Computer Science engineering  Abstract The proposed approach is a distortion-specific blind image quality assessment.
Discovering the Correlation Between Evolutionary Genomics and Protein-Protein Interaction Rezaul Kabir and Brett Thompson
Copyright © Cengage Learning. All rights reserved. 2 Limits and Derivatives.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
11/23/2015Slide 1 Using a combination of tables and plots from SPSS plus spreadsheets from Excel, we will show the linkage between correlation and linear.
Copyright © Cengage Learning. All rights reserved. 1 Functions and Models.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
FUNCTIONS AND MODELS Exponential Functions FUNCTIONS AND MODELS In this section, we will learn about: Exponential functions and their applications.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
ConcepTest • Section 1.1 • Question 1
CLUSTER ANALYSIS. What is Cluster analysis? Cluster analysis is a techniques for grouping objects, cases, entities on the basis of multiple variables.
Bioinformatics Shared Resource Bioinformatics : How to… Bioinformatics Shared Resource Kutbuddin Doctor, PhD.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Big data classification using neural network
Copyright © Cengage Learning. All rights reserved.
Bag-of-Visual-Words Based Feature Extraction
CSE 4705 Artificial Intelligence
CSSSPEC6 SOFTWARE DEVELOPMENT WITH QUALITY ASSURANCE
Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster.
Copyright © Cengage Learning. All rights reserved.
3.1 Genes Essential idea: Every living organism inherits a blueprint for life from its parents. Genes and hence genetic information is inherited from.
Alignment IV BLOSUM Matrices
Volume 50, Issue 1, Pages (April 2006)
Presentation transcript:

Protein sequence clustering has been widely used as a part of the analysis of protein structure and function. We demonstrate an approach to protein clustering, SEQOPTICS, based on OPTICS (Ordering Points To Identify the Clustering Structure). The approach’s attractiveness lies in emphases on visualization of results and support for interactive work, e.g., in choosing parameters. SEQOPTICS results are presented for four data sets from diverse data sources. Visualization of the sequence clustering structure is demonstrated for all four data sets. Our system was evaluated by comparison with other existing methods, results demonstrating that, via the Jaccard coefficient evaluation criterion, our system performed better. OPTICS CLUSTERINGCOMPUTE DISTANCE ABSTRACTMETHODSDATA SETS EXPERIMENTAL RESULTSFUTURE WORKRESULTS EVALUATION SEQOPTICS has proved its value for small data sets (<1000 sequences) according to this report. To apply the method to larger data sets, such as an entire protein sequence database, improvements would help make it more effective. We may list specifically: 1) use another distance measure for protein sequence distance, e.g., BLAST or FASTA; 2) apply parallel computing tools, for example, the Message Passing Interface (MPI); 3) implement different visualization techniques accommodating data set size; 4) consider incremental cluster ordering schemes since protein databases are rapidly growing in size. SEQOPTICS: PROTEIN SEQUENCE CLUSTERING WITH OPTICS Yonghui Chen (Advisor: Kevin Reilly & Alan Sprague) Department of Computer and Information Sciences – The University of Alabama at Birmingham To judge the resulting clustering set’s biological accuracy, we need to compare it to a “true” cluster set. However, there is no generally accepted “true” cluster set. For this study, we assumed the original database clusters are the “real” clusters, similar to the way most automatic protein clustering methods are tested. For example, we assume the sequences from the glucokinase family of Pfam are in the same cluster. Based on this assumption, several statistics appear to adequate to evaluate the results. Here, for the criterion, we used the Jaccard (similarity) coefficient, S, defined as: S = a/(a+b+c) where a is the “true positive”, i.e., the number of sequence pairs clustered together in both sets. b captures “false negatives,” i.e., the number of sequence pairs clustered together in the true cluster set, but not in the current clustering solution. c is for “false positives,” i.e., the number of sequence pairs clustered in the current solution, but not in the true cluster set. The Jaccard similarity value lies between 0 and 1, bigger values signifying better clustering. First, data sets were extracted from protein databases. Secondly, pairwise distances between any two proteins were computed using a score based on (a normalized) Smith-Waterman algorithm. Then the OPTICS algorithm was adopted to execute the clustering actions with the results analyzed via the Jaccard coefficient. Figure1: Overview of SEQOPTICS. Four data sets are extracted from different publicly available protein repositories as shown in table 1: two from Pfam, one from Swiss-Prot and the remaining one from NCBI. Each protein sequence is labeled according to the data set from which it originated, the labeling thus defining what we later use as the “true” cluster to which a sequence belongs. Results Analysis Computing Pairwise distance Clustering, Visualization Extracting Dataset A pairwise Smith-Waterman local alignment score, is computed first and then normalized to obtain SN(a,b) by: SN(a,b) = S(a,b)/Min(S(a,a),S(b,b)); where S(a,b) is the Smith-Waterman local alignment score between two sequences a and b. The distance between two protein sequences is then computed as: Distance(a,b) = 1/SN(a,b). Distances range from 0 to 1, 0 meaning identical sequences and 1 meaning totally different. Protein sequences Distance Measure Smith-Waterman Score Normalized Smith-Waterman Pfam Swiss-Prot NCBI Labeled Datasets Data Selection, Labeling, Reformating The final clusters can be extracted from the plot by employing either a cutoff value or a steepness criterion. In this study, density-based clusters were extracted by using a cutoff value. For example, in figure 2, the cutoff value is set as (see the line at reachability distance 0.860). Under this cutoff regime, each valley in figure 2 between two sequences with reachability distance higher than the cutoff identifies a cluster. The sequence starting a valley with reachability distance higher than the cutoff is also in the same cluster as the remaining sequences in the valley. Any sequence with reachability distance higher than the cutoff is noise if it does not start a new valley. Therefore, in figure 2, there are four clusters identified. Similarly, using cutoff values in the paper there are four clusters in figure 3, six in figure 4, and four in figure 5. OPTICS (Ordering Points To Identify the Clustering Structure) orders data into a density based clustering structure corresponding to a broad range of parameter settings (elected by the user). A reachability (bar chart) plot shows each object’s reachability distance (in the order the object was processed): it demonstrates the data’s cluster structure. There are two main advantages to applying OPTICS in protein sequences clustering analysis: 1) OPTICS can find the local density region; 2) OPTICS produces an augmented ordering of the database representing its density based clustering structure and this ordering can be visualized in the reachability plot. SEQOPTICS was applied to cluster the data sets 1. Visualization of the cluster structure: We made a reachability distance plot for each data set. In each figure, the horizontal axis represents the ordering of the sequences, the vertical axis represents the reachability distance, and each valley stands for a cluster set. From the figures, we can see that each valley contains exclusively one sequence family. 2. Extraction of the clusters: The final density-based clusters were extracted by using a cutoff value. For example, in figure 2, the cutoff value is set as 0.860; other values may be chosen: e.g., in figure 3, is chosen. Fig. 2 (data set 1) There are five valleys: The first two valleys are composed of sequences from cytochrom B562; The third valley consists of sequences from glucokinase; The fourth valley contains sequences from GABAR family; The fifth valley are sequences from bac- globin family. Fig. 5 (data set 4) There are four main valleys in figure 5: The first valley contains only casein kappa sequences The second and third valley contain only globins; the fourth valley is composed of GAPDHs. Fig. 4 (data set 3) There are six valleys: The first one and last one contain only cytoC sequences; The second valley contains only sequences from GABAR; The third valley contains sequences GAPDH; The fourth valley contains GPCR sequences; The fifth valley contains only GFAT. Fig. 3 (data set 2) There are three valleys: The first one is composed of sequences from bac globin; The second valley is composed of sequences from band3 family; The third valley contains only sequences from IGA1. We also clustered the same data sets with two existing clustering methods, blastclust and BAG, by default parameters. We then compared our results with those results via the Jaccard coefficient. The comparison is shown in table 2. From this table, we see that SEQOPTICS produces very nice results relative for each original cluster set, and, moreover, that it outperforms BAG and blastclust on all the data sets. Meanwhile, the performance of blastclust exceeds BAG in two cases and is less good in two cases. Overall, SEQOPTICS seems a promising method in terms of both clustering quality and its graphical representation. BAG Blastclust SEQOPTICS Data sets Compare the results by Jaccard coefficient Distance Matrix Reachibility plot Clusters extraction