Protein sequence clustering has been widely used as a part of the analysis of protein structure and function. We demonstrate an approach to protein clustering,

Protein sequence clustering has been widely used as a part of the analysis of protein structure and function. We demonstrate an approach to protein clustering, SEQOPTICS, based on OPTICS (Ordering Points To Identify the Clustering Structure). The approach’s attractiveness lies in emphases on visualization of results and support for interactive work, e.g., in choosing parameters. SEQOPTICS results are presented for four data sets from diverse data sources. Visualization of the sequence clustering structure is demonstrated for all four data sets. Our system was evaluated by comparison with other existing methods, results demonstrating that, via the Jaccard coefficient evaluation criterion, our system performed better. OPTICS CLUSTERINGCOMPUTE DISTANCE ABSTRACTMETHODSDATA SETS EXPERIMENTAL RESULTSFUTURE WORKRESULTS EVALUATION SEQOPTICS has proved its value for small data sets (<1000 sequences) according to this report. To apply the method to larger data sets, such as an entire protein sequence database, improvements would help make it more effective. We may list specifically: 1) use another distance measure for protein sequence distance, e.g., BLAST or FASTA; 2) apply parallel computing tools, for example, the Message Passing Interface (MPI); 3) implement different visualization techniques accommodating data set size; 4) consider incremental cluster ordering schemes since protein databases are rapidly growing in size. SEQOPTICS: PROTEIN SEQUENCE CLUSTERING WITH OPTICS Yonghui Chen (Advisor: Kevin Reilly & Alan Sprague) Department of Computer and Information Sciences – The University of Alabama at Birmingham http://www.cis.uab.edu/chenyh To judge the resulting clustering set’s biological accuracy, we need to compare it to a “true” cluster set. However, there is no generally accepted “true” cluster set. For this study, we assumed the original database clusters are the “real” clusters, similar to the way most automatic protein clustering methods are tested. For example, we assume the sequences from the glucokinase family of Pfam are in the same cluster. Based on this assumption, several statistics appear to adequate to evaluate the results. Here, for the criterion, we used the Jaccard (similarity) coefficient, S, defined as: S = a/(a+b+c) where a is the “true positive”, i.e., the number of sequence pairs clustered together in both sets. b captures “false negatives,” i.e., the number of sequence pairs clustered together in the true cluster set, but not in the current clustering solution. c is for “false positives,” i.e., the number of sequence pairs clustered in the current solution, but not in the true cluster set. The Jaccard similarity value lies between 0 and 1, bigger values signifying better clustering. First, data sets were extracted from protein databases. Secondly, pairwise distances between any two proteins were computed using a score based on (a normalized) Smith-Waterman algorithm. Then the OPTICS algorithm was adopted to execute the clustering actions with the results analyzed via the Jaccard coefficient. Figure1: Overview of SEQOPTICS. Four data sets are extracted from different publicly available protein repositories as shown in table 1: two from Pfam, one from Swiss-Prot and the remaining one from NCBI. Each protein sequence is labeled according to the data set from which it originated, the labeling thus defining what we later use as the “true” cluster to which a sequence belongs. Results Analysis Computing Pairwise distance Clustering, Visualization Extracting Dataset A pairwise Smith-Waterman local alignment score, is computed first and then normalized to obtain SN(a,b) by: SN(a,b) = S(a,b)/Min(S(a,a),S(b,b)); where S(a,b) is the Smith-Waterman local alignment score between two sequences a and b. The distance between two protein sequences is then computed as: Distance(a,b) = 1/SN(a,b). Distances range from 0 to 1, 0 meaning identical sequences and 1 meaning totally different. Protein sequences Distance Measure Smith-Waterman Score Normalized Smith-Waterman Pfam Swiss-Prot NCBI Labeled Datasets Data Selection, Labeling, Reformating The final clusters can be extracted from the plot by employing either a cutoff value or a steepness criterion. In this study, density-based clusters were extracted by using a cutoff value. For example, in figure 2, the cutoff value is set as 0.860 (see the line at reachability distance 0.860). Under this cutoff regime, each valley in figure 2 between two sequences with reachability distance higher than the cutoff identifies a cluster. The sequence starting a valley with reachability distance higher than the cutoff is also in the same cluster as the remaining sequences in the valley. Any sequence with reachability distance higher than the cutoff is noise if it does not start a new valley. Therefore, in figure 2, there are four clusters identified. Similarly, using cutoff values in the paper there are four clusters in figure 3, six in figure 4, and four in figure 5. OPTICS (Ordering Points To Identify the Clustering Structure) orders data into a density based clustering structure corresponding to a broad range of parameter settings (elected by the user). A reachability (bar chart) plot shows each object’s reachability distance (in the order the object was processed): it demonstrates the data’s cluster structure. There are two main advantages to applying OPTICS in protein sequences clustering analysis: 1) OPTICS can find the local density region; 2) OPTICS produces an augmented ordering of the database representing its density based clustering structure and this ordering can be visualized in the reachability plot. SEQOPTICS was applied to cluster the data sets 1. Visualization of the cluster structure: We made a reachability distance plot for each data set. In each figure, the horizontal axis represents the ordering of the sequences, the vertical axis represents the reachability distance, and each valley stands for a cluster set. From the figures, we can see that each valley contains exclusively one sequence family. 2. Extraction of the clusters: The final density-based clusters were extracted by using a cutoff value. For example, in figure 2, the cutoff value is set as 0.860; other values may be chosen: e.g., in figure 3, 0.745 is chosen. Fig. 2 (data set 1) There are five valleys: The first two valleys are composed of sequences from cytochrom B562; The third valley consists of sequences from glucokinase; The fourth valley contains sequences from GABAR family; The fifth valley are sequences from bac- globin family. Fig. 5 (data set 4) There are four main valleys in figure 5: The first valley contains only casein kappa sequences The second and third valley contain only globins; the fourth valley is composed of GAPDHs. Fig. 4 (data set 3) There are six valleys: The first one and last one contain only cytoC sequences; The second valley contains only sequences from GABAR; The third valley contains sequences GAPDH; The fourth valley contains GPCR sequences; The fifth valley contains only GFAT. Fig. 3 (data set 2) There are three valleys: The first one is composed of sequences from bac globin; The second valley is composed of sequences from band3 family; The third valley contains only sequences from IGA1. We also clustered the same data sets with two existing clustering methods, blastclust and BAG, by default parameters. We then compared our results with those results via the Jaccard coefficient. The comparison is shown in table 2. From this table, we see that SEQOPTICS produces very nice results relative for each original cluster set, and, moreover, that it outperforms BAG and blastclust on all the data sets. Meanwhile, the performance of blastclust exceeds BAG in two cases and is less good in two cases. Overall, SEQOPTICS seems a promising method in terms of both clustering quality and its graphical representation. BAG Blastclust SEQOPTICS Data sets Compare the results by Jaccard coefficient Distance Matrix Reachibility plot Clusters extraction

Protein sequence clustering has been widely used as a part of the analysis of protein structure and function. We demonstrate an approach to protein clustering,

Similar presentations

Presentation on theme: "Protein sequence clustering has been widely used as a part of the analysis of protein structure and function. We demonstrate an approach to protein clustering,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Protein sequence clustering has been widely used as a part of the analysis of protein structure and function. We demonstrate an approach to protein clustering,

Similar presentations

Presentation on theme: "Protein sequence clustering has been widely used as a part of the analysis of protein structure and function. We demonstrate an approach to protein clustering,"— Presentation transcript:

Similar presentations

About project

Feedback