Presentation on theme: "ProtoNet Automatic Hierarchical Classification of Proteins Brief Database overview December 2003."— Presentation transcript:
ProtoNet Automatic Hierarchical Classification of Proteins Brief Database overview December 2003
ProtoNet Database overview 2 Unix servers: Red Hat Linux 7.1,FreeBSD 4.7 the first serves as a web server the second as a database server Each server is dual-PIII 1000Mhz w/ 2GB RAM Database engine: PostgreSQL 7.2 Present Configuration
ProtoNet Database overview Disks situation:130GB (w/ mirror), 265GB (RAID 5) Both at critical utilization levels (93%) Present Configuration
ProtoNet Database overview Database Figures: Around 150 tables, consuming ~160GB 2 major protein (sequence) sources 3 protein motif sources 8 keyword sources Multiple versions of each source are kept and handled under release management (3 releases so far) – to be mentioned later 7 clusterings (cluster-tree setups)
Research Satellite systems Web Interface AVA Blast ProtoNet Database overview Database setup process: KeywordsMotif SystemsAnnotation SystemsProteins Clustering Annotated Clustering Clusters Tree
ProtoNet Database overview Building the PROTEINS table: Proteins Protein Sources: Swiss-Prot (~130K sequences) TrEMBL (~950K sequences) Each protein record comes with many attributes: Swiss-Prot ID + Accession numbers Description Amino-Acid sequence + Molecular Weight + Length Keywords + Motifs (features) Linkage information to external databases (PDB, Taxonomy, Publication journals, etc) and more… We also calculate additional attributes, such as the theoretical Iso-Electric point (based on AA sequence).
AVA Blast ProtoNet Database overview Proteins Protein Sources: Swiss-Prot (~130K sequences) TrEMBL (~950K sequences) Each pair of Swiss-Prot proteins is checked for its BLAST e-score, which serves as a distance measure The set of distance measures between all proteins serves as the base metric for the clustering process (to be mentioned later) TrEMBL sequences are ignored at this stage, both due to space and computation limitations and also due to potential noise they would probably inflict on the clustering process (we haven’t had the chance to actually test this hypothesis) Building the PROTEINPROTEIN table:
AVA Blast ProtoNet Database overview Proteins The clustering process: All initial proteins are defined as singleton clusters Closest pair of clusters is merged into a new cluster (distance between clusters is the average distance of all of their proteins) The distance between the merged clusters induces a time measure, representing their “death-time” which is equal to the new-born cluster’s “birth-time” Each cluster gets additional attributes, such as: size, depth, height, average length of proteins, and more… The process repeats until a distance-cutoff is met The overall result is the clusters tree Building the CLUSTERS table: Clustering Clusters Tree
AVA Blast ProtoNet Database overview Proteins The overall result is the clusters tree Building the CLUSTERS table: Clustering Clusters Tree clustid fatherid size height depth birthtime deathtime …
ProtoNet Database overview The overall result is the clusters tree Building the CLUSTERS table: Clusters Tree clustid fatherid size height depth birthtime deathtime …
ProtoNet Database overview Building the CLUSTERS table: Clusters Tree At this point TrEMBL comes into the picture: Each TrEMBL sequence is “hung” a- posterior on the tree, using our Classify-Your-Protein algorithm. This algorithm either finds a suitable cluster or declares the TrEMBL sequence as a singleton. Note: Leaves are no longer singletons! clustid fatherid size height depth birthtime deathtime … We also calculate and attach additional attributes to each cluster, such as: amountofclusters_atbirth, amountofsingletons_atbirth averlength, varlength averdepth, vardepth dfsnum, dfsnum_ofnextnondesc subtreesize, lifetime, … Several DFS-based orderings of the clusters in the tree are kept, and allow efficient (and elegant!) way to query the clusters hierarchy.
ProtoNet Database overview KeywordsMotif SystemsAnnotation Systems Extracting annotation from external sources: Notes: External databases are downloaded from the Internet in varying file formats (all sorts of strange ASCII formats, XML, …) Generally speaking, the more “Biological” the database is, the more peculiar is its format Most biological databases are indexed uniquely, but not all with an integer Being dynamic and developing, the external databases tend to change their format every now and then! PERL Parsing scripts were written for each external source…
ProtoNet Database overview Motif Sources:InterPro (PRINTS, Pfam, PROSITE, ProDom, Smart, TIGRFAMs, PIR SuperFamily, SuperFamily) Swiss-Prot (FEATUREs) TrEMBL (FEATUREs) Each motif record comes with the following attributes: ID in external system Description Starting AA position Ending AA position Motif Systems Building the MOTIFS + PROTEINMOTIF tables:
ProtoNet Database overview Motif Systems Keyword Sources:Swiss-Prot, TrEMBL, Enzyme, PDB, SCOP, GO, NCBI Taxonomy, InterPro Biological protein properties are extracted from each database, and are stored in suitable tables (proteinec, proteinpdb, pdbscop, proteinspecies,...) Keywords are generated for each protein from its associations to these biological properties These keywords are also grouped into keyword-types Statistics (quantity/frequency) are calculated for each keyword and keyword-type, according to the proteinkeyword table Annotation Systems Building the KEYWORDS + PROTEINKEYWORD tables:
AVA Blast ProtoNet Database overview KeywordsMotif SystemsAnnotation SystemsProteins Clustering Annotated Clustering Clusters Tree Combining annotations with the clustering
ProtoNet Database overview Combining annotations with the clustering In the process of combining the annotations with the clustering, we “forget” where each keyword came from. Statistics are calculated for each keyword in the clustering, and for selected cluster-keyword pairs ( TP, FN, specificity, sensitivity, significance measures). Several tables are maintained for this stage, some of which are huge: clusterings, clusteringprotein, clusteringkeyword, clusteringkeytype, clusters, clusterkeyword Annotated Clustering Space/Time limitations enforce us to be selective with regards to: 1. Which keyword-types enter this phase in the first place 2. Which cluster-keyword pairs are analyzed
ProtoNet Database overview Release management RELEASE is a set of clustering-trees and protein / motif / keyword sources, each of a certain version. Some of the releases were built explicitly for the web users. A default clustering of the most updated release is exposed to the standard user. Advanced users can switch between the various releases and clustering trees. In order to allow multiple releases, we keep more than one version of each of the external (imported) databases + clustering trees. This is also important for research work (that can also be comparative).
ProtoNet Database overview Release management In order to efficiently manage multiple versions (of external databases), each record contains a special number field which represents a bits-vector. Bit in position i is turned ON if (and only if) the record is part of version i of the corresponding source. This serves as a CONDENSATION for the data in the tables, and was introduced due to our space problems. For example, if a protein-sequence record appears (exactly the same) in versions 2 and 3 of Swiss-Prot, the bits-vector integer is equal to 6 (00000110 2 ) This bits-vector field is unnecessary for the clustering tables, since each clustering is associated to a release (+versions) and the tuples are (generally) non-repetitive. This implies having NO CONDENSATION in the cluster-related tables, and hence they consume A LOT of disk space (especially the cluster-keyword table).
ProtoNet Database overview Problems… ProtoNet’s database is a HUGE setup, which seems to be beyond our technical capabilities. We are limited in: 1.Men power: We starve for a steady, experienced database and system developers team, dedicated solely to this project (rather than preoccupied research students). This is the only way this project can fulfill its objective – to be an up-to-date homepage for Proteomics! 2.Disk space 3.CPU power: Both for serving outside users and for quicker release upgrades. 4.Robust, trustworthy and efficient database engine (PostgreSQL seems to fail on all three categories!)