Presentation is loading. Please wait.

Presentation is loading. Please wait.

Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

Similar presentations


Presentation on theme: "Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting."— Presentation transcript:

1 Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting Assistant Professor, Dept of Computer Science University of North Carolina at Greensboro

2 Introduction Molecular Biology: A Brief Introduction Central Dogma of Biology Protein Structure Prediction: A Brief Introduction Protein Secondary Structure Prediction Problem Related Work Rule-Based RT-RICO BLAST-RT-RICO RT-RICO Rule Generation Algorithm Rule Visualization of Protein Motif Sequence Data Conclusion References, More Related Work, Detailed RT-RICO 2

3 What is life made of ? What are living organisms made of ? 3

4 Molecular Biology: A Brief Introduction What is life made of? Organisms are made of cells A great diversity of cells exist in nature, but they have some common features (Jones and Pevzner, 2004) – Born, eat, replicate, and die – A cell would be roughly analogous to a car factory 4

5 Molecular Biology: A Brief Introduction There are two types of cells: – Eukaryotic cells (DNA in a nucleus, most multicellular organisms like flies or humans) – Prokaryotic cells (DNA not in nucleus, most unicellular organisms like bacteria) (Jones and Pevzner, 2004) Image from Science Primer (National Center for Biotechnology Information) 5

6 Molecular Biology: A Brief Introduction All life on this planet depends mainly on three types of molecules: DNA, RNA, and proteins A cell’s DNA holds a library describing how the cell works RNA acts to transfer short pieces of information to different places in the cell, smaller volumes of information are used as templates to synthesize proteins Proteins perform biochemical reactions, send signals to other cells, form body’s components, and do the actual work of the cell. (Jones and Pevzner, 2004) 6

7 Central Dogma of Biology DNA --> transcription --> RNA --> translation --> protein Is referred to as the central dogma in molecular biology (Jones and Pevzner, 2004) DNA sequence determines protein sequence Protein sequence determines protein structure Protein structure determines protein function Regulatory mechanisms, delivers the right amount of the right function to the right place at the right time (Lesk, 2008) 7

8 Molecular Biology: A Brief Introduction DNA: the structure and the four genomic letters code for all living organisms, double helix structure, can replicate Adenine, Guanine, Thymine, and Cytosine which pair A-T and C-G on complimentary strands (chemically attached) (Jones and Pevzner, 2004) 8

9 Molecular Biology: A Brief Introduction Cell Information: instruction book of life DNA/RNA: strings written in four-letter nucleotide (A C G T/U) Protein: strings written in 20-letter amino acid Example, the transcription of DNA into RNA, and the translation of RNA into a protein (Jones and Pevzner, 2004) DNA: TAC CGC GGC TAT TAC TGC CAG GAA GGA ACT RNA: AUG GCG CCG AUA AUG ACG GUC CUU CCU UGA Protein: Met Ala Pro Ile Met Thr Val Leu Pro Stop 9

10 Molecular Biology: A Brief Introduction Genetic code, from the perspective of mRNA. AUG also acts as a “start” codon 10

11 Protein Structure Prediction : A Brief Introduction 3D structure of pepsin (PDB ID: 1PSN) 11 >1PSN:A|PDBID|CHAIN|SEQUENCE VDEQPLENYLDMEYFGTIGIGTPAQDFTV VFDTGSSNLWVPSVYCSSLACTNHNRFN PEDSSTYQSTSETVSITYGTGSMTGILGYD TVQVGGISDTNQIFGLSETEPGSFLYYAPF DGILGLAYPSISSSGATPVFDNIWNQGLVS QDLFSVYLSADDQSGSVVIFGGIDSSYYTG SLNWVPVTVEGYWQITVDSITMNGEAIA CAEGCQAIVDTGTSLLTGPTSPIANIQSDI GASENSDGDMVVSCSAISSLPDIVFTING VQYPVPPSAYILQSEGSCISGFQGMNLPT ESGELWILGDVFIRQYFTVFDRANNQVGL APVA

12 Protein Structure Prediction : A Brief Introduction Genomic projects provide us with the linear amino acid sequence of hundreds of thousands of proteins If only we could learn how each and every one of these folds in 3D… Malfunctioning of proteins is the most common cause of endogenous diseases Most life-saving drugs act by interfering with the action of foreign protein So far, most drugs have been discovered by trial-and-error Our lack of understanding of complex interplay of proteins – drugs might not be aimed at best target, side-effects (Tramontano, 2006) 12

13 Protein Structure Prediction : A Brief Introduction Experimental methods can provide us the precise arrangement of every atom of a protein – X-ray crystallography and NMR spectroscopy X-ray crystallography requires protein or complex to form a reasonably well ordered crystal, a feature that is not universally shared by proteins NMR spectroscopy needs proteins to be soluble and there is a limit to the size of protein that can be studied Both are time consuming techniques, we cannot hope to use them to solve the structures of all proteins in the universe in the near future Problem: How to relate the amino acid sequence of a protein to its 3D structure 13

14 Background – Protein Primary Structure Protein primary structures are chains of amino acids 20 amino acids {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y} – 1san:A – MTYTRYQTLELEKEFHFNRYLTRRRRIEIAHALSLTERQIKIWFQNRRMKWKKENKTKGEPG 14 Image Author:National Human Genome Research Institute (NHGRI)

15 Background - Protein Secondary Structure Secondary structure is normally defined by hydrogen bonding patterns Amino acids vary in ability to form various secondary structure elements 8 types of secondary structure defined: {G, H, I, T, E, B, S, -} 15 >1SAN:A:sequence MTYTRYQTLELEKEFHFNRYLTRRRRIEIAHALSLTERQIKIWFQNRRMKWKKENKTKGEPG >1SAN:A:secstr ----HHHHHHHHHHHHH-SS--HHHHHHHHHHHT--SHHHHHHHHHHHHTTTTTS-TT-S-- Image Author: Carl FürstenbergAlpha helices are shown in colour, and random coil in white, there are no beta sheets shown.

16 Protein Secondary Structure Prediction - Motivation Important research problem in bioinformatics / biochemistry Of high importance for design of drugs and novel enzymes Determination of protein structures by experimental methods is lagging far behind discovery of protein sequences Predicting protein tertiary structure is an even more challenging problem, but more tractable if using simpler secondary structure definitions; focus for current research (tertiary structure of a protein is its three-dimensional structure, as defined by the atomic coordinates) 16

17 Protein Secondary Structure Prediction Problem Description Input (Baldi et al., 2000) – Amino acid sequence, A = a 1, a 2, … a N – Data for comparison, D = d 1, d 2, … d N – a i is an element of a set of 20 amino acids, {A,R,N…V} – d i is an element of a set of secondary structures, {H,E,C}, which represents helix H, sheet E, and coil C. Output – Prediction result: M = m 1, m 2, … m N – m i is an element of a set of secondary structures, {H,E,C} 3-Class Prediction (Zhang and Zhang, 2003) – Multi-class prediction problem with 3 classes {H,E,C} in which one obtains a 3 x 3 confusion matrix Z = (z ij ) 17

18 Protein Secondary Structure Prediction Problem Description 3 x 3 matrix (3 classes) Prediction H E C H Z 11 RealityE Z 22 CZ 33 Z ij : input predicted to be in class j while in reality belonging to class i Q total = 100 ∑ i Z ii / N (percentage) 18

19 Q 3 Score Q 3 = W αα + W ββ + W cc W αα = % of helices correctly predicted W ββ = % of sheets correctly predicted W cc = % of coils correctly predicted Example of Q 3 calculation Protein: 10% helices, 10% sheets, 80% coils Prediction: 100% coils Q 3 = 0% + 0% + 80% = 0.80 19

20 Q 3 Score Q 3 = W αα + W ββ + W cc W αα = % of helices correctly predicted W ββ = % of sheets correctly predicted W cc = % of coils correctly predicted Example of Q 3 calculation, length 10 Amino acid (primary structure) sequence (A):MTYTRYQTLE (Secondary structure) data for comparison (D): HHHEEECCCC (Secondary structure) Prediction (M):HHEEECCCCC Q 3 = 2/10 + 2/10 + 4/10 = 0.80 20

21 Related Work Rost (2003) classifies protein secondary structure prediction methods into 3 generations First generation methods depend on single residue statistics to perform prediction Second generation methods depend on segment statistics Third generation methods use evolutionary information to predict secondary structure; e.g., PHD (Rost and Sander, 1993a) One of the best secondary structure predictors is the PSIPRED Protein Structure Prediction Server (Jones, 1999) ; uses a two-stage neural network, based on position-specific scoring matrices. Recently, trend to use support vector machine (SVM) to predict protein secondary structures 21

22 Related Work Levitt and Chotia (1976) proposed to classify proteins as 4 basic types according to their α-helix and β-sheet content – “All-α” class proteins consist almost entirely (at least 90%) of α-helices – “All-β” class proteins composed mostly of β-sheets (at least 90%) – “α/β” class proteins have alternating, mainly parallel segments of α-helices and β-sheets – “α+β” class proteins have mixture of all-α and all-β regions, mostly in sequential order 22

23 Related Work Fadime, O¨zlem, and Metin (2008), used different 2-stage method; Q 3 74.1% (different test dataset) First stage determines class of unknown proteins with 100% accuracy Second stage uses probabilistic approach Simplifies problem: given a protein amino acid sequence, if it can be determined which one of the 4 classes protein belongs to, other approaches can be applied to predict the secondary structure elements within the 4 classes Shows there are statistical relationships between a secondary structure element and its neighboring amino acid residues 23

24 Related Work Not easy to evaluate performance of a protein secondary structure prediction method (e.g., different datasets used for training and testing) Rost and Sander (1993a) selected a list of 126 protein domains (RS126); now constitutes comparative standard Cuff and Barton (1999) described development of non- redundant test set of 396 protein domains (CB396) PHD, one of the first methods surpassing the 70% accuracy threshold, uses multiple sequence alignments as input to a neural network (Rost and Sander, 1993b) 24

25 Related Work PHD effectively utilizes evolutionary information by exploiting the well-known fact that homologous proteins have similar 3D structures Random mutations in DNA sequence can lead to different amino acids in the protein sequences Mutations resulting in a structural change are not likely to retain protein function; thus, structure more conserved than sequence (Rost, 2003) Rost (2003) also has stated that a value of around 88% likely will be the operational upper limit for prediction accuracy 25 In evolutionary biology, homology refers to any similarity between characteristics of organisms that is due to their shared ancestry. Homology among proteins and DNA is often concluded on the basis of sequence similarity, especially in bioinformatics. For example, in general, if two or more genes have highly similar DNA sequences, it is likely that they are homologous. But sequence similarity may also arise without common ancestry:

26 Q 3 Scores of Secondary Structure Prediction Methods Methods RS126 Test Dataset CB396 Test Dataset Other Test Datasets PHD73.5%71.9% DSC71.1%68.4% PREDATOR70.3%68.6% NNSSP72.7%71.4% CONSENSUS74.8%72.9% Fadime, 2-stage74.1% PSIPRED78.3% Hu, SVM78.8% Kim, SVMpsi76.1%78.5% Nguyen, 2-stage SVM78.0%76.3% 26

27 Q 3 Scores of Secondary Structure Prediction Methods Due to differences in approaches, data availability, and test design strategies, difficult to directly compare different methods’ prediction results Q 3 scores comparison should be used as general guide, not strict percentile comparison Q 3 scores under “Other Test Datasets” column should NOT be directly compared (uses different test datasets) 27

28 Background - RT-RICO We developed a rule-based secondary structure prediction method called RT-RICO Paper 1: Rule-based RT-RICO: improvements to the prediction algorithm; RS126 Q 3 score 81.75%, CB396 Q 3 score 79.19% (Lee, Leopold, Kandoth and Frank, 2010b) Paper 2: BLAST-RT-RICO: modified method BLAST-RT-RICO; RS126 Q 3 score 89.93%, CB396 Q 3 score 87.71% (Lee, Leopold and Frank, 2011) Paper 3: Rule Visualization: modifications to an existing visualization technique are proposed in order to visualize and analyze the RT-RICO and BLAST-RT-RICO association rules (Lee, Leopold, Edgett and Frank, 2010d) 28

29 Rule-Based RT-RICO (Paper 1) RT-RICO Step 1 All protein names and corresponding folding types of each protein retrieved from the SCOP database (Andreeva et al., 2008) All available corresponding protein sequences and secondary structure sequences obtained from PDB database (Berman et al., 2000) 5 databases of protein domains (with their amino acid sequences and secondary structure sequences) of different protein domain types (e.g., “all-α”, “all-β”, “α/β”, “α+β” and “others”) built Proteins from test datasets (RS126 or CB396) first removed; Protein domains from different protein families selected to form training datasets 29

30 Rule-Based RT-RICO (Paper 1), Step 1 30

31 Rule-Based RT-RICO (Paper 1) Step 1 Data Preparation RT-RICO Step 1 Protein secondary structure sequences from PDB formed from 8 states of secondary structure, {H, G, I, E, B, T, S, -} 8 states are converted to 4 states to facilitate rule generation: (final Q 3 calculation uses 3 states) (G, H, I) => Helix H; (E, B) => Sheet E; (T, S) => Coil C; (-) => “-” Klepeis and Floudas (2002): use of overlapping segments of 5 residues effective in predicting the helical segments of proteins 31

32 Rule-Based RT-RICO (Paper 1) Step 1 Data Preparation 32

33 Rule-Based RT-RICO (Paper 1) Step 1 Data Preparation 33

34 Rule-Based RT-RICO (Paper 1) Step 2 Rule Generation 34 RT-RICO generate rules

35 Rule-Based RT-RICO (Paper 1) Step 2 Rule Generation 35

36 Rule-Based RT-RICO (Paper 1) Step 3 Prediction 36 Loads protein primary structures from test dataset Predicts secondary structure elements

37 Rule-Based RT-RICO (Paper 1) Step 3 Prediction Each of these segments compared with generated rules; first searched for matching rules with 100% confidence value If no matching rule existed among 100% confidence value rules, searched for other matching rules (with confidence values ≥ 90%, but < 100%) Secondary structure element with highest total support value selected as predicted secondary structure element for the specific position If no matching rule found for the segment at all, secondary structure of the previous position used as predicted secondary structure 37

38 Rule-Based RT-RICO (Paper 1) Step 3 Prediction 38

39 RT-RICO Rule Generation Algorithm (4 new definitions and 2 new algorithms ) Algorithm RT-RICO (Relaxed Threshold Rule Induction From Coverings) finds the set C of all relaxed coverings of R in S (and the related rules), with threshold probability t (0 < t  1), where S is the set of all attributes, and R is the set of all decisions. The set of all subsets of the same cardinality k of the set S is denoted P k = {{x i1, x i2, …, x ik } | x i1, x i2, …, x ik  S} Algorithm 2: RT-RICO begin for each attribute x in S do compute [x]*; compute partition R* k:=1 while k  |S| do for each set P in P k do if (  x  P [x]*  r,t R*) then begin find values of attributes from the entities that are in the region (B  B’) such that (|B  B’| / |B|)  t; add rule to output file; end k := k+1 end-while; end-algorithm. 39

40 BLAST-RT-RICO (Paper 2) After Rule-Based RT-RICO (Paper 1), can we do better? Given input protein A (amino acid sequence, A = a 1, a 2, … a N ), protein BLAST search (Web-based) performed using A as query sequence BLAST returns list of proteins with significant sequence alignments Suitable proteins chosen to form training dataset for A RT-RICO algorithm generates rules from the training dataset; rules used to predict the secondary structure for protein A Output is predicted secondary structure sequence M BLAST-RT-RICO is accepted for publication 40

41 BLAST-RT-RICO (Paper 2) 41

42 BLAST-RT-RICO (Paper 2) Step 1 Online BLAST and PDB Data Match BLAST search (Web crawler program) performed using A as query sequence Returns list of proteins with significant sequence alignments and corresponding BLAST scores; proteins with score ≤ 30 removed from list (test protein A also removed) Some of these proteins may have corresponding secondary structure records in PDB (Berman et al., 2000) Those records retrieved, become inputs to next step, data preparation If a protein from the list does not have known secondary structure record in PDB, will require data from offline preprocessing 42

43 BLAST-RT-RICO (Paper 2)Step 1 Online BLAST and PDB Data Match 43

44 BLAST-RT-RICO (Paper 2) Step 2 Data Preparation (Maths content, may skip) For test protein A, there is set of protein primary structure sequence B i and set of corresponding secondary structure sequence C i where B i ∈ {B 1, B 2, B 3, B 4, … B m }, C i ∈ {C 1, C 2, C 3, C 4, … C m } Primary structure sequence is Corresponding secondary structure sequence is B 1 to B m not necessarily of same length, because they represent different proteins Each b i,j is an element of a set of 20 amino acids, {A,R,N…V} c i,j is an element of set of 8-state secondary structures, {H, G, I, E, B, T, S, -} (PDB); converted to an element of a set of 4- state secondary structures, {H, E, C, -} 44

45 BLAST-RT-RICO (Paper 2) Step 2 Data Preparation 45 For each secondary structure element, five “neighboring” amino acid residues extracted to form a segment of 5 amino acid residues, plus 1 secondary structure element Frst and second positions at beginning of sequence are represented by 3 residues + 1, and 4 residues + 1 segments, respectively These segments used as input to RT-RICO to generate rules (as 6-tuples)

46 BLAST-RT-RICO (Paper 2) Step 2 Data Preparation (Maths content, may skip) If B i is primary structure sequence, C i is secondary structure sequence shown in Fig. 2, and length of sequence(s) is n i, then each 5-residue segment is of form: b i,j-2, b i,j-1, b i,j, b i,j+1, b i,j+2, c i,j ; and j has value from 3 to (n i – 2) This data preparation step performed for all B i and C i pairs, where i is from 1 to m These 5-residue segments are main inputs to RT-RICO rule generation algorithm 46

47 BLAST-RT-RICO (Paper 2)Step 2 Data Preparation 47

48 BLAST-RT-RICO (Paper 2) Step 3 Rule Generation 48 Sample rules generated are as shown.

49 BLAST-RT-RICO (Paper 2)Step 3 Rule Generation 49

50 BLAST-RT-RICO (Paper 2) Step 4 Prediction 50 Finally RT-RICO loads protein primary structures from test data set (a single protein A for this case), and predicts secondary structure elements For each secondary structure element prediction position, 5 “neighboring” amino acid residues extracted to form segment of 5 amino acid residues

51 BLAST-RT-RICO (Paper 2) Step 4 Prediction same as Rule-Based RT-RICO (Paper 1) Step 3 Each of these segments compared with generated rules; first searched for matching rules with 100% confidence value If no matching rule existed among 100% confidence value rules, searched for other matching rules (with confidence values ≥ 90%, but < 100%) Secondary structure element with highest total support value selected as predicted secondary structure element for the specific position If no matching rule found for the segment at all, secondary structure of the previous position used as predicted secondary structure 51

52 BLAST-RT-RICO (Paper 4) Step 4 Prediction (Maths, may skip) Output of prediction is sequence of secondary structure elements M = m 1, m 2, … m N where each m i is an element of a set of 4-state secondary structures, {H,E,C,-} Q 3 score calculation uses 3-state decision attribute; m i first converted to an element of {H,E,C} before final Q 3 score calculation 52

53 BLAST-RT-RICO (Paper 2)Step 4 Prediction 53

54 BLAST-RT-RICO, Offline Preprocessing (future work needed here) If no protein with significant sequence alignments has corresponding known secondary structure sequence from PDB (answer is “no” in Fig. 1.), prediction for test protein needs to be handled slightly differently All proteins and corresponding secondary structure sequences from PDB downloaded to form initial dataset; test datasets (RS126 or CB396) removed; protein domains from different protein families selected to form training datasets Now have set of protein primary structure sequence B i and corresponding secondary structure sequence C i ; same data preparation, rule generation, and prediction steps applied 54

55 BLAST-RT-RICOOffline Preprocessing 55

56 RT-RICO Rule Generation Algorithm Note: most computationally intensive is rule generation, performed both in 3 rd step and during offline preprocessing 56

57 RT-RICO Rule Generation Algorithm (4 new definitions and 2 new algorithms ) Algorithm RT-RICO (Relaxed Threshold Rule Induction From Coverings) finds the set C of all relaxed coverings of R in S (and the related rules), with threshold probability t (0 < t  1), where S is the set of all attributes, and R is the set of all decisions. The set of all subsets of the same cardinality k of the set S is denoted P k = {{x i1, x i2, …, x ik } | x i1, x i2, …, x ik  S} Algorithm 2: RT-RICO begin for each attribute x in S do compute [x]*; compute partition R*; k:=1; while k  |S| do for each set P in P k do if (  x  P [x]*  r,t R*) then begin find values of attributes from the entities that are in the region (B  B’) such that (|B  B’| / |B|)  t; add rule to output file; end k := k+1; end-while end-algorithm. 57

58 RT-RICO Rule Generation Algorithm (Maths, may skip) Input is m×(n+1) matrix, where m is number of all entities (number of 5-residue plus 1 secondary structure element segment), and n = |S| (number of attributes, where n = 5) Time complexity exponential to |S|, O(m 2 2 n ). For training datasets used n = |S| = 5, and m is sufficiently large; hence, m 2 dominates time complexity 58

59 BLAST-RT-RICO (Paper 2)Results (more tests needed) 59

60 BLAST-RT-RICO (Paper 2)Results 60

61 Rule Visualization (Paper 3) Association rule is implication of the form X → Y where X is set of antecedent items, and Y is consequent item (Wong et al., 1999) Wong’s technique designed to handle only Boolean association rules (Han and Kamber, 2001), rules concerning only the presence or absence of attributes Our rules for secondary structure are multi-valued (considered quantitative) We generate numerous rules (e.g., 572,531 from “all-α” class training set) 61

62 Rule Visualization (Paper 3) 62 Rules sorted by confidence value, then by support value Sorted this way due to prediction steps

63 Rule Visualization (Paper 3) 63 Can be visualized by modified version of Wong’s technique Different colors will represent different amino acids and different secondary structure elements

64 Rule Visualization (Paper 3) 64 Interesting observations Only 15 different amino acids (instead of 20) appear All decision attribute values at position 5 are “H/Helix” Motivated to compare color patterns!

65 Rule Visualization (Paper 3) Positions 0 to 4 are antecedent items and position 5 is only consequent item Can change amino acids’ colors (or any attribute’s color) in 3D diagrams to represent different properties In Fig. 5 amino acid colors chosen according to different amino acid types (e.g., acidic, basic, nonpolar, and polar uncharged) Colors can be changed to distinguish amino acids of different sizes, or other relevant chemical properties 65

66 Rule Visualization (Paper 3) 66 As shown in Table V, amino acids belonging to same type use similar color shades (acidic: orange; basic: teal; nonpolar: green; polar uncharged: pink)

67 Rule Visualization (Paper 3) 67 Colors can be changed to distinguish amino acids of different sizes (Fig. 10) Python programming language, matplotlib plotting library: zooming, rotating about any axis, and saving as image file

68 Rule Visualization (Paper 3) Different Classes 68 Rule sequences between Fig. 5 and Fig. 7 are clearly different

69 Rule Visualization (Paper 3) Different Classes Surprisingly, top 30 “all-β” class rules do not produce all “E/Sheet” values at decision attribute (position 5) Top “all-β” class rules have similar support value as top “all-α” class rules Fig. 7 makes use of all 20 amino acids, compared to the 15 amino acids displayed in Fig. 5 Obvious different color distribution between the two diagrams indicates different rule value compositions Visualization allows patterns to emerge that would otherwise not be apparent! 69

70 Rule Visualization (Paper 3) Different Classes In the graph for "all-α" by amino acid type (Fig. 5), acidic and basic amino acids occur at frequency expected for number of amino acids in those groups Conversely, significant preponderance of nonpolar amino acids and a paucity of polar uncharged Although basic amino acids occur with expected frequency, overall concentrated in middle position, 2, with fewer at edge positions, 0 and 4 Nonpolar amino acids not equally distributed by position; inverse of trend for basic amino acids (i.e. concentrated at edge positions, 0 and 4, fewer in middle position, 2) 70

71 Rule Visualization (Paper 3) Different Classes (observations, may skip) Comparison of graphs between proteins classes also reveals patterns not apparent without visualization! Basic amino acids more abundant than expected in “all-α" group, as compared that expected in “all-β” group Polar amino acids more abundant than expected in “all-β” group, compared to that expected in “all-α" group Becomes apparent that among nonpolar type, different amino acids predominate in “all-α" group versus “all-β” group 71

72 Rule Visualization (Paper 3) Different Classes (observations, may skip) Similar patterns emerge from graph for " all-α" by amino acid size, where amino acids sorted by molecular weight into four groups (as shown in Fig.10, small: orange; medium small: green; medium large: pink; large: teal) Significantly fewer amino acids of the large class, roughly expected number of medium large and medium small, but significantly more than expected of small class Among medium large, amino acids in this class concentrated in middle position, 2, and less abundant in the edge positions, 0 and 4 72

73 Rule Visualization (Paper 3) Different Classes 73

74 Rule Visualization (Paper 3) Different Test Proteins BLAST-RT-RICO uses BLAST search to find list of proteins with significant sequence alignments (for each test protein) Rules are generated from these proteins Using visualization technique, can more readily get sense of information that rules convey, and can compare rule sets for test proteins Proteins with significant sequence alignments may carry important evolutionary information! 74

75 Rule Visualization (Paper 3) Different Test Proteins 75 Fig. 12 and Fig. 13 help us visualize the concept that different sets of amino acids are responsible for the two rule sets.

76 Rule Visualization (Paper 3) Different Test Proteins May lead to other future research topics related to protein secondary structure; e.g., encourages researcher to ask questions such as: (1) how different rules (or groups of rules) affect the functions of an individual protein or a protein family, (2) why certain rules only exist in one protein class, but not in another, and (3) why some test proteins produce common rules although the proteins have different structure 76

77 Rule Visualization (Paper 3) Will help researchers discern patterns of residue association in protein structure as other more complex properties of those amino acids are applied to the visualization For brevity, figures each show only about 30 rules; on 21” monitor, 1000s rules can be displayed and analyzed Implementation supports zooming, rotating, etc., allowing users to have “big picture” of a particular set of rules 77

78 Conclusion Novel rule-based method that generates rules for predicting protein secondary structure Rule-based RT-RICO (paper 1): Q 3 accuracy scores of 81.75% for RS126 and 79.19% for CB396 BLAST-RT-RICO approach (paper 2): Q 3 scores of 89.93% for RS126 and 87.71% for CB396 – promising, but more tests needed for test proteins with “no known homologous template structures in the PDB database”. 78

79 Conclusion Rule Visualization (paper 3): technique to visualize those rules, compare rule sets between different protein classes, and compare rule sets of different test proteins In future, useful to construct BLAST-RT-RICO prediction server with functions to analyze training datasets and prediction results Also consider other properties of proteins and sequences of length > 5 Conduct more tests 79

80 Questions? Robbins, R.J. (1992). Challenges in human genome project. IEEE Engineering in Medicine and Biology, 11, 25-34. “… Consider the 3.2 gigabytes of human genome as equivalent to 3.2 GB of files on the mass-storage device of some computer system of unknown design. … Reverse engineering that unknown computer system (both the hardware and the 3.2 GB of software) all the way back to the full set of design and maintain specifications. …. resulting image of the mass-storage device will not be a file-by- file copy, but rather a streaming dump of bytes… files are known to be fragmented… erased files… garbage… only a partial, and sometimes incorrect understanding of the CPU… 3.2 GB are the binary specifications… millions of maintenance revisions… spaghetti-coding… hackers… self-modifying code… and relying upon undocumented system quirks.” 80

81 Teaching Interests: Web Application Development (AmphibAnat.org) NSF funded ($1,116K) Web interface design: (different design templates) Client-side programming: JavaScript, CSS, html Server-side programming: C#.net Relational database design/admin: Microsoft SQL Server Server setup/admin: Microsoft IIS web server and Microsoft Windows server 81

82 Teaching Interests: Web Application Development (RDBOM Ontology Sys.) NSF funded Ontology theory / Automata / Algorithm Design Web interface design: (different design templates) Client-side programming: JavaScript, CSS, html Server-side programming: C#.net Relational database design/admin: Microsoft SQL Server Server setup/admin: Microsoft IIS web server and Microsoft Windows server 82

83 Teaching Interests: Web Application Development (leeleong.com) Web interface design: (different design templates) Client-side programming: JavaScript, CSS, html Server-side programming: PHP Relational database design/admin: MySQL Server Server setup/admin: Apache web server Web graphics / photography Personal hobby 83

84 Teaching Interests: Web Design (web building projects) 84 Common Call Campus MinistryRollaShootingClub.org

85 Teaching Interests: Skills Programming: MS ASP.NET, C#, PHP, MATLAB, Perl, C, C++, Java, JavaScript, Pascal, Flash ActionScript, Director Lingo, HTML, SMIL, XML Database: MySQL Database, MS SQL Server Server Administration: MS Win Server, MS IIS, Apache Web Server, Real/Helix Streaming Server Web/Multimedia: Adobe Dreamweaver, Fireworks, Flash, Director, Freehand, Photoshop, Premiere Streaming System: RealPlayer, Helix Producer, Helix Server, SMIL 85

86 Teaching Interests: New Course Development Qualified to teach any core computer science course at the undergraduate level as well as specialized graduate courses I would be most interested in developing (new courses) – Advanced Bioinformatics – Bioinformatics – Data Mining – Neural Networks & Applications – Theory of Computation Courses – Web Multimedia Development Courses (web application development, web game programming) – Basic Web Design (basic design theories, web aesthetics, web interface design) 86

87 ::: Thank You ::: Leong Lee, Ph.D. University of Missouri (MS&T) Visiting Assistant Professor, Dept of Computer Science University of North Carolina at Greensboro

88 References Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D. J. (1997) ‘Gapped BLAST and PSI-BLAST: a new generation of protein database search programs’, Nucleic Acids Res., Vol. 25, No. 17, pp.3389-402. Andreeva, A., Howorth, D., Chandonia, J. M., Brenner, S. E., Hubbard, T. J., Chothia, C. and Murzin, A. G. (2008) ‘Data growth and its impact on the SCOP database: new developments’, Nucleic Acids Res, Vol. 36 (Database issue), D419-25. Baldi, P., Brunak, S., Chauvin, Y., Andersen, C. A. F. and Nielsen, H. (2000) ‘Assessing the accuracy of prediction algorithms for classification: an overview’, Bioinformatics, Vol. 16, No. 5, pp.412-24. Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. and Bourne, P. E. (2000) ‘The Protein Data Bank’, Nucleic Acids Res., Vol. 28, No. 1, pp.235-42. BLAST (2009). BLAST: Basic Local Alignment Search Tool. Obtained through the Internet: http://blast.ncbi.nlm.nih.gov/, [accessed 30/11/2009] Bryson, K., McGuffin, L. J., Marsden, R. L., Ward, J. J., Sodhi, J. S. and Jones, D. T. (2005) ‘Protein structure prediction servers at University College London’, Nucleic Acids Res., Vol. 33(Web Server issue), W36-8. Cuff, J. A. and Barton, G. (1999) ‘Evaluation and improvement of multiple sequence methods for protein secondary structure prediction’, Proteins, Vol. 34, pp.508–519. Cuff, J. A. and Barton, G. (2000) ‘Application of multiple sequence alignment profiles to improve protein secondary structure prediction’, Proteins, Vol. 40, No. 3, pp.502-11. Fadime, U. Y., O¨zlem, Y. and Metin, T. (2008) ‘Prediction of secondary structures of proteinsnext term using a two-stage method’, Computers & Chemical Engineering, Vol. 32, No. 1-2, pp.78-88. 88

89 References Frishman, D. and Argos, P. (1997) ‘Seventy-five percent accuracy in protein secondary structure prediction’, Proteins, Vol. 27, pp.329–335. Grzymala-Busse, J. W. (1991) ‘Ch.3. Knowledge Acquisition’, Managing Uncertanity in Expert System, (pp.43-76), Boston: Kluwer Academic. Han, J. and Kamber, M. (2001) Data Mining: Concepts and Techniques, (pp.155-157) Morgan Kaufmann. Hu, H., Pan, Y., Harrison, R. and Tai, P. (2004) ‘Improved protein secondary structure prediction using support vector machine and a new encoding scheme and an advanced tertiary classifier’, IEEE Trans. NanoBiosci., Vol. 3, pp.265–271. Jones, D. T. (1999) ‘Protein secondary structure prediction based on position-specific scoring matrices’, J. Mol. Biol., Vol. 292, No. 2, pp.195- 202. Jones, N. C. And Pevzner, P. A. (2004) An Introduction to Bioinformatics Algorithms, MIT Press. Kabsh, W. and Sander, C. (1983) ‘How good are predictions of protein secondary structure?’, FEBS Letters, Vol. 155, pp.179-182. Kim, H. and Park, H., (2003) ‘Protein secondary structure prediction based on an improved support vector machines approach’, Protein Eng., Vol. 16, pp.553-60. King, R. D. and Sternberg, M. J. E. (1996) ‘Identification and application of the concepts important for accurate and reliable protein secondary structure prediction’, Protein. Sci., Vol. 5, pp.2298–2310. Klepeis, J. L. and Floudas, C. A. (2002) ‘Ab initio prediction of helical segments in polypeptides’, J Comput. Chem, Vol. 23, No. 2, pp.245-66. 89

90 References Leopold, J. L., Maglia, A. M., Thakur, M., Patel, B. and Ercal, F. (2007) ‘Identifying Character Non-Independence in Phylogenetic Data Using Parallelized Rule Induction From Coverings’, Data Mining VIII: Data, Text, and Web Mining and Their Business Applications, WIT Transactions on Information and Communication Technologies, Vol. 38, pp. 45-54. Levitt, M. and Chothia, C. (1976) ‘Structural patterns in globular proteins’, Nature, Vol. 261, No. 5561, pp.552-8. Lee, L., Leopold, J. L., Frank, R. L., and Maglia, A. M. (2009) ‘Protein Secondary Structure Prediction Using Rule Induction from Coverings,’ Proceedings of IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology 2009, Nashville, Tennessee, USA, pp. 79-86. Lee, L., Kandoth, C., Leopold, J. L., and Frank, R. L. (2010a) ‘Protein Secondary Structure Prediction Using Parallelized Rule Induction from Coverings,’ International Journal of Medicine and Medical Sciences, Vol. 1, No. 2, pp. 99-105. Lee, L., Leopold, J. L., Kandoth, C., and Frank, R. L. (2010b) ‘Protein secondary structure prediction using RT-RICO: a rule-based approach,’ The Open Bioinformatics Journal, Vol. 4, pp. 17-30.. Lee, L., Leopold, J. L., Edgett, P. G., and Frank, R. L. (2010c) ‘Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction,’ Proceedings of ANNIE 2010 conference, St. Louis, Missouri, USA. Lee, L., Leopold, J. L., and Frank, R. L. (2011) ‘Protein secondary structure prediction using BLAST and Relaxed Threshold Rule Induction from Coverings,’ Proceedings of IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology 2011, Paris, France, accepted for publication. Lesk, A. M. (2008) Introduction to Bioinformatics, 3 rd Edition, Oxford. Maglia, A. M., Leopold, J. L. and Ghatti, V. R. (2004) ‘Identifying Character Non-Independence in Phylogenetic Data Using Data Mining Techniques’, Proc. Second Asia-Pacific Bioinformatics Conference Dunedin, New Zealand. 90

91 References Murzin, A. G., Brenner, S. E., Hubbard, T. and Chothia, C. (1995) ‘SCOP: a structural classification of proteins database for the investigation of sequences and structures’, J Mol. Biol, Vol. 247, No. 4, pp.536-40. Nguyen, N. and Rajapakse, J. C. (2007) ‘Two stage support vector machines for protein secondary structure prediction’, Intl. J. Data Mining & Bioinformatics, Vol. 1, pp.248-269. Pawlak, Z. (1984) ‘Rough Classification’, Int. J. Man-Machine Studies, Vol. 20, pp.469-483. Rost, B. and Sander, C. (1993a) ‘Prediction of protein secondary structure at better than 70% accuracy’, J. Mol. Biol.,Vol. 232, pp.584-599. Rost, B. and Sander, C. (1993b) ‘Improved prediction of protein secondary structure by use of sequence profiles and neural networks’, Proc. Natl. Acad. Sci. USA, Vol. 90, pp.7558–7562. Rost, B. (2003) ‘Rising accuracy of protein secondary structure prediction’, In: Chasman, D. (Ed.), Protein structure determination, analysis, and modeling for drug discovery, (pp.207–249), New York: Dekker. Salamov, A. A. and Solovyev, V. V. (1995) ‘Prediction of protein secondary structure by combining nearest-neighbor algorithms and multiple sequence alignments’, J Mol. Biol., Vol. 247, pp.11–15. Tramontano, A. (2006) Protein Structure Prediction, Wiley-vch. Wong, P. C., Whitney, P. and Thomas, J. (1999) ‘Visualizing Association Rules for Text Mining’ Proceedings of the 1999 IEEE Symposium on Information Visualization, pp. 120-123, 152. Zhang, C. T. and Zhang, R. (2003) ‘Q9, a content-balancing accuracy index to evaluate algorithms of protein secondary structure prediction’, Int J Biochem Cell Biol., Vol. 35, No. 8, pp.1256-62. 91


Download ppt "Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting."

Similar presentations


Ads by Google