Presentation is loading. Please wait.

Presentation is loading. Please wait.

Applications of knowledge discovery to molecular biology: Identifying structural regularities in proteins Shaobing Su Supervisor: Dr. Lawrence B. Holder.

Similar presentations


Presentation on theme: "Applications of knowledge discovery to molecular biology: Identifying structural regularities in proteins Shaobing Su Supervisor: Dr. Lawrence B. Holder."— Presentation transcript:

1 Applications of knowledge discovery to molecular biology: Identifying structural regularities in proteins Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook Dr. Edward Bellion Dr. Edward Bellion

2 Outline b Motivation and goal of the research b SUBDUE knowledge discovery system b Proteins and PDB b Methods and results b Discussion and conclusion b Future research

3 Motivation and Goal b Explosive amount of molecular biology info need to be analyze to help understanding the underlining structure-function relationship in protein and other macromolecules. b Apply SUBDUE to the Brookhaven Protein Data Bank (PDB) to identify biologically meaningful patterns

4 SUBDUE knowledge discovery system b SUBDUE discovers patterns (substructures) in structural data sets b SUBDUE represent data as a labeled graph b Inputs: vertices and edges b Outputs: discovered patterns and instances

5 Example object triangle object square on shape Vertices: objects or attributes Edges: relationships 4 instances of

6 SUBDUE’s search algorithm b Minimum Description Length (MDL) principle: The best theory to describe a set of data is the one that minimizes the DL of the entire data set b DL of the graph: the number of bits necessary to completely describe the graph b Search for the substructure that results in the maximum compression

7 Inexact graph match approach Find instances with a slight distortion: insertion, deletion, and substitution of edges/vertices. Threshold parameter: specify amount of distortion allowed.

8 Overview of proteins b most important biomolecule b composed from 20 amino acids b structural hierarchy b very diverse structure and function

9 Structural hierarchy in proteins b Primary structure (sequence of protein) b Secondary structure (helix, sheet, random) b Tertiary structure (3-D)

10 Primary Structure of proteins b Average 100-150 residues (a.a.) linked in head to tail b N-terminus and C-terminus b Peptide bond, alpha-carbon H 3 N - C  1 - C - N - C  2 - C - O R1 O H R2 O N-terminusC-terminus + - peptide bond first a.a second a.a

11 Secondary structure elements b Ordered backbone arrangement: helix and sheet b Helix (0 % to 90 %; average 11 a.a; several types) b Sheet (2 to 15 strands per sheet; parallel and anti-parallel; average 6 a.a. per strand)

12 Tertiary Structure of protein b Highly complicated 3-D arrangement b Folding of its secondary structure elements

13 Brookhaven Protein Data Bank (PDB) b Brookhaven National Laboratory b Over 6000 Experimentally determined 3-D structure of biomolecules b Majority: protein structures

14 Contents of PDB b SEQRES: sequence of a.a. (three letter code) b HELIX: starting, ending, and type b SHEET: starts, ends, sense b ATOM: (x, y, z) coordinates for each atoms in protein

15 Applications of SUBDUE to PDB - Methods and Results b July 1997 PDB TM release (6000 PDB) b Global data set (4000 PDB) b Category data sets hemoglobin Myoglobin Ribonuclease A

16 Flowchart of Research Preprocessing Application Brookhaven PDB Graphic representation Inputs to SUBDUE Patterns in Category Patterns in Global others Instance mapping

17 Preprocessing b compile PDB list for each category b model.c: extract first model b seq.c: extract sequence info convert to graphic format b secondary.c: extract secondary structure info and convert to graphic format b coor.c: extract 3D coordinates convert to grahic format

18 Primary structure and its representation b Sample PDB lines: SEQRES 1 150 ALA ASN LYS THR 1ASH 139 SEQRES 2 150 LYS SER LEU GLU 1ASH 140 b Sequence (N-terminus to C-terminus): ALA ASN LYS THR LYS SER LEU GLU b SUBDUE graphic input (ALA ASN): v 1 ALA - - - ALA residue v 2 ASN - - - ASN residue e 1 2 bond - - - a peptide bond between ALA and ASN

19 Secondary structure and its representation -HELIX b Sample PDB lines (starting, ending, type): HELIX 1 ASN 1 HIS 13 1 HELIX 2 ASN 20 ASN 36 1 b vertex: h_type_length b Helix Length: Hlength = SeqNum(last a.a.) - SeqNum(first a.a.) b SUBDUE graphic input: v 1 h_1_12 - - - helix 1, type 1, length 12 v 2 h_1_16 - - - helix 2, type 1, length 16

20 Secondary structure and its representation - SHEET b Sample PDB lines (sense, length): SHEET 1 TYR 284 ILE 286 0 SHEET 2 HIS 292 THR 294 - 1 b vertex: s_sense_length b SUBDUE graphic input: v 1 s_0_2 - - - strand 1, sense 0, length 2 v 2 s_-1_2 - - - strand 2, sense -1, length 2

21 Overall secondary structure representation b PDB line: SUBDUE graphic input HELIX 1 THR 3 MET 13 1 v 1 h_1_10 HELIX 2 ASN 24 ASN 34 1 v 2 h_1_10 e 1 2 sh HELIX 3 SER 50 GLN 60 1 v 3 s_0_7 e 2 3 sh SHEET 1 LYS 41 HIS 48 0v 4 h_1_10 e 3 4 sh SHEET 2 MET 79 THR 87 -1v 5 s_-1_8 e 4 5 sh b sequential relationship is represented as edge “sh” b Visualization: N-terminus C-terminus

22 Tertiary structure and its representation b Sample PDB lines: X Y Z ATOMCAALA110.3690.99710.519 ATOMCAASN26.6910.2399.830 b vertex: backbone carbon; edge: distance (vs, s) b Distance (Å): distance = ((x 2 -x 1 ) 2 + (y 2 -y 1 ) 2 + (z 2 - z 1 ) 2 ) 1/2 b v 1 CA_ALA v 2 CA_ASN e 1 2 vs- - - very short distance

23 Rationale for representation choice -Criteria b Patterns identified by SUBDUE must be representative for each category b Patterns discovered by SUBDUE should discriminate one category from others

24 Primary sequence b vertex - a.a. residue name b edge - peptide bond e 1 2 bond e 2 3 bond ARGGLUALA bond v 1 ARGv 2 GLUv 3 ALA

25 Secondary structure elements b Type of the helix b starting and ending points (a.a name and seq number) Helix 1 1 12 ASN … HIS type length starts ends N-terminus C-terminus

26 Other ways of representing helix b Separate type and length b combine type and length Helix 1 1 12 Helix_1_12 type length

27 Tertiary structure b (x, y, z) coordinates vary with different origin choice b avoid numeric number, use vs (  4 Å), s (4 Å < dist  6 Å) 10.46.7 1.0 C1 C2 0.2 10.59.8 x y vs y z z

28 Results: Primary structure patterns Ribonuclease_A_sequence: GLY GLN THR ASN CYS TYR GLN SER TYR SER THR MET SER ILE THR ASP CYS ARG GLU THR GLY SER SER LYS TYR PRO ASN CYS ALA TYR LYS THR THR GLN ALA ASN LYS HIS ILE ILE VAL ALA CYS GLU GLY ASN PRO TYR VAL PRO VAL HIS PHE ASP ALA SER VAL Hemo_seq (63/65) Hemo_sequence: THR LYS THR TYR PHE PRO HIS PHE ASP LEU SER HIS GLY SER ALA GLN VAL LYS GLY HIS GLY LYS LYS VAL ALA ASP ALA LEU THR ASN ALA VAL ALA HIS VAL ASP ASP MET PRO ASN ALA LEU SET ALA LEU SER THR LEU ALA ALA HIS LEU PRO LAL GLU PHE THR PRO ALA VAL HIS ALA SET LEU ASP LYS PHE LEU ALA SET VAL SER THR VAL LEU THR SER LYS TYR Myo_seq (67/103) Myoglo_sequence: VAL LSU SER GLU GLY GLU TRP GLN LEU VAL LEU HIS VAL TRP ALA LYS VAL GLU ALA ASP VAL ALA GLY HIS GLY GLN ASP ILE LEU ILE ARG LEU PHE LYS SER HIS PRO GLU THR LEU GLU LYS PHE ASP ARG Ribo_A (59/68)

29 Primary structure patterns b Unique to each sample category b hemoglobin and myoglobin proteins share little sequence similarity

30 Results: Hemo secondary structure patterns 1 : h_1_14 -> h_1_15 -> h_1_6 -> h_1_6 -> h_1_19 -> h_1_8 -> h_1_18 -> h_1_20 7 : h_1_15 -> h_1_15 -> h_1_6 -> h_1_1 -> h_1_19 -> h_1_8 -> h_1_18 -> h_1_20

31 Results: Myo secondary structure patterns 1 : h_1_15 -> h_1_15 -> h_1_6 -> h_1_6 -> h_1_19 -> h_1_9 -> h_1_18 -> h_1_25

32 Results: Ribo_A secondary structure patterns 1 : h_1_10 -> h_1_10 -> s_0_7 -> s_0_7 -> h_1_10 -> s_0_3 -> s_0_3 -> s_-1_4 -> s_-1_4 -> s_-1_8 -> s_-1_1 -> s_-1_10 -> s_-1_10 -> s_-1_8 -> s_-1_8 -> s_-1_5 -> s_-1_3 10 : h_1_10 -> h_1_10 -> s_0_7 -> h_1_10 -> s_0_3 -> s_-1_4 -> s_-1_8 -> s_-1_8 -> s_-1_6

33 Results: Tertiary structural patterns b SUBDUE finds small patterns (2 or 3 a.a.) b not unique for each category of proteins b not biologically meaningful

34 Visualization of secondary structure patterns -hemoglobin complete hemoglobin 2 instances of pattern structure N-terminus C-terminus

35 Visualization of secondary structure patterns -myoglobin complete myoglobin 1 instance of pattern structure N-terminus C-terminus

36 Visualization of secondary structure patterns -ribonuclease_A complete ribonuclease_A 1 instance of pattern structure N-terminus C-terminus

37 Discussion -Hemoglobin b Hemoglobin: A, B, C, D chains b Two types of patterns identified by SUBDUE One for A, C chains, the other for B, D chains b Patterns exist in a majority of hemoglobin proteins b No instances of the best hemoglobin pattern found in other proteins in the global data set

38 Occurrence of hemo patterns

39 Occurrence of hemo patterns - continued

40 Discussion -Myoglobin b Myoglobin: one chain b One dominant pattern identified by SUBDUE b Patterns exist in most of myoglobin proteins b No instances of the best myoglobin pattern found in other proteins in the global data set

41 Discussion: -Hemoglobin and Myoglobin b Similar secondary structure patterns Hemoglobin B, D chains (from N- to C-terminus) h_1_14 -> h_1_15 -> h_1_6 -> h_1_6 -> h_1_19 -> h_1_8 -> h_1_18 -> h_1_20 Myoglobin chain (from N- to C-terminus) h_1_15 -> h_1_15 -> h_1_6 -> h_1_6 -> h_1_19 -> h_1_9 -> h_1_18 -> h_1_25 Hemoglobin A, C chains (from N- to C-terminus) h_1_15 -> h_1_15 -> h_1_6 -> h_1_1 -> h_1_19 -> h_1_8 -> h_1_18 -> h_1_20

42 Discussion: -Hemoglobin and Myoglobin b Consistent with the genetic studies b Hemoglobin and myoglobin share one ancestral gene b Divergence occurred in the course of evolution. One copy of gene for myoglobin, four copies for hemoglobin. b The last helix of the hemoglobin is shorter; One of the helix in hemoglobin A, C chains almost disappear: allow conformational change

43 Discussion: -ribonuclease A proteins b All patterns have three helices of the same size b Several strands appear twice indicating participation in two sheet formation. b Ribonuclease S protein (S-protein fragment) also has the pattern.

44 Conclusion of the results b Secondary structure patterns discovered by SUBDUE are representative to each category b Secondary structure patterns discovered by SUBDUE are distinct for each category b SUBDUE has the ability to discover biologically interesting patterns from PDB and other similar MB data bases

45 Comparison with other related studies b Different graphic representation b predefined patterns with exact or inexact graph match b Not applied systematically to PDB or other DB b SUBDUE would perform similar task if the inexact graph match routine is incorporated

46 Conclusions of the study b Abstraction over 3D structure to its secondary structural elements is suitable for discovery b SUBDUE discovered secondary structure patterns for each category can be used as a signature for its class b Inexact graph match is useful for finding similar patterns b SUBDUE is suitable for knowledge discovery in MB structural DB

47 Future Research b More consistent and detailed description of secondary structure b Add relative positions of the secondary structural elements to represent spatial relationship b Investigate alternative representation: more suitable 3D coordinates representation; weighting on different edges b Inexact graph match in predefined substructure b More collaboration with domain scientists


Download ppt "Applications of knowledge discovery to molecular biology: Identifying structural regularities in proteins Shaobing Su Supervisor: Dr. Lawrence B. Holder."

Similar presentations


Ads by Google