Applications of knowledge discovery to molecular biology: Identifying structural regularities in proteins Shaobing Su Supervisor: Dr. Lawrence B. Holder.

Slides:



Advertisements
Similar presentations
Proteins: Structure reflects function….. Fig. 5-UN1 Amino group Carboxyl group carbon.
Advertisements

Review.
François Fages MPRI Bio-info 2007 Formal Biology of the Cell Protein structure prediction with constraint logic programming François Fages, Constraint.
Review of Basic Principles of Chemistry, Amino Acids and Proteins Brian Kuhlman: The material presented here is available on the.
1 SURVEY OF BIOCHEMISTRY Protein Function. 2 PRS In a protein, the most conformationally restricted amino acid is_____ and the least conformationally.
Review: Amino Acid Side Chains Aliphatic- Ala, Val, Leu, Ile, Gly Polar- Ser, Thr, Cys, Met, [Tyr, Trp] Acidic (and conjugate amide)- Asp, Asn, Glu, Gln.
Protein Purification and Analysis Day 4. Amino Acids, Peptides, and Proteins.
FUNDAMENTALS OF MOLECULAR BIOLOGY Introduction -Molecular Biology, Cell, Molecule, Chemical Bonding Macromolecule -Class -Chemical structure -Forms Important.
5’ C 3’ OH (free) 1’ C 5’ PO4 (free) DNA is a linear polymer of nucleotide subunits joined together by phosphodiester bonds - covalent bonds between.
Protein-a chemical view A chain of amino acids folded in 3D Picture from on-line biology bookon-line biology book Peptide Protein backbone N / C terminal.
1 Levels of Protein Structure Primary to Quaternary Structure.
Amino Acids and Proteins 1.What is an amino acid / protein 2.Where are they found 3.Properties of the amino acids 4.How are proteins synthesized 1.Transcription.
Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane.
Sequence analysis June 18, 2008 Learning objectives-Understand the concept of sliding window programs. Understand difference between identity, similarity.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Graph-Based Concept Learning Jesus A. Gonzalez, Lawrence B. Holder, and Diane J. Cook Department of Computer Science and Engineering University of Texas.
Structural Knowledge Discovery Used to Analyze Earthquake Activity Jesus A. Gonzalez Lawrence B. Holder Diane J. Cook.
Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
It & Health 2009 Summary Thomas Nordahl Petersen.
Graph-Based Data Mining Diane J. Cook University of Texas at Arlington
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
ProteinStructuralDatabases. Proteins are built from amino-acids. Introduction H | NH2-c-CO2H | R.
©CMBI 2005 Why align sequences? Lots of sequences with unknown structure and function. A few sequences with known structure and function If they align,
The relative orientation observed for  helices packed on ß sheets.
Protein Structure Elements Primary to Quaternary Structure.
Chapter 3 The Chemistry of Organic Molecules
Protein Structure FDSC400. Protein Functions Biological?Food?
You Must Know How the sequence and subcomponents of proteins determine their properties. The cellular functions of proteins. (Brief – we will come back.
Proteins. The central role of proteins in the chemistry of life Proteins have a variety of functions. Structural proteins make up the physical structure.
Marlou Snelleman 2012 Proteins and amino acids. Overview Proteins Primary structure Secondary structure Tertiary structure Quaternary structure Amino.
Proteins and Enzymes Nestor T. Hilvano, M.D., M.P.H. (Images Copyright Discover Biology, 5 th ed., Singh-Cundy and Cain, Textbook, 2012.)
Proteins are polymers of amino acids.
Protein Structural Prediction. Protein Structure is Hierarchical.
Proteins account for more than 50% of the dry mass of most cells
1.What makes an enzyme specific to one type of reaction (in other words, what determines the function of a protein)? –SHAPE determines the function of.
Structure and Function of Proteins Lecturer: Dr. Ora Furman Oct 2009 Winter 2009/10 Teaching Assistants: Miraim Oxsman Sivan Pearl.
Proteins account for more than 50% of the dry mass of most cells
What are proteins? Proteins are important; e.g. for catalyzing and regulating biochemical reactions, transporting molecules, … Linear polymer chain composed.
The most important secondary structural elements of proteins are: A. α-Helix B. Pleated-sheet structures C. β Turns The most common secondary structures.
©CMBI 2006 Amino Acids “ When you understand the amino acids, you understand everything ”
BIOCHEMISTRY REVIEW Overview of Biomolecules Chapter 4 Protein Sequence.
. Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences.
1.Overall amino acid structure 2.Amino acid stereochemistry 3.Amino acid sidechain structure & classification 4.‘Non-standard’ amino acids 5.Amino acid.
Amino Acids & Side Groups Polar Charged ◦ ACIDIC negatively charged amino acids  ASP & GLU R group with a 2nd COOH that ionizes* above pH 7.02nd COOH.
Secondary structure prediction
Learning Targets “I Can...” -State how many nucleotides make up a codon. -Use a codon chart to find the corresponding amino acid.
1 10/26/2015 MOLECULES. 2 10/26/2015 H 2 N-CH-C-OH O R Monomer E.g. protein Monomer vs polymer amino acid monomer R is a side group.
Outline 1.What is an amino acid / protein naturally occurring amino acids 3.Codon – triplet coding for an amino acid 1.How are proteins synthesized.
Outline What is an amino acid / protein
Protein Secondary Structure Prediction G P S Raghava.
A program of ITEST (Information Technology Experiences for Students and Teachers) funded by the National Science Foundation Background Session #3 DNA &
RNA 2 Translation.
1 Protein synthesis How a nucleotide sequence is translated into amino acids.
Amino Acids ©CMBI 2001 “ When you understand the amino acids, you understand everything ”
Marlou Snelleman 2011 Proteins and amino acids. Overview Proteins Primary structure Secondary structure Tertiary structure Quaternary structure Amino.
Hierarchy of Protein Structure
Proteins.
Proteins Structure of proteins Proteins are made of C, H, O and nitrogen and may have sulfur. The monomers of proteins are amino acids An amino acid.
Chapter 3 Proteins.
Stephen Taylor i-Biology.net Photo credit: Firefly with glow, by Terry Priest on Flickr (Creative Commons)
Doug Raiford Lesson 14.  Reminder  Involved in virtually every chemical reaction ▪ Enzymes catalyze reactions  Structure ▪ muscle, keratins (skin,
Prepared By: Syed Khaleelulla Hussaini. Outline Proteins DNA RNA Genetics and evolution The Sequence Matching Problem RNA Sequence Matching Complexity.
NUCLEIC ACIDS AND PROTEIN SYNTHESIS. DNA complex molecule contains the complete blueprint for every cell in every living thing Amount of DNA that would.
Proteins Tertiary Protein Structure of Enzyme Lactasevideo Video 2.
Arginine, who are you? Why so important?. Release 2015_01 of 07-Jan-15 of UniProtKB/Swiss-Prot contains sequence entries, comprising
Protein Structure FDSC400. Protein Functions Biological?Food?
Figure 3.14A–D Protein structure (layer 1)
Haixu Tang School of Inforamtics
Proteins Genetic information in DNA codes specifically for the production of proteins Cells have thousands of different proteins, each with a specific.
Levels of Protein Structure
Presentation transcript:

Applications of knowledge discovery to molecular biology: Identifying structural regularities in proteins Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook Dr. Edward Bellion Dr. Edward Bellion

Outline b Motivation and goal of the research b SUBDUE knowledge discovery system b Proteins and PDB b Methods and results b Discussion and conclusion b Future research

Motivation and Goal b Explosive amount of molecular biology info need to be analyze to help understanding the underlining structure-function relationship in protein and other macromolecules. b Apply SUBDUE to the Brookhaven Protein Data Bank (PDB) to identify biologically meaningful patterns

SUBDUE knowledge discovery system b SUBDUE discovers patterns (substructures) in structural data sets b SUBDUE represent data as a labeled graph b Inputs: vertices and edges b Outputs: discovered patterns and instances

Example object triangle object square on shape Vertices: objects or attributes Edges: relationships 4 instances of

SUBDUE’s search algorithm b Minimum Description Length (MDL) principle: The best theory to describe a set of data is the one that minimizes the DL of the entire data set b DL of the graph: the number of bits necessary to completely describe the graph b Search for the substructure that results in the maximum compression

Inexact graph match approach Find instances with a slight distortion: insertion, deletion, and substitution of edges/vertices. Threshold parameter: specify amount of distortion allowed.

Overview of proteins b most important biomolecule b composed from 20 amino acids b structural hierarchy b very diverse structure and function

Structural hierarchy in proteins b Primary structure (sequence of protein) b Secondary structure (helix, sheet, random) b Tertiary structure (3-D)

Primary Structure of proteins b Average residues (a.a.) linked in head to tail b N-terminus and C-terminus b Peptide bond, alpha-carbon H 3 N - C  1 - C - N - C  2 - C - O R1 O H R2 O N-terminusC-terminus + - peptide bond first a.a second a.a

Secondary structure elements b Ordered backbone arrangement: helix and sheet b Helix (0 % to 90 %; average 11 a.a; several types) b Sheet (2 to 15 strands per sheet; parallel and anti-parallel; average 6 a.a. per strand)

Tertiary Structure of protein b Highly complicated 3-D arrangement b Folding of its secondary structure elements

Brookhaven Protein Data Bank (PDB) b Brookhaven National Laboratory b Over 6000 Experimentally determined 3-D structure of biomolecules b Majority: protein structures

Contents of PDB b SEQRES: sequence of a.a. (three letter code) b HELIX: starting, ending, and type b SHEET: starts, ends, sense b ATOM: (x, y, z) coordinates for each atoms in protein

Applications of SUBDUE to PDB - Methods and Results b July 1997 PDB TM release (6000 PDB) b Global data set (4000 PDB) b Category data sets hemoglobin Myoglobin Ribonuclease A

Flowchart of Research Preprocessing Application Brookhaven PDB Graphic representation Inputs to SUBDUE Patterns in Category Patterns in Global others Instance mapping

Preprocessing b compile PDB list for each category b model.c: extract first model b seq.c: extract sequence info convert to graphic format b secondary.c: extract secondary structure info and convert to graphic format b coor.c: extract 3D coordinates convert to grahic format

Primary structure and its representation b Sample PDB lines: SEQRES ALA ASN LYS THR 1ASH 139 SEQRES LYS SER LEU GLU 1ASH 140 b Sequence (N-terminus to C-terminus): ALA ASN LYS THR LYS SER LEU GLU b SUBDUE graphic input (ALA ASN): v 1 ALA ALA residue v 2 ASN ASN residue e 1 2 bond a peptide bond between ALA and ASN

Secondary structure and its representation -HELIX b Sample PDB lines (starting, ending, type): HELIX 1 ASN 1 HIS 13 1 HELIX 2 ASN 20 ASN 36 1 b vertex: h_type_length b Helix Length: Hlength = SeqNum(last a.a.) - SeqNum(first a.a.) b SUBDUE graphic input: v 1 h_1_ helix 1, type 1, length 12 v 2 h_1_ helix 2, type 1, length 16

Secondary structure and its representation - SHEET b Sample PDB lines (sense, length): SHEET 1 TYR 284 ILE SHEET 2 HIS 292 THR b vertex: s_sense_length b SUBDUE graphic input: v 1 s_0_ strand 1, sense 0, length 2 v 2 s_-1_ strand 2, sense -1, length 2

Overall secondary structure representation b PDB line: SUBDUE graphic input HELIX 1 THR 3 MET 13 1 v 1 h_1_10 HELIX 2 ASN 24 ASN 34 1 v 2 h_1_10 e 1 2 sh HELIX 3 SER 50 GLN 60 1 v 3 s_0_7 e 2 3 sh SHEET 1 LYS 41 HIS 48 0v 4 h_1_10 e 3 4 sh SHEET 2 MET 79 THR 87 -1v 5 s_-1_8 e 4 5 sh b sequential relationship is represented as edge “sh” b Visualization: N-terminus C-terminus

Tertiary structure and its representation b Sample PDB lines: X Y Z ATOMCAALA ATOMCAASN b vertex: backbone carbon; edge: distance (vs, s) b Distance (Å): distance = ((x 2 -x 1 ) 2 + (y 2 -y 1 ) 2 + (z 2 - z 1 ) 2 ) 1/2 b v 1 CA_ALA v 2 CA_ASN e 1 2 vs- - - very short distance

Rationale for representation choice -Criteria b Patterns identified by SUBDUE must be representative for each category b Patterns discovered by SUBDUE should discriminate one category from others

Primary sequence b vertex - a.a. residue name b edge - peptide bond e 1 2 bond e 2 3 bond ARGGLUALA bond v 1 ARGv 2 GLUv 3 ALA

Secondary structure elements b Type of the helix b starting and ending points (a.a name and seq number) Helix ASN … HIS type length starts ends N-terminus C-terminus

Other ways of representing helix b Separate type and length b combine type and length Helix Helix_1_12 type length

Tertiary structure b (x, y, z) coordinates vary with different origin choice b avoid numeric number, use vs (  4 Å), s (4 Å < dist  6 Å) C1 C x y vs y z z

Results: Primary structure patterns Ribonuclease_A_sequence: GLY GLN THR ASN CYS TYR GLN SER TYR SER THR MET SER ILE THR ASP CYS ARG GLU THR GLY SER SER LYS TYR PRO ASN CYS ALA TYR LYS THR THR GLN ALA ASN LYS HIS ILE ILE VAL ALA CYS GLU GLY ASN PRO TYR VAL PRO VAL HIS PHE ASP ALA SER VAL Hemo_seq (63/65) Hemo_sequence: THR LYS THR TYR PHE PRO HIS PHE ASP LEU SER HIS GLY SER ALA GLN VAL LYS GLY HIS GLY LYS LYS VAL ALA ASP ALA LEU THR ASN ALA VAL ALA HIS VAL ASP ASP MET PRO ASN ALA LEU SET ALA LEU SER THR LEU ALA ALA HIS LEU PRO LAL GLU PHE THR PRO ALA VAL HIS ALA SET LEU ASP LYS PHE LEU ALA SET VAL SER THR VAL LEU THR SER LYS TYR Myo_seq (67/103) Myoglo_sequence: VAL LSU SER GLU GLY GLU TRP GLN LEU VAL LEU HIS VAL TRP ALA LYS VAL GLU ALA ASP VAL ALA GLY HIS GLY GLN ASP ILE LEU ILE ARG LEU PHE LYS SER HIS PRO GLU THR LEU GLU LYS PHE ASP ARG Ribo_A (59/68)

Primary structure patterns b Unique to each sample category b hemoglobin and myoglobin proteins share little sequence similarity

Results: Hemo secondary structure patterns 1 : h_1_14 -> h_1_15 -> h_1_6 -> h_1_6 -> h_1_19 -> h_1_8 -> h_1_18 -> h_1_20 7 : h_1_15 -> h_1_15 -> h_1_6 -> h_1_1 -> h_1_19 -> h_1_8 -> h_1_18 -> h_1_20

Results: Myo secondary structure patterns 1 : h_1_15 -> h_1_15 -> h_1_6 -> h_1_6 -> h_1_19 -> h_1_9 -> h_1_18 -> h_1_25

Results: Ribo_A secondary structure patterns 1 : h_1_10 -> h_1_10 -> s_0_7 -> s_0_7 -> h_1_10 -> s_0_3 -> s_0_3 -> s_-1_4 -> s_-1_4 -> s_-1_8 -> s_-1_1 -> s_-1_10 -> s_-1_10 -> s_-1_8 -> s_-1_8 -> s_-1_5 -> s_-1_3 10 : h_1_10 -> h_1_10 -> s_0_7 -> h_1_10 -> s_0_3 -> s_-1_4 -> s_-1_8 -> s_-1_8 -> s_-1_6

Results: Tertiary structural patterns b SUBDUE finds small patterns (2 or 3 a.a.) b not unique for each category of proteins b not biologically meaningful

Visualization of secondary structure patterns -hemoglobin complete hemoglobin 2 instances of pattern structure N-terminus C-terminus

Visualization of secondary structure patterns -myoglobin complete myoglobin 1 instance of pattern structure N-terminus C-terminus

Visualization of secondary structure patterns -ribonuclease_A complete ribonuclease_A 1 instance of pattern structure N-terminus C-terminus

Discussion -Hemoglobin b Hemoglobin: A, B, C, D chains b Two types of patterns identified by SUBDUE One for A, C chains, the other for B, D chains b Patterns exist in a majority of hemoglobin proteins b No instances of the best hemoglobin pattern found in other proteins in the global data set

Occurrence of hemo patterns

Occurrence of hemo patterns - continued

Discussion -Myoglobin b Myoglobin: one chain b One dominant pattern identified by SUBDUE b Patterns exist in most of myoglobin proteins b No instances of the best myoglobin pattern found in other proteins in the global data set

Discussion: -Hemoglobin and Myoglobin b Similar secondary structure patterns Hemoglobin B, D chains (from N- to C-terminus) h_1_14 -> h_1_15 -> h_1_6 -> h_1_6 -> h_1_19 -> h_1_8 -> h_1_18 -> h_1_20 Myoglobin chain (from N- to C-terminus) h_1_15 -> h_1_15 -> h_1_6 -> h_1_6 -> h_1_19 -> h_1_9 -> h_1_18 -> h_1_25 Hemoglobin A, C chains (from N- to C-terminus) h_1_15 -> h_1_15 -> h_1_6 -> h_1_1 -> h_1_19 -> h_1_8 -> h_1_18 -> h_1_20

Discussion: -Hemoglobin and Myoglobin b Consistent with the genetic studies b Hemoglobin and myoglobin share one ancestral gene b Divergence occurred in the course of evolution. One copy of gene for myoglobin, four copies for hemoglobin. b The last helix of the hemoglobin is shorter; One of the helix in hemoglobin A, C chains almost disappear: allow conformational change

Discussion: -ribonuclease A proteins b All patterns have three helices of the same size b Several strands appear twice indicating participation in two sheet formation. b Ribonuclease S protein (S-protein fragment) also has the pattern.

Conclusion of the results b Secondary structure patterns discovered by SUBDUE are representative to each category b Secondary structure patterns discovered by SUBDUE are distinct for each category b SUBDUE has the ability to discover biologically interesting patterns from PDB and other similar MB data bases

Comparison with other related studies b Different graphic representation b predefined patterns with exact or inexact graph match b Not applied systematically to PDB or other DB b SUBDUE would perform similar task if the inexact graph match routine is incorporated

Conclusions of the study b Abstraction over 3D structure to its secondary structural elements is suitable for discovery b SUBDUE discovered secondary structure patterns for each category can be used as a signature for its class b Inexact graph match is useful for finding similar patterns b SUBDUE is suitable for knowledge discovery in MB structural DB

Future Research b More consistent and detailed description of secondary structure b Add relative positions of the secondary structural elements to represent spatial relationship b Investigate alternative representation: more suitable 3D coordinates representation; weighting on different edges b Inexact graph match in predefined substructure b More collaboration with domain scientists