Positional Association Rules Dr. Bernard Chen Ph.D. University of Central Arkansas.

Slides:



Advertisements
Similar presentations
Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.
Advertisements

Data Mining Techniques Association Rule
Protein – Protein Interactions Lisa Chargualaf Simon Kanaan Keefe Roedersheimer Others: Dr. Izaguirre, Dr. Chen, Dr. Wuchty, ChengBang Huang.
LOGO Association Rule Lecturer: Dr. Bo Yuan
Protein Structure and Physics. What I will talk about today… -Outline protein synthesis and explain the basic steps involved. -Go over the Chemistry of.
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
ICDM'06 Panel 1 Apriori Algorithm Rakesh Agrawal Ramakrishnan Srikant (description by C. Faloutsos)
Association rules The goal of mining association rules is to generate all possible rules that exceed some minimum user-specified support and confidence.
1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Research Topics Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Data Mining Association Analysis: Basic Concepts and Algorithms
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Gene Activity: How Genes Work
Data Mining Association Analysis: Basic Concepts and Algorithms
The construction of cells DNA or RNA Protein Carbohydrates Lipid etc
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, D. W. Cheung, B. Kao Department of Computer Science.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Fast Algorithms for Association Rule Mining
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
Chapter 5 Mining Association Rules with FP Tree Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Motif Discovery in Protein Sequences using Messy De Bruijn Graph Mehmet Dalkilic and Rupali Patwardhan.
Association Rules. 2 Customer buying habits by finding associations and correlations between the different items that customers place in their “shopping.
How to make a presentation (Oral and Poster) Dr. Bernard Chen Ph.D. University of Central Arkansas July 5 th Applied Research in Healthy Information.
Supersecondary structures. Supersecondary structures motifs motifs or folds, are particularly stable arrangements of several elements of the secondary.
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Ch5 Mining Frequent Patterns, Associations, and Correlations
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Identification of Protein Domains. Orthologs and Paralogs Describing evolutionary relationships among genes (proteins): Two major ways of creating homologous.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.
My Research Work and Clustering Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Do you know… What does the central dogma of modern biology say? What are the two main steps in Protein Synthesis?
Mrs. Einstein Research in Molecular Biology. Importance of proteins for cell function: Proteins are the end product of the central dogma YOU are your.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Guiding motif discovery by iterative pattern refinement Zhiping Wang Advisor: Sun Kim, Mehmet Dalkilic School of Informatics, Indiana University.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Motif discovery and Protein Databases Tutorial 5.
Data Mining Find information from data data ? information.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Mining Frequent Patterns, Associations, and Correlations Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
PROTEIN FUNCTIONS. PROTEIN FUNCTIONS (continued)
Processes DNA RNAMisc.Protein What is the base pair rule? Why is it important.
Motif Search and RNA Structure Prediction Lesson 9.
Step 3: Tools Database Searching
Protein Synthesis The process of protein synthesis is explained by the central dogma of molecular biology, which states that: DNA  RNA  Proteins How.
Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.
Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.
1 The Strategies for Mining Fault-Tolerant Patterns Jia-Ling Koh Department of Information and Computer Education National Taiwan Normal University.
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
Protein Structure and Function. Proteins are organic compounds made from amino acids held together by peptide bonds.
Lesson 4- Gene Expression PART 2 - TRANSLATION. Warm-Up Name 10 differences between DNA replication and transcription.
PROTEINS Proteins Composed mainly of –Carbon –Hydrogen –Nitrogen.
Protein Proteins are biochemical compounds consisting of one or more polypeptides typically folded into a globular or fibrous form in a biologically functional.
Data Mining: Concepts and Techniques
Frequent Pattern Mining
Chapter 6 Tutorial.
Dynamic Itemset Counting
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
A Parameterised Algorithm for Mining Association Rules
Predict Protein Sequence by Fuzzy-Association Rules
There are four levels of structure in proteins
Data Mining Association Analysis: Basic Concepts and Algorithms
Motifs.
Presentation transcript:

Positional Association Rules Dr. Bernard Chen Ph.D. University of Central Arkansas

Central Dogma of Molecular Biology

Amino Acids, the subunit of proteins

Protein Primary, Secondary, and Tertiary Structure

Protein 3D Structure

Protein Sequence Motif Although there are 20 amino acids, the construction of protein primary structure is not randomly choose among those amino acids Sequence Motif: A relatively small number of functionally or structurally conserved sequence patterns that occurs repeatedly in a group of related proteins.

Protein Sequence Motif These biologically significant regions or residues are usually: Enzyme catalytic site Prostethic group attachment sites (heme, pyridoxal-phosphate, biotin…) Amino acid involved in binding a metal ion Cysteines involved in disulfide bonds Regions involved in binding a molecule (ATP/ADP, GDP/GTP, Ca, DNA…)

HSSP-BLOSUM62 Measure

Future Works

Motivation In order to obtain the DNA/protein sequence motifs information, fixing the length of sequence segments is usually necessary. Due to the fixed size, they might deliver a number of similar motifs simply shifted by several bases or including mismatches

Example If there exists a biological sequence motif with length of 12 and we set the window size to 9, it is highly possible that we discovered two similar sequence motifs where one motif covers the front part of the biological sequence motif and the other one covers the rear part.

Positional Association Rules The basic association rule gives the information of A => B However, under the circumstances of the “order” involved with the appearance of items, the basic association rule is not powerful enough we introduce another parameter called “distance assurance” to help identify frequent itemset with frequent distance

Positional Association Rules

Pseudocode of Positional Association Rule with the Apriori concept Algorithm: Positional Association Rule with the Apriori Concept Input: Database, D, (Protein sequences as Transactions and Sequence Motifs as items), min_support, min_confidence, and min_distance_assurance Output: P, positional association rules in D Method: L = find_frequent_itemsets(D, min_support) S = find_strong_association_rules(L, min_confidence) for (k=2; Sk ≠ Ø; k++ ) for each strong association rule, r Sk antecedent_motif = Apriori_Motif_Construct(r_ant) consequence_motif = Apriori_Motif_Construct(r_con) if antecident_motif == NULL or consequence_motif == NULL: goto Step (4) for each protein sequence, ps D for (ant_position=1; |ps| ; ant_position++) if antecedent_motif start appear on ps[ant_position]: r_ant_count++ for (con_position=1; |ps| ; con_position++) if consequent_motif start appear on ps[con_position]: distance = ant_position – con_position rdistance ++ Pk = { rdistance | rdistance > min_distance_assurance * r_ant_count } Apriori_Motif_Construct(itemset) if |itemset| == 1: return itemset else: for each positional association rules in P|itemset| if all items in the itemset appear in the positional association rule: return the new motif constructed by the positional association rule return NULL

Positional Association Rules Example

minimum support = 60%, minimum confidence = 80%, minimum distance assurance = 60%

minimum support = 60%, minimum confidence = 80%, minimum distance assurance = 60% Scan for C1 A: 3/5 A B: 5/5B C: 2/5 => => AB, AD, BD D: 4/5D E: 1/5

minimum support = 60%, minimum confidence = 80%, minimum distance assurance = 60% Scan for C2 AB: 3/5 AB AD:3/5 => AD => ABD BD: 4/5 BD

minimum support = 60%, minimum confidence = 80%, minimum distance assurance = 60% Scan for C3 ABD: 3/5 => ABD => no C4

minimum support = 60%, minimum confidence = 80%, minimum distance assurance = 60% Therefore, the itemset that pass support: {AB, AD, BD, ABD} Next, we need to compute their confidence

minimum support = 60%, minimum confidence = 80%, minimum distance assurance = 60% First, we work on 2-itemset: {AB,AD,BD} A=>B: 3/3 B=>A: 3/5 A=>D: 3/3 D=>A: 3/4 B=>D: 4/5 D=>B: 4/4

minimum support = 60%, minimum confidence = 80%, minimum distance assurance = 60% then, we work on 3-itemset: {ABD} A=>BD: 3/3 B=>AD: 3/5 D=>AB: 3/4 AB=>D: 3/3 AD=>B: 3/3 BD=>A: 3/4

minimum support = 60%, minimum confidence = 80%, minimum distance assurance = 60% Thus, the strong association rules we have: 2-itemset 3-itemset A=>BA=>BD A=>DAB=>D B=>DAD=>B D=>B Next, we work on Positional Association rules…

Positional Association Rules D=>B minimum distance assurance = 60% 1. = 3/4 3.=1/4 2.= 1/4

Positional Association Rules B=>D minimum distance assurance = 60% 1. = 3/63. = 1/6 2.= 1/6

Positional Association Rules A=>B minimum distance assurance = 60% 1. = 2/43. = 1/4 2.= 1/4 4. = 1/4

Positional Association Rules A=>D minimum distance assurance = 60% 1. = 3/4 2.= 1/4

Positional Association Rules AD=>B minimum distance assurance = 60% 1. = 2/3 2. = 1/3

Positional Association Rules AB=>D minimum distance assurance = 60% NO Positional Association Rules on AB !!!

Positional Association Rules A=>BD minimum distance assurance = 60% 1. = 2/4 2. = 1/4