Data Mining: Algorithms and Principles — Mining Biological Data —

Data Mining: Algorithms and Principles — Mining Biological Data —
©Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign Additional contributors: Mayssam Sayyadian, Xiao Hu and Zheng Shao (Spring 2004) 9/20/2018 Data Mining: Principles and Algorithms

Biological Data Mining
Biology Fundamentals Gene Expression and Microarray Bioinformatics Data Mining in Bioinformatics Data Mining in Biological Sequences Structure Mining and Drug Design Data Mining in Gene Expression and Micro-array Data Cluster Analysis Classification Analysis Variable Selection Summary 9/20/2018 Data Mining: Principles and Algorithms

Data Mining: Principles and Algorithms
Biology Fundamentals Genome = all the chromosomes inside the nucleus Gene = a segment on the DNA molecule Image Source: Genetics/basic.htm 9/20/2018 Data Mining: Principles and Algorithms

Biology Fundamentals (2)
Human Genome 23 pairs of chromosomes Image Source: 9/20/2018 Data Mining: Principles and Algorithms

Biology Fundamentals (3)
DNA: carry genetic information Protein: realize cell’s functionalities mRNA: serve as a template for protein production Information Flow: DNA -> mRNA -> Protein Image Source: magnet.fsu.edu/Education/2010/Lectures/26_DNA_Transcription.htm 9/20/2018 Data Mining: Principles and Algorithms

Bioinformatics Interdisciplinary Field (Molecular Biology, Statistics, Computer Science, Genomics, Genetics, Databases, Chemistry, Radiology …) Computational management and analysis of biological information 9/20/2018 Data Mining: Principles and Algorithms

Biological Vocabulary
Cell Store information in DNA Functionality is determined with Proteins Gene: Section of DNA producing a functional product Chromosome: Physical linear sequence of DNA Genome: entire collection of DNA for organism DNA (nucleotides)  RNA  Proteins (amino-acids) Premise of bioinformatics: gene sequences determine biological functions 9/20/2018 Data Mining: Principles and Algorithms

Eukaryotic Genome DNA Structure Nucleotides (bases) Adenine (A) Cytosine (C) Guanine (G) Thymine (T) Sequence data = Strings of letters triplet codons genetic code 20 amino acids (A, L, V, S etc.) 9/20/2018 Data Mining: Principles and Algorithms

Proteins: Prediction of Biochemical Function
CGCCAGCTGGACGGGCACACCATGAGGCTGCTGACCCTCCTGGGCCTTCTG… TDQAAFDTNIVTLTRFVMEQGRKARGTGEMTQLLNSLCTAVKAISTAVRKAGIAHLYGIAGSTNVTGDQVKKLDVLSNDLVINVLKSSFATCVLVTEEDKNAIIVEPEKRGKYVVCFDPLDGSSNIDCLVSIGTIFGIYRKNSTDEPSEKDALQPGRNLVAAGYALYGSATML DNA / amino acid sequence D structure protein functions Use of this knowledge for prediction of function, molecular modelling, and design (e.g., new therapies) 9/20/2018 Data Mining: Principles and Algorithms

Biological Information: From Genes to Proteins
DNA RNA Transcription Translation Protein Protein folding genomics molecular biology structural biology biophysics 9/20/2018 Data Mining: Principles and Algorithms

Data Mining & Bioinformatics : Motivations
Many biological processes are not well-understood Biological knowledge is highly complex and descriptive and experimental Data mining methods let us gain biological insight (data/information  knowledge) Biological data is abundant and information-rich Gene chips, bio-testing data, genomics & proteomics data, huge data banks and rich literature Largest and richest scientific data sets in the world Find correlations among linkages and information in literature and distributed databases Mining for correlations, linkages between disease and gene sequences, interactions among proteins, classification, clustering, outliers, etc. 9/20/2018 Data Mining: Principles and Algorithms

Data Mining & Bioinformatics – How ?
Handling heterogeneous, distributed bio-data Build Web-based, interchangeable, integrated, multi-dimensional genome databases Data cleaning and data integration methods becomes crucial Mining correlated information across multiple databases itself becomes a data mining task Typical studies: potter’s wheel, mining database structures, information extraction from data, document classification, association and correlation discovery algorithms... Exploration of existing data mining tools Towards functional genomics (functional networks of genes and proteins) 9/20/2018 Data Mining: Principles and Algorithms

Similarity search and comparison in bio-data Identify critical differences between classes of genes (e.g., diseased and healthy) By finding and comparing frequent patterns Identify gene sequence patterns that play roles in various diseases Those only in diseased samples may indicate the genetic factors of disease Those only in healthy ones may indicate mechanisms that protect the body from disease Similarity analysis in micro-array data and protein data Non-perfect matches (noisy environment) and mining for sequential and structured patterns 9/20/2018 Data Mining: Principles and Algorithms

Association analysis: identification of co-occurring bio-sequences Most diseases are not triggered by a single gene but by a combination of genes/proteins acting together Association analysis may help determine the kinds of genes/proteins that are likely to co-occur together in target samples Path analysis: linking genes/proteins to different disease development stages Different genes/proteins may become active at different stages of the disease Develop pharmaceutical interventions that target the different stages separately Visualization tools and genetic/proteomic data analysis 9/20/2018 Data Mining: Principles and Algorithms

Sequence Mining in Bioinformatics
Sequence alignment and homology search in bio-data Mining frequent sequential patterns in bio-data Classifying biological sequences Clustering biological sequences 9/20/2018 Data Mining: Principles and Algorithms

Sequence Alignment Goal: Given two or more input sequences Identify similar sequences with long conserved subsequences Method: Use substitution matrices (probabilities of substitutions of amino-acids/bases and probabilities of insertions and deletions) Optimal alignment problem: NP-hard Heuristic method to find good alignments 9/20/2018 Data Mining: Principles and Algorithms

Pair-Wise Sequence Alignment
Example Which one is better?  Scoring alignments HEAGAWGHEE PAWHEAE HEAGAWGHE-E HEAGAWGHE-E P-A--W-HEAE --P-AW-HEAE 9/20/2018 Data Mining: Principles and Algorithms

Pair-Wise Sequence Alignment ─ Scoring
To compare two sequence alignments, calculate a score PAM or BLOSUM matrices Matches and mismatches Gap penalty Initiating a gap Gap extension penalty Extending a gap 9/20/2018 Data Mining: Principles and Algorithms

Pair-wise Sequence Alignment
H W 5 -1 -2 -3 6 10 P -4 15 Gap penalty: -8 Gap extension: -8 HEAGAWGHE-E --P-AW-HEAE (-8) + (-8) + (-1) (-8) (-8) + 6 = 9 HEAGAWGHE-E Exercise: Calculate for P-A--W-HEAE 9/20/2018 Data Mining: Principles and Algorithms

Formal Description Problem: PairSeqAlign Input: Two sequences x, y Scoring matrix s Gap penalty d Gap extension penalty e Output: The optimal sequence alignment Difficulty: If x, y are of size n then number of possible global  alignments is 9/20/2018 Data Mining: Principles and Algorithms

Global Alignment Needleman-Wunsch 1970 Idea: Build up optimal alignment from optimal alignments of subsequences HEAG --P- -25 Add score from table HEAG- --P-A -33 HEAGA --P-A -20 HEAGA --P— -33 Gap with bottom Top and bottom Gap with top 9/20/2018 Data Mining: Principles and Algorithms

Global Alignment Uses recursion to fill in intermediate results table Uses O(nm) space and time O(n2) algorithm Feasible for moderate sized sequences, but not for aligning whole genomes. yj aligned to gap F(i-1,j-1) F(i,j-1) s(xi,yj) d F(i-1,j) F(i,j) d xi aligned to gap While building the table, keep track of where optimal score came from, reverse arrows 9/20/2018 Data Mining: Principles and Algorithms

Pair-Wise Sequence Alignment
Alignment: F(0,0)-F(n,m) Alignment: 0-F(i,j) We can vary both the model and the alignment strategies 9/20/2018 Data Mining: Principles and Algorithms

Heuristic alignment algorithms
Motivation: Complexity of alignment algorithms O(nm) Current protein DB: 100 million base pairs Imagine matching each sequence with a 1,000 base pair query Takes about 3 hours! Heuristic algorithms aim at speeding up at the price of possibly missing the best scoring alignment Two well known programs BLAST: Basic Local Alignment Search Tool FASTA: Fast Alignment Tool Both find high scoring local alignments between a query sequence and a target database Basic idea: first locate high-scoring short stretches and the extend them 9/20/2018 Data Mining: Principles and Algorithms

Other Alignment Algorithms
Dot Matrix Plot Extremely simple but O(n2) Visual 9/20/2018 Data Mining: Principles and Algorithms

FASTA (Fast Alignment)
Approach [Pearson & Lipman 1988] Derived from logic of the dot matrix method View sequences as sequences of short words (k-tuple) DNA - 6 bases, protein - 1 or 2 amino acids Start from nearby sequences of exact matching words Motivation Good alignments should contain many exact matches Hashing can find exact matches in O(n) time Diagonals can be formed from exact matches quickly Sort matches by position (i – j) Look only at matches near longest diagonals Apply more precise alignment to small search space at end 9/20/2018 Data Mining: Principles and Algorithms

FASTA (Fast Alignment)
9/20/2018 Data Mining: Principles and Algorithms

BLAST (Basic Local Alignment Search Tool)
Approach (BLAST) [Altschul+ 1990] View sequences as sequences of short words (k-tuple) DNA - 11 bases, protein - 3 amino acids Create hash table of neighborhood (closely-matching) words Use statistics to set threshold for “closeness” Start from exact matches to neighborhood words Motivation Good alignments should contain many close matches Statistics can determine which matches are significant Much more sensitive than % identity Hashing can find matches in O(n) time Extending matches in both directions finds alignment Yields high-scoring /maximum segment pairs (HSP/MSP) 9/20/2018 Data Mining: Principles and Algorithms

BLAST (Basic Local Alignment Search Tool)

Multiple Sequence Alignment
Alignment containing multiple DNA / protein sequences Look for conserved regions → similar function Example: #Rat ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT #Mouse ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT #Rabbit ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGC #Human ATGGTGCACCTGACTCCT---GAGGAGAAGTCTGC #Oppossum ATGGTGCACTTGACTTTT---GAGGAGAAGAACTG #Chicken ATGGTGCACTGGACTGCT---GAGGAGAAGCAGCT #Frog ATGGGTTTGACAGCACATGATCGT---CAGCT 9/20/2018 Data Mining: Principles and Algorithms

Multiple Sequence Alignment: Motivations
Identify highly conserved residues Likely to be essential sites for structure/function More precision from multiple sequences Better structure/function prediction, pairwise alignments Building gene/protein families Use conserved regions to guide search Basis for phylogenetic analysis Infer evolutionary relationships between genes Develop primers & probes Use conserved region to develop Primers for PCR Probes for DNA micro-arrays 9/20/2018 Data Mining: Principles and Algorithms

Multiple Alignment Model
Q1: How should we define s? Q2: How should we define A? Model: scoring function s: A X1=x11,…,x1m1 X1=x11,…,x1m1 Possible alignments of all Xi’s: A ={a1,…,ak} Find the best alignment(s) X2=x21,…,x2m2 X2=x21,…,x2m2 … … S(a*)= 21 XN=xN1,…,xNmN XN=xN1,…,xNmN Q4: Is the alignment biologically Meaningful? Q3: How can we find a* quickly? 9/20/2018 Data Mining: Principles and Algorithms

Minimum Entropy Scoring
Intuition: A perfectly aligned column has one single symbol (least uncertainty) A poorly aligned column has many distinct symbols (high uncertainty) Count of symbol a in column i 9/20/2018 Data Mining: Principles and Algorithms

Multidimensional Dynamic Programming
Assumptions: (1) columns are independent (2) linear gap cost Alignment: 0,0,0…,0---|x1| , …, |xN| We can vary both the model and the alignment strategies =Maximum score of an alignment up to the subsequences ending with 9/20/2018 Data Mining: Principles and Algorithms

Complexity of Dynamic Programming
Comlexity: Space: O(LN); Time: O(2NLN) One idea for improving the efficiency Define the score as the sum of pairwise alignment scores Derive a lower bound for S(akl), only consider a pairwise alignment scoring better than the bound Pairwise alignment between sequence k and l 9/20/2018 Data Mining: Principles and Algorithms

Approximate Algorithms for Multiple Alignment
Feng-Doolittle alignment Step 1: Compute all possible pairwise alignments Step 2: Convert alignment scores to distances Step 3: Construct a “guide tree” by clustering Step 4: Progressive alignment based on the guide tree (bottom up) Problems: All alignments are completely determined by pairwise alignment (restricted search space) No backtracking (subalignment is “frozen”) No way to correct an early mistake Non-optimality: Mismatches and gaps at highly conserved region should be penalized more, but we can’t tell where is a highly conserved region early in the process 9/20/2018 Data Mining: Principles and Algorithms

Iterative Refinement Re-assigning a sequence to a different cluster/profile Repeatedly do this for a fixed number of times or until the score converges Essentially to enlarge the search space 9/20/2018 Data Mining: Principles and Algorithms

ClustalW: A Multiple Alignment Tool
Essentially following Feng-Doolittle Do pairwise alignment (dynamic programming) Do score conversion/normalization (Kimura’s model) Construct a guide tree (neighbour-journing clustering) Progressively align all sequences using profile alignment Many Heuristics 9/20/2018 Data Mining: Principles and Algorithms

Gene Expression and Microarray
Gene Expression: The process of transcribing the gene’s DNA sequence into mRNA The functionality of the cell is solely determined by DNA. Right? Wrong! The functionality is also affected by environment. Environment (e.g., enzymes) will affect gene expression. 9/20/2018 Data Mining: Principles and Algorithms

Gene Expression and Microarray (2)
A row contains the same sample A column tests the same gene Image Source: magnet.fsu.edu/Education/2010/Lectures/26_DNA_Transcription.htm Microarray: A technique for experiment Usage: Measure the amount of each gene (carried by mRNA) Advantage: Do thousands of experiments at the same time 9/20/2018 Data Mining: Principles and Algorithms

DNA Microarray – flash animation PCR - shockwave animation RT-PCR - flash animation 9/20/2018 Data Mining: Principles and Algorithms

Result of a Microarray experiment: An N x M matrix N is the number of samples M is the number of genes Typical Example: 50 x 5000 Value in the matrix (Usually logarithm of) amount of genes This is the input to Microarray Mining 9/20/2018 Data Mining: Principles and Algorithms

How to find similar samples according to genes (and similar genes according to samples)? Cluster Analysis How to classify new samples based on existing ones? Classification Analysis Which genes are most useful in determining the sample’s class? Variable Selection (Feature Selection) 9/20/2018 Data Mining: Principles and Algorithms

Agenda Mining Frequent Pattern in Biological Sequences Deterministic patterns Probabilistic patterns Classifying Biological Sequences KNN Markov Chain based Feature based Clustering Biological Sequences Projecting to Vector Space CLUSEQ A lot of works. In each topic, pick up two or three representative methods. 9/20/2018 Data Mining: Principles and Algorithms

Properties of Biological Sequences
Sequences of single of symbols Nucleotides: A,G,T,C alphabet size: 4 Amino-acids: A,V,F,M … alphabet size : 20 Long length Hundreds, thousands bp (base pair) Calls for methods with high efficiency Noise Large parts of sequences seem to carry no information Some mutations of amino acids occur Can be looked as strings over alphabets of 4 or 20 symbols. Long which calls for methods with high efficiency 9/20/2018 Data Mining: Principles and Algorithms

Mining Frequent Pattern in Biosequences
Motivation: Patterns appearing often enough in a set of biological sequences is expected to play a role in defining the respective sequences’ functional behavior and/or evolutionary relationships Find maximal patterns, as specific as possible Deterministic model Support: number of sequences in which the pattern occurs Either match sequence or not (zero or one) Probabilistic model Match: approximate matching [J.Yang et al 2002] Give a probability between 0 and 1. Why are frequent sequential patterns interesting 9/20/2018 Data Mining: Principles and Algorithms

Types of Biosequnces Patterns
Deterministic patterns PROSITE notation: C-x(3,5)-[NHG]-x-{FY}-x(2)-Q-C non-x letter: one particular type of aa (amino acid) X(3,5): at least 3 and at most 5 aa of any type may occur at this position – flexible gap [NHG]: any of the aa in [] may be here {FY}: all aa except F and Y are allowed in this position. 9/20/2018 Data Mining: Principles and Algorithms

Types of Biosequnces Patterns
Probabilistic Patterns Position Weight Matrices (PWM) Positions Compatibility Matrix [J.Yang et al 2002] Stochastic Models HMM 9/20/2018 Data Mining: Principles and Algorithms

Mining Deterministic Patterns
Alignment algorithms: Computationally expensive (NP-hard) Reveal only global similarities Enumerating algorithms: Expensive, only good for short patterns Creating long patterns from short patterns TEIRESIAS: find maximal patterns Iterative heuristic methods Gibbs Sampling: faster, but only local maximum …… 9/20/2018 Data Mining: Principles and Algorithms

TEIRESIAS: Rigoutsos et al (IBM)
Pattern mined: with rigid gaps, e.g. A..H.C “.” matches one arbitrary symbol in alphabet Finds all maximal patterns with minimal support Algorithm: 2 Phases Scanning Scan the sequences, locate all elementary patterns with minimum support Convolution Build up maximal patterns from elementary patterns Convolution (join) operation : efficient Mining Frequent sequential patterns has been widely studied. In biological sphere, TERESIAS is a representative example. It mines patterns with rigid gaps. 9/20/2018 Data Mining: Principles and Algorithms

TEIRESIAS: example F.AS F.AST AST F.ASTS STS A user provides 3 parameters , W: maximum number of don’t care characters between two successive residues in a pattern Elementary patterns Guarantee to find maximal patterns 9/20/2018 Data Mining: Principles and Algorithms

Mining Probabilistic Patterns
Support Metric Require exact match and fail to recognize possible substitution among symbols Protein may mutate without change of its functionality Compatibility Matrix Accommodate mutation circumstances Match metric allow flexibility in pattern matching Including TEIRESIAS, most frequent sequential pattern mining methods apply support metric. But in some cases, the exact match assumption is not good. protein may … , those mutations should not be looked as independent patterns. we need to accommodate these scenarios , so a new measure Match is proposed: 9/20/2018 Data Mining: Principles and Algorithms

Compatibility Matrix Symbols observed may be different from the real ones Compatibility matrix provides a probabilistic connection from the observation to the underlying substances. C(di,dj)=Prob(true_value = di | observed_value = dj) Prob(d1|d1)=0.9 Prob(d2|d1)=0.05 Prob(d3|d1)=0.05 Prob(d4|d1)=0 Prob(d5|d1)=0 Observed value d1 d2 d3 d4 d5 true value 0.9 0.1 0.05 0.8 0.7 0.15 0.75 0.85 Instead of assigning 0 or 1 to indicate occurrence of a symbol, Each symbol occurrence can be considered as occurrences of multiple symbols with various degrees. For example, if d1 is observed, then with 0.9 probability, it is an occurrence of d1, with 0.05 probability, it is an occurrence of d2, and so on. The compatibility matrix is assumed to be provided by an expert in the area 9/20/2018 Data Mining: Principles and Algorithms

Match Given a pattern P= d1d2…dl, a sequence s= d1’d2’…dl’ the match of P in s M(P,s):=Prob(P|s) =  1ilC(di,di’) E.g: if P=d1d2, s=d1d3, then M(P,s)=C(d1,d1)C(d2,d3)=0.9 0.05=0.045. Here P is not a subpattern of s, but M(P,s) >0. Suppose the symbols in the sequence are independent, product of the occurrence of each symbol (correspondent entries in CM) Provides flexibility in approximate match 9/20/2018 Data Mining: Principles and Algorithms

Match The match of a pattern P in a subsequence s (with the same length) is Prob(P| s). The match of a pattern P in a sequence S is defined as M(P,S)=Maxssubsequences of S M(P,s) The match of pattern P in a database D of N sequence: M(P,D)=SDM(P,S) / N All patterns that meet a specified threshold min_match are called frequent patterns. Maximum match of P in all subseq, of S Average match of all sequences 9/20/2018 Data Mining: Principles and Algorithms

Match Model The match model can accommodate misrepresentation of symbols due to noise The Apriori property also holds on the match metric. The match of a pattern P in a database D  the match of any subpattern of P. In a noise-free environment, match model can represent support model. Let the compatibility matrix be an identity matrix (C(di,dj)=1 if i=j and is 0 otherwise). The match of a pattern is equal to the support of a pattern. So it is an extension of support measure 9/20/2018 Data Mining: Principles and Algorithms

Agenda Mining Frequent Pattern in Biological Sequences Classifying Biological Sequences KNN Markov Chain based Feature based Clustering Biological Sequences 9/20/2018 Data Mining: Principles and Algorithms

K-nearest Neighbor Classifier(KNN)
Locate K training sequences: most similar to the test sequence Class label of the test sequence := Majority Function (class labels of K seq.s) KEY- Similarity function : sim(Si,Sj) Simplest Score= aligned symbols identical 0 otherwise Advanced: position dependent scores score matrices of similarity between different amino-acids Well used in bioinformatics tools. How to measure the similarity of sequences In addition, there are various score matrices 9/20/2018 Data Mining: Principles and Algorithms

Markov Chain based Classifier
Build simple Markov chain using training datasets for training seq. with class + for training seq. with class - Compute for test seq. + if Class label of = - if 9/20/2018 Data Mining: Principles and Algorithms

IMM & SMM Problem in Higher Order Models many high-order states cannot frequently observed in the training dataset Interpolated Markov Models (IMM) the weight of transition probability of the j th-order model Selective Markov Models (SMM) Build varying order Markov chains Prune the non-discriminatory states from the higher order Markov chains The motivation is that higher order Markov states capture longer context, but they tend to have lower support. Hence, using probabilities from lower order states which have higher support helps in correcting some of the errors in estimating the transition probabilities for higher order models. Furthermore, by using the various γparameters, IMMs can control how the various states are used while computing the transition probabilities. (A number of methods for computing the various γ factors where presented in [SDKW98] that are based on the distribution of the different states in the various order models. However, the right method appears to be dataset dependent. 9/20/2018 Data Mining: Principles and Algorithms

Feature Based Sequence Classifier
Project each sequence into vector-space Use sophisticated linear classifiers E.g: Support Vector Machine(SVM) KEY-- How to project? Based on Markov chain 9/20/2018 Data Mining: Principles and Algorithms

Vector-space view of sequence in Markov chain
Classifier: 1st order Markov Chain Vector u,w length: N2 (N: size of alphabet) Each dimension: a unique pair of symbols If j th dimension of the vectors = AG u (j) = number of occurrences of AG in S w(j) = w: from Transition probability Matrices Higher order Markov Chain Dimensionality of the new space is N k+1 u: Sequence vector w: Weight vector If N is the size of alphabet, the dimension of both vectors are N2. Log-odds ratio of P+. P- are the transition probabilities of the positive and negative classes, Allow us to view each sequence as a vector in a new space whose dimensions are made of all possible pairs, triplets, quintuplets, etc, of symbols, and second it shows that each one of the Markov chain based techniques is nothing more than a linear classifier [Mit97], in which the decision hyperplane is given by the vector w that corresponds to the log-odds ratios of the maximum likelihood estimates for the various transition probabilities. 9/20/2018 Data Mining: Principles and Algorithms

Example: Sequence classification in vector space based on 1st order Markov Chain Based on first order Markov Chain S is positive, since the value >0 9/20/2018 Data Mining: Principles and Algorithms

Agenda Mining Frequent Pattern in Biological Sequences Classifying Biological Sequences Clustering Biological Sequences Projecting to Vector Space CLUSEQ 9/20/2018 Data Mining: Principles and Algorithms

Clustering Biological Sequences
Motivation: Detecting protein families Revealing unknown object groups Challenges: Similarity measurement Alignment based Similarity measure based on pairwise alignment Frequent Sequence based No alignment, use frequent subsequences Efficiency 9/20/2018 Data Mining: Principles and Algorithms

Frequent Sequence based Method
Projecting to vector space [Guralnik & Karypis ‘01] Determine all frequent subsequences Select independent subset of these subsequences (features) Count occurrences of features (Vector Space Model) Applying any clustering algorithm to the vectors 9/20/2018 Data Mining: Principles and Algorithms

Example: feature selection
No Markov Chain here with 50% support Select “independent” subsequences Globally selected: consider all sequences together, locally selected: consider sequence one by one 9/20/2018 Data Mining: Principles and Algorithms

CLUSEQ [J.Yang et al 2003] Efficient and Effective Sequence Clustering Probability approach Similar to higher order Markov Model Probabilistic Suffix Tree Use structural characteristics of sequences Structure determines function Sequences: sequential structure A Novel Algorithm High accuracy Automatically adjust parameters (e.g. # of clusters) The rational is: …The order of the symbols in sequences Using the conditional probability distribution of the next symbol given a preceding segment to characterize sequence behavior and to support the similarity measure. Overcome the typical problem of clustering method: manually setting parameters 9/20/2018 Data Mining: Principles and Algorithms

Similarity Measure A cluster S is modeled by conditional probability distribution Ps represented by a Probabilistic Suffix Tree The probability of an arbitrary sequence : (s1 s2… sl) subsuming to Ps : Significant/Insignificant subsequences robust to noise If s1 s2… si-1 is not significant (i.e < threshold), approximate it by its longest significant suffix Different clusters subsume following different P Where is the conditional probability that the symbol si is the next symbol right after the segment s1…si-1 To make the measure robust to noise, a significant threshold is given. If the occurrence of a subsequence is less than the threshold, it is not significant. In this case, .., try to capture the context information as much as possible 9/20/2018 Data Mining: Principles and Algorithms

Similarity Measure Similarity between sequence  and cluster S: Accommodate subsequences matching Ratio of probability of subsuming to distribution of S, and of generating the sequence randomly. This similarity measure requires the entire sequence follows a single CPD, too strict, different portions of a sequence may subsume to different CPD. Modified to capture the maximum similarity between any continuous segment of and S >= t:  belongs to S < t:  not belong to S 9/20/2018 Data Mining: Principles and Algorithms

Probabilistic Suffix Tree (PST)
Improve efficiency of computation Suffix tree: good model for symbol sequences Each cluster is represented by a PST tree Rooted directed tree Each edge is labeled with a symbol in the sequence set Concatenation of edge-labels on the path between the node and the root spells out a subsequence Each note has probabilities of the next symbol 9/20/2018 Data Mining: Principles and Algorithms

Example: A Probabilistic Suffix Tree P(b|bba)≈ P(b|ba)=0.594 In this example there are only two symbols a and b, number in each node is the occurrences of its label in the sequence cluster The red line is the cut off of significant nodes, whose occurrences are larger than a preset threshold. 9/20/2018 Data Mining: Principles and Algorithms C=65

CLUSEQ: algorithm Update cluster membership and Probabilistic suffix tree Merge heavily overlapped clusters Novel part is that, upon iteration, it automatically updates the clusters, and adjust the parameters. Robust w.r.t initial parameter settings. 9/20/2018 Data Mining: Principles and Algorithms

Summary Mining Biosequence Frequent Pattern Deterministic patterns Probabilistic patterns Classifying Biological Sequences KNN: simple, using alignment information Markov Chain based :IMM, SMM Feature based : vector space Clustering Biological Sequences Frequent sequence based: vector space CLUSEQ Probabilistic based similarity measure Exploring sequential characteristics 9/20/2018 Data Mining: Principles and Algorithms

References Rigoutsos, I. & Floratos, A. Combinatorial pattern discovery in biological sequences: The Teiresias algorithm, Bioinformatics,1998, 14(1), pp Brejova, B., DiMarco, C., Vinar, T., Hidalgo, S. R., Holguin, G. and Patten, C. Finding Patterns in Biological Sequences. Unpublished project report for CS798G, University of Waterloo, Fall 2000 J. Yang, P. Yu, W. Wang, and J. Han 'Mining Long Sequential Patterns in a Noisy Environment ', SIGMOD2002 Mukund Deshpande, George Karypis, Evaluation of Techniques for Classifying Biological Sequences , PAKDD2002 Valerie Guralnik, George Karypis, A Scalable Algorithm for Clustering Sequential Data, ICDM2001 Jiong Yang, Wei Wang, CLUSEQ: Efficient and Effective Sequence Clustering , ICDE 2003 9/20/2018 Data Mining: Principles and Algorithms

Cluster Analysis Hierarchical clustering K-means algorithm Self-Organizing Map (SOM) SVD-based clustering Block clustering Gene shaving Significance of clusters 9/20/2018 Data Mining: Principles and Algorithms

Hierarchical clustering
The most common approach for grouping gene expression data Bottom-Up Approach Input: Distance-Matrix (N x N) Algorithm: Find the two nearest groups, combine them together Diversity Single Linkage: Group distance = minimum distance between points of the two groups Average Linkage: Group distance = average distance between points of the two groups 9/20/2018 Data Mining: Principles and Algorithms

K-means Algorithm Top-down approach Pre-specified number of clusters Algorithm: Select initial cluster centers Classify all the samples Recalculate the centers of each cluster Problem The result is highly dependent on the initial set of centers 9/20/2018 Data Mining: Principles and Algorithms

Self-Organizing Map (SOM)
2-layer NN Input layer: vector Output layer: matrix Each node in Output Layer has a weight vector w. Image Source: 9/20/2018 Data Mining: Principles and Algorithms

Self-Organizing Map (SOM) (2)
Given an input vector x (a sample), the output of the node is Gaussian(w - x). Image Source: Image Source: 9/20/2018 Data Mining: Principles and Algorithms

Initialization Random Initialization Sample Initialization Linear Initialization Where the weight vectors are initialized in an orderly fashion along the linear subspace spanned by the two principal eigenvectors of the input data set. 9/20/2018 Data Mining: Principles and Algorithms

Training For each sample y Select the output node with the greatest output value. Assume it’s output node j. Update the weight vector ci of every output node I 9/20/2018 Data Mining: Principles and Algorithms

Advantages: Partial structure on the clusters (neighbor relation between nearby clusters) Easy to implement Reasonably fast, scalable to large data sets Easy visualization and interpretation 9/20/2018 Data Mining: Principles and Algorithms

SVD-based Clustering Singular Value Decomposition (Also named KL-Decomposition / PCA) Sort on singular values Keep the first K singular values Effect: Filter out noise or experimental artifacts, enable meaningful comparison of the expression of different genes across different samples 9/20/2018 Data Mining: Principles and Algorithms

Block Clustering Two-way clustering technique: Simultaneously cluster both genes and samples Algorithm: Begin with the entire matrix in one block At each stage find the best row and column split Split continued until a large number of blocks are obtained Some blocks are recombined until the optimal number of blocks are obtained 9/20/2018 Data Mining: Principles and Algorithms

Block Clustering (2) The best row and column split Largest reduction in the total within block variance The optimal number of blocks Randomly permute the elements within each row of the original matrix A. The result is B. Find the k that maximize the difference between the within cluster variance of A with k clusters the within cluster variance of B with k clusters 9/20/2018 Data Mining: Principles and Algorithms

Gene Shaving Goal find several small, possibly overlapping blocks of genes The blocks: small between-genes variance Algorithm Start with the whole gene expression matrix Y Find the first principal component of the genes For each gene, compute the absolute value of its correlation with the first principal component Remove the 10% of the genes having the smallest absolute correlation 9/20/2018 Data Mining: Principles and Algorithms

Gene Shaving (2) The algorithm produces a set of nested gene groups: The first cluster is selected as the group that maximizes the gap statistic 9/20/2018 Data Mining: Principles and Algorithms

Significance of Clusters
How do we show that the clusters discovered are real and biologically interesting rather than coincidental artifacts of the data? Bootstrap approach. Try different subsets of the data and check if the results are consistent. 9/20/2018 Data Mining: Principles and Algorithms

Classification Analysis
Support Vector Machines (SVM) Nearest Neighbor Decision Trees Voted classification Weighted Gene Voting Bayesian Classification 9/20/2018 Data Mining: Principles and Algorithms

Support Vector Machines (SVM)
2-Class classification Map data from low-dimensional space to high-dimensional space. So that originally linear unseparatable data might be linear separatable in high-dimensional space. Advantage: Does not tend to overfit 9/20/2018 Data Mining: Principles and Algorithms

Nearest Neighbor All training data are saved. Get the k nearest neighbors (in the training data) of the new sample Determine the class by majority of the neighbors. K = 3,5,7… (For 2-class classification) 9/20/2018 Data Mining: Principles and Algorithms

Voted Classification Takes a classifier and a training set as input Trains the classifier multiple times on different versions of the training set Two types: Bagging Algorithm Boosting Algorithm 9/20/2018 Data Mining: Principles and Algorithms

Voted Classification (2)
Bagging Algorithm Bootstrap sample set: randomly select T samples from the original training set. Train classifiers on bootstrap sample sets Usage: Apply these classifiers to the test data, choose the most votes. Useful for unstable classifiers (like decision tree) 9/20/2018 Data Mining: Principles and Algorithms

Voted Classification (3)
Boosting Algorithm Train a classifier on the training set The trained classifier is assigned a voting weight based on its accuracy Calculate the classification result for each sample, based on the weighted voting of all trained classifiers Modify the probability of each sample appearing in the training set based on the classification correctness so far. 9/20/2018 Data Mining: Principles and Algorithms

Bayesian Classification
Bayes’ theorem Assumption: Each element of y is conditionally independent 9/20/2018 Data Mining: Principles and Algorithms

Variable Selection Problem: 50 samples, 5000 genes Which genes are useful? 9/20/2018 Data Mining: Principles and Algorithms

Variable Selection (2) Unsupervised (No class information) cut-off on the SD of the gene over all the samples (The bigger, the better) Supervised (With class information) cut-off on Ratio of between-class to within-class variance (The bigger, the better) 9/20/2018 Data Mining: Principles and Algorithms

Summary Clustering Hierarchical clustering K-means algorithm Self-Organizing Map (SOM) SVD-based clustering Block clustering Gene shaving Significance of clusters Classification Support Vector Machines (SVM) Nearest Neighbor Decision Trees Voted classification Weighted Gene Voting Bayesian Classification Variable Selection 9/20/2018 Data Mining: Principles and Algorithms

Reference K.Aas, Microarray Data Mining: A Survey, NR Note SAMBA/02/01 David Page, PPT of Supervised Learning for Gene Expression Microarray Data Images are obtained from Google Images (Image on page 8 from the paper Aas01) 9/20/2018 Data Mining: Principles and Algorithms

Thank you !!! 9/20/2018 Data Mining: Principles and Algorithms

Clustering Measures (Distance)
Lp Norm Distances Manhattan (p = 1) Euclidean Distance (p = 2) Cosine Distance Robust Kernel Distances (?) Distance functions are not adequate in capturing all kinds of correlated behavior. 9/20/2018 Data Mining: Principles and Algorithms

Shifting and Scaling Patterns

Shifting and Scaling Patterns in Subspaces

DNA micro-array analysis
Applications DNA micro-array analysis 9/20/2018 Data Mining: Principles and Algorithms

Why not use correlation ?
Prior Work Why not use correlation ? Cannot capture non global characteristics over attributes. 9/20/2018 Data Mining: Principles and Algorithms

Bicluster Method ? (Y.Cheng & G. Church)
Prior Work Bicluster Method ? (Y.Cheng & G. Church) Submatrix of a l-bicluster not necessarily a l-bicluster Greedy approach can miss overlapping clusters Difficult to remove outliers from clusters. 9/20/2018 Data Mining: Principles and Algorithms

Notations 9/20/2018 Data Mining: Principles and Algorithms

pScore pScore & pCluster 9/20/2018 Data Mining: Principles and Algorithms

pScore (cont..) Good similarity for Shifting Patterns Ordinary Patterns are the degenerate case where shift is zero Scaling Patterns can be converted to Shifting Patterns by taking logarithm on original attributes and hence pScore is a good metric to tackle scaling patterns too. (?) The pClusters defined by using the pScore are compact. Intuitively, pScore(X) < c means that the change of values between two attributes of two objects in X is confined by a user-specified threshold c. 9/20/2018 Data Mining: Principles and Algorithms

Problem Statement Given : cluster threshold nc: minimal number of columns of a cluster nr: minimal number of rows of a cluster Find all (O,T) such that (O, T) is a -pCluster, where |O| ≥ nr & |T| ≥ nc. 9/20/2018 Data Mining: Principles and Algorithms

Maximum Dimension Set (MDS)
Search over subspaces for An object set O c-pClusterable on a column set T only when there doesn’t exist T’  T, such that O c-pClusters on T’. That T which c-pClusters O and there doesn’t exist any T’  T that c-pClusters O is called Maximum Dimension Set (MDS) of O. Since finding a MDS isn’t trivial, a two object special case, called pairwise clustering, is considered. 9/20/2018 Data Mining: Principles and Algorithms

Pairwise Clustering To find the Maximum Dimension Sets for O where O contains only two objects x and y. Given objects x and y, and a column set T we can define :- S(x,y,T) = {dxa-dya|a T} S1(x,y,T) = s1, …, sk. si  S(x,y,T) and si ≤ sj for i<j The Objects x and y form a c-pCluster on T iff the difference between the largest and the smallest value in S(x,y,T) is below c. Or in other words in S1(x,y,T) we have (sk-s1) ≤ c. 9/20/2018 Data Mining: Principles and Algorithms

Pairwise Clustering (cont..)
pairCluster(x,y,A,nc) pairCluster(a,b,D,nr) 9/20/2018 Data Mining: Principles and Algorithms

Pairwise Clustering (example)
nr = 3, nc = 3, c = 1 9/20/2018 Data Mining: Principles and Algorithms

MDS Pruning 9/20/2018 Data Mining: Principles and Algorithms

MDS Pruning (example) 9/20/2018 Data Mining: Principles and Algorithms

Prefix Tree Summary The resulting object-pair MDSs are inserted into a prefix tree. Each node of the prefix tree uniquely represents a maximum dimension set, and can be regarded as a candidate cluster (O,T). A final step of pruning objects in O so that (O,T), becomes a pCluster. The pruning step after prefix tree construction is done in post-order traversal. 9/20/2018 Data Mining: Principles and Algorithms

Prefix Tree Summary 9/20/2018 Data Mining: Principles and Algorithms

Experiments 9/20/2018 Data Mining: Principles and Algorithms

Data Mining: Algorithms and Principles — Mining Biological Data —

Similar presentations

Presentation on theme: "Data Mining: Algorithms and Principles — Mining Biological Data —"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining: Algorithms and Principles — Mining Biological Data —

Similar presentations

Presentation on theme: "Data Mining: Algorithms and Principles — Mining Biological Data —"— Presentation transcript:

Similar presentations

About project

Feedback