Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining: Algorithms and Principles — Mining Biological Data —

Similar presentations


Presentation on theme: "Data Mining: Algorithms and Principles — Mining Biological Data —"— Presentation transcript:

1 Data Mining: Algorithms and Principles — Mining Biological Data —
©Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign Additional contributors: Mayssam Sayyadian, Xiao Hu and Zheng Shao (Spring 2004) 9/20/2018 Data Mining: Principles and Algorithms

2 Biological Data Mining
Biology Fundamentals Gene Expression and Microarray Bioinformatics Data Mining in Bioinformatics Data Mining in Biological Sequences Structure Mining and Drug Design Data Mining in Gene Expression and Micro-array Data Cluster Analysis Classification Analysis Variable Selection Summary 9/20/2018 Data Mining: Principles and Algorithms

3 Data Mining: Principles and Algorithms
Biology Fundamentals Genome = all the chromosomes inside the nucleus Gene = a segment on the DNA molecule Image Source: Genetics/basic.htm 9/20/2018 Data Mining: Principles and Algorithms

4 Biology Fundamentals (2)
Human Genome 23 pairs of chromosomes Image Source: 9/20/2018 Data Mining: Principles and Algorithms

5 Biology Fundamentals (3)
DNA: carry genetic information Protein: realize cell’s functionalities mRNA: serve as a template for protein production Information Flow: DNA -> mRNA -> Protein Image Source: magnet.fsu.edu/Education/2010/Lectures/26_DNA_Transcription.htm 9/20/2018 Data Mining: Principles and Algorithms

6 Data Mining: Principles and Algorithms
Bioinformatics Interdisciplinary Field (Molecular Biology, Statistics, Computer Science, Genomics, Genetics, Databases, Chemistry, Radiology …) Computational management and analysis of biological information 9/20/2018 Data Mining: Principles and Algorithms

7 Biological Vocabulary
Cell Store information in DNA Functionality is determined with Proteins Gene: Section of DNA producing a functional product Chromosome: Physical linear sequence of DNA Genome: entire collection of DNA for organism DNA (nucleotides)  RNA  Proteins (amino-acids) Premise of bioinformatics: gene sequences determine biological functions 9/20/2018 Data Mining: Principles and Algorithms

8 Data Mining: Principles and Algorithms
Eukaryotic Genome DNA Structure Nucleotides (bases) Adenine (A) Cytosine (C) Guanine (G) Thymine (T) Sequence data = Strings of letters triplet codons genetic code 20 amino acids (A, L, V, S etc.) 9/20/2018 Data Mining: Principles and Algorithms

9 Proteins: Prediction of Biochemical Function
CGCCAGCTGGACGGGCACACCATGAGGCTGCTGACCCTCCTGGGCCTTCTG… TDQAAFDTNIVTLTRFVMEQGRKARGTGEMTQLLNSLCTAVKAISTAVRKAGIAHLYGIAGSTNVTGDQVKKLDVLSNDLVINVLKSSFATCVLVTEEDKNAIIVEPEKRGKYVVCFDPLDGSSNIDCLVSIGTIFGIYRKNSTDEPSEKDALQPGRNLVAAGYALYGSATML DNA / amino acid sequence D structure protein functions Use of this knowledge for prediction of function, molecular modelling, and design (e.g., new therapies) 9/20/2018 Data Mining: Principles and Algorithms

10 Biological Information: From Genes to Proteins
DNA RNA Transcription Translation Protein Protein folding genomics molecular biology structural biology biophysics 9/20/2018 Data Mining: Principles and Algorithms

11 Data Mining & Bioinformatics : Motivations
Many biological processes are not well-understood Biological knowledge is highly complex and descriptive and experimental Data mining methods let us gain biological insight (data/information  knowledge) Biological data is abundant and information-rich Gene chips, bio-testing data, genomics & proteomics data, huge data banks and rich literature Largest and richest scientific data sets in the world Find correlations among linkages and information in literature and distributed databases Mining for correlations, linkages between disease and gene sequences, interactions among proteins, classification, clustering, outliers, etc. 9/20/2018 Data Mining: Principles and Algorithms

12 Data Mining & Bioinformatics – How ?
Handling heterogeneous, distributed bio-data Build Web-based, interchangeable, integrated, multi-dimensional genome databases Data cleaning and data integration methods becomes crucial Mining correlated information across multiple databases itself becomes a data mining task Typical studies: potter’s wheel, mining database structures, information extraction from data, document classification, association and correlation discovery algorithms... Exploration of existing data mining tools Towards functional genomics (functional networks of genes and proteins) 9/20/2018 Data Mining: Principles and Algorithms

13 Data Mining & Bioinformatics – How ?
Similarity search and comparison in bio-data Identify critical differences between classes of genes (e.g., diseased and healthy) By finding and comparing frequent patterns Identify gene sequence patterns that play roles in various diseases Those only in diseased samples may indicate the genetic factors of disease Those only in healthy ones may indicate mechanisms that protect the body from disease Similarity analysis in micro-array data and protein data Non-perfect matches (noisy environment) and mining for sequential and structured patterns 9/20/2018 Data Mining: Principles and Algorithms

14 Data Mining & Bioinformatics – How ?
Association analysis: identification of co-occurring bio-sequences Most diseases are not triggered by a single gene but by a combination of genes/proteins acting together Association analysis may help determine the kinds of genes/proteins that are likely to co-occur together in target samples Path analysis: linking genes/proteins to different disease development stages Different genes/proteins may become active at different stages of the disease Develop pharmaceutical interventions that target the different stages separately Visualization tools and genetic/proteomic data analysis 9/20/2018 Data Mining: Principles and Algorithms

15 Sequence Mining in Bioinformatics
Sequence alignment and homology search in bio-data Mining frequent sequential patterns in bio-data Classifying biological sequences Clustering biological sequences 9/20/2018 Data Mining: Principles and Algorithms

16 Data Mining: Principles and Algorithms
Sequence Alignment Goal: Given two or more input sequences Identify similar sequences with long conserved subsequences Method: Use substitution matrices (probabilities of substitutions of amino-acids/bases and probabilities of insertions and deletions) Optimal alignment problem: NP-hard Heuristic method to find good alignments 9/20/2018 Data Mining: Principles and Algorithms

17 Pair-Wise Sequence Alignment
Example Which one is better?  Scoring alignments HEAGAWGHEE PAWHEAE HEAGAWGHE-E HEAGAWGHE-E P-A--W-HEAE --P-AW-HEAE 9/20/2018 Data Mining: Principles and Algorithms

18 Pair-Wise Sequence Alignment ─ Scoring
To compare two sequence alignments, calculate a score PAM or BLOSUM matrices Matches and mismatches Gap penalty Initiating a gap Gap extension penalty Extending a gap 9/20/2018 Data Mining: Principles and Algorithms

19 Pair-wise Sequence Alignment
H W 5 -1 -2 -3 6 10 P -4 15 Gap penalty: -8 Gap extension: -8 HEAGAWGHE-E --P-AW-HEAE (-8) + (-8) + (-1) (-8) (-8) + 6 = 9 HEAGAWGHE-E Exercise: Calculate for P-A--W-HEAE 9/20/2018 Data Mining: Principles and Algorithms

20 Data Mining: Principles and Algorithms
Formal Description Problem: PairSeqAlign Input: Two sequences x, y Scoring matrix s Gap penalty d Gap extension penalty e Output: The optimal sequence alignment Difficulty: If x, y are of size n then number of possible global  alignments is 9/20/2018 Data Mining: Principles and Algorithms

21 Data Mining: Principles and Algorithms
Global Alignment Needleman-Wunsch 1970 Idea: Build up optimal alignment from optimal alignments of subsequences HEAG --P- -25 Add score from table HEAG- --P-A -33 HEAGA --P-A -20 HEAGA --P— -33 Gap with bottom Top and bottom Gap with top 9/20/2018 Data Mining: Principles and Algorithms

22 Data Mining: Principles and Algorithms
Global Alignment Uses recursion to fill in intermediate results table Uses O(nm) space and time O(n2) algorithm Feasible for moderate sized sequences, but not for aligning whole genomes. yj aligned to gap F(i-1,j-1) F(i,j-1) s(xi,yj) d F(i-1,j) F(i,j) d xi aligned to gap While building the table, keep track of where optimal score came from, reverse arrows 9/20/2018 Data Mining: Principles and Algorithms

23 Pair-Wise Sequence Alignment
Alignment: F(0,0)-F(n,m) Alignment: 0-F(i,j) We can vary both the model and the alignment strategies 9/20/2018 Data Mining: Principles and Algorithms

24 Heuristic alignment algorithms
Motivation: Complexity of alignment algorithms O(nm) Current protein DB: 100 million base pairs Imagine matching each sequence with a 1,000 base pair query Takes about 3 hours! Heuristic algorithms aim at speeding up at the price of possibly missing the best scoring alignment Two well known programs BLAST: Basic Local Alignment Search Tool FASTA: Fast Alignment Tool Both find high scoring local alignments between a query sequence and a target database Basic idea: first locate high-scoring short stretches and the extend them 9/20/2018 Data Mining: Principles and Algorithms

25 Other Alignment Algorithms
Dot Matrix Plot Extremely simple but O(n2) Visual 9/20/2018 Data Mining: Principles and Algorithms

26 FASTA (Fast Alignment)
Approach [Pearson & Lipman 1988] Derived from logic of the dot matrix method View sequences as sequences of short words (k-tuple) DNA - 6 bases, protein - 1 or 2 amino acids Start from nearby sequences of exact matching words Motivation Good alignments should contain many exact matches Hashing can find exact matches in O(n) time Diagonals can be formed from exact matches quickly Sort matches by position (i – j) Look only at matches near longest diagonals Apply more precise alignment to small search space at end 9/20/2018 Data Mining: Principles and Algorithms

27 FASTA (Fast Alignment)
9/20/2018 Data Mining: Principles and Algorithms

28 BLAST (Basic Local Alignment Search Tool)
Approach (BLAST) [Altschul+ 1990] View sequences as sequences of short words (k-tuple) DNA - 11 bases, protein - 3 amino acids Create hash table of neighborhood (closely-matching) words Use statistics to set threshold for “closeness” Start from exact matches to neighborhood words Motivation Good alignments should contain many close matches Statistics can determine which matches are significant Much more sensitive than % identity Hashing can find matches in O(n) time Extending matches in both directions finds alignment Yields high-scoring /maximum segment pairs (HSP/MSP) 9/20/2018 Data Mining: Principles and Algorithms

29 BLAST (Basic Local Alignment Search Tool)
9/20/2018 Data Mining: Principles and Algorithms

30 Multiple Sequence Alignment
Alignment containing multiple DNA / protein sequences Look for conserved regions → similar function Example: #Rat ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT #Mouse ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT #Rabbit ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGC #Human ATGGTGCACCTGACTCCT---GAGGAGAAGTCTGC #Oppossum ATGGTGCACTTGACTTTT---GAGGAGAAGAACTG #Chicken ATGGTGCACTGGACTGCT---GAGGAGAAGCAGCT #Frog ATGGGTTTGACAGCACATGATCGT---CAGCT 9/20/2018 Data Mining: Principles and Algorithms

31 Multiple Sequence Alignment: Motivations
Identify highly conserved residues Likely to be essential sites for structure/function More precision from multiple sequences Better structure/function prediction, pairwise alignments Building gene/protein families Use conserved regions to guide search Basis for phylogenetic analysis Infer evolutionary relationships between genes Develop primers & probes Use conserved region to develop Primers for PCR Probes for DNA micro-arrays 9/20/2018 Data Mining: Principles and Algorithms

32 Multiple Alignment Model
Q1: How should we define s? Q2: How should we define A? Model: scoring function s: A X1=x11,…,x1m1 X1=x11,…,x1m1 Possible alignments of all Xi’s: A ={a1,…,ak} Find the best alignment(s) X2=x21,…,x2m2 X2=x21,…,x2m2 S(a*)= 21 XN=xN1,…,xNmN XN=xN1,…,xNmN Q4: Is the alignment biologically Meaningful? Q3: How can we find a* quickly? 9/20/2018 Data Mining: Principles and Algorithms

33 Minimum Entropy Scoring
Intuition: A perfectly aligned column has one single symbol (least uncertainty) A poorly aligned column has many distinct symbols (high uncertainty) Count of symbol a in column i 9/20/2018 Data Mining: Principles and Algorithms

34 Multidimensional Dynamic Programming
Assumptions: (1) columns are independent (2) linear gap cost Alignment: 0,0,0…,0---|x1| , …, |xN| We can vary both the model and the alignment strategies =Maximum score of an alignment up to the subsequences ending with 9/20/2018 Data Mining: Principles and Algorithms

35 Complexity of Dynamic Programming
Comlexity: Space: O(LN); Time: O(2NLN) One idea for improving the efficiency Define the score as the sum of pairwise alignment scores Derive a lower bound for S(akl), only consider a pairwise alignment scoring better than the bound Pairwise alignment between sequence k and l 9/20/2018 Data Mining: Principles and Algorithms

36 Approximate Algorithms for Multiple Alignment
Feng-Doolittle alignment Step 1: Compute all possible pairwise alignments Step 2: Convert alignment scores to distances Step 3: Construct a “guide tree” by clustering Step 4: Progressive alignment based on the guide tree (bottom up) Problems: All alignments are completely determined by pairwise alignment (restricted search space) No backtracking (subalignment is “frozen”) No way to correct an early mistake Non-optimality: Mismatches and gaps at highly conserved region should be penalized more, but we can’t tell where is a highly conserved region early in the process 9/20/2018 Data Mining: Principles and Algorithms

37 Data Mining: Principles and Algorithms
Iterative Refinement Re-assigning a sequence to a different cluster/profile Repeatedly do this for a fixed number of times or until the score converges Essentially to enlarge the search space 9/20/2018 Data Mining: Principles and Algorithms

38 ClustalW: A Multiple Alignment Tool
Essentially following Feng-Doolittle Do pairwise alignment (dynamic programming) Do score conversion/normalization (Kimura’s model) Construct a guide tree (neighbour-journing clustering) Progressively align all sequences using profile alignment Many Heuristics 9/20/2018 Data Mining: Principles and Algorithms

39 Gene Expression and Microarray
Gene Expression: The process of transcribing the gene’s DNA sequence into mRNA The functionality of the cell is solely determined by DNA. Right? Wrong! The functionality is also affected by environment. Environment (e.g., enzymes) will affect gene expression. 9/20/2018 Data Mining: Principles and Algorithms

40 Gene Expression and Microarray (2)
A row contains the same sample A column tests the same gene Image Source: magnet.fsu.edu/Education/2010/Lectures/26_DNA_Transcription.htm Microarray: A technique for experiment Usage: Measure the amount of each gene (carried by mRNA) Advantage: Do thousands of experiments at the same time 9/20/2018 Data Mining: Principles and Algorithms

41 Gene Expression and Microarray (3)
9/20/2018 Data Mining: Principles and Algorithms

42 Gene Expression and Microarray (4)
DNA Microarray – flash animation PCR - shockwave animation RT-PCR - flash animation 9/20/2018 Data Mining: Principles and Algorithms

43 Gene Expression and Microarray (5)
Result of a Microarray experiment: An N x M matrix N is the number of samples M is the number of genes Typical Example: 50 x 5000 Value in the matrix (Usually logarithm of) amount of genes This is the input to Microarray Mining 9/20/2018 Data Mining: Principles and Algorithms

44 Gene Expression and Microarray (6)
How to find similar samples according to genes (and similar genes according to samples)? Cluster Analysis How to classify new samples based on existing ones? Classification Analysis Which genes are most useful in determining the sample’s class? Variable Selection (Feature Selection) 9/20/2018 Data Mining: Principles and Algorithms

45 Data Mining: Principles and Algorithms
Agenda Mining Frequent Pattern in Biological Sequences Deterministic patterns Probabilistic patterns Classifying Biological Sequences KNN Markov Chain based Feature based Clustering Biological Sequences Projecting to Vector Space CLUSEQ A lot of works. In each topic, pick up two or three representative methods. 9/20/2018 Data Mining: Principles and Algorithms

46 Properties of Biological Sequences
Sequences of single of symbols Nucleotides: A,G,T,C alphabet size: 4 Amino-acids: A,V,F,M … alphabet size : 20 Long length Hundreds, thousands bp (base pair) Calls for methods with high efficiency Noise Large parts of sequences seem to carry no information Some mutations of amino acids occur Can be looked as strings over alphabets of 4 or 20 symbols. Long which calls for methods with high efficiency 9/20/2018 Data Mining: Principles and Algorithms

47 Mining Frequent Pattern in Biosequences
Motivation: Patterns appearing often enough in a set of biological sequences is expected to play a role in defining the respective sequences’ functional behavior and/or evolutionary relationships Find maximal patterns, as specific as possible Deterministic model Support: number of sequences in which the pattern occurs Either match sequence or not (zero or one) Probabilistic model Match: approximate matching [J.Yang et al 2002] Give a probability between 0 and 1. Why are frequent sequential patterns interesting 9/20/2018 Data Mining: Principles and Algorithms

48 Types of Biosequnces Patterns
Deterministic patterns PROSITE notation: C-x(3,5)-[NHG]-x-{FY}-x(2)-Q-C non-x letter: one particular type of aa (amino acid) X(3,5): at least 3 and at most 5 aa of any type may occur at this position – flexible gap [NHG]: any of the aa in [] may be here {FY}: all aa except F and Y are allowed in this position. 9/20/2018 Data Mining: Principles and Algorithms

49 Types of Biosequnces Patterns
Probabilistic Patterns Position Weight Matrices (PWM) Positions Compatibility Matrix [J.Yang et al 2002] Stochastic Models HMM 9/20/2018 Data Mining: Principles and Algorithms

50 Mining Deterministic Patterns
Alignment algorithms: Computationally expensive (NP-hard) Reveal only global similarities Enumerating algorithms: Expensive, only good for short patterns Creating long patterns from short patterns TEIRESIAS: find maximal patterns Iterative heuristic methods Gibbs Sampling: faster, but only local maximum …… 9/20/2018 Data Mining: Principles and Algorithms

51 TEIRESIAS: Rigoutsos et al (IBM)
Pattern mined: with rigid gaps, e.g. A..H.C “.” matches one arbitrary symbol in alphabet Finds all maximal patterns with minimal support Algorithm: 2 Phases Scanning Scan the sequences, locate all elementary patterns with minimum support Convolution Build up maximal patterns from elementary patterns Convolution (join) operation : efficient Mining Frequent sequential patterns has been widely studied. In biological sphere, TERESIAS is a representative example. It mines patterns with rigid gaps. 9/20/2018 Data Mining: Principles and Algorithms

52 Data Mining: Principles and Algorithms
TEIRESIAS: example F.AS F.AST AST F.ASTS STS A user provides 3 parameters , W: maximum number of don’t care characters between two successive residues in a pattern Elementary patterns Guarantee to find maximal patterns 9/20/2018 Data Mining: Principles and Algorithms

53 Mining Probabilistic Patterns
Support Metric Require exact match and fail to recognize possible substitution among symbols Protein may mutate without change of its functionality Compatibility Matrix Accommodate mutation circumstances Match metric allow flexibility in pattern matching Including TEIRESIAS, most frequent sequential pattern mining methods apply support metric. But in some cases, the exact match assumption is not good. protein may … , those mutations should not be looked as independent patterns. we need to accommodate these scenarios , so a new measure Match is proposed: 9/20/2018 Data Mining: Principles and Algorithms

54 Data Mining: Principles and Algorithms
Compatibility Matrix Symbols observed may be different from the real ones Compatibility matrix provides a probabilistic connection from the observation to the underlying substances. C(di,dj)=Prob(true_value = di | observed_value = dj) Prob(d1|d1)=0.9 Prob(d2|d1)=0.05 Prob(d3|d1)=0.05 Prob(d4|d1)=0 Prob(d5|d1)=0 Observed value d1 d2 d3 d4 d5 true value 0.9 0.1 0.05 0.8 0.7 0.15 0.75 0.85 Instead of assigning 0 or 1 to indicate occurrence of a symbol, Each symbol occurrence can be considered as occurrences of multiple symbols with various degrees. For example, if d1 is observed, then with 0.9 probability, it is an occurrence of d1, with 0.05 probability, it is an occurrence of d2, and so on. The compatibility matrix is assumed to be provided by an expert in the area 9/20/2018 Data Mining: Principles and Algorithms

55 Data Mining: Principles and Algorithms
Match Given a pattern P= d1d2…dl, a sequence s= d1’d2’…dl’ the match of P in s M(P,s):=Prob(P|s) =  1ilC(di,di’) E.g: if P=d1d2, s=d1d3, then M(P,s)=C(d1,d1)C(d2,d3)=0.9 0.05=0.045. Here P is not a subpattern of s, but M(P,s) >0. Suppose the symbols in the sequence are independent, product of the occurrence of each symbol (correspondent entries in CM) Provides flexibility in approximate match 9/20/2018 Data Mining: Principles and Algorithms

56 Data Mining: Principles and Algorithms
Match The match of a pattern P in a subsequence s (with the same length) is Prob(P| s). The match of a pattern P in a sequence S is defined as M(P,S)=Maxssubsequences of S M(P,s) The match of pattern P in a database D of N sequence: M(P,D)=SDM(P,S) / N All patterns that meet a specified threshold min_match are called frequent patterns. Maximum match of P in all subseq, of S Average match of all sequences 9/20/2018 Data Mining: Principles and Algorithms

57 Data Mining: Principles and Algorithms
Match Model The match model can accommodate misrepresentation of symbols due to noise The Apriori property also holds on the match metric. The match of a pattern P in a database D  the match of any subpattern of P. In a noise-free environment, match model can represent support model. Let the compatibility matrix be an identity matrix (C(di,dj)=1 if i=j and is 0 otherwise). The match of a pattern is equal to the support of a pattern. So it is an extension of support measure 9/20/2018 Data Mining: Principles and Algorithms

58 Data Mining: Principles and Algorithms
Agenda Mining Frequent Pattern in Biological Sequences Classifying Biological Sequences KNN Markov Chain based Feature based Clustering Biological Sequences 9/20/2018 Data Mining: Principles and Algorithms

59 K-nearest Neighbor Classifier(KNN)
Locate K training sequences: most similar to the test sequence Class label of the test sequence := Majority Function (class labels of K seq.s) KEY- Similarity function : sim(Si,Sj) Simplest Score= aligned symbols identical 0 otherwise Advanced: position dependent scores score matrices of similarity between different amino-acids Well used in bioinformatics tools. How to measure the similarity of sequences In addition, there are various score matrices 9/20/2018 Data Mining: Principles and Algorithms

60 Markov Chain based Classifier
Build simple Markov chain using training datasets for training seq. with class + for training seq. with class - Compute for test seq. + if Class label of = - if 9/20/2018 Data Mining: Principles and Algorithms

61 Data Mining: Principles and Algorithms
IMM & SMM Problem in Higher Order Models many high-order states cannot frequently observed in the training dataset Interpolated Markov Models (IMM) the weight of transition probability of the j th-order model Selective Markov Models (SMM) Build varying order Markov chains Prune the non-discriminatory states from the higher order Markov chains The motivation is that higher order Markov states capture longer context, but they tend to have lower support. Hence, using probabilities from lower order states which have higher support helps in correcting some of the errors in estimating the transition probabilities for higher order models. Furthermore, by using the various γparameters, IMMs can control how the various states are used while computing the transition probabilities. (A number of methods for computing the various γ factors where presented in [SDKW98] that are based on the distribution of the different states in the various order models. However, the right method appears to be dataset dependent. 9/20/2018 Data Mining: Principles and Algorithms

62 Feature Based Sequence Classifier
Project each sequence into vector-space Use sophisticated linear classifiers E.g: Support Vector Machine(SVM) KEY-- How to project? Based on Markov chain 9/20/2018 Data Mining: Principles and Algorithms

63 Vector-space view of sequence in Markov chain
Classifier: 1st order Markov Chain Vector u,w length: N2 (N: size of alphabet) Each dimension: a unique pair of symbols If j th dimension of the vectors = AG u (j) = number of occurrences of AG in S w(j) = w: from Transition probability Matrices Higher order Markov Chain Dimensionality of the new space is N k+1 u: Sequence vector w: Weight vector If N is the size of alphabet, the dimension of both vectors are N2. Log-odds ratio of P+. P- are the transition probabilities of the positive and negative classes, Allow us to view each sequence as a vector in a new space whose dimensions are made of all possible pairs, triplets, quintuplets, etc, of symbols, and second it shows that each one of the Markov chain based techniques is nothing more than a linear classifier [Mit97], in which the decision hyperplane is given by the vector w that corresponds to the log-odds ratios of the maximum likelihood estimates for the various transition probabilities. 9/20/2018 Data Mining: Principles and Algorithms

64 Data Mining: Principles and Algorithms
Example: Sequence classification in vector space based on 1st order Markov Chain Based on first order Markov Chain S is positive, since the value >0 9/20/2018 Data Mining: Principles and Algorithms

65 Data Mining: Principles and Algorithms
Agenda Mining Frequent Pattern in Biological Sequences Classifying Biological Sequences Clustering Biological Sequences Projecting to Vector Space CLUSEQ 9/20/2018 Data Mining: Principles and Algorithms

66 Clustering Biological Sequences
Motivation: Detecting protein families Revealing unknown object groups Challenges: Similarity measurement Alignment based Similarity measure based on pairwise alignment Frequent Sequence based No alignment, use frequent subsequences Efficiency 9/20/2018 Data Mining: Principles and Algorithms

67 Frequent Sequence based Method
Projecting to vector space [Guralnik & Karypis ‘01] Determine all frequent subsequences Select independent subset of these subsequences (features) Count occurrences of features (Vector Space Model) Applying any clustering algorithm to the vectors 9/20/2018 Data Mining: Principles and Algorithms

68 Example: feature selection
No Markov Chain here with 50% support Select “independent” subsequences Globally selected: consider all sequences together, locally selected: consider sequence one by one 9/20/2018 Data Mining: Principles and Algorithms

69 Data Mining: Principles and Algorithms
CLUSEQ [J.Yang et al 2003] Efficient and Effective Sequence Clustering Probability approach Similar to higher order Markov Model Probabilistic Suffix Tree Use structural characteristics of sequences Structure determines function Sequences: sequential structure A Novel Algorithm High accuracy Automatically adjust parameters (e.g. # of clusters) The rational is: …The order of the symbols in sequences Using the conditional probability distribution of the next symbol given a preceding segment to characterize sequence behavior and to support the similarity measure. Overcome the typical problem of clustering method: manually setting parameters 9/20/2018 Data Mining: Principles and Algorithms

70 Data Mining: Principles and Algorithms
Similarity Measure A cluster S is modeled by conditional probability distribution Ps represented by a Probabilistic Suffix Tree The probability of an arbitrary sequence : (s1 s2… sl) subsuming to Ps : Significant/Insignificant subsequences robust to noise If s1 s2… si-1 is not significant (i.e < threshold), approximate it by its longest significant suffix Different clusters subsume following different P Where is the conditional probability that the symbol si is the next symbol right after the segment s1…si-1 To make the measure robust to noise, a significant threshold is given. If the occurrence of a subsequence is less than the threshold, it is not significant. In this case, .., try to capture the context information as much as possible 9/20/2018 Data Mining: Principles and Algorithms

71 Data Mining: Principles and Algorithms
Similarity Measure Similarity between sequence  and cluster S: Accommodate subsequences matching Ratio of probability of subsuming to distribution of S, and of generating the sequence randomly. This similarity measure requires the entire sequence follows a single CPD, too strict, different portions of a sequence may subsume to different CPD. Modified to capture the maximum similarity between any continuous segment of and S >= t:  belongs to S < t:  not belong to S 9/20/2018 Data Mining: Principles and Algorithms

72 Probabilistic Suffix Tree (PST)
Improve efficiency of computation Suffix tree: good model for symbol sequences Each cluster is represented by a PST tree Rooted directed tree Each edge is labeled with a symbol in the sequence set Concatenation of edge-labels on the path between the node and the root spells out a subsequence Each note has probabilities of the next symbol 9/20/2018 Data Mining: Principles and Algorithms

73 Data Mining: Principles and Algorithms
Example: A Probabilistic Suffix Tree P(b|bba)≈ P(b|ba)=0.594 In this example there are only two symbols a and b, number in each node is the occurrences of its label in the sequence cluster The red line is the cut off of significant nodes, whose occurrences are larger than a preset threshold. 9/20/2018 Data Mining: Principles and Algorithms C=65

74 CLUSEQ: algorithm Update cluster membership and Probabilistic suffix tree Merge heavily overlapped clusters Novel part is that, upon iteration, it automatically updates the clusters, and adjust the parameters. Robust w.r.t initial parameter settings. 9/20/2018 Data Mining: Principles and Algorithms

75 Data Mining: Principles and Algorithms
Summary Mining Biosequence Frequent Pattern Deterministic patterns Probabilistic patterns Classifying Biological Sequences KNN: simple, using alignment information Markov Chain based :IMM, SMM Feature based : vector space Clustering Biological Sequences Frequent sequence based: vector space CLUSEQ Probabilistic based similarity measure Exploring sequential characteristics 9/20/2018 Data Mining: Principles and Algorithms

76 Data Mining: Principles and Algorithms
References Rigoutsos, I. & Floratos, A. Combinatorial pattern discovery in biological sequences: The Teiresias algorithm, Bioinformatics,1998, 14(1), pp Brejova, B., DiMarco, C., Vinar, T., Hidalgo, S. R., Holguin, G. and Patten, C. Finding Patterns in Biological Sequences. Unpublished project report for CS798G, University of Waterloo, Fall 2000 J. Yang, P. Yu, W. Wang, and J. Han 'Mining Long Sequential Patterns in a Noisy Environment ', SIGMOD2002 Mukund Deshpande, George Karypis, Evaluation of Techniques for Classifying Biological Sequences , PAKDD2002  Valerie Guralnik, George Karypis, A Scalable Algorithm for Clustering Sequential Data, ICDM2001 Jiong Yang, Wei Wang, CLUSEQ: Efficient and Effective Sequence Clustering , ICDE 2003 9/20/2018 Data Mining: Principles and Algorithms

77 Data Mining: Principles and Algorithms
Cluster Analysis Hierarchical clustering K-means algorithm Self-Organizing Map (SOM) SVD-based clustering Block clustering Gene shaving Significance of clusters 9/20/2018 Data Mining: Principles and Algorithms

78 Hierarchical clustering
The most common approach for grouping gene expression data Bottom-Up Approach Input: Distance-Matrix (N x N) Algorithm: Find the two nearest groups, combine them together Diversity Single Linkage: Group distance = minimum distance between points of the two groups Average Linkage: Group distance = average distance between points of the two groups 9/20/2018 Data Mining: Principles and Algorithms

79 Data Mining: Principles and Algorithms
K-means Algorithm Top-down approach Pre-specified number of clusters Algorithm: Select initial cluster centers Classify all the samples Recalculate the centers of each cluster Problem The result is highly dependent on the initial set of centers 9/20/2018 Data Mining: Principles and Algorithms

80 Self-Organizing Map (SOM)
2-layer NN Input layer: vector Output layer: matrix Each node in Output Layer has a weight vector w. Image Source: 9/20/2018 Data Mining: Principles and Algorithms

81 Self-Organizing Map (SOM) (2)
Given an input vector x (a sample), the output of the node is Gaussian(w - x). Image Source: Image Source: 9/20/2018 Data Mining: Principles and Algorithms

82 Self-Organizing Map (SOM) (3)
Initialization Random Initialization Sample Initialization Linear Initialization Where the weight vectors are initialized in an orderly fashion along the linear subspace spanned by the two principal eigenvectors of the input data set. 9/20/2018 Data Mining: Principles and Algorithms

83 Self-Organizing Map (SOM) (4)
Training For each sample y Select the output node with the greatest output value. Assume it’s output node j. Update the weight vector ci of every output node I 9/20/2018 Data Mining: Principles and Algorithms

84 Self-Organizing Map (SOM) (5)
Advantages: Partial structure on the clusters (neighbor relation between nearby clusters) Easy to implement Reasonably fast, scalable to large data sets Easy visualization and interpretation 9/20/2018 Data Mining: Principles and Algorithms

85 Data Mining: Principles and Algorithms
SVD-based Clustering Singular Value Decomposition (Also named KL-Decomposition / PCA) Sort on singular values Keep the first K singular values Effect: Filter out noise or experimental artifacts, enable meaningful comparison of the expression of different genes across different samples 9/20/2018 Data Mining: Principles and Algorithms

86 Data Mining: Principles and Algorithms
Block Clustering Two-way clustering technique: Simultaneously cluster both genes and samples Algorithm: Begin with the entire matrix in one block At each stage find the best row and column split Split continued until a large number of blocks are obtained Some blocks are recombined until the optimal number of blocks are obtained 9/20/2018 Data Mining: Principles and Algorithms

87 Data Mining: Principles and Algorithms
Block Clustering (2) The best row and column split Largest reduction in the total within block variance The optimal number of blocks Randomly permute the elements within each row of the original matrix A. The result is B. Find the k that maximize the difference between the within cluster variance of A with k clusters the within cluster variance of B with k clusters 9/20/2018 Data Mining: Principles and Algorithms

88 Data Mining: Principles and Algorithms
Gene Shaving Goal find several small, possibly overlapping blocks of genes The blocks: small between-genes variance Algorithm Start with the whole gene expression matrix Y Find the first principal component of the genes For each gene, compute the absolute value of its correlation with the first principal component Remove the 10% of the genes having the smallest absolute correlation 9/20/2018 Data Mining: Principles and Algorithms

89 Data Mining: Principles and Algorithms
Gene Shaving (2) The algorithm produces a set of nested gene groups: The first cluster is selected as the group that maximizes the gap statistic 9/20/2018 Data Mining: Principles and Algorithms

90 Significance of Clusters
How do we show that the clusters discovered are real and biologically interesting rather than coincidental artifacts of the data? Bootstrap approach. Try different subsets of the data and check if the results are consistent. 9/20/2018 Data Mining: Principles and Algorithms

91 Classification Analysis
Support Vector Machines (SVM) Nearest Neighbor Decision Trees Voted classification Weighted Gene Voting Bayesian Classification 9/20/2018 Data Mining: Principles and Algorithms

92 Support Vector Machines (SVM)
2-Class classification Map data from low-dimensional space to high-dimensional space. So that originally linear unseparatable data might be linear separatable in high-dimensional space. Advantage: Does not tend to overfit 9/20/2018 Data Mining: Principles and Algorithms

93 Data Mining: Principles and Algorithms
Nearest Neighbor All training data are saved. Get the k nearest neighbors (in the training data) of the new sample Determine the class by majority of the neighbors. K = 3,5,7… (For 2-class classification) 9/20/2018 Data Mining: Principles and Algorithms

94 Data Mining: Principles and Algorithms
Voted Classification Takes a classifier and a training set as input Trains the classifier multiple times on different versions of the training set Two types: Bagging Algorithm Boosting Algorithm 9/20/2018 Data Mining: Principles and Algorithms

95 Voted Classification (2)
Bagging Algorithm Bootstrap sample set: randomly select T samples from the original training set. Train classifiers on bootstrap sample sets Usage: Apply these classifiers to the test data, choose the most votes. Useful for unstable classifiers (like decision tree) 9/20/2018 Data Mining: Principles and Algorithms

96 Voted Classification (3)
Boosting Algorithm Train a classifier on the training set The trained classifier is assigned a voting weight based on its accuracy Calculate the classification result for each sample, based on the weighted voting of all trained classifiers Modify the probability of each sample appearing in the training set based on the classification correctness so far. 9/20/2018 Data Mining: Principles and Algorithms

97 Data Mining: Principles and Algorithms
Weighted Gene Voting 2-class Classification Value of Gene j Class 1, mean = m1j SD = s1j Class 2, mean = m2j SD = s2j Gene j will vote for Class 1, if | yj – m1j | < | yj – m2j | Weight for voting | yj – (m1j + m2j) / 2 | * | m1j – m2j | / (s1j+ s2j) 9/20/2018 Data Mining: Principles and Algorithms

98 Bayesian Classification
Bayes’ theorem Assumption: Each element of y is conditionally independent 9/20/2018 Data Mining: Principles and Algorithms

99 Data Mining: Principles and Algorithms
Variable Selection Problem: 50 samples, 5000 genes Which genes are useful? 9/20/2018 Data Mining: Principles and Algorithms

100 Data Mining: Principles and Algorithms
Variable Selection (2) Unsupervised (No class information) cut-off on the SD of the gene over all the samples (The bigger, the better) Supervised (With class information) cut-off on Ratio of between-class to within-class variance (The bigger, the better) 9/20/2018 Data Mining: Principles and Algorithms

101 Data Mining: Principles and Algorithms
Summary Clustering Hierarchical clustering K-means algorithm Self-Organizing Map (SOM) SVD-based clustering Block clustering Gene shaving Significance of clusters Classification Support Vector Machines (SVM) Nearest Neighbor Decision Trees Voted classification Weighted Gene Voting Bayesian Classification Variable Selection 9/20/2018 Data Mining: Principles and Algorithms

102 Data Mining: Principles and Algorithms
Reference K.Aas, Microarray Data Mining: A Survey, NR Note SAMBA/02/01 David Page, PPT of Supervised Learning for Gene Expression Microarray Data Images are obtained from Google Images (Image on page 8 from the paper Aas01) 9/20/2018 Data Mining: Principles and Algorithms

103 Data Mining: Principles and Algorithms
Thank you !!! 9/20/2018 Data Mining: Principles and Algorithms

104 Clustering Measures (Distance)
Lp Norm Distances Manhattan (p = 1) Euclidean Distance (p = 2) Cosine Distance Robust Kernel Distances (?) Distance functions are not adequate in capturing all kinds of correlated behavior. 9/20/2018 Data Mining: Principles and Algorithms

105 Shifting and Scaling Patterns
9/20/2018 Data Mining: Principles and Algorithms

106 Shifting and Scaling Patterns in Subspaces
9/20/2018 Data Mining: Principles and Algorithms

107 DNA micro-array analysis
Applications DNA micro-array analysis 9/20/2018 Data Mining: Principles and Algorithms

108 Why not use correlation ?
Prior Work Why not use correlation ? Cannot capture non global characteristics over attributes. 9/20/2018 Data Mining: Principles and Algorithms

109 Bicluster Method ? (Y.Cheng & G. Church)
Prior Work Bicluster Method ? (Y.Cheng & G. Church) Submatrix of a l-bicluster not necessarily a l-bicluster Greedy approach can miss overlapping clusters Difficult to remove outliers from clusters. 9/20/2018 Data Mining: Principles and Algorithms

110 Data Mining: Principles and Algorithms
Notations 9/20/2018 Data Mining: Principles and Algorithms

111 Data Mining: Principles and Algorithms
pScore pScore & pCluster 9/20/2018 Data Mining: Principles and Algorithms

112 Data Mining: Principles and Algorithms
pScore (cont..) Good similarity for Shifting Patterns Ordinary Patterns are the degenerate case where shift is zero Scaling Patterns can be converted to Shifting Patterns by taking logarithm on original attributes and hence pScore is a good metric to tackle scaling patterns too. (?) The pClusters defined by using the pScore are compact. Intuitively, pScore(X) < c means that the change of values between two attributes of two objects in X is confined by a user-specified threshold c. 9/20/2018 Data Mining: Principles and Algorithms

113 Data Mining: Principles and Algorithms
Problem Statement Given : cluster threshold nc: minimal number of columns of a cluster nr: minimal number of rows of a cluster Find all (O,T) such that (O, T) is a -pCluster, where |O| ≥ nr & |T| ≥ nc. 9/20/2018 Data Mining: Principles and Algorithms

114 Maximum Dimension Set (MDS)
Search over subspaces for An object set O c-pClusterable on a column set T only when there doesn’t exist T’  T, such that O c-pClusters on T’. That T which c-pClusters O and there doesn’t exist any T’  T that c-pClusters O is called Maximum Dimension Set (MDS) of O. Since finding a MDS isn’t trivial, a two object special case, called pairwise clustering, is considered. 9/20/2018 Data Mining: Principles and Algorithms

115 Pairwise Clustering To find the Maximum Dimension Sets for O where O contains only two objects x and y. Given objects x and y, and a column set T we can define :- S(x,y,T) = {dxa-dya|a T} S1(x,y,T) = s1, …, sk. si  S(x,y,T) and si ≤ sj for i<j The Objects x and y form a c-pCluster on T iff the difference between the largest and the smallest value in S(x,y,T) is below c. Or in other words in S1(x,y,T) we have (sk-s1) ≤ c. 9/20/2018 Data Mining: Principles and Algorithms

116 Pairwise Clustering (cont..)
pairCluster(x,y,A,nc) pairCluster(a,b,D,nr) 9/20/2018 Data Mining: Principles and Algorithms

117 Pairwise Clustering (example)
nr = 3, nc = 3, c = 1 9/20/2018 Data Mining: Principles and Algorithms

118 Data Mining: Principles and Algorithms
MDS Pruning 9/20/2018 Data Mining: Principles and Algorithms

119 Data Mining: Principles and Algorithms
MDS Pruning (example) 9/20/2018 Data Mining: Principles and Algorithms

120 Data Mining: Principles and Algorithms
MDS Pruning (example) 9/20/2018 Data Mining: Principles and Algorithms

121 Data Mining: Principles and Algorithms
MDS Pruning (example) 9/20/2018 Data Mining: Principles and Algorithms

122 Data Mining: Principles and Algorithms
Prefix Tree Summary The resulting object-pair MDSs are inserted into a prefix tree. Each node of the prefix tree uniquely represents a maximum dimension set, and can be regarded as a candidate cluster (O,T). A final step of pruning objects in O so that (O,T), becomes a pCluster. The pruning step after prefix tree construction is done in post-order traversal. 9/20/2018 Data Mining: Principles and Algorithms

123 Data Mining: Principles and Algorithms
Prefix Tree Summary 9/20/2018 Data Mining: Principles and Algorithms

124 Data Mining: Principles and Algorithms
Experiments 9/20/2018 Data Mining: Principles and Algorithms

125 Data Mining: Principles and Algorithms
Experiments 9/20/2018 Data Mining: Principles and Algorithms

126 Data Mining: Principles and Algorithms
Experiments 9/20/2018 Data Mining: Principles and Algorithms

127 Data Mining: Principles and Algorithms
Experiments 9/20/2018 Data Mining: Principles and Algorithms


Download ppt "Data Mining: Algorithms and Principles — Mining Biological Data —"

Similar presentations


Ads by Google