Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mining Biological Data

Similar presentations

Presentation on theme: "Mining Biological Data"— Presentation transcript:

1 Mining Biological Data
Jiong Yang, Ph. D. Visiting Assistant Professor UIUC

2 Data is Everywhere My favorite part is the the biological data: the microarray data that I will talk in more detail shortly, the proteins, and the DNAs. Besides them, huge amount of data are continuously generated from e-commerce applications, stock market, sensor networks.

3 Data Mining is a Powerful Tool
Knowledge Computational Biology E-Commerce Intrusion Detection Multimedia Processing Unstructured Data . . . In order to benefit from the massive data collected, the data mining techniques come into play. Generally speaking, it digests the data and produce the knowledge.

4 Biological Data Bio-informatics have become one of the most important applications in data mining. DNA sequences Protein sequences Protein folding Microarray data …… Classification Pattern Matching Inference Clustering is a well established technique with a long history.

5 Outline Approximate sequential pattern mining
Coherent cluster: clustering by pattern similarity in a large data set

6 Frequent Patterns Model Widely studied A set of sequences of symbols.
a1,a2,a4 a2,a3,a5 a1,a4,a5,a6,a7 If a pattern occurs more than a certain number of times, then this pattern is considered important. a1,a4 Widely studied Frequent itemset mining: Agarwal and Srikant (IBM Almaden) FP growth: Han (UIUC) Stream data: Motwani (Stanford) Frequent pattern has been widely studied in the past decade or so. The main idea is that if a pattern occurs a sufficient number of times, then it is deemed important. Otherwise, it is unimportant. All previous work assumed that in order to count occurrence, the pattern has to occur exactly. What I mean is in order to count pattern (a, b, c) occur once, a has to occur, followed by a b, then c has to occur. Otherwise, the pattern does not occur at all. In some case, this type of rigid requirement is necessary. However, in many other cases, this may cause miss of some important patterns due to the possible mutation of symbols. There are several reasons for mutation. For example, sensors may make a mistake, an event a may be reported as b with some small error probability. Some other mutation may be caused intentionally. Some popular topics can be found in many web pages, such as sports or financial information. An interested web surfer may randomly access any of these pages. The reality is that there may exist a similarity measure among different symbols to define the semantic differential and/or likelihood of substitution. The end result is that the mutation of symbols may severely reduce the number of occurrence or support of a pattern especially when the pattern is very long, consisting hundreds of symbols. How can we model the substitutions?

7 Apriori Property Widely used in data mining field
It holds for the support metrics All patterns form a lattice. (a, b, d) is a super-pattern of (a, d) and it is a sub-pattern of (a, b, c, d). Support metric defines a partial order on the lattice. Support(a, b, d) <= min{Support(b, d) , Support(a, d) , Support(a, b) } Level-wise search algorithm can be used Now we have a problem on hand, how can we solve this problem? The approach that we would take is depended on the property that the match model has. First let’s take a look a very important property existed in the support model, which is called Apriori Property. This is a very well studied property in the data mining field. Here is some terminologies. For two patterns, if one pattern can be obtained by specifying a don’t care position in another pattern to be a symbol, then the former is a super-pattern of the later. This super-pattern and sub-pattern relationship forms a lattice. By definition, the occurrence of a super-pattern is also an occurrence of its sub-pattern, thus, the support of a super-pattern is at most of that of a sub-pattern. As a result, the support metrics defines a partial order on this lattice. For a chain of patterns, among which one is a super-pattern of the next one, we only need find one whose super-pattern does not satisfy the support threshold, but its sub-pattern does. It means that we only need to find a continuous envelop on the lattice. Since support metrics possesses this property, a level wise search algorithm can be used to find the envelop. First, the patterns with 1 don’t care symbols are searched. Based on that, the patterns with two don’t care positions are searched, and so on. Much work has been done on the variations of this level-wise search approach. Since match metric also preserves this property, the level wise search algorithm can also be applied in this model.

8 Shortcomings Require exact match and fail to recognize possible substitution among symbols Protein may mutate without change of its functionality. A sensor may make some mistakes Different web pages may have similar contents. A word may have many synonyms. How can the symbol substitution be modeled

9 Compatibility Matrix d1 d2 d5 d3 d4 0.9 0.05 0.1 0.75 0.8 0.7 0.15
observed true 0.9 0.05 0.1 0.75 0.8 0.7 0.15 0.85 Here is our answer to the question. We use a compatibility table to model the mutation among symbols. For each pair of symbols di and dj, there is a real number between 0 and 1 to indicate the probability that the observed di is from the true underlying symbol dj. Since the compatibility table specifies the probability distribution, the sum of each row and column is 1. Compatibility matrix of 5 symbols

10 Compatibility Matrix The compatibility matrix serves as a bridge between the observation and the underlying substance. Each observed symbol is interpreted as an occurrence of a set of symbols with various probabilities. An observed symbol combination is treated as an occurrence of a set of patterns with various degrees. Obtain the compatibility matrix through empirical study domain expert The matrix is a bridge between the observed sequence of symbols and the true underlying sequence. Each symbol occurrence can be considered as occurrences of multiple symbols with various degree. For example, if d2 is observed, then with 0.1 probability, it is an occurrence of d1, with 0.8 probability, it is an occurrence of d2, and with 0.1 probability, it is an occurrence of d4. The occurrence of symbol combination is considered as occurrences of a set of patterns with various degree. Assuming events are independent, then the occurrence of a pattern is the product of the occurrence of each symbol. For example, if d2d2 were observed, then it means that with 0.64 probability, the true underlying pattern is d2d2 while with 0.08 probability, the pattern is d1d2. I just want to mention that the compatibility matrix can be obtained through empirical study or the help of domain expert.

11 Match A new metric, match, is then proposed to quantify the importance of a pattern. The match of a pattern P in a subsequence s (with the same length) is defined as the conditional probability Prob(P| s). The match of a pattern P in a sequence S is defined as the maximal match of P in every distinct subsequence in S. A dynamic programming technique is used to compute the match of P in a sequence S We propose a new metrics called match to capture the mutated occurrences of patterns. The match is defined as the accumulated amount of occurrences. As in the previous example, the occurrence of d2d2 contributes 0.64 occurrences to the pattern d2d2 and 0.08 to d1d2. If the aggregated match of a pattern succeeds a user-specified threshold, then this pattern is considered important.

12 Match M(d1d2…di, S1S2…Sj) is the maximum of M(d1s2…di, S1S2…Sj-1) and M(d1d2…di-1,S1S2…Sj-1) x C(di, Sj) The match of a pattern P in a set of sequence is defined as the sum of the pattern P with each sequence. A pattern is called a frequent pattern if its match exceeds a user-specified threshold min_match. S S p d1 d3 d4 d1 max S d1 0.9 0.9 0.9 0.9 p p d2 0.045 0.09 0.09

13 Challenges Previous work focuses on short patterns.
Long patterns require a large number of scans through the input sequence. Expensive I/O cost Performance vs. Accuracy Probabilistic Approach However, there is a catch. Previous work has been focus on patterns with relatively short length, such, 20 or 50. This may be justified if we only allow exact match. However, in our model, patterns may consist of a couple symbols, e.g, gene patterns. To mine these long patterns, a level-wise search algorithm requires many scans through the data, which can be very costly. We want to find an algorithm which requires less number of I/Os. What can we do? There is an old trade-off in computer science, time vs. space. It means that if you want want faster algorithm, you need more space, memory. In recent time, I believe that there is another trade-off, performance vs. accuracy. If we are willing to allow some small probability of error, we may obtain a much faster algorithm. In some application, we do not want any error. For example, no one wants error on his bank account. On the other hand, many applications can allow some small degree of error, such as recommendation system. If some pattern is not found, no big deal. In these applications, we can employ a probability approach to find most of patterns.

14 Chernoff Bound Let X be a random variable whose range is R. Suppose that we have n independent observations of X and the observed mean is . The Chernoff bound states that, with probability (1- ), the true mean of X is at least  - , where With probability (1- ), the true value of X is at most  + . The core of our approach is the Chernoff Bound. Chernoff Bound has many forms. This particular form is derived by Hoeffeding in This is the formula. Let X be a random variable whose domain is R. R is expressed as the length of the domain. For example, if the domain is between 1 and 2, R is 2 – 1 =1. Let mu be the estimated mean of X after n samples. The Chernoff Bound states that the true mean of X is at least of mu – epsilon with confidence 1- delta. The error rang epsilon is a function of R, n, and delta. From this function, we can see that the the error bound is independent of X distribution and is only a function of the number of samples of X. This is a very attractive property. As a result, we can use this bound on any random variables with the knowledge of distribution.

15 Approach Three-stage approach to mine patterns with length l:
Finding Match of Individual Symbols and Take a Sample set of sequences Pattern Discovery on Samples Ambiguous Patterns Determination Sample size: depending on memory size Based on the samples, three types of patterns are determined. We employ two-stage approach to find pattern with length l. In the first phase, we put patterns into three categories based on the samples. The first category is those that satisfied the match threshold with high confidence. The second category is those that does not satisfy the match threshold with high confidence. The third category are those that we could not tell with sufficient confidence. We called the third category patterns ambiguous patterns. In the second phase, we further analyze these patterns by verifying them against the entire sequence. Now are going to the details of the two phases. In the first phase, what we do is really finding the significant patterns on the samples. Each sample consists of l continuous portion of the sequence. The number of sample are determined by the available memory. For example, if 10MBytes memory are allocated to store the samples. L=10 and each symbols needs 1 byte for encoding, then we can have 1M samples. Next we use the traditional level-wise search algorithm to find significant patterns with length 10 on the sample data set. This can avoid expensive I/Os for the input sequence since the samples are all in-memory. We can categorize the patterns based on the following formula. Based on the range R, the sample size n, and the confidence threshold delta, we can compute the error bound epsilon. If the match of a pattern is greater than the match threshold plus epsilon, then it means the true match of the pattern is at least the match threshold with high confidence, thus we can label it as significant. With the same reason, we label the patterns whose match on the samples is less than match threshold minus epsilon as insignificant. The patterns whose match is within the bounds are labeled as ambiguous because we can say not tell whether they are significant or not with sufficient confidence. Among these three types of patterns, we only need to investigate the ambiguous patterns further.

16 Approach Frequent pattern if match is greater than (min_match +)
Ambiguous pattern if match is between (min_match - ) and (min_match + ). Infrequent pattern otherwise;

17 Ambiguous Patterns Ambiguous Patterns Too many Border collapse
We have the negative and positive borders of significant patterns. Our goal is to collapse the border as fast as possible. If all ambiguous patterns can be fit into memory at once, then the patterns need to be loaded into memory several times. To minimize the number of patterns for examination, we want to examine the patterns with the most pruning power first. Which patterns have the most pruning power? After examining a pattern, if we can skip a large number of patterns, then this pattern possesses a great amount of pruning power. The following is an example of the examination order.

18 Ambiguous Patterns (d1,d2,d3,d4,d5) (d1,d2,d3,d4) (d1,d2,d3,d5)
Let’s assume that the patterns shown in this slice are ambiguous. If we examine the patterns in the middle layer, such as (d1, d2, *, *, d5) and it is significant, then all its descendants, those patterns in red boxes are significant. On the other hand, if we found that (d1, d2, *, *, d5) is insignificant, then all patterns in green boxes are insignificant. Thus the patterns in the middle layer has the most pruning power in the worst case. (d1,d2) (d1,d3) (d1,d4) (d1,d5) (d1)

19 Ambiguous Patterns infrequent frequent (d1) (d1,d2,) (d1,d3) (d1,d4)
If we choose to examine the middle layers first, and we found (d1, d2, d3, *, *) and (d1, d2, *, *, d3) are significant and the rest are insignificant, then we are able to categorize all patterns except the one in shade. In the second time, we only need to examine this pattern further. Therefore, the order that we are examining the patterns are the middle layer first, then the quarter-way layer patterns and so on.

20 Effects of 1- Without Border Collapse With Border Collapse
We apply our algorithm on to a real trace. I will talk about the trace in detail in about 5 minutes. Here it is performance of our algorithm. If the confidence parameter is increased, then the ambiguous patterns increased dramatically. This means that if we can allow up to 0.1% error rate, the number of ambiguous patterns is around 50,000, which is not too large. The effectiveness of the second phase of our algorithm, the ambiguous pattern determination phase is shown in the right side figure. To avoid the second phase of our algorithm, we can mark the ambiguous patterns either significant or insignificant. We can see that the second phase improve the accuracy by 5 or 10 times.

21 Approximate Pattern Mining
Reference: Mining long sequential patterns in a noisy environment, Proceeding of ACM SIGMOD International Conference on Management of Data (SIGMOD), pp , 2002. Other Work Periodic Patterns (KDD2000, ICDM2001) Statistically significant Patterns (KDD2001, ICDM 2002)

22 Outline Approximate sequential pattern mining
Coherent cluster: clustering by pattern similarity in a large data set with application in microarray analysis With application in protein categorization

23 Coherent Cluster In many applications, data can be of very high dimensionality. Gene expression data Dozens to hundreds conditions/samples Customer evaluation Thousands or more merchants Objective: discover peer groups attributes a1 . . . aj . . . o1 . . . oi dij objects .

24 17 conditions 40 genes X  100 log(105x)
We can imagine the expression data represented in a matrix with rows representing genes, columns representing samples or conditions, and each entry containing a number characterizing the expression level of the particular gene in the particular sample. Here is a yeast gene expression matrix with 17 conditions and 40 genes. It is part of a bigger matrix. I am not going to bore you of the details on how this matrix is generated at this moment.

25 Coherent Cluster Some are very active and some are not.
Gene expression matrix analysis If two genes have similar expression profiles, we can hypothesize that they are co-regulated and possibly functionally related. Reverse engineering of gene regulation networks By comparing samples/conditions, we can find genes that are differentially expressed which can be useful in Studying effects of various compounds Exploring tumor subclasses

26 40 genes

27 Coherent Cluster Co-regulated genes Several observations can be made.
These genes may be co-regulated and are probably controlled by the same set of transcription factors. If mapped to points in high dimensional space, they may not be close to each other. Not every condition participates Not every gene participates The corresponding submatrix may not occupy a continuous area. Co-regulated genes

28 Coherent Cluster Observations:
If mapped to points in high dimensional space, they may not be close to each other. Bias exists universally. Only a subset of objects and a subset of attributes may participate. Need to accommodate some degree of noise. Solution: subspace cluster, bicluster, coherent cluster

29 Subspace cluster CLICK: Argawal et al IBM Almaden
Find a subset of dimensions and a subset of objects such that the distance between the objects on the subset of dimensions is close. The clusters may overlap Proclus: Aggawal et al IBM T. J. Watson Do not allow overlap

30 Bicluster Developed in 2000 by Cheung and Church
Using mean squared error residual After discovering one cluster, replace the cluster with random data and find another Not efficient and not accurate

31 Coherent Cluster Coherent cluster pair-wise disparity
Subspace clustering Measure distance on mutual bias pair-wise disparity For a 22 (sub)matrix consisting of objects {x, y} and attributes {a, b} The similarity is defined on the slopes rather then the absolute values. The more parallel they are, the lower the disparity, and the higher the similarity. The parallelism is measured in terms of the mutual bias of each column. dxa dxb x x dya dyb y y mutual bias of attribute a mutual bias of attribute b a a b b attribute

32 Coherent Cluster A 22 (sub)matrix is a -coherent cluster if its D value is less than or equal to . An mn matrix X is a -coherent cluster if every 22 submatrix of X is -coherent cluster. A -coherent cluster is a maximum -coherent cluster if it is not a submatrix of any other -coherent cluster. Objective: given a data matrix and a threshold , find all maximum -coherent clusters.

33 Coherent Cluster Challenges:
Finding subspace clustering based on distance itself is already a difficult task due to the curse of dimensionality. The (sub)set of objects and the (sub)set of attributes that form a cluster are unknown in advance and may not be adjacent to each other in the data matrix. The actual values of the objects in a coherent cluster may be far apart from each other. Each object or attribute in a coherent cluster may bear some relative bias (that are unknown in advance) and such bias may be local to the coherent cluster.

34 Coherent Cluster Compute the maximum coherent
attribute sets for each pair of objects Compute the maximum coherent object sets for each pair of attributes Two way pruning How do we solve this problem? Here is the general framework. starting from the maximum condition sets, we first generate coherent clusters with many conditions but only two genes and then try to add more genes. One of the consequence of including additional genes is to drop some conditions to make the sub-matrix coherent clusters. A lexicographical tree is used to organize gene pairs by their maximum condition sets. Construct the lexicographical tree Post-order traverse the tree to find maximum coherent clusters

35 Coherent Cluster Observation: Given a pair of objects {o1, o2} and a (sub)set of attributes {a1, a2, …, ak}, the 2k submatrix is a -coherent cluster iff, for every attribute ai, the mutual bias (do1ai – do2ai) does not differ from each other by more than . a1 a2 a3 a4 a5 1 3 5 7 2 3.5 2.5 o1 o2  [2, 3.5] If  = 1.5, then {a1,a2,a3,a4,a5} is a coherent attribute set (CAS) of (o1,o2). The difference between any two mutual biases is less than or equal to .

36 Coherent Cluster Strategy: find the maximum coherent attribute sets for each pair of objects with respect to the given threshold . a1 a2 a3 a4 a5 1 3 5 7 2 3.5 2.5 r1 r2 3 5 7 r1 r2 a2 2 a3 3.5 a4 a5 2.5 a1 1 A set of conditions is called a maximum condition set if adding any additional condition to the set would cause the difference in mutual biases exceed the threshold .  = 1 The maximum coherent attribute sets define the search space for maximum coherent clusters.

37 Two Way Pruning a0 a1 a2 o0 1 4 2 o1 5 o2 3 6 o3 200 7 o4 300
(o0,o2) →(a0,a1,a2) (o1,o2) →(a0,a1,a2) (a0,a1) →(o0,o1,o2) (a0,a2) →(o1,o2,o3) (a1,a2) →(o1,o2,o4) (a1,a2) →(o0,o2,o4) (o0,o2) →(a0,a1,a2) (o1,o2) →(a0,a1,a2) (a0,a1) →(o0,o1,o2) (a0,a2) →(o1,o2,o3) (a1,a2) →(o1,o2,o4) (a1,a2) →(o0,o2,o4) delta=1 nc =3 nr = 3 MCAS MCOS

38 Coherent Cluster High expressive power Efficient and highly scalable
The coherent cluster can capture many interesting and meaningful patterns overlooked by previous clustering methods. Efficient and highly scalable Wide applications Gene expression analysis Collaborative filtering traditional clustering coherent clustering The scalability is measured on synthetic data.

39 Coherent Cluster References: Other Work
Delta-cluster: capturing subspace correlation in a large data set, Proceedings of the 18th IEEE International Conference on Data Engineering (ICDE), pp , 2002. Clustering by pattern similarity in large data sets, Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp , 2002. Enhanced biclustering on expression data, Proceedings of the IEEE bio-informatics and bioengineering (BIBE), 2003. Other Work STING (VLDB1997) STING+ (ICDE1999, TKDE 2000) CLUSEQ (CSB2002, ICDE2003) Cluster Streams (ICDE2003)

40 Remarks Similarity measure Clustering algorithm
Powerful in capturing high order statistics and dependencies Efficient in computation Robust to noise Clustering algorithm High accuracy High adaptability High scalability High reliability

Download ppt "Mining Biological Data"

Similar presentations

Ads by Google