Presentation is loading. Please wait.

Presentation is loading. Please wait.

MS Sequence Clustering

Similar presentations


Presentation on theme: "MS Sequence Clustering"— Presentation transcript:

1 MS Sequence Clustering

2 What is it? We know clustering, especially EM (Expectation Maximization) Now, what is a sequence? A series of discrete events (state), usually finite Education path high school, work, college, professional school, graduate school, community colleges Set of URLs, or parameters, at AMAZON DNA (A, G, C, and T)

3 What does the algorithm do
It is a hybrid of sequence and clustering It is used to analyze a population of cases that contains sequence data and group those cases into clusters For example, at Amazon, we could just care what are ordered – could be an clustering problem If we care where do customers visit before purchases or not, that is a sequence clustering problem

4 Amazon Example The company has click information for each customer profile. By using the Microsoft Sequence Clustering algorithm on this data, the company can find groups, or clusters, of customers who have similar patterns or sequences of clicks. The company can then use these clusters to analyze how users move through the Web site, to identify which pages are most closely related to the sale of a particular product, and to predict which pages are most likely to be visited next.

5 How the Algorithm Works
One of the input columns that the Microsoft Sequence Clustering algorithm uses is a nested table that contains sequence data. This data is a series of state transitions of individual cases in a dataset, such as product purchases or Web clicks. To determine which sequence columns to treat as input columns for clustering, the algorithm measures the differences, or distances, between all the possible sequences in the dataset. After the algorithm measures these distances, it can use the sequence column as an input for the EM method of clustering.

6 Markov Chain Having the Markov property means that,
Given the present state, future states are independent of the past states. Future states will be reached through a probabilistic process instead of a deterministic one P(xi+1=G|xi= A) = 0.15 saying that given the current state A, the probability of next state being G is 0.15

7 The order of the chain An nth-order Markov chain over k states is equivalent to a first order (1st-order) Markov chain over kn states. Example, the 2nd- order of A, C, G, T is the same as the 1st-order of AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT.

8 State Transition Matrix
States are Finite Not too large Non-redundant If M is the number of states, a state transition matrix is a M*M matrix

9 Clustering with Markov Chain
Create clusters in random Map each cluster with a chain Assign a case to a few clusters based on fitting and cut-off numbers Calibrate the clusters Repeat steps 3 and 4 until converge

10 Number of Clusters Sequence clustering may have more clusters than the non-sequence clustering because the meaning of the clustering is more easily understood.

11

12 Sequence Clustering Viewer
Cluster Diagram, Cluster Profiles, Cluster Characteristics, Cluster Discrimination, and State Transitions.

13 Cluster Diagram Tab The layout in the diagram represents the relationships of the clusters, where similar clusters are grouped close together. By default, the shade of the node color represents the density of all cases in the cluster—the darker the node, the more cases it contains.

14

15 Cluster Profiles Tab The Cluster Profiles tab displays the sequences that exist in each cluster. The clusters are listed in individual columns to the right of the States column.

16

17 Cluster Characteristics Tab
The Cluster Characteristics tab summarizes the transitions between states in a cluster, with bars describing the importance of the attribute value for the selected cluster.

18

19 Cluster Discrimination Tab
With the Cluster Discrimination tab, you can compare two clusters, to determine which models favor which clusters. The tab contains four columns: Variables, Values, Cluster 1, and Cluster 2. If the cluster favors a specific model, a blue bar appears in the Cluster 1 or Cluster 2 column in the row of the corresponding model in the Variables column. The longer the blue bar, the more the model favors the cluster.

20

21 State Transitions Tab On the State Transitions tab, you can select a cluster and browse through its state transitions. Each node represents a state of the model. A line represents the transition between states, and each node is based on the probability of a transition. The background color represents the frequency of the node in the cluster.

22


Download ppt "MS Sequence Clustering"

Similar presentations


Ads by Google