Presentation is loading. Please wait.

Presentation is loading. Please wait.

Applying the ROCAT algorithm to find subspace clusters in categorical data Presented by George Hodulik.

Similar presentations


Presentation on theme: "Applying the ROCAT algorithm to find subspace clusters in categorical data Presented by George Hodulik."— Presentation transcript:

1 Applying the ROCAT algorithm to find subspace clusters in categorical data Presented by George Hodulik

2 Project Description Goals: 1.Develop an application that will find relevant overlapping subspace clusters in categorical data from an input SQL query using ROCAT. 2.Apply the application to the public dataset: “Social Justice Sexuality Project: 2010 National Survey, including Puerto Rico (ICPSR 34363).” 3.Potentially apply application to other public datasets 4.Find optimizations to the algorithm to reduce runtime/improve results. 5.If I have time, I would also like to run other subspace clustering algorithms like DHCC to the data set and compare results.

3 The Dataset “Social Justice Sexuality Project: 2010 National Survey, including Puerto Rico (ICPSR 34363).” 5 factors: racial and sexual identity, spirituality and religion, mental and physical health, family formations and dynamics, and civic and community engagement. Contains about 5000 data rows (results of one person taking the survey) and over 100 attributes (the questions). Ideal for ROCAT – almost entirely categorical data, and the amount data is in a range that should run smoothly but not trivially easy with ROCAT.

4 Implementation Primarily Java (+ MySQL) The ROCAT algorithm translates very well into an object oriented environment. It’s a language I am comfortable with. I also used Excel and a Python script to import the csv file to a MySQL database table.

5 Data preprocessing (I have done so far) The data needed to be preprocessed for the following reasons: Some questions were blanked for confidentiality – no point in keeping these columns Ex. Exact age was blanked, and instead, age was broken into three subgroups. Some questions were only answered by 600/5000 people because there were two versions of the survey Obviously, the columns for the questions people did not answer should not be considered. These columns should only be considered for the 600 people who did answer them. Some questions are irrelevant Whether or not someone took the paper or electronic version of the survey is probably not important.

6 More on Data preprocessing (have not implemented yet) For whatever reason, some of the attributes redundant and should be condensed if possible. For example, there is a question involving the race of the subject where the response options are “Only white,” “Only black,” “Only Asian,” etc., but there is another boolean column in the data that is “Subject answered ‘only white’ in earlier question.” Once I have the application completely functional, I may preprocess the data further to try to find trends relating to specific questions. Ex. To find specific trends between race and religion, I may run the application on only the attributes that relate to race and religion.

7 Recall: MDL principle to measure relevance Goal: Find the compression model that will result in the minimum number of bits needed to represent the data. The model will tell us the relevant subspace clusters. So, there algorithm frequently checks if a subspace cluster is relevant by checking if adding it to the model reduces coding cost. I am having trouble properly calculating coding cost, and consequently, my application cannot tell which subspace clusters are the most relevant.

8 Recall: ROCAT Algorithm Input: Data set D Output: List of subspace clusters in D 3 phases: Searching (bulk of algorithm and runtime - have implemented) Combining (have not implemented) Reassigning (have not implemented) As said earlier, my implementation currently cannot decide on its own which subspace clusters are most relevant. Fortunately, I can, so I can share some results.

9 Potentially interesting current results 466 answered 25 questions the same, from which I could conclude: Not white, nor person of color, nor Asian/Pacific Islander, nor Native American, nor Hispanic They identified as LGBTQ and Cisgender Not foreign born, nor parents, not third generation or more These people said that their medical professions did not ignore nor seemed uncomfortable with their sexual identity. Because of redundant data mentioned earlier, this specific subspace cluster was basically found twice, but with different questions: Saying “Not x” for all races x, and saying “only ‘other race’” means the same thing, but there are attributes for both, and the algorithm can’t realize they are the same.

10 Uninteresting Results 4571 people said they were neither Asian/Pacific Islander, nor Native American 2 people answered 47 questions the same Because I am having trouble properly calculating coding cost, the application currently cannot realize these results are less interesting than the previous ones.

11 Potential optimizations so far (1) Recall: Search phase – Find best pure subspace cluster In a situation like this, it makes sense to calculate the coding cost of each candidate cluster, as each candidate is very different. However, I have found that calculating the coding cost is a slow process – It is currently my bottleneck, but I could be implementing it inefficiently. Note, attributes are added in order of least entropy to greatest

12 Potential optimizations so far (1 con) Consider this case in the Find Best Pure algorithm: C1C1 C2C2 CnCn CkCk Obviously, of C 1, C 2, …, C k, C k is the best subspace cluster. For any C i, C i+1 that have same number of rows, we should be able to skip the calculation of the coding cost of C i, because CC(C i ) >= CC(C i+1 ). C 1, C 2, …, C k all have the same number of rows. C n has fewer rows. Return C =

13 Potential optimizations so far (2) Consider this case in the Find Best Pure algorithm: C1C1 C2C2 CnCn Return C n A subsection of C 1, C 2, etc. is likely to be found in future iterations – perhaps we can remember these so we do not have to find them again? Overall, I do not think you can assume that subsections of C 1, C 2, etc. will be found to be relevant in the future because returning C n changes the value distributions of the attributes. However, it may be a good estimation if the application were time sensitive.

14 What to do next Properly calculate coding cost to be better discriminate relevance between subspace clusters. Implement the rest of the algorithm and look for optimizations. Preprocess data to reduce redundancy in data. Preprocess data to find specific trends. Try other dataset(s). Try running a different algorithm. I do not think I will have time to implement another algorithm myself, but I may be able to find someone else’s application and compare results.

15 Thank you! Questions?


Download ppt "Applying the ROCAT algorithm to find subspace clusters in categorical data Presented by George Hodulik."

Similar presentations


Ads by Google