Presentation is loading. Please wait.

Presentation is loading. Please wait.

Designing Semantics-Preserving Cluster Representatives for Scientific Input Conditions Aparna Varde, Elke Rundensteiner, Carolina Ruiz, David Brown, Mohammed.

Similar presentations


Presentation on theme: "Designing Semantics-Preserving Cluster Representatives for Scientific Input Conditions Aparna Varde, Elke Rundensteiner, Carolina Ruiz, David Brown, Mohammed."— Presentation transcript:

1 Designing Semantics-Preserving Cluster Representatives for Scientific Input Conditions Aparna Varde, Elke Rundensteiner, Carolina Ruiz, David Brown, Mohammed Maniruzzaman and Richard Sisson Jr. Worcester Polytechnic Institute Worcester, MA, USA ACM CIKM 2006, Arlington, VA, USA

2 Introduction Clustering often groups data with mixed attributes Clustering often groups data with mixed attributes Numeric Numeric Categorical Categorical Ordinal Ordinal Examples: PDAs, Web Pages, Scientific Experiments Examples: PDAs, Web Pages, Scientific Experiments Cluster Representatives: depictions of each cluster Cluster Representatives: depictions of each cluster Randomly selected representatives not enough in Randomly selected representatives not enough in Capturing cluster information Capturing cluster information Providing ease of interpretation Providing ease of interpretation Incorporating different user interests Incorporating different user interests Need for Designing Cluster Representatives Need for Designing Cluster Representatives

3 Motivating Example Scientific experiments clustered based on results Scientific experiments clustered based on results Clustering criteria learned based on input conditions Clustering criteria learned based on input conditions Representative of conditions used to characterize a cluster Representative of conditions used to characterize a cluster Problem with randomly selected representative Problem with randomly selected representative Distinct combinations of conditions could lead to a given cluster Distinct combinations of conditions could lead to a given cluster Decision tree learning the clustering criteria (Heat Treating of Materials)

4 Goals Need to Design Semantics-Preserving Cluster Representatives that Need to Design Semantics-Preserving Cluster Representatives that Capture relevant information in cluster Capture relevant information in cluster Avoid visual clutter and are easy to interpret Avoid visual clutter and are easy to interpret Take into account various user interests in targeted applications Take into account various user interests in targeted applications

5 Proposed Approach: DesCond Build candidate representatives with increasing levels of detail Given: Clusters of experiments, conditions leading to clusters Compare candidates using MDL-based encoding capturing user interests Return candidate with lowest encoding as best for each cluster Define notion of distance for conditions incorporating domain semantics

6 Main Tasks in DesCond Defining a notion of distance for the input conditions Defining a notion of distance for the input conditions Obtaining suitable candidate representatives for each cluster Obtaining suitable candidate representatives for each cluster Proposing an encoding to compare candidates and find a winner Proposing an encoding to compare candidates and find a winner

7 Notion of Distance Example: Heat Treating of Materials Example: Heat Treating of Materials Quenchant: Cooling Medium Quenchant: Cooling Medium Part: The material being treated Part: The material being treated Probe: Characterizes shape, dimension Probe: Characterizes shape, dimension Oxide: Thickness of oxide on surface Oxide: Thickness of oxide on surface Agitation: Extent of agitation of cooling medium Agitation: Extent of agitation of cooling medium Quenchant Temperature: Starting temperature of cooling medium Quenchant Temperature: Starting temperature of cooling medium Define domain-specific distance metric for conditions incorporating Define domain-specific distance metric for conditions incorporating Data types of attributes Data types of attributes Distance between attribute values Distance between attribute values Weights of the attributes Weights of the attributes

8 Data Types of the Attributes Categorical Categorical Characters or strings with descriptive information Characters or strings with descriptive information E.g., Quenchant Name, Part Material, Probe Type E.g., Quenchant Name, Part Material, Probe Type Numerical Numerical Integers or real numbers Integers or real numbers E.g., Quenchant Temperature E.g., Quenchant Temperature Ordinal Ordinal Where order matters Where order matters E.g., Oxide Layer, Agitation Level E.g., Oxide Layer, Agitation Level

9 Distance Between the Attribute Values Categorical Categorical Different = 1 Different = 1 Same = 0 Same = 0 Numerical Numerical Absolute difference between Absolute difference between Values or Values or Mean values of ranges Mean values of ranges Ordinal Ordinal Map values to integer Map values to integer E.g., Oxide Layer: none = 0, thin =1, thick = 2 E.g., Oxide Layer: none = 0, thin =1, thick = 2 Absolute difference between mapped values Absolute difference between mapped values

10 Weights of the Attributes Attribute has higher weight if it Attribute has higher weight if it Is at higher level in tree Is at higher level in tree Belongs to a shorter path Belongs to a shorter path Has more experiments in its corresponding cluster Has more experiments in its corresponding cluster Decision Tree Weight Heuristic Decision Tree Weight Heuristic W i = 1/P ∑ j=1 to P (H i,j / H j ) * G j

11 Candidate Representatives in Levels of Detail Level 1: Single Conditions Representative (SCR) Level 1: Single Conditions Representative (SCR) One set of conditions preserving cluster information One set of conditions preserving cluster information Level 2: Multiple Conditions Representative (MCR) Level 2: Multiple Conditions Representative (MCR) Summary of information in cluster Summary of information in cluster Level 3: All Conditions Representative (ACR) Level 3: All Conditions Representative (ACR) All information in cluster abstracted suitably All information in cluster abstracted suitably

12 Single Conditions Representative Return set of conditions closest to all others in cluster Return set of conditions closest to all others in cluster Notion of distance: Domain-specific distance metric for conditions Notion of distance: Domain-specific distance metric for conditions Input conditions in Cluster A SCR for Cluster A

13 Multiple Conditions Representative Build sub- clusters of condition using domain knowledge Build sub- clusters of condition using domain knowledge Return nearest sub-cluster representatives Return nearest sub-cluster representatives Sort them Sort them MCR for Cluster A Cluster A Sub- clusters within Cluster A

14 All Conditions Representative Return all sets of conditions Return all sets of conditions Sort them in ascending order Sort them in ascending order Cluster A ACR for Cluster A

15 DesCond Encoding to Compare Candidates Analogous to Minimum Description Length (MDL) Analogous to Minimum Description Length (MDL) Theory: representative, Examples: Sets of conditions in cluster Theory: representative, Examples: Sets of conditions in cluster Complexity of representative (ease of interpretation) Complexity of representative (ease of interpretation) Complexity = log 2 AV A= number of attributes, V= number of values for each attribute A= number of attributes, V= number of values for each attribute Distance of all items from representative (information loss) Distance of all items from representative (information loss) Distance = log 2 (1/s)∑ {i=1 to s} D(R,S i ) D: domain-specific distance metric for conditions D: domain-specific distance metric for conditions s: total number of items (sets of conditions) in cluster s: total number of items (sets of conditions) in cluster S i : each individual item S i : each individual item R: representative set of conditions R: representative set of conditions DesCond Encoding DesCond Encoding Effectiveness= UBC*Complexity + UBD*Distance UBC, UBD: User bias % weights for complexity and distance UBC, UBD: User bias % weights for complexity and distance

16 Evaluation of DesCond with Domain Expert Interviews Evaluated with real data in Heat Treating Evaluated with real data in Heat Treating User Bias weights in Encoding reflect interests in targeted applications User Bias weights in Encoding reflect interests in targeted applications Different data sets and number of clusters Different data sets and number of clusters For each data set score calculated as follows For each data set score calculated as follows Consider winning candidate for each cluster Consider winning candidate for each cluster Based on DesCond Encoding Based on DesCond Encoding Score: Number of clusters in which candidate is winner Score: Number of clusters in which candidate is winner Example: Dataset of size 25 with 5 clusters Example: Dataset of size 25 with 5 clusters If SCR wins for 2 clusters, ACR for 3 If SCR wins for 2 clusters, ACR for 3 Score: SCR=2, ACR=3 Score: SCR=2, ACR=3

17 Evaluation Results Details Details Data Set Size = 400, Number of Clusters = 20 Data Set Size = 400, Number of Clusters = 20 Experts provide UBC / UBD values in Encoding Experts provide UBC / UBD values in Encoding Observations Observations Overall winner is MCR Overall winner is MCR As weight for complexity increases, SCR wins As weight for complexity increases, SCR wins Designed better than Random Designed better than Random

18 Evaluation with Formal User Surveys DesCond used to design representatives for a trademarked estimation tool [ref CHTE: Center for Heat Treating Excellence] DesCond used to design representatives for a trademarked estimation tool [ref CHTE: Center for Heat Treating Excellence] Formal user surveys conducted in different applications of the system Formal user surveys conducted in different applications of the system Evaluation Process Evaluation Process Compare estimation with real data in test set Compare estimation with real data in test set If they match estimation is accurate If they match estimation is accurate

19 Evaluation Results Different winners in different applications Different winners in different applications Results of surveys tally with those of Encoding-based evaluation Results of surveys tally with those of Encoding-based evaluation Estimation Accuracy: 90 to 94% (better than earlier versions of tool) Estimation Accuracy: 90 to 94% (better than earlier versions of tool) Parameter Selection ApplicationsSimulation Tool Applications Decision Support Applications Intelligent Tutoring Applications

20 Related Work Image Rating: [HH-01] Image Rating: [HH-01] User intervention involved in manual rating User intervention involved in manual rating Semantic Fish Eye Views: [JP-04] Semantic Fish Eye Views: [JP-04] Display multiple objects in small space, no representatives Display multiple objects in small space, no representatives PDA Displays in Levels of Detail: [BGMP-01] PDA Displays in Levels of Detail: [BGMP-01] Do not evaluate different types of representatives Do not evaluate different types of representatives

21 Conclusions Contributions of this work Contributions of this work Designing cluster representatives for scientific input conditions in levels of detail Designing cluster representatives for scientific input conditions in levels of detail Defining a domain-specific distance metric for conditions Defining a domain-specific distance metric for conditions Proposing an encoding to compare representatives Proposing an encoding to compare representatives Conducting evaluation using encoding with real data from Heat Treating Conducting evaluation using encoding with real data from Heat Treating Assessing use of representatives in applications of a CHTE trademarked estimation tool Assessing use of representatives in applications of a CHTE trademarked estimation tool Results Results Designed Representatives better than random Designed Representatives better than random Different designed representatives suit different applications Different designed representatives suit different applications DesCond enhances accuracy of estimation tool DesCond enhances accuracy of estimation tool


Download ppt "Designing Semantics-Preserving Cluster Representatives for Scientific Input Conditions Aparna Varde, Elke Rundensteiner, Carolina Ruiz, David Brown, Mohammed."

Similar presentations


Ads by Google