Presentation is loading. Please wait.

Presentation is loading. Please wait.

Analysing Large Data Sets using Formal Concept Lattices Simon Andrews and Constantinos Orphanides {s.andrews, Conceptual Structures.

Similar presentations


Presentation on theme: "Analysing Large Data Sets using Formal Concept Lattices Simon Andrews and Constantinos Orphanides {s.andrews, Conceptual Structures."— Presentation transcript:

1 Analysing Large Data Sets using Formal Concept Lattices Simon Andrews and Constantinos Orphanides {s.andrews, Conceptual Structures Research Group Communication and Computing Research Centre

2 Acknowledgement This work is part of the CUBIST project ("Combining and Uniting Business Intelligence with Semantic Technologies"), funded by the European Commission's 7th Framework Programme of ICT, under topic 4.3: Intelligent Information Management.

3 A variety of data sets can be converted into formal contexts: –Data Discretization –Data Booleanization However, issues arise: –Data of modest size can contain hundreds (of thousands) of formal concepts, resulting in unmanageable and unreadable concept lattices. –Density of, and noise in a context: factors that increase the number of formal concepts. –Computation of formal concepts cannot be carried out, by much of the existing software, on a large scale. Data Sets

4 FcaBedrock (Formal Context Creator) –Creating sub-contexts by restricting the conversion of the data to information of interest. In-Close (Fast Concept Miner) –By removing relatively small concepts from a context to reduce "noise". ⇨ Production of readable, yet still meaningful, concept lattices. Tools By The Authors

5 A Formal Context Creator for Formal Concept Analysis, developed by the authors. Free and open-source at Sourceforge. Input files supported: Flat-file CSV and Three- column CSV (triples). Output files supported: Burmeister (.cxt) and FIMI (.dat). User guided automation - the user has the final say on how to interpret a data set. Attributes supported: Categorical (aka many-valued, nominal/ordinal), Boolean and Continuous. FcaBedrock - Overview

6 Auto-detection of metadata, directly from the data set, if desired. Support of both discrete (0-10, 10-20, …) and progressive (>10, >20, …) scaling for continuous attributes. Ability to exclude attributes from the analysis. Ability to restrict the analysis to user-specified attribute values. Metadata of each conversion/analysis saved & stored for subsequent conversions. Repetition of metadata for similar attributes. FcaBedrock - Overview

7 A fast Concept Miner for Formal Concept Analysis, developed by one of the authors. Free and open-source at Sourceforge. Input files supported: Burmeister (.cxt). Minimum support for intent and extent. Output of analysis data and concepts. Output of sub-context ("noise" reduction). Fast computation of formal concepts: –Mining 1 million concepts per second. In-Close - Overview

8 Analysis of Sub-Contexts: Agaricus-Lepiota Data Set: Agaricus Lepiota (aka Mushroom) –From UCI Machine Learning Repository –8124 objects (mushrooms) –23 attributes (mushroom properties) e.g. stalk shape, cap color, edible/poisonous… –Attribute types: Categorical, Boolean –Processed by In-Close: 220,000+ concepts

9 Analysis of Sub-Contexts: Agaricus-Lepiota Lets us say we are interested in the relationship between mushroom habitat and population type. Using FcaBedrock: –Create a sub-context by only converting the habitat and population type attributes. ⇨ Down to 33 Formal Concepts (from 220,000+) and 13 Formal Attributes (from 125)

10 Visualisation of the sub-context in ConExp

11 Analysis of Sub-Contexts: Census Income Data Set: Census Income (aka Adult) –From UCI Machine Learning Repository –32561 objects (adults) –14 attributes (census data) e.g. age, sex, education, employment type… –Attribute types: Categorical, Boolean, Continuous –Processed by In-Close: 100,000+ concepts

12 Analysis of Sub-Contexts: Census Income Lets us say we are interested in comparing how pay is effected by gender in adults who have had a higher education. Using FcaBedrock: –Create a sub-context by only converting the sex, class and education attributes. –Convert only those objects (adults) with the education attribute value Bachelors, Masters or Doctorate. ⇨ Down to 7941 objects and 37 Formal Concepts

13 Visualisation of the sub-context in ConExp

14 In-Close: Concept Reduction Using FcaBedrock's context reduction: –Attributes of no particular interest can be excluded from the analysis (attribute exclusion). –We can convert only those objects with specific attribute values (object exclusion). Introducing In-Close's concept reduction: –Using the well-known idea of minimum support Specifying a minimum number of objects and/or attributes for a concept. ⇨ Reduction of 'noise' in a context.

15 In-Close: Concept Reduction 'Noise': Concepts containing number of attributes or objects smaller than the user- defined minimums. Reduction of 'noise' achieved by: –Semi-automated form of lattice 'iceberging'. Complete hierarchy maintained in the lattice. –Mining a context for concepts that satisfy a minimum- support and then re-writing the context using only those concepts.

16 A Student Survey Example Student survey data –Demographic and 'problem' data from 587 university undergraduates. –Yes/No responses to 36 problems that a student may have experienced during their studies: missing lectures, low performance, etc. Noisy data set: –145 Formal Attributes –Processed by In-Close: 22,760,243 concepts!

17 A Student Survey Example Let us say we are only interested in analysing the 'problem' data. Using FcaBedrock: –Convert only these attributes, exclude demographics. –Remaining concepts: 339,672 Significant reduction, but still too many! Adding In-Close to the equation: –Set minimum size of intent to 4 and minimum size of extent to 80. ⇨ Remaining concepts: 32!

18 Visualisation of the sub-context in ConExp

19 Comparing Quiet Sub-Contexts Data Set: Agaricus-Lepiota (aka Mushroom) Using FcaBedrock: –Create two sub-contexts: one for edible mushrooms and one for poisonous mushrooms. Using In-Close (for each sub-context): –Set minimum size of intent to 10. ⇨ 2,848 objects + 17 concepts for the edible sub- context, 3,344 objects + 14 concepts for the poisonous sub-context.

20 Comparing Quiet Sub-Contexts Similarities between the two lattices: –Attributes expressed in both lattices were moved to the right of each lattice. Differences between the two lattices: –Attributes expressed in only one lattice and were moved to the left. ⇨ Clear visualisation for comparison.

21 Edible mushroom lattice in ConExp

22 Poisonous mushroom lattice in ConExp

23 Large data sets may be difficult to deal with computationally, but: –It is the number of formal concepts derived from a data set that is the key factor in determining if a concept lattice will be useful as a visualisation. Readable lattices can be produced with a straightforward process of: –creating sub-contexts –reducing noise Freely available software. Burmeister (.cxt) format used to succesfully interoperate between the three FCA tools. Conclusion

24 Thank you very much. Questions?


Download ppt "Analysing Large Data Sets using Formal Concept Lattices Simon Andrews and Constantinos Orphanides {s.andrews, Conceptual Structures."

Similar presentations


Ads by Google