Presentation is loading. Please wait.

Presentation is loading. Please wait.

Technische universität dortmund Fakultät für Informatik LS 8 Prof. Dr. Katharina Morik The Challenge of Heterogeneity.

Similar presentations


Presentation on theme: "Technische universität dortmund Fakultät für Informatik LS 8 Prof. Dr. Katharina Morik The Challenge of Heterogeneity."— Presentation transcript:

1 technische universität dortmund Fakultät für Informatik LS 8 Prof. Dr. Katharina Morik The Challenge of Heterogeneity

2 Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008 Faculty Computer Science LS 8 technische universität dortmund 2 Overview  Heterogeneity in Data  Distributed Data  Web 2.0  Heterogeneity of Users  Structuring music collections  Structuring tag collections

3 Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008 Faculty Computer Science LS 8 technische universität dortmund 3 Heterogeneity in Data  Databases  Fixed set of attributes  Declared data types  Multi-relational  Very large number of records  Preparation for mining  Extract, Transform, Load  Select attributes  Declare label for learning  Handle missing values  Compose new attributes  Schema-mapping for re-use of DM MiningMart application to customer churn -- Telecom Italia

4 Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008 Faculty Computer Science LS 8 technische universität dortmund 4 Heterogeneity in Data  Time series data  Measurements over time  Business  Medicine  Production  Hand writing  Pictures  Music  Prediction  Classification  Clustering  Signal to Symbol

5 Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008 Faculty Computer Science LS 8 technische universität dortmund 5 Heterogeneity in Data  Texts  High dimensional vectors  Sparse word vectors  Texts of the same class need not share a word!  Syntactic, semantic structures  Classification  Clustering  Named Entity Recognition, Information Extraction

6 Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008 Faculty Computer Science LS 8 technische universität dortmund 6 Distributed Data  Distributed databases of the same schema  Distributed databases of different schemas  Low-level, low capacity sensors  Peer-to-peer networks

7 Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008 Faculty Computer Science LS 8 technische universität dortmund 7 Heterogeneity of Users  The same label name does not necessarily mean the same concept.  Different names may refer to the same set of items.  Users apply diverse aspects, e.g., genre, time of day, episodes (summer 99),...  Users share some set of items (possibly under different names). hip hop popmetalalternative death metal true metal hip hop poppianoclassic guitar classic jazz classic popjazzfavourites blues modern work home officeplane

8 Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008 Faculty Computer Science LS 8 technische universität dortmund 8 Web 2.0  Organizing large data collections requires semantic annotations.  Users annotate items with arbitrary tags.  No common ontology is required (“folksonomies”).  Users want to keep their tags, but like to benefit from efforts of others.

9 Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008 Faculty Computer Science LS 8 technische universität dortmund 9 Structuring Music Collections  A concept’s meaning is its extension, e.g., some music.  A concept’s meaning can be expressed by a classifier.  A concept hierarchy for each aspect --> hierarchical classification.  Acquiring the hierarchy by clustering under the assumption that user-given taggings are kept. pop rock metal a d e bad good blues f b aggressive

10 Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008 Faculty Computer Science LS 8 technische universität dortmund 10 Localized Alternative Cluster Ensembles (ECML 2006)  Acquiring hierarchical clusterings from  Own partial clusterings  Clusterings of other peers  Preserve taggings of users  Produce several alternative  Exploit input clusterings  Consider locality instead of global consensus hip hop popmetalalternative death metal true metal hip hop poppianoclassic guitar classic jazz classic popjazzfavourites blues modern work home officeplane

11 Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008 Faculty Computer Science LS 8 technische universität dortmund 11 LACE Algorithm  11 alternativemetal true metaldeath metal a c hip hop pop d f  12 ba c d e f g b Items are represented by Ids.

12 Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008 Faculty Computer Science LS 8 technische universität dortmund 12 LACE Algorithm  11 alternativemetal true metaldeath metal a c hip hop pop d f  12 ba c d e f g b Best matching cluster node is selected by f-measure.

13 Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008 Faculty Computer Science LS 8 technische universität dortmund 13 LACE Algorithm  11 alternativemetal true metaldeath metal a c pop d f  12 b  11 alternativemetal true metaldeath metal a b c hip hop d e f g Items that are sufficiently similar to items in the best matching clustering are deleted from the query set.

14 Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008 Faculty Computer Science LS 8 technische universität dortmund 14 LACE Algorithm  11 alternativemetal true metaldeath metal a c pop d f  12 b  11 alternativemetal true metaldeath metal a b c hip hop d e f g A new query is posed containing the remaining items. Only tags not used yet are considered.

15 Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008 Faculty Computer Science LS 8 technische universität dortmund 15 LACE Algorithm  11 alternativemetal true metaldeath metal a c pop d f  12 b  11 alternativemetal true metaldeath metal a b c hip hop pop d f  12 11 hip hop eg The process continues until all items are covered, no additional match is possible or a maximal number of rounds is reached.

16 Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008 Faculty Computer Science LS 8 technische universität dortmund 16 LACE Algorithm  11 alternativemetal true metaldeath metal a c hip hop pop d f  12 b  11 alternativemetal true metaldeath metal a b c hip hop pop d e f  12’ g 11 Remaining items are added by classification (kNN).

17 Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008 Faculty Computer Science LS 8 technische universität dortmund 17 LACE Algorithm  11 alternativemetal true metaldeath metal a c hip hop pop d f  12 b hip hop pop 11 metalalternative death metal true metal Process starts anew until no more matches are possible or the maximal number of results is reached.

18 Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008 Faculty Computer Science LS 8 technische universität dortmund 18 LACE Algorithm  11 alternativemetal true metaldeath metal a c hip hop pop d f  12 b hip hop pop 11 metalalternative death metal true metal work home 22 officeplane 33 …  k Process starts anew until no more matches are possible or the maximal number of results is reached.

19 Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008 Faculty Computer Science LS 8 technische universität dortmund 19 LACE Algorithm  11 alternativemetal true metaldeath metal a c hip hop pop d f  12 b P2p Network hip hop pop 11 metalalternative death metal true metal work home 22 officeplane 33 …  k Ad hoc peer-to-peer network.

20 Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008 Faculty Computer Science LS 8 technische universität dortmund 20 Structuring Music Collections Challenge of music data:  There is no perfect feature set for all mining tasks.  Learning feature extraction for a classification task Mierswa/Morik MLJ 2005  Structuring music collections Wurst/Morik/Mierswa ECML 2006  User views are local models - no global consensus wanted! Mierswa/Morik/Wurst, In: Masseglia, Poncelet, l and Teisserie(editors), Successes and New Directions in Data Mining, 2007

21 Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008 Faculty Computer Science LS 8 technische universität dortmund 21 Structuring Tag Collections  Users annotate resources with arbitrary tags.  Frequency of tags is shown by the tag cloud.  Tags structure the collection.

22 Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008 Faculty Computer Science LS 8 technische universität dortmund 22 Navigation  User may select a tag and sees the resources.  User may follow related tags.  Problem:  No hierarchical structure.  Restricted navigation to given tags.  No navigation according to subsets.  Photography and art cannot be found!

23 Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008 Faculty Computer Science LS 8 technische universität dortmund 23 Given: Folksonomy  A Folksonomy (U,T,R,Y), with  U Users  T tags  R Resources  Y  U  T  R  a record (u,t,r)  Y means that user u has annotated resource r with tag t.

24 Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008 Faculty Computer Science LS 8 technische universität dortmund 24 Wanted: Tagset clustering  Hierarchical clustering of tags for navigation,  based on frequency: how many users used tag t? supp: P (T) -->  supp U (T)= |{u  U|  t  T:  r  R: (u,t,r)  Y}|  Subset of the lattice of frequent tag sets that optimizes clustering criteria.

25 Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008 Faculty Computer Science LS 8 technische universität dortmund 25 Starting Point: Termset Clustering  Termset clustering: how many resources support a term?  Given frequent term sets form a clustering with small overlap and large coverage. Beil, Ester, Xu (2002) Frequent Term- Based Text Clustering, in KDD 2002 Fung, Wang, Ester (2003) Hierarchical Document Clustering Using Frequent Itemsets, in SDM 2003  Heuristics for minimizing overlap, maximizing coverage....{sun}{beach} D1, D4, D5, D6, D2, D9, D13 D8, D10, D11, D15 D7, D14 D2, D9, D13 D8, D10, D11, D15 {sun, fun, beach} {sun,fun} {fun, beach} {sun,beach} D1,D4,D6,D8... D2, D8, D9, D10 D10, D11, D13 D11, D15 { } D1,..., D16

26 Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008 Faculty Computer Science LS 8 technische universität dortmund 26 Heterogeneous Preferences Child-count vs. completeness (left); coverage vs. overlap (right)

27 Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008 Faculty Computer Science LS 8 technische universität dortmund 27 Multi-objective Optimization  Given frequent tag sets  Find all optimal clusterings according to two orthogonal criteria.  Orthogonal criteria can only be determined empirically.  Childcount: number of successors of a cluster  Overlap: average overlap of clusters at each level.  Completeness: how much of the lattice is retained? + + + + + + + + + ++ +

28 Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008 Faculty Computer Science LS 8 technische universität dortmund 28 GA for Optimization  NSGA II algorithm Deb, Agrawal,Pratab, Meyarivan (2000) in Procs. Parallel Problem Solving from Nature  Delivers all Pareto-optimal clusterings to a partial lattice of frequent tag sets. Initial population FitnessStop? Selection Crossover Mutation Output

29 Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008 Faculty Computer Science LS 8 technische universität dortmund 29 Encoding Frequent Tag Sets  Given the lattice of possibly frequent tag sets,  a Binary vector indicates the inclusion of a tag set into the clustering.  A vector can be mutated by flipping bits.  Two vectors can be combined to a new one by crossover.

30 Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008 Faculty Computer Science LS 8 technische universität dortmund 30 Result: Points of Pareto-front  Childcount vs. Completeness  Pareto-front for different minimal support  Instances

31 Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008 Faculty Computer Science LS 8 technische universität dortmund 31 Application  Bibsonomy social bookmark system: Hotho, Jäschke, Schmitz, Stumme 2006  780 users, 59.000 resources, 25.000 tags  4000 frequent tag sets  Optimization according to Childcount vs. Completeness and Overlap vs. Coverage

32 Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008 Faculty Computer Science LS 8 technische universität dortmund 32 Multi-objective Tagset Clustering  Multi-objective optimization allows the user to select among equally good clusterings --> heterogeneity of users is respected  High scalability, high dimensionality  Understandable labels (tags)  Hierarchical structure for navigation.

33 Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008 Faculty Computer Science LS 8 technische universität dortmund 33 Challenges for Data Mining  High dimensional data  High throughput data  Distributed Data  P2P networks  Web 2.0  Diverse user preferences  Service for end-user systems, e.g. mobile “phones”


Download ppt "Technische universität dortmund Fakultät für Informatik LS 8 Prof. Dr. Katharina Morik The Challenge of Heterogeneity."

Similar presentations


Ads by Google