Presentation on theme: "Personalization in Folksonomies Based on Tag Clustering"— Presentation transcript:
1 Personalization in Folksonomies Based on Tag Clustering Jonathan Gemmell, Andriy Shepitsen, Bamshad Mobasher, Robin BurkeCenter for Web IntelligenceSchool of Computing, DePaul UniversityChicago, Illinois, USA AAAI 2008
2 AbstractCollaborative tagging systems enable Internet users to annotate or search for resources using custom labels.Tagging systems contain large numbers of redundant, ambiguous, and idiosyncratic tags.Data mining techniques can be used to ameliorate this problem by reducing noise in the data and identifying trends.They use discovered clusters as intermediaries between a user’s profile and resources.This can tailor the results of search to the user’s interest.
3 Introduction (1/7)Collaborative tagging is an emerging trend, there has been a recent proliferation of collaborative tagging system.At the foundation of collaborative tagging is the annotation; a user describes a resource with a tag.A collection of annotations results in a complex network of interrelated users, resources and tags, commonly referred to as a folksonomy (Mathes 2004).Users are free to navigate through the folksonomy without being tied to a pre-defined hierarchy.
4 Introduction (2/7) Why tagging is so popular? Tags make it easy and intuitive to retrieve previously viewed resources (Hammond et al. 2005).Further, tagging allows users to categorized resources by several terms. (Millen, Feinberg, and Kerr 2006).Collaborative tagging systems have a low entry cost. It doesn’t require users to conform to a rigid hierarchy.Users may enjoy the social aspects of collaborative tagging (Choy and Lui 2006).Users may share or discover resources and connect to people with similar interests.
5 Introduction (3/7)Collaborative tagging is different from traditional web search tools:Collaborative tagging applications reap the insights of many users rather than a few “experts” (Wu, Zhang, and Yu 2006).More dynamic and able to incorporate a changing vocabulary.Absorb new trends quicklyThese applications can identify groups of likeminded users.Catering not only to mainstream but also to non-conventional users.Unlike search engine is a pull-model system, tagging system is a push-model.Pull-model: the application pulls resources from the information space (Yan, Natsev,and Campbell 2007).Push-model: users identify which resources are relevant and through the annotation process promote the resource.The collaborative tagging system may be populated with resources a pull-model may not be able to locate.
6 Introduction (4/7) Other advantages of Collaborative tagging: Collaborative tagging can categorize resources by users without metadata like videos or pictures.As users annotate resources, the system is able track their interests.Data mining tools such as clustering can identify important trends and characteristics of the users.These profiles are a powerful tool for personalization algorithms (Yan, Natsev, and Campbell 2007).
7 Introduction (5/7) Challenges for search and navigation by using tags: Most collaborative tagging applications permit unsupervised tagging.Folksonomies contain a wide variety of tags:From the factual (e.g., “Mt Rushmore”) to the subjective (e.g., “boring”),From the semantically-obvious (e.g., “Chicago”) to the utterly opaque (e.g., “jfgwh”).Tag redundancy or tag ambiguity can confound users searching for resources.Tag redundancy: several tags have the same meaning.Tag ambiguity: a single tag has many different meanings.
8 Introduction (6/7) To solve problems: Through clustering redundant tags can be aggregated.The combined trend of a cluster can be more easily detected than the effect of a single tag.The uncertainty of a single tag in a cluster can be overwhelmed by the additive effects of the rest of the tags.Personalization can also be used to overcome noise in folksonomies.Given a particular user profile, the user’s interests can be clarified and navigation within the folksonomy can be tailored to suit the user’s preferences.
9 Introduction (7/7)This paper proposes an algorithm to personalize search and navigation based on tags in folksonomies.The core of our algorithm is a set of tag clusters.The personalization algorithm models users as vectors over the set of tags.By measuring the importance of a tag cluster to a user, the user’s interests can be better understood.Likewise, each resource is also modeled as a vector over the set of tags.By associating resources with tag clusters, resources relevant to the topics captured by those clusters can be identified.By using the tag clusters as intermediaries between a user and a resource, we infer the relevance of the resource to the user.
10 Related Work (1/3)A main assumption is that the ability of clustering algorithms to form coherent clusters of related tags.This assumption is support by :Begelman, Keller,and Smadja (2006 )Tag clustering is suggested to improve search in folksonomies .Heymann and Garcia-Molina April (2006)Hierarchical clustering is proposed to generate a taxonomy from a folksonomy.A similar notion was previously described in (Niwa, Doi, and Honiden 2006) in which an affinity level was calculated between a user and a set of tag clusters.
11 Related Work (2/3)Our algorithm relies heavily on tag clusters and the utility they offer.Clusters have many other potential functions worth noting.Tag clusters could serve as intermediaries between two users in order to identify like-minded individuals.Tag clustering can support tag recommendation, reducing annotation to a mouse click.Well chosen tags make the recovery process simple and offer some control over the tag-space.In (Xu et al. 2006) a group of tags are offered to the user based on several criteria (coverage, popularity, effort, uniformity) resulting in a cluster of a relevant tags.
12 Related Work (3/3)Clustering is an important step in many attempts to improve search and navigation.In (Wu, Zhang, and Yu 2006), tag clusters are presumed to be representative of the resource content.In (Choy and Lui 2006) a two-dimensional tag map is constructed.Tag clusters can be used as waypoints in the tag space and facilitate navigation through the folksonomy.In (Hayes and Avesani 2007), topic relevant partitions are generated by clustering resources rather than tags.Users interested in the topic represented by a cluster may be particularly interested in the characteristic resources.
13 Search and Navigation in Folksonomies (1/7) In traditional Internet applications the search and navigation process serves two vital functions:Retrieval incorporates the notion of navigating to a particular resource.Discovery incorporates the notion of finding resources or content interesting but theretofore unknown to the user.The success of collaborative tagging is due in part to its ability to facilitate both these functions.
14 Search and Navigation in Folksonomies (2/7) A folksonomy can be described as a four-tuple D:The folksonomy can be viewed as a tripartie hyper- graph (Mika 2007).Nodes: users, tags, and resourcesHyper-edges: annotations connecting a user, a tag and a resource.U: a set of usersR: a set of resourcesT: a set of tagsA: a set of annotations
15 Search and Navigation in Folksonomies (3/7) Standard Search in FolksonomiesIn this work we focus on the vector space model adapted from the information retrieval discipline to work with folksonomies.Each user, u, is modeled as a vector over the set of tags, where each weight, w(ti), in each dimension corresponds to the importance of a particular tag, ti.Resources can also be modeled as a vector over the set of tags.
16 Search and Navigation in Folksonomies (4/7) In calculating the vector weights, a variety of measures can be used.The tag frequency, tf, for a tag, t, and a resource, r is the number of times the resource has been annotated with the query tag.Likewise, the well known tf*idf can be used.N: total number of resourcesnt: the number of resources to which thequery tag was applied
17 Search and Navigation in Folksonomies (5/7) With either term weighting approach, a similarity measure between a query, q and a resource, r can be calculated.q and r are both represented as a vector over the set tags.Assume the query is a vector with only one tag since search is often initiated by selecting a single tag.Cosine similarity is a popular measure defined as:
18 Search and Navigation in Folksonomies (6/7) Need for PersonalizationA standard search does not take into account the user profile and returns identical results regardless of the user.Noise in the folksonomy, such as tag redundancy and tag ambiguity, obfuscate patterns and reduce the effectiveness of data mining techniques.Redundant tags can hinder algorithms that depend on calculating similarity between resources. (e.g. “java” and “Java”).Ambiguous tags can result in the overestimation of the similarity of resources that are in fact unrelated. (e.g. “java” applied to a and “java” applied to
19 Search and Navigation in Folksonomies (7/7) Tag clustering provides a means to combat noise in the data and facilitate personalization.Tag redundancy can be assuaged since the trend for a cluster can be more easily identified than the effect of a single tag.A cluster of tags will assume the aggregate meaning and overshadow any ambiguous meaning a single tag may have.If tag clusters are used as a nexus between users and resources, the users interest in resources can be calculated.Consequently results from a basic search can be re-ranked to reflect the user profile.
20 Personalized Search Based on Tag Clustering (1/12) Overview of the Proposed AppraochIn our approach, tag clusters serve as intermediaries between a user and the resources.A strong interest indicates the user has frequently used the tags in the cluster.A strong relationship between a tag cluster and a resource means many of the tags were used to describe the resource.By using clusters to connect the user to the resources, the relevance of the resource to the user can be inferred.Each user, therefore, receives a personalized view of the information space.
21 Personalized Search Based on Tag Clustering (2/12) Tag clusters(topics)Tag clusters(topics)User profile(tagging history)
22 Personalized Search Based on Tag Clustering (3/12) A critical element of our algorithm is a set of tag clusters that connects a user with the resources.For many clustering techniques, the similarity between tags must first be calculated.The cosine similarity between two tags, t and s:Either tf or tf* idf can be used as the weights in the vectors.
23 Personalized Search Based on Tag Clustering (4/12) Three clustering technique can be used:Hierarchical agglomerative clusteringMaximal complete link clusteringK-means clustering
24 Personalized Search Based on Tag Clustering (5/12) Hierarchical ClusteringAlgorithms:As the hierarchical clustering algorithm begins each tag forms a singleton cluster.During each stage of the procedure, clusters of tags are joined together depending on the level of similarity between the clusters.This is done for many iterations until all tags have been aggregated into one cluster.
25 Personalized Search Based on Tag Clustering (6/12)
26 Personalized Search Based on Tag Clustering (7/12) Several techniques exist to calculate the similarity between tag clusters and to merge smaller clusters.Single LinkMaximum LinkCentroidTo compute the similarity between clusters, a centroid for each cluster is calculated.Each tag is treated as a vector over the set resources.Vector weights are calculated using either tf or tf*idf.The similarity between two clusters is then calculated using the centroids as though they were single tags.
27 Personalized Search Based on Tag Clustering (8/12) Hierarchical clustering has several parameters that require tuning.Step :The decrement by which the similarity threshold is lowered at each iteration.By modifying this parameter the granularity of the hierarchy can be controlled.Division coefficient :Any cluster below this similarity threshold is considered a independent cluster.It will result in many small clusters with high internal similarity or possibly even singletons when the value is near 1.
28 Personalized Search Based on Tag Clustering (9/12) Generalization level :Allows the algorithm to return more general tag clusters for the hierarchy.It will behave as a traditional agglomerative clustering algorithm when the value is set very high.
29 Personalized Search Based on Tag Clustering (10/12) Maximal Complete Link ClusteringMaximal complete link clustering identifies every maximal clique in a graph (Augutson and Minker 1970).A maximal clique is a clique that it not contained in a larger clique.Maximal complete link clustering permits clusters to overlap.This may be particularly advantageous when dealing with ambiguous tags.“java” for example could be a member of a coffee cluster as well as programming cluster.
30 Personalized Search Based on Tag Clustering (11/12) Maximal complete link clustering is a well known NP- hard problem.Fortunately, the extreme sparsity of the data permits the application of this method.Approximation techniques could be used to save computational time at the expense of missing some clusters (Johnson 1973).The minimum similarity threshold need to be tuned.If the similarity between two tags meets this threshold, they are considered to be connected.
31 Personalized Search Based on Tag Clustering (12/12) K-means ClusteringA predetermined number of clusters, k, are randomly populated with tags.Centroids are calculated for each cluster.Each tag is reassigned to a cluster based on a similarity measure between itself and the cluster centroid.Several iterations are completed until tags are no longer reassigned.This clustering method has only one parameter to tune, k.
32 Personalized Algorithm Based on Tag Clustering (1/3) There are three inputs to a personalized search: the selected tag, the user profile and the discovered clusters.The output of the algorithm is an ordered set of resources.The user’s interest is calculated as the ratio:Numbers of annotations which use tags from cThe total number of annotation that the user has.
33 Personalized Algorithm Based on Tag Clustering (2/3) The relation of a resource, r, to a cluster:The relevance of the resource to the user, relevance(u,r), is calculated from the sum of the product of these weights over the set of all clusters, C.Numbers of annotations which use tags from cThe total number of annotation that the resource has.
34 Personalized Algorithm Based on Tag Clustering (3/3) A personalized similarity is calculated for each resource by multiplying the cosine similarity by the relevance of the resource to the user.Consequently, the resulting p rankscore(u,q,r) will depend on the user and the results will be personalized.
35 Experimental Evaluation (1/3) DataA Web crawler was used to extract data from del.icio.us from 5/26/2007 to 06/15/2007.The dataset contains 29,918 users, 6,403,442 resources and 1,035,177 tags.There are 47,184,492 annotations with one user, resource and tag.TestEach test case consisted of a user, a tag and a resource.First, the system perform basic search.Second, perform the personalized searchCompared the rank between basic and personalized search.
36 Experimental Evaluation (2/3) Experiment MethodologyTwo random samples of 5,000 users were taken from the dataset.Five-fold cross validation was performed on each sample.For each fold, 20% of the users were partitioned from the rest as test users.Clustering was completed using the data from the remaining 80% of the users.Clusters were generated using hierarchical, maximal complete link and k-means clustering.Relevant parameters for each method were tuned.From each user in the test set, 10% of the user’s annotations were randomly selected as test cases.
37 Experimental Evaluation (3/3) In order to judge the improvement provided by the personalized search, imp (Voorhees 1999) can be used.The value for imp can never be greater than one.If the basic search ranks the resource very low and the personalization algorithm improves its rank to the first position, then imp will approach one.rank of target resource by basic searchrank of target resource by personalized search
38 Experimental ResultsIn general, the proposed personalization technique results in improved performance.The choice of tf or tf*idf also played an important role. In all cases tf*idf is superiorHierarchical clustering produced superior performance, perhaps due to its inherent flexibility.Three parameters need to be tuned.StepDivision coefficientGeneralization level
39 Experimental ResultsAn ideal value for step would aggregate tags slowly enough to capture the conceptual hierarchy between individual clusters.overspecializationtags are aggregated too quicklyThe best
40 Experimental ResultsDivision coefficient defines the level where the hierarchy is dissected into individual clusters.Division coefficient is set too high, the result will be many small clusters.Hard to define a topicDivision coefficient is set too low, the result is a few large clusters.Clusters may include many relatively unrelated tags and span several topic areas.The optimum value for this dataset is approximately 0.1.
42 Experimental ResultsEvery cluster below the division coefficient would be returned by the clustering algorithm.If the generalization level is set too low, it is possible to overlook a cluster representing relevant resources.If the generalization level is set too high, irrelevant factors can be introduced.For these experiments the optimum value for the generalization level is 8.
44 Experimental ResultsThe similarity threshold for complete link has a strong impact on the effectiveness of the personalization routine.If the threshold is set too low, links are generated between tags based upon a very weak relationship resulting in large clusters.Setting the threshold too high, can remove links between tags that are in fact quite similar, resulting in a loss of valuable information.Through empirical evaluation we found the ideal value for the similarity threshold to be 0.2.
46 Experimental ResultsK-means has only one parameter to tune, k, the predetermined number of clusters to be generated.If too many clusters are generated, a topic may be separated into many clusters.Too few clusters can result in clusters covering multiple topics.In these experiments the optimum value for k was found to be approximately 350.
48 Experimental ResultsThe personalization technique using maximal Complete Link clusters demonstrated promising results.Hierarchical clustering improved personalization even further.The worst of the three methods was k-means clustering.Improvement:Hierarchical: 0.137Complete Link: 0.112K-mean: 0.042
49 Experimental ResultsThe poor results of k-means clustering can be attributed to its inability to identify innocuous tags.Another drawback from k-means clustering is that ambiguous tags can pull unrelated tags together.The strength of maximal Complete Link clustering lies in its ability to generate overlapping clusters.The modified hierarchical clustering offers a level of customization not offered by the other two techniques.