Finding similar items by leveraging social tag clouds Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: SAC 2012’ Date: October 4, 2012.

Finding similar items by leveraging social tag clouds Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: SAC 2012’ Date: October 4, 2012

Outline Introduction Challenges Miss Tag Effect Popularity Bias Approaches Balanced Voting Model One-Class Probabilistic Model Experiment Conclusion

Introduction When I want to search: “Outstanding universities in California” What’s the result can I get? Hmm....It’s so strange..... Where’s Standford University?

Cont. How to solve this situation? A potential solution is to providing a query-by- example interface. Users can provide some examples to help to improve the quality of results. Ex. Issue a query : ”UC Berkeley” Result : UCLA, Standford University,...etc.

Cont. What’s the major challenge? Identify and rank entities that are similar to the user-provided examples on tag information. In the uncontrolled nature of user-generated metadata often cause problems of imprecision and ambiguity. So we have two challenges: missing tag effect and popularity bias.

Cont. Goal: To create a function R to measure the similarity between an entity x i in the dataset and query X Q. Tag generation Data Model Identifier: Title Entity: one Wiki page or entry Tag : user labels the page or category name

Intersection-Driven Approach T Q ∩ = T 1 ∩ T 2 ∩ T 3 ∩... X Q = {x 1,x 2 } R(x 4, X Q ) = 1 R(x 3, X Q ) = 2 What’s the problem? We don’t know user want to search city? capital? or both?

Missing tag effect When? A newly created entity might not be well tagged until its editors finish revising all the content of the entity. It could cause the system to misinterpret user intent. How to happen? How to solve?

X Q = { x 2 : Washington D.C, x 3 :London} By Intersection-driven approach, we can get {t 8 : Object} So it considers the entity Beijing as an irrelevant one and return others which contain the tag Object. How to happen?

Solve Missing Tag Effect One way is called Partial Weighting Generalization X Q = { x 2, x 3 }, T Q ∩ = {Object} We can assign scores in real number, instead of either 0 or 1 to these tags. For example: X Q = {x 2, x 3 } Now we assign 0.5 points to these tags not in T Q ∩ : R(x 4,X Q ) = 1+ 0.5 = 1.5 R(x 5,X Q ) = 1+ 0.5 = 1.5 R(x 6,X Q ) = 1+ 0.5 + 0.5 = 2 In this system, if the system already returns satisfied results, we tend to not adopt generalization unless the user asks for more results.

Popularity Bias The number of tags associated with an entity follows a power law distribution. |T i | : The popularity of an entity x i based on the number of tags associated with x i if X Q = {x 1: Beijing, x 6: Lyon }, then the entity Beijing may contribute more score, we will get the result having entities like Beijing. The popular tag is probably not the concept the users intent to search for. ( like tag: object)

What do we want? We have to refine the intersection-driven approach to solve two challenges above: A popular entity in a query shouldn’t influence the results. Even a few tags are missing in input example, the system have to identify relevant entities based on tags associated with a subset of input examples.

Balanced Voting Model

X Q = {x 1 :Beijing, x 6 :Lyon}, Entity = {x 3 :London} x 1 :Beijing ->T 1 :{City, Capital, Asia, Summer Olympic, China} x 3 :London ->T 3 :{City, Europe, Summer Olympic, Object} x 6 :Lyon -> T 6 :{City, Europe, Object} R(t 1 :City, X Q ) = 0.2 + 0.33 = 0.53 R(t 2 :Capital, X Q ) = 0.2 R(t 3 :Summer Olympic, X Q ) = 0.53 R(x 3, X Q ) = 0.53 + 0.2 + 0.33 + 0.33 = 1.39 This way compensates biases caused by a popular entity in a query. The non-zero assignment alleviates the missing tag effect.

One-Class Probabilistic Model Now let us think something: how people create a query for finding similar items? At first, the user must have some desired property in mind, then try to recall other properties based on their knowledge.

Cont. Now we assume a user’s intent corresponds to one tag t k in a dataset. Because the intent is unpredictable so we think users may select |X Q | entities from ε(t k ). t k set is expected to be associated with all entities in the query. ε(t k ) is the set of all entities associated with t k.

Cont. There is no missing tag in the dataset All tags are independent from each other t k stands for the user’s intent.

Cont. The probability of having X Q being the query and t k being the desire tag is: Now we sum up all probabilities value to get the probability of the entity x i being a similarity entity. It’s good for alleviating the popularity tag-bias because the system will assign a low value to P(X Q |t k ) for a popular tag.

Cont. Now we deal with the missing tag effect: ε c (t k ): Return entities that are relevant to the tag tk but the tag-entity relation is missing.(Ex. ε c (t 8 :Object) ={x 1 :Beijing} ) m k : The number of entities missing a tag t k in a query. u : the number of all entities in the dataset Since the ratio of missing tag is unknown, the paper make an assumption: 50% of tag-entity relation being missing.

Experiment For evaluating ranking algorithms, we build a search engine and see how well user perceive our new ranking results. We download the dataset of Wikipedia and create a search interface on top of the dataset for collecting user survey.

Effectiveness evaluation This paper collected 600 valid questionnaires from 69 students in UCLA to create a benchmark for evaluating user satisfaction. Compute the satisfaction score:

Compare to Google Set Query:{Beijing, Atlanta} -> Olympic!

Conclusion This paper introduced three approaches, and built a search engine on top of them, creating a benchmark for evaluation. It explains two important challenges of utilizing tag information: popularity bias and the missing tag effect, and explain how to overcome these difficulties. The framework not only can find similar items, but also shows potential of social tag information. It show that the task can be completed through providing a query consisting of entities and using only tag information, even though the tag information is uncontrolled and noisy.

Finding similar items by leveraging social tag clouds Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: SAC 2012’ Date: October 4, 2012.

Similar presentations

Presentation on theme: "Finding similar items by leveraging social tag clouds Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: SAC 2012’ Date: October 4, 2012."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Finding similar items by leveraging social tag clouds Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: SAC 2012’ Date: October 4, 2012.

Similar presentations

Presentation on theme: "Finding similar items by leveraging social tag clouds Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: SAC 2012’ Date: October 4, 2012."— Presentation transcript:

Similar presentations

About project

Feedback