Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi.

Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi

Introduction Social Annotation: A process where users collaboratively assign a short sequence of keywords (tags) to a number of resources ▫Each tag sequence is a concise and accurate summary of the resource’s content ▫Meant to aid navigation through a collection Leads to searching via tags ▫Enables relevant text retrieval ▫Allows accurate retrieval of non-textual objects ▫Presents a need for an efficient retrieval and ranking method based on user tags

RadING Ranking annotated data using Interpolated N- Grams Searching and ranking method based exclusively on user tags Uses interpolated n-grams to model tag sequences associated with every resource How does it rank?

Probabilistic Foundations Goal: To rank resources by the probability that they will be relevant to the query Given keyword query Q, and a collection of resources R, we apply Bayesian theorem to get: p(R is relevant | Q) = p(Q|R is relevant)p(R is Relevant) p(Q) Where p(R is relevant) is the probability that R is relevant, independent of the query posed and p(Q) is the probability of the query issued

Probabilistic Foundations p(R is relevant) is constant throughout the resource collection, as well as p(Q) ▫Meaning: ranking resources by p(R is relevant|Q) is equivalent to ranking by p(Q|R is relevant) In order to estimate the probability of the query being “generated” by each resource, resources need to be modeled based on knowledge of social annotation

Dynamics and Properties of the Social Annotation Process The goal of the tagging process is to describe the resource’s content User opinions crystallize quickly, can find annotation trends after witnessing a small number of assignments Therefore we assume the following: ▫p(Q | R is relevant) = p(Q is used to tag R) ▫In English: Users will use keyword sequences derived from the same distribution to both tag and search for a resource

Social Annotation Process: Things to consider… Resources are rarely given assignments with one tag Also, tag positions are not random, progress from left to right from more general to more specific Tags representing different perspectives on a resource are less likely to occur together in the same assigment Used n-gram models to model these co- occurance patterns

N-gram Models Given an assignment made up of a sequence (s) of l tags t 1 …t l, the probability of this sequence being assigned to a resource is: ▫p(t 1, …,t l ) = p(t 1 )p(t 2 |t 1 )…p(t l |t 1,…, t l-1 ) The purpose of using n-gram models is to approximate the probability of a subsequence with only the last n-1 tags ▫In the case of a bi-gram model, p(t k |t 1,…,t k-1 ) approximates to p(t k |t k-1 )

N-gram Models Calculate the probability using the Maximum Likelihood equation c(t 1, t 2 ) = the number of occurrences of the bi- gram The summation is the sum of the occurrences of all bigrams involving t 1 as the first tag

Interpolation Interpolation is used to compensate for sparse data, distributes probability mass from high counts to low counts Used the Jelinek-Mercer interpolation technique. Applied to a bi-gram, yields:

Parameter Optimization Goal: to maximize the likelihood function L(λ 1,λ 2 ) in order to find the ideal interpolation parameters Definitions: ▫D*: The constrained domain of λ 1 and λ 2 ▫λ * : The global maximum of L(λ 1,λ 2 ) ▫λ c : The point at which L(λ 1,λ 2 ) evaluates to its maximum value within D*, which must be found to optimize parameters

RadING Optimization Framework Step 1: If L(λ 1,λ 2 ) is unbounded, perform 1D optimization to locate λ c Step 2: If L(λ 1,λ 2 ) is bounded, apply 2D optimization to find λ* Step 3: If λ* is not in D*, locate λ c

Searching Process Step 1: Train a bi-gram model for each resource ▫Compute the bi-gram and unigram probability and optimize the interpolation parameters Step 2: At query-time compute the probability of the query keyword sequence being generated by each resource’s bi-gram model Use Threshold Algorithm to compute top-k results

Searching Example

Experimental Evaluation Test data: web crawl of del.icio.us ▫70,658,851 assignments ▫Posted by 567,539 users ▫Attached to 24,245,248 unique URLs ▫Average length of assignment: 2.77 ▫Standard deviation: 2.70 ▫Median: 2

Optimization Efficiency

Ranking Effectiveness Compares RadING ranking method to adaptations of tf/idf ranking ▫Tf/Idf: concatenates resources’ assignments into a document and performs raking based tf/idf similarity to each document ▫Tf/Idf+: computes tf/idf similarity of each individual assignment and rank resources based on average similarity 10 Judges contacted through Amazon Mechanical Turk to measure precision

Ranking Effectiveness

Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi.

Similar presentations

Presentation on theme: "Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi.

Similar presentations

Presentation on theme: "Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi."— Presentation transcript:

Similar presentations

About project

Feedback