Presentation is loading. Please wait.

Presentation is loading. Please wait.

Vanity Fair: Privacy in Querylog Bundles Rosie Jones, Ravi Kumar, Bo Pang, Andrew Tomkins.

Similar presentations


Presentation on theme: "Vanity Fair: Privacy in Querylog Bundles Rosie Jones, Ravi Kumar, Bo Pang, Andrew Tomkins."— Presentation transcript:

1 Vanity Fair: Privacy in Querylog Bundles Rosie Jones, Ravi Kumar, Bo Pang, Andrew Tomkins

2 Outline Introduction Data Vanity Queries Structural Vulnerabilities Analytical Vulnerabilities Conclusions

3 Introduction Background: A proposed approach to address privacy concern in storing and publishing web search query logs – bundling logs of multiple users together. Contributions: Investigating possible privacy leaks even when query logs are bundled. Implications for privacy: --Name/UID known, the actual bundle could be found due to the vanity queries. ( Vanity Queries: a web phenomenon in which a user issues a query for his/her own name.) --Bundle known, the specific user session might be teased out because of the structural and analytical weaknesses.

4 Data Querylog Q1 Querylog Q2 and bundles of users ➔ Querylogs from users issuing >= 100 queries over a 70 day period in 2006 ➔ Each query has the cookie information of the browser issuing the query ➔ A UID can be reliable extracted ➔ These logs + user profiles are utilized for analyzing vanity queries ➔ Querylogs chosen from April 9 – 15 th 2007: each query has the terms, the browser cookie, the originating IP address, and the time stamp ➔ Bundle query logs (a subset of Q2 containing100 M queries) by masking low-order bits of IP address

5 Vanity Queries Vanity queries. Why? -- Access the ranking of the user's own home page, blog, or other content -- Discover whether other parties have mentioned the user Geographic information. -- Users submit queries that contain geographic information -- Such queries will often reference geographic location that are proximate to the user (i.e. Pizza 95014)

6 Vanity Queries Cont. Extracting names from profile -- In many websites, each userid has a corresponding public profile page accessible to everyone K userids are crawed. They extracted the real name, if present, and parsed it to obtain a firstname-lastname pair. The first and last names are not distinguished. This method can yield from zero to many names from a single userid.

7 Vanity Queries Cont. Extracting names from userid --A list L of 91,909 names was obtained from the US Census Bureau data. --Converting characters to lower case, and removing non-alphabetic characters. --Given a userid x1,..., xn (xi is an alphanumeric character), we construct a (n +1)-node graph G (correspond to the characters x1,..., xn, plus a final state xn+1). We place a directed edge (xi, xj+1) in the graph if one of the following conditions hold: (1) the substring xi... xj is a valid name, according to the list L; (2) j = i, i.e., we allow single-letter names, to account for potential initials.

8 Vanity Queries Cont. Coverage performance of both schemes --For a userid u-- P(u) : the set of names obtained using the profile page S(u) : the set of names obtained using the userid segmentation. P k : Returns the set P(u) of names, as long as there are at least k elements in the set P(u). S k : Returns the set S(u) of names, as long as there are at least k elements in the set S(u) (P ∨ S) k : Returns the set P(u) U S(u) of names, as long as either P(u) or S(u) has at least k elements. (P ∧ S) k : Returns the set P(u) U S(u) of names, as long as both P(u) and S(u) have at least k elements. (P∩ S) k : Returns the set P(u)∩S(u) of names, as long as the intersection P(u) ∩ S(u) is of size at least k. The userid contains at least one name for almost 90% of users, and two names for over 40%. If we combine information,we find that two names can be extracted from the combination of profile page and userid in almost 50% of cases.

9 Vanity Queries Cont. Use the name extraction methods to get potential names for each userid Using a name recognizer built using the US Census Bureau data to achieve query terms that are people names. Count the number of name queries for which any one extracted name is present (the any-match condition), and the number of name queries for which every extracted name is present (the all-match condition). Name Vanity Queries 5% of users for whom we extracted two names from the profile page issued a query containing those two names. And over 10% of users for whom we extracted the same two names, using both the profile and a parse of the userid, issued a query containing those two names.

10 Vanity Queries Cont. Given a particular query log session, the attacker attempts to discover which user generated the session. Assuming users search for their own names, what is the rank of their names when placed among all name queries they issue? Use tf*idf to avoid popular and celebrity names Name Vanity Queries – Recovering name from sessions 90% of the users query their own name within the top ten of all name queries they issue..

11 Vanity Queries Cont. Name Vanity Queries – Recovering sessions for a user We assume the attacker would like to discover the session of a particular known person, perhaps a neighbor, co-worker, or even a spouse. Assuming users search for their own names, given a name, what is the rank of the true user among all users who queried for that name? Thus, given a large collection of query bundles, for those users who have issued name vanity queries, it is possible to narrow down to a few candidate bundles by examining the name vanity queries..

12 Vanity Queries Cont. Geographic vanity queries Some users issue queries that reference their own geographic location. Use the profile data to obtain the geographic location for each user. A set of 744K users were chosen (every profile contains a US postal code). For each query, we apply a black-box geographic classifier based on the internet locality product WhereOnEarth (WOE). Given a query, WOE determines if this query has a locational component and if so, outputs a list of locations at the best guessed granularity (i.e., city, county, state, country) along with a confidence. We run theWOE classifier for each query and store the top postal code, along with the confidence, for each query.

13 Vanity Queries Cont. Geographic vanity queries A query is a γ-geovanity query if the confidence output by WOE >= γ, and the location output by WOE matches the user's location in the profile. We consider three levels of postal code matching: the first three digits, first four digits, and all five digits. We consider a user to have queried for a location only if the user has issued >= k queries for that location for the given postal code matching granularity. We compute the fraction of users who issue geovanity queries: γ= 0.5, k = 10, and a match of 5 digits, at least 1% of users issue queries that reveal their location.

14 Structural vulnerabilities There is a single dominant user who issues a very large number of searches relative to an average user There are users who query for geographical locations, especially locations that are unique.

15 Structural vulnerabilities Cont. Users issuing large number of queries. -- the bundles are almost ineffective in this case. -- We compute the numbers of users per bundle who issue at least f-fraction of all queries within the bundle. F = 0.1,..., % of the bundles have a user that issued at least half of the queries in the bundle. 0.6% of the bundles have a user who issued at least 90% of the queries in the bundle.

16 Structural vulnerabilities Cont. Users issuing geographic queries. -- Applying the geographic information recognizing techniques described before -- We now characterize the extent to which a particular user in a bundle might be the unique user in the bundle querying for that specific geographic location. -- We define a heavy user to be one who queries for at least the average number of queries in a bundle. 75% of the postal code (zip) locations in a bundle are associated with only one user. A significant portion of the heavy users (9.9%) are associated with only one postal code. 0.3% of heavy users are associated with only one zipcode and this postal code has exactly one user associated with it. It is easy to see that if a user has a unique postal code and also queries for sensitive material such as HIV clinics or adult services in their area, privacy will be easily compromised.

17 Analytical vulnerabilities. Given a bundle of fibers corresponding to different users, how to extract the individual fibers from this bundle?- A clustering problem Describing various measures we use to evaluate the performance of algorithms finding fibers from bundles Describing our main method and analyzing the performance of the method on querylog Q2 Next we describe two augmentations to the basic method for which we have preliminary results: -- modeling using topic switch -- modeling using a decision-tree classifier.

18 Analytical vulnerabilities Cont. Evaluation measures X = {x1,..., xn} is a ground set, and C ={C1,...,Ck} and D = {D1,...,D`} are two clusterings of X. Let C(x) denote the cluster in C to which x belongs and let D(x)denote the cluster in D to which x belongs. F-measure: Prec(D|C) = E x ∈ X [prec(x,D|C)] The f-measure = the harmonic mean of prec(D|C) and prec(C|D). The closer the f-measure is to 1, the better the agreement between C and D. Meila measure: H(C) + H(D) − 2I(C,D), where H(·) is the binary entropy and I(·, ·) is the mutual information. The closer the Meila measure is to zero, the better the agreement between C and D. Rand index: Capturing the number of pairwise disagreements between the clusterings C and D. The closer this measure is to 0, the better the agreement between C and D.

19 Analytical vulnerabilities Cont. Main method Constructing a graph based on similarity between queries and then cluster this graph to obtain the fibers. Given the canonicalized queries q1,..., qn in a bundle, let G be an n-node graph where each node corresponds to a query. We add an edge from query qi to qj in G if one of the following two conditions hold. (1) The geo location of qi is the same as that of qj. These edges are called g-edges or geo edges. (2) qi and qj share at least one term. These edges are called w-edges or word-overlap edges. Note that both these operations can be performed efficiently. For instance, to do (2), we create an index of all the terms in all the queries and add edges between qi and qj if they are together in a posting list in the index.

20 Analytical vulnerabilities Cont. Main method -- Performance results We consider two simple baselines: 1. BL1: each query in a bundle is assigned its own fiber(this baseline should perform well when there are many small fibers in a bundle and no dominant users). 2. BL2: all the queries in a bundle are placed in a single fiber; this baseline should perform well when there is a dominant user in the bundle. The addition of g-edges causes some small improvement over BL1, g, w edges cause much more dramatic improvements. In other words, word overlap in query terms is a very strong signal indicating two queries are from the same fiber.

21 Analytical vulnerabilities Cont. Modeling topic switch -- Modeling the queries via topics -- Word co-occurrence patterns in conjunction with the adjust for the probability that the two words occurred together by chance. -- We use the x 2 -measure to discover highly related term pairs -- Add topic edges to the original graph G when two queries share common topics (with high x 2 value)

22 Analytical vulnerabilities Cont. Classifier-based approach J48, the Weka implementation of C4.5 [13]. For each bundle in the training data, our training data labels consist of all pairs of queries in that bundle. The label is “same” if the pair of queries are both from the same user, and “different” if the two queries are from different users. The features we used for prediction were from several categories:(1) time, (2) word and character similarity, and (3) session rewrite probabilities. We used 40 bundles for training, giving us 109,038 pairs of queries.. We evaluated the accuracy of this classifier on 460 bundles of various, containing 8,235,739 pairs of queries. Accuracy = 97.5 precision (71.2%) is better than recall (29.9%). This may be because we are able to link within-session queries together reliably, but are not able to link queries from the same user across with different topics

23 Conclusions The privacy implications of bundling when used as a tool to enhance the privacy of users in querylogs. Names or locations of users in a bundle are identifiable The bundle is identifiable given a user The structural vulnerabilities of bundling, -- Volume of queries -- Sensitive queries Analytical vulnerabilities. -- Decomposition a bundle into individual fibers In summary, bundling is certainly a viable technique in the realm of user privacy; however, there are significant challenges to actually using this in practice.


Download ppt "Vanity Fair: Privacy in Querylog Bundles Rosie Jones, Ravi Kumar, Bo Pang, Andrew Tomkins."

Similar presentations


Ads by Google