Presentation is loading. Please wait.

Presentation is loading. Please wait.

– with special attention to location ( ) privacy SPACE WEBMINING PRIVACY and : foes or friends? Bettina Berendt Dept. Computer Science K.U. Leuven.

Similar presentations


Presentation on theme: "– with special attention to location ( ) privacy SPACE WEBMINING PRIVACY and : foes or friends? Bettina Berendt Dept. Computer Science K.U. Leuven."— Presentation transcript:

1 – with special attention to location ( ) privacy SPACE WEBMINING PRIVACY and : foes or friends? Bettina Berendt Dept. Computer Science K.U. Leuven

2 SPACE WEBMINING PRIVACY

3 BASICS

4 SPACE WEBMINING PRIVACY

5 5 What is Web Mining? And who am I? Knowledge discovery (aka Data mining): "the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data." Web Mining: the application of data mining techniques on the content, (hyperlink) structure, and usage of Web resources. Web mining areas: Web content mining Web structure mining Web usage mining Navigation, queries, content access & creation

6 6 Why Web / data mining? “the database of Intentions“ (J. Battelle)

7 SPACE WEBMINING PRIVACY

8 Location-based services and augmented reality

9 Semiotically augmented reality: semapedia and related ideas

10 Mobile Social Web

11 SPACE WEBMINING PRIVACY

12 What's special about spatial information? 1. Interpreting Rich inferences from spatial position to personal properties and/or identity possible Pos(A,9-17) = P1 → workplace(A,P1) Pos(A,20-6) = P2 → home(A,P2) An even richer „database of intentions“?! Pos(A,now) = P3 & temp(P3,now,hot) → wants(A,ice-cream)  (location-based services) Pos(A, t in 13-18) = Pos(Demonstration,13-18) → suspicious(A)  (ex. Dresden phone surveillance case 2011)

13 What's special about spatial information? 2. Sending, or: Opt-out impossible?! Physically: You cannot be nowhere Corollary: You cannot be in two places at once → limits on identity-building Contractually: Rental car with tracking,... Culturally I: Opt-out may preclude basics of identity construction No mobile phone/internet communication Culturally II: Opt-out considered suspicious in itself (ex. A. Holm surveillance case 2007)

14 FOES ?

15 SPACE WEBMINING PRIVACY

16 16 Behaviour on the Web (and elsewhere) Data

17 17 (Web) data analysis and mining Data Privacy problems!

18 18 Technical background of the problem: The dataset allows for Web mining (e.g., which search queries lead to which site choices), it violates k-anonymity (e.g. "Lilburn"  a likely k = #inhabitants of Lilburn)

19 SPACE WEBMINING PRIVACY

20 Inferences Data mining / machine learning: inductive learning of models („knowledge“) from data Privacy-relevant (Re-)identification: inferences towards identity Profiling: inferences towards properties Application of the inferred knowledge

21 21 What is identity merging? Or: Is this the same person?

22 22 Data integration: an example Paper published by the MovieLens team (collaborative-filtering movie ratings) who were considering publishing a ratings dataset, see Public dataset: users mention films in forum posts Private dataset (may be released e.g. for research purposes): users‘ ratings Film IDs can easily be extracted from the posts Observation: Every user will talk about items from a sparse relation space (those – generally few – films s/he has seen) [Frankowski, D., Cosley, D., Sen, S., Terveen, L., & Riedl, J. (2006). You are what you say: Privacy risks of public mentions. In Proc. SIGIR‘06] Generalisation with more robust de-anonymization attacks and different data: [Narayanan A, Shmatikov V (2009) De-anonymizing social networks. In: Proc. 30th IEEE Symposium on Security and Privacy 2009]

23 23 Given a target user t from the forum users, find similar users (in terms of which items they related to) in the ratings dataset Rank these users u by their likelihood of being t Evalute:  If t is in the top k of this list, then t is k-identified  Count percentage of users who are k-identified  E.g. measure likelihood by TF.IDF (m: item) Merging identities – the computational problem

24 24 Results

25 25 What do you think helps?

26 26 What is classification (and prediction)?

27 27 Predicting political affiliation from Facebook profile and link data (1): Most Conservative Traits Trait NameTrait ValueWeight Conservative Groupgeorge w bush is my homeboy Groupcollege republicans Grouptexas conservatives Groupbears for bush Groupkerry is a fairy Groupaggie republicans Groupkeep facebook clean Groupi voted for bush Groupprotect marriage one man one woman Lindamood et al. 09 & Heatherly et al. 09

28 28 Predicting political affiliation from Facebook profile and link data (2): Most Liberal Traits per Trait Name Trait NameTrait ValueWeight Liberal activitiesamnesty international Employerhot topic favorite tv showsqueer as folk grad schoolcomputer science hometownmumbai Relationship Statusin an open relationship religious viewsagnostic looking forwhatever i can get Lindamood et al. 09 & Heatherly et al. 09

29 29 What is collaborative filtering? "People like what people like them like"

30 30 User-based Collaborative Filtering Idea: People who agreed in the past are likely to agree again To predict a user’s opinion for an item, use the opinion of similar users Similarity between users is decided by looking at their overlap in opinions for other items

31 31 Example: User-based Collaborative Filtering Item 1Item 2Item 3Item 4Item 5 User 181 ? 27 User 22 ? 575 User User User User

32 32 Similarity between users Item 1Item 2Item 3Item 4Item 5 User 181?27 User 22?575 User How similar are users 1 and 2? How similar are users 1 and 4? How do you calculate similarity?

33 33 Popular similarity measures Cosine based similarity Adjusted cosine based similarity Correlation based similarity

34 34 Algorithm 1: using entire matrix Aggregation function: often weighted sum Weight depends on similarity

35 35 Algorithm 2: K-Nearest- Neighbour Aggregation function: often weighted sum Weight depends on similarity Neighbours are people who have historically had the same taste as our user

36 SPACE WEBMINING PRIVACY

37 Summary: Lots of data → lots of privacy threats (and opportunities) The Web incites one of the semiotically richest (and often machine-processable) types of interaction Space incites data-rich types of interaction → two rich sources of „the database of intentions“

38 SPACE WEBMINING PRIVACY

39 How many people see an ad? Television: sample viewers, extrapolate to population Web: count viewers/clickers through clickstream City streets: count pedestrians / motorists? Too many streets! → Solution intuition: sample streets, predict

40 Fraunhofer IAIS (2007): predict frequencies based on similar streets Street segments modelled as vectors Spatial / geometric information Type of street, direction, speed class, … Demographic, socio-economic data about vicinity Nearby points of interest (buffer around segment, count #POI) KNN algorithm Frequency of a street segment = weighted sum of frequencies from most similar k segments in sample Dynamic + selective calculation of distance to counter the huge numbers of segments and measurements

41 SPACE WEBMINING PRIVACY

42 IP filtering: a deterministic classification model IP → country

43 43 Where do people live who will buy the Koran soon? Technical background of the problem: A mashup of different data sources Amazon wishlists Yahoo! People (addresses) Google Maps each with insufficient k-anonymity, allows for attribute matching and thereby inferences

44 Event store Learning Reasoning Multiple views on traffic Operator ID: Nick Heading: INCIDENT Message: INCIDENT INFORMATION Cleared 1637: I-405 SB JS I-90 ACC BLK RL CCTV 1623 – WSP, FIR ON SCENE Incident reports Weather Major events Traffic Prediction: space data + Web data +... E.g. LARKC project: I. Celino, D. Dell'Aglio, E. Della Valle, R. Grothmann, F. Steinke and V. Tresp: Integrating Machine Learning in a Semantic Web Platform for Traffic Forecasting and Routing. IRMLeS 2011 Workshop at ESWC 2011.

45 PEACEFUL COEXISTENCE ?

46 46 Recall (a simple view): Cryptographic privacy solutions Data not all !

47 47 "Privacy-preserving data mining" Data not all !

48 48 Privacy-preserving data mining (PPDM) Database inference problem: "The problem that arises when confidential information can be derived from released data by unauthorized users” Objective of PPDM : "develop algorithms for modifying the original data in some way, so that the private data and private knowledge remain private even after the mining process.” Approaches:  Data distribution Decentralized holding of data  Data modification Aggregation/merging into coarser categories Perturbation, blocking of attribute values Swapping values of individual records sampling  Data or rule hiding Push the support of sensitive patterns below a threshold

49 49 Example 1: Collaborative filtering

50 50 Collaborative filtering: idea and architecture Basic idea of collaborative filtering: "Users who liked this also liked..."  generalize from "similar profiles" Standard solution:  At the community site / centralized:  Compute, from all users and their ratings/purchases, etc., a global model  To derive a recommendation for a given user: find "similar profiles" in this model and derive a prediction Mathematically: depends on simple vector computations in the user-item space

51 51 Distributed data mining / secure multi-party computation: The principle explained by secure sum Given a number of values x1,...,xn belonging to n entities compute  xi such that each entity ONLY knows its input and the result of the computation (The aggregate sum of the data)

52 52 Canny: Collaborative filtering with privacy Each user starts with their own preference data, and knowledge of who their peers are in their community. By running the protocol, users exchange various encrypted messages. At the end of the protocol, every user has an unencrypted copy of the linear model Λ, ψ of the community’s preferences. They can then use this to extrapolate their own ratings At no stage does unencypted information about a user’s preferences leave their own machine. Users outside the community can request a copy of the model Λ, ψ from any community member, and derive recommendations for themselves Canny (2002), Proc. IEEE Symp. Security and Privacy; Proc. SIGIR

53 53 Ex. 2: Frequent itemset mining

54 54 Generating large k-itemsets with Apriori Min. support = 40% step 1: candidate 1-itemsets Spaghetti: support = 3 (60%) tomato sauce: support = 3 (60%) bread: support = 4 (80%) butter: support = 1 (20%) Transaction ID Attributes (basket items) 1Spaghetti, tomato sauce 2Spaghetti, bread 3Spaghetti, tomato sauce, bread 4bread, butter 5bread, tomato sauce

55 55 spaghettiTomato saucebreadbutter Spaghetti, tomato sauce Spaghetti, bread Spaghetti, butter Tomato s., bread Tomato s., butter Bread, butter Spagetthi, Tomato sauce, Bread, butter Spagetthi, Tomato sauce, Bread Spagetthi, Tomato sauce, butter Spagetthi, Bread, butter Tomato sauce, Bread, butter 

56 56 spaghettiTomato saucebreadbutter Spaghetti, tomato sauce Spaghetti, bread Spaghetti, butter Tomato s., bread Tomato s., butter Bread, butter Spagetthi, Tomato sauce, Bread, butter Spagetthi, Tomato sauce, Bread Spagetthi, Tomato sauce, butter Spagetthi, Bread, butter Tomato sauce, Bread, butter 

57 57 spaghettiTomato saucebreadbutter Spaghetti, tomato sauce Spaghetti, bread Spaghetti, butter Tomato s., bread Tomato s., butter Bread, butter Spagetthi, Tomato sauce, Bread, butter Spagetthi, Tomato sauce, Bread Spagetthi, Tomato sauce, butter Spagetthi, Bread, butter Tomato sauce, Bread, butter 

58 58 spaghettiTomato saucebreadbutter Spaghetti, tomato sauce Spaghetti, bread Spaghetti, butter Tomato s., bread Tomato s., butter Bread, butter Spagetthi, Tomato sauce, Bread, butter Spagetthi, Tomato sauce, Bread Spagetthi, Tomato sauce, butter Spagetthi, Bread, butter Tomato sauce, Bread, butter 

59 How many people see an ad? Next steps... Not only ads, but personalized ads Ad sequences? → need to know trajectories Single trajectories: highly privacy-sensitive data Aggregate (e.g. frequent) trajectories also interesting for other applications – e.g., traffic planning

60 Privacy-preserving frequent-route mining by data coarsening: intuition

61 Ex.: Gidófalvi et al. (2007): Privacy-preserving data mining on moving object trajectories Basic strategy: Aggregation/merging into coarser categories, performed by client Anonymization rectangles satisfying (areasize, maxLocProb): → allows inference of location probability of R

62 Coarsened trajectories

63 Time interval probabilistically frequent route queries Split trajectories inside query time interval into m sub- trajectories of equal time length → trajectory = set/sequence of spatio-temporal grid cell IDs, each associated with a location probability = transaction of items (X,P) Transaction p-satisfies itemset Y if Y in X and for all i in Y intersects X: i.prob >= min_prob p-support of an item(set) i.count: #TAs that p-satisfy the item(set) frequent routes := maximal p-frequent itemsets with a frequent-itemset miner (can be discontinuous) Extension to frequent sequence mining?!

64 64 Outlook: Privacy-preserving data publishing (PPDP)  In contrast to the general assumptions of PPDM, arbitrary mining methods may be performed after publishing  need adversary models  Objective: "access to published data should not enable the attacker to learn anything extra about any target victim compared to no access to the database, even with the presence of any attacker’s background knowledge obtained from other sources” (this needs to be relaxed by assumptions about the background knowledge)  A comprehensive current survey: Fung et al. ACM Computing Surveys 2010

65 65 Problem solved?

66 66 No... How do people like/buy books? What do our Webserver logs tell us about viewing behaviour? How can we combine Webserver and transaction logs? Which data noise do we have to remove from our logs? Which of these association rules are frequent/co nfident enough? Should we show the recommen dations at the top or bottom of the page? Only to registered customers ? What if someone bought a book as a present for their father?

67 FRIENDS ?

68 against From... SPACE WEBMINING PRIVACY

69 for … to SPACE WEBMINING PRIVACY

70 70 Why Web / data mining? Who is doing the learning?

71 71 Privacy as practice: Identity construction Data

72 72 Example: Privacy Wizards for Social Networking Sites [Fang & LeFevre 2010] Interface: user specifies what they want to share with whom Not in an abstract way ("group X" or "friends of friends" etc.) Not for every friend separately But for a subsect of friends, and the system learns the "implicit rules behind that" Data mining: active learning (system asks only about the most informative friends instances) Results: good accuracy, better for "friends by communities" (linkage information) than for "friends by profile" (their profile data)

73 73 Privacy Wizards... – more feedback: “Expert interface“ shows the learned classifier

74 encrypted content, unobservable communication selectivity by access control identification of information flows feedback & awareness tools educational materials and communication design cognitive biases and nudging interventions legal aspects offline communities: social identities, social requirements profiling

75 Summary and conclusions The Web and space are rich sources of behavioural and other data Data mining is learning (inductively) from these data – a process of knowledge discovery (KD) „privacy-preserving data mining“ modifies data and/or algorithms to preserve utility & privacy Privacy threats arise in all phases of KD But KD can also offer privacy opportunities

76 Outlook: from macro-space to micro-space / social signal processing

77 QUESTIONS PLEASE THANK YOU


Download ppt "– with special attention to location ( ) privacy SPACE WEBMINING PRIVACY and : foes or friends? Bettina Berendt Dept. Computer Science K.U. Leuven."

Similar presentations


Ads by Google