Emiran UC San Diego Alin UC San Diego K.K. at&t Divesh at&t.

Emiran Curtmola @ UC San Diego Alin Deutsch @ UC San Diego K.K. Ramakrishnan @ at&t Divesh Srivastava @ at&t

SIGMOD, June 2010 Global community data (all published data) Democratization of data creation on the web ─ easy to create and publish data Self-organization in online communities of interest in ad-hoc fashion DATAONLINE COMMUNITIES 2  Typical such applications are centralized  Hosted online communities  Search engines  Limitations  Disintermediation of publishers from queriers  Publishers need to give up their data  Central site controls visibility of publishers to queriers  Publishers loose their right to privacy

 Free data exchange within the community  Some users want to remain autonomous  User privacy (i.e., not all users may want to reveal their true identity) ▪ Publishers express their opinions anonymously to avoid association with sensitive or controversial issues (e.g., political, race, religion.. )  User autonomy + privacy suggest a decentralized infrastructure SIGMOD, June 20103

 Make safer to join and post data for publishers  Prevent association of sensitive topics with publishers that contribute to them even if compromised nodes  Publisher k-anonymity: For every publisher p and data item d, hide p in a k-protected crowd of publishers: there are at least other k-1 potential publishers of the same d SIGMOD, June 20104

News & BlogsAdvertised data items about the publisher’s articles P1P1 Beijing, Tibet, stocks, poverty, money P2P2 Beijing, yak tea, Hong Kong, poverty P3P3 Beijing, Tibet, yak tea, Hong Kong, money P4P4 Beijing, Olympics, yak tea, stocks, money P5P5 P6P6 Olympics, Tibet, stocks, money P7P7 Olympics, yak tea, stocks, money P8P8 Query Q 1 : find the articles mentioning the Olympics in Beijing Query Q 3 : find the articles mentioning poverty Query Q 2 : find the articles about Tibet Query Q 4 : find the articles that give the money in Hong Kong P3P3 P8P8 P7P7 P6P6 P1P1 P2P2 P4P4 P5P5 The community data collection local XML data P3P3 P4P4 P8P8 P2P2 P5P5 P1P1 P6P6 P7P7 SIGMOD, June 20105 How to query ad-hoc distributed data sources while preserving user privacy?  Allow publishers keep complete control over their data  Disseminate queries in the network, not data  Publishers answer queries at their own discretion  Published data is not traceable back to publishers even if compromised nodes  Allow publishers keep complete control over their data  Disseminate queries in the network, not data  Publishers answer queries at their own discretion  Published data is not traceable back to publishers even if compromised nodes

 Infrastructure setup such that  Distribution of data  Large nr. of decentralized publishers and consumers  User privacy  Efficient query routing (to avoid flooding the network) SIGMOD, June 20106

 Build an overlay network to act as a distributed index  Peers are organized into logical query dissemination trees (QDTs)  Use QDTs to disseminate queries using node summaries P1’s advertised set of terms: Beijing, Tibet, stocks, poverty, money P1’s advertised set of terms: Beijing, Tibet, stocks, poverty, money P2’s advertised set of terms: Beijing, yak tea, Hong Kong, poverty P2’s advertised set of terms: Beijing, yak tea, Hong Kong, poverty Node 3’s summary (set of terms) Beijing, Tibet, stocks, poverty, money, yak tea, Hong Kong Node 3’s summary (set of terms) Beijing, Tibet, stocks, poverty, money, yak tea, Hong Kong 242118 1 8 9 10 64 172023 13 2 31416 P4P4 P5P5 P6P6 P7P7 P8P8 P3P3 P2P2 P1P1 router P publisher union of its subtrees’ summaries SIGMOD, June 20107

1 64 2 3 8 9 P4P4 10172023 242118 13 1416 P5P5 P6P6 P7P7 P8P8 P3P3 P2P2 P1P1 Q 3 =“poverty” Q3Q3 Q3Q3 Q3Q3 Q3Q3 Q3Q3 Q3Q3 Q3Q3 Q3Q3 Only P 1 and P 2 publish articles about poverty …poverty…  check set inclusion: query into node’s summary Bloom Filter SIGMOD, June 20108 Pruning

 Minimum information at each node ▪ No node has global information ▪ Node summaries are vectors of counters (bloom filters) representing hash values of advertised data items  Queries reach publishers in such a manner that users do not know if publisher does not respond vs. does not have matching documents SIGMOD, June 2010 1 64 2 3 8 9 P4P4 10172023 242118 13 1416 P5P5 P6P6 P7P7 P8P8 P3P3 P2P2 P1P1 poverty… 9 Q 3 =“poverty”

▪ If an edge node is compromised ▪ Risk: Individual updates of node summaries (from publishers to edge routers) may expose the publishers ▪ Solution: publisher k-anonymity Hide users in protected crowds of at least k-publishers and... SIGMOD, June 2010 1 6 4 4 2 3 8 9 P4P4 10172023 242118 13 1416 P5P5 P6P6 P7P7 P8P8 P3P3 P2P2 P1P1 poverty… 10 Protected crowd

▪ Solution: publisher k-anonymity Hide users in protected crowds of at least k-publishers and use secure-multi party (SMP) computation inside crowds to advertise updates of published terms to the edge routers SIGMOD, June 201011 4 P1P1 P2P2 P3P3 + Upd 1 + Upd 2 + Upd 3 +R -R Edge router 4 Publisher 3-anonymous protected crowd Upd 1 +Upd 2 +Upd 3

▪ If an internal node is compromised ▪ Risk: Node summary of advertised terms is exposed → Downstream may contain sensitive content but the crowd of publishers is even bigger now.. SIGMOD, June 2010 1 64 2 3 3 8 9 P4P4 10172023 242118 13 1416 P5P5 P6P6 P7P7 P8P8 P3P3 P2P2 P1P1 poverty… 12 Protected crowd

The tree topology introduces congestion at upper QDT levels during query dissemination The tree topology introduces congestion at upper QDT levels during query dissemination How to relieve the congestion?  SIGMOD, June 201013

 Overlaying multiple logical QDTs over the same underlay network  A physical node belongs to multiple logical QDTs but at different levels  Goal: organize the nodes into QDTs such that the distribution of tree levels for a node is uniform across the QDTs SIGMOD, June 201014

QDT 1 QDT 2 QDT 3 QDT 4 1 1 1 1 1 1 1 1 SIGMOD, June 201015

 Partition community data collection into disjoint blocks  Build one QDT tree per block B  QDT i groups all publishers with terms in B i  Routing a query  Terms in query determine the relevant blocks  Send query to the corresponding QDT  Check the full query with publishers BlockTerms B1B1 Beijing, Olympics B2B2 Tibet, yak tea B3B3 Hong Kong, stocks B4B4 poverty, money …poverty… QDT1 QDT2 QDT3 QDT4 SIGMOD, June 201016 Q 3 =“poverty” Q 3 falls in B 4  use QDT 4 :

QDT 1 QDT 2 QDT 3 QDT 4 Q 3 =“poverty” Q 1 =“Olympics”, “Beijing” SIGMOD, June 201017

 Q4=“Hong Kong”, “money”  Route Q 4 on both trees?  Query selectivity optimization techniques: Choose the selective QDT to route on by maintaining only 1-3% of popular data items (see paper) BlockTerms B1B1 Beijing, Olympics B2B2 Tibet, yak tea B3B3 Hong Kong, stocks B4B4 poverty, money QDT3 QDT4 SIGMOD, June 201018

Our solution  SIGMOD, June 201019

 Empirical fact: Upper two levels in a QDT are the most congested  Model: cyclical permutation of nodes on the tree levels nr of QDTs for load balance = nr of legal permutations (i.e., without breaking the fairness property) Fairness property: all routers appear precisely once in the top two levels of any QDT Fairness property: all routers appear precisely once in the top two levels of any QDT SIGMOD, June 201020

 Overall throughput depends heavily on the most congested node  Look at node stress in terms of nr. of messages  going into a node: Processing Load at a node (PLoad)  going out of a node: Forwarding Load at a node (FLoad)  Throughput indicator: compare how far are ↔ SIGMOD, June 201021 P P F F peak load (k-QDTs) ideal load (avg. load for 1-QDT = ) nr.msgs nr.nodes

SIGMOD, June 201022  Experiment 1: PLoad for Scribe QDT topology  Result: nr. QDTs for load balance found experimentally coincides with that given by our analytical model  Load balance with ▪ How close: 32% closest to ideal PLoad ▪ How close: 923% closest to ideal FLoad  To balance FLoad, need node fanouts to be the same   Experiment 2: FLoad for fanout-balanced QDT topologies  How close: 18% closest to ideal Pload  How close: 130% closest to ideal FLoad

 Propose a novel publishing infrastructure  Empowers publishers to join and post without being associated with (sensitive) content  Generic solution: it extracts the maximum load balance supported by the QDT topology SIGMOD, June 201023

SIGMOD, June 201024

Emiran UC San Diego Alin UC San Diego K.K. at&t Divesh at&t.

Similar presentations

Presentation on theme: "Emiran UC San Diego Alin UC San Diego K.K. at&t Divesh at&t."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Emiran UC San Diego Alin UC San Diego K.K. at&t Divesh at&t.

Similar presentations

Presentation on theme: "Emiran UC San Diego Alin UC San Diego K.K. at&t Divesh at&t."— Presentation transcript:

Similar presentations

About project

Feedback