A Sampling-Based Approach to Optimizing Top-k Queries in Sensor Networks Adam Silberstein Rebecca Braynard Carla Ellis Kamesh Munagala Jun Yang Duke University.

A Sampling-Based Approach to Optimizing Top-k Queries in Sensor Networks Adam Silberstein Rebecca Braynard Carla Ellis Kamesh Munagala Jun Yang Duke University

Querying Sensor Networks Sensors provide a huge amount of data Nodes are energy-constrained –Radio used to transmit data –Radio is primary energy consumer  Challenge: use data to answer queries, but while limiting size and number of messages Allowing approximation has great potential Base Station

Outline 1.Top-k Query Challenges 2.Sampling-Based Optimization 3. Prospector Algorithms 4.Experimental Evaluation

Top-k Query Top-k Query Definition –Return the nodes with the k highest sensor readings –Approximate version: Given an energy budget, gather node values such that number of top-k values is maximized Example –Get list of most popular bird feeding sites so ornithologists can visit those areas

Top-k Challenges  Key Point: Unlike selection queries, whether a node in the solution depends on other node values Cannot simply rank nodes by model means Calculating the likelihood a node is top-k requires joint comparison of all node models –Not easy even when models are simple –Exponential number of cases where each node is in top-k

Top-k Challenges Assume we know probability a node is in top-k –Still other issues influencing an acquisition plan –Local filtering Greedily choosing top probability nodes not necessarily best strategy If no more than 2 top-k nodes ever come from subtree rooted at u, don’t assign greater than bandwidth 2 out of subtree –Constructing Proof Additional goal may be to prove as many top-k values as possible Must return non solution values as proof

Terminology Network of n nodes, u 1,…,u n Nodes arranged in routing spanning tree, Τ –Parent of u i, parent(u i ) –Child nodes of u i, children(u i ) Cost model –Overhead cost per edge: σ –Marginal data cost per edge: δ  A top-k plan is an assignment of bandwidths to edges dictating data collection  Total bandwidth cost must be less than user- specified energy budget BS A B AC Cost: (3σ)+(4δ)(4σ)+(4δ)

Sampling-Based Optimization Base optimization on set of samples drawn from joint distribution of nodes BS 1 2 34 5 Sample # node 1 2 3 12345 Sample # node 1 2 3 12345 1016 1922 14 7 53 1217 42691020 Acquire as many “1”s as possible Sample # node 1 2 3 12345 000 000 000 11 11 11 Mark which nodes are top-k in each sample

Theoretical Basis for Sampling Optimize over the samples Polynomial number of samples sufficient for prediction Approach proven for class of 2-stage stochastic optimization problems [Shmoys and Swamy FOCS04] Proof for approximate top-k is a reduction from stochastic steiner-tree

Sampling Advantages Work directly with sample values –Avoid having to learn and fit models to data Any correlation is implied within samples –If model is available, can draw values from it Samples maintenance then no more expensive than explicit model maintenance –In top-k, avoid calculating probabilities of nodes being in result

Prospectors Prospecting: Dig for answers in most promising places, given energy budget ProspectorGreedy –Sum sample columns –Visit most frequent contributors Sample # node 1 2 3 12345 000 000 000 11 11 11 1301 1

Prospectors ProspectorLP-LF –Consider bandwidth cost model Encode with linear programming –Each visited node necessitates each ancestor edge must be used, with additional marginal cost of a single (node, value) pair  Now we favor visiting nodes clustered in same area of network!

Prospectors ProspectorLP+LF –Add in local filtering –Each visited node doesn’t necessarily contribute a (node, value) pair to each ancestor edge message i.e., b(u i ) != Σ b(child(u i ))+{0, 1} b: bandwidth –More flexibility Acquire all the subtree’s top-k values, but with lower outgoing bandwidth on ancestor edges If a subtree never contributes more than x, even if particular nodes vary, assign no more than x bandwidth

Prospectors GreedyLP-LFLP+LF a 1 1 a 1 1 a 1 1 1 1 aaa aa 1 11 3 2 5 5 a a a a a a a 1 11 2 2 2 2 1 1

Prospectors Previous Prospectors cannot guarantee anything about their result ProspectorProof –Pass up additional values to prove other values are the top-l in a subtree Sort all values passed to subtree root v i is top-l for the subtree if: 1.It is within the top-l of the sorted list 2.A proven value from each child subtree besides v i ’s is in list below top l, or all values from subtree are in list  Augment with 2 nd querying stage to get remaining top k ProspectorExact

Experimental Evaluation Comparison of Algorithms Results of Energy Cost vs. Accuracy –Oracle: Knows top-k nodes in advance –Naïve: Exact solution for different k values –Greedy, LP-LF, LP+LF: Prospectors

Experimental Evaluation Contention Zones –Demonstrate local filtering –Network areas with negative correlation –Each zone has k low mean, high variance nodes such that each provides 1/6 of top k –LP+LF visits more nodes per same energy budget

Experimental Evaluation Temperature readings at Intel Lab Berkeley data set Local filtering doesn’t help, no negative correlation

Experimental Evaluation Given enough energy for ProspectorProof (phase 1), ProspectorExact total cost is less than naïve-k

Conclusions Approximation is an important technique for limiting network transmissions and saving energy Model-driven acquisition is good for building query plans, but not straightforward for complex queries Top-k is a fundamental query and illustrates complications for using models We use sampling-based models and associated linear programs to build query plans, incorporating sophisticated features Experimentally –Approximate solutions are much less expensive that exact –The more sophisticated the planning program, the more the plan can leverage the network and data characteristics to visit more nodes per same amount of energy

Experimental Details 200 nodes k=40 200x400 area, 50 m range Berkeley, k=12, of 54 nodes Naïve top-k on Berkeley, 907 mJ for 100%, 623 mJ for 20%. So 3x more expensive than approx at near 100% Fixed Cost: 40, Marginal Cost: 18

A Sampling-Based Approach to Optimizing Top-k Queries in Sensor Networks Adam Silberstein Rebecca Braynard Carla Ellis Kamesh Munagala Jun Yang Duke University.

Similar presentations

Presentation on theme: "A Sampling-Based Approach to Optimizing Top-k Queries in Sensor Networks Adam Silberstein Rebecca Braynard Carla Ellis Kamesh Munagala Jun Yang Duke University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Sampling-Based Approach to Optimizing Top-k Queries in Sensor Networks Adam Silberstein Rebecca Braynard Carla Ellis Kamesh Munagala Jun Yang Duke University.

Similar presentations

Presentation on theme: "A Sampling-Based Approach to Optimizing Top-k Queries in Sensor Networks Adam Silberstein Rebecca Braynard Carla Ellis Kamesh Munagala Jun Yang Duke University."— Presentation transcript:

Similar presentations

About project

Feedback