Efficient Gathering of Correlated Data in Sensor Networks

Efficient Gathering of Correlated Data in Sensor Networks
Himanshu Gupta, Vishnu Navda, Samir R. Das, Vishal Chowdhary Department of CS, State University of New York Stony Brook MobiHoc 2005 老師各位學長同學學弟妹大家好我今天要報告的題目是Efficient Gathering of Correlated data in sensor networks 這篇的作者是Himanshu Dupta等三人他們是美國石溪大學的成員那這篇文章是發表在MobiHoc 2005年的這篇文章利用sensor收集到的data的相關性在整個sensor network中只選則部分的nodes來收集他們的資料其他沒被選到的nodes的資料則從這些選中的nodes來推斷用這個方式來達到省電的效果

Outline Introduction Problem Formulation
Energy-Efficient Distributed Algorithm Centralized Approximation Algorithm Performance Results Conclusion 那我們今天天outline就有一開始的introduction Problem formulation就是把他提出的問題作一個正式的定義接著就是他所提出來的distributed跟centralized的演算法最後是他的實驗結果跟結論

Introduction (1) Data gathering in sensor networks
Collect periodic snapshots of distributed sensor data at a sink node. Environment application: Temperature, humidity, pressure data Sensor networks are usually redundant They exhibit high degree of spatial correlation in the data collected (colored sub-regions in the figure) 那我們在sensor network中做data gathering 就是把node中每個時段測到的資料收集到sink node當中常見的application就像是收集溫度溼度跟壓力的資料這種sensor network在通常會佈置很多redundant nodes 而且nodes之間收集到的資料通常都會有相當程度的關聯

Introduction (2) Data Gathering Approach
Naïve Method Collect data from all the nodes by forming an gathering tree with sink node at the root Energy Efficient Method Given a sensor network, select a subset of sensors “M”, called Connected Correlation-Dominating Set, such that (a) Each sensor not in M is correlated to a subset of sensors in selected set M (b) The selected set M forms a connected communication graph 設計gathering的的方式的時候簡單的方法就是讓所有nodes形成一個以sink為root的gathering tree 讓所有的data順著這個tree傳到sink當中可是這樣每個node都要傳資料會耗費很多能量另外一個有效率的方法就是這篇論文所使用的方法在整個sensor network當中只選擇部分的nodes 這這些node作收集他的資料以及傳遞的工作就是這個M 把他稱為connected correlation-dominating set 那他必須保有兩種特性第一種就是其他不在M當中的nodes要和M中的subset會有關聯也就是不在M當中的node的資料可以從M的subset的資料推測出來另外一個特性就是 M當中的nodes會形成一個連接的communication graph

Example For a given region, any two sensor data are sufficient to infer the data of all other sensors in the region. Deleted node 這邊有一個connected correlation-dominating set的例子他是假設在同深淺的region當中任兩個nodes的資料就可以用來推斷出在同一region的其他node的資料就像下面這張圖一樣只要在每個region當中選兩個nodes出來如果這些nodes的communication graph市相連接的話那這些被選到的nodes就會是一個connected correlation-dominating set Selected node

Formal Problem Definition (1)
Definition 1. (Communication Graph) Given a sensor network consisting of a set of sensors I, the communication graph for the sensor network is the undirected graph CG with I as the set of vertices and an edge between any two sensors if they can communicate directly with each other. t u w v x y z 接著他先對他所提出的connected correlation-dominating的問題做個公式化這裡先定義了幾個會用到的觀念第一個definition 就是communication graph 一個sensor network中Communication graph就是一個graph 就像下面這張圖一樣這個graph包含了所有的sensor nodes當作vertex 如果兩個sensor nodes在彼此的通訊範圍中的話在這個graph中就會有一個edge連接這兩個nodes (a) Communication Graph

Definition 2. (Correlation Graph; Correlation Neighbors) Given a sensor network consisting of a set of sensors I, the correlation graph over the sensor nodes is a directed hypergraph with I as the set of vertices, and a subset of (P(I) × I) as the set of directed hyperedges, where P(I) is the power set of I. In other words, the correlation graph is a hypergraph G(V = I,E ⊆ (P(I) × I)). x v z t 第二個definition就是correlation graph跟correlation neighbor 一個sensor network的correlation graph是一個有向性的hypergragh 他一樣包含了所有sensor nodes當作他的vertex 比較特別的是這個correlation edge是一個有向性的hyperedge hyperedge他是由一個nodes set指向一個node的edge 就像左下方的這個圖(a) u v w所形成的subset指向一個node x 這個Correlation edge的意思是x和subset (u, v, w) 是有關聯的也就是說x的資料可以從u v x當中的資料所推斷出來 X 和 u v w之間就會稱為是correlation neighbors 集合所有nodes和這些所有存在的hyperedges 就會形成這個sensor network的correlation graph 右下這個就是一個correlation graph的例子 x u y w u v w (a) Correlation Edge ((u,v,w), x) (b) Correlation Graph

Definition 3. (Connected Correlation-Dominating Set) Consider a sensor network consisting of n sensors. Let C be the correlation graph over the sensor nodes in the network. A set of sensors M is called a connected correlation-dominating set if : 1. The communication subgraph induced by M is connected. 2. For each sensor node s M, there is a set of sensors S ⊆ M such that (S, s) is a correlation edge in C. v v z z 那definition3是connected correlation-dominating set的定義如果我們說一個nodes set M是sensor network的connected correlation-dominating set的話他必須滿足下面這兩個條件: 第一個是M當中node所形成的communication graph必須是連接的第二個是對於不屬於這個M的node s, 在這個sensor network的correlation graph C 中一定會有correlation edge從M的subset指向這個node s. 就像右下角這張圖一樣這個由t u v w所形成的nodes set M是一個connected correlation-dominating set 因為t u v w的communication graph是connected 而且其他不屬於M的node x y z在correlation graph中都有correlation edge從M當中指向他們 t t x x u u y y w w M = {t, u, v, w} (a) Correlation Graph “C” (b) Connected Correlation-Dominating Set “M”

Connected Correlation-Dominating Set Problem: Given a sensor network and a correlation graph over the sensors, the connected correlation-dominating set problem is to find the smallest connected correlation-dominating set. The connected correlation-dominating set problem is NP-hard as the less general minimum dominating set problem is well known to be NP-hard. 那這個connected correlation-dominating set problem正式的定義就是給我們sensor network跟他的correlation graph 我們要為這個sensor network找到一個nodes最少的connected correlation-dominating set 由於較簡單的dominating set已經被認為是一個np-hard的問題了所以這個connect correlation-dominating set problem也應該是個np-hard的問題後面提出的distributed跟centralized的演算法也都是heuristic

Computing Correlation Hyperedge Parameters A hyperedge (S, s) exists if data values of s can be inferred from values of S within certain error bound. Linear Prediction Model Least Square Approach : Source nodes : Predicted value of node s at kth time : Actual value of source node l at kth time 在這邊就講解一下他找出這些correlation edge的方法如果一個node s的資料可以從一個source set大S所推斷出來而且在一定的error bound之內的話就會有一條correlation edge從大S指到小s 那他用了linear的方式來當作他的data prediction model 像下面的式子一樣在推測node s中的k時段的資料就是這個s‘[k] 這個s‘[k]就會被表示為在source set中node 1到node L在k時段的資料配合上這些alpha參數的linear combination 在接著使用least square的方式來求取alpah參數的值 : Actual value of node s at kth time

接著用matrix的方式來表示這個least square error的式子使用標準的做法把E of alpha對alpha作微分讓他等於零那我們透過標準的解法最後取得這些alpha參數的值有了這些alpha的值之後我們就可以反過頭來確認這個從source set預測出來的值和這個node s在k時段真實測到的值他們的差值是不是在這個設定好的error bound當中

Energy-Efficient Distributed Algorithm (1)
Basic Distributed Algorithm 1. Initially, each node assigns itself a priority. Data-gathering nodes mark itself selected. 2. Next, each node collects d-hop neighborhood information. 3. Remaining, nodes are marked deleted and instruct the related correlation neighbors as selected while the following conditions are satisfied during periodically testing. (i) It can be inferred (using a correlation edge) from a set of non-deleted nodes. (ii) Its deletion preserves the connectivity of the communication subgraph induced over the non-deleted nodes. 這邊就是他的基本的distributed algorithm 1.一開始的時候所有node都有一個自己的priority, 負責要匯集資料的nodes(像是sink等等)會先標示自己是selected的狀態, selected代表這個node要收集環境的資料以及會被用來推斷別的node的資料. 2.接著每個node收集k-hop以內的neighbor的資料, 來找出有哪些指向他的correlation edge的存在以及連結姓 3.然後每個node會週期性地去檢查, 如果這個node到的的資料可以從其他non-deleted node推測出來, 而且如果把他標示為deleted刪除掉不會影響到non-deleted nodes的連接性的話那這個node就會把自己標示為deleted status, 並且通知他的correlation neighbor 讓correlation neighbors變成selected的狀態在這個演算法當中不管在哪個階段,這些non-deleted nodes都會形成一個connected correlation-dominating set The non-deleted nodes forms a Connected Correlation-Dominating Set

Conditions for Marking Deleted not selected s 這張圖是四個實際用來確定是不是可以標示為deleted的conditions 當滿足下面四個條件的時候 node就可以變成deleted狀態第一個condition是 node s本身不能是selected的狀態第二個condition是對node s 的任一neighbor pair (u, v)之間必須能有一條path連通且path上的node的priority要較s來的低 -> 用來保護non-deleted nodes的連接性第三個condition是對node s來說, correlation graph中要有一條他可以使用的correlation edge, 而且這些source nodes必須是selected狀態或是 priority較s來的低第四個condition是對那些可能把s當作source node來使用的node r來說, r 的狀態必須要是 selected, deleted或是較s的priority來的低 S滿足上面四個條件就會標示自己為deleted狀態並且通知他所用到的correlation neighbor他的deleted狀態 for node s C1

2-Round Distributed Algorithm Based on basic distributed algorithm Replace C3 and C4 with C33 and C44 in the initial round. C33: There is a correlation edge (S, s) in the correlation graph, such that no node in the set S is marked deleted. In addition, each node in S is either marked selected or doesn’t satisfy the C2 condition or has a priority less than p(s). C44: If there is a correlation edge (R, r) where s ∈ R, then either r is marked deleted or marked selected or doesn’t satisfy the C2 condition or has a priority less than p(s). 這裡希望能增加被標示為deleted的node數目這裡他提出了兩個稍微作了點修改的algorithm 第一個就是這個2-round distributed algorithm 他基本上是和前面的basic algorithm一樣那不同的是他在第一個testing的round的時候用修改過的C33和C44 condition來取代前面的C3和C4 condition C33和C44 condition稍微放寬了C3和C4 的限制 C33 condition是說對於一個node s 他所可以所使用的source nodes 除了可以是C3 condition的 selected 或是priority較低的node外還可以是沒有滿足C2 condition的nodes C44 condition是說那些可能把node s當作source node的node r來說 Node r他的狀態除了可以是C4 condition的deleted selected 或是priority較低的之外還可以是沒有滿竹c2 condition的node 使用到C33和C44的時候因為要確定node s的correlation neighbor他們的C2 condition 所以node s除了收集k-hop內neighbor的資料外還要在多收集correlation neighbor附近的資料這樣會有多出來的負擔所以只有在第一個testing round採用C33和C44

Handshake Algorithm Based on basic distributed algorithm Using C33 and C44 in all testing rounds Additional C2-satisfied messages Whenever a node’s C2 condition is satisfied, it transmits a C2-satisfied message to its correlation neighbors. Before node s marks itself deleted, it makes a “handshakes” with the used source nodes. 那他第二個提出的distributed algorithm就是這個handshake algorithm 他一樣是和前面所提到的basic algorithm一樣只不過他在所有的testing round改用C33和C44取代了本來的 C3 和 C4 condition 至於要確定node s的correlation neighbor的C2 condition的問題他改用這個C2-satisfied message來解決也就是當一個node滿足c2時他會把這個C2-satisfied message傳給他所有的correlation neighbors知道為了避免message loss的問題 node s 要標示自己為deleted之前 Node s會先和他所有的correlation neighbors做好handshake 確定情況都正確後才標示自己為deleted的狀態不過這個做法一樣會多出需要傳遞的messages數目

Centralized Approximation Algorithm (1)
Definition 4. (Intersection Graph of Source Sets) Let I be the set of nodes in the network, and I = { {s} | s ∈ I }. Let S be the set of source sets in the correlation graph of the network. The intersection graph of source sets is the simple graph G( V =S∪I, E = { (v1, v2) | (v1 ∩ v2) = φ}). S1 S2 S3 S4 接著就是他所提出的centralized的演算法一樣是對所要用到的觀念作一些定義這個definition 4是說source sets的intersection graph 一個source set就是一個correlation edge上的source nodes所形成的組合這個source set的intersection graph是一個graph 他包含了所有的source set當作vertex 如果兩個source set的交集不是空集合的話在這個intersection graph上兩個source set間就會有一條edge相連如同下面的圖 S1是個source set包含{b1, b2} S2是另一個source set包含{b2, b3} 因為S1和S2之間有交集 b2 所以代表S1和S2的vertex間有一條edge相連

Definition 5. (Connected Subgraph of Sources; Connected Source Set) A connected subgraph in the intersection graph of source sets is called a connected subgraph of sources. A connected source set is a set of nodes corresponding to some connected subgraph of sources, i.e., the union of the sets corresponding to the vertices of a connected subgraph of sources. S1 S2 S3 (c) Connected Subgraph of Sources Definition 5是說sources的connected graph跟connected source set Connected subgraph of sources就是前面所說的intersection graph中一個連續的subgraph 像是在這個S1到S4所組成的intersection graph中 S1到S3所形成的就是一個connected subgraph of sources 至於connected source set就是對應connected subgraph of sources的source nodes 像是S1到S3所形成的connected subgraph of sources 對應他的connected source set就是{b1, b2, b3, b4} S1 S2 S3 S1 S2 S3 S4 { b1, b2, b3, b4} (a) Source Sets (b) Intersection graph of Source Sets (d) Connected Source Set

Definition 6. (Inferred Nodes) Given a set of nodes S, the set of inferred nodes for S is denoted by I(S) and is defined as I(S) = S ∪ { x | (Y, x) is a correlation edge and Y ⊆ S }. Definition 7. (Benefit of a Set of Nodes) Benefit of a set S with respect to a set M of nodes M is denoted by B(S,M) and is defined as B(S,M) = |(I(S) − I(M)| / |S − M|, where I(S) and I(M) are the set of inferred nodes for S and M respectively. Definition 6是對inferred nodes下定義對一個node set S來說 S的inferred nodes除了包含了S當中的nodes 還包含了correlation edge從S當中出發指到的那些nodes 像左下圖一樣 M黃框框中的是M中的nodes M的inferred nodes I(M)包含了屬於M的nodes 以及correlation edge指到的一些node Definition 7是說benefit of a set of nodes 這邊就是node set S對於node set M的benefit為 I(S) – I(M) 除上 |S - M| 就是inferred nodes數目差除上 set 中node數目的差值像下圖 S對M的benefit 就是 inferred node差就是這個3 除去set中node增加的數目就是這個1 等於 3 B(S,M) = 3 / 1 = 3 M S I(M) I(S)

1st Phase: Constructing a near-optimal Correlation-Dominating Set Initially, set M contains the data-gathering node. The algorithm iteratively adds to M the connected source set that has the maximum benefit with respect to M. The Phase terminates while the set M becomes a correlation-dominating set. 2nd Phase: Connecting the Correlation-Dominating Set The algorithm iteratively connects the closest pair of connected components. The time complexity of the algorithm is exponential in n (nodes num), since the number of connected source sets in the first phase can be exponential. 這邊就是他所提出的centralized algorithm 他分為兩個phase 第一個phase先建出一個correlation-domination set 這個correlation-domination set不一定是connected 他的做法一開始讓這個correlation-domination set M只包含這些像是sink之類的data-gathering nodes 接著他用greedy的做法反覆的加入含有最大benefit的connected source set到M當中直到最後M形成了一個correlation-dominating set 第一個phase結束那第二phase就是要把這個correlation-dominating set變成connected 他一樣是用greedy的方式反覆的將correlation-dominating set中最近的兩個components相連直到他們變成connected 由於在第一個phase中要試的connected source set的數目會是exponential n的所以這個algorithm是n exponential的

Polynomial-time Heuristics (l-hop Heuristic) Based on the above algorithm At each stage, the algorithm constructs the connected source set fl(S) for each source set S, and pick the fl(S) having the max benefit and add it to the selected set M. The fl(S) is constructed in a greedy manner by merging S with the best source set that is at most l away from S in the intersection graph. Example: 1-hop heuristic at 1st stage 要縮減algorithm的複雜度他提出了一個polynomial-time的heuristic 他把他稱為l-hop heuristic 基本上是相同於前面的centralized algorithm 那不同的是在第一個phase中不再找所有可能的connected source set 現在他在每個stage當中對每一個source set S只建出一個connected source set fl(S) 然後在這些fl(S)當中挑出一個benefit最大的加到selected set M當中 S的Fl(S)的建法也是greedy的就是不斷地把距離S l-hop內有最好的benefit的source set加入到S當中直到fl(S)的benefit無法增加後完成下面就是1-hop heuristic在第一個stage的例子 S1的connected source set f1(S1)就是S1合併S2後形成 S2的f1(S2)就是S2合併S1再合併S3後停止所形成的那在這個第一個stage中因為f1(S2)對M的benefit最大所以stage1會加入f1(S2)到M中在接著作其他的stage 直到形成了correlation-dominating set 再接著作phase 2將其相連完成整個algorithm S1 S2 S3 Max Benefit! ˙ ˙

Performance Results (1)
Random Sensor Networks with Synthetic Correlation 1000 nodes Area: 40 x 40 units Transmission Radius : 3 units For each node s and a set of nodes S (1 to 3 nodes within at most d = 2 hop), the hyperedge (S, s) is added with a probability P/100. Simulation Environments Correlation computation: K=3, L=5 Small size network: 100 nodes, 7 x 7 area Large size network: 1000 nodes, 40 x 40 area 最後他的實驗用到了真實跟合成的資料那他合成的資料的建法是他用了1000 nodes 把他佈置在40 x 40 units的面積中把node的通訊範圍定為3個unit 然後對於每一到三個node的node set與另外一個node之間會有一個機率p來決定他們之間是不是有correlation edge的存在那他拿這個合成的資料來跑實驗的環境中他把這個k設為3 Correlation edge只用3個時段來計算把這個l設為5 就是一個node拿來推斷他資料的source nodes的數目是5個小的network 100個node步在7x7的面積中大個network 1000個node布在40 x 40的面積中那他在這兩種size的network來跑他的實驗

Centralized Algorithm 第一個實驗是比較centralized algorithm的表現可以從圖中看到 1-hop和2-hop的表現差異很小通訊範圍大時 1-hop和2-hop的表現已經很接近optimal的表現所以作者認為1-hop的centralized algorithm對一般的的sensor network已經夠用 100 nodes, 7 x 7 area with synthetically generated correlation

Distributed Algorithm 那第二個實驗他比較了distributed algorithm的表現下面三張圖的橫軸都是決定correlation edge存不存在的機率參數p 圖A當中的縱軸是correlation-dominating set的size 可以看到handshake表現的叫2-round要來的好意點不過並沒有差很多那在圖B當中的縱軸是這個algorithm所要傳輸的message數量不過從圖B中可以看到handshake比2-round要多出了很多message的數量圖c當中的q theta的值則是拿這個algorithm會耗費的能量除去那些被標示為deleted的node數目也就是每個deleted node要負擔這個algorithm的平均耗能它採用message數目作為q-theta的單位當後來的query數目多於這個q-theta值的時候這個algorithm才能幫整個sensor network節省能源圖C中也顯示了一般network中的query數目較兩者的q-theta值都會高很多所以這兩個distributed algorithm是可以幫network節省到能源但handshake的q-theta值較2-round要來的高而兩者的表現沒有差很多所以這個作者還是認為2-round algorithm已經夠用了 1000 nodes, 40 x 40 area with synthetically generated correlation

Simulation on Real Temperature Data Average temperature of over 600 US cities Source set S consists of 1 to 3 nodes within most distance d = 2 Error threshold: 5% 那他第三個實驗就拿了真實的資料來跑模擬他拿的是600個美國城市的平均溫度來作這個source nodes被限制為是2-hop內的一到三個node所組成的source set 把error threshold定成百分之五從下面這個表可以看到跑出來的correlation-dominating set的size都大概只剩原來的2/3 而且Distributed algorithm的Q-theta值對於一般query的數目來說也是很小

Conclusion The paper considered the connected correlation-dominating set that helps in minimizing energy costs in data-gathering sensor network. The correlation structure (hypergraph) can capture general data correlation. 最後是這篇論文的conclusion 那這邊paper就提出了這個connected correlation-dominating set來節省data-gathering network的能量那他所提出來的correlation structure是可以捕捉到一般普遍的data correlation 實驗結果也顯示他的centralized跟distributed的algorithm是有效的

Efficient Gathering of Correlated Data in Sensor Networks

Similar presentations

Presentation on theme: "Efficient Gathering of Correlated Data in Sensor Networks"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficient Gathering of Correlated Data in Sensor Networks

Similar presentations

Presentation on theme: "Efficient Gathering of Correlated Data in Sensor Networks"— Presentation transcript:

Similar presentations

About project

Feedback