CDNs Content Outsourcing via Generalized Communities Dimitrios Katsaros, Ph.D. Heraklion, March 20 th, Dept. of Computer & Communication Engineering,

CDNs Content Outsourcing via Generalized Communities Dimitrios Katsaros, Ph.D. Heraklion, March 20 th, 2008 @ Dept. of Computer & Communication Engineering, University of Thessaly @ Dept. of Informatics, Aristotle University

2 Outline of the talk A summary of my research Latest results: “ CDNs Content Outsourcing via Generalized Communities” (IEEE Transactions on Knowledge & Data Engineering) PRIMITIVE: Community Identification METHOD: Content Outsourcing for CDNs GOAL: Access Latency Reduction & Robustness

3 My Research Areas (chronological info) WIRELESS NETWORKS Mobile & Pervasive Computing Data Management Caching ( ’04 ) Air-Indexing ( ’07 ) Data Dissemination Broadcast Scheduling ( ’04 ) Prediction Mobility Prediction ( ’03+’08 ) Prefetching ( ’03 ) Mobile Ad Hoc Networks Content-based Multimedia Retrieval ( ’05+’08 ) Broadcasting ( ’06+’08 ) Wireless Sensor Networks Sensor Network Clustering ( ’07 ) (Distr+Local) Data Indexing ( ’06+’08 ) Cooperative Caching ( ’07+’08 ) Data Dissemination ( ’08 ) WIRED NETWORKS Conventional and Streaming Media Distribution in the Web Replication ( ’03 ) Prefetching ( ’01+’02+03 ) Caching ( ’04 ) Overlay and P2P Networks Content Distribution Networks ( ’05+’06 ) Content Placement in CDNs ( ’07+’08 ) Indexing & Query Routing in P2P ( progress ) Distributed Structures over P2P ( progress ) Web Information Retrieval and Data Mining Web Link Mining ( ’05 ) Web Ranking ( ’07+’08 ) Rank Aggregation ( ’07+’08 ) Social Network Analysis ( ’07+’08 ) Bibliometrics (’06+’07+’08)

4 Research areas: Ultimately  ??? Overlay Nets Mobile/Pervasive Computing Sensors Ad Hoc Information Retrieval Web Location Tracking Caching & Air-Indexing Peer-to-Peer Networks Content Distribution Networks Caching & Prefetching & Replication & Semistructured Data & Web views Web Ranking & Search Engines Social Network Analysis Cooperative Caching & Sensor Node Clustering & Distributed Indexing & Coverage/Connectivity & Flash storage & Content-Based MIR Broadcasting & Data Dissemination Webcasting INTELLIGENCE Pervasive Web

5 Content Outsourcing The problem: flash crowds The solution: CDNs Reactive vs proactive solutions Community identification The CiBC algorithm Evaluation

6 A problem… Feb 3, 2004: Google linked banner to “julia fractals” Users clicking directed to Australian University web site …University’s network link overloaded, web server taken down temporarily…

7 The problem strikes again! Feb 4, 2004: Slashdot ran the story about Google …Site taken down temporarily…again

8 The response from down under… later…Paul Bourke asks: “They have hundreds (thousands?) of servers worldwide that distribute their traffic load. If even a small percentage of that traffic is directed to a single server … what chance does it have?” → Help him ←

9 Existing approaches Client-side proxying Squid, Summary Cache, hierarchical cache, CoDeeN, Squirrel, Backslash, PROOFS, … Problem: Not 100% coverage Throw money at the problem Load-balanced servers, fast network connections Problem: Can’t afford or don’t anticipate need Content Distribution Networks (CDNs) Akamai, Digital Island, Mirror Image, …

10 End User Origin Server Origin Server From Internet Mazes to …

11 SydneySydney SeattleSeattle San SanJose Jose DenverDenver TokyoTokyo Los Angeles Hong Kong Hong Kong DallasDallas MiamiMiami AtlantaAtlanta New York New York ChicagoChicago ParisParisStockholmStockholmZurichZurich AmsterdamAmsterdam Toronto Toronto Boston Boston Washington D.C. LondonLondon SingaporeSingapore FrankfurtFrankfurt Content distribution

12 Content Distribution Network (CDNs)

13 Types of CDNs uncooperative cooperative Akamai Coral pullpush First proposed @ IEEE JSAC’03, and What is described here today X

14 Comparison Outsourcing Policies Replication redundancy Commun. cost Update costTemporal coherency Uncooperative pull High Low Cooperative pull LowHighMedium Uncooperative push HighLowMedium Cooperative push LowMediumLowHigh

15 Cooperative push What to push? Frequently accessed content (IEEE JSAC’03) Hard to predict what will be popular! Popularity changes rapidly, too! Request statistics? Reactive approach Can we devise a proactive solution? Where to store the pushed content? Easy; a lot of replica placement algorithms

16 Communities as “attractors”

17 Web-site communities DO exist hollins.eduAntonis Sidiropoulos et al., WWW Journal, 11(1), 2008

18 “Hard” (max-flow) communities COMMUNITY: a subset of the nodes of a graph, with the property that: (for each node of the community) The number of links to other nodes belonging to the community is larger than the number of links to nodes NOT belonging to the community

19 “Hard”, but inefficient

20 Generalized communities … COMMUNITY: a subset of the nodes of a graph, with the property that: (for each node of the community) The sum of all degrees within the community is larger than the sum of all degrees toward the rest of graph

21 Social Network Analysis A social network is a social structure to describe social relations (wikipedia) History of Social Network is older than everybody who is here (more than 100 years – Cooley 1909, Durkheim 1893) [ book: Stanley Wasserman & Katherine Faust ] 1.Mathematical Representation 2.Structural & Locational Properties 1.Centrality Betweenness centrality 3.Roles & Positions 4.Dyadic & Triadic Methods

22 Betweenness Centrality σ uw = σ wu : number of shortest paths from u  V to w  V ( σ uu = 0) σ uw (v) : number of shortest paths from u to w that some vertex v  V lies on Betweenness Centrality NI(v) of a vertex v is:

23 Betweenness Centrality in sample graphs 13 15 20 19 17 1 2 3 6 5 4 7 14 12 8 18 16 11 10 9 W R U P A C X Y T V Q B

24 Betweenness Centrality in sample graphs 13 (0) 15 (0) 20 (0) 19 (0) 17 (1) 1 (0) 2 (0) 3 (68) 6 (0) 5 (0) 4 (96) 7 (156) 14 (233) 12 (0) 8 (26) 18 (97) 16 (131) 11 (0) 10 (0) 9 (0) W (3,33) R (9,33) U (54) P (41) A (6,67) C (0) X (0) Y (0) T (1,33) V (1,33) Q (8) B (13) Nodes with large NI:  Articulation nodes (in bridges), e.g., 3, 4, 7, 16, 18  With large fanout, e.g., 14, 8, U

25 Betweenness centrality in … [ WEB ] Performing graph clustering and recognizing communities in Web site graphs

26 CiBC Method Target: is true CiBC method: Building “cliques” and clusters around representative (pole) nodes (with low CB)

27 CiBC Method IDNI index 1020.68 219.61 611.38 110.28 72.06 01.73 90.99 8 40.75 50.00 110.00 0 1 2 3 4 5 6 7 8 10 9 11 Phase 1: NI Computation -O(nm) Phase 2: Initialization of cliques O(n)

28 CiBC Method IDNI index 1020.68 219.61 611.38 110.28 72.06 01.73 90.99 8 40.75 50.00 110.00 0 1 2 3 4 5 6 7 8 10 9 11 Phase 2: Initialization of cliques O(n)

32 CiBC Method A B ABCD A3300 B3311 C0134 D0143 0 1 2 3 4 5 6 7 8 10 9 11 CD Phase 3: Clique Merging & Creation of Communities Complexity: O(l 2 ) l is the number of cliques

33 CiBC Method A B ABCD A3300 B3311 C01 34 D0143 0 1 2 3 4 5 6 7 8 10 9 11 CD Phase 3: Clique Merging & Creation of Communities 4343

34 CiBC Method A B ABC A330 B332 C0210 0 1 2 3 4 5 6 7 8 9 11 C Phase 3: Clique Merging & Creation of Communities

35 CiBC Method A B ABC A 33 0 B332 C0210 0 1 2 3 4 5 6 7 8 9 11 C Phase 3: Clique Merging & Creation of Communities

36 CiBC Method A AC A92 C210 0 1 2 3 4 5 6 7 8 9 11 C Phase 3: Clique Merging & Creation of Communities Phase 4: Check constraints

37 Evaluation … Need for: Web site graphs CDN Topology Networking issues Request streams Roaming over the site graph Impossible to find real data for all these … Simulators for each of them To compensate for the lack of any of the above

38 Simulators Web site graphs Simulating the growth process of the Web Request streams Random surfer (following links + teleportation) CDN CDNSim (http://oswinds.csd.auth.gr/~cdnsim/)

39 Competing methods Communities-based methods Clique Percolation Method (CPM) Correlation Clustering Communities identification method (C3i) Simple Web Caching (LRU) No CDN (only the origin server) Full Replication

40 Metrics Mean Response Time (MRT) : the expected time for a request to be satisfied Response time CDF : the Cumulative Distribution Function (CDF) denotes the probability of having response times lower or equal to a given response time Replica Factor (RF) : the percentage of the number of replica objects to the whole CDN infrastructure w.r.t. the total outsourced objects Byte Hit Ratio (BHR) Independent parameters a) Surrogates’ cache size b) graph assortativity

41 Situations examined Regular traffic Network delay dominates the other components Flash crowd event TCP setup delay + network delay dominate the other components

42 Regular traffic: MRT vs. comm. strength

43 Regular traffic: BHR vs. comm. strength

44 Regular traffic: MRT vs. cache size

45 Surge of requests: CiBC

46 Surge of requests: CPM

47 Surge of requests: C3i

48 Surge of requests: LRU

49 Discussion CDNs: industrial interest for them Content outsourcing: significant issue Proactive content outsourcing Discovery of communities Placement to surrogate servers CiBC prevails

50 References Our work D. Katsaros , G. Pallis, K. Stamos, A. Sidiropoulos, A. Vakali, Y. Manolopoulos. “ CDNs Content Outsourcing via Generalized Communities ”. IEEE Transactions on Knowledge and Data Engineering, 2008. State-of-the-art competing method [ CPM community identification method ] G. Palla, I.Derenyi, I.Farkas, and T.Vicsek. Uncovering the overlapping community structure of complex networks in nature and society. Nature, 435(7043):814–818, 2005.

Thanks to my collaborators at A.U.Th Thank you for your attention! Questions?

52 Generalized Web Page Community A subgraph U ( V u,E u ) of a Web site graph G constitutes a Web page community, if satisfies the following which means that the sum of all degrees within the community U is larger than the sum of all degrees toward the rest of the graph G The hard Web page community (Flake et. al) A subgraph U(V u,E u ) of a Web site graph G constitutes a Web page community, if every node v satisfies the following criterion:

53 The CiBC Algorithm (1/4) The Web site is represented by a Web graph G = ( V,E ), where its nodes are the Web pages and the edges depict the hyperlinks among Web pages Input:Web site graph Output: a set of Web page communities; These communities constitute the set of objects which are outsourced to the surrogate servers Phase I: Computation of Betweenness Centrality the pole nodes – the nodes with the lowest Betweenness Centrality the concept of Betweenness Centrality (BC) is used to select the pole nodes Phase II: Nodes Accumulation around Pole Nodes nodes are accumulated around identified pole nodes by making use of Web graph properties; a set of Web page communities is created

54 The CiBC Algorithm (2/4) Betweenness Centrality (BC) reflects the amount of control exerted by a given Web page over the interactions between the other Web pages in the Web server content structure.

55 The CiBC Algorithm (3/4) Phase I: Computation of BC Compute the Betweenness Centrality (BC) of the Web graph’s nodes Nodes with high (low) BC reside at the center (borders) of the clusters Sort the nodes by the ascending order of their BC values

56 The CiBC Algorithm (4/4) Phase II: Accumulation around Pole Nodes The pole node with the lowest BC value is selected to be accumulated It is checked if it belongs to any group. If not, it indicates a distinctive one and then, it is accumulated by the nodes which are directly connected with it The pole node is expanded by the nodes which are traversed using the BFS algorithm BFS expands uniformly the groups The resulted kernel groups are processed (merged/deleted) in order to create generalized Web page communities

57 Performance Evaluation Examined Methods Clique Percolation Method (CPM): The outsourced objects obtained by the CPM correspond to k-clique percolation clusters in the network. A k-clique percolation cluster is a sub-graph containing k-cliques (complete sub-graphs of size k) that can all reach each other through chains of k- clique adjacency, where two k-cliques are said to be adjacent if they share k - 1 nodes. Experiments have shown that this method is quite efficient when it is applied on large graphs. Web caching scheme (LRU): The objects are stored reactively to proxy cache servers. We consider that each proxy cache server follows the LRU (Least Recently Used) replacement policy since it is the typical case for the popularity of proxy servers (e.g., Squid ). No Replication (W/O CDN): All the objects are placed on the origin server and there is no CDN/no proxy servers. This policy represents the “worst-case” scenario. Full Replication (FR): All the objects have been outsourced to all the CDN’s surrogate servers. This (unrealistic) policy represents the “optimal-case” scenario.

58 Performance evaluation parameters Simulation Testbed

59 Content Replication Problem Lat-cdn: the outsourced objects are placed to surrogate servers with respect to the total network’s latency, without taking into account the objects’ popularity (La-Web 2005) il2p: the outsourced objects are placed to surrogate servers integrating both the network’s latency and the objects’ load (ICDE workshop 2006)

60 Problem Formulation The content replication problem is to select the optimal placement x such that it minimizes D ik (x) is the “distance” to a replica of object k from surrogate server i under the placement x the distance reflects the latency (the elapsed time between when a user issues a request and when it receives the response) N is the number of surrogate servers, K is the number of outsourced objects, λ i is the request rate for surrogate server i, and p κ is the probability that a client will request the object k. Content Replication Problem

61 For each outsourced object, we find which is the best surrogate server in order to place it (produces the minimum network latency) We select from all the pairs of outsourced object – surrogate server that have been occurred in the previous step, the one which produces the largest network latency, and thus place this object to that surrogate server Surrogate servers become full? No Yes All the “outsourced objects” are stored in the origin server and all the CDN’s surrogate servers are empty The final Placement CDN Infrastructure outsourced objects The Lat-cdn Algorithm: The Flowchart

62 The il2p ( i ntegration of l atency and l oad object p lacement) Algorithm Main idea Considering that all the outsourced objects are initially placed on an origin server, the content replication problem is separated into two sub-problems: Choice of the best surrogate server to replicate an outsourced object (based on the network’s latency) Arrangement priorities for outsourced objects replication (based on the objects’ load)

63 The il2p Algorithm Arrangement priorities for outsourced objects replication From the objects assigned to a single server we replicate the object k which has the maximum utility value. Utility_Value k =load k *latency k,where load k =access_rate k * s k latency k is the latency that the object k produces if it is replicated to the surrogate server which has been determined by the previous step, load k is the total load due to object k, access_rate k is defined as the number of accesses of object k per unit time and s k is the size of object k.

64 For each outsourced object, we find which is the best surrogate server in order to place it (produces the minimum network latency) We select from all the pairs of outsoursed object – surrogate server that have been occurred in the previous step, the one with the maximum utility value and thus place this object to that surrogate server Surrogate servers become full? No Yes All the “outsourced objects” are stored in the origin server and all the CDN’s surrogate servers are empty The final Placement CDN Infrastructure outsourced objects The il2p Algorithm: The Flowchart

CDNs Content Outsourcing via Generalized Communities Dimitrios Katsaros, Ph.D. Heraklion, March 20 th, Dept. of Computer & Communication Engineering,

Similar presentations

Presentation on theme: "CDNs Content Outsourcing via Generalized Communities Dimitrios Katsaros, Ph.D. Heraklion, March 20 th, Dept. of Computer & Communication Engineering,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CDNs Content Outsourcing via Generalized Communities Dimitrios Katsaros, Ph.D. Heraklion, March 20 th, Dept. of Computer & Communication Engineering,

Similar presentations

Presentation on theme: "CDNs Content Outsourcing via Generalized Communities Dimitrios Katsaros, Ph.D. Heraklion, March 20 th, Dept. of Computer & Communication Engineering,"— Presentation transcript:

Similar presentations

About project

Feedback