1 Clustering Web Content for Efficient Replication Yan Chen, Lili Qiu*, Weiyu Chen, Luan Nguyen, Randy H. Katz EECS Department UC Berkeley *Microsoft Research.

Slides:



Advertisements
Similar presentations
Dynamic Replica Placement for Scalable Content Delivery Yan Chen, Randy H. Katz, John D. Kubiatowicz {yanchen, randy, EECS Department.
Advertisements

Alex Cheung and Hans-Arno Jacobsen August, 14 th 2009 MIDDLEWARE SYSTEMS RESEARCH GROUP.
Ningning HuCarnegie Mellon University1 Optimizing Network Performance In Replicated Hosting Peter Steenkiste (CMU) with Ningning Hu (CMU), Oliver Spatscheck.
1 Efficient and Robust Streaming Provisioning in VPNs Z. Morley Mao David Johnson Oliver Spatscheck Kobus van der Merwe Jia Wang.
The Cache Location Problem IEEE/ACM Transactions on Networking, Vol. 8, No. 5, October 2000 P. Krishnan, Danny Raz, Member, IEEE, and Yuval Shavitt, Member,
A Taxonomy and Survey of Content Delivery Networks Meng-Huan Wu 2011/10/26 1.
Mohamed Hefeeda 1 School of Computing Science Simon Fraser University, Canada ISP-Friendly Peer Matching without ISP Collaboration Mohamed Hefeeda (Joint.
Spring 2003CS 4611 Content Distribution Networks Outline Implementation Techniques Hashing Schemes Redirection Strategies.
SCAN: A Dynamic, Scalable, and Efficient Content Distribution Network Yan Chen, Randy H. Katz, John D. Kubiatowicz {yanchen, randy,
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
Peer-to-Peer Based Multimedia Distribution Service Zhe Xiang, Qian Zhang, Wenwu Zhu, Zhensheng Zhang IEEE Transactions on Multimedia, Vol. 6, No. 2, April.
EEC-484/584 Computer Networks Discussion Session for HTTP and DNS Wenbing Zhao
Internet Iso-bar: A Scalable Overlay Distance Monitoring System Yan Chen, Lili Qiu, Chris Overton and Randy H. Katz.
1 The Content and Access Dynamics of a Busy Web Server: Findings and Implications Venkata N. Padmanabhan Microsoft Research Lili Qiu Cornell University.
1 Clustering Web Content for Efficient Replication Yan Chen, Lili Qiu*, Weiyu Chen, Luan Nguyen, Randy H. Katz EECS Department UC Berkeley *Microsoft Research.
Predictive End-to-End Reservations via A Hierarchical Clearing House Endeavour Retreat June 19-21, 2000 Chen-Nee Chuah (Advisor: Professor Randy H. Katz)
1 Caching/storage problems and solutions in wireless sensor network Bin Tang CSE 658 Seminar on Wireless and Mobile Networking.
Quantitative Characterization of Denial of Service Attacks: A Case Study of Location Services Adam Bargteil David Bindel Yan Chen.
Exploiting Content Localities for Efficient Search in P2P Systems Lei Guo 1 Song Jiang 2 Li Xiao 3 and Xiaodong Zhang 1 1 College of William and Mary,
Scalable Adaptive Data Dissemination Under Heterogeneous Environment Yan Chen, John Kubiatowicz and Ben Zhao UC Berkeley.
SkipNet: A Scalable Overlay Network with Practical Locality Properties Nick Harvey, Mike Jones, Stefan Saroiu, Marvin Theimer, Alec Wolman Microsoft Research.
Flash Crowds And Denial of Service Attacks: Characterization and Implications for CDNs and Web Sites Aaron Beach Cs395 network security.
Introspective Replica Management Yan Chen, Hakim Weatherspoon, and Dennis Geels Our project developed and evaluated a replica management algorithm suitable.
Internet-Scale Research at Universities Panel Session SAHARA Retreat, Jan 2002 Prof. Randy H. Katz, Bhaskaran Raman, Z. Morley Mao, Yan Chen.
Evaluation of the Proximity between Web Clients and their Local DNS Servers Z. Morley Mao UC Berkeley C. Cranor, M. Rabinovich,
Yao Zhao 1, Yan Chen 1, David Bindel 2 Towards Unbiased End-to-End Diagnosis 1.Lab for Internet & Security Tech, Northwestern Univ 2.EECS department, UC.
Clustering of Web Content for Efficient Replication Yan Chen, Lili Qiu, Wei Chen, Luan Nguyen and Randy H. Katz {yanchen, wychen, luann,
Tradeoffs in CDN Designs for Throughput Oriented Traffic Minlan Yu University of Southern California 1 Joint work with Wenjie Jiang, Haoyuan Li, and Ion.
Content Delivery Networks (CDN) Dr. Yingwu Zhu Reverse Proxy Reverse Proxy Reverse Proxy Intranet Web Cache Architecure Browser Local ISP cache L4 Switch.
1 Content Distribution Networks. 2 Replication Issues Request distribution: how to transparently distribute requests for content among replication servers.
Distributing Content Simplifies ISP Traffic Engineering Abhigyan Sharma* Arun Venkataramani* Ramesh Sitaraman*~ *University of Massachusetts Amherst ~Akamai.
By Ravi Shankar Dubasi Sivani Kavuri A Popularity-Based Prediction Model for Web Prefetching.
Achieving Load Balance and Effective Caching in Clustered Web Servers Richard B. Bunt Derek L. Eager Gregory M. Oster Carey L. Williamson Department of.
DEXA 2005 Quality-Aware Replication of Multimedia Data Yicheng Tu, Jingfeng Yan and Sunil Prabhakar Department of Computer Sciences, Purdue University.
{ Content Distribution Networks ECE544 Dhananjay Makwana Principal Software Engineer, Semandex Networks 5/2/14ECE544.
Ao-Jan Su, David R. Choffnes, Fabián E. Bustamante and Aleksandar Kuzmanovic Department of EECS Northwestern University Relative Network Positioning via.
SCAN: a Scalable, Adaptive, Secure and Network-aware Content Distribution Network Yan Chen CS Department Northwestern University.
Overcast: Reliable Multicasting with an Overlay Network CS294 Paul Burstein 9/15/2003.
Web Cache Replacement Policies: Properties, Limitations and Implications Fabrício Benevenuto, Fernando Duarte, Virgílio Almeida, Jussara Almeida Computer.
1 On the Placement of Web Server Replicas Lili Qiu, Microsoft Research Venkata N. Padmanabhan, Microsoft Research Geoffrey M. Voelker, UCSD IEEE INFOCOM’2001,
A Dynamic Data Grid Replication Strategy to Minimize the Data Missed Ming Lei, Susan Vrbsky, Xiaoyan Hong University of Alabama.
A Scalable, Adaptive, Network-aware Infrastructure for Efficient Content Delivery Yan Chen Ph.D. Status Talk EECS Department UC Berkeley.
A Routing Underlay for Overlay Networks Akihiro Nakao Larry Peterson Andy Bavier SIGCOMM’03 Reviewer: Jing lu.
TOMA: A Viable Solution for Large- Scale Multicast Service Support Li Lao, Jun-Hong Cui, and Mario Gerla UCLA and University of Connecticut Networking.
ECO-DNS: Expected Consistency Optimization for DNS Chen Stephanos Matsumoto Adrian Perrig © 2013 Stephanos Matsumoto1.
Microsoft Research1 Characterizing Alert and Browse Services for Mobile Clients Atul Adya, Victor Bahl, Lili Qiu Microsoft Research USENIX Annual Technical.
1 On the Placement of Web Server Replicas Lili Qiu, Microsoft Research Venkata N. Padmanabhan, Microsoft Research Geoffrey M. Voelker, UCSD IEEE INFOCOM’2001,
--He Xiangnan PhD student Importance Estimation of User-generated Data.
Efficient P2P Search by Exploiting Localities in Peer Community and Individual Peers A DISC’04 paper Lei Guo 1 Song Jiang 2 Li Xiao 3 and Xiaodong Zhang.
Towards a Scalable, Adaptive and Network-aware Content Distribution Network Yan Chen EECS Department UC Berkeley.
PROP: A Scalable and Reliable P2P Assisted Proxy Streaming System Computer Science Department College of William and Mary Lei Guo, Songqing Chen, and Xiaodong.
Efficient and Adaptive Replication using Content Clustering Yan Chen EECS Department UC Berkeley.
CS 6401 Overlay Networks Outline Overlay networks overview Routing overlays Resilient Overlay Networks Content Distribution Networks.
Web Prefetching Lili Qiu Microsoft Research March 27, 2003.
Content Delivery Networks: Status and Trends Speaker: Shao-Fen Chou Advisor: Dr. Ho-Ting Wu 5/8/
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Dynamic Replica Placement for Scalable Content Delivery Yan Chen, Randy H. Katz, John D. Kubiatowicz {yanchen, randy, EECS Department.
Network Anomography Yin Zhang Joint work with Zihui Ge, Albert Greenberg, Matthew Roughan Internet Measurement.
On the Placement of Web Server Replicas Yu Cai. Paper On the Placement of Web Server Replicas Lili Qiu, Venkata N. Padmanabhan, Geoffrey M. Voelker Infocom.
On the scale and performance of cooperative Web proxy caching 2/3/06.
Efficient and Adaptive Replication using Content Clustering Yan Chen EECS Department UC Berkeley.
1 Scalability and Accuracy in a Large-Scale Network Emulator Nov. 12, 2003 Byung-Gon Chun.
1 On the Impact of Route Monitor Selection Ying Zhang* Zheng Zhang # Z. Morley Mao* Y. Charlie Hu # Bruce M. Maggs ^ University of Michigan* Purdue University.
Proxy Caching for Streaming Media
Lazaros Gkatzikis. Huawei, France Vasilis Sourlas
A Study of Group-Tree Matching in Large Scale Group Communications
Edge computing (1) Content Distribution Networks
Dynamic Replica Placement for Scalable Content Delivery
Replica Placement Heuristics of Application-level Multicast
Existing CDNs Fail to Address these Challenges
Presentation transcript:

1 Clustering Web Content for Efficient Replication Yan Chen, Lili Qiu*, Weiyu Chen, Luan Nguyen, Randy H. Katz EECS Department UC Berkeley *Microsoft Research

2 Motivation Amazing growth in WWW traffic –Daily growth of roughly 7M Web pages –Annual growth of 200% predicted for next 4 years Content Distribution Network (CDN) commercialized to improve Web performance –Un-cooperative pull-based replication Paradigm shift: cooperative push more cost-effective –Push replicas with greedy algorithms can achieve close to optimal performance [JJKRS01, QPV01] –Improving availability during flash crowds and disasters Orthogonal issue: scalability –Per Website? Per URL? -> Clustering! –Clustering based on aggregated clients’ access patterns Adapt to users’ dynamic access patterns –Incremental clustering (online and offline)

3 Outlines Motivation Architecture Related Work Problem Formulation Simulation methodology Granularity of replication Dynamic clustering and replication Conclusions

4 CDN name server Client 1 Local DNS serverLocal CDN server 1. GET request 4. local CDN server IP address Web content server Client 2 Local DNS server Local CDN server 2. Request for hostname resolution 3. Reply: local CDN server IP address 5.GET request 8. Response 6.GET request if cache miss ISP 2 ISP 1 Conventional CDN: Un-cooperative Pull 7. Response Big waste of replication!

5 CDN name server Client 1 Local DNS serverLocal CDN server 1. GET request 4. Redirected server IP address Web content server Client 2 Local DNS server Local CDN server 2. Request for hostname resolution 3. Reply: nearby replica server or Web server IP address ISP 2 ISP 1 5. GET request 6. Response 5.GET request if no replica yet Cooperative Push-based CDN 0. Push replicas Significantly reduce # of replicas and consequently, the update cost (only 4% of un-coop pull)

6 Outlines Motivation Architecture Related Work Problem Formulation Simulation methodology Granularity of replication Dynamic clustering and replication Conclusions

7 Related Work Many existing work model replica placement as NP- hard problem and propose greedy algorithms Ignore scalability problem Clustering of Web contents based on individuals’ access patterns for –Pre-fetching, Web organization, etc. Little on the dynamics of replica placement / clustering

8 Problem Formulation Subject to the total replication cost (e.g., # of URL replicas) Find a scalable, adaptive replication strategy to reduce avg access cost

9 Outlines Motivation Architecture Related Work Problem Formulation Simulation methodology Granularity of replication Dynamic clustering and replication Conclusions

10 Simulation Methodology Network Topology –Pure-random, Waxman & transit-stub models from GT-ITM –A real AS-level topology from 7 widely-dispersed BGP peers Web Workload Web Site PeriodDuration# Requests avg –min-max # Clients avg –min-max # Client groups avg –min-max MSNBCAug-Oct/199910–11am1.5M–642K–1.7M129K–69K–150K15.6K-10K-17K NASAJul-Aug/1995All day79K-61K-101K –Aggregate MSNBC Web clients with BGP prefix »BGP tables from a BBNPlanet router »10K groups left, chooses top 10% covering >70% of requests –Aggregate NASA Web clients with domain names –Map the client groups onto the topology Performance Metric: average retrieval cost –Sum of edge costs from client to its closest replica

11 Outlines Motivation Architecture Related Work Problem Formulation Simulation methodology Granularity of replication Dynamic clustering and replication Conclusions

12 Per Web site Per URL

13 Where R: # of replicas/URLK: # of clusters M: # of URLs (M >> K) C: # of clients S: # of CDN servers f: placement adaptation frequency Replication SchemeStates to MaintainComputation Cost Per WebsiteO (R)f × O(R × S × C) Per ClusterO(R × K + M)f × O(K × R × (K + S × C)) Per URLO(R × M)f × O(M × R × (M + S × C)) 60 – 70% average retrieval cost reduction for Per URL scheme Per URL is too expensive for management! Replica Placement: Per Website vs. Per URL

14 Clustering Web Content General clustering framework –Define the correlation distance between URLs –Cluster diameter: the max distance b/t any two members »Worst correlation in a cluster –Generic clustering: minimize the max diameter of all clusters Correlation distance definition based on –Spatial locality –Temporal locality –Popularity –Semantics (e.g., directory)

15 Spatial Clustering Correlation distance between two URLs defined as –Euclidean distance –Vector similarity URL spatial access vector –Blue URL

16 Clustering Web Content (cont’d) Popularity-based clustering –OR even simpler, sort them and put the first N/K elements into the first cluster, etc. - binary correlation Temporal clustering – Divide traces into multiple individuals’ access sessions [ABQ01] – In each session, – Average over multiple sessions in one day

17 Performance of Cluster-based Replication Tested over various topologies and traces Spatial clustering with Euclidean distance and popularity- based clustering perform the best –Even small # of clusters (with only 1-2% of # of URLs) can achieve close to per-URL performance, with much less overhead MSNBC, 8/2/1999, 5 replicas/URL

18 Effects of the Non-Uniform Size of URLs Replication cost constraint : bytes Similar trends exist –Per URL based replication outperforms per Website dramatically –Spatial clustering with Euclidean distance and popularity-based clustering are very cost-effective

19 Outlines Motivation Architecture Related Work Problem Formulation Simulation methodology Granularity of replication Dynamic clustering and replication –Static clustering –Incremental clustering Conclusions

20 Static clustering and replication Two daily traces: old trace and new trace Static clustering performs poorly beyond a week –Average retrieval cost almost doubles MethodsStatic 1Static 2Optimal Traces used for clusteringOld New Traces used for replicationOldNew Traces used for evaluationNew

21 Incremental Clustering Generic framework 1.If new URL u match with existing clusters c, add u to c and replicate u to existing replicas of c 2.Else create new clusters and replicate them Online incremental clustering –Push before accessed -> high availability –Predict access patterns based on semantics –Simplify to popularity prediction –Groups of URLs with similar popularity? Use hyperlink structures! »Groups of siblings »Groups of the same hyperlink depth: smallest # of links from root

22 Online Popularity Prediction Experiments –Use WebReaper to crawl on 5/3/2002 with hyperlink depth 4, then group the URLs –Use corresponding access logs to analyze the correlation –Groups of siblings has the best correlation Measure the divergence of URL popularity within a group: access freq span =

23 Online Incremental Clustering Semantics-based incremental clustering –Put new URL into existing clusters with largest # of siblings –When there is a tie, choose the cluster with more replicas Simulation on 5/3/2002 MSNBC –8-10am trace: static popularity clustering + replication –At 10am: 16 new URLs emerged - online incremental clustering + replication –Evaluation with 10-12am trace: 16 URLs has 33,262 requests ?

24 Online Incremental Clustering & Replication Results

25 Offline Incremental Clustering Assume access history as input Study spatial clustering and popularity-based clustering For instance, spatial clustering with Euclidean distance c r Find the closest c for new URL u Match if (s < r) More than 98% new URLs match with old clusters Cluster orphan URLs with diameter of d max Replicate them with the average replicas/URL u s

26 Offline Incremental Clustering Results Performance close to optimal With only 25-45% replication cost

27 Conclusions CDN operators: cooperative, clustering-based replication –Cooperative: big savings on replica management and update cost –Per URL replication outperforms per Website scheme by 60-70% –Clustering solves the scalability issues, and gives the full spectrum of flexibility »Spatial clustering and popularity-based clustering recommended To adapt to users’ access patterns: incremental clustering –Hyperlink-based online incremental clustering for »High availability »Performance improvement –Offline incremental clustering performs close to optimal

28 Performance of Cluster-based Replication Tested over various topologies and traces Spatial clustering with Euclidean distance and popularity- based clustering perform the best –Even small # of clusters (with only 1-2% of # of URLs) can achieve close to per-URL performance MSNBC, 8/2/1999, 5 replicas/URL NASA, 7/1/1995, 3 replicas/URL