1 Department of Computer Science, Jinan University 2 School of Computer Science & Technology, Huazhong University of Science & Technology Junjie Xie 1,

Slides:

Advertisements

Similar presentations

Chuanxiong Guo, Haitao Wu, Kun Tan, Lei Shi, Yongguang Zhang, Songwu Lu Microsoft Research Asia, Tsinghua University, UCLA 1 DCell: A Scalable and Fault-Tolerant.

Advertisements

Large Scale Computing Systems

P2P data retrieval DHT (Distributed Hash Tables) Partially based on Hellerstein’s presentation at VLDB2004.

BCube: A High Performance, Server-centric Network Architecture for Modular Data Centers Chuanxiong Guo1, Guohan Lu1, Dan Li1, Haitao Wu1, Xuan Zhang2,

Flattened Butterfly Topology for On-Chip Networks John Kim, James Balfour, and William J. Dally Presented by Jun Pang.

Seunghwa Kang David A. Bader Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System.

1 Exploring Efficient and Scalable Multicast Routing in Future Data Center Networks Dan Li, Jiangwei Yu, Junbiao Yu, Jianping Wu Tsinghua University Presented.

Towards Virtual Routers as a Service 6th GI/ITG KuVS Workshop on “Future Internet” November 22, 2010 Hannover Zdravko Bozakov.

Datacenter Network Topologies

ASWP – Ad-hoc Routing with Interference Consideration Zhanfeng Jia, Rajarshi Gupta, Jean Walrand, Pravin Varaiya Department of EECS University of California,

Decentralized resource management for a distributed continuous media server Cyrus Shahabi and Farnoush Banaei-Kashani IEEE Transactions on Parallel and.

7. Fault Tolerance Through Dynamic or Standby Redundancy 7.6 Reconfiguration in Multiprocessors Focused on permanent and transient faults detection. Three.

BCube: A High Performance, Server-centric Network Architecture for Modular Data Centers Chuanxiong Guo1, Guohan Lu1, Dan Li1, Haitao Wu1, Xuan Zhang2,

1 Load Balance and Efficient Hierarchical Data-Centric Storage in Sensor Networks Yao Zhao, List Lab, Northwestern Univ Yan Chen, List Lab, Northwestern.

Portland: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric Offense Kai Chen Shih-Chi Chen.

Chuanxiong Guo, Haitao Wu, Kun Tan,

Department of Computer Science, Jinan University, Guangzhou, P.R. China Lijun Lyu, Junjie Xie, Yuhui Deng, Yongtao Zhou ICA3PP 2014: The 14th International.

Ji-Yong Shin * Bernard Wong +, and Emin Gün Sirer * * Cornell University + University of Waterloo 2 nd ACM Symposium on Cloud ComputingOct 27, 2011 Small-World.

Quasi Fat Trees for HPC Clouds and their Fault-Resilient Closed-Form Routing Technion - EE Department; *and Mellanox Technologies Eitan Zahavi* Isaac Keslassy.

Network Support for Cloud Services Lixin Gao, UMass Amherst.

Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.

Redundant Array of Independent Disks

Distributed Asynchronous Bellman-Ford Algorithm

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Network Aware Resource Allocation in Distributed Clouds.

Routing & Architecture

EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.

Topology aggregation and Multi-constraint QoS routing Presented by Almas Ansari.

Extreme-scale computing systems – High performance computing systems Current No. 1 supercomputer Tianhe-2 at petaflops Pushing toward exa-scale computing.

Improving Capacity and Flexibility of Wireless Mesh Networks by Interface Switching Yunxia Feng, Minglu Li and Min-You Wu Presented by: Yunxia Feng Dept.

Budapest University of Technology and Economics Department of Telecommunications and Media Informatics Optimized QoS Protection of Ethernet Trees Tibor.

LAN Switching and Wireless – Chapter 1

A.SATHEESH Department of Software Engineering Periyar Maniammai University Tamil Nadu.

Joint Power Optimization Through VM Placement and Flow Scheduling in Data Centers DAWEI LI, JIE WU (TEMPLE UNIVERISTY) ZHIYONG LIU, AND FA ZHANG (CHINESE.

VL2: A Scalable and Flexible Data Center Network Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, David.

Department of Computer Science A Scalable, Commodity Data Center Network Architecture Mohammad Al-Fares Alexander Loukissas Amin Vahdat SIGCOMM’08 Reporter:

InterConnection Network Topologies to Minimize graph diameter: Low Diameter Regular graphs and Physical Wire Length Constrained networks Nilesh Choudhury.

PROP: A Scalable and Reliable P2P Assisted Proxy Streaming System Computer Science Department College of William and Mary Lei Guo, Songqing Chen, and Xiaodong.

Dual Centric Data Center Network Architectures DAWEI LI, JIE WU (TEMPLE UNIVERSITY) ZHIYONG LIU, AND FA ZHANG (CHINESE ACADEMY OF SCIENCES) ICPP 2015.

Chuanxiong Guo, Haitao Wu, Kun Tan, Lei Shi, Yongguang Zhang, Songwu Lu SIGCOMM 2008 Presented by Ye Tian for Course CS05112.

Distributed systems [Fall 2015] G Lec 1: Course Introduction.

On Reducing Mesh Delay for Peer- to-Peer Live Streaming Dongni Ren, Y.-T. Hillman Li, S.-H. Gary Chan Department of Computer Science and Engineering The.

Symbiotic Routing in Future Data Centers Hussam Abu-Libdeh Paolo Costa Antony Rowstron Greg O’Shea Austin Donnelly MICROSOFT RESEARCH Presented By Deng.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Authors: Xiaoqiao Meng, Vasileio Pappas and Li Zhang

Resource Allocation in Network Virtualization Jie Wu Computer and Information Sciences Temple University.

Advanced Computer Networks Lecturer: E EE Eng. Ahmed Hemaid Office: I 114.

Distributed Computing Systems CSCI 4780/6780. Scalability ConceptExample Centralized servicesA single server for all users Centralized dataA single on-line.

Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.

SecondNet: A Data Center Network Virtualization Architecture with Bandwidth Guarantees Chuanxiong Guo 1, Guohan Lu 1, Helen J. Wang 2, Shuang Yang 3, Chao.

Super computers Parallel Processing

Switched LAN Architecture

Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003.

CubicRing ENABLING ONE-HOP FAILURE DETECTION AND RECOVERY FOR DISTRIBUTED IN- MEMORY STORAGE SYSTEMS Yiming Zhang, Chuanxiong Guo, Dongsheng Li, Rui Chu,

Seminar On Rain Technology

WAN Technologies. 2 Large Spans and Wide Area Networks MAN networks: Have not been commercially successful.

VL2: A Scalable and Flexible Data Center Network

Data Center Architectures

Chen Qian, Xin Li University of Kentucky

CIS 700-5: The Design and Implementation of Cloud Networks

Architecture and Algorithms for an IEEE 802

Data Center Network Architectures

Chuanxiong Guo, et al, Microsoft Research Asia, SIGCOMM 2008

A Server-less Architecture for Building Scalable, Reliable, and Cost-Effective Video-on-demand Systems Raymond Leung and Jack Y.B. Lee Department of Information.

BCube: A High Performance, Server-centric Network Architecture for Modular Data Centers Chuanxiong Guo1, Guohan Lu1, Dan Li1, Haitao Wu1, Xuan Zhang2,

Chuanxiong Guo, Haitao Wu, Kun Tan,

Jellyfish: Networking Data Centers Randomly

Data Center Architectures

Database System Architectures

Data Center Traffic Engineering

Presentation transcript:

1 Department of Computer Science, Jinan University 2 School of Computer Science & Technology, Huazhong University of Science & Technology Junjie Xie 1, Yuhui Deng 1, Ke Zhou 2 1 NPC 2013: The 10th IFIP International Conference on Network and Parallel Computing. October 2, Guiyang, China.

Motivation Challenges Related work Our idea System architecture Evaluation Conclusion 2

The Explosive Growth of Data ⇒ Large Data Center  Industrial manufacturing, E-commerce, Social network...  IDC: 1,800EB data in 2011, 40-60% annual increase  YouTube : 72 hours of video are uploaded per minute.  Facebook : 1 billion active users upload 250 million photos per day. Image from 3

Feb.2011, 《 Science 》： On the Future of Genomic Data 。 Feb.2011, 《 Science 》： Climate Data Challenges in the 21st Century Jim Gray : The global amount of information would double every 18 months (1998).

IDC report: Most of the data would be stored in data centers. Large Data Center ⇒ Scalability  Google: 19 data centers>1 million servers  Facebook, Microsoft, Amazon… : >100k servers Large Data Center ⇒ Fault Tolerance  Google MapReduce:  5 nodes fail during a job  1 disk fails every 6 hours Google Data Center Therefore, the data center network has to be very scalable and fault tolerant

Tree-based Structure  Bandwidth bottleneck, Single points of failure, Expensive Fat-tree  High capacity,  Limited scalability 6 Tree-based Structure Fat-tree

7 DCell  Scalable,  Fault-tolerant,  High capacity,  Complex,  Expensive DCell is a level-based, recursively defined interconnection structure. It requires multiport (e.g., 3, 4 or 5) servers. DCell scales doubly exponentially with the server node degree. It is also fault tolerant and supports high network capacity. Downside: It trades-off the expensive core switches/routers with multiport NICs and higher wiring cost. C. Guo, H. Wu, K. Tan, L. Shi, Y. Zhang and S. Lu. DCell: A Scalable and Fault-Tolerant Network Structure for Data Centers. In: Proc. of the ACM SIGCOMM’08, Aug 2008

FiConn  Scalable,  Fault-tolerant,  Low capacity 8 D. Li, C. Guo, H. Wu, K. Tan, and S. Lu. FiConn: Using Backup Port for Server Interconnection in Data Centers. In: Proc. of the IEEE INFOCOM, FiConn utilizes servers with two built-in ports and low-end commodity switches to form the structure. FiConn has a lower wiring cost than DCell. Routing in FiConn also makes a balanced use of links at different levels and is traffic-aware to better utilize the link capacities. Downside: it has lower aggregate network capacity. Other architectures: Portland, VL2, Camcube…

What we achieve:  Scalability: Millions of servers  Fault-tolerance: Structure & Routing  Low cost: Commodity devices  High capacity: Multi- redundant links Totoro Structure of One Level 9

10 structure with N = 4, n = 4, K = 2.

Architecture:  Two-port servers  Low-end switches  Recursively defined Building Algorithm k-level Totoro two-port NIC 11

Connect N servers to an N-port switch Here, N=4 Basic partition: Totoro 0 Intra-switch A Totoro 0 Structure 12

Available ports in Totoro 0 : c. Here, c=4 Connect n Totoro 0 s to n-port switches by using c/2 ports Inter-switch A Totoro 1 structure consists of n Totoro 0 s. 13

Connect n Totoro i-1 s to n-port switches to build a Totoro i Recursively defined Half of available ports ⇒ Open & Scalable The number of paths among Totoro i s is n/2 times of the number of paths among Totoro i-1 s ⇒ Multi-redundant links ⇒ High network capacity 14

Building Algorithm 15 0 TotoroBuild(N, n, K) { 1 Define t K = N * n K 2 Define server = [a K, a K-1, …, a i, …, a 1, a 0 ] 3 For tid = 0 to (t K - 1) 4 For i = 0 to (K – 1) 5 a i+1 = (tid / (N * n i )) mod n 6 a 0 = tid mod N 7 Define intra-switch = (0 - a K, a K-1, …, a 1, a 0 ) 8 Connect(server, intra-switch) 9 For i = 1 to K 10 If ((tid – 2 i-1 + 1) mod 2 i == 0) 11 Define inter-switch (u - b K-u, …, b i, …, b 0 ) 12 u = i 13 For j = i to (K - 1) 14 b j = (tid / (N * n j-1 )) mod n 15 b 0 = (tid / 2 u ) mod (N / n * (n/2) u ) 16 Connect(server, inter-switch) 17 } The key: work out the level of the outgoing link of this server

Building Algorithm 16 Nnututu Millions of servers

Totoro Routing Algorithm (TRA)  Basically, Not Fault-tolerant Totoro Broadcast Domain (TBD)  Detect & Share link states Totoro Fault-tolerant Routing (TFR)  TRA + Dijkstra algorithm (Based on TBD) 17

Totoro Routing Algorithm (TRA) 18 Divide & Conquer algorithm Path from src to dst?

19  Step 1: src and dst belong to two different partitions respectively Totoro Routing Algorithm (TRA)

20  Step 2: Take a link between these two partitions

Totoro Routing Algorithm (TRA) 21  m and n are the intermediate servers  The intermediate path is from m to n

Totoro Routing Algorithm (TRA) 22  Step 3: src(dst) and m(n) are in the same basic partition, just return the directed path

Totoro Routing Algorithm (TRA) 23  Step 3: Otherwise, return to Step 1 to work out the path from src(dst) to m(n)

Totoro Routing Algorithm (TRA) 24  Step 4: Join the P(src, m), P(m, n) and P(n, dst) for a full path

Totoro Routing Algorithm (TRA) 25 The performance of TRA is close to the SP under the conditions of different sizes. Simple & Efficient Nnututu MuMu TRA Shortest Path Algorithm MeanStdDevMeanStdDev The mean value and standard deviation of path length in TRA and SP Algorithm in Totoro u of different sizes. M u is the maximum distance between any two servers in Totoro u. t u indicates the total number of servers

Totoro Broadcast Domain (TBD) 26 Fault-tolerance ⇒ Detect and share link states Time cost & CPU load ⇒ Global strategy is impossible Divide Totoro into several TBDs Green: inner-server Yellow: outer-server

Totoro Fault-tolerant Routing (TFR) 27 Two strategies:  Dijkstra algorithm within TBD  TRA between TBDs Proxy: a temporary destination Next hop: the next server on P(src, proxy/dst)

Totoro Fault-tolerant Routing (TFR) 28 If the proxy is unreachable

Totoro Fault-tolerant Routing (TFR) 29 Reroute the packet to another proxy by using local redundant links

Evaluating Path Failure  Totoro vs. Shortest Path Algorithm( Floyd-Warshall ) Evaluating Network Structure  Totoro vs. Tree-based structure, Fat-Tree, DCell & FiConn 30

Evaluating Path Failure 31 Types of failures  Link, Node, Switch & Rack failures Comparison  TFR vs. SP Platform  Totoro 1 (N=48, n=48, K=1, t K =2,304 servers)  Totoro 2 (N=16, n=16, K=2, t K =4,096 servers) Failures ratios  2% - 20% Communication mode  All-to-all Simulation times  20 times

Evaluating Path Failure 32 Path failure ratio vs. node failure ratio.  The performance of TFR is almost identical to that of SP  Maximize the usage of redundant links when a node failure occurs

Evaluating Path Failure 33 Path failure ratio vs. link failure ratio.  TFR performs well when the link failure ratio is small (i.e., <4%).  The performance gap between TFR and SP becomes larger and larger.  Not global optimal  Not guaranteed to find out an existing path  A huge performance improvement potential

Evaluating 34 Path failure ratio vs. switch failure ratio.  TFR performs almost as well as SP in Totoro 1  The performance gap between TFR and SP becomes larger and larger in the same Totoro 2

Evaluating Path Failure 35 Path failure ratio vs. switch failure ratio.  Path failure ratio of SP is lower in a larger-level Totoro  More redundant high-level switches help bypass the failure

Evaluating Path Failure 36 Path failure ratio vs. rack failure ratio.  In a low-level Totoro, TFR achieves results very close to SP.  The capacity of TFR in a relative high-level Totoro can be improved.

Evaluating Network Structure 37 Low degree  Approaches to but never reach 2  Lower degree ⇒ Lower deployment and maintenance overhead. StructureDegreeDiameter Bisection Width Tree--2log d-1 T1 Fat-Tree--2log 2 TT/2 DCellk + 1<2log n T-1T/4long n T FiConn2 – 1/2 k O(logT)O(T/logT) Totoro2 – 1/2 k O(T)T/2 k+1 N: the number of ports on an intra-switch n:the number of ports on an inter-switch T : the total number of servers. For Totoro, there is

Evaluating Network Structure 38 Relative large diameter  Smaller diameter ⇒ More efficient routing mechanism  In practice, the diameter of a Totoro 3 with 1M servers is only 18.  This can be improved. StructureDegreeDiameter Bisection Width Tree--2log d-1 T1 Fat-Tree--2log 2 TT/2 DCellk + 1<2log n T-1T/4long n T FiConn2 – 1/2 k O(logT)O(T/logT) Totoro2 – 1/2 k O(T)T/2 k+1

Evaluating Network Structure 39 Large bisection width  Large bisection width ⇒ Fault-tolerant & Resilient  Take a small number of k, the bisection width is large.  BiW=T/4, T/8, T/16 when k = 1, 2, 3. StructureDegreeDiameter Bisection Width Tree--2log d-1 T1 Fat-Tree--2log 2 TT/2 DCellk + 1<2log n T-1T/4long n T FiConn2 – 1/2 k O(logT)O(T/logT) Totoro2 – 1/2 k O(T)T/2 k+1

Scalability:  Millions of servers & Open structure Fault-tolerance:  Structure & Routing mechanism Low cost:  Two-port servers & Commodity switches High capacity:  Multi-redundant links Totoro is a viable interconnection solution for data centers! 40

Fault-tolerance:  Structure  How to be more resilient?  Routing under complex failures:  More robust rerouting techniques? Network capacity  Data locality:  Mapping between servers and switches?  Data storage allocation policies? 41

42 NPC 2013: The 10th IFIP International Conference on Network and Parallel Computing. October 2, Guiyang, China.