Design & Implementation of LH* RS : a Highly- Available Distributed Data Structure Rim Moussa

Slides:



Advertisements
Similar presentations
Analysis and Construction of Functional Regenerating Codes with Uncoded Repair for Distributed Storage Systems Yuchong Hu, Patrick P. C. Lee, Kenneth.
Advertisements

Alex Dimakis based on collaborations with Dimitris Papailiopoulos Arash Saber Tehrani USC Network Coding for Distributed Storage.
CSCE430/830 Computer Architecture
Henry C. H. Chen and Patrick P. C. Lee
Computer Networking Error Control Coding
BASIC Regenerating Codes for Distributed Storage Systems Kenneth Shum (Joint work with Minghua Chen, Hanxu Hou and Hui Li)
Jump to first page A. Patwardhan, CSE Digital Fountains Main Ideas : n Distribution of bulk data n Reliable multicast, broadcast n Ideal digital.
Simple Regenerating Codes: Network Coding for Cloud Storage Dimitris S. Papailiopoulos, Jianqiang Luo, Alexandros G. Dimakis, Cheng Huang, and Jin Li University.
PRESENTED BY: TING WANG PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric Radhika Niranjan Mysore, Andreas Pamboris, Nathan.
SCTP v/s TCP – A Comparison of Transport Protocols for Web Traffic CS740 Project Presentation by N. Gupta, S. Kumar, R. Rajamani.
Rim Moussa University Paris 9 Dauphine Experimental Performance Analysis of LH* RS Parity Management Workshop on Distributed Data Structures: WDAS 2002.
Contribution to the Design & Implementation of the Highly Available Scalable and Distributed Data Structure: LH* RS Rim Moussa
BZUPAGES.COM 1 User Datagram Protocol - UDP RFC 768, Protocol 17 Provides unreliable, connectionless on top of IP Minimal overhead, high performance –No.
Enhanced Availability With RAID CC5493/7493. RAID Redundant Array of Independent Disks RAID is implemented to improve: –IO throughput (speed) and –Availability.
RAID- Redundant Array of Inexpensive Drives. Purpose Provide faster data access and larger storage Provide data redundancy.
1 Magnetic Disks 1956: IBM (RAMAC) first disk drive 5 Mb – Mb/in $/year 9 Kb/sec 1980: SEAGATE first 5.25’’ disk drive 5 Mb – 1.96 Mb/in2 625.
Chapter 5: Server Hardware and Availability. Hardware Reliability and LAN The more reliable a component, the more expensive it is. Server hardware is.
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
Copyright © 2005 Department of Computer Science 1 Solving the TCP-incast Problem with Application-Level Scheduling Maxim Podlesny, University of Waterloo.
A Server-less Architecture for Building Scalable, Reliable, and Cost-Effective Video-on-demand Systems Jack Lee Yiu-bun, Raymond Leung Wai Tak Department.
Typhoon: An Ultra-Available Archive and Backup System Utilizing Linear-Time Erasure Codes.
Streaming Video over the Internet: Approaches and Directions Dapeng Wu, Yiwei Thomas Hou et al. Presented by: Abhishek Gupta
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public 1 Version 4.0 OSI Transport Layer Network Fundamentals – Chapter 4.
1 Recap (RAID and Storage Architectures). 2 RAID To increase the availability and the performance (bandwidth) of a storage system, instead of a single.
Computing in the Reliable Array of Independent Nodes Vasken Bohossian, Charles Fan, Paul LeMahieu, Marc Riedel, Lihao Xu, Jehoshua Bruck May 5, 2000 IEEE.
A Hybrid Approach of Failed Disk Recovery Using RAID-6 Codes: Algorithms and Performance Evaluation Yinlong Xu University of Science and Technology of.
A Server-less Architecture for Building Scalable, Reliable, and Cost-Effective Video-on-demand Systems Presented by: Raymond Leung Wai Tak Supervisor:
Witold Litwin Riad Mokadem Thomas Schwartz Disk Backup Through Algebraic Signatures.
More Codes Never Enough. 2 EVENODD Code Basics of EVENODD code  each storage node as a single column # of data nodes k = p (prime) # of total nodes n.
TCP. Learning objectives Reliable Transport in TCP TCP flow and Congestion Control.
Definition of terms Definition of terms Explain business conditions driving distributed databases Explain business conditions driving distributed databases.
Frangipani: A Scalable Distributed File System C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation.
RACS: A Case for Cloud Storage Diversity Paper by Hussam Abu-LibdehCornell University Lonnie PrincehouseCornell University Hakim Weatherspoon Cornell University.
Department of Computer Science Southern Illinois University Edwardsville Dr. Hiroshi Fujinoki and Kiran Gollamudi {hfujino,
Using Algebraic Signatures in Storage Applications Thomas Schwarz, S.J. Associate Professor, Santa Clara University Associate, SSRC UCSC Storage Systems.
RAID Shuli Han COSC 573 Presentation.
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Storage Systems.
RAID: High-Performance, Reliable Secondary Storage Mei Qing & Chaoxia Liao Nov. 20, 2003.
©2001 Pål HalvorsenINFOCOM 2001, Anchorage, April 2001 Integrated Error Management in MoD Services Pål Halvorsen, Thomas Plagemann, and Vera Goebel University.
1 Solid State Storage (SSS) System Error Recovery LHO 08 For NASA Langley Research Center.
Transport Layer Issue in Wireless Ad Hoc and Sensor Network
1 Chapter Client-Server Interaction. 2 Functionality  Transport layer and layers below  Basic communication  Reliability  Application layer.
I/O – Chapter 8 Introduction Disk Storage and Dependability – 8.2 Buses and other connectors – 8.4 I/O performance measures – 8.6.
Redundant Array of Inexpensive Disks aka Redundant Array of Independent Disks (RAID) Modified from CCT slides.
LH* RS P2P : A Scalable Distributed Data Structure for P2P Environment W. LITWIN CERIA Laboratory H.YAKOUBEN Paris Dauphine University
Remote Access Chapter 4. Learning Objectives Understand implications of IEEE 802.1x and how it is used Understand VPN technology and its uses for securing.
Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel.
1 High-Availability LH* Schemes with Mirroring W. Litwin, M.-A. Neimat U. Paris 9 & HPL Palo-Alto
An Efficient Approach for Content Delivery in Overlay Networks Mohammad Malli Chadi Barakat, Walid Dabbous Planete Project To appear in proceedings of.
1 Forward Error Correction in Sensor Networks Jaein Jeong, Cheng-Tien Ee University of California, Berkeley.
Scalable Distributed Data Structures & High-Performance Computing Witold Litwin Fethi Bennour CERIA University Paris 9 Dauphine
A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.
1 High-Availability in Scalable Distributed Data Structures W. Litwin.
A Cost-based Heterogeneous Recovery Scheme for Distributed Storage Systems with RAID-6 Codes Yunfeng Zhu 1, Patrick P. C. Lee 2, Liping Xiang 1, Yinlong.
The InetAddress Class A class for storing and managing internet addresses (both as IP numbers and as names). The are no constructors but “class factory”
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
S-Paxos: Eliminating the Leader Bottleneck
1 Scalable Distributed Data Structures Part 2 Witold Litwin Paris 9
Measuring the Capacity of a Web Server USENIX Sympo. on Internet Tech. and Sys. ‘ Koo-Min Ahn.
LH* RS P2P : A Scalable Distributed Data Structure for P2P Environment W. LITWIN CERIA Laboratory H.YAKOUBEN Paris Dauphine University
By Nitin Bahadur Gokul Nadathur Department of Computer Sciences University of Wisconsin-Madison Spring 2000.
KYUNG-HWA KIM HENNING SCHULZRINNE 12/09/2008 INTERNET REAL-TIME LAB, COLUMBIA UNIVERSITY DYSWIS.
1 Ad-hoc Transport Layer Protocol (ATCP) EECS 4215.
Coding for Multipath TCP: Opportunities and Challenges Øyvind Ytrehus University of Bergen and Simula Res. Lab. NNUW-2, August 29, 2014.
rain technology (redundant array of independent nodes)
Accelerating Peer-to-Peer Networks for Video Streaming
Ad-hoc Transport Layer Protocol (ATCP)
SCTP v/s TCP – A Comparison of Transport Protocols for Web Traffic
Workshop in Distributed Data & Structures
Erasure Correcting Codes for Highly Available Storage
Presentation transcript:

Design & Implementation of LH* RS : a Highly- Available Distributed Data Structure Rim Moussa Thomas J.E. Schwartz /thomas_schwarz.html Workshop in Distributed Data & Structures * July 2004

2 Objective Factors of Interest are : Parity Overhead Recovery Performances LH* RS DesignImplementation Performance Measurements

3 1. Motivation 2. Highly-available schemes 3. LH* RS 4. Architectural Design 5. Hardware testbed 6. File Creation 7. High Availability 8. Recovery 9. Conclusion 10. Future Work  Scenario Description  Performance Results Overview

4 Motivation  Information Volume of 30% / year  Bottleneck of disk access and CPUs  Failures are frequent & costly Business OperationIndustryAverage Hourly Financial Impact Brokerage (Retail) operations Financial$6.45 million Credit Card Sales Authorization Financial$2.6 million Airline Reservation Centers Transportation$89,500 Cellular (new) Service Activation Communication$41,000 Source: Contingency Planning Research -1996

5 Requirements Need Highly Available Networked Data Storage Systems Scalability High Throughput High Availability

6 Scalable & Distributed Data Structure Dynamic file growth Client Network Client … Data Buckets (DBs) … Coordinator I’m Overloaded ! You Split ! Records Transfered Inserts …

7 SDDS (Ctnd.) Network No Centralized Directory Access Data Buckets (DBs) …… Client Query Query Forwarded Image Adjustement Message …

8 Solutions towards High Availability Parity Calculus (+) Good Response time since mirrors are queried (-) High storage cost (  n if n repliquas) Data Replication Erasure-resilient codes are evaluated regarding: Coding Rate (parity volume / data volume) Coding Rate (parity volume / data volume) Update Penality Update Penality Group Size used for Data Reconstruction Group Size used for Data Reconstruction Complexity of Coding & Decoding Complexity of Coding & Decoding

9 Fault-Tolerant Schemes 1 server failure More than 1 server failure Binary linear codes: [Hellerstein & al., 94] Array Codes: EVENODD [Blaum et al., 94 ], X-code [Xu et al.,99], RDP schema [Corbett et al., 04]  Tolerate just 2 failures Reed Solomon Codes: IDA [Rabin, 89], RAID X [White, 91], FEC [Blomer et al., 95], Tutorial [Plank, 97], LH* RS [Litwin & Schwarz, 00]  Tolerate large number of failures … Simple XOR parity calculus : RAID Systems [Patterson et al., 88], The SDDS LH*g [Litwin et al., 96]

A Highly Available & Distributed Data Structure: LH* RS [Litwin & Schwarz, 00] [Litwin, Moussa & Schwarz, sub.]

11 LH* RSSDDS Data Distribution scheme based on Linear Hashing : LH* LH [Karlesson et al., 96] applied to the key-field Parity Calculus Reed-Solomon Codes [Reed & Solomon, 63]Scalability High Throughput High Availability

12 LH* RS File Structure   Data Buckets Parity Buckets      : Key Data Field Insert Rank r  : Rank [Key List] Parity Field Key r

Architectural Design of LH* RS

14 Communication Use of TCP/IP New PB Creation Large Update Transfer (DB split) Bucket Recovery Use of UDP Individual Insert/ Update/ Delete/ Search Queries Record Recovery Service and Control Messages Speed Better Performance & Reliability than UDP

15 Network Multicast Listening Port Send UDP Port Message Queue -Message processing- TCP/IP Port Process Buffer Recv UDP Port Message Queue -Message processing- … … Message TCP Connection WindowFree Zones  Sending Credit Messages waiting for ack … Not ack’ed messages Multicast Listening Thread Multicast Working Thread Ack. Management Thread UDP Listening Thread TCP Listening Thread Work. Thread 1 Work. Thread n Bucket Architecture

16 Architectural Design TCP/IP Connection Handler TCP/IP connections are passive OPEN, RFC 793 –[ISI,81],TCP/IP Implem. under Win2K Server O.S. [McDonal & Barkley, 00] Enhancements to SDDS2000 [B00, D01] Bucket Architecture Flow Control and Acknowledgement Mgmt. Principle of “Sending Credit + Message conservation until delivery” [Jacobson, 88] [Diène, 01] Ex.: 1 DB recovery: SDDS 2000 Architecture: 6.7 s New Architecture: 2.6 s  Improv. 60% (Hardware config.: 733MhZ machines, 100Mbps network) Before

17 Architectural Design (Ctnd.) A pre-defined & static Table Dynamic Structure Updated when adding new/spare Buckets (PBs/DBs) through Multicast Probe DBs, PBs Coordinator Blank DBs Multicast Group Blank PBs Multicast Group Multicast Component Before

18 Hardware Testbed  5 Machines (Pentium IV: 1.8 GHz, RAM: 512 Mb)  Ethernet Network: max bandwidth of 1 Gbps  Operating System: Windows 2K Server  Tested configuration  1 Client  A group of 4 Data Buckets  k Parity Buckets, k  {0, 1, 2}

LH* RS File Creation

20 File Creation Client Operation Splitting Data Bucket  PBs : (Records that Remain) N Deletes -from old rank & N Inserts -at new rank + (Records that move) N Deletes New Data Bucket  PBs: N Inserts (Moved Records) All Updates are gathered in the same buffer and transferred (TCP/IP) simultaneously to respective Parity Buckets of the Splitting DB & New DB. Propagation of each Insert/ Update/ Delete on Data Record to Parity Buckets Data Bucket Split

21 File Creation Perf. Experiments Set-up File of data records; 1 data record = 104 B Client Sending Credit = 1 Client Sending Credit = 5 k = 1 to k = 2  Perf. Degradation of 8% PB Overhead k = 0 to k = 1  Perf. Degradation of 20%

22 File Creation Perf. Experimental Set-up File of data records; 1 data record = 104 B Client Sending Credit = 1 Client Sending Credit = 5 k = 1 to k = 2  Perf. Degradation of 10% PB Overhead k = 0 to k = 1  Perf. Degradation of 37%

LH* RS Parity Bucket Creation

24 PB Creation Scenario Coordinator PBs Connected to The Blank PBs Multicast Group Wanna join group g ? [Sender Your Entity#] Searching for a new PB

25 PB Creation Scenario Coordinator PBs Connected to The Blank PBs Multicast Group I would Start UDP Listening, Start TCP Listening, Start Working Threads Waiting for Confirmation, If Time-out elapsed  Cancel all Waiting for Replies

26 PB Creation Scenario Coordinator PBs Connected to The Blank PBs Multicast Group You are Selected Disconnect from Blank PBs Multicast Group Cancellation PB Selection

27 Send me your contents ! … PB Creation Scenario Data Bucket’s group New PB … Auto-creation -Query phase

28 Requested Buffer … PB Creation Scenario Data Bucket’s group New PB Buffer Processing … Auto-creation –Encoding phase

29 PB Creation Perf. Bucket Size Total Time (sec) Processing Time (sec) Communication Time (sec) XOR Encoding RS Encoding Comparison Experimental Set-up Bucket Size : records; Bucket Contents = 0.625* Bucket Size File Size: 2.5 * Bucket Size records  Bucket Size: PT  74% TT ,608 Encoding Rate MB/sec

30 PB Creation Perf. Experimental Set-up Bucket Size : records; Bucket Contents = 0.625* Bucket Size File Size: 2.5 * Bucket Size records Bucket Size Total Time (sec) Processing Time (sec) Communication Time (sec) XOR Encoding RS Encoding Comparison  Bucket Size; PT  74% TT ,618 Encoding Rate MB/sec

31 PB Creation Perf. XOR Encoding RS Encoding Comparison XOR Encoding Rate : 0.66 MB/sec RS Encoding Rate : MB/sec XOR provides a performance gain of 5% in Processing Time (  0.02% in the Total Time) For Bucket Size = 50000

LH* RS Bucket Recovery

33 Coordinator Failure Detection Are You Alive ? Data Buckets Parity Buckets   Buckets’ Recovery

34 Coordinator Waiting for Replies… I am Alive ? Data Buckets Parity Buckets   Buckets’ Recovery

35 Coordinator Searching for 2 Spare DBs… DBs Connected to The Blank DBs Multicast Group Wanna be a Spare DB ? [Sender Your Entity#] Buckets’ Recovery

36 Coordinator Waiting for Replies … DBs Connected to The Blank DBs Multicast Group Start UDP Listening, Start TCP Listening, Start Working Threads Waiting for Confirmation, If Time-out elapsed  Cancel all I would Buckets’ Recovery

37 Coordinator Spare DBs Selection DBs Connected to The Blank DBs Multicast Group Disconnect from Blank PBs Multicast Group You are Selected Disconnect from Blank PBs Multicast Group Cancellation Buckets’ Recovery

38 Coordinator Parity Buckets Recover Buckets [Spares Buckets’ Recovery Recovery Manager Determination

39 Data Buckets Parity Buckets Recovery Manager Spare DBs Alive Buckets participating to Recovery Send me Records of rank in [r, r+slice-1] … Buckets’ Recovery Query Phase

40 Decoding Process Recovered Records Data Buckets Parity Buckets Recovery Manager Spare DBs Alive Buckets participating to Recovery Requested Buffer … Buckets’ Recovery Reconstruction Phase

41 DBs Recovery Perf. Experimental Set-up File: recs; Bucket: recs  MB XOR Decoding RS Decoding Comparison SliceTotal Time (sec) Processing Time (sec) Communication Time (sec)  Slice (from 4% to 100% of Bucket contents)  TT doesn’t vary a lot 0.72

42 DBs Recovery Perf. Experimental Set-up File: recs; Bucket: recs  MB XOR Decoding RS Decoding Comparison SliceTotal Time (sec) Processing Time (sec) Communication Time (sec)  Slice (from 4% to 100% of Bucket contents)  TT doesn’t vary a lot 0.85

43 DBs Recovery Perf. Experimental Set-up File: recs; Bucket: recs  MB XOR Decoding RS Decoding Comparison 1DB Recovery Time - XOR : sec XOR provides a performance gain of 15% in Total Time 1DB Recovery Time – RS : sec

44 DBs Recovery Perf. Experimental Set-up File: recs; Bucket: recs  MB Recover 2 DBs Recover 3 DBs SliceTotal Time (sec) Processing Time (sec) Communication Time (sec)  Slice (from 4% to 100% of Bucket contents)  TT doesn’t vary a lot 1.2

45 DBs Recovery Perf. Experimental Set-up File: recs; Bucket: recs  MB Recover 2 DBs Recover 3 DBs SliceTotal Time (sec) Processing Time (sec) Communication Time (sec)  Slice (from 4% to 100% of Bucket contents)  TT doesn’t vary a lot 1.6

46 Perf. Summary of Bucket Recovery 1 DB (3.125 MB) in 0.7 sec (XOR)  4.46 MB/sec 1 DB (3.125 MB) in 0.85 sec (RS)  3.65 MB/sec 2 DBs (6.250 MB) in 1.2 sec (RS)  5.21 MB/sec 3 DBs (9,375 MB) in 1.6 sec (RS)  5.86 MB/sec

47 Conclusion The conducted experiements show that: Encoding/Decoding Optimization Enhanced Bucket Architecture  I mpact on performance Good Recovery Performance Finally, we improved the processing time of the RS decoding process by 4% to 8% 1DB is recovered in half a second

48 Conclusion LH* RS Mature Implementation Many Optimization Iterations Only SDDS with Scalable Availability

49 Future Work Better Parity Update Propagation Strategy to PBs Investigation of faster Encoding/ Decoding processes

50 References [Patterson et al., 88] D. A. Patterson, G. Gibson & R. H. Katz, A Case for Redundant Arrays of Inexpensive Disks, Proc. of ACM SIGMOD Conf, pp , June [ISI,81] Information Sciences Institute, RFC 793: Transmission Control Protocol (TCP) – Specification, Sept. 1981, [McDonal & Barkley, 00] D. MacDonal, W. Barkley, MS Windows 2000 TCP/IP Implementation Details, [Jacobson, 88] V. Jacobson, M. J. Karels, Congestion Avoidance and Control, Computer Communication Review, Vol. 18, No 4, pp [Xu et al.,99] L. Xu & Jehoshua Bruck, X- Code: MDS Array Codes with Optimal Encoding, IEEE Trans. on Information Theory, 45(1), p , [Corbett et al., 04] P. Corbett, B. English, A. Goel, T. Grcanac, S. Kleiman, J. Leong, S. Sankar, Row-Diagonal Parity for Double Disk Failure Correction, Proc. of the 3 rd USENIX – Conf. On File and Storage Technologies, Avril [Rabin, 89] M. O. Rabin, Efficient Dispersal of Information for Security, Load Balancing and Fault Tolerance, Journal of ACM, Vol. 26, N° 2, April 1989, pp [White, 91] P.E. White, RAID X tackles design problems with existing design RAID schemes, ECC Technologies, ftp://members.aol.com.mnecctek.ctr1991.pdfftp://members.aol.com.mnecctek.ctr1991.pdf [Blomer et al., 95] J. Blomer, M. Kalfane, R. Karp, M. Karpinski, M. Luby & D. Zuckerman, An XOR-Based Erasure-Resilient Coding Scheme, ICSI Tech. Rep. TR , 1995.

51 References (Ctnd.) [Litwin & Schwarz, 00] W. Litwin & T. Schwarz, LH* RS : A High-Availability Scalable Distributed Data Structure using Reed Solomon Codes, p , Proceedings of the ACM SIGMOD [Karlesson et al., 96] J. Karlson, W. Litwin & T. Risch, LH*LH: A Scalable high performance data structure for switched multicomputers, EDBT 96, Springer Verlag. [Reed & Solomon, 60] I. Reed & G. Solomon, Polynomial codes over certain Finite Fields, Journal of the society for industrial and applied mathematics, [Plank, 97] J. S. Plank, A Tutorial on Reed-Solomon Coding for fault-Tolerance in RAID-like Systems, Software– Practise & Experience, 27(9), Sept. 1997, pp , [Diéne, 01] A.W. Diène, Contribution à la Gestion de Structures de Données Distribuées et Scalables, PhD Thesis, Nov. 2001, Université Paris Dauphine. [Bennour, 00] F. Sahli Bennour, Contribution à la Gestion de Structures de Données Distribuées et Scalables, PhD Thesis, Juin 2000, Université Paris Dauphine. [Moussa] More references:

End

53 Parity Calculus Galois Field GF[2 8 ]  1 symbol is 1 byte || GF[2 16 ]  1 symbol is 2 bytes GF[2 8 ]  1 symbol is 1 byte || GF[2 16 ]  1 symbol is 2 bytes (+) GF[2 16 ] vs. GF[2 8 ] reduces by 1/2 the # of symbols, and consequently number of opertaions in the field (-) Multiplication Tables Sizes New Generator Matrix 1st Column of ‘1’s 1st parity bucket executes XOR calculus instead of RS calculus  1st parity bucket executes XOR calculus instead of RS calculus  gain performance in encoding of 20% 1st line of ‘1’s Each PB executes XOR calculus for any update from the 1st DB of any group  gain performance of 4% - measured for PB creation Encoding & Decoding Hints Encoding log pre-calculus of the P matrix coefficents  improv. of 3.5% Decoding log pre-calculus of H -1 matrix coef. and b vector for multiple buckets recovery  improv. from 4% to 8%