Presentation is loading. Please wait.

Presentation is loading. Please wait.

Workshop in Distributed Data & Structures

Similar presentations


Presentation on theme: "Workshop in Distributed Data & Structures"— Presentation transcript:

1 Workshop in Distributed Data & Structures
*July 2004 Design & Implementation of LH*RS : a Highly- Available Distributed Data Structure Thomas J.E. Schwartz Rim Moussa

2 Recovery Performances
Objective Design Implementation Performance Measurements LH*RS Factors of Interest are : Parity Overhead Recovery Performances

3 Overview Motivation Highly-available schemes LH*RS
Architectural Design Hardware testbed File Creation High Availability Recovery Conclusion Future Work  Scenario Description  Performance Results

4 Motivation Information Volume of 30% / year
Bottleneck of disk access and CPUs Failures are frequent & costly Business Operation Industry Average Hourly Financial Impact Brokerage (Retail) operations Financial $6.45 million Credit Card Sales Authorization $2.6 million Airline Reservation Centers Transportation $89,500 Cellular (new) Service Activation Communication $41,000 Source: Contingency Planning Research -1996

5 Highly Available Networked Data Storage Systems
Requirements Need Highly Available Networked Data Storage Systems Scalability High Throughput High Availability Disadvantages 1. Providing a subject definition requires that a proper level of maturity and organization of subject domain community (e.g., scientific community) have to be reached (e.g., are the research and development groups in the area sufficiently open, collaborative and motivated ?). Subject consolidation is a collective, organized effort of the community. 2. Process of registration is not an easy one and requires specific supporting tools.

6 Scalable & Distributed Data Structure
Dynamic file growth Coordinator Client Client Inserts I’m Overloaded ! You Split ! Records Transfered Network Data Buckets (DBs)

7 No Centralized Directory Access
SDDS (Ctnd.) No Centralized Directory Access Client Query Image Adjustement Message Query Forwarded Network Data Buckets (DBs)

8 Solutions towards High Availability
Data Replication (+) Good Response time since mirrors are queried (-) High storage cost (n if n repliquas) Parity Calculus Erasure-resilient codes are evaluated regarding: Coding Rate (parity volume / data volume) Update Penality Group Size used for Data Reconstruction Complexity of Coding & Decoding

9 Fault-Tolerant Schemes
1 server failure Simple XOR parity calculus : RAID Systems [Patterson et al., 88], The SDDS LH*g [Litwin et al., 96] More than 1 server failure Binary linear codes: [Hellerstein & al., 94] Array Codes: EVENODD [Blaum et al., 94 ], X-code [Xu et al.,99], RDP schema [Corbett et al., 04]  Tolerate just 2 failures Reed Solomon Codes: IDA [Rabin, 89], RAID X [White, 91], FEC [Blomer et al., 95], Tutorial [Plank, 97], LH*RS[Litwin & Schwarz, 00]  Tolerate large number of failures

10 A Highly Available & Distributed Data Structure: LH*RS
[Litwin & Schwarz, 00] [Litwin, Moussa & Schwarz, sub.]

11 LH*RS SDDS Parity Calculus
Data Distribution scheme based on Linear Hashing : LH*LH [Karlesson et al., 96] applied to the key-field Scalability High Throughput Parity Calculus Reed-Solomon Codes [Reed & Solomon, 63] High Availability

12 LH*RS File Structure        : Rank [Key List] Parity Field
Key r 2 1 Insert Rank r 2 1 Parity Buckets  : Rank [Key List] Parity Field Data Buckets  : Key Data Field

13 Architectural Design of LH*RS

14 Communication Use of UDP
Individual Insert/ Update/ Delete/ Search Queries Record Recovery Service and Control Messages Speed Use of TCP/IP New PB Creation Large Update Transfer (DB split) Bucket Recovery Better Performance & Reliability than UDP

15 Bucket Architecture … … … Network Multicast Listening Port
TCP Connection Network Multicast Listening Port Send UDP Port Recv UDP Port TCP/IP Port TCP Listening Thread UDP Listening Thread Multicast Listening Thread Message Message Queue Process Buffer Message Queue -Message processing- Work. Thread 1 Work. Thread n -Message processing- Multicast Working Thread Free Zones Window Messages waiting for ack Ack. Management Thread Sending Credit Not ack’ed messages

16 Architectural Design Enhancements to SDDS2000 [B00, D01] Bucket Architecture TCP/IP Connection Handler TCP/IP connections are passive OPEN, RFC 793 –[ISI,81],TCP/IP Implem. under Win2K Server O.S. [McDonal & Barkley, 00] Before Ex.: 1 DB recovery: SDDS 2000 Architecture: 6.7 s New Architecture: 2.6 s  Improv. 60% (Hardware config.: 733MhZ machines, 100Mbps network) Flow Control and Acknowledgement Mgmt. Principle of “Sending Credit + Message conservation until delivery” [Jacobson, 88] [Diène, 01]

17 Architectural Design (Ctnd.)
Dynamic Structure Updated when adding new/spare Buckets (PBs/DBs) through Multicast Probe DBs, PBs Coordinator Blank DBs Multicast Group Blank PBs Multicast Group Multicast Component A pre-defined & static Table Before

18 Hardware Testbed 5 Machines (Pentium IV: 1.8 GHz, RAM: 512 Mb)
Ethernet Network: max bandwidth of 1 Gbps Operating System: Windows 2K Server Tested configuration 1 Client A group of 4 Data Buckets k Parity Buckets, k  {0, 1, 2}

19 LH*RS File Creation

20 File Creation Client Operation Data Bucket Split
Propagation of each Insert/ Update/ Delete on Data Record to Parity Buckets Data Bucket Split Splitting Data Bucket  PBs : (Records that Remain) N Deletes -from old rank & N Inserts -at new rank + (Records that move) N Deletes New Data Bucket  PBs: N Inserts (Moved Records) All Updates are gathered in the same buffer and transferred (TCP/IP) simultaneously to respective Parity Buckets of the Splitting DB & New DB.

21 File Creation Perf. Experiments Set-up
File of data records; 1 data record = 104 B Client Sending Credit = 1 Client Sending Credit = 5 PB Overhead k = 0 to k = 1  Perf. Degradation of 20% k = 1 to k = 2  Perf. Degradation of 8%

22 File Creation Perf. Experimental Set-up
File of data records; 1 data record = 104 B Client Sending Credit = 1 Client Sending Credit = 5 PB Overhead k = 0 to k = 1  Perf. Degradation of 37% k = 1 to k = 2  Perf. Degradation of 10%

23 LH*RS Parity Bucket Creation

24 [Sender IP@+Entity#, Your Entity#]
PB Creation Scenario Searching for a new PB Coordinator Wanna join group g ? <Multicast> [Sender Your Entity#] PBs Connected to The Blank PBs Multicast Group

25 PB Creation Scenario Waiting for Replies Coordinator
I would Coordinator Start UDP Listening, Start TCP Listening, Start Working Threads Waiting for Confirmation, If Time-out elapsed  Cancel all PBs Connected to The Blank PBs Multicast Group

26 PB Creation Scenario PB Selection Coordinator
Cancellation PB Selection Cancellation Coordinator You are Selected <UDP> Disconnect from Blank PBs Multicast Group PBs Connected to The Blank PBs Multicast Group

27 Send me your contents ! <UDP>
PB Creation Scenario Auto-creation -Query phase Send me your contents ! <UDP> New PB Data Bucket’s group

28 Auto-creation –Encoding phase
PB Creation Scenario Auto-creation –Encoding phase Buffer Processing Requested Buffer <TCP> New PB Data Bucket’s group

29 File Size: 2.5 * Bucket Size records
PB Creation Perf. Experimental Set-up Bucket Size : records; Bucket Contents = 0.625* Bucket Size File Size: 2.5 * Bucket Size records XOR Encoding RS Encoding Comparison Bucket Size Total Time (sec) Processing Time (sec) Communication Time (sec) 5000 0.190 0.140 0.029 10000 0.429 0.304 0.066 25000 1.007 0.738 0.144 50000 2.062 1.484 0.322 Bucket Size: PT  74% TT 0.659 0.640 0.686 0,608 Encoding Rate MB/sec

30 File Size: 2.5 * Bucket Size records
PB Creation Perf. Experimental Set-up Bucket Size : records; Bucket Contents = 0.625* Bucket Size File Size: 2.5 * Bucket Size records XOR Encoding RS Encoding Comparison Bucket Size Total Time (sec) Processing Time (sec) Communication Time (sec) 5000 0.193 0.149 0.035 10000 0.446 0.328 0.059 25000 1.053 0.766 0.153 50000 2.103 1.531 0.322 Bucket Size; PT  74% TT 0.673 0.674 0.713 0,618 Encoding Rate MB/sec

31 PB Creation Perf. For Bucket Size = 50000
XOR Encoding RS Encoding Comparison For Bucket Size = 50000 XOR Encoding Rate : MB/sec RS Encoding Rate : MB/sec XOR provides a performance gain of 5% in Processing Time (0.02% in the Total Time)

32 LH*RS Bucket Recovery

33 Are You Alive ? <UDP>
Buckets’ Recovery Failure Detection Coordinator Are You Alive ? <UDP> Parity Buckets Data Buckets

34   Buckets’ Recovery I am Alive ? <UDP> Waiting for Replies…
Coordinator I am Alive ? <UDP> Parity Buckets Data Buckets

35 [Sender IP@, Your Entity#]
Buckets’ Recovery Searching for 2 Spare DBs… Coordinator Wanna be a Spare DB ? <Multicast> [Sender Your Entity#] DBs Connected to The Blank DBs Multicast Group

36 Buckets’ Recovery Waiting for Replies … Coordinator
I would Coordinator Start UDP Listening, Start TCP Listening, Start Working Threads Waiting for Confirmation, If Time-out elapsed  Cancel all DBs Connected to The Blank DBs Multicast Group

37 Buckets’ Recovery Spare DBs Selection Coordinator
Disconnect from Blank PBs Multicast Group Spare DBs Selection Coordinator Cancellation You are Selected <UDP> Disconnect from Blank PBs Multicast Group DBs Connected to The Blank DBs Multicast Group

38 Buckets’ Recovery Recovery Manager Determination Coordinator
Recover Buckets [Spares Parity Buckets

39 Alive Buckets participating to Recovery
Buckets’ Recovery Query Phase Alive Buckets participating to Recovery Recovery Manager Send me Records of rank in [r, r+slice-1] <UDP> Parity Buckets Data Buckets Spare DBs

40 Alive Buckets participating to Recovery
Buckets’ Recovery Reconstruction Phase Alive Buckets participating to Recovery Recovery Manager Requested Buffer <TCP> Parity Buckets Decoding Process Recovered Records <TCP> Data Buckets Spare DBs

41 Communication Time (sec)
DBs Recovery Perf. Experimental Set-up File: recs; Bucket: recs  MB XOR Decoding RS Decoding Comparison Slice Total Time (sec) Processing Time (sec) Communication Time (sec) 1250 0.750 0.291 0.433 3125 0.693 0.249 0.372 6250 0.667 0.260 0.360 15625 0.755 0.255 0.458 31250 0.734 0.271 0.448 Slice (from 4% to 100% of Bucket contents)  TT doesn’t vary a lot 0.72

42 Communication Time (sec)
DBs Recovery Perf. Experimental Set-up File: recs; Bucket: recs  MB XOR Decoding RS Decoding Comparison Slice Total Time (sec) Processing Time (sec) Communication Time (sec) 1250 0.870 0.390 0.443 3125 0.867 0.375 6250 0.828 0.385 0.303 15625 0.854 0.433 31250 0.448 Slice (from 4% to 100% of Bucket contents)  TT doesn’t vary a lot 0.85

43 XOR provides a performance gain of 15% in Total Time
DBs Recovery Perf. Experimental Set-up File: recs; Bucket: recs  MB XOR Decoding RS Decoding Comparison 1DB Recovery Time - XOR : sec 1DB Recovery Time – RS : sec XOR provides a performance gain of 15% in Total Time

44 Communication Time (sec)
DBs Recovery Perf. Experimental Set-up File: recs; Bucket: recs  MB Recover 2 DBs Recover 3 DBs Slice Total Time (sec) Processing Time (sec) Communication Time (sec) 1250 1.234 0.590 0.519 3125 1.172 0.599 0.400 6250 0.598 0.365 15625 1.146 0.609 0.443 31250 1.088 0.442 Slice (from 4% to 100% of Bucket contents)  TT doesn’t vary a lot 1.2

45 Communication Time (sec)
DBs Recovery Perf. Experimental Set-up File: recs; Bucket: recs  MB Recover 2 DBs Recover 3 DBs Slice Total Time (sec) Processing Time (sec) Communication Time (sec) 1250 1.589 0.922 0.522 3125 1.599 0.928 0.383 6250 1.541 0.907 0.401 15625 1.578 0.891 0.520 31250 1.468 0.906 0.495 Slice (from 4% to 100% of Bucket contents)  TT doesn’t vary a lot 1.6

46 Perf. Summary of Bucket Recovery
1 DB (3.125 MB) in sec (XOR) 4.46 MB/sec 1 DB (3.125 MB) in sec (RS)  3.65 MB/sec 2 DBs (6.250 MB) in sec (RS)  5.21 MB/sec 3 DBs (9,375 MB) in 1.6 sec (RS)  5.86 MB/sec

47 Conclusion The conducted experiements show that:
Encoding/Decoding Optimization Enhanced Bucket Architecture  Impact on performance Good Recovery Performance Finally, we improved the processing time of the RS decoding process by 4% to 8% 1DB is recovered in half a second

48 Conclusion LH*RS Mature Implementation Many Optimization Iterations
Only SDDS with Scalable Availability

49 Future Work Better Parity Update Propagation Strategy to PBs
Investigation of faster Encoding/ Decoding processes

50 References [Patterson et al., 88] D. A. Patterson, G. Gibson & R. H. Katz, A Case for Redundant Arrays of Inexpensive Disks, Proc. of ACM SIGMOD Conf, pp , June 1988. [ISI,81] Information Sciences Institute, RFC 793: Transmission Control Protocol (TCP) – Specification, Sept. 1981, [McDonal & Barkley, 00] D. MacDonal, W. Barkley, MS Windows 2000 TCP/IP Implementation Details, [Jacobson, 88] V. Jacobson, M. J. Karels, Congestion Avoidance and Control, Computer Communication Review, Vol. 18, No 4, pp [Xu et al.,99] L. Xu & Jehoshua Bruck, X-Code: MDS Array Codes with Optimal Encoding, IEEE Trans. on Information Theory, 45(1), p , 1999. [Corbett et al., 04] P. Corbett, B. English, A. Goel, T. Grcanac, S. Kleiman, J. Leong, S. Sankar, Row-Diagonal Parity for Double Disk Failure Correction, Proc. of the 3rd USENIX –Conf. On File and Storage Technologies, Avril 2004. [Rabin, 89] M. O. Rabin, Efficient Dispersal of Information for Security, Load Balancing and Fault Tolerance, Journal of ACM, Vol. 26, N° 2, April 1989, pp [White, 91] P.E. White, RAID X tackles design problems with existing design RAID schemes, ECC Technologies, ftp://members.aol.com.mnecctek.ctr1991.pdf [Blomer et al., 95] J. Blomer, M. Kalfane, R. Karp, M. Karpinski, M. Luby & D. Zuckerman, An XOR-Based Erasure-Resilient Coding Scheme, ICSI Tech. Rep. TR , 1995.

51 References (Ctnd.) [Litwin & Schwarz, 00] W. Litwin & T. Schwarz, LH*RS: A High-Availability Scalable Distributed Data Structure using Reed Solomon Codes, p , Proceedings of the ACM SIGMOD 2000. [Karlesson et al., 96] J. Karlson, W. Litwin & T. Risch, LH*LH: A Scalable high performance data structure for switched multicomputers, EDBT 96, Springer Verlag. [Reed & Solomon, 60] I. Reed & G. Solomon, Polynomial codes over certain Finite Fields, Journal of the society for industrial and applied mathematics, 1960.  [Plank, 97] J. S. Plank, A Tutorial on Reed-Solomon Coding for fault-Tolerance in RAID-like Systems, Software– Practise & Experience, 27(9), Sept. 1997, pp , [Diéne, 01] A.W. Diène, Contribution à la Gestion de Structures de Données Distribuées et Scalables, PhD Thesis, Nov. 2001, Université Paris Dauphine. [Bennour, 00] F. Sahli Bennour, Contribution à la Gestion de Structures de Données Distribuées et Scalables, PhD Thesis, Juin 2000, Université Paris Dauphine. [Moussa] More references:

52 Thank you for your Attention
End Thank you for your Attention

53 Parity Calculus Galois Field New Generator Matrix
GF[28]  1 symbol is 1 byte || GF[216]  1 symbol is 2 bytes (+) GF[216] vs. GF[28] reduces by 1/2 the # of symbols, and consequently number of opertaions in the field (-) Multiplication Tables Sizes New Generator Matrix 1st line of ‘1’s Each PB executes XOR calculus for any update from the 1st DB of any group  gain performance of 4% -measured for PB creation 1st Column of ‘1’s 1st parity bucket executes XOR calculus instead of RS calculus  gain performance in encoding of 20% Encoding & Decoding Hints Encoding log pre-calculus of the P matrix coefficents  improv. of 3.5% Decoding log pre-calculus of H-1 matrix coef. and b vector for multiple buckets recovery  improv. from 4% to 8%


Download ppt "Workshop in Distributed Data & Structures"

Similar presentations


Ads by Google