Design & Implementation of LH* RS : a Highly- Available Distributed Data Structure Rim Moussa

Design & Implementation of LH* RS : a Highly- Available Distributed Data Structure Rim Moussa Rim.Moussa@dauphine.frhttp://ceria.dauphine.fr/rim/rim.html Thomas J.E. Schwartz TjSchwarz@scu.edu http://www.cse.scu.edu/~tschwarz/homepage /thomas_schwarz.html Workshop in Distributed Data & Structures * July 2004

2 Objective Factors of Interest are : Parity Overhead Recovery Performances LH* RS DesignImplementation Performance Measurements

3 1. Motivation 2. Highly-available schemes 3. LH* RS 4. Architectural Design 5. Hardware testbed 6. File Creation 7. High Availability 8. Recovery 9. Conclusion 10. Future Work  Scenario Description  Performance Results Overview

4 Motivation  Information Volume of 30% / year  Bottleneck of disk access and CPUs  Failures are frequent & costly Business OperationIndustryAverage Hourly Financial Impact Brokerage (Retail) operations Financial$6.45 million Credit Card Sales Authorization Financial$2.6 million Airline Reservation Centers Transportation$89,500 Cellular (new) Service Activation Communication$41,000 Source: Contingency Planning Research -1996

5 Requirements Need Highly Available Networked Data Storage Systems Scalability High Throughput High Availability

6 Scalable & Distributed Data Structure Dynamic file growth Client Network Client … Data Buckets (DBs) … Coordinator I’m Overloaded ! You Split ! Records Transfered Inserts …

7 SDDS (Ctnd.) Network No Centralized Directory Access Data Buckets (DBs) …… Client Query Query Forwarded Image Adjustement Message …

8 Solutions towards High Availability Parity Calculus (+) Good Response time since mirrors are queried (-) High storage cost (  n if n repliquas) Data Replication Erasure-resilient codes are evaluated regarding: Coding Rate (parity volume / data volume) Coding Rate (parity volume / data volume) Update Penality Update Penality Group Size used for Data Reconstruction Group Size used for Data Reconstruction Complexity of Coding & Decoding Complexity of Coding & Decoding

9 Fault-Tolerant Schemes 1 server failure More than 1 server failure Binary linear codes: [Hellerstein & al., 94] Array Codes: EVENODD [Blaum et al., 94 ], X-code [Xu et al.,99], RDP schema [Corbett et al., 04]  Tolerate just 2 failures Reed Solomon Codes: IDA [Rabin, 89], RAID X [White, 91], FEC [Blomer et al., 95], Tutorial [Plank, 97], LH* RS [Litwin & Schwarz, 00]  Tolerate large number of failures … Simple XOR parity calculus : RAID Systems [Patterson et al., 88], The SDDS LH*g [Litwin et al., 96]

A Highly Available & Distributed Data Structure: LH* RS [Litwin & Schwarz, 00] [Litwin, Moussa & Schwarz, sub.]

11 LH* RSSDDS Data Distribution scheme based on Linear Hashing : LH* LH [Karlesson et al., 96] applied to the key-field Parity Calculus Reed-Solomon Codes [Reed & Solomon, 63]Scalability High Throughput High Availability

12 LH* RS File Structure   Data Buckets Parity Buckets      : Key Data Field Insert Rank r  : Rank [Key List] Parity Field Key r 210 210 210 210

Architectural Design of LH* RS

14 Communication Use of TCP/IP New PB Creation Large Update Transfer (DB split) Bucket Recovery Use of UDP Individual Insert/ Update/ Delete/ Search Queries Record Recovery Service and Control Messages Speed Better Performance & Reliability than UDP

15 Network Multicast Listening Port Send UDP Port Message Queue -Message processing- TCP/IP Port Process Buffer Recv UDP Port Message Queue -Message processing- … … Message TCP Connection WindowFree Zones  Sending Credit Messages waiting for ack … Not ack’ed messages Multicast Listening Thread Multicast Working Thread Ack. Management Thread UDP Listening Thread TCP Listening Thread Work. Thread 1 Work. Thread n Bucket Architecture

16 Architectural Design TCP/IP Connection Handler TCP/IP connections are passive OPEN, RFC 793 –[ISI,81],TCP/IP Implem. under Win2K Server O.S. [McDonal & Barkley, 00] Enhancements to SDDS2000 [B00, D01] Bucket Architecture Flow Control and Acknowledgement Mgmt. Principle of “Sending Credit + Message conservation until delivery” [Jacobson, 88] [Diène, 01] Ex.: 1 DB recovery: SDDS 2000 Architecture: 6.7 s New Architecture: 2.6 s  Improv. 60% (Hardware config.: 733MhZ machines, 100Mbps network) Before

17 Architectural Design (Ctnd.) A pre-defined & static IP @s Table Dynamic IP@ Structure Updated when adding new/spare Buckets (PBs/DBs) through Multicast Probe DBs, PBs Coordinator Blank DBs Multicast Group Blank PBs Multicast Group Multicast Component Before

18 Hardware Testbed  5 Machines (Pentium IV: 1.8 GHz, RAM: 512 Mb)  Ethernet Network: max bandwidth of 1 Gbps  Operating System: Windows 2K Server  Tested configuration  1 Client  A group of 4 Data Buckets  k Parity Buckets, k  {0, 1, 2}

LH* RS File Creation

20 File Creation Client Operation Splitting Data Bucket  PBs : (Records that Remain) N Deletes -from old rank & N Inserts -at new rank + (Records that move) N Deletes New Data Bucket  PBs: N Inserts (Moved Records) All Updates are gathered in the same buffer and transferred (TCP/IP) simultaneously to respective Parity Buckets of the Splitting DB & New DB. Propagation of each Insert/ Update/ Delete on Data Record to Parity Buckets Data Bucket Split

21 File Creation Perf. Experiments Set-up File of 25 000 data records; 1 data record = 104 B Client Sending Credit = 1 Client Sending Credit = 5 k = 1 to k = 2  Perf. Degradation of 8% PB Overhead k = 0 to k = 1  Perf. Degradation of 20%

22 File Creation Perf. Experimental Set-up File of 25 000 data records; 1 data record = 104 B Client Sending Credit = 1 Client Sending Credit = 5 k = 1 to k = 2  Perf. Degradation of 10% PB Overhead k = 0 to k = 1  Perf. Degradation of 37%

LH* RS Parity Bucket Creation

24 PB Creation Scenario Coordinator PBs Connected to The Blank PBs Multicast Group Wanna join group g ? [Sender IP@+Entity#, Your Entity#] Searching for a new PB

25 PB Creation Scenario Coordinator PBs Connected to The Blank PBs Multicast Group I would Start UDP Listening, Start TCP Listening, Start Working Threads Waiting for Confirmation, If Time-out elapsed  Cancel all Waiting for Replies

26 PB Creation Scenario Coordinator PBs Connected to The Blank PBs Multicast Group You are Selected Disconnect from Blank PBs Multicast Group Cancellation PB Selection

27 Send me your contents ! … PB Creation Scenario Data Bucket’s group New PB … Auto-creation -Query phase

28 Requested Buffer … PB Creation Scenario Data Bucket’s group New PB Buffer Processing … Auto-creation –Encoding phase

29 PB Creation Perf. Bucket Size Total Time (sec) Processing Time (sec) Communication Time (sec) 50000.1900.1400.029 100000.4290.3040.066 250001.0070.7380.144 500002.0621.4840.322 XOR Encoding RS Encoding Comparison Experimental Set-up Bucket Size : 5000.. 50000 records; Bucket Contents = 0.625* Bucket Size File Size: 2.5 * Bucket Size records  Bucket Size: PT  74% TT 0.659 0.640 0.686 0,608 Encoding Rate MB/sec

30 PB Creation Perf. Experimental Set-up Bucket Size : 5000.. 50000 records; Bucket Contents = 0.625* Bucket Size File Size: 2.5 * Bucket Size records Bucket Size Total Time (sec) Processing Time (sec) Communication Time (sec) 50000.1930.1490.035 100000.4460.3280.059 250001.0530.7660.153 500002.1031.5310.322 XOR Encoding RS Encoding Comparison  Bucket Size; PT  74% TT 0.673 0.674 0.713 0,618 Encoding Rate MB/sec

31 PB Creation Perf. XOR Encoding RS Encoding Comparison XOR Encoding Rate : 0.66 MB/sec RS Encoding Rate : 0.673 MB/sec XOR provides a performance gain of 5% in Processing Time (  0.02% in the Total Time) For Bucket Size = 50000

LH* RS Bucket Recovery

33 Coordinator Failure Detection Are You Alive ? Data Buckets Parity Buckets   Buckets’ Recovery

34 Coordinator Waiting for Replies… I am Alive ? Data Buckets Parity Buckets   Buckets’ Recovery

35 Coordinator Searching for 2 Spare DBs… DBs Connected to The Blank DBs Multicast Group Wanna be a Spare DB ? [Sender IP@, Your Entity#] Buckets’ Recovery

36 Coordinator Waiting for Replies … DBs Connected to The Blank DBs Multicast Group Start UDP Listening, Start TCP Listening, Start Working Threads Waiting for Confirmation, If Time-out elapsed  Cancel all I would Buckets’ Recovery

37 Coordinator Spare DBs Selection DBs Connected to The Blank DBs Multicast Group Disconnect from Blank PBs Multicast Group You are Selected Disconnect from Blank PBs Multicast Group Cancellation Buckets’ Recovery

38 Coordinator Parity Buckets Recover Buckets [Spares IP@s+Entity#s;…] Buckets’ Recovery Recovery Manager Determination

39 Data Buckets Parity Buckets Recovery Manager Spare DBs Alive Buckets participating to Recovery Send me Records of rank in [r, r+slice-1] … Buckets’ Recovery Query Phase

40 Decoding Process Recovered Records Data Buckets Parity Buckets Recovery Manager Spare DBs Alive Buckets participating to Recovery Requested Buffer … Buckets’ Recovery Reconstruction Phase

41 DBs Recovery Perf. Experimental Set-up File: 125 000 recs; Bucket: 31250 recs  3.125 MB XOR Decoding RS Decoding Comparison SliceTotal Time (sec) Processing Time (sec) Communication Time (sec) 12500.7500.2910.433 31250.6930.2490.372 62500.6670.2600.360 156250.7550.2550.458 312500.7340.2710.448  Slice (from 4% to 100% of Bucket contents)  TT doesn’t vary a lot 0.72

42 DBs Recovery Perf. Experimental Set-up File: 125 000 recs; Bucket: 31250 recs  3.125 MB XOR Decoding RS Decoding Comparison SliceTotal Time (sec) Processing Time (sec) Communication Time (sec) 12500.8700.3900.443 31250.8670.375 62500.8280.3850.303 156250.8540.3750.433 312500.8540.3750.448  Slice (from 4% to 100% of Bucket contents)  TT doesn’t vary a lot 0.85

43 DBs Recovery Perf. Experimental Set-up File: 125 000 recs; Bucket: 31250 recs  3.125 MB XOR Decoding RS Decoding Comparison 1DB Recovery Time - XOR : 0.720 sec XOR provides a performance gain of 15% in Total Time 1DB Recovery Time – RS : 0.855 sec

44 DBs Recovery Perf. Experimental Set-up File: 125 000 recs; Bucket: 31250 recs  3.125 MB Recover 2 DBs Recover 3 DBs SliceTotal Time (sec) Processing Time (sec) Communication Time (sec) 12501.2340.5900.519 31251.1720.5990.400 62501.1720.5980.365 156251.1460.6090.443 312501.0880.5990.442  Slice (from 4% to 100% of Bucket contents)  TT doesn’t vary a lot 1.2

45 DBs Recovery Perf. Experimental Set-up File: 125 000 recs; Bucket: 31250 recs  3.125 MB Recover 2 DBs Recover 3 DBs SliceTotal Time (sec) Processing Time (sec) Communication Time (sec) 12501.5890.9220.522 31251.5990.9280.383 62501.5410.9070.401 156251.5780.8910.520 312501.4680.9060.495  Slice (from 4% to 100% of Bucket contents)  TT doesn’t vary a lot 1.6

46 Perf. Summary of Bucket Recovery 1 DB (3.125 MB) in 0.7 sec (XOR)  4.46 MB/sec 1 DB (3.125 MB) in 0.85 sec (RS)  3.65 MB/sec 2 DBs (6.250 MB) in 1.2 sec (RS)  5.21 MB/sec 3 DBs (9,375 MB) in 1.6 sec (RS)  5.86 MB/sec

47 Conclusion The conducted experiements show that: Encoding/Decoding Optimization Enhanced Bucket Architecture  I mpact on performance Good Recovery Performance Finally, we improved the processing time of the RS decoding process by 4% to 8% 1DB is recovered in half a second

48 Conclusion LH* RS Mature Implementation Many Optimization Iterations Only SDDS with Scalable Availability

49 Future Work Better Parity Update Propagation Strategy to PBs Investigation of faster Encoding/ Decoding processes

50 References [Patterson et al., 88] D. A. Patterson, G. Gibson & R. H. Katz, A Case for Redundant Arrays of Inexpensive Disks, Proc. of ACM SIGMOD Conf, pp.109-106, June 1988. [ISI,81] Information Sciences Institute, RFC 793: Transmission Control Protocol (TCP) – Specification, Sept. 1981, http://www.faqs.org/rfcs/rfc793.htmlhttp://www.faqs.org/rfcs/rfc793.html [McDonal & Barkley, 00] D. MacDonal, W. Barkley, MS Windows 2000 TCP/IP Implementation Details, http://secinf.net/info/nt/2000ip/tcpipimp.htmlhttp://secinf.net/info/nt/2000ip/tcpipimp.html [Jacobson, 88] V. Jacobson, M. J. Karels, Congestion Avoidance and Control, Computer Communication Review, Vol. 18, No 4, pp. 314-329. [Xu et al.,99] L. Xu & Jehoshua Bruck, X- Code: MDS Array Codes with Optimal Encoding, IEEE Trans. on Information Theory, 45(1), p.272-276, 1999. [Corbett et al., 04] P. Corbett, B. English, A. Goel, T. Grcanac, S. Kleiman, J. Leong, S. Sankar, Row-Diagonal Parity for Double Disk Failure Correction, Proc. of the 3 rd USENIX – Conf. On File and Storage Technologies, Avril 2004. [Rabin, 89] M. O. Rabin, Efficient Dispersal of Information for Security, Load Balancing and Fault Tolerance, Journal of ACM, Vol. 26, N° 2, April 1989, pp. 335-348. [White, 91] P.E. White, RAID X tackles design problems with existing design RAID schemes, ECC Technologies, ftp://members.aol.com.mnecctek.ctr1991.pdfftp://members.aol.com.mnecctek.ctr1991.pdf [Blomer et al., 95] J. Blomer, M. Kalfane, R. Karp, M. Karpinski, M. Luby & D. Zuckerman, An XOR-Based Erasure-Resilient Coding Scheme, ICSI Tech. Rep. TR-95-048, 1995.

51 References (Ctnd.) [Litwin & Schwarz, 00] W. Litwin & T. Schwarz, LH* RS : A High-Availability Scalable Distributed Data Structure using Reed Solomon Codes, p.237-248, Proceedings of the ACM SIGMOD 2000. [Karlesson et al., 96] J. Karlson, W. Litwin & T. Risch, LH*LH: A Scalable high performance data structure for switched multicomputers, EDBT 96, Springer Verlag. [Reed & Solomon, 60] I. Reed & G. Solomon, Polynomial codes over certain Finite Fields, Journal of the society for industrial and applied mathematics, 1960. [Plank, 97] J. S. Plank, A Tutorial on Reed-Solomon Coding for fault-Tolerance in RAID-like Systems, Software– Practise & Experience, 27(9), Sept. 1997, pp 995- 1012, [Diéne, 01] A.W. Diène, Contribution à la Gestion de Structures de Données Distribuées et Scalables, PhD Thesis, Nov. 2001, Université Paris Dauphine. [Bennour, 00] F. Sahli Bennour, Contribution à la Gestion de Structures de Données Distribuées et Scalables, PhD Thesis, Juin 2000, Université Paris Dauphine. http://ceria.dauphine.fr/rim/rim.html [Moussa]http://ceria.dauphine.fr/rim/rim.html More references: http://ceria.dauphine.fr/rim/biblio.pdf

53 Parity Calculus Galois Field GF[2 8 ]  1 symbol is 1 byte || GF[2 16 ]  1 symbol is 2 bytes GF[2 8 ]  1 symbol is 1 byte || GF[2 16 ]  1 symbol is 2 bytes (+) GF[2 16 ] vs. GF[2 8 ] reduces by 1/2 the # of symbols, and consequently number of opertaions in the field (-) Multiplication Tables Sizes New Generator Matrix 1st Column of ‘1’s 1st parity bucket executes XOR calculus instead of RS calculus  1st parity bucket executes XOR calculus instead of RS calculus  gain performance in encoding of 20% 1st line of ‘1’s Each PB executes XOR calculus for any update from the 1st DB of any group  gain performance of 4% - measured for PB creation Encoding & Decoding Hints Encoding log pre-calculus of the P matrix coefficents  improv. of 3.5% Decoding log pre-calculus of H -1 matrix coef. and b vector for multiple buckets recovery  improv. from 4% to 8%

Design & Implementation of LH* RS : a Highly- Available Distributed Data Structure Rim Moussa

Similar presentations

Presentation on theme: "Design & Implementation of LH* RS : a Highly- Available Distributed Data Structure Rim Moussa"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Design & Implementation of LH* RS : a Highly- Available Distributed Data Structure Rim Moussa

Similar presentations

Presentation on theme: "Design & Implementation of LH* RS : a Highly- Available Distributed Data Structure Rim Moussa"— Presentation transcript:

Similar presentations

About project

Feedback