Workshop in Distributed Data & Structures

Workshop in Distributed Data & Structures
*July 2004 Design & Implementation of LH*RS : a Highly- Available Distributed Data Structure Thomas J.E. Schwartz Rim Moussa

Recovery Performances
Objective Design Implementation Performance Measurements LH*RS Factors of Interest are : Parity Overhead Recovery Performances

Overview Motivation Highly-available schemes LH*RS
Architectural Design Hardware testbed File Creation High Availability Recovery Conclusion Future Work  Scenario Description  Performance Results

Motivation Information Volume of 30% / year
Bottleneck of disk access and CPUs Failures are frequent & costly Business Operation Industry Average Hourly Financial Impact Brokerage (Retail) operations Financial $6.45 million Credit Card Sales Authorization $2.6 million Airline Reservation Centers Transportation $89,500 Cellular (new) Service Activation Communication $41,000 Source: Contingency Planning Research -1996

Highly Available Networked Data Storage Systems
Requirements Need Highly Available Networked Data Storage Systems Scalability High Throughput High Availability Disadvantages 1. Providing a subject definition requires that a proper level of maturity and organization of subject domain community (e.g., scientific community) have to be reached (e.g., are the research and development groups in the area sufficiently open, collaborative and motivated ?). Subject consolidation is a collective, organized effort of the community. 2. Process of registration is not an easy one and requires specific supporting tools.

Scalable & Distributed Data Structure
Dynamic file growth Coordinator Client Client … Inserts I’m Overloaded ! You Split ! Records Transfered Network … … Data Buckets (DBs)

No Centralized Directory Access
SDDS (Ctnd.) No Centralized Directory Access Client Query Image Adjustement Message Query Forwarded Network … … … Data Buckets (DBs)

Solutions towards High Availability
Data Replication (+) Good Response time since mirrors are queried (-) High storage cost (n if n repliquas) Parity Calculus Erasure-resilient codes are evaluated regarding: Coding Rate (parity volume / data volume) Update Penality Group Size used for Data Reconstruction Complexity of Coding & Decoding

Fault-Tolerant Schemes
1 server failure Simple XOR parity calculus : RAID Systems [Patterson et al., 88], The SDDS LH*g [Litwin et al., 96] More than 1 server failure Binary linear codes: [Hellerstein & al., 94] Array Codes: EVENODD [Blaum et al., 94 ], X-code [Xu et al.,99], RDP schema [Corbett et al., 04]  Tolerate just 2 failures Reed Solomon Codes: IDA [Rabin, 89], RAID X [White, 91], FEC [Blomer et al., 95], Tutorial [Plank, 97], LH*RS[Litwin & Schwarz, 00]  Tolerate large number of failures …

A Highly Available & Distributed Data Structure: LH*RS
[Litwin & Schwarz, 00] [Litwin, Moussa & Schwarz, sub.]

LH*RS SDDS Parity Calculus
Data Distribution scheme based on Linear Hashing : LH*LH [Karlesson et al., 96] applied to the key-field Scalability High Throughput Parity Calculus Reed-Solomon Codes [Reed & Solomon, 63] High Availability

LH*RS File Structure        : Rank [Key List] Parity Field
Key r   2 1 Insert Rank r     2 1 Parity Buckets  : Rank [Key List] Parity Field Data Buckets  : Key Data Field

Architectural Design of LH*RS

Communication Use of UDP
Individual Insert/ Update/ Delete/ Search Queries Record Recovery Service and Control Messages Speed Use of TCP/IP New PB Creation Large Update Transfer (DB split) Bucket Recovery Better Performance & Reliability than UDP

Bucket Architecture … … … Network Multicast Listening Port
TCP Connection Network Multicast Listening Port Send UDP Port Recv UDP Port TCP/IP Port TCP Listening Thread UDP Listening Thread Multicast Listening Thread Message Message Queue Process Buffer Message Queue -Message processing- Work. Thread 1 Work. Thread n … -Message processing- … Multicast Working Thread Free Zones Window  Messages waiting for ack Ack. Management Thread Sending Credit Not ack’ed messages …

Architectural Design Enhancements to SDDS2000 [B00, D01] Bucket Architecture TCP/IP Connection Handler TCP/IP connections are passive OPEN, RFC 793 –[ISI,81],TCP/IP Implem. under Win2K Server O.S. [McDonal & Barkley, 00] Before Ex.: 1 DB recovery: SDDS 2000 Architecture: 6.7 s New Architecture: 2.6 s  Improv. 60% (Hardware config.: 733MhZ machines, 100Mbps network) Flow Control and Acknowledgement Mgmt. Principle of “Sending Credit + Message conservation until delivery” [Jacobson, 88] [Diène, 01]

Architectural Design (Ctnd.)
Dynamic Structure Updated when adding new/spare Buckets (PBs/DBs) through Multicast Probe DBs, PBs Coordinator Blank DBs Multicast Group Blank PBs Multicast Group Multicast Component A pre-defined & static Table Before

Hardware Testbed 5 Machines (Pentium IV: 1.8 GHz, RAM: 512 Mb)
Ethernet Network: max bandwidth of 1 Gbps Operating System: Windows 2K Server Tested configuration 1 Client A group of 4 Data Buckets k Parity Buckets, k  {0, 1, 2}

LH*RS File Creation

File Creation Client Operation Data Bucket Split
Propagation of each Insert/ Update/ Delete on Data Record to Parity Buckets Data Bucket Split Splitting Data Bucket  PBs : (Records that Remain) N Deletes -from old rank & N Inserts -at new rank + (Records that move) N Deletes New Data Bucket  PBs: N Inserts (Moved Records) All Updates are gathered in the same buffer and transferred (TCP/IP) simultaneously to respective Parity Buckets of the Splitting DB & New DB.

File Creation Perf. Experiments Set-up
File of data records; 1 data record = 104 B Client Sending Credit = 1 Client Sending Credit = 5 PB Overhead k = 0 to k = 1  Perf. Degradation of 20% k = 1 to k = 2  Perf. Degradation of 8%

File Creation Perf. Experimental Set-up
File of data records; 1 data record = 104 B Client Sending Credit = 1 Client Sending Credit = 5 PB Overhead k = 0 to k = 1  Perf. Degradation of 37% k = 1 to k = 2  Perf. Degradation of 10%

LH*RS Parity Bucket Creation

[Sender IP@+Entity#, Your Entity#]
PB Creation Scenario Searching for a new PB Coordinator Wanna join group g ? <Multicast> [Sender Your Entity#] PBs Connected to The Blank PBs Multicast Group

PB Creation Scenario Waiting for Replies Coordinator
I would Coordinator Start UDP Listening, Start TCP Listening, Start Working Threads Waiting for Confirmation, If Time-out elapsed  Cancel all PBs Connected to The Blank PBs Multicast Group

PB Creation Scenario PB Selection Coordinator
Cancellation PB Selection Cancellation Coordinator You are Selected <UDP> Disconnect from Blank PBs Multicast Group PBs Connected to The Blank PBs Multicast Group

Send me your contents ! <UDP>
PB Creation Scenario Auto-creation -Query phase Send me your contents ! <UDP> … … New PB Data Bucket’s group

Auto-creation –Encoding phase
PB Creation Scenario Auto-creation –Encoding phase Buffer Processing Requested Buffer <TCP> … … New PB Data Bucket’s group

File Size: 2.5 * Bucket Size records
PB Creation Perf. Experimental Set-up Bucket Size : records; Bucket Contents = 0.625* Bucket Size File Size: 2.5 * Bucket Size records XOR Encoding RS Encoding Comparison Bucket Size Total Time (sec) Processing Time (sec) Communication Time (sec) 5000 0.190 0.140 0.029 10000 0.429 0.304 0.066 25000 1.007 0.738 0.144 50000 2.062 1.484 0.322 Bucket Size: PT  74% TT 0.659 0.640 0.686 0,608 Encoding Rate MB/sec

File Size: 2.5 * Bucket Size records
PB Creation Perf. Experimental Set-up Bucket Size : records; Bucket Contents = 0.625* Bucket Size File Size: 2.5 * Bucket Size records XOR Encoding RS Encoding Comparison Bucket Size Total Time (sec) Processing Time (sec) Communication Time (sec) 5000 0.193 0.149 0.035 10000 0.446 0.328 0.059 25000 1.053 0.766 0.153 50000 2.103 1.531 0.322 Bucket Size; PT  74% TT 0.673 0.674 0.713 0,618 Encoding Rate MB/sec

PB Creation Perf. For Bucket Size = 50000
XOR Encoding RS Encoding Comparison For Bucket Size = 50000 XOR Encoding Rate : MB/sec RS Encoding Rate : MB/sec XOR provides a performance gain of 5% in Processing Time (0.02% in the Total Time)

LH*RS Bucket Recovery

Are You Alive ? <UDP>
Buckets’ Recovery Failure Detection Coordinator Are You Alive ? <UDP>   Parity Buckets Data Buckets

  Buckets’ Recovery I am Alive ? <UDP> Waiting for Replies…
Coordinator I am Alive ? <UDP>   Parity Buckets Data Buckets

[Sender IP@, Your Entity#]
Buckets’ Recovery Searching for 2 Spare DBs… Coordinator Wanna be a Spare DB ? <Multicast> [Sender Your Entity#] DBs Connected to The Blank DBs Multicast Group

Buckets’ Recovery Waiting for Replies … Coordinator
I would Coordinator Start UDP Listening, Start TCP Listening, Start Working Threads Waiting for Confirmation, If Time-out elapsed  Cancel all DBs Connected to The Blank DBs Multicast Group

Buckets’ Recovery Spare DBs Selection Coordinator
Disconnect from Blank PBs Multicast Group Spare DBs Selection Coordinator Cancellation You are Selected <UDP> Disconnect from Blank PBs Multicast Group DBs Connected to The Blank DBs Multicast Group

Buckets’ Recovery Recovery Manager Determination Coordinator
Recover Buckets [Spares Parity Buckets

Alive Buckets participating to Recovery
Buckets’ Recovery Query Phase Alive Buckets participating to Recovery Recovery Manager Send me Records of rank in [r, r+slice-1] <UDP> … Parity Buckets Data Buckets Spare DBs

Alive Buckets participating to Recovery
Buckets’ Recovery Reconstruction Phase Alive Buckets participating to Recovery Recovery Manager Requested Buffer <TCP> … Parity Buckets Decoding Process Recovered Records <TCP> Data Buckets Spare DBs

Communication Time (sec)
DBs Recovery Perf. Experimental Set-up File: recs; Bucket: recs  MB XOR Decoding RS Decoding Comparison Slice Total Time (sec) Processing Time (sec) Communication Time (sec) 1250 0.750 0.291 0.433 3125 0.693 0.249 0.372 6250 0.667 0.260 0.360 15625 0.755 0.255 0.458 31250 0.734 0.271 0.448 Slice (from 4% to 100% of Bucket contents)  TT doesn’t vary a lot 0.72

DBs Recovery Perf. Experimental Set-up File: recs; Bucket: recs  MB XOR Decoding RS Decoding Comparison Slice Total Time (sec) Processing Time (sec) Communication Time (sec) 1250 0.870 0.390 0.443 3125 0.867 0.375 6250 0.828 0.385 0.303 15625 0.854 0.433 31250 0.448 Slice (from 4% to 100% of Bucket contents)  TT doesn’t vary a lot 0.85

XOR provides a performance gain of 15% in Total Time
DBs Recovery Perf. Experimental Set-up File: recs; Bucket: recs  MB XOR Decoding RS Decoding Comparison 1DB Recovery Time - XOR : sec 1DB Recovery Time – RS : sec XOR provides a performance gain of 15% in Total Time

DBs Recovery Perf. Experimental Set-up File: recs; Bucket: recs  MB Recover 2 DBs Recover 3 DBs Slice Total Time (sec) Processing Time (sec) Communication Time (sec) 1250 1.234 0.590 0.519 3125 1.172 0.599 0.400 6250 0.598 0.365 15625 1.146 0.609 0.443 31250 1.088 0.442 Slice (from 4% to 100% of Bucket contents)  TT doesn’t vary a lot 1.2

DBs Recovery Perf. Experimental Set-up File: recs; Bucket: recs  MB Recover 2 DBs Recover 3 DBs Slice Total Time (sec) Processing Time (sec) Communication Time (sec) 1250 1.589 0.922 0.522 3125 1.599 0.928 0.383 6250 1.541 0.907 0.401 15625 1.578 0.891 0.520 31250 1.468 0.906 0.495 Slice (from 4% to 100% of Bucket contents)  TT doesn’t vary a lot 1.6

Perf. Summary of Bucket Recovery
1 DB (3.125 MB) in sec (XOR) 4.46 MB/sec 1 DB (3.125 MB) in sec (RS)  3.65 MB/sec 2 DBs (6.250 MB) in sec (RS)  5.21 MB/sec 3 DBs (9,375 MB) in 1.6 sec (RS)  5.86 MB/sec

Conclusion The conducted experiements show that:
Encoding/Decoding Optimization Enhanced Bucket Architecture  Impact on performance Good Recovery Performance Finally, we improved the processing time of the RS decoding process by 4% to 8% 1DB is recovered in half a second

Conclusion LH*RS Mature Implementation Many Optimization Iterations
Only SDDS with Scalable Availability

Future Work Better Parity Update Propagation Strategy to PBs
Investigation of faster Encoding/ Decoding processes

References [Patterson et al., 88] D. A. Patterson, G. Gibson & R. H. Katz, A Case for Redundant Arrays of Inexpensive Disks, Proc. of ACM SIGMOD Conf, pp , June 1988. [ISI,81] Information Sciences Institute, RFC 793: Transmission Control Protocol (TCP) – Specification, Sept. 1981, [McDonal & Barkley, 00] D. MacDonal, W. Barkley, MS Windows 2000 TCP/IP Implementation Details, [Jacobson, 88] V. Jacobson, M. J. Karels, Congestion Avoidance and Control, Computer Communication Review, Vol. 18, No 4, pp [Xu et al.,99] L. Xu & Jehoshua Bruck, X-Code: MDS Array Codes with Optimal Encoding, IEEE Trans. on Information Theory, 45(1), p , 1999. [Corbett et al., 04] P. Corbett, B. English, A. Goel, T. Grcanac, S. Kleiman, J. Leong, S. Sankar, Row-Diagonal Parity for Double Disk Failure Correction, Proc. of the 3rd USENIX –Conf. On File and Storage Technologies, Avril 2004. [Rabin, 89] M. O. Rabin, Efficient Dispersal of Information for Security, Load Balancing and Fault Tolerance, Journal of ACM, Vol. 26, N° 2, April 1989, pp [White, 91] P.E. White, RAID X tackles design problems with existing design RAID schemes, ECC Technologies, ftp://members.aol.com.mnecctek.ctr1991.pdf [Blomer et al., 95] J. Blomer, M. Kalfane, R. Karp, M. Karpinski, M. Luby & D. Zuckerman, An XOR-Based Erasure-Resilient Coding Scheme, ICSI Tech. Rep. TR , 1995.

References (Ctnd.) [Litwin & Schwarz, 00] W. Litwin & T. Schwarz, LH*RS: A High-Availability Scalable Distributed Data Structure using Reed Solomon Codes, p , Proceedings of the ACM SIGMOD 2000. [Karlesson et al., 96] J. Karlson, W. Litwin & T. Risch, LH*LH: A Scalable high performance data structure for switched multicomputers, EDBT 96, Springer Verlag. [Reed & Solomon, 60] I. Reed & G. Solomon, Polynomial codes over certain Finite Fields, Journal of the society for industrial and applied mathematics, 1960. [Plank, 97] J. S. Plank, A Tutorial on Reed-Solomon Coding for fault-Tolerance in RAID-like Systems, Software– Practise & Experience, 27(9), Sept. 1997, pp , [Diéne, 01] A.W. Diène, Contribution à la Gestion de Structures de Données Distribuées et Scalables, PhD Thesis, Nov. 2001, Université Paris Dauphine. [Bennour, 00] F. Sahli Bennour, Contribution à la Gestion de Structures de Données Distribuées et Scalables, PhD Thesis, Juin 2000, Université Paris Dauphine. [Moussa] More references:

Thank you for your Attention
End Thank you for your Attention

Parity Calculus Galois Field New Generator Matrix
GF[28]  1 symbol is 1 byte || GF[216]  1 symbol is 2 bytes (+) GF[216] vs. GF[28] reduces by 1/2 the # of symbols, and consequently number of opertaions in the field (-) Multiplication Tables Sizes New Generator Matrix 1st line of ‘1’s Each PB executes XOR calculus for any update from the 1st DB of any group  gain performance of 4% -measured for PB creation 1st Column of ‘1’s 1st parity bucket executes XOR calculus instead of RS calculus  gain performance in encoding of 20% Encoding & Decoding Hints Encoding log pre-calculus of the P matrix coefficents  improv. of 3.5% Decoding log pre-calculus of H-1 matrix coef. and b vector for multiple buckets recovery  improv. from 4% to 8%

Workshop in Distributed Data & Structures

Similar presentations

Presentation on theme: "Workshop in Distributed Data & Structures"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Workshop in Distributed Data & Structures

Similar presentations

Presentation on theme: "Workshop in Distributed Data & Structures"— Presentation transcript:

Similar presentations

About project

Feedback