Contribution to the Design & Implementation of the Highly Available Scalable and Distributed Data Structure: LH* RS Rim Moussa

Slides:

Advertisements

Similar presentations

Disk Arrays COEN 180. Large Storage Systems Collection of disks to store large amount of data. Performance advantage: Each drive can satisfy only so many.

Advertisements

Analysis and Construction of Functional Regenerating Codes with Uncoded Repair for Distributed Storage Systems Yuchong Hu, Patrick P. C. Lee, Kenneth.

Alex Dimakis based on collaborations with Dimitris Papailiopoulos Arash Saber Tehrani USC Network Coding for Distributed Storage.

CSCE430/830 Computer Architecture

Henry C. H. Chen and Patrick P. C. Lee

1 NCFS: On the Practicality and Extensibility of a Network-Coding-Based Distributed File System Yuchong Hu 1, Chiu-Man Yu 2, Yan-Kit Li 2 Patrick P. C.

A P2P-based Storage Platform for Storing Session Data in Internet Access Networks T. Bahls, D. Duchow Nokia Siemens Networks Broadband Access Division.

1 Scalable Distributed Data Structures Part 2 Witold Litwin

1 Interoperability of a Scalable Distributed Data Manager with an Object-relational DBMS Thesis presentation Yakham NDIAYE November, 13 the 2001 November,

Digital Fountain Codes V. S

Simple Regenerating Codes: Network Coding for Cloud Storage Dimitris S. Papailiopoulos, Jianqiang Luo, Alexandros G. Dimakis, Cheng Huang, and Jin Li University.

Rim Moussa University Paris 9 Dauphine Experimental Performance Analysis of LH* RS Parity Management Workshop on Distributed Data Structures: WDAS 2002.

Design & Implementation of LH* RS : a Highly- Available Distributed Data Structure Rim Moussa

Enhanced Availability With RAID CC5493/7493. RAID Redundant Array of Independent Disks RAID is implemented to improve: –IO throughput (speed) and –Availability.

1 Magnetic Disks 1956: IBM (RAMAC) first disk drive 5 Mb – Mb/in $/year 9 Kb/sec 1980: SEAGATE first 5.25’’ disk drive 5 Mb – 1.96 Mb/in2 625.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts Amherst Operating Systems CMPSCI 377 Lecture.

1 Principles of Reliable Distributed Systems Tutorial 12: Frangipani Spring 2009 Alex Shraer.

Computing in the Reliable Array of Independent Nodes Vasken Bohossian, Charles Fan, Paul LeMahieu, Marc Riedel, Lihao Xu, Jehoshua Bruck May 5, 2000 IEEE.

Other Disk Details. 2 Disk Formatting After manufacturing disk has no information –Is stack of platters coated with magnetizable metal oxide Before use,

M. S. Thesis Defense, 09/24/20011 Migratory TCP (MTCP) Transport Layer Support for Highly- Available Network Services Kiran Srinivasan DisCoLab Division.

A Hybrid Approach of Failed Disk Recovery Using RAID-6 Codes: Algorithms and Performance Evaluation Yinlong Xu University of Science and Technology of.

A Server-less Architecture for Building Scalable, Reliable, and Cost-Effective Video-on-demand Systems Presented by: Raymond Leung Wai Tak Supervisor:

Witold Litwin Riad Mokadem Thomas Schwartz Disk Backup Through Algebraic Signatures.

Physical Storage Organization. Advanced DatabasesPhysical Storage Organization2 Outline Where and How data are stored? –physical level –logical level.

WDAS Workshop, Lausanne, Jul. 9th1 Implementing SD-SQL Server: a Scalable Distributed Database System Soror SAHRI Witold LITWIN

World Wide Web Caching: Trends and Technology Greg Barish and Katia Obraczka USC Information Science Institute IEEE Communications Magazine, May 2000 Presented.

RACS: A Case for Cloud Storage Diversity Paper by Hussam Abu-LibdehCornell University Lonnie PrincehouseCornell University Hakim Weatherspoon Cornell University.

Department of Computer Science Southern Illinois University Edwardsville Dr. Hiroshi Fujinoki and Kiran Gollamudi {hfujino,

Using Algebraic Signatures in Storage Applications Thomas Schwarz, S.J. Associate Professor, Santa Clara University Associate, SSRC UCSC Storage Systems.

1 Pattern Matching Using n-grams With Algebraic Signatures Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas Schwarz[2] [1] Université Paris Dauphine.

Lecture 4 1 Reliability vs Availability Reliability: Is anything broken? Availability: Is the system still available to the user?

RAID: High-Performance, Reliable Secondary Storage Mei Qing & Chaoxia Liao Nov. 20, 2003.

©2001 Pål HalvorsenINFOCOM 2001, Anchorage, April 2001 Integrated Error Management in MoD Services Pål Halvorsen, Thomas Plagemann, and Vera Goebel University.

CS An Overlay Routing Scheme For Moving Large Files Su Zhang Kai Xu.

Redundant Array of Inexpensive Disks aka Redundant Array of Independent Disks (RAID) Modified from CCT slides.

Soror SAHRI – June 13 th, 2006 Design & Implementation of a Scalable Distributed Database System: SD-SQL Server 1\46 pages Soror SAHRI

LH* RS P2P : A Scalable Distributed Data Structure for P2P Environment W. LITWIN CERIA Laboratory H.YAKOUBEN Paris Dauphine University

Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel.

1 High-Availability LH* Schemes with Mirroring W. Litwin, M.-A. Neimat U. Paris 9 & HPL Palo-Alto

An Efficient Approach for Content Delivery in Overlay Networks Mohammad Malli Chadi Barakat, Walid Dabbous Planete Project To appear in proceedings of.

Of Rostock University DuDE: A D istributed Computing System u sing a D ecentralized P2P E nvironment The 4th International Workshop on Architectures, Services.

CS1Q Computer Systems Lecture 17 Simon Gay. Lecture 17CS1Q Computer Systems - Simon Gay2 The Layered Model of Networks It is useful to think of networks.

Scalable Distributed Data Structures & High-Performance Computing Witold Litwin Fethi Bennour CERIA University Paris 9 Dauphine

A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.

COEN 180 Erasure Correcting, Error Detecting, and Error Correcting Codes.

1 High-Availability in Scalable Distributed Data Structures W. Litwin.

A Cost-based Heterogeneous Recovery Scheme for Distributed Storage Systems with RAID-6 Codes Yunfeng Zhu 1, Patrick P. C. Lee 2, Liping Xiang 1, Yinlong.

Array BP-XOR Codes for Reliable Cloud Storage Systems Yongge Wang UNC Charlotte, USA IEEE ISIT(International Symposium on Information Theory)

1 WDAS – 14 June THESSALONIKI(Greece) Range Queries to Scalable Distributed Data Structure RP* WDAS – 14 June THESSALONIKI(Greece) Range.

Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.

Ahmed Osama Research Assistant. Presentation Outline Winc- Nile University- Privacy Preserving Over Network Coding 2  Introduction  Network coding 

S-Paxos: Eliminating the Leader Bottleneck

1 Scalable Distributed Data Structures Part 2 Witold Litwin Paris 9

LH* RS P2P : A Scalable Distributed Data Structure for P2P Environment W. LITWIN CERIA Laboratory H.YAKOUBEN Paris Dauphine University

By Nitin Bahadur Gokul Nadathur Department of Computer Sciences University of Wisconsin-Madison Spring 2000.

Introduction and File Structures Database System Implementation CSE 507 Some slides adapted from R. Elmasri and S. Navathe, Fundamentals of Database Systems,

1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.

RAID Technology By: Adarsha A,S 1BY08A03. Overview What is RAID Technology? What is RAID Technology? History of RAID History of RAID Techniques/Methods.

Coding for Multipath TCP: Opportunities and Challenges Øyvind Ytrehus University of Bergen and Simula Res. Lab. NNUW-2, August 29, 2014.

rain technology (redundant array of independent nodes)

A Case for Redundant Arrays of Inexpensive Disks (RAID) -1988

Steve Ko Computer Sciences and Engineering University at Buffalo

Repair Pipelining for Erasure-Coded Storage

A Server-less Architecture for Building Scalable, Reliable, and Cost-Effective Video-on-demand Systems Raymond Leung and Jack Y.B. Lee Department of Information.

LH*RSP2P: A Scalable Distributed Data Structure for P2P Environment

Workshop in Distributed Data & Structures

Erasure Correcting Codes for Highly Available Storage

IT 344: Operating Systems Winter 2007 Module 18 Redundant Arrays of Inexpensive Disks (RAID) Chia-Chi Teng CTB

Last Class: Fault Tolerance

Presentation transcript:

Contribution to the Design & Implementation of the Highly Available Scalable and Distributed Data Structure: LH* RS Rim Moussa Thesis Presentation in Computer Science *Distributed Databases Thesis Supervisor: Pr. Witold Litwin Examinators: Pr. Thomas J.E. Schwarz Pr. Toré Risch Jury President: Pr. Gérard Lévy Paris Dauphine University *CERIA Lab. *04th October 2004

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 2 Outline 1. Issue 2. State of the Art 3. LH* RS Scheme 4. LH* RS Manager 5. Experimentations 6. LH* RS File Creation 7. Bucket Recovery 8. Parity Bucket Creation 9. Conclusion & Future Work

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 3 Facts …  Volume of Information of 30% /year  Technology  Network Infrastructure >> Gilder Law, bandwidth triples every year.  Evolution of PCs storage & computing capacities >> Moore Law, the latters double every 18 months.  Bottleneck of Disks Accesses & CPUs Need of Distributed Data Storage Systems SDDSs: LH*, RP* …  High Throughput

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 4 Facts … Network  Frequent & Costly Failures >> Stat. Published by the Contingency Planning Research in 1996: the cost of service interruption/h case of brokerage application is $6,45 million. Need of Distributed & Highly-Available Data Storage Systems  Multicomputers >> Modular Architecture >> Good Price/ Performance Tradeoff

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 5 State of the Art Parity Calculus (+) (+) Good Response Time, Mirors are functional (-) High Storage Overhead (  n if n repliquas) Data Replication Criteria to evaluate Erasure-resilient Codes: Encoding Rate (Parity Volume/ Data Volume) Update Penality (Parity Volumes) Group Size used for Data Reconstruction Encoding & Decoding Complexity Recovery Capabilitties

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 6 Parity Schemes 1-Available Schemes k-Available Schemes Binary Linear Codes: [H94]  Tolerate max. 3 failures Array Codes: EVENODD [B94 ], X-code [XB99], RDP [C+04]  Tolerate max. 2 failures Reed Solomon Codes : IDA [R89], RAID X [W91], FEC [B95], Tutorial [P97], LH* RS [LS00, ML02, MS04, LMS04]  Tolerate k failures (k > 3) … XOR Parity Calculus : RAID Technology (level 3, 4, 5…) [PGK88], SDDS LH*g [L96] …

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 7 Outline… 1. Issue 2. State of the Art 3. LH* RS Scheme LH* RS ? SDDSs? Reed Solomon Codes? Encoding/ Decoding Optimizations 4. LH* RS Manager 5. Experimentations

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 8 LH* RS ? Distribution using Linear Hashing (LH* LH [KLR96]) LH* LH Manager[B00] Scalability & High Throughput High Availability LH*: Scalable & Distributed Data Structure Parity Calculus using Reed-Solomon Codes [RS63] LH* RS [LS00]

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 9 SDDSs Principles (1) Dynamic File Growth Client Network Client … Data Buckets … OVERLOADED You Split Insertions … Coordinator Record Transfert

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 10 SDDSs Principles (2) Network (2) No Centralized Directory Access Cases de Données …… Client Query Query Forward Client Image Adjustment Message … File Image

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 11 Reed-Solomon Codes  Encoding From m Data Symbols  Calculus of n Parity Symbols  Data Representation  Galois Field Fields with finite size: q Closure Propoerty: Addition, Substraction, Multiplication, Division. In GF(2 w ), (1) Addition (XOR) (2) Multiplication (Tables: gflog and antigflog) e 1 * e 2 = antigflog[ gflog[e 1 ] + gflog[e 2 ] ]

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine …0 C 1,1 … C 1,j … C 1,n-m …0 C 2,1 … C 2,j … C 2,n-m …0 C 3,1 … C 3,j … C 3,n-m … … … … … …1 C m,1 … C m,j … C m,n-m RS Encoding S1S2S3:Si:SmS1S2S3:Si:Sm  S 1 : S m P 1 P 2 : P j : P n-m = C 1,j C 2,j C 3,j : C m,j PjPj (S 1  C 1,j )  (S 2  C 2,j )  …  (S m  C m,j ) m-1 XORs GF m Multiplications GF S1S2S3:Si:SmS1S2S3:Si:Sm ImIm P(m  (n-m)) (1) Systematic Encoding: Matrix (Im|P) (2) Any m columns are linearly independent Parity Matrix

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 13 Optimized Decoding Multiply the ‘‘m OK symbols’’ By columns of H -1 corresponding to lost symbols m OK symbols H m : m corresponding columns  H -1 = [ S 1 S 2 S 3 S 4 ….. S m ] Gauss Transformatiom 10000…0C 1,1 C 1,2 C 1,3 … C 1,n-m …0 C 2,1 C 2,2 C 2,3 … C 2,n-m …0 C 3,1 C 3,2 C 3,3 … C 3,n-m … … … … … …1 C m,1 C m,2 C m,3 … C m,n-m RS Decoding S 1 S 2 S 3 S 4 : S m P 1 P 2 P 3 : P n-m

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 14 Galois Field Parity Matrix Optimizations GF Multiplication (+) GF(2 16 ) vs. GF(2 8 ) reduces the #Symbols by 1/2  #Operations in the GF. GF(2 8 )  1 symbol = 1 Byte GF(2 16 )  1 symbol = 2 Bytes (-) Multiplication Tables Size GF(2 8 ): 0,768 Ko GF(2 16 ): 393,216 Ko (512  0,768)

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 15 Galois Field Parity Matrix Optimizations (2) GF Multiplication 1st Column of ‘1’s Encoding of the 1st PB along XOR Calculus  Gain in encoding & decoding 1st Row of ‘1’s Any update from 1st DB is processed with XOR Calculus  Gain in Performance of 4% (case PB creation, m =4) … 0001 eb9b 2284 … é74 … e44 d7f1 … … …

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 16 Galois Field Parity Matrix Optimizations (3) GF Multiplication Encoding Log Pre-calculus of the Coef. of P Matrix  Improvement of 3,5% … ab5 e267 … 0000 e267 0dce … d 2b66 … … … Decoding Log Pre-calculus of coef. of H -1 matrix and OK symbols vector  Improvement of 4% to 8% depending on the #buckets to recover Goal: Reduce GF Multiplication Complexity e 1 * e 2 = antigflog[ gflog[e 1 ] + gflog[e 2 ] ]

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 17 LH* RS -Parity Groups   Data Buckets Parity Buckets      : Key; Data Insert Rank r  : Rank; [Key-list ]; Parity Key r A k-Acvailable Group survive to the failure of k buckets Grouping Concept  m: #data buckets  k: #parity buckets

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 18 Outline… 1. Issue 2. State of the Art 3. LH* RS Scheme 4. LH* RS Manager Communication Gross Architecture 5. Experimentations 6. File Creation 7. Bucket Recovery …

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 19 Communication TCP/IPUDP“Multicast”  Individual Operations (Insert, Update, Delete, Search)  Record Recovery  Control Messages Performance

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 20 Communication TCP/IPUDP“Multicast” Large Buffers Transfert  New Parity Buckets  Transfer Parity Update & Record (Bucket Split)  Bucket Recovery Performance & Reliability

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 21 Communication TCP/IPUDP“Multicast” Looking for New Data/Parity Buckets Communication Multipoints

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 22 Architecture (1)TCP/IP Connection Handler Principle of “Sending Credit & Message Conservation until delivery” [J88, GRS97, D01] 1 Bucket Recovery (3,125 MB ): SDDS 2000: 6,7 s SDDS2000-TCP: 2,6 s (Hardware Config.: CPU 733MhZ machines, network 100Mbps) Before  Improvement of 60% TCP/IP Connections are passive OPEN, RFC 793 – [ISI81], TCP/IP under Win2K Server OS [MB00] (2)Flow Control & Message Acknowledgement (FCMA) Enhancements to SDDS2000 Architecture:

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 23 Architecture (2) Before To tag new servers (data or parity) using Multicast: (3)Dynamic IP Addressing Structure Pre-defined and Static Table Multicast Group of Blank Data Buckets Multicast Group of Blank Parity Buckets Coordinator Created Buckets

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 24 Architecture (3) Multicast Listening Port UDP Sending Port TCP/IP Port UDP Listening Port UDP Listening Thread Messages Queue TCP Listening Thread Multicast listening Thread Message Queue Pool of Working Threads Network ACK Mgmt Threads  Free Zones Messages waiting for ACK. Not acquitted Messages … ACK Structure Multicast Working Thread

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 25 Experimentation  Performance Evaluation * CPU Time *Communication Time  Experimental Environment *5 Machines (Pentium IV: 1.8 GHz, RAM: 512 Mb) *Ethernet Network 1 Gbps *O.S.: Win2K Server *Tested Configuration: 1 Client, A group of 4 Data Buckets, k Parity Buckets (k = 0,1,2,3).

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 26 Outline… 1. Issue 2. State of the Art 3. LH* RS Scheme 4. LH* RS Manager 5. Experimentations 6. File Creation Parity Update Performance 7. Bucket Recovery 8. Parity Bucket Creation

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 27 File Creation  Client Operations Propagation of Data Record Inserts/ Updates/ Deletes to Parity Buckets.  Update: Send only  –record.  Deletes: Management of Free Ranks within Data Buckets.  Data Bucket Split N1: #renaining records N2: #leaving records Parity Group of the Splitting Data Bucket N1+N2 Deletes + N1 Inserts Parity Group of the New Data Bucket N2 Inserts

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 28 Performances Config. Client Window = 1 Client Window = 5 Max Bucket Size = records File of records 1 record = 104 Bytes No difference GF(2 8 ) et GF(2 16 ) (we don’t wait for ACKs between DBs and PBs)

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 29 Performances Config. Client Window = 1 Client Window = 5 k = 0 ** k = 1  Perf. Degradation of 20% k = 1 ** k = 2  Perf. Degradation of 8%

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 30 Performances Config. Client Window = 1 Client Window = 5 k = 0 ** k = 1  Perf. Degradation of 37% k = 1 ** k = 2  Perf. Degradation of 10%

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 31 Outline… 1. Issue 2. State of the Art 3. LH* RS Scheme 4. LH* RS Manager 5. Experimentations 6. File Creation 7. Bucket Recovery Scenario Performances 8. Parity Bucket Creation

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 32 Failure Detection Are you Alive? Data Buckets Parity Buckets  Scenario Coordinator 

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 33 Waiting for Responses … OK Data Buckets Parity Buckets Scenario (2) OK Coordinator  

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 34 Searching Spare Buckets … Wanna be Spare ? Scenario (3) Multicast Group of Blank Data Buckets Coordinator

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 35 Waiting for Replies … Launch UDP Listening Launch TCP Listening, Launch Working Thredsl *Waiting for Confirmation* If Time-out elapsed  cancel everything I would Scenario (4) Multicast Group of Blank Data Buckets Coordinator I would

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 36 Spare Selection Scenario (5) Multicast Group of Blank Data Buckets Confirmed Cancellation Confirmed You are Hired Coordinator

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 37 Parity Buckets Recover Failed Buckets Scenario (6) Recovery Manager Selection Coordinator

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 38 Data Buckets Parity Buckets Recovery Manager Spare Buckets Buckets participating to Recovery Send me Records of rank in [r, r+slice-1] … Scenario (7) Query Phase

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 39 Decoding Phase Recovered Slices Data Buckets Parity Buckets Spare Buckets Buckets participating to Recovery Requested Buffers … Scenario (8) Reconstruction Phase Recovery Manager In // with Query Phase

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 40 2 DBs1 DB XORConfig. 1 DB RS XOR vs. RS Performances  File Info File of records Record Size = 100 bytes Bucket Size = records  MB Group of 4 Data Buckets (m = 4), k-Available with k = 1,2,3  Decoding * GF(2 16 ) * RS + Decoding (RS + log Pre-calculus of H -1 and OK Symboles Vector)  Recovery per Slice (adaptative to PCs storage & computing capacities)

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 41 2 DBs1 DB XORConfig. 1 DB RS XOR vs. RS Performances Slice Total Time (sec) CPU Time (sec) Com. Time (sec) 12500,6250,2660, ,5880,2550, ,5520,2400, ,5620,2550, ,5780,2500,328  Slice (from 4% to 100% of a bucket content)  Total Time is almost constant 0,58

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 42 2 DBs1 DB XORConfig. 1 DB RS XOR vs. RS Performances Slice Total Time (sec) CPU Time (sec) Com. Time (sec) 12500,7340,3490, ,6880,3590, ,6560,3540, ,6670,3600, ,6880,3600,328 0,67  Slice (from 4% to 100% of a bucket content)  Total Time is almost constant

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 43 2 DBs1 DB XORConfig. Performances Time to Recover 1DB -XOR : 0,58 sec XOR in GF(2 16 ) realizes a gain of 13% in Total Time (and 30% in CPU Time) Time to Recover 1DB –RS : 0,67 sec 1 DB RS XOR vs. RS

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 44 3 DBs2 DBsSummaryXOR vs. RS1 DB RS Performances Slice Total Time (sec) CPU Time (sec) Com. Time (sec) 12500,9760,5770, ,9320,5890, ,8830,5620, ,8750,5620, ,8750,5620,313 0,9  Slice (from 4% to 100% of a bucket content)  Total Time is almost constant

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 45 3 DBs2 DBsSummaryXOR vs. RS1 DB RS Performances SliceTotal Time (sec) CPU Time (sec) Com. Time (sec) 12501,2810,8280, ,2500,8280, ,2110,8520, ,1880,8230, ,2030,8280,375 1,23  Slice (from 4% to 100% of a bucket content)  Total Time is almost constant

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 46 Performances 3 DBs2 DBsSummaryXOR vs. RS1 DB RS f Bucket Size (MB) Total Time (sec) Recovery Speed ( MB/sec ) 1 (XOR) 1 (RS) 3,125 0, , ,2500, ,3751,237,62 Time to Recover f Buckets  f  Time to Recover 1 Bucket Factorized Query Phase  The + is Decoding Time & Time to send Recovered Buffers

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 47 Performances GF(2 8 )  XOR in GF(2 8 ) improves decoding perf. of 60% compared to RS in GF(2 8 ).  RS/RS+ decoding in GF(2 16 ) realize a gain of 50% compared to decoding in GF(2 8 ). 3 DBs2 DBsSummaryXOR vs. RS

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 48 Outline… 1. Issue 2. State of the Art 3. LH* RS Scheme 4. LH* RS Manager 5. Experimentations 6. File Creation 7. Bucket Recovery 8. Parity Bucket Creation Scenario Performances

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 49 Scenario Multicast Group of Blank Parity Buckets Wanna Join Group g ? Searching for a new Parity Bucket Coordinator

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 50 Scenario (2) Coordinator I Would Launch UDP Listening Launch TCP Listening, Launch Working Thredsl *Waiting for Confirmation* If Time-out elapsed  cancel everything Waiting for Replies … Multicast Group of Blank Parity Buckets I Would

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 51 Scenario (3) You are Hired Confirmed Cancellation New Parity Bucket Selection Multicast Group of Blank Parity Buckets Coordinator

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 52 Send me your contents ! … Scenario (4) Group of Data Buckets New Parity Bucket … Auto-creation *Query Phase

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 53 Requested Buffers … Scenario (5) Group of Data Buckets Buffer Processing … Auto-creation *Encoding Phase New Parity Bucket

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 54 Performances Max Bucket Size : records Bucket Load Factor: 62,5% Record Size: 100 octets Group of 4 Data Buckets Encoding  GF(2 16 )  RS++ ( Log Pre-calculus & Row ‘1’s  XOR encoding to Process 1st DB buffer) XORRS XOR vs. RS Config.GF(2 8 )

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 55 Performances Bucket Size Total Time (sec) CPU Time (sec) Com. Time (sec) XORRS XOR vs. RS Config.GF(2 8 ) Same Encoding Rate  Bucket Size: CPU Time  74% Total Time

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 56 Performances Bucket Size Total Time (sec) CPU Time (sec) Com. Time (sec) XORRS XOR vs. RS Config.GF(2 8 ) Same Encoding Rate  Bucket Size: CPU Time  74% Total Time

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 57 Performances XOR encoding speed : sec RS encoding speed: sec XOR realizes a performance gain in CPU time of 5% (  only 0,02% on Total Time) For Bucket Size = records XORRS XOR vs. RS Config.GF(2 8 )

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 58 XORRS XOR vs. RS Config.GF(2 8 ) Performances  Idem GF(2 16 ), CPU Time = 3/4 Total Time  XOR in GF(2 8 ) improves CPU Time by 22%

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 59 Performance File Creation Rate 0.33MB/s for k = MB/s for k = MB/s for k = 2 Record Insert Time 0.29ms for k = ms for k = ms for k = 2 Bucket Recovery Rate 4.66MB/s from 1-unavailability 6.94MB/s from 2-unavailability 7.62MB/s from 3-unavailability Record Recovery Time About 1.3ms Key Search Time Individual> 0.24ms Bulk> 0.056ms Wintel P4, 1.8GHz, 1Gbps

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 60 Conclusion Experiments prove: Optimizations Encoding/ Decoding Architecture  Impact on Performance Good Recovery Performances

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 61 Future Work Update Propagation to Parity Buckets Reliability Performance Reduce Coordinator Tasks « Parity Declustering » Investigation of New Erausure-Resilient Codes

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 62 References [PGK88] D. A. Patterson, G. Gibson & R. H. Katz, A Case for Redundant Arrays of Inexpensive Disks, Proc. of ACM SIGMOD Conf, pp , June [ISI81] Information Sciences Institute, RFC 793: Transmission Control Protocol (TCP) – Specification, Sept. 1981, [MB 00] D. MacDonal, W. Barkley, MS Windows 2000 TCP/IP Implementation Details, [J88] V. Jacobson, M. J. Karels, Congestion Avoidance and Control, Computer Communication Review, Vol. 18, No 4, pp [XB99] L. Xu & J. Bruck, X-Code: MDS Array Codes with Optimal Encoding, IEEE Trans. on Information Theory, 45(1), p , [CEG+ 04] P. Corbett, B. English, A. Goel, T. Grcanac, S. Kleiman, J. Leong, S. Sankar, Row-Diagonal Parity for Double Disk Failure Correction, Proc. of the 3 rd USENIX –Conf. On File and Storage Technologies, Avril [R89] M. O. Rabin, Efficient Dispersal of Information for Security, Load Balancing and Fault Tolerance, Journal of ACM, Vol. 26, N° 2, April 1989, pp [W91] P.E. White, RAID X tackles design problems with existing design RAID schemes, ECC Technologies, ftp://members.aol.com.mnecctek.ctr1991.pdfftp://members.aol.com.mnecctek.ctr1991.pdf [GRS97] J. C. Gomez, V. Redo, V. S. Sunderam, Efficient Multithreaded User-Space Transport for Network Computing, Design & Test of the TRAP protocol, Journal of Parallel & Distributed Computing, 40 (1) 1997.

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 63 References (2) [BK+ 95] J. Blomer, M. Kalfane, R. Karp, M. Karpinski, M. Luby & D. Zuckerman, An XOR-Based Erasure- Resilient Coding Scheme, ICSI Tech. Rep. TR , [LS00] W. Litwin & T. Schwarz, LH* RS : A High-Availability Scalable Distributed Data Structure using Reed Solomon Codes, p , Proceedings of the ACM SIGMOD [KLR96] J. Karlson, W. Litwin & T. Risch, LH*LH: A Scalable high performance data structure for switched multicomputers, EDBT 96, Springer Verlag. [RS60] I. Reed & G. Solomon, Polynomial codes over certain Finite Fields, Journal of the society for industrial and applied mathematics, [P97] J. S. Plank, A Tutorial on Reed-Solomon Coding for fault-Tolerance in RAID-like Systems, Software– Practise & Experience, 27(9), Sept. 1997, pp , [D01] A.W. Diène, Contribution à la Gestion de Structures de Données Distribuées et Scalables, PhD Thesis, Nov. 2001, Université Paris Dauphine. [B00] F. Sahli Bennour, Contribution à la Gestion de Structures de Données Distribuées et Scalables, PhD Thesis, Juin 2000, Université Paris Dauphine. + Références:

04 Oct. 04 * Présentation de Thèse R. Moussa, U. Paris Dauphine 64 Publications [ML02] R. Moussa, W. Litwin, Experimental Performance Analysis of LH* RS Parity Management, Carleton Scientific Records of the 4th International Workshop on Distributed Data & Structure : WDAS 2002, p [MS04] R. Moussa, T. Schwarz, Design and Implementation of LH* RS – A Highly- Available Scalable Distributed Data Structure, Carleton Scientific Records of the 6th International Workshop on Distributed Data & Structure: WDAS [LMS04] W. Litwin, R. Moussa, T. Schwarz, Prototype Demonstration of LH* RS : A Highly Available Distributed Storage System, Proc. of VLDB 2004 (Demo Session ) p [LMS04-a] W. Litwin, R. Moussa, T. Schwarz, LH* RS : A Highly Available Distributed Storage System, journal version submitted, under revision.

Thank You For Your Attention Questions ?