Scalable Distributed Data Structures & High-Performance Computing Witold Litwin Fethi Bennour CERIA University Paris 9 Dauphine

Scalable Distributed Data Structures & High-Performance Computing Witold Litwin Fethi Bennour CERIA University Paris 9 Dauphine http://ceria.dauphine.fr/

2 Plan Multicomputers for HPC What are SDDSs ? Overview of LH* Implementation under SDDS-2000 Conclusion

3 Multicomputers A collection of loosely coupled computers –Mass-produced and/or preexisting hardware –share nothing architecture Best for HPC because of scalability –message passing through high-speed net (  Mb/s) Network multicomputers –use general purpose nets & PCs LANs: Fast Ethernet, Token Ring, SCI, FDDI, Myrinet, ATM… –NCSA cluster : 1024 NTs on Myrinet by the end of 1999 Switched multicomputers –use a bus, or a switch –IBM-SP2, Parsytec...

4 Why Multicomputers ? Unbeatable price-performance ratio for HPC. –Cheaper and more powerful than supercomputers. –especially the network multicomputers. Available everywhere. Computing power. –file size, access and processing times, throughput... For more pro & cons : –IBM SP2 and GPFS literature. –Tanenbaum: "Distributed Operating Systems", Prentice Hall, 1995. –NOW project (UC Berkeley). –Bill Gates at Microsoft Scalability Day, May 1997. –www.microoft.com White Papers from Business Syst. Div. –Report to the President, President’s Inf. Techn. Adv. Comm., Aug 98.

5 Client Server Typical Network Multicomputer

6 Why SDDSs Multicomputers need data structures and file systems Trivial extensions of traditional structures are not best  hot-spots  scalability  parallel queries  distributed and autonomous clients  distributed RAM & distance to data  For a CPU, data on a disk are as far as those at the Moon for a human (J. Gray, ACM Turing Price 1999)

7 What is an SDDS ? +Data are structured +records with keys  objects with OIDs + more semantics than in Unix flat-file model + abstraction most popular with applications + parallel scans & function shipping +Data are on servers –waiting for access +Overflowing servers split into new servers –appended to the file without informing the clients +Queries come from multiple autonomous clients –Access initiators –Not supporting synchronous updates –Not using any centralized directory for access computations

8 +Clients can make addressing errors –Clients have less or more adequate image of the actual file structure, Servers are able to forward the queries to the correct address –perhaps in several messages + Servers may send Image Adjustment Messages Clients do not make same error twice  Servers supports parallel scans  Sent out by multicast or unicast  With deterministic or probabilistic termination See the SDDS talk & papers for more –ceria.dauphine.fr/witold.html Or the LH* ACM-TODS paper (Dec. 96) What is an SDDS ?

9 + A server can be unavailable for access without service interruption +Data are reconstructed from other servers +Data and parity servers  Up to k  servers can fail +At parity overhead cost of about 1/k +Factor k can itself scale with the file +Scalable availability SDDSs High-Availability SDDS

10 An SDDS Clients growth through splits under inserts Servers

14 An SDDS Client Access Clients

15 Clients An SDDS Client Access

16 Clients IAM An SDDS Client Access

19 Known SDDSs DS Classics

20 Known SDDSs Hash SDDS (1993) LH* DDH Breitbart & al DS Classics

21 Known SDDSs Hash SDDS (1993) 1-d tree LH* DDH Breitbart & al RP* Kroll & Widmayer DS Classics

22 Known SDDSs Hash SDDS (1993) 1-d tree LH* DDH Breitbart & al RP* Kroll & Widmayer m-d trees k-RP* dPi-tree DS Classics

23 Known SDDSs Hash SDDS (1993) 1-d tree LH* DDH Breitbart & al RP* Kroll & Widmayer m-d trees DS Classics Security LH*s k-RP* dPi-tree Nardelli-tree LH*m, LH*g H-Avail.

24 Known SDDSs Hash SDDS (1993) 1-d tree LH* DDH Breitbart & al RP* Kroll & Widmayer Breitbart & Vingralek m-d trees DS Classics H-Avail. LH*m, LH*g Security LH*s k-RP* dPi-tree Nardelli-tree s-availability LH* SA LH* RS http://192.134.119.81/SDDS-bibliograhie.html SDLSA Disk

25 LH* ( A classic) Scalable distributed hash partitionning –generalizes the LH addressing schema variants used in Netscape products, LH-Server, Unify, Frontpage, IIS, MsExchange... Typical load factor 70 - 90 % In practice, at most 2 forwarding messages –regardless of the size of the file In general, 1 m/insert and 2 m/search on the average 4 messages in the worst case

26 LH* bucket servers For every record c, its correct address a results from the LH addressing rule a  h i (c) if n = 0 then exit else if a < n then a  h i+1 ( c) ; end (i, n) = the file state, known only to the LH*-coordinator Each server a keeps only track of the function h j used to access it: j = i or j = i+1

27 LH* clients Each client uses the LH-rule for address computation, but with the client image (i’, n’) of the file state. Initially, for a new client (i’, n’) = 0.

28 LH* Server Address Verification and Forwarding –Server a getting key c, a = m in particular, computes : a' := h j (c) ; if a' = a then accept c ; else a'' := h j - 1 (c) ; if a'' > a and a'' < a' then a' := a'' ; send c to bucket a' ;

29 Client Image Adjustment The IAM consists of address a where the client sent c and of j (a) if j > i' then i' := j - 1, n' := a +1 ; if n'  2^i' then n' = 0, i' := i' +1 ; The rule guarantees that client image is within the file Provided there is no file contractions (merge)

30 LH* : file structure j = 4 0 1 j = 3 2 7 j = 4 8 9 n = 2 ; i = 3 n' = 0, i' = 0n' = 3, i' = 2 Coordinator Client servers

31 LH* : file structure j = 4 0 1 j = 3 2 7 j = 4 8 9 n = 2 ; i = 3 n' = 0, i' = 0n' = 3, i' = 2 Coordinator Client servers

32 LH* : split j = 4 0 1 j = 3 2 7 j = 4 8 9 n = 2 ; i = 3 n' = 0, i' = 0n' = 3, i' = 2 Coordinator Client servers

33 LH* : split j = 4 0 1 j = 3 2 7 j = 4 8 9 n = 2 ; i = 3 n' = 0, i' = 0n' = 3, i' = 2 Coordinator Client servers

34 LH* : split j = 4 0 1 2 j = 3 7 j = 4 8 9 n = 3 ; i = 3 n' = 0, i' = 0n' = 3, i' = 2 Coordinator Client servers j = 4 10

35 LH* : addressing j = 4 0 1 2 j = 3 7 j = 4 8 9 n = 3 ; i = 3 n' = 0, i' = 0n' = 3, i' = 2 Coordinateur Client servers j = 4 10 15

37 LH* : addressing j = 4 0 1 2 j = 3 7 j = 4 8 9 n = 3 ; i = 3 n' = 0, i' = 3n' = 3, i' = 2 Coordinateur Client servers j = 4 10 15 a =7, j = 3

41 LH* : addressing j = 4 0 1 2 j = 3 7 j = 4 8 9 n = 3 ; i = 3 n' = 1, i' = 3n' = 3, i' = 2 Coordinateur Client servers j = 4 10 9 a = 9, j = 4

42 Result The distributed file can grow to even whole Internet so that : –every insert and search are done in four messages (IAM included) –in general an insert is done in one message and search in two message

43 SDDS-2000 Prototype Implementation of LH* and of RP* on Wintel multicomputer Architecture Client/Server TCP/IP Communication (UDP and TCP) with Windows Sockets Multiple threads control Processes synchronization (mutex, critical section, event, time_out, etc) Queuing system Optional Flow control for UDP messaging

44  Send Request  Receive Response  Return Response  Client Image process. SDDS-2000 : Client Architecture Interface : Applications - SDDS send Request Socket Network ResponseRequest Receive Response file i n.......... Client Image Update Server Address Receive Request Return Response Id_Req Id_App........ Queuing system RequestResponse Applications Server

45 SDDS-2000 : Server Architecture Bucket SDDS InsertionSearchUpdateDelete W.Thread 1W.Thread 4 … Request Analyse Queuing system Listen Thread Socket Client Network client Request Response  Listen Thread  Queuing system  Work Thread  Local process  Forward  Response

46 LH* LH : RAM buckets...... LH bucket LH* bucket A record dynamic array 0404 data 1 2 data 2 6 dataX 8 data3 -1 dataY -1 0 1 2 3 4 5 6 7 8 9

47 Measuring conditions  LAN of 4 computers interconnected by a 100 Mb/s Ethernet  F.S : Fast Server : Pentium II 350 MHz & 128 Mo RAM  F.C : Fast Client : Pentium II 350 MHz & 128 Mo RAM  S.C : Slow Client : Pentium I 90 Mhz & 48 Mo RAM  S.S : Slow Server : Pentium I 90 Mhz & 48 Mo RAM  The measurements result from 10.000 records & more.  UDP Protocol for insertions and searches  TCP Protocol for splitting

48 Best performances of a F.S : configuration F.S J=0 S.C (3) S.C (1) 100 Mb/s UDP communication Bucket 0 S.C (2)

49 Fast Server Average Insert time Inserts without ack 3 clients create lost messages  best time: 0,44 ms

50 Fast Server Average Search time The time measured include the search process + response return More than 3 clients, there are a lot of lost messages Whatever is the bucket capacity (1000,5000, …, 20000 records),  0,66 ms is the best time

51 Performance of a Slow Server Configuration S.S J=0 S.C 100 Mb/s wait UDP communication Bucket 0

52 Slow Server Average Insert time Measurements on server without ack S.C to S.S (with wait) We don’t need a 2 nd client  2,3 ms is the best & constant time

53 Measurement s on server S.C to S.S (with wait) We don’t need a 2 nd client  3,3 ms is the best time Slow Server Average Search time

54 Insert time into up to 3 buckets Configuration F.S J=2 S.S J=1 S.C 100 Mb/s S.S J=2 Bucket 0 Bucket 1 Bucket 2 UDP communication Batch 1,2,3, …

55 Average insert time no ack File creation includes 2 splits + forwards + updates of IAM Buckets already exist : without splits Conditions: S.C + F.S + 2 S.S Time measured on the server of bucket 0 which is informed of the end of insertions from each server. T he split is not penalizing  0,8 ms/insert in both cases.

56 Average search time in 3 Slow Servers Configuration S.S J=2 S.S J=1 F.C 100 Mb/s S.S J=2 Bucket 0 Bucket 1 Bucket 2 UDP communication Batch 1,2,3, …

57 The average key search time Fast Client & Slow Servers Records are sent in batch system : 1,2,3,…. 10000 Balanced charge (load) : The 3 buckets receive the same number of records Non balanced charge : The bucket 1 receives more than the others conclusion : The curve is linear  a good parallelism

58 Extrapolation Single 700 Mhz P3 server Search time Insertion time Processor Pentium II 350 MhzPentium 90 Mhz / 4 F.S = 0,66 msS.S = 3,3 ms * 5 F.S = 0,44 msS.S = 2,37 ms * 5

59 Search time Insertion time Processor Pentium II 350 MhzPentium 90 Mhz / 4 Pentium III 700 Mhz / 2 F.S = 0,66 msS.S = 3,3 ms * 5 F.S = 0,44 msS.S = 2,37 ms * 5 Extrapolation Single 700 Mhz P3 server

60 Search time Insertion time Processor Pentium II 350 MhzPentium 90 Mhz / 4 Pentium III 700 Mhz / 2 F.S = 0,66 msS.S = 3,3 ms * 5 <= 0,33 ms * 2 F.S = 0,44 msS.S = 2,37 ms * 5 Extrapolation Single 700 Mhz P3 server

61 Search time Insertion time Processor Pentium II 350 MhzPentium 90 Mhz / 4 Pentium III 700 Mhz / 2 F.S = 0,66 msS.S = 3,3 ms * 5 <= 0,33 ms * 2 F.S = 0,44 msS.S = 2,37 ms * 5 <= 0,22 ms * 2 Extrapolation Single 700 Mhz P3 server

62 Extrapolation : Search time on fast P3 servers  The client is F.C  3 servers are 350 Mhz.P3: search time is 0,216 ms/ key  3 servers are 700 Mhz.: search time is 0,106 ms/ key

63 Extrapolation : Search time in file scaling to 100 servers

64 RP* schemes Produce 1-d ordered files –for range search Uses m-ary trees –like a B-tree Efficiently supports range queries –LH* also supports range queries but less efficiently Consists of the family of three schemes –RP* N RP* C and RP* S

65 RP* schemes

68 Scalable Distributed Log Structured Array (SDLSA) Intended for high-capacity SANs of IBM Ramac Virtual Arrays (RVAs) or Enterprise Storage Servers (ESSs) –One RVA contains up to 0.8 TB of data –One EES contains up to 13 TB of data Reuse of current capabilities : –Transparent access to the entire SAN, as if it were one RVA or EES –Preservation of current functions, Log Structured Arrays –for high-availability without small-write RAID penalty Snapshots New capabilities –Scalable TB databases PB databases for an EES SAN –Parallel / distributed processing –High-availability supporting an entire server node unavailability

69 Gross Architecture RVA

70 Scalable Availability SDDS Support unavailability of k  server sites The factor k increases automatically with the file. –Necessary to prevent the reliability decrease Moderate overhead for parity data –Storage overhead of O (1/k) –Access overhead of k messages per data record insert or update Do not impare search and parallel scans –Unlike trivial adaptations of RAID like schemes. Several schemas were proposed around LH* –Different properties to best suit various applications –See http://ceria.dauphine.fr/witold.htmlhttp://

71 SDLSA : Main features LH* used as global addressing schema RAM buckets split atomically Disk buckets split in lazy way –A record (logical track) moves only when The client access it (update, or read) It is garbage collected –Atomic split of TB disk bucket would take hours The LH* RS schema is used for the high-availability Litwin W. Menon, J. Scalable Distributed Log Structured Arrays. CERIA Res. Rep. 12, 1999 http://ceria.dauphine.fr/witold.htmlhttp://

72 Conclusion SDDSs should be highly useful for HPC –Scalability –Fast access perfromance –Parallel scans & function shipping –High-availability SDDSs are available on network multicomputers –SDDS-2000 Access performance prove at least an order of magnitude faster than to traditional files –Should reach two orders (100 times improvement) for 700 Mhz P3 –Combination of fast net & distributed RAM

73 Future work Experiments –Faster net We do not have : any volunteer to help ? –More Wintel computers We are adding two 700 Mhz P3 Volunteers with funding for more their own config. ? –Experiments on switched multicomputers LH* LH runs on Parsytec (J. Karlson) & SGs (Math. Cntr. Of U. Amsterdam) Volunteers with an SP2 ? –Generally, we welcome every cooperation

74 Thank You for Your Attention Sponsored by HP Laboratories IBM Almaden Research Microsoft Research

Scalable Distributed Data Structures & High-Performance Computing Witold Litwin Fethi Bennour CERIA University Paris 9 Dauphine

Similar presentations

Presentation on theme: "Scalable Distributed Data Structures & High-Performance Computing Witold Litwin Fethi Bennour CERIA University Paris 9 Dauphine"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Scalable Distributed Data Structures & High-Performance Computing Witold Litwin Fethi Bennour CERIA University Paris 9 Dauphine

Similar presentations

Presentation on theme: "Scalable Distributed Data Structures & High-Performance Computing Witold Litwin Fethi Bennour CERIA University Paris 9 Dauphine"— Presentation transcript:

Similar presentations

About project

Feedback