Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSCI5570 Large Scale Data Processing Systems NoSQL Slide Ack.: modified based on the slides from Adam Silberstein James Cheng CSE, CUHK.

Similar presentations


Presentation on theme: "CSCI5570 Large Scale Data Processing Systems NoSQL Slide Ack.: modified based on the slides from Adam Silberstein James Cheng CSE, CUHK."— Presentation transcript:

1 CSCI5570 Large Scale Data Processing Systems NoSQL Slide Ack.: modified based on the slides from Adam Silberstein James Cheng CSE, CUHK

2 PNUTS: Yahoo!’s Hosted Data Serving Platform Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans- Arno Jacobsen, Nick Puz, Daniel Weaver and Ramana Yerneni VLDB 2008 Platform for Nimble Universal Table Storage 2

3 How do I build a cool new web app? Option 1: Code it up! Make it live! – Scale it later – It gets popular – Scale it now! – Flickr, Twitter, MySpace, Facebook, … 3

4 How do I build a cool new web app? Option 2: Make it industrial strength! – Evaluate scalable database backends – Evaluate scalable indexing systems – Evaluate scalable caching systems – Architect data partitioning schemes – Architect data replication schemes – Architect monitoring and reporting infrastructure – Write application – Go live – Realize it doesn’t scale as well as you hoped – Re-architect around bottlenecks – 1 year later – ready to go! 4

5 PNUTS A massively parallel and geographically distributed database system for Yahoo!’s web applications Data storage organized as hashed or ordered tables, low latency for large numbers of concurrent requests (updates and queries) Record-level, asynchronous geographic replication => per-record consistency guarantee Automated load-balancing and failover 5

6 Example: social network updates 16 Mike <ph.. 6 Jimi <ph.. 8 Mary <re.. 12 Sonja <ph.. 15 Brandon <po.. 17 Bob <re.. Flower www.flickr.com 6

7 What is PNUTS’ overall goal? Serializability of general transactions over a distributed replicated system is expensive, and often unnecessary Web applications need – Scalability And the ability to scale linearly – Geographic scope Tens of data centers ("sites") all over the world – e.g. mail (each app probably runs at all sites) Web applications typically have: – Simple query needs No joins, aggregations – Relaxed consistency needs Applications can tolerate stale data 7

8 What is PNUTS’ overall goal? PNUTS keeps states for apps – per-user: profile, shopping cart, friend list – per-item: popularity, selling price App might need – any piece of data at any data center to handle lots of concurrent updates to different data. – e.g. lots of users must be able to update profile at the same time Thus 1000s of PNUTS servers => crashes can be frequent – Must cope well with partial failures 8

9 Performance Requirements High scalability Short response time (latency) High availability and fault tolerance Relaxed consistency guarantees 9

10 Data Model Tables of records [ key attrs ] – Primary key + attributes (columns) – Blobs, application specific data types (parsed JSON) – Flexible schema: new attributes can be added any time, attributes not required to have a value – No referential integrity constraints 10

11 Query Model Queries only by primary key – Also supports range scan based on primary keys Selection and projection from a single table No complex ad hoc queries (e.g., joins, group- by, etc.) Typical query workloads: read and write single records or small sets of records 11

12 12 Query Model Per-record operations – Get – Set – Delete Multi-record operations – Multiget – Scan – Getrange

13 System Architecture Storage units Routers Tablet controller REST API Clients Local region Remote regions YMB 13

14 System Architecture Each region contains a full complement of system components and a complete copy of each table 14 Data-path components Storage units Routers Tablet controller REST API Clients Message Broker

15 Data Storage Data tables are horizontally partitioned into groups of records called tablets Tablets are scattered across many servers Each server has many tablets Each tablet is stored on a single server within a region (replicated in many regions) Regions are typically geographically distributed Use message broker, a pub/sub mechanism, for reliability and replication (details later) 15

16 System Components Storage Unit (SU) – Store tablets – Respond to get() and scan() requests – Respond to set() requests for updating records Router – Determine which tablet contains a record (to be read/written from/to a SU) – Determine which SU has that tablet – The primary-key space of a table is divided into intervals, router stores interval mapping (fits in memory for fast search), which defines the boundaries of each tablet and also maps each tablet to a storage unit – If a router fails, simply start a new one and no recovery process 16

17 Distributed Hash Table Primary KeyRecord Grape{"liquid" : "wine"} Lime{"color" : "green"} Apple{"quote" : "Apple a day keeps the …"} Strawberry{"spread" : "jam"} Orange{"color" : "orange"} Avocado{"spread" : "guacamole"} Lemon{"expression" : "expensive crap"} Tomato{"classification" : "yes… fruit"} Banana{"expression" : "goes bananas"} Kiwi{"expression" : "New Zealand"} 0x0000 0x911F 0x2AF3 Tablet 17

18 Distributed Ordered Table Primary KeyRecord Apple{"quote" : "Apple a day keeps the …"} Avocado{"spread" : "guacamole"} Banana{"expression" : "goes bananas"} Grape{"liquid" : "wine"} Kiwi{"expression" : "New Zealand"} Lemon{"expression" : "expensive crap"} Lime{"color" : "green"} Orange{"color" : "orange"} Strawberry{"spread" : "jam"} Tomato{"classification" : "yes… fruit"} Tablet clustered by key range 18

19 Accessing data SU 1 Get key k 2 3 Record for key k 4 19

20 20 Storage unit 1Storage unit 2Storage unit 3 Range queries Router Apple Avocado Banana Blueberry Canteloupe Grape Kiwi Lemon Lime Mango Orange Strawberry Tomato Watermelon Grapefruit…Pear? Grapefruit…Lime? Lime…Pear? MIN-CanteloupeSU1 Canteloupe-LimeSU3 Lime-StrawberrySU2 Strawberry-MAXSU1 Strawberry-MAX SU2Lime-Strawberry SU3Canteloupe-Lime SU1MIN-Canteloupe

21 System Components Tablet Controller – Owns interval mapping (router only stores a cached copy) – Determines when to move a tablet between SUs for load balancing or recovery, and updates interval mapping – When mapping updated, router’s cached copy becomes outdated & hence requests misdirected => SU error response => router gets the updated copy from controller – A single pair of active/standby servers, but not a bottleneck as not sit on the data path 21

22 Tablet Splitting & Balancing Each storage unit has many tablets Tablets may grow over time Overfull tablets split Storage unit may become a hotspot Shed load by moving tablets to other servers 22

23 Yahoo! Message Broker (YMB) A publish/subscribe system used for asynchronous replication in PNUTS Data update “committed” when published to YMB At some point after “committed”, the update will be asynchronously propagated to different regions to be applied to their replicas Thus delay in update => consistency issue! 23

24 Yahoo! Message Broker (YMB) Why YMB can be used for replication & logging in PNUTS? YMB guarantees delivery of published messages to all subscribers – logging message to multiple disks on different servers to avoid single broker machine failures – Do not purge message from YMB log until PNUTS verifies that the update is applied to all replicas YMB clusters reside in geographic datacenters, and messages published to one YMB cluster will be relayed to other YMB clusters for delivery to local subscribers 24

25 How to keep copies of data synchronized? Replication => replicas of data everywhere Messages published to a YMB cluster will be delivered to all subscribers in their publishing order Messages published to different YMB clusters may be delivered in any order => some sites may see stale data (see example in following slides) How to keep all replicas of a record synchronized? – Distributed sites need to apply same updates in the same order 25

26 Consider a photo sharing application that allows users to post photos and control access. A user wishes to do a sequence of 2 updates to her record: U1: Remove her mother from the list of control access U2: Post spring-break photos 26

27 Eventual Consistency Popularly adopted in NoSQL systems Allow stale reads, but ensure that reads will eventually reflect previously written values – Even after very long times Doesn’t order concurrent writes as they are executed, which might create conflicts later: which write was first? 27

28 Eventual Consistency Update U1 can go to replica R1 of the record, while U2 might go to replica R2. Even though the final states of the replicas R1 and R2 are guaranteed to be the same at R2. – for some time a user is able to read a state of the record that never should have existed – the photos have been posted but the change in access control has not taken place => mother sees all spring- break photos and hence her frenetic behaviors 28

29 Consistency Model PNUTS supports per-record timeline consistency – All replicas of a given record apply all updates to the record in the same order – A consistency model between serializability and eventual consistency 29

30 Per-Record Timeline Consistency Version # stored in each record – [ key attr version ] 30

31 Per-Record Timeline Consistency Transactions: Alice removes his mother from ACL (Access control list) Alice posts spring-break photos (Alice, Mom ok, photos_v1)(Alice, Mom banned, photos_v1) Region 1 (Alice, Mom ok, photos_v1) (Alice, Mom banned, photos_spring) Region 2 Mom banned photos_spring (Alice, Mom banned, photos_spring) photos_spring (Alice, Mom Banned, photos_spring) No replica should see record as (Alice, Mom Ok, photos_spring ) 31

32 Is this feasible for applications? Contrast to serializability: – Certainly while I'm updating and reading my profile, it's ok if I see a stale version of yours. – OK for Mom to see old data, just shouldn't see pictures without permission 32

33 How to do this? App server gets web request, needs to write data in PNUTS – Need to update every site in the same order! Why not just have app logic send update to every site? – What if app crashes after updating only some sites? – What if concurrent updates to same record? 33

34 Consistency via YMB & Mastership Per-record mastership – Designating one copy of a record as the master – Different records in the same table can have masters in different clusters – Direct all updates to the master, master publishes the updates to a YMB to be delivered to all replicas in “commit” order 34

35 Consistency via YMB & Mastership Rationale for record-level mastering – 85% of writes to a given record originated in the same cluster – Different records have update affinity for different clusters, thus per-tablet or per-table mastering would lead costly cross-region latency to reach their masters – The mastership of a record can migrate between replicas when affinity changes: the replica initiating the majority of writes is designated as the master 35

36 More consistency semantics ONLY per record – Alice's stuff must all be kept in the same record Nothing guaranteed across records – if it was different records being updated, mom may see just U2 Can read stale data 36

37 API and Specified Consistency Application controls consistency semantics Multiple kinds of reads/writes – read-any – read-critical(required_version) – read-latest – write – test-and-set-write(required_version) How does each work? Why is each needed? 37

38 read-any read from local SU might return stale data – even if you just wrote! why: – fast! local! – Performance over consistency! 38

39 read-critical(required_version) maybe read from local SU – if it has vers >= required_version – otherwise read from master SU why: reflects what app has already seen or written; maybe fast 39

40 read-latest probably always read from master SU – if local copy too stale slow if master is remote! why: app needs fresh data 40

41 write gives the same ACID guarantees as a transaction with a single write operation in it why: useful for blind writes (a transaction writes a value without reading it) 41

42 performs the requested write to the record iff the present version of the record is the same as version# e.g.: app reads a record, and do a write to the record based on the read – SF trying to increment a counter; LA as well – what if the local read produced stale data? – what if read was OK, but concurrent updates? 42 test-and-set-write(version#, new value)

43 gives you atomic update to one record – master rejects the write if current version # != version# – so if concurrent updates, one will fail and retry while(1): (x, ver) = read-latest(x) if(t-a-s-w(ver, x+1)) break 43

44 What about new inserts? Can we insert in local SU? Problem: need to ensure we don't end up with two records w/ same key – Designate one copy of each tablet as the tablet master – Send all inserts to the master – Keep tablet boundaries synchronized across replicas in different regions 44

45 What about tolerating failures? The main players are – app servers – storage units (many many cheap servers) – YMB – a record's master (owning site, SU) – routers – tablet controller 45

46 App server If it crashes midway through a set of updates – not a transaction, so only some of writes will happen – but master SU and YMB either did or didn't get each write – so each write happens at all sites, or none 46

47 SU crashes Paper indicates that crashed SU doesn't come back – Did the write make it to YMB? If so, write went through. If not, write didn't happen. – The tablet controller re-assign tablets to a new SU 47

48 How does the new SU recover state? For each tablet: – Subscribe to YMB for that tablet – Find SU responsible for this tablet in another data center – Ask it to send a copy of the state 48

49 YMB crashes after accepting update logs to disks on at least two YMB servers before ACKing recovery looks at log, (re)sends logged msgs – record master may re-send an update if YMB crash before ACK – record version #s will allow SUs to ignore duplicate 49

50 YMB is a neat idea Atomic – updates all replicas, or none – so failure of app servers isn't a problem Reliable – keeps trying, to cope with temporarily SU/site failure Async – apps don't have to wait for write to complete, good for WAN Ordered – keeps replicas identical even w/ multiple writers 50

51 Router & Tablet controller failure Router – Stateless – Just start caching mappings from tablet controller after recovery Tablet controller – Active/standby servers 51

52 Why this design? High scalability – tens of regions world wide – 1000+ servers in each region Short response time (latency) – Simply add routers to handle queries and add YMB servers to handle updates High availability and fault tolerance – Multiple replicas, many routers – All components can recover from faults quickly Relaxed consistency guarantees – Per-record level consistency 52


Download ppt "CSCI5570 Large Scale Data Processing Systems NoSQL Slide Ack.: modified based on the slides from Adam Silberstein James Cheng CSE, CUHK."

Similar presentations


Ads by Google