© 2016 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Bigtable and Percolator April 25, 2016.

© 2016 A. Haeberlen, Z. Ives Announcements 2 nd midterm: Wednesday at 10:30am-noon Three rooms: TOWN 319, BENN 231, Berger auditorium Please complete the course evaluations! Please let me know how you liked the class (topics covered, structure, projects, assignments,...) and especially what aspects could be improved I already know the workload is very high Your feedback will benefit future instances of CIS455! Project demo slots will be available later today One member of each team should sign up for one slot Reading: Peng & Dabek: "Large-scale Incremental Processing Using Distributed Transactions and Notifications", OSDI 2010 http://www.google.com/research/pubs/archive/36726.pdf 2 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Reminder: Google award The team with the best search engine will receive an award (sponsored by ) Criteria: Architecture/design, speed, quality of search results, reliability, user interface, written final report Winning team gets four Nexus 7 tablets Winners will be announced on the course web page 3 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Bigtable Implements a multidimensional sorted map Keys are (row, column, timestamp); provides versioning Data is maintained in lexicographic order (by row key) Atomic lookup and update operations on each row, but no atomic cross-row operations Used by many Google projects, including Google Earth, the web index, and possibly others 5 University of Pennsylvania Source: Bigtable paper (OSDI2006) Different versions "Column family"

© 2016 A. Haeberlen, Z. Ives Bigtable implementation A single-master system, similar to GFS Table is broken into tablets, which each contain a contiguous region of key space. Stored by tablet servers. There is a master that assigns tablets to tablet servers Persistent state is stored in GFS files; recently committed data is kept in a memtable in memory Designed to be scalable: Handles petabytes of data, runs reliably on large numbers of unreliable machines 6 University of Pennsylvania Write op tablet log memtable Source: BigTable paper (OSDI 2006) Read op SSTable Files

© 2016 A. Haeberlen, Z. Ives Some services that use Bigtable In 2006, there were 388 non-test Bigtable clusters at Google Combined total: 24,500 tablet servers Example: Google Analytics Raw click table (~200TB): 1 row for each end-user session Summary table (~20TB): Predefined summaries per website 7 University of Pennsylvania Source: Bigtable paper (OSDI 2006)

© 2016 A. Haeberlen, Z. Ives Flashback Bigtable uses many of the technologies we've been looking at in this course: Lock service is made fault-tolerant with Paxos Tablet location hierarchy is basically a B+ tree Clients can run per-row transactions Data is persisted in a scalable file system, GFS Bigtable can be used as source or target for MapReduce jobs More details are in the paper: F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. Wallach, M. Burrows, T. Chandra, A. Fikes, R. Gruber: "Bigtable: A Distributed Storage System for Structured Data", OSDI 2006 http://research.google.com/archive/bigtable-osdi06.pdf 8 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Why Percolator? Scenario: Web crawler We have a huge index (Google: Tens of Petabytes) We need to run some computation on the index (e.g., PageRank updates, clustering,...) Google's indexing system is a chain of many MapReduces Every day we recrawl a small part of the web How do we update the index? Alternative #1: Run MapReduce on changed pages only Problem: Not accurate; for example, there may be links between the new pages and the rest of the web Alternative #2: Re-run MapReduce on entire data Problem: Wasteful; discards work done in earlier runs This is what Google actually used to do prior to Percolator Alternative #3: Update incrementally 10 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Example 11 University of Pennsylvania... map reduce URLChecksumPageRankIsCanonical nyt.com0xabcdef016yes ChecksumCanonical 0xabcdef01nyt.com invert links... nytimes.com 0xabcdef01 9 yes no nytimes.com

© 2016 A. Haeberlen, Z. Ives What is Percolator? A system for incrementally processing updates to a large data set Percolator-based indexing system is known as 'Caffeine' Reduced average age of documents in Google search results by 50%; documents move through Caffeine about 100x faster than through the previous system Published at OSDI 2010 Peng & Dabek: "Large-scale Incremental Processing Using Distributed Transactions and Notifications", OSDI 2010 http://www.google.com/research/pubs/archive/36726.pdf 12 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives What Percolator provides Percolator builds on Bigtable, but additionally it provides the following two abstractions: ACID transactions (as seen earlier) with snapshot isolation Observers - a way to organize incremental computation What is an observer? Essentially, a small piece of code that is invoked whenever a specific column changes Percolator applications are structured as a series of observers: An external process (e.g., the crawler) triggers updates in the table Update is handled by an observer, which then produces more updates and thus more work for other observers, etc. 13 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Why ACID semantics? Couldn't they have built this system without transactions? Transactions are not 'free' - they have some overhead Yes - but transactions...... make it easier to reason about the state of the system, especially when many updates are performed concurrently... avoid introduction of errors into long-lived repository These could be introduced by bugs, crashes,...... allow easy construction of consistent, up-to-date indexes Interesting change of perspective Given the earlier debates (e.g., Stonebraker/DeWitt) 14 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Snapshot isolation What is snapshot isolation? Conceptually, each transaction performs all its reads at a start timestamp, and all its writes at a commit timestamp In the above example, transaction 2 does not see writes from transaction 1, but transaction 3 sees writes from both 1 and 2 Implemented using versioning Protects against write-write conflicts: If two transactions write to the same cell, at least one aborts Comparison to serializability? 15 University of Pennsylvania Time 1 2 3 Read timestamp Write timestamp

© 2016 A. Haeberlen, Z. Ives Locking in Percolator Quite different from DBMS locking Locks are kept in special BigTable columns Ensures persistence and provides high throughput Remember: Accesses to individual rows are already atomic in BigTable! 16 University of Pennsylvania keybal:databal:lockbal:write Bob 6: 5: $10 6: 5: 6: data @ 5 5: Joe 6: 5: $2 6: 5: 6: data @ 5 5:

© 2016 A. Haeberlen, Z. Ives Transactions in Percolator 1. At the beginning, obtain start timestamp Comes from a timestamp oracle 2. Buffer all writes until commit time 3. At commit time, try to lock all the cells being written ('prewrite') If existing locks are found, transaction aborts A random cell is designated as the 'primary'; other cells contain a reference to the primary 4. Obtain commit timestamp from oracle 5. Release locks and make writes visible Start with primary (ensures that roll-forward is possible) 17 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Handling faulty nodes in Percolator What if a node fails in transaction? What would happen in a DBMS? What is the effect in Percolator? What needs to happen? What should a transaction do if it finds locks left behind by another transaction? Option #1: Rollback If primary lock still exists, no changes have been made visible yet (since the primary lock is always removed first) Option #2: Roll-forward If primary lock no longer exists Need to make all the writes of the original transaction visible What if two transactions 'collide'? 18 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives An example Transfer $7 from Bob to Joe Lock on Bob's 'bal' column is chosen as the primary Until version 8 of Bob's row is written, transaction would be rolled back if the lock holder crashed Visible data at this point is still $2/$10! (see bal:write) After that point, it would be rolled forward 19 University of Pennsylvania keybal:databal:lockbal:write Bob 6: 5: $10 6: 5: 6: data @ 5 5: Joe 6: 5: $2 6: 5: 6: data @ 5 5: 7: $3 PRIMARY 7: 7: $9 PRI@Bob.bal 7: 8: 8: data @ 7 8: 8: data @ 7 Taken from the Percolator paper (OSDI'10) $3/$9 not yet visible at v7! 7:

© 2016 A. Haeberlen, Z. Ives Observers User writes code ('observers') triggered by changes to the table Register a function and a set of columns to be observed Percolator invokes function when data in columns is modified Similar to database triggers But: Unlike triggers, observers and their transactions are not atomic, so observers cannot be used to maintain data integrity! Authors claim that this makes them 'easier to understand' 20 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Observer execution Guarantees: Multiple observers are allowed to observe the same column Consequence: It is possible to shoot yourself in the foot! But: At most one observer's transaction will commit for each observed change How is this implemented? Special 'notify' and 'acknowledgment' columns When a transaction writes to an observed cell, it sets the 'notify' column Percolator workers continuously perform a distributed scan of the table, looking for dirty 'notify' cells When one is found, observer runs and then updates the 'acknowledgment' column 21 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Theory and practice Observation: Worker threads tend to cluster in the same region of the table! Why might this be happening? When a worker is busy, other workers queue up behind it Similar to 'bus clumping' in public transport Solution? When a worker finds that it is scanning the same row as another worker, it jumps to a random cell Not applicable to public transport 22 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives "It depends" What is faster: MapReduce or Percolator? Example: Clustering on 240 machines w/continuous crawl Answer depends on crawl rate! 23 University of Pennsylvania Percolator saturates resources

© 2016 A. Haeberlen, Z. Ives Alternatives to Percolator Traditional DBMS Better for smaller computations (Percolator is designed for multi-Petabyte data sets!) MapReduce Better if computations can't be broken down into small updates BigTable alone Better if computation does not have strong consistency requirements 24 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Recap: Percolator A more recent technology used in Google's index Allows incremental updates No need to re-run large MapReduce jobs over entire index Result: Data in index is 'fresher' Main components: Transactions and observers Provides snapshot isolation semantics (e.g., cheaper reads) Runs over BigTable, which in turn runs over GFS "Existence proof for distributed transactions at Web scale" 25 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives You should be able to... Identify security problems in web systems and apply suitable countermeasures Example: Devise attacks on a poorly secured servlet Write XQueries (FLWOR etc) Compare various consistency models Example: Eventual/sequential/snapshot consistency Understand fundamentals of Inform. Retrieval Example: Compare Boolean model and Vector model Understand techniques for achieving robust- ness to various types of faults, and their costs Example: How would you build a storage system that handles a) crash faults, b) rational behavior, c) Byzantine faults? 27 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Review questions Compare the architecture of Google and Mercator What is a Sybil attack, and how can you defend a system against it? How would you implement a DHT in Pastry? Be able to provide pseudocode and discuss failure cases Explain similarities and differences between the semantics of RPCs and local function calls Can you pass values by reference in a RPC? How can you achieve exactly-once semantics? Compare SOAP and REST Explain PageRank: Intuition? How to compute? 28 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Review questions Compare XQuery and XSLT Web-specific challenges for Information Retrieval? Compare Boolean model and Vector model Compare HITS and PageRank Write a simple MapReduce program Possible defenses against various SEOs Explain utility computing model; compare to classical Compare different consistency models Be able to do a simple ARIES example Which faults can you (not) recover from in 2PC? Example of a fault that is rational but not Byzantine? 29 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Review questions Design or debug a simple incentive scheme How would you exploit BitTorrent for your own profit? Explain why we need fault models How would you implement search suggestions? How would you implement phrase search? How can you 'optimize' the PageRank of your site? Explain what the utility computing model is 2PC: Explain how to recover from a given fault 2PL: Explain why it works; how it could go wrong For each component of ACID... name one technique that can be used to implement it provide an example where it goes wrong 30 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Review questions Explain TF-IDF ranking Explain the idea behind stemming Write an XQuery (with FLWOR) Example: Use of a correspondence table Discuss importance of replication for a new service 31 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Bigtable and Percolator April 25, 2016.

Similar presentations

Presentation on theme: "© 2016 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Bigtable and Percolator April 25, 2016."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

© 2016 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Bigtable and Percolator April 25, 2016.

Similar presentations

Presentation on theme: "© 2016 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Bigtable and Percolator April 25, 2016."— Presentation transcript:

Similar presentations

About project

Feedback