Presentation is loading. Please wait.

Presentation is loading. Please wait.

Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook.

Similar presentations


Presentation on theme: "Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook."— Presentation transcript:

1 Big Data and NoSQL What and Why?

2 Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook / 1.1 billion (sep 2012) [Yahoo finance] –Twitter / 500 mio (jul 2012) [techcrunch.com] –Netflix / 30 mio (okt 2012) [Yahoo finance] –Google / index ~45 billion pages (nov 2012) [http://www.worldwidewebsize.com/] So – how do we store and query this type of data set? 2

3 Motivation: Scalability Another characteristica is the potential rapidly expanding market –Facebook (end of year data): 2004: 1 mio 2006: 12 mio 2008: 150 mio 2010: 600 mio 2012: 1.100 mio How to scale fast enough? 3

4 Motivation: Availability Bad user experience –Down time –Long waits... Equals lost users How to ensure good user experience? 4

5 From Alvin Richard’s Goto presentation 5

6 Requirements: So the QA requirements are –Performance: Time-dimension:fast storage and retrival of data Space-dimension:large data sets –Availability:of data service 6

7 The players The old workhorse –RDBM: Relational DataBase Management The hyped and forgotten (?) –OODB –XMLDB BaseX The hyped and uncertain –NoSQL 7

8 RDBM Performance: –Time Queries do not scale linearly in the size of the data set  –10.000 rows~1 second –1 billion rows~24 hours 8 [Jacobs (2009)]

9 RDBM Performance: –Space Generally, RDBM are rather monolith One RDBM = One machine Limitations on disk size, RAM, etc. –(There are ‘tricks’ to distribute data over many machines, but it is generally an ‘after-the-fact’ design, not something considered a premise...) 9

10 RDBM Availability: –OK I think... Redundancy, redundant disk arrays, the usual bag of tricks 10

11 RDBM Performance Starting with Time behaviour 11

12 RDBM Space tricks Sharding Exercise: Give examples from WoW … 12 [Wikipedia]

13 Why do RDBMs become slow? Jacobs’ paper explains the details. It boils down to: –The memory hierarchy Cache, RAM, Disk, (Tape) Each step has a big performance penalty –The reading pattern Sequential read faster than Random read Tape: obviousbut even for RAM this is true! 13

14 Issue 1 14

15 Thus Issue 1: Large data sets cannot reside in RAM Thus read/write time is slow 15

16 Issue 2 Issue –Often one domain type is fixed at ”large” Number of customers/users ( < 7 billion ) Number of sensors –While other domains increase indefintely Observations over time –each page visit, transaction, sensor reading Data production –tweets, facebook update, log entry 16

17 But... The relational database model explicitly ignores the ordering of rows in tables. 17

18 Example Transactions recorded by a web server –Transaction log is inherently time ordered –Queries are usually about time as well How may last day? Most visiting user last week? However results in random visits to the disk! 18

19 Normalization It becomes even worse due to the classic RDBM virtue of normalization: A user and a transaction table 19

20 Jacobs’ statement 20

21 Denormalization technique Denormalization: –Make aggregate objects –Allows much faster access –Liability: Much more space intensive And! –Store data in the sequences they are to be queried… 21

22 NoSQL Databases A new take on storing and quering big data 22

23 Key features NoSQL: ”Not Only SQL”/ Not Relational –Horizontal scaling of simple operations over many servers –Repliation and partitioning data over many servers –Simple call level interface –Weaker concurrency model than ACID –Efficient use of RAM and dist. Indexes for storage –Ability to dynamically add new attributes to records Architectural Drivers: –Performance –Scalabilty 23 [Cattel, 2010]

24 Clarifications NoSQL focus on –Simple operations Key lookup, read/write of one or a few records Opposite:Complex joins over many tables (SQL) (NoSQL generally denormalize and thus do not require joins) –Horizontal scaling Many servers with no RAM nor disk sharing Commodity hardware –Cheap but more prone to failures 24

25 NoSQL DB Types 25

26 Basic types Four types –Key-value stores Memcached, Riak, Regis –Document stores MongoDB –Extensible record (Fowler: Column-family) Cassandra, Google BigTable –Graph stores Neo4J 26 Fowler: Aggregate-oriented

27 Key-value Basically the Java Map datastructure: –Map.put(”pid01-2012-12-03-11-45”, measurement); –m = Map.get(”pid01-2012-12-03-11-45”); Schema:Any value under any key… Supports: –Automatic replication –Automatic sharding –Both using hashing on the key Only lookup using the (primary) key 27

28 Document Stores ”documents” –MongoDB:JSON objects. –Stronger queries, also in document contents –Schema: Any JSON object may be stored! –Atomic updates, otherwise no concurrency control Supports –Master-slave replication, automatic failover and recovery –Automatic sharding Range-based, on shard key (like zip-code, CPR, etc.) 28

29 Extensible Record/Column First kid on the block: Google BigTable Datamodel –Table of rows and columns Scalability model: splitting both on rows and columns Row: (Range) Sharding on primary key Column: Column groups – domain defined clustering –No Schema on the columns, change as you go –Generally memory-based with periodic flushes to disk 29

30 (Graph) Neo4J 30

31 CAP Theorem 31

32 Reviewing ACID Basic RDBM teaching talks on ACID Atomicity –Transaction: All or none succeed Concistency –DB is in valid state before and after transaction Isolation –N transactions executed concurrently = N executed in sequence Durability –Once a transaction has been committed, it will remain so (even in the event of power loss, crash) 32

33 CAP Eric Brewer: only get two of the three: Consistency –Set of operations has occurred at once (Client view) Availability –Operations will terminate in intended reponse Partition tolerence –Operation will complete, even if components are unavailable 33

34 Horizontal Scaling: P taken We have already taken P, so we have to relax either –Consistency –Availability RDBM prefer consistency over availability NoSQL prefer availability over consistency –replacing it with eventual consistency 34

35 Achieving ACID Two-phase Commit 1.All partitions pre-commit, report result to master 2.If success, master tells each to commit; else roll-back Guaranty consistency, but availability suffer Example –Two partitions, 99.9% availability => 99.9 2 = 99.8% (+43 min down every month) –Five partitions: 99,5% (36 hours down time in all) 35

36 BASE Replace ACID with BASE BA:Basically Available S:Soft state E:Eventual consistent Availability achieved by partial failures not leading to system failures –In two-phase commit, what would master do if one partition does not repond? 36

37 Eventual Consistency So what does this mean? –Upon write, an immediate read may retrieve the old value – or not see the newest added item! Why? Gets data from a replica that is not yet updated… –Eventual consistency: Given sufficiently long period of time which no updates, all replicas are consistent, and thus all reads consistently return the same data… System always available, but state is ‘soft’ / cached 37

38 Discussion Web applications –Is it a problem that a facebook update takes some minutes to appear at my friends? 38

39 Summary RDBMs have issues with ‘big data’ –Ignores the physical laws of hardware (random read) –RDB model requires joins (random read) NoSQL –Class of alternative DB models and impl. –Two main classes Key-value stores Document stores CAP –You can only get two of three –NoSQL choose A, and sacrifice C for Eventual Consistency CS@AUHenrik Bærbak Christensen39


Download ppt "Big Data and NoSQL What and Why?. Motivation: Size WWW has spawned a new era of applications that need to store and query very large data sets –Facebook."

Similar presentations


Ads by Google