Presentation is loading. Please wait.

Presentation is loading. Please wait.

Big Data and NoSQL BUS 782.

Similar presentations


Presentation on theme: "Big Data and NoSQL BUS 782."— Presentation transcript:

1 Big Data and NoSQL BUS 782

2 What is Big Data? https://www.youtube.com/watch?v=c4BwefH5Ve8
Employee-generated data User-generated data Machine-generated data Big Data Analytics: 11 Case Histories and Success Stories eature=iv&src_vid=c4BwefH5Ve8&v=t4wtzIuoY0w

3 Big Data Data Size: Gigabyte Terabyte: Terabyte USB
Petabyte: Wal-Mart handles more than 1m customer transactions every hour at more than 2.5 petabytes Exabyte: the amount of traffic flowing over the internet about 700 exabytes annually Zettabyte

4 Big Data: Some Facts World’s information is doubling every two years
World generated 1.8 ZB of information in 2011 Cisco predicts that by 2016 global IP traffic will reach 1.3 zettabytes There will be 19 billion networked devices by 2016 70% of this data is being generated by individuals as opposed to enterprises & organizations

5 Big Data Sources Web sites Social media Machine generated RFID
Image, video, and audio Etc.

6 Big Data Challenges Big Data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization. “3Vs": Volume: Size >= TBs Velocity: Processing speed Variety: Structured: able to fit in a database table unstructured data

7 Do Companies care about Data?
Not really, What they care about are Key Performance Indicators (KPIs) Some examples of KPIs are Revenue Profit Revenue per customer/employee Customer Attrition: the loss of clients or customers Big Data is only useful if it helps drive KPIs

8 Big Data to KPIs

9 Applications Text mining: deriving high-quality information from text.
text categorization, text clustering, concept/entity extraction, sentiment analysis, etc. Web mining: Web usage mining Web content mining Social media mining Salesforce Radian6 Social Marketing Cloud

10 Advantages of Relational Databases
Well-defined database schema Flexible query language Maintain database consistency in business transactions: Concurrent database processing with multiple users Reading/updating Locking

11 Transaction ACID Properties
Atomic Transaction cannot be subdivided All or nothing Consistent Constraints don’t change from before transaction to after transaction A transaction transforms a database from one consistent state to another consistent state. Isolated Transactions execute independently of one another. Database changes not revealed to users until after transaction has completed Durable Database changes are permanent and must not be lost.

12 Problems with relational databases in managing Big Data
High overhead in maintaining database consistency Do not support unstructured data search very well (i.e. google type searching) Do not handle data in unexpected formats well Don’t scale well to very large size databases: Expensive “scale up”: adding processer, storage Slow query response time Data must move to server Server failure Organizations such as Facebook, Yahoo, Google, and Amazon were among the first to decide that relational databases were not good solutions for the volumes and types of data that they were dealing with.

13 What is needed in new approach
Deal with data size never imagined before. Hardware failure should be expected. Data has gravity, compute has to move to data.

14 What is Hadoop? Open source project by Apache Foundation
Based on papers published by Google Google File System ( Oct, 2003) MapReduce ( Dec, 2004) Consists of two core components Hadoop Distributed File System (Storage) MapReduce (Compute)

15 How Hadoop fits in the new approach
Run on cluster of low cost commodity servers so can accommodate petabytes of data cost effectively. Embraces partial failures Data locality (computation on local node where data resides) Horizontally Scales Scale Out Hadoop file is: Distributed: a file is stored in many servers Replicated: a file is replicated with many copies

16 Hadoop HDFS: Hadoop Distributed File System
Based on GFS Designed to store very large amount of data (TBs and PBs) and much larger file sizes Write-once, read many-times access pattern Designed to run on clusters of commodity hardware and does replication for reliability Allows data to be read and processed locally Supports limited operations on files - write, delete, append and reads but no updates

17 MapReduce: a programming model for distributed processing of data
Rather than take the conventional step of moving data over a network to be processed by software, MapReduce moves the processing software to the data. Each node does both store and compute, and does best to process local data. MapReduce has two main phases: Map Reduce

18 Example: Word Count

19 Hadoop Ecosystem Hbase–a column-oriented data store
Hive –provides a SQL like query capability Pig –a high-level language for creating MapReducejobs HCatalog–takes Hive’s metadata and makes it available across the Hadoop ecosystem Mahout –a library of algorithms for clustering, classification, and filtering Sqoop–accelerates bulk loads of data between Hadoop and RDMS Flume –streams large volumes of log data from multiple sources into Hadoop

20 NoSQL Database NotOnlySQL is a broad class of database management systems identified by non-adherence to the widely used relational database management system model. They are useful when working with a huge quantity of data when the data's nature does not require a relational model.

21 Types of NoSQL Databases
Column-oriented database Example: Cassandra Document-oriented database: Example: MongoDB, CouchDB Data stored in JSON, JavaScript Object Notation, format

22 JSON, JavsScript Object Notation http://www. w3schools
JSON Example {"employees":[ {"firstName":"John", "lastName":"Doe"}, {"firstName":"Anna", "lastName":"Smith"}, {"firstName":"Peter", "lastName":"Jones"} ]}

23 Cassandra is essentially a key-value store
Cassandra is essentially a key-value store. This means that all data is stored only in one ‘table’, each row of which is uniquely identified by a key, with JSON representation. { "user1": { "Bio": { "name": "Shaneeb Kamran", "age" : 23 } }, "user2": { "name": "Salman ul Haq", "profession": "Developer" "Education": { "bachelors": "NUST"

24 Column Data Model http://www. sinbadsoft
A column is a key-value pair consisting of three elements: 1: Unique name: Used to reference the column 2: Value: The content of the column. 3: Timestamp: used to determine the valid content. Column Family: A container for columns sorted by their names. Column Families are referenced and sorted by row keys. Super Column: A sorted associative array of columns Example: Multi-value attribute Super column family: A container for super columns sorted by their names. Super Column Families are referenced and sorted by row keys. Keyspace: Top level element. Container for column families.

25 Column Family Super column family

26

27 Migrate a Relational Database Structure into a NoSQL Cassandra Structure { "biologicalfeatures": { "forests" : { "forest003" : { "name" : "Black Forest", "trees" : "two million", "bushes" : "three million“ }, "forest045" : { "name" : "100 Acre Woods", "trees" : "four thousand", "bushes" : "five thousand“ }, "forest127" : { "name" : "Lonely Grove", "trees" : "none", "bushes" : "one hundred“ } }, "famoustrees" : { "tree12345" : { "forestID" : "forest003", "name" : "Der Tree", "species" : "Red Oak“ }, "tree12399" : { "forestID" : "forest045", "name" : "Happy Hunny Tree", "species" : "Willow“ }, "tree32345" : { "forestID" : "forest003", "name" : "Das Ubertree", "species" : "Blue Spruce“ } }

28 Document database: MongoDB http://docs. mongodb
MongoDB stores business subjects in documents. A document is the basic unit of data in MongoDB. Documents are analogous to JSON objects but exist in the database in a more type-rich format known as BSON, Bin­ary JSON, is a bin­ary-en­coded seri­al­iz­a­tion of JSON-like doc­u­ments. The structure of MongoDB documents and how the application represents relationships between data: references and embedded documents.

29 Example using reference

30 Embedded Data Models

31 CouchDB A CouchDB document is a JSON object that consists of named fields. Field values may be strings, numbers, dates, or even ordered lists and associative maps. An example of a document would be a blog post: { "Subject": "I like Plankton", "Author": "Rusty", "PostedDate": "5/23/2006", "Tags": ["plankton", "baseball", "decisions"], "Body": "I decided today that I don't like baseball. I like plankton." }

32 Problems with NoSQL Databases
Does not support transaction consistency as relational database systems. There is no standard query language for NoSQL databases

33 NewSQL Databases http://en.wikipedia.org/wiki/NewSQL
NewSQL is a class of modern relational database management systems that seek to provide the same scalable performance of NoSQL systems for online transaction processing (OLTP) read-write workloads while still maintaining the ACID guarantees of a traditional database system.

34 Approaches of NewSQL Systems
1. Distributed cluster of shared-nothing nodes: node owns a subset of the data. These databases include components such as distributed concurrency control and distributed query processing. 2. Transparent sharding: These systems provide a sharding middleware layer to automatically split databases across multiple nodes. 3. Highly optimized SQL engines 4. In-memory database

35 In-Memory Database An in-memory database is a database management system that primarily relies on main memory for computer data storage. It is contrasted with database management systems that employ a disk storage mechanism. Main memory databases are faster than disk-optimized databases. Good for Big Data analytics. Use non-volatile main memory module that retains data even when electrical power is removed.

36 SAP HANA, High-Speed Analytical Appliance
SAP HANA is an in-memory, column-oriented, relational database management system developed and marketed by SAP. HANA's architecture is designed to handle both high transaction rates and complex query processing on the same platform HANA's performance is 10,000 times faster when compared to standard disks, which allows companies to analyze data in a matter of seconds instead of long hours.


Download ppt "Big Data and NoSQL BUS 782."

Similar presentations


Ads by Google