NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD 1.

NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD suronape@mut.ac.th 1

Huge Amount of Data 2 today statistics

Statistical facts of # devices 3

Big Data  A collection of data sets so large and complex, it’s impossible to process it on one computer with the usual databases and tools  Big Data represent the information assets characterized by “High Volume, Velocity, and Variety”  Because of its size and complexity, Big Data is hard to capture, store, copy, delete (privacy), search, share, analyze, and visualize 4

Big Data Processing  Combine it with cloud would be possible  as to require specific technology and analytical methods for its transformation into to Value 5 Derived dataInput process

MapReduce  What is MapReduce?  Programming model from LISP  Scatter and gather principals  Many problems can be phrased this way  Large input data make simple computation impossible  Advantages  Easy to process and generate large data sets  Hides difficulty of writing parallel code  System takes care of scheduling, load balancing, handling machines failures, etc. 6

MapReduce Programming Model  The computation takes a set of input key/value pairs, and produces a set of output key/value pairs.  Users expresses the computation as two functions: Map and Reduce.  Map  Takes an input pair and produces a set of intermediate key/value pairs.  Reduce  Accepts an intermediate key I and a set of values for that key and merges together these values to form a possibly smaller set of values.(typ. 1 output) 7

Word Count 8  Count number of times each distinct word appears in the file  MAP(KEY = LINE, VALUE = CONTENTS):  REDUCE(KEY, VALUES):

Word Count Illustrated 9

Observation  Conceptually the map and reduce functions supplied by the user have associated types  The input keys and values are drawn from a different domain than the output keys and values.  Furthermore, the intermediate keys and values are from the same domain as the output keys and values. 10

PageRank Algorithm  Phase 1: Propagation  Phase 2: Aggregation  Input: A pool of objects, including both vertices and edges 11

PageRank: Propagation  Map: for each object  If object is vertex, emit key=URL, value=object  If object is edge, emit key=source URL, value=object  Reduce: (input is a web page and all the outgoing links)  Find the number of edge objects -> outgoing links  Read the PageRank value from the vertex object  Assign PR(edges)=PR(vertex)/num_outgoing 12

PageRank: Aggregation  Map: for each object  If object is vertex, emit key=URL, value=object  If object is edge, emit key=destination URL, value=object  Reduce: (input is a web page and all the incoming links)  Add the PR value of all incoming links  Assign PR(vertex)= Σ PR(incoming links) 13

More Examples  Distributed Grep:  Map: emits a line if it matches a supplied pattern  Reduce: copies supplied intermediate data to the output  Count of URL Access Frequency:  Map: processes logs of web page requests, outputs (URL, 1)  Reduce: adds together all values for the same URL and emits (URL, total count) pairs  Reverse Web-Link Graph:  Map: extracts a key from each record, and emits a (key; record) pair.  Reduce: emits all pairs unchanged. 14

Implementation 15 Overall flow of a MapReduce operation

Execution Overview  When user calls MapReduce, sequence of actions are:  MapReduce library first splits input files into M pieces (=16-64MB) and starts up many copies of the program on a cluster of machines  The master, one of the program copies assigns work to the workers  The map worker who is assigned a map task do the following:  Reads the contents of the corresponding input split  Parses key/value pairs from input data and input each to the Map function.  Buffer produced intermediate key/value pairs in memory.  Buffered pairs are written to local disk, partitioned into R regions by the partitioning function (their location passed back to the master)  The master forwards these locations to the reduce workers.  The reduce worker reads intermediate data and sorts with the key  The reduce worker performs reduce function and append to output 16

Parallelism  map() functions run in parallel, creating different intermediate values from different input data sets  reduce() functions also run in parallel, each working on a different output key  All values are processed independently  Bottleneck: reduce phase can’t start until map phase is completely finished 17

Combiners  Often a map task will produce many pairs of the form (k,v1), (k,v2), … for the same key k  E.g., popular words in Word Count  Can save network time by preaggregating at mapper  Combine (k1, list(v1)) -> v2  Usually same as reduce function  Works only if reduce function is commutative and associative 18

Hadoop 19

Hadoop Execution  1. Client submits “wordcount” job, indicating code and input files  2. JobTracker breaks input file into k chunks (64 MB each). Assigns work to TaskTrackers  3. After map(), TaskTrackers exchange map-output for grouping map output by keys  4. JobTracker breaks reduce() keyspace into m chunks. Assigns work  5. reduce() output may go to HDFS 20

Map-Machine  Reads contents of assigned portion of input file  Parses and prepares data for input to map function  Passes data into map function and saves result in memory  Periodically writes completed work to local disk  Notifies Master of this partially completed work (intermediate data) 21

Reduce-Machine  Receives notification from Master of partially completed work  Retrieves intermediate data from Map-Machine via remote-read  Sorts intermediate data by key  Iterates over intermediate data  For each unique key, sends corresponding set through reduce function  Appends result of reduce function to final output file (HDFS) 22

Data Flow  Input, final output are stored on a distributed file system  Scheduler tries to schedule map tasks “close” to physical storage location of input data  Intermediate results are stored on local FS of map and reduce workers  Output is often input to another map reduce task 23

Capacity planning 24  Cloud provider have to be on-demand in scale  Capacity Planning  Match demand to available resources

Scaling in cloud 25  Scale vertically (scale up)  Add resources to a node (or a server) to make it powerful  Scale horizontally (scale out)  Add more nodes (or commodity servers)

Building blocks in cloud 26  Data center  Server: what we want to connect  Switch control: who is connected right now (enabling data flowing)  Switch  A layer 2 device that deals with local networking  Switching a connection is based on its own internal hardware

Scaling the servers 27  Add more ports to the switch  Support hundreds of thousands giga-bits each second  Hundreds of thousands servers in a data center  Each of which requires up to 1 Gbps  Infeasible  Add more switches  Imaging a tree-like structure

What happens as we keep going up the tree? 28  Technology impossible to build the enormous root switch  Increase ports (expensive)  What happens if the root fail?  Switches can’t handle that much load  Max per switch = 2 Gbps  Other 2 connects are useless

From tree to fat tree 29  4x4 switch represented as 2 set of 2x2 switches  Enforce the “criss cross” pattern

A large flat tree: the 8x8 switches (4x(2x2)) 30  A tree scalable, using only 2x2 switches (smaller switches)

The Clos Network 31  Non-blocking property  “Any unused server can connect to any other unused server at any time, no matter what the other connections are.”  Adding another set of switches in the middle

Scale out is better than scale up 32  Scale out  Having a lot of smaller switches  Scale up  Having a few big switches

Scaling comparison 33  Cost  Normally, scale up pays more than scale out.  Scale out enables you to try smaller-specialized configuration.  Maintenance  Scale out increases the number of systems you must manage.  Communication  Scale out increases the number of communication between systems.  Scale out introduces additional latency to your system.  Scale out increase the level of your availability of the system.

References  Cloud Computing Application, Campbell, R. and Farivar, R.  A Survey of Mobile Cloud Computing: Architecture, Applications, and Approaches  Brinton, Christopher; Chiang, Mung (2013-06-10). Networks Illustrated: 8 Principles Without Calculus (Kindle Locations 1119-1123). Edwiser Scholastic Press. Kindle Edition. 34

NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD 1.

Similar presentations

Presentation on theme: "NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD 1.

Similar presentations

Presentation on theme: "NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD 1."— Presentation transcript:

Similar presentations

About project

Feedback