Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cloud Strategy II B. Ramamurthy 7/11/2014

Similar presentations


Presentation on theme: "Cloud Strategy II B. Ramamurthy 7/11/2014"— Presentation transcript:

1 Cloud Strategy II B. Ramamurthy 7/11/2014
Rich's Big Data Analytics Training 7/11/2014

2 Tableau vs Qlikview vs Spotfire
Source not available: this is from a cached page R integration is currently the direction taken by all three. Rich's Big Data Analytics Training 7/11/2014

3 The Context: Big-data Data mining huge amounts of data collected in a wide range of domains from astronomy to healthcare has become essential for planning and performance. We are in a knowledge economy. Data is an important asset to any organization Discovery of knowledge; Enabling discovery; Annotation of data Complex computational models No single environment is good enough: need elastic, on-demand capacities We are also looking at newer Programming models Supporting algorithms and data structures We need a rapid prototyping environment for learning these Rich's Big Data Analytics Training 7/11/2014

4 The Context: Big-data Strategies
What are your strategies for tapping into emerging technologies? Exploring the emerging technologies before investing in it: make informed decisions Training your workforce The Cloud infrastructure Inclusion of cloud and computing on the cloud in your environment has become indispensable to keep up with the emerging technologies Newer methods and algorithms Maintaining your competitive edge by including newer approaches to data analytics and visualization: R , JS libraries, tap into APIs provided by social media Rich's Big Data Analytics Training 7/11/2014

5 Outline for Today Amazon cloud hands-on exercises (take 2)
Google App engine – concept Google App engine – hands-on exercises Map-reduce (MR) algorithm Hadoop infrastructure MR on amazon web services Moving forward Summary Rich's Big Data Analytics Training 7/11/2014

6 Cloud Model & Enabling Technologies
Cloud applications: data-intensive, compute-intensive, storage-intensive Bandwidth WS Services interface Web-services, SOA, WS standards VM0 VM1 VMn Storage Models: S3, BigTable, BlobStore, ... Virtualization: bare metal, hypervisor. … Multi-core architectures 64-bit processor Rich's Big Data Analytics Training 7/11/2014

7 Common Features of Cloud Providers
Development Environment: IDE, SDK, Plugins Production Environment Simple storage Table Store <key, value> Drives Accessible through Web services Management Console and Monitoring tools & multi-level security Rich's Big Data Analytics Training 7/11/2014

8 Public Cloud vs. Private Cloud
Rationale for Private Cloud: Security and privacy of business data was a big concern Potential for vendor lock-in Service Level Agreements (SLAs) required for real-time performance and reliability Cost savings of the shared model achieved because of the multiple projects that the company is actively developing Rich's Big Data Analytics Training 7/11/2014

9 Cloud Computing for the Enterprise What should IT Do
Revise cost model to utility-based computing: CPU/hour, GB/day etc. Include hidden costs for management, training Different cloud models for different applications - evaluate Use for prototyping applications and learn Link it to current strategic plans for Services-Oriented Architecture, Disaster Recovery, etc. Rich's Big Data Analytics Training 7/11/2014

10 Cloud Infrastructure Essential component of today’s IT
Running legacy applications: 32-bit, old obsolete languages, older version of software. Launching emerging applications: don’t have the infrastructure for a cluster or large machines; map-reduce cluster, social media data collection cluster. Prototyping a setup before investing in it. Load balancing: address sudden surge in traffic. Use it when there is wide variability in traffic. Temporary set up such as for a conference, meetings, tournaments (Example: US Open, FIFA in South Africa, Ebola camp). Establishing IT in places where there is no infrastructure (South pole, Amazon jungle). Rapid prototyping: quickly get something going to address an emergency situation or for disaster mitigation. Plain and simple: run your business on the cloud. Success story is Netflix. Rich's Big Data Analytics Training 7/11/2014

11 Amazon Web Services (AWS)
Review material from Session 4: Cloud Strategy I Rich's Big Data Analytics Training 7/11/2014

12 Getting Started with AWS
Starting point is the excellent documentation in: Go through this document before you launch anything on AWS. We will use amazon machine image (AMI) to launch and connect to a Windows machine/instance. We will use a Linux AMI to launch a Linux machine and work with it. There are many other applications such as data workflows, data pipeline, elastic map reduce, etc. We will also deploy a map-reduce application (wordcount) on AWS. Rich's Big Data Analytics Training 7/11/2014

13 Google App Engine Rich's Big Data Analytics Training 7/11/2014

14 What is google app engine?
Google App Engine is a Platform as a Service (PaaS) It lets you build and run applications on Google’s infrastructure. App Engine applications are easy to build, easy to maintain, and easy to scale as your traffic and data storage needs change. With App Engine, there are no servers for you to maintain. You simply upload your application and it is ready to go. Rich's Big Data Analytics Training 7/11/2014

15 GAE Features App Engine makes it easy to build and deploy an application that runs reliably even under heavy load and with large amounts of data. It includes the following features: Persistent storage with queries, sorting, and transactions. Automatic scaling and load balancing. Asynchronous task queues for performing work outside the scope of a request. Scheduled tasks for triggering events at specified times or regular intervals. Integration with other Google cloud services and APIs. Applications run in a secure, sandboxed environment, allowing App Engine to distribute requests across multiple servers, and scaling servers to meet traffic demands. Rich's Big Data Analytics Training 7/11/2014

16 How to deploy an application?
We will work from the Eclipse environment we have already downloaded You need to download the Google app engine plugin and use that to deploy the application. You need a google app engine account before you can deploy anything. Please do that before we proceed. We will deploy two applications we developed earlier: Hangman and three.js application that shows a cube and a sphere Rich's Big Data Analytics Training 7/11/2014

17 GAE capacities All applications can use up to 1 GB of storage and enough CPU and bandwidth to support an efficient app serving around 5 million page views a month, absolutely free. Runs your web applications on Google's infrastructure. Google App Engine is fully-integrated development environment You can serve your applications with your own domain (such as or you can use the App Engine domain for free (just like You can use server side languages such as PHP, Java, Python, and Go. Rich's Big Data Analytics Training 7/11/2014

18 Monitor Your App Rich's Big Data Analytics Training 7/11/2014

19 Summarizing Google App Engine
Among the cloud offerings Google App Engine has the least expensive model for cloud deployment The learning curve for the process of deployment is not that steep. Moreover Eclipse has a plugin to simplify the process. But all the infrastructure is hidden from you; You access them only through the API and the services offered by the GAE. Rich's Big Data Analytics Training 7/11/2014

20 Hadoop-MapReduce Rich's Big Data Analytics Training 7/11/2014

21 Google File System Internet introduced a new challenge in the form of web logs, web crawler’s data: large scale “peta scale” But observe that this type of data has an uniquely different characteristic than your transactional or the “customer order” data : “write once read many (WORM)” Privacy protected healthcare and patient information; Historical financial data; Other historical data Transactional data from your sales Manufacturing data for quality control Google exploited this WORM characteristics in its Google file system (GFS) for running massive parallel processes/programs Rich's Big Data Analytics Training 7/11/2014

22 What is Hadoop? At Google MapReduce operations are run on a special file system called Google File System (GFS) that is highly optimized for this purpose. However GFS is not open source. Doug Cutting and others at Yahoo! reverse engineered the GFS and called it Hadoop Distributed File System (HDFS). The software framework that supports HDFS, MapReduce and other related entities is called the project Hadoop or simply Hadoop. This is now open source and distributed by Apache. Rich's Big Data Analytics Training 7/11/2014

23 Basic Features: HDFS Highly fault-tolerant High throughput
Suitable for applications with large data sets Streaming access to file system data Can be built out of commodity hardware HDFS provides Java API for applications to use. It also provides a streaming API for other languages. A HTTP browser can be used to browse the files of a HDFS instance. Rich's Big Data Analytics Training 7/11/2014

24 Fault tolerance Failure is the norm rather than exception in a large network. A HDFS instance may consist of thousands of server machines, each storing part of the file system’s data. Since we have huge number of components and that each component has non-trivial probability of failure means that there is always some component that is non-functional. Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS. Rich's Big Data Analytics Training 7/11/2014

25 HDFS Architecture Namenode Metadata ops Client Block ops Read
Metadata(Name, replicas..) (/home/foo/data,6. .. Metadata ops Client Block ops Read Datanodes Datanodes B replication Blocks Rack2 Rack1 Write Client Rich's Big Data Analytics Training 7/11/2014

26 Hadoop Distributed File System
HDFS Server Master node HDFS Client Application Local file system Block size: 2K Name Nodes Block size: 128M Replicated Rich's Big Data Analytics Training 7/11/2014

27 What is MapReduce? MapReduce is a programming model Google has used successfully is processing its “big-data” sets (~ peta bytes per day) A map function extracts some intelligence from raw data. A reduce function aggregates according to some guides the data output by the map. Users specify the computation in terms of a map and a reduce function, Underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, and Underlying system also handles machine failures, efficient communications, and performance issues. -- Reference: Dean, J. and Ghemawat, S MapReduce: simplified data processing on large clusters. Communication of ACM 51, 1 (Jan. 2008), Rich's Big Data Analytics Training 7/11/2014

28 Classes of problems “mapreducable”
Benchmark for comparing: Jim Gray’s challenge on data-intensive computing. Ex: “Sort” Google uses it for wordcount, adwords, pagerank, indexing data. Simple algorithms such as grep, text-indexing, reverse indexing Bayesian classification: data mining domain Any number of applications involving data mining and machine learning Facebook uses it for various operations: demographics Financial services use it for analytics Astronomy: Gaussian analysis for locating extra-terrestrial objects. Rich's Big Data Analytics Training 7/11/2014

29 MapReduce Example: Mapper
This is a cat Cat sits on a roof <this 1> <is 1> <a <1,1,>> <cat <1,1>> <sits 1> <on 1> <roof 1> The roof is a tin roof There is a tin can on the roof <the <1,1>> <roof <1,1,1>> <is <1,1>> <a <1,1>> <tin <1,1>> <then 1> <can 1> <on 1> Cat kicks the can It rolls on the roof and falls on the next roof <cat 1> <kicks 1> <the <1,1>> <can 1> <it 1> <roll 1> <on <1,1>> <roof <1,1>> <and 1> <falls 1> <next 1> The cat rolls too It sits on the can <the <1,1>> <cat 1> <rolls 1> <too 1> <it 1> <sits 1> <on 1> <cat 1> Rich's Big Data Analytics Training 7/11/2014

30 MapReduce Example: Combiner, Reducer
<this 1> <is 1> <a <1,1,>> <cat <1,1>> <sits 1> <on 1> <roof 1> <the <1,1>> <roof <1,1,1>> <is <1,1>> <a <1,1>> <tin <1,1>> <then 1> <can 1> <on 1> <cat 1> <kicks 1> <the <1,1>> <can 1> <it 1> <roll 1> <on <1,1>> <roof <1,1>> <and 1> <falls 1> <next 1> <the <1,1>> <cat 1> <rolls 1> <too 1> <it 1> <sits 1> <on 1> <cat 1> Combine the counts of all the same words: <cat <1,1,1,1>> <roof <1,1,1,1,1,1>> <can <1, 1,1>> Reduce (sum in this case) the counts: <cat 4> <can 3> <roof 6> Rich's Big Data Analytics Training 7/11/2014

31 Large scale data splits Map <key, 1> <key, value>pair
Reducers (say, Count) Parse-hash Count P-0000 , count1 Parse-hash Count P-0001 , count2 Parse-hash Count P-0002 Parse-hash ,count3 Rich's Big Data Analytics Training 7/11/2014

32 Putting it all together
Starting point: question. What do you want to know? Don’t try to fit a question into a technology. Exploratory data analysis Use R to do EDA Explore on the cloud For emerging technologies For legacy technologies Visualize JS and JS libraries three.js, d3.js Analyze, present & make decisions Gephi Tableau/Qlikview Question answered? Rich's Big Data Analytics Training 7/11/2014

33 The Data Science Process (Review)
Raw data collected Exploratory data analysis Machine learning algorithms; Statistical models Build data products Communication Visualization Report Findings Make decisions Micro-level data strategy Data is processed Data is cleaned 1 2 3 4 5 6 7 Rich's Big Data Analytics Training 7/11/2014

34 Big Data Training Rich's Big Data Analytics Training 7/11/2014

35 Summary We illustrated cloud concepts and demonstrated the cloud capabilities through simple applications We discussed the features of the Hadoop File System, and mapreduce to handle big-data sets. We also explored some real business issues in adoption of cloud. Cloud is indeed an impactful technology that is sure to transform computing in business. Rich's Big Data Analytics Training 7/11/2014

36 References & useful links
Amazon AWS: Google App Engine (GAE): Rich's Big Data Analytics Training 7/11/2014


Download ppt "Cloud Strategy II B. Ramamurthy 7/11/2014"

Similar presentations


Ads by Google