Presentation is loading. Please wait.

Presentation is loading. Please wait.

Big Data Tools Hadoop S.S.Mulay Sr. V.P. Engineering February 1, 2013.

Similar presentations


Presentation on theme: "Big Data Tools Hadoop S.S.Mulay Sr. V.P. Engineering February 1, 2013."— Presentation transcript:

1 Big Data Tools Hadoop S.S.Mulay Sr. V.P. Engineering February 1, 2013

2 Confidential Netmagic Internal Use Only Hadoop - A Prelude 2

3 Confidential Netmagic Internal Use Only Apache Project and Animal Friendly names Some of the Projects under Apache Foundation to mention: 3 Apache Zookeeper Apache Tomcat Apache Pig

4 Confidential Netmagic Internal Use Only And now Hadoop

5 Confidential Netmagic Internal Use Only Hadoop – The Name 5

6 Confidential Netmagic Internal Use Only Hadoop – The Relevance 6 Apache Zookeeper Two Important things to know when discussing Big Data ● MapReduce ● Hadoop.

7 Confidential Netmagic Internal Use Only Hadoop – How was it Born? ● To Process Huge Volume of data, as the amount of generated data continued to rapidly increase. (Big Data). ● Also the Web generated more and more information, which was becoming quite challenging to index the content. 7 Apache Zookeeper Apache Tomcat

8 Confidential Netmagic Internal Use Only Hadoop – The Reality Vs Myth Hadoop is not a direct replacement for enterprise data warehouses, data marts and other data stores that are commonly used to manage structured or transactional data. It is used to augment enterprise data architectures by providing an efficient and cost-effective means for storing, processing, managing and analyzing the ever-increasing volumes of semi-structured or un-structured data. Hadoop is useful across virtually every vertical industry. 8 Apache Zookeeper Apache Tomcat

9 Confidential Netmagic Internal Use Only Hadoop – Some Use Cases Digital marketing automationLog Analysis and Event CorrelationFraud detection and preventionPredictive modeling for new drugsSocial network and relationship analysis Perform ETL ( Extract Transform Load ) functions on unstructured data Image Correlation and AnalysisCollaborative Filtering 9 Apache Tomcat

10 Confidential Netmagic Internal Use Only Hadoop – What do we expect from it ? If we analyze the mentioned use cases, we realize that 10 The data is coming in varied formats and from varied sources. Need to handle incoming stream of data in real time and also process it, sometimes in real-time. Need a connect to their existing RDBMS Need for a Distributed File SystemNeed capability for Data Warehousing over and above the processed data Need “Map Only” capability to perform Image matching and correlation Need for a Scalable database Growing need for a GUI to operate and Develop Applications for HadoopNeed for a Framework for Parallel Compute Need for a Distributed Computing EnvironmentNeed for a Machine Learning and Data Mining requirements Almost all of the workloads have a need to Manage data processing Jobs

11 Confidential Netmagic Internal Use Only Hadoop – Components which come to the rescue 11 Apache Zookeeper Apache Tomcat HDFS – Distributed File System Mahout MapReduce – Distributed Processing of large Data sets ZooKeeper – Co-ordination Service for Dist Apps HBase – Scalable Distributed DB. Supports Structured Data Avro – Data Serialization System SQOOP – Connector to Structured Database Chukwa –To monitor Large Distributed System Flume – To move Large Data post processing efficiently Hue – GUI to operate & develop Hadoop Applications Hive – Data Warehousing framework Many more …. Pig – Framework for Parallel Computation Oozie – Workflow Service to manage Data Processing Jobs

12 Confidential Netmagic Internal Use Only Hadoop – Who’s Using It ? 12 Apache Zookeeper Apache Tomcat Uses Hadoop and HBase for : Social services Structured data storage Processing for internal use Uses Hadoop for : Amazon's product search indices They process millions of sessions daily for analytics. Uses Hadoop for : Search optimization Research Uses Hadoop for : Databasing and analyzing Next Generation Sequencing (NGS) data produced for the Cancer Genome Atlas (TCGA) project and other groups Uses Hadoop for : Internal log reporting/parsing systems designed to scale to infinity and beyond. web-wide analytics platform Uses Hadoop : As a source for reporting/analytics and machine learning. And Many More ….

13 Confidential Netmagic Internal Use Only Hadoop – The Various Forms Today 13 Apache Zookeeper Apache Tomcat Apache Hadoop – Native Hadoop Distribution from Apache FoundationYahoo! Hadoop – Hadoop Distribution of YahooCDH – Hadoop Distribution from ClouderaGreenPlum Hadoop – Hadoop Distribution from EMCHDP – Hadoop Platform from HortonworksM3 / M5 / M7 – Hadoop Distribution from MAPRProject Serengeti – Vmware’s Implementation of Hadoop on VcenterAnd More …

14 Confidential Netmagic Internal Use Only Hadoop – Use Case Example – Log Processing ● Some of the Practical Use cases for Log Processing Generally in use today : Assuming a situation we have Huge Log’s generated for a period of time ranging in TB’s and we want to know : 14 Apache Tomcat Analytics – Application / Web Site PerformanceReporting – Page views, User sessionsEvent Detection & CorrelationPage views / User sessions Weekly / MonthlyUsers and their Behavioral PatternInvestigate IP and its behavioral pattern

15 Confidential Netmagic Internal Use Only Hadoop – Use Case Example – Log Processing In the Conventional Method : Parallelism is on a per file basis and not on a Single file. 15 Apache Tomcat Log file - 1Log file - 2Log file - n Task - 1 grep [pattern] awk Task - 2 grep [pattern] awk Task - n grep [pattern] awk Final Data Set Concatenate Data Set Task - new

16 Confidential Netmagic Internal Use Only Hadoop – Use Case Example – Log Processing With Map Reduce: 16 Apache Tomcat Log file - 1– Chunk-1Log file – 1 – Chunk - 2Log file - 1 – Chunk- n Task - 1 grep [pattern] awk Task - 2 grep [pattern] awk Task - n grep [pattern] awk Resultant Data Set

17 Confidential Netmagic Internal Use Only Hadoop – Use Case Example – Log Processing ● Infrastructure realities in Conventional Method : ● How things Change With Map Reduce ● Assuming ● Single Disk can transfer data at the speed of 75MB/Sec ● If we consider a Hadoop Cluster of 4000 Nodes and each Server of 6 Disks each. ● The overall Throughput of the Setup would be 17 Apache Tomcat 1 Server with a 1Gbps NIC – Can Copy 100GB file in 14 Minutes 1 Server with 1 Disk can Typically copy a 100GB file in about 20 to 25 minutes. The Network Bottleneck is eliminated as we see multiple Servers with 1 Gbps NIC reading the same 100GB Data in smaller chunks each. The Disk Bottleneck is eliminated since each individual Server would have multiple Disks and underlying RAID to improve the Disk performance. = 6 * 75 * 4000 = approx 1.8 TB/secSo in result for 1PB of data to be read it would approx take 10 min’s.

18 Confidential Netmagic Internal Use Only Hadoop – Big Data Integration Challenges 18 Apache Tomcat Technology / Tools. A successful big data initiative requires acquiring, integrating and managing several big data technologies such as Hadoop, MapReduce, NoSQL databases, Pig, Scoop, Hive, Oozie and others. Conventional data management tools fail when trying to integrate, search and analyze big datasets, which range from terabytes to multiple Petabytes of information. People. As with any new technology, staff needs to be trained in big data technologies to learn proper skills and best practices. The two biggest challenges are : Finding in-house expertise, Allocation of sufficient budget, time and resources. Processes. Being a niche area not many documented procedures and processes are available.. Also depending upon the Application Use case, requirements change.

19 Confidential Netmagic Internal Use Only Hadoop – Native Solutions & Challenges Inherent Knowledge of the various Components and their dependencies is required. Configuration and implementation needs specific skills to not only implement but also to manage. Dependency of Data Scientist on the backend Programming Team. Any version upgrades etc need to be tested thoroughly before upgrading the current setup. Support Model is only through community based support and can lead to issues for an enterprise implementing Hadoop. Any integration and problems arising out of that can become a show stopper. 19 Apache Zookeeper Apache Tomcat

20 Confidential Netmagic Internal Use Only Hadoop – Advantages of Commercial Solutions Comes fully Integrated as a Package and Documented.Implementation is a straightforward activityCome with a Configuration manager which can help quickly setup the infrastructure.Give a great and easy connect to Enterprise Applications / Architecture. Some of them come with GUI capabilities to eliminate most of the programming requirements. Thus giving the control to the “Data Scientists” all by themselves. Come with a lot of Add-on Capabilities including the GUI for Management. Most of these Commercial editions work closely with the “Apache Foundation” and hence are compatible. It is pre-tested and hence the dependencies of the packages and their version changes etc is assured with the Distribution. 20 Apache Zookeeper Apache Tomcat

21 Confidential Netmagic Internal Use Only Hadoop – Commercial Solutions For Hadoop The Solutions Fit into 2 Categories : ● Infrastructure Automation ● Application Automation 21 Apache Zookeeper Apache Tomcat Infrastructure Automation Cloudera Infrastructure Automation HortonWorks Application Automation Karmasphere Studio Application Automation Talend Application Automation Pentaho These are just some of them.

22 Confidential Netmagic Internal Use Only Gartner Report – Magic Quadrant for Data Integration Tools 22 Apache Zookeeper Apache Tomcat

23 Confidential Netmagic Internal Use Only Hadoop & Cloud – Hand in Hand ? What Advantages does Cloud Bring in : Thus Hadoop going on Cloud does bring in the above advantages on the table to the Enterprises. All the Commercial Distributions available today, do offer a Virtual image option to deploy on Cloud / Virtualization Platform. Virtualization Solution Providers like vmware have come up with Project “Serengeti” to Support Quick Deployment and Management of Hadoop on Cloud. Cloud Service providers like Amazon, Netmagic and others have a deployment option of Hadoop Infrastructure on Cloud. 23 Apache Zookeeper Apache Tomcat Reduced Physical InfrastructureQuick Deployment using the Cloud Cloning / Templates.Elasticity Auto-Scaling capabilities of the Cloud to Spawn / De-spawn instances as and when required.

24 Confidential Netmagic Internal Use Only Insert your image here Contact Details For related queries/ feedback, mail to ssmulay@netmagicsolutions.com +91-9820453568

25 Confidential Netmagic Internal Use Only Thank You

26 http://www.linkedin.com/ companies/netmagic http://twitter.com/netmagichttp://www.facebook.com/ NetmagicSolutions http://www.youtube.com /user/netmagicsolutions


Download ppt "Big Data Tools Hadoop S.S.Mulay Sr. V.P. Engineering February 1, 2013."

Similar presentations


Ads by Google