Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Cloudera Certified Developer for Apache Hadoop (CCDH)

Similar presentations


Presentation on theme: "1 Cloudera Certified Developer for Apache Hadoop (CCDH)"— Presentation transcript:

1 1 Cloudera Certified Developer for Apache Hadoop (CCDH)

2 Who We Are 2 How We Do It We deliver relevant products and services.  A distribution of Apache Hadoop that is tested, certified and supported  Comprehensive support and professional service offerings  A suite of management software for Hadoop operations  Training and certification programs for developers, administrators, managers and data scientists Technical Team Unmatched knowledge and experience.  Founders, committers and contributors to Hadoop  A wealth of experience in the design and delivery of production software Credentials The Apache Hadoop experts.  Number 1 distribution of Apache Hadoop in the world  Largest contributor to the open source Hadoop ecosystem  More committers on staff than any other company  More than 100 customers across a wide variety of industries  Strong growth in revenue and new accounts Mission: To help organizations profit from their data Leadership Strong executive team with proven abilities. Mike Olson CEO Kirk Dunn COO Charles Zedlewski VP, Product Mary Rorabaugh CFO Jeff Hammerbacher Chief Scientist Amr Awadalla VP Engineering Doug Cutting Chief Architect Omer Trajman VP, Customer Solutions

3 Users of Cloudera 3 Financial Web Retail & Consumer MediaTelecom https://www.pass4sureexam.com/ccD-410.html

4 What is Apache Hadoop? 4 Hadoop Distributed File System (HDFS) File Sharing & Data Protection Across Physical Servers MapReduce Distributed Computing Across Physical Servers Flexibility  A single repository for storing processing & analyzing any type of data  Not bound by a single schema Scalability  Scale-out architecture divides workloads across multiple nodes  Flexible file system eliminates ETL bottlenecks Low Cost  Can be deployed on commodity hardware  Open source platform guards against vendor lock Hadoop is a platform for data storage and processing that is… Scalable Fault tolerant Open source CORE HADOOP COMPONENTS https://www.pass4sureexam.com/ccD-410.html

5 What Makes Hadoop Different? Ability to scale out to Petabytes in size using commodity hardware Processing (MapReduce) jobs are sent to the data versus shipping the data to be processed Hadoop doesn’t impose a single data format so it can easily handle structure, semi-structure and unstructured data Manages fault tolerance and data replication automatically 5 https://www.pass4sureexam.com/ccD-410.html

6 Why the Need for Hadoop? 6 10,000 20052015 2010 5,000 0 1.8 trillion gigabytes of data was created in 2011…  More than 90% is unstructured data  Approx. 500 quadrillion files  Quantity doubles every 2 years STRUCTURED DATAUNSTRUCTURED DATA GIGABYTES OF DATA CREATED (IN BILLIONS) Source: IDC 2011

7 Hadoop Use Cases 7 ADVANCED ANALYTICS DATA PROCESSING Social Network Analysis Content Optimization Network Analytics Loyalty & Promotions Analysis Fraud Analysis Entity Analysis Clickstream Sessionization Mediation Data Factory Trade Reconciliation SIGINT Application Industry Web Media Telco Retail Financial Federal Bioinformatics Genome Mapping Sequencing Analysis Use Case

8 Hadoop in the Enterprise 8 Logs Files Web Data Relational Databases IDE’s BI / Analytics Enterprise Reporting Enterprise Data Warehouse Web Application Management Tools OPERATORSENGINEERS ANALYSTSBUSINESS USERS CUSTOMERS https://www.pass4sureexam.com/ccD-410.html

9 What is CDH? 9 Fastest Path to Success  No need to write your own scripts or do integration testing on different components  Works with a wide range of operating systems, hardware, databases and data warehouses Stable and Reliable  Extensive Cloudera QA systems, software & processes  Tested & run in production at scale  Proven at scale in dozens of enterprise environments Community Driven  Incorporates only main-line components from the Apache Hadoop ecosystem – no forks or proprietary underpinnings  FREE Cloudera’s Distribution Including Apache Hadoop (CDH) is an enterprise-ready distribution of Hadoop that is… 100% Apache open source Contains all components needed for deployment Fully documented and supported Released on a reliable schedule

10 10 ComponentCloudera CommittersCloudera Founder2011 Commits Common6Yes#1 HDFS6Yes#2 MapReduce5Yes#1 HBase2No#2 Zookeeper1Yes#2 Oozie1Yes#1 Pig0No#3 Hive1No#2 Sqoop2Yes#1 Flume3Yes#1 Hue3Yes#1 Snappy2No#1 Bigtop8Yes#1 Avro4Yes#1 Whirr2Yes#1 Cloudera’s Commitment to the Open Source Community

11 Components of CDH 11 Coordination Data Integration Fast Read/Write Access Languages / Compilers Workflow Scheduling APACHE ZOOKEEPER APACHE FLUME, APACHE SQOOP APACHE HBASE APACHE PIG, APACHE HIVE APACHE OOZIE File System Mount User Interface FUSE-DFS HUE Cloudera Enterprise https://www.pass4sureexam.com/ccD-410.html

12 Block Size = 64MB Replication Factor = 3 Hadoop Distributed File System Cost is $400-$500/TB 12 1 2 3 4 5 2 3 4 5 2 4 5 1 3 5 1 2 5 1 3 4 HDFS

13 Components of Hadoop NameNode – Holds all metadata for HDFS –Needs to be a highly reliable machine RAID drives – typically RAID 10 Dual power supplies Dual network cards – Bonded –The more memory the better – typical 36GB to - 64GB Secondary NameNode – Provides check pointing for the NameNode. Same hardware as the NameNode should be used 13

14 Components of Hadoop DataNodes – Hardware will depend on the specific needs of the cluster –No RAID needed, JBOD (just a bunch of disks) is used –Typical ratio is: 1 hard drive 2 cores 4GB of RAM 14 https://www.pass4sureexam.com/ccD-410.html

15 Networking One of the most important things to consider when setting up a Hadoop cluster Typically a top of rack is used with Hadoop with a core switch Careful on over subscribing the backplane of the switch! 15

16 Map 16 Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line). map() produces one or more intermediate values along with an output key from the input. Map Task Map Task (key 1, values) (key 2, values) (key 3, values) Shuffle Phase Shuffle Phase (key 1, int. values) Reduce Task Final (key, values)

17 Reduce 17 After the map phase is over, all the intermediate values for a given output key are combined together into a list reduce() combines those intermediate values into one or more final values for that same output key Map Task Map Task (key 1, values) (key 2, values) (key 3, values) Shuffle Phase Shuffle Phase (key 1, int. values) Reduce Task Final (key, values)

18 MapReduce Execution 18 https://www.pass4sureexam.com/ccD-410.html

19 Sqoop 19 SQL to Hadoop  Tool to import/export any JDBC-supported database into Hadoop  Transfer data between Hadoop and external databases or EDW  High performance connectors for some RDBMS  Developed at Cloudera

20 Flume 20 Distributed, reliable, available service for efficiently moving large amounts of data as it is produced  Suited for gathering logs from multiple systems  Inserting them into HDFS as they are generated Design goals  Reliability, Scalability, Manageability, Extensibility Developed at Cloudera

21 Flume: high-level architecture Agent Processor Collector(s) Agent Configurable levels of reliability Guarantee delivery in event of failure Deployable, centrally administered compress encrypt batch encrypt Flexibly deploy decorators at any step to improve performance, reliability or security Optionally pre-process incoming data: perform transformations, suppressions, metadata enrichment Writes to multiple HDFS file formats (text, sequence, JSON, Avro, others) Parallelized writes across many collectors – as much write throughput as MASTER Master send configuration to all Agents 21

22 HBase 22 Column-family store. Based on design of Google BigTable  Provides interactive access to information  Holds extremely large datasets (multi-TB)  Constrained access model  (key, value) lookup  Limited transactions (only one row) https://www.pass4sureexam.com/ccD-410.html

23 HBase 23

24 Hive 24 SQL-based data warehousing application  Language is SQL-like  Supports SELECT, JOIN, GROUP BY, etc.  Features for analyzing very large data sets  Partition columns, Sampling, Buckets  Example: SELECT s.word, s.freq, k.freq FROM shakespeares JOIN ON (s.word= k.word) WHERE s.freq >= 5;

25 Pig 25 Data-flow oriented language – “Pig latin”  Datatypes include sets, associative arrays, tuples  High-level language for routing data, allows easy integration of Java for complex tasks  Example: emps=LOAD 'people.txt’ AS(id,name,salary); rich = FILTER emps BY salary > 100000; srtd = ORDER rich BY salary DESC; STORE srtd INTO ’ rich_people.txt'; https://www.pass4sureexam.com/ccD-410.html

26 Oozie 26 Oozie is a workflow/cordination service to manage data processing jobs for Hadoop

27 Zookeeper 27 Zookeeper is a distributed consensus engine  Provides well-defined concurrent access semantics:  Leader election  Service discovery  Distributed locking / mutual exclusion  Message board / mailboxes

28 Pipes and Streaming 28 Multi-language connector libraries for MapReduce  Write native-code MapReduce in C++  Write MapReduce passes in any scripting language, including  Perl  Python https://www.pass4sureexam.com/ccD-410.html

29 FUSE - DFS 29 Allows mounting of HDFS volumes via Linux FUSE file system  Does allow easy integration with other systems for data import/export  Does not imply HDFS can be used for general-purpose file system

30 Hadoop Security 30  Authentication is secured by Kerberos v5 and integrated with LDAP  Hadoop server can ensure that users and groups are who they say they are  Job Control includes Access Control Lists, which means Jobs can specify who can view logs, counters, configurations and who can modify a job  Tasks now run as the user who launched the job https://www.pass4sureexam.com/ccD-410.html

31 Cloudera Enterprise 31 Simplify and Accelerate Hadoop Deployment Reduce Adoption Costs and Risks Lower the Cost of Administration Increase the Transparency Control of Hadoop Leverage the Experience of Our Experts Cloudera Enterprise makes open source Hadoop enterprise-easy EFFECTIVENESS Ensuring You Get Value From Your Hadoop Deployment EFFICIENCY Enabling You to Affordably Run Hadoop in Production Cloudera Manager End-to-End Management Application for Apache Hadoop Production-Level Support Our Team of Experts On- Call to Help You Meet Your SLAs CLOUDERA ENTERPRISE COMPONENTS

32 Cloudera Manager 32 DISCOVER DIAGNOSE OPTIMIZE ACT HDFS MAPREDUCE HBASE ZOOKEEPER OOZIE HUE

33 Cloudera Enterprise Demo 33 https://www.pass4sureexam.com/ccD-410.html

34 Cloudera Enterprise 34 Including Cloudera Support FeatureBenefit Flexible Support Windows Choose from 8x5 or 24x7 options to meet SLA requirements Configuration Checks Verify that your Hadoop cluster is fine-tuned for your environment Issue Resolution and Escalation Processes Proven processes ensure that support cases get resolved with maximum efficiency Comprehensive Knowledgebase Browse through hundreds of Articles and Tech Notes to expand upon your knowledge of Apache Hadoop Certified Connectors Connect your Apache Hadoop cluster to your existing data analysis tools such as IBM Netezza and Revolution Analytics Notification of New Developments and Events Stay up to speed with what’s going on in the Apache Hadoop community

35 Cloudera University 35 Public and Private Training to Enable Your Success ClassDescription Developer Training & Certification (4 Days) Hands-on training and certification for developers who want to analyze their data but are new to Apache Hadoop System Administrator Training & Certification (3 Days) Hands-on training and certification for administrators who will be responsible for setting up, configuring, monitoring an Apache Hadoop cluster HBase Training (2 Day) Covers the HBase architecture, data model, and Java API as well as some advanced topics and best practices Analyzing Data with Hive and Pig (2 Days) Hive and Pig training is designed for people who have a basic understanding of how Apache Hadoop works and want to utilize these languages for analysis of their data Essentials for Managers (1 Day) Provides decision-makers the information they need to know about Apache Hadoop, answering questions such as “when is Hadoop appropriate?”, “what are people using Hadoop for?” and “what do I need to know about choosing Hadoop?”

36 Cloudera Consulting Services 36 Put Our Expertise To Work For You. ServiceDescription Use Case DiscoveryAssess the appropriateness and value of Hadoop for your organization New Hadoop DeploymentSet up and configure high performance, production-ready Hadoop clusters Proof of ConceptVerify the prototype functionality and project feasibility for a new Hadoop cluster Production PilotDeploy your first production-level project using Hadoop Process and Team DevelopmentDefine the requirements and processes for creating a new Hadoop team Hadoop Deployment CertificationPerform periodic health checks to certify and tune up existing Hadoop clusters Cloudera’s team of Solutions Architects provides guidance and hands-on expertise to address unique enterprise challenges.

37 Journey of the Cloudera Customer 37 Discover the Benefits of Apache Hadoop Cloudera’s Distribution Subscribe to Cloudera Enterprise Flexibility to store and mine all types of data The fastest, surest path to success with Apache Hadoop Simplify and accelerate Apache Hadoop deployment https://www.pass4sureexam.com/ccD-410.html

38 Cloudera in Production 38 Logs Files Web Data Relational Databases IDE’s BI / Analytics Enterprise Reporting Enterprise Data Warehouse Operational Rules Engines Management Tools OPERATORSENGINEERS ANALYSTSBUSINESS USERS Cloudera’s Distribution Including Apache Hadoop (CDH) & SCM Express Cloudera Enterprise  Cloudera Management Suite  Cloudera Support Cloudera Services  Consulting Services  Cloudera University Web Application CUSTOMERS

39 39 Cloudera helps you profit from all your data. cloudera.com +1 (888) 789-1488 sales @cloudera.com twitter.com/ cloudera facebook.com/ cloudera Get Hadoop

40 Cloudera Manager 40 https://www.pass4sureexam.com/ccD-410.html

41 Cloudera Manager 41 Installs the complete Hadoop stack in minutes. The simple, wizard-based interface guides you through the steps. Gives you complete, end-to-end visibility and control over your Hadoop cluster from a single interface Set server roles, configure services and manage security across the cluster Gracefully start, stop and restart of services as needed Maintains a complete record of configuration changes for SOX compliance Monitors dozens of service performance metrics and alerts you when you approach critical thresholds Gather, view and search Hadoop logs collected from across the cluster Scans Hadoop logs for irregularities and warns you before they impact the cluster ONLY CLOUDERA https://www.pass4sureexam.com/ccD-410.html

42 Cloudera Manager 42 Establishes the time context globally for almost all views Correlates jobs, activities, logs, system changes, configuration changes and service metrics along a single timeline to simplify diagnosis Takes a snapshot of the cluster state and automatically sends it to Cloudera support to assist with resolution Creates and aggregates relevant Hadoop events pertaining to system health, log messages, user services and activities and make them available for alerting and searching Generates email alerts when certain events occur Visualize current and historical disk usage by user, group and directory Track MapReduce activity on the cluster by job or user View information pertaining to hosts in your cluster including status, resident memory, virtual memory and roles ONLY CLOUDERA

43 43 Max Number of Nodes Supported50Unlimited Automated Deployment Host-Level Monitoring Secure Communication Between Server & Agents Configuration Management Manage HDFS, MapReduce, HBase, Hue, Oozie & Zookeeper Audit Trails Start/Stop/Restart Services Add/Restart/Decomission Role Instances Configuration Versioning & History Support for Kerberos Service Monitoring Proactive Health Checks Status & Health Summary Intelligent Log Management Events Management & Alerts Activity Monitoring Operational Reporting Global Time Control Support Integration FREE EDITIONENTERPRISE EDITION** Two Editions: ** Part of the Cloudera Enterprise subscription

44 44 View Service Health and Performance https://www.pass4sureexam.com/ccD-410.html

45 45 Get Host-Level Snapshots https://www.pass4sureexam.com/ccD-410.html

46 46 Monitor and Diagnose Cluster Workloads https://www.pass4sureexam.com/ccD-410.html

47 47 Gather, View and Search Hadoop Logs https://www.pass4sureexam.com/ccD-410.html

48 48 Track Events From Across the Cluster https://www.pass4sureexam.com/ccD-410.html

49 49 Run Reports on System Performance & Usage https://www.pass4sureexam.com/ccD-410.html

50 New in Cloudera Manager 3.7 50 Monitors dozens of service performance metrics and alerts you when you approach critical thresholds Gathers and scans Hadoop logs for irregularities and warns you before they impact the cluster Correlates jobs, activities, logs, system changes, configuration changes and service metrics along a single timeline to simplify diagnosis Takes a snapshot of the cluster state and automatically sends it to Cloudera support to assist with resolution Creates and aggregates relevant Hadoop events pertaining to system health, log messages, user services and activities and make them available for alerting and searching Generates email alerts when certain events occur Maintains a complete record of configuration changes for SOX compliance Visualize current and historical disk usage by user, group and directory and track MapReduce activity on the cluster by job or user ONLY CLOUDERA https://www.pass4sureexam.com/ccD-410.html

51 Cloudera Support 51 FeatureBenefit Flexible Support WindowsChoose from 8x5 or 24x7 options to meet SLA requirements Configuration ChecksVerify that your Hadoop cluster is fine-tuned for your environment Issue Resolution and Escalation Processes Proven processes ensure that support cases get resolved with maximum efficiency Comprehensive KnowledgebaseBrowse through hundreds of Articles and Tech Notes to expand upon your knowledge of Apache Hadoop Certified ConnectorsConnect your Apache Hadoop cluster to your existing data analysis tools such as IBM Netezza, Revolution Analytics, and MicroStrategy Proactive Notification of New Developments and Events Stay up to speed with what’s going on in the Apache Hadoop community https://www.pass4sureexam.com/ccD-410.html

52 Cloudera Enterprise 52 Why Cloudera Enterprise?  Apache Hadoop is a distributed system that presents unique operational challenges  The fixed cost of managing an internal patch and release infrastructure is prohibitive  Apache Hadoop skills and expertise are scarce  It’s challenging to track consistently to community development efforts Only Cloudera Enterprise management application Has a management application that supports the full lifecycle of operationalizing Apache Hadoop production support Has production support backed by the Apache committers depth of experience Has the depth of experience supporting hundreds of production Apache Hadoop clusters The Fastest Path to Success Running Apache Hadoop in Production.

53 Block Size = 64MB Replication Factor = 3 Hadoop Distributed File System Cost is $400-$500/TB 53

54 MapReduce: Distributed Processing 54 https://www.pass4sureexam.com/ccD-410.html

55 Thank you. https://www.pass4sureexam.com/ccD-410.html


Download ppt "1 Cloudera Certified Developer for Apache Hadoop (CCDH)"

Similar presentations


Ads by Google