Hadoop Clusters Tess Fulkerson.

Hadoop Clusters Tess Fulkerson

What is Hadoop? It is an open-source project under Apache that focuses on big- data management. Named after creator, Doug Cutting’s, son’s toy elephant. Java-based framework. Processes data in parallel rather than in serial. Common OS supported: Red Hat Enterprise CentOS Oracle Linux Ubuntu SUSE Linux Enterprise Server

Where did it come from? 90% of world’s data was generated in last 2 years. -comes from smart phones, social network, trading platforms If you have a file bigger than your PC’s capacity, Hadoop allows for storage of large files/many files by utilizing HDFS (Hadoop Distributed File System) Distributes the data over multiple nodes.

HDFS (Hadoop Distributed File System)
Provides scalable and reliable data storage. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks. HDFS is highly fault-tolerant and is designed to be deployed on low- cost hardware.

MapReduce Big data got too big to perform calculations on it efficiently, so Google labs team developed an algorithm to allow for large data calculations to be chopped up into smaller chunks and mapped to many computers (nodes), then when calculations were done, the data was brought back together to produce resulting data sets. This algorithm was given the name MapReduce MapReduce works in parallel rather than in serial, this increases efficiency and saves time

Programming in Parallel VS Serial
Parallel: The processing is broken up into parts The instructions from each part run simultaneously on different CPUs. These CPUs can exist on a single machine, or they can be CPUs in a set of computers connected via a network. Serial: The processing is executed in one after the other fashion. The instructions run from start to finish on a single processor.

How can we use it? MapReduce has several functionalities that we can easily utilize: Statistical analysis Sorting Counting

How can we use it? (cont) Other than popular MapReduce, Hadoop supports other services: Apache Spark Mahout Hbase Zookeeper Oozie Pig Hive

Recommender Systems Many high-level languages, predictive analysis algorithms, and other tools for different tasks can be integrated with Hadoop framework including those associated with recommender systems. We can use this Hadoop framework to analyze the large amounts of data we find on our topic in order to form more informed theses.

Recommender Systems (cont)
Recommender systems allow us to understand a person’s taste and find new, desirable content for them based on past search behaviors. There are two categories of recommender system algorithms: user-based and item-based recommenders. These are based on two filtering techniques: collaborative and content-based filtering techniques.

Apache Spark A fast and general engine for large-scale data processing. Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3. Learning Spark is easy whether you come from a Java or Python background *Using Spark with Hadoop could be a very powerful combination for our project*

Mahout Architecture and Algorithms
Apache Mahout is a project of the Apache Software Foundation to produce free implementations of distributed, scalable machine- learning algorithms focused primarily in the areas of collaborative filtering, clustering, and classification. Machine-learning algorithms with different performance measurement methods are implemented in a distributed computing environment (HDFS). Mahout architecture provides a good standard interface for the evaluation of a recommender system.

How will we implement it?
Hardware is needed to implement a Hadoop framework, but luckily, another good thing about Hadoop is that is it runs on relatively simple hardware. There are four types of roles in a basic Hadoop cluster: NameNode (Standby NameNode) JobTracker TaskTracker DataNode

DataNode/TaskTrackers
Some recommended specifications for DataNode/TaskTrackers in a balanced Hadoop cluster: TB hard disks in a JBOD configuration 2 quad-/hex-/octo-core CPUs, running at least 2-2.5GHz 64-512GB of RAM Bonded Gigabit Ethernet or 10Gigabit Ethernet

NameNode/JobTracker/Standby NameNode
Some recommended specifications for NameNode/JobTracker/Standby NameNode nodes: 4–6 1TB hard disks in a JBOD configuration 2 quad-/hex-/octo-core CPUs, running at least 2-2.5GHz 64-128GB of RAM Bonded Gigabit Ethernet or 10Gigabit Ethernet

References "Apache Hadoop HDFS - Hortonworks." Hortonworks.com. Hortonworks Inc., Web. 20 Oct Fiori, Andrea. "Hadoop Overview – Andrea Fiori." Andreafiori.net. N.p., 26 Apr Web. 20 Oct "HDFS Architecture Guide." Hadoop.apache.org. The Apache Software Foundation, Web. 20 Oct "Introduction to Parallel Programming and MapReduce." N.p., Web. 20 Oct O'Dell, Kevin. "How-to: Select the Right Hardware for Your New Hadoop "Cloudera.com. N.p., 28 Aug Web. 20 Oct Verma, Jai Prakash, Bankim Patel, and Atul Patel. "Big Data Analysis: Recommendation System with Hadoop ..." ResearchGate. N.p., 20 Apr Web. 20 Oct

Hadoop Clusters Tess Fulkerson.

Similar presentations

Presentation on theme: "Hadoop Clusters Tess Fulkerson."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hadoop Clusters Tess Fulkerson.

Similar presentations

Presentation on theme: "Hadoop Clusters Tess Fulkerson."— Presentation transcript:

Similar presentations

About project

Feedback