Download presentation
Presentation is loading. Please wait.
Published byROHIT PANDEY Modified over 5 years ago
1
SUBMITTED TO – SANJAY SIR (ASSITANT PROFESSOR) SUBMITTED BY – ROHIT PANDEY 17MCA04
2
What is hadoop technology? Why hadoop technology is used?(Diagram) Reasons for using hadoop technology What is Big data? Types of big data? Difference between hadoop and big data? Developers of hadoop technology? Hadoop users?
3
Features of hadoop? Hadoop Architecture? Hadoop Installation Hadoop Ecosystem Disadvantages of hadoop
4
Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. The most well known technology used for Big data is Hadoop. It is actually a large scale batch data processing system.
5
Why hadoop technology is used Storage And Processing speed Low cost Fault taulerence Computing power Scalability Flexibility
6
1. Storage and Processing speed. With data volumes and varieties constantly increasing, especially from social media and the Internet of Things (IoT), that's a key consideration.IoT 2. Computing power. Hadoop's distributed computing model processes big data fast. The more computing nodes you use, the more processing power you have. 3. Fault tolerance. Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. Multiple copies of all data are stored automatically. 4. Flexibility. Unlike traditional relational databases, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos. 5. Low cost. The open-source framework is free and uses commodity hardware to store large quantities of data. 6. Scalability. You can easily grow your system to handle more data simply by adding nodes. Little administration is required.
7
Big Data is also data but with a huge size. Big Data is a term used to describe a collection of data that is huge in size and yet growing exponentially with time. In short such data is so large and complex that none of the traditional data management tools are able to store it or process it efficiently. Examples Of Big Data The New York Stock Exchange generates about one terabyte of new trade data per day.
8
Three types of big data : 1. Structured 2. Unstructured 3. Semi-structured 1. Structured:- Any data that can be stored, accessed and processed in the form of fixed format is termed as a 'structured' data. 2. Unstructured:-Any data with unknown form or the structure is classified as unstructured data. 3. Semi-structured:-Semi-structured data can contain both the forms of data. We can see semi-structured data as a structured in form but it is actually not defined with e.g. a table definition in relational DBMS.
9
HADOOPBIG DATA Hadoop is an open-source software framework for storing data and running applications. Big Data is a term used to describe a collection of data that is huge in size and yet growing exponentially with time The most well known technology used for Big data is Hadoop In short such data is so large and complex that none of the traditional data management tools are able to store it or process it efficiently. It is actually a large scale batch data processing system
10
Hadoop was created by Doug Cutting and Mike Cafarella in 2005. It was originally developed to support distribution for the Nutch search engine project.
12
License Free: Anyone can go to the Apache Hadoop Website, From there you Download Hadoop, Install and work with it. Open Source: Its Source code is available, you can modify, change as per your requirements. Meant for Big Data Analytics: It can handle Volume, Variety, Velocity & Value. hadoop is a concept of handling Big Data, & it handles it with the help of the Ecosystem Approach. Faster: Hadoop is extremely good at high-volume batch processing because of its ability to do parallel processing. Hadoop can perform batch processes multiple times faster than on single thread server or on the mainframe. Fault Tolerance: The data sent to one individual node and the same data also replicates on other nodes in the same cluster. If the individual node failed to process the data, the other nodes in the same cluster available to process the data. High Availability: Data is highly available and accessible despite hardware failure due to multiple copies of data. If the machine or hardware crashes, then data will be accessed from another path.
14
Cluster: - A Hadoop cluster is designed specifically for storing and analysing huge amounts of unstructured data in a distributed computing environment. Cluster is the set of nodes which are also known as host machines. Cluster is the hardware part of the infrastructure. Hadoop's open source distributed processing software on low-cost commodity computers run by these clusters.
15
YARN Infrastructure: -YARN is abbreviated as Yet Another Resource Negotiator. Apache Yarn is a part or outside of Hadoop that can act as a standalone resource manager. YARN is the framework responsible for providing the computational resources needed for application executions. Yarn consists of two important elements are: Resource Manager and Node Manager.
16
HDFS Federation: - HDFS federation is the framework responsible for providing permanent, reliable and distributed storage. This is typically used for storing inputs and output (but not intermediate ones). It enables support for multiple namespaces in the cluster to improve scalability and isolation. In order to scale the name service horizontally, federation uses multiple independent name nodes/namespaces.
17
MapReduce Framework: - A MapReduce framework is usually composed of three steps - Map: Each node applies the map function to the local data and writes the output to a temporary storage. A master node ensures that only one copy of redundant input data is processed. Shuffle: Each node redistribute data based on the output keys, such that all data belonging to one key is located on the same node. Reduce: Each node processes each group of output data, per key, in parallel.
19
1. Pig was basically developed by Yahoo which works on a pig Latin language, which is Query based language similar to SQL. 2. It is a platform for structuring the data flow, processing and analyzing huge data sets. 3. Pig does the work of executing commands and in the background, all the activities of MapReduce are taken care of. After the processing, pig stores the result in HDFS. 4. Pig Latin language is specially designed for this framework which runs on Pig Runtime. Just the way Java runs on the JVM.JVM 5. Pig helps to achieve ease of programming and optimization and hence is a major segment of the Hadoop Ecosystem.
20
1. With the help of SQL methodology and interface, HIVE performs reading and writing of large data sets. However, its query language is called as HQL (Hive Query Language). 2. It is highly scalable as it allows real-time processing and batch processing both. Also, all the SQL datatypes are supported by Hive thus, making the query processing easier. 3. Similar to the Query Processing frameworks, HIVE too comes with two components: JDBC Drivers and HIVE Command Line. 4. JDBC, along with ODBC drivers work on establishing the data storage permissions and connection whereas HIVE Command line helps in the processing of queries.
21
1. Mahout, allows Machine Learnability to a system or application. Machine Learning, as the name suggests helps the system to develop itself based on some patterns, user/environmental interaction or on the basis of algorithms.Machine Learning 2. It provides various libraries or functionalities such as collaborative filtering, clustering, and classification which are nothing but concepts of Machine learning. It allows invoking algorithms as per our need with the help of its own libraries.
22
1. It’s a platform that handles all the process consumptive tasks like batch processing, interactive or iterative real-time processing, graph conversions, and visualization, etc. 2. It consumes in memory resources hence, thus being faster than the prior in terms of optimization.
23
1. It’s a NoSQL database which supports all kinds of data and thus capable of handling anything of Hadoop Database. It provides capabilities of Google’s BigTable, thus able to work on Big Data sets effectively. 2. At times where we need to search or retrieve the occurrences of something small in a huge database, the request must be processed within a short quick span of time. At such times, HBase comes handy as it gives us a tolerant way of storing limited data.
24
There was a huge issue of management of coordination and synchronization among the resources or the components of Hadoop which resulted in inconsistency, often. Zookeeper overcame all the problems by performing synchronization, inter- component based communication, grouping, and maintenance.
25
Oozie simply performs the task of a scheduler, thus scheduling jobs and binding them together as a single unit. There is two kinds of jobs.i.e Oozie workflow and Oozie coordinator jobs. Oozie workflow is the jobs that need to be executed in a sequentially ordered manner whereas Oozie Coordinator jobs are those that are triggered when some data or external stimulus is given to it.
26
1. Managing traffic on streets. 2. Streaming processing. 3. Content Management and Archiving Emails. 4. Processing Rat Brain Neuronal Signals using a Hadoop Computing Cluster. 5. Fraud detection and Prevention.. 6. Managing content, posts, images and videos on social media platforms. 7. Analyzing customer data in real-time for improving business performance. 8. Public sector fields such as intelligence, defense, cyber security and scientific research.
28
1. Issue with Small Files:- Hadoop does not suit for small data. (HDFS) Hadoop distributed file system lacks the ability to efficiently support the random reading of small files because of its high capacity design.(HDFS) Hadoop distributed file system 2. Slow Processing Speed:- In Hadoop, with a parallel and distributed algorithm, the MapReduce process large data sets. There are tasks that we need to perform: Map and Reduce and, MapReduce requires a lot of time to perform these tasks thereby increasing latency. 3. Support for Batch Processing only :- Hadoop supports batch processing only, it does not process streamed data, and hence overall performance is slower. The MapReduce framework of Hadoop does not leverage the memory of the Hadoop cluster to the maximum.Hadoop cluster No Real-time Data Processing :- Apache Hadoop is for batch processing, which means it takes a huge amount of data in input, process it and produces the result.
29
Not Easy to Use :- In Hadoop, MapReduce developers need to hand code for each and every operation which makes it very difficult to work. Security :- Hadoop is challenging in managing the complex application. If the user doesn’t know how to enable a platform who is managing the platform, your data can be a huge risk. No Abstraction :- Hadoop does not have any type of abstraction so MapReduce developers need to hand code for each and every operation which makes it very difficult to work. Vulnerable by Nature :- Hadoop is entirely written in Java, a language most widely used, hence java been most heavily exploited by cyber criminals and as a result, implicated in numerous security breaches.
30
Hadoop has been very effective solution for companies dealing with the data in perabytes. It has solved many problems in industry related to huge data management and distributed system. As it is open source, so it is adopted by companies widely. One such tool that helps in analyzing and processing Big-data is Hadoop.
31
studymafia.org www.123seminarsonly.com www.123seminarsonly.com www.mindsmapped.com www.mindsmapped.com www.edureka.co www.edureka.co www.tutorialscampus.com www.tutorialscampus.com www.tutorialscampus.com www.tutorialscampus.com www.google.com
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.