Big Data Technology.

Big Data Technology

Objective: Design and configuration of the architecture used to analyze Big Data

Introduction Big Data deals with huge volume of varied data, it requires the use of best technologies at every stage be it collecting data, cleaning it, sorting and organizing it, integrating it, analyzing it, or deriving final conclusions. One of the ways to achieve this combination of efficiency and economy is virtualization.

Virtualization Virtualization is creating a virtual versions of something, such as an operating system, network resource, a server etc. It keep a virtual version of resources and services separate from the underlying physical implementation. Applying virtualization across the big data environment helps in fulfilling its resources related requirement to a great extent.

Exploring the Big Data Stack
The first step in the process of designing any data architecture is to create a model that should give a complete view of all the required elements. Data Analysis also needs the creation of a model or architecture, known as big data architecture. The configuration of the model may vary depending upon the specific needs of the organization , however the basic layers and components more or less remains the same. While creating Big Data environment, we must also take hardware, infrastructure software, operational software, management software, APIs and software developer tools into consideration.

Foundational Requirements
The architecture must fulfil all the foundational requirements and must able to perform the following function: Capturing data from different sources Cleaning and integrating data of different type formats Storing and organizing data Analyzing data Identifying relationships and pattern Deriving conclusions based on the data analysis results.

Stack of Layers in Big Data Architecture

Layers of Big Data architecture
Data Source layer: Organization generate huge amount of data almost on a daily basis, and is growing exponentially. The data source layer assimilates the data coming in from various sources, at varying velocity, and in different formats. Ex: Different data sources where a Telecom industry obtains its data:

Ingestion Layer: The role of the ingestion layer is to absorb the huge inflow of data and sort it out in different categories. This layer separates noise from the relevant information. It can handle the huge volume, high velocity and a variety of data. The ingestion layer validates, cleanses, transforms, reduces and integrates the unstructured data into the Big Data stack for further processing.

In the ingestion layer the data passes through the following stages:
Identification: categorized into various known data formats, or we can say that unstructured data is assigned with defaults formats. Filtration: The information relevant for the enterprise is filtered on the basis of the Enterprise Master Data management (MDM) repository. Validation: The filtered data is analyzed against MDM metadata. Noise Reduction: Data is cleaned by removing the noise and minimizing the related disturbances.

Transformation: Data is split or combined on the basis of its type, contents and the requirements of the organization. Compression-The size of the data is reduced without affecting its relevance for the required process. It should be noted that compression does not affect the result analysis. Integration: The refined dataset is integrated with the Hadoop storage layer, which consists of Hadoop Distributed File System (HDFS) and NoSQL databases.

Storage layer Hadoop is an open source framework (storage architecture) used to store large volumes of data in a distributed manner across multiple machines. The Hadoop storage layer supports fault tolerance and parallelization, which enables high speed distributed processing algorithm to execute over large scale data. There are two major components of Hadoop, a scalable Hadoop Distributed File System (HDFS) that can support petabytes of data and a Mapreduce engine that computes results in batches.

Earlier there were different types of databases such as relational and non relational, for storing different types of data. All these types of data storage requirements can be addressed by a single concept known as Not Only SQL (NoSQL) databases. Some examples of NoSQL databases include HBAS, mongoDB, AllegroGraph and InfiniteGraph

Different NoSQL databases used for different types of business applications are shown
Key-Value Pair Shopping Carts Web User Data Analysis (Amazon, linkedin) Column-Oriented Analyze Huge Web User Actions (Facebook, Twitter) Document-Based Real Time Analytics Logging Document Archive Management Graph-Based Network Modelling Locality Recommendation

Physical Infrastructure Layer
Principles on which Big Data implementation is based on specific principles are: Performance: High end infrastructure is required to deliver high performance with low latency. Performance is measured end to end, on the basis of a single transaction or query request. It would be rated high if the total time taken in traversing a query request is low. The total time taken by a data packet to travel from one node to another is described as latency.

Generally the setups providing high performance and low latency are quite expensive than normal infrastructure setups. Availability: The infrastructure setup must be available at all times to ensure nearly a 100 percent uptime guarantee . Scalability: The Big Data infrastructure should be scalable enough to accommodate varying storage and computing requirements. Ready to deal with any unexpected challenges. Flexibility: Flexible infrastructure facilitates adding more resources to the setup and promote failure recovery. These type of infrastructure are costly but can be controlled with the use of cloud services where you need to pay for what you actually use. Cost: The section of hardware, networking and storage requirements of the infrastructure must be selected that can be afford.

Platform Management Layer
The role of the platform management layer is to provide tools and query language for accessing NoSQL databases. This layer uses HDFS storage file system that lies on the top of the Hadoop physical Infrastructure layer. There are two new technologies, Hadoop and Mapreduce, which allow enterprises to strore, access, and analyze large amounts of data by using real time analysis. Both these technologies handle the fundamental problem of processing huge amounts of data timely, efficiently and cost effective.

Building Blocks of Hadoop Platform Management Layer
Mapreduce: Combination of map and reduce features. Map is a component that distributes a problem (of multiple tasks) across a large number of systems and also handles the task of distributing the load for recovery management against failure. When the task of distributed computation is completed, the reduce function combines all the elements back together to provide an aggregate result.

Stack of Layers in Big Data Architecture

Hive: A data warehouse system of Hadoop for providing the capability of aggregating large volumes of data. Pig: A scripting language that is used for batch processing huge amounts of data. Pig is not suitable to perform queries on a small portion of a dataset because it scans the entire dataset in one go. Hbase: Refers to a column-oriented database that provides fast access for handling Big Data. It is Hadoop compliment and suitable for batch processing.

Sqoop: Refers to a command-line tool that can import individual tables, specific columns or entire database files directly in the distributed file system or data warehouse. ZooKeeper: A coordinator that keeps multiple Hadoop instances and nodes in synchronization and provides protection of every node from failing because of the overload of data.

Security Layer The security layer handles the basic security tenets that Big data architecture should follow. Big Data projects are full of security issues because of using the distributed architecture, a simple programming model, and the open framework of services. Therefore the following security checks must be considered while designing a Big data stack: It must authenticate nodes by using protocols, such as Kerberos It must enable file-layer encryption. It must subscribe a key management service for trusted keys and certificates. It must maintain logs of the communication that occurs between nodes and trace any anomalies. It must ensure a secure combination between nodes by using Secure Sockets Layer (SSL).

Monitoring Layer Consist of number of monitoring systems.
These systems remain automatically aware of all the configurations and functions of different operating systems and hardware. They also provide the facility of machine communication with the help of monitoring tool high through-level protocols, such XML. Monitoring systems also provide tools for data storage and visualization. Some examples of open source tools for monitoring Big data stacks are Ganglia and Nagios.

Analytics Engine Role of analytic engine is to analyze huge amounts of unstructured data. This type of analysis is related to text analytics and statistical analytics. Examples of unstructured data that are available as large datasets include the following: Documents containing textual patterns Text and symbols generated by customers or users using social media forums such as Yammer. Machine generated data such as weather data. Data generated from application logs about upcoming and down time details, or about maintenance and upgrade details.

Statistical and Numerical Methods for Analyzing Unstructured Data
NLP Text Mining Linguistic computation Machine Learning Search and Sort algorithm

The following types of engines are used for analyzing Big Data:
Search engines: Big Data analysis requires extremely fast search engines with iterative and cognitive data discovery mechanisms for analyzing huge volumes of data. This is required because the data loaded from various sources has to be indexed and searched for Big Data analytics processing. Real Time Analytics: These days real time applications generate data at a very high speed and even a few hour old data becomes obsolete and useless as new data continues to flow in. Real time analysis is required in the Big Data environment to analyze this type of data. For this purpose , real type engines and NoSQL data stores are stored.

Visualization Layer The visualization layer handles task of interpreting and visualizing Big data. Visualization of data is done by data analysts to have a look at the different aspects of the data in various visual modes. It can be described as viewing a piece of information from different perspectives, interpreting it in different manners, trying to fit it in different types of situations, and deriving different types of conclusions from it.

Process Flow of Visualization Layer:
The visualization layer works on top of the aggregated data stored in traditional Operational Data Stores (ODS), data warehouse and data marts. These ODS get the aggregated data through the data scoop. Examples of visualization tools are Tableau, Clickview, Spotfire, MapR and revolution R. These tools work on top of the traditional components such as reports, dashboards and queries.

Visualization and Big Data
Virtualization is a process that allows you to run the images of multiple operating systems on a physical computer. These images of operating systems are called virtual machines. A virtual machine is basically a software representation of a physical machine that can execute or perform the same functions as the physical machine. Each virtual machine contains a separate copy of the operating system with its own virtual hardware resources, device drivers, services and applications. Although virtualization is not a requirement for Big Data analysis, the required software frameworks such as MapReduce works very efficiently in a virtualized environment.

Virtualization Environment
The operating system that is runs as a virtual machine is known as the guest, while the operating system that runs a virtual machine is known as the host. A guest operating system runs on a hardware virtualization layer, which is at the top of the hardware of the physical machine.

The following are basic features of virtualization:
Partitioning: Multiple applications and operating systems are supported by a single physical system by partitioning (separating) the available resources. Isolation: Each virtual machine runs in an isolated manner from its host physical system and other virtual machines. The benefit of such isolation is that if one virtual instance crashes, the other virtual machines and the host system are not affected. Encapsulation: Each virtual machine encapsulates its state as file system. Like a simple file on a computer system, a virtual machine can also be moved or copied. It works like an independent guest software configuration. Interposition: Generally in a virtual machine all the new guest actions are performed through the monitor. A monitor can inspect, modify or deny operations, such as compression, encryption, profiling and translation.

Virtualization Approaches
In Big data environment almost all approaches can be virtualized such as servers, storage applications, data, networks, processors, memory and services. Server Virtualization: A single server is partitioned into multiple servers. Each virtual servers has its own hardware and related resources, such as Random Access Memory (RAM), CPU, hard drive and network controller. A thin layer of software is also inserted with the hardware which consist of a virtual machine monitor also called hypervisor. It manages the traffic between virtual machine and physical machine.

Application Virtualization
Encapsulating applications in a way that they would not be dependent on the underlying physical computer system. Improves the manageability and portability of applications. Network Virtualization Using virtual networking as a pool of connection resources

Big Data Technology.

Similar presentations

Presentation on theme: "Big Data Technology."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Big Data Technology.

Similar presentations

Presentation on theme: "Big Data Technology."— Presentation transcript:

Similar presentations

About project

Feedback