The Big Data Network (phase 2) Cloud Hadoop system

The Big Data Network (phase 2) Cloud Hadoop system
Presentation to the Manchester Data Science club 14 July 2016 Peter Smyth UK Data Service

The BDN2 Cloud Hadoop system
The challenges of building and populating a secure but accessible big data environment for researchers in the Social Sciences and related disciplines.

Overview of this presentation
Aims A simple Hadoop System Dealing with the data Processing Users Safeguarding the data and its usage Appeal for data and use cases Leave this for Peter

Aims Provide a processing environment for big data
Targeted at the Social Sciences - but not exclusively so Provide easy ingest of datasets Provide comprehensive search facilities Provide easy access to users for processing or download

Cloud Hadoop System

Cloud Hadoop System Start with minimal configuration
Cloud based, so we can grow it as needed Adding nodes is what Hadoop is good at Need to provide HA from the outset Resilience and user access is important Search facilities will be expected 24/7

Software installed - and how we will use it
Standard HDP (Hortonworks Data Platform) Spark, Hive, Pig, Ambari, Zeppelin etc. Other Apache software Ranger - monitor and manage comprehensive data security across the Hadoop platform Knox – REST API Gateway providing single point of entry to the cluster Other Software Kerberos AD integration Our own processes for workflows and ingest / metadata production

Fitting the bits together
Hadoop System Job Scheduling User Access control & quotas Data Access control & SDC Data Users Performance Monitoring Auditing and Logging

Getting the data in Large datasets from 3rd parties
Existing UKDS datasets Not necessarily big data But likely to be used in conjunction with other data BYOD – Bring your own data Negotiation, contracts , conditions

How not to do it!

HDF – Hortonworks Data Flow
Built on Apache Nifi Allows workflows to be built for collecting data from external sources Single shot datasets Regular updates (monthly, daily) Possibility of streaming data

NiFi workflow

Data storage Raw Data Metadata (Semantic Data Lake)
Dashboards, summaries and samples User data Own datasets Work in progress Results Raw data – as the files come in MetaData – not only Metadata but the semantic data lake contents

Semantic data lake Must contain everything
There will be only one search engine Whether in the cloud or on-prem (secure data) The metadata isn’t just what is extracted from the datasets and associated documentation Appropriate Ontologies need to be used Not only terms but relationships between them Resource Description Framework or RDF

Processing

Processing Ingest and curation processing
Extracting and creating Metadata Processing for Dashboards, summaries and samples Samples – in advance or as requested? User searches User Jobs Processing systems Spark Hive / Pig Effect of interactive environments Zeppelin

Job Scheduling Ingest related jobs Metadata maintenance related jobs
User jobs Batch? Hive Pig (Near) Real time Spark Streaming What kind of delay is acceptable? For users For Operations Do we need to prioritise?

User types Short term (try before you ‘buy’)
Long term (Researchers 3-5 years ) Commercial users? (in exchange for data) Everyone is a search user

Safeguarding Data

Security and Audit Who can access what data
Making data available Disc quotas Private areas Who has access to resources and can run jobs Sandbox area for authenticated users Providing tools Levels of Support What audit trails are maintained What is recorded How long do we keep the logs Will they be reviewed?

Data Ownership and Provenance
Restrictions on use of a dataset License agreements Types of research permitted Complications due to combining Permissions needed Carrying the provenance/licence with the data in the semantic data lake

SDC – Statistical Disclosure Controls
Currently a manual process Likely to be more complex as more datasets are combined Could just be checked on output Automated tools are becoming available But how good are they? Or, are they good enough

Hadoop in the Cloud

Performance monitoring
Need to understand usage patterns Or try to anticipate them Need to be able to detect when the system is under stress - and be able to react in a timely manner CPU RAM HDFS Need to provide proper job scheduling for true batch jobs Cannot allow the use of Spark to result in a free-for-all

Pros and Cons of the Cloud for Hadoop
Elasticity Add or remove nodes as required Only pay for what you use Cons Hadoop designed as a share nothing system Adding, and particularly removing nodes not as straightforward as in other type of cloud systems Continuously paying for storage big datasets The pros are the standard one offered for cloud computing in general. The Cons explain why they are not necessarily applicable to a Hadoop system

Appeal for use cases

Why we need data and use cases
Building a generalised system Many of the processes and procedures have not been tried before Need an understanding of ‘typical’ use needs Need to ensure we cater for end to end processing of the user needs

What is in it for you Safe 24/7 repository for your data
Access to Big Data processing Support & Training

and offers of data Peter Smyth Peter.smyth@manchester.ac.uk
ukdataservice.ac.uk/help/ Subscribe to the UK Data Service news list at Follow us on Twitter or Facebook

The Big Data Network (phase 2) Cloud Hadoop system

Similar presentations

Presentation on theme: "The Big Data Network (phase 2) Cloud Hadoop system"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Big Data Network (phase 2) Cloud Hadoop system

Similar presentations

Presentation on theme: "The Big Data Network (phase 2) Cloud Hadoop system"— Presentation transcript:

Similar presentations

About project

Feedback