The Big Data Network (phase 2) Cloud Hadoop system

Slides:



Advertisements
Similar presentations
Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Advertisements

Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.
Laura Morgan-Rees Veyo – Director of Business Development Introductions Veyo Promotional Video Background The Veyo System Veyo Overview Questions & Close.
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 11 Managing and Monitoring a Windows Server 2008 Network.
Talend 5.4 Architecture Adam Pemble Talend Professional Services.
©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential.
Network Security Policy Anna Nash MBA 737. Agenda Overview Goals Components Success Factors Common Barriers Importance Questions.
Security Baseline. Definition A preliminary assessment of a newly implemented system Serves as a starting point to measure changes in configurations and.
© Hortonworks Inc Hortonworks Page 1. © Hortonworks Inc Big Data Changes the Game Megabytes Gigabytes Terabytes Petabytes Purchase detail.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Use of Hierarchical Keywords for Easy Data Management on HUBzero HUBbub Conference 2013 September 6 th, 2013 Gaurav Nanda, Jonathan Tan, Peter Auyeung,
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
Service Level Agreements Service Level Statements NO YES The process of negotiating and defining the levels of user service (service levels) required.
OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.
Machine Learning as a Service
CASE (Computer-Aided Software Engineering) Tools Software that is used to support software process activities. Provides software process support by:- –
Apache Hadoop on the Open Cloud David Dobbins Nirmal Ranganathan.
WebWatcher A Lightweight Tool for Analyzing Web Server Logs Hervé DEBAR IBM Zurich Research Laboratory Global Security Analysis Laboratory
Institutional Repositories July 2007 DIGITAL CURATION creating, managing and preserving digital objects Dr D Peters DISA Digital Innovation South.
Interactions & Automations
Microsoft Partner since 2011
July 7, System and Network Administration: Introduction Abdul Wahid.
Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Why Freelance Developers Are Switching To Econtracts
SQL Database Management
Start-SPPowerShell – Introduction to PowerShell for SharePoint Admins and Developers Paul BAker.
ICD v7.6 Analytic Capability
Pilot Kafka Service Manuel Martín Márquez. Pilot Kafka Service Manuel Martín Márquez.
OMOP CDM on Hadoop Reference Architecture
BUILD BIG DATA ENTERPRISE SOLUTIONS FASTER ON AZURE HDINSIGHT
Protecting a Tsunami of Data in Hadoop
Principles of Information Systems Eighth Edition
Big Data is a Big Deal!.
Data Platform and Analytics Foundational Training
Why is my Hadoop* job slow?
BBMRI Competence Centre Status Report
Spark Presentation.
DI4R, 30th September 2016, Krakow
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
THE STEPS TO MANAGE THE GRID
Pentaho 7.1.
Enterprise security for big data solutions on Azure HDInsight
07 | Analyzing Big Data with Excel
FHIR BULK DATA API April 2018
What’s New in Colectica 5.3 Part 1
湖南大学-信息科学与工程学院-计算机与科学系
Johannes Peter MediaMarktSaturn Retail Group
Design engineer deliver.
Overview of big data tools
Technical Capabilities
Bethesda Cybersecurity Club
Metadata The metadata contains
Charles Tappert Seidenberg School of CSIS, Pace University
Jisc Research Data Shared Service (RDSS)
Overview of Workflows: Why Use Them?
Item 2.2 of the Agenda Remote access to confidential data for researchers: possible actions under the 7th Framework Programme Pascal JACQUES Unit B 5 15.
Managing Private and Public Views of DDI Metadata Repositories
Big DATA.
IBM C IBM Big Data Engineer. You want to train yourself to do better in exam or you want to test your preparation in either situation Dumpspedia’s.
Streaming data processing using Spark
MapReduce: Simplified Data Processing on Large Clusters
Oracle 1z0-928 Oracle Cloud Platform Big Data Management 2018 Associate.
Contract Management Software 100% Cloud-Based ContraxAware provides you with a deep set of easy to use contract management features.
Contract Management Software from ContraxAware Simplify Your Contract Management Process.
SQL Server 2019 Bringing Apache Spark to SQL Server
Presentation transcript:

The Big Data Network (phase 2) Cloud Hadoop system Presentation to the Manchester Data Science club 14 July 2016 Peter Smyth UK Data Service

The BDN2 Cloud Hadoop system The challenges of building and populating a secure but accessible big data environment for researchers in the Social Sciences and related disciplines.

Overview of this presentation Aims A simple Hadoop System Dealing with the data Processing Users Safeguarding the data and its usage Appeal for data and use cases Leave this for Peter

AIMS

Aims Provide a processing environment for big data Targeted at the Social Sciences - but not exclusively so Provide easy ingest of datasets Provide comprehensive search facilities Provide easy access to users for processing or download

Cloud Hadoop System

Cloud Hadoop System Start with minimal configuration Cloud based, so we can grow it as needed Adding nodes is what Hadoop is good at Need to provide HA from the outset Resilience and user access is important Search facilities will be expected 24/7

Software installed - and how we will use it Standard HDP (Hortonworks Data Platform) Spark, Hive, Pig, Ambari, Zeppelin etc. Other Apache software Ranger - monitor and manage comprehensive data security across the Hadoop platform Knox – REST API Gateway providing single point of entry to the cluster Other Software Kerberos AD integration Our own processes for workflows and ingest / metadata production

Fitting the bits together Hadoop System Job Scheduling User Access control & quotas Data Access control & SDC Data Users Performance Monitoring Auditing and Logging

Data

Getting the data in Large datasets from 3rd parties Existing UKDS datasets Not necessarily big data But likely to be used in conjunction with other data BYOD – Bring your own data Negotiation, contracts , conditions

How not to do it!

HDF – Hortonworks Data Flow Built on Apache Nifi Allows workflows to be built for collecting data from external sources Single shot datasets Regular updates (monthly, daily) Possibility of streaming data

NiFi workflow

Data storage Raw Data Metadata (Semantic Data Lake) Dashboards, summaries and samples User data Own datasets Work in progress Results Raw data – as the files come in MetaData – not only Metadata but the semantic data lake contents

Semantic data lake Must contain everything There will be only one search engine Whether in the cloud or on-prem (secure data) The metadata isn’t just what is extracted from the datasets and associated documentation Appropriate Ontologies need to be used Not only terms but relationships between them Resource Description Framework or RDF

Processing

Processing Ingest and curation processing Extracting and creating Metadata Processing for Dashboards, summaries and samples Samples – in advance or as requested? User searches User Jobs Processing systems Spark Hive / Pig Effect of interactive environments Zeppelin

Job Scheduling Ingest related jobs Metadata maintenance related jobs User jobs Batch? Hive Pig (Near) Real time Spark Streaming What kind of delay is acceptable? For users For Operations Do we need to prioritise?

Users

User types Short term (try before you ‘buy’) Long term (Researchers 3-5 years ) Commercial users? (in exchange for data) Everyone is a search user

Safeguarding Data

Security and Audit Who can access what data Making data available Disc quotas Private areas Who has access to resources and can run jobs Sandbox area for authenticated users Providing tools Levels of Support What audit trails are maintained What is recorded How long do we keep the logs Will they be reviewed?

Data Ownership and Provenance Restrictions on use of a dataset License agreements Types of research permitted Complications due to combining Permissions needed Carrying the provenance/licence with the data in the semantic data lake

SDC – Statistical Disclosure Controls Currently a manual process Likely to be more complex as more datasets are combined Could just be checked on output Automated tools are becoming available But how good are they? Or, are they good enough

Hadoop in the Cloud

Performance monitoring Need to understand usage patterns Or try to anticipate them Need to be able to detect when the system is under stress - and be able to react in a timely manner CPU RAM HDFS Need to provide proper job scheduling for true batch jobs Cannot allow the use of Spark to result in a free-for-all

Pros and Cons of the Cloud for Hadoop Elasticity Add or remove nodes as required Only pay for what you use Cons Hadoop designed as a share nothing system Adding, and particularly removing nodes not as straightforward as in other type of cloud systems Continuously paying for storage big datasets The pros are the standard one offered for cloud computing in general. The Cons explain why they are not necessarily applicable to a Hadoop system

Appeal for use cases

Why we need data and use cases Building a generalised system Many of the processes and procedures have not been tried before Need an understanding of ‘typical’ use needs Need to ensure we cater for end to end processing of the user needs

What is in it for you Safe 24/7 repository for your data Access to Big Data processing Support & Training

and offers of data Peter Smyth Peter.smyth@manchester.ac.uk ukdataservice.ac.uk/help/ Subscribe to the UK Data Service news list at https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=UKDATASERVICE Follow us on Twitter https://twitter.com/UKDataService or Facebook https://www.facebook.com/UKDataService