100% tekst TEXT LEVELS Plain text 1 2 3 Eerste bullet niveau Tweede bullet niveau 4 Titel Klik om de stijl te bewerken Klik om de modelstijlen te bewerken.

Slides:



Advertisements
Similar presentations
Inner Architecture of a Social Networking System Petr Kunc, Jaroslav Škrabálek, Tomáš Pitner.
Advertisements

FAST FORWARD WITH MICROSOFT BIG DATA Vinoo Srinivas M Solutions Specialist Windows Azure (Hadoop, HPC, Media)
Evaluation of distributed open source solutions in CERN database use cases HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski,
An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
NoSQL and NewSQL Justin DeBrabant CIS Advanced Systems - Fall 2013.
Big Data Technologies for InfoSec Dive Deeper. See Further. Ram Sripracha UCLA / Sift Security.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
Hadoop Ecosystem Overview
CERN IT Department CH-1211 Geneva 23 Switzerland t XLDB 2010 (Extremely Large Databases) conference summary Dawid Wójcik.
Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Introduction to Hadoop and HDFS
Data storing and data access. Plan Basic Java API for HBase – demo Bulk data loading Hands-on – Distributed storage for user files SQL on noSQL Summary.
Processing of the WLCG monitoring data using NoSQL J. Andreeva, A. Beche, S. Belov, I. Dzhunov, I. Kadochnikov, E. Karavakis, P. Saiz, J. Schovancova,
Indexing HDFS Data in PDW: Splitting the data from the index VLDB2014 WSIC、Microsoft Calvin
How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.
Hadoop implementation of MapReduce computational model Ján Vaňo.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
Site Technology TOI Fest Q Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?
Nov 2006 Google released the paper on BigTable.
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
Harnessing Big Data with Hadoop Dipti Sangani; Madhu Reddy DBI210.
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
Efficient Data Management Tools for the Heterogeneous Big Data Warehouse Autors: Aleksandr Alekseev (Programmer), Victoria Osipova (Associate professor),
BIG DATA/ Hadoop Interview Questions.
B ig D ata Analysis for Page Ranking using Map/Reduce R.Renuka, R.Vidhya Priya, III B.Sc., IT, The S.F.R.College for Women, Sivakasi.
CPSC8985 FA 2015 Team C3 DATA MIGRATION FROM RDBMS TO HADOOP By Naga Sruthi Tiyyagura Monika RallabandiRadhakrishna Nalluri.
Microsoft Ignite /28/2017 6:07 PM
Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
From RDBMS to Hadoop A case study Mihaly Berekmeri School of Computer Science University of Manchester Data Science Club, 14th July 2016 Hayden Clark,
Hadoop file format studies in IT-DB Analytics WG meeting 20 th of May, 2015 Daniel Lanza, IT-DB.
Pilot Kafka Service Manuel Martín Márquez. Pilot Kafka Service Manuel Martín Márquez.
Integration of Oracle and Hadoop: hybrid databases affordable at scale
Scaling HDFS to more than 1 million operations per second with HopsFS
Organizations Are Embracing New Opportunities
Integration of Oracle and Hadoop: hybrid databases affordable at scale
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.
Hadoop and Analytics at CERN IT
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
CS122B: Projects in Databases and Web Applications Winter 2017
Running virtualized Hadoop, does it make sense?
CLOUDERA TRAINING For Apache HBase
Hadoop Clusters Tess Fulkerson.
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Introduction to Apache
Overview of big data tools
Cloud computing mechanisms
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Applying Data Warehousing and Big Data Techniques to Analyze Internet Performance Thiago Barbosa, Renan Souza, Sérgio Serra, Maria Luiza and Roger Cottrell.
Building a Threat-Analytics Multi-Region Data Lake on AWS
Big-Data Analytics with Azure HDInsight
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
SDMX meeting Big Data technologies
Presentation transcript:

100% tekst TEXT LEVELS Plain text Eerste bullet niveau Tweede bullet niveau 4 Titel Klik om de stijl te bewerken Klik om de modelstijlen te bewerken  Tweede niveau  Derde niveau  Vierde niveau Vijfde niveau Wie zijn wij? | Mijlpalen | Organisatie | Het huidige internet | Missie - Visie | Diensten | Referenties | Samenvatting 1 DNS-OARC Fall 2015 Workshop October 4th 2015 Maarten Wullink, SIDN DNS Big Data Analytics

100% tekst TEXT LEVELS Plain text Eerste bullet niveau Tweede bullet niveau 4 Titel SIDN Domain name registry for.nl ccTLD > 5,6 million domain names 2,45 million domain names secured with DNSSEC SIDN Labs is the R&D team of SIDN

100% tekst TEXT LEVELS Plain text Eerste bullet niveau Tweede bullet niveau 4 Titel DNS > 3.1 million distinct resolvers > 1.3 billion query's daily > 300 GB of PCAP data daily

100% tekst TEXT LEVELS Plain text Eerste bullet niveau Tweede bullet niveau 4 Titel ENTRADA Goal: data-driven improved security & stability of.nl Problem: Existing solutions for analyzing network data do not work well with large datasets and have limited analytical capabilities. Main requirement: high-performance, near real-time data warehouse Approach: avoid expensive pcap analysis: Convert pcap data to a performance-optimized format (key) Perform analysis with tools/engines that leverage that ENhanced Top-Level Domain Resilience through Advanced Data Analysis

SQL support Scalability High performance Capacity for >1 year of DNS data Extensibility Stability Don’t spend too much money! Requirements

100% tekst TEXT LEVELS Plain text Eerste bullet niveau Tweede bullet niveau 4 Titel Query Engine Options Evaluated SQL and NoSQL solutions Relational SQL (PostgreSQL) MongoDB Cassandra Elasticsearch Hadoop (HBASE + Apache Phoenix or Hive) SQL on Hadoop (HDFS + Impala + Parquet) Engines galore!

100% tekst TEXT LEVELS Plain text Eerste bullet niveau Tweede bullet niveau 4 Titel Hadoop Node N+2 Hadoop Node N+1 Hadoop Node N SQL on Hadoop HDFS PARQUET IMPALA Best fit for our requirements PARQUET IMPALA PARQUET IMPALA

100% tekst TEXT LEVELS Plain text Eerste bullet niveau Tweede bullet niveau 4 Titel HDFS Distributed file system for storing large volumes of data High availability through replication of data blocks Scalable to hundreds of PB’s and thousands of servers

100% tekst TEXT LEVELS Plain text Eerste bullet niveau Tweede bullet niveau 4 Titel Impala query engine MPP (massively parallel processing) Inspired by Google Dremel paper Provides low latency and high concurrency for BI/analytic queries on Hadoop Excellent performance when compared to other Hadoop based query engines.

100% tekst TEXT LEVELS Plain text Eerste bullet niveau Tweede bullet niveau 4 Titel Impala (2) Data formats Text Hadoop formats Apache Avro Apache Parquet Interfaces Web-based GUI Command line (impala-shell) Python (Impyla) JDBC

row oriented column oriented data Apache Parquet Why not just use the PCAP files? Reading (compressed) PCAP data is just too slow Analytical engines cannot read PCAP files Columnar storage format

Apache Parquet (2) Columnar storage allows for efficient encoding/compression multiple encoding schemes support for Snappy compression Partition data (e.g. by year, month, day and server) Partition pruning allows Impala to skip data we are not interested in Other analytical engines such as Apache Spark can use the same Parquet data.

100% tekst TEXT LEVELS Plain text Eerste bullet niveau Tweede bullet niveau 4 Titel ENTRADA Architecture ‘DNS big data’ system Goal: develop applications and services that further enhance the security and stability of.nl, the DNS, and the Internet at large ENTRADA main components Applications and services Platform Data sources Privacy framework

100% tekst TEXT LEVELS Plain text Eerste bullet niveau Tweede bullet niveau 4 Titel ENTRADA Privacy Framework Policy elements: Purpose Data that is used Filters on the data Retention period Access to the data Type of application (Research vs. Production) Download paper:

management node nano sized 2Gb/s network location Ilocation IIlocation III data nodes Cluster Design

Management node Data node HP ProLiant DL380 64GB RAM Xeon 1.9 GHz 12 core CPU 3 TB storage HP ProLiant DL380 64GB RAM Xeon 1.9 GHz 12 core CPU 6 TB storage Scaling Vertical by adding more resources Horizontal by adding more data nodes Hardware

100% tekst TEXT LEVELS Plain text Eerste bullet niveau Tweede bullet niveau 4 Titel Workflow Query data available for analysis within 10 minutes name server PCAP staging PCAP decode Import Hadoop Parquet Impala Analyst Enrich Join Filter Monitoring Metrics

Performance 1 Year of data is 2.2TB Parquet ~ 52TB of PCAP select concat_ws(’- ’,day,month,year), count(1) from dns.queries where ipv=4 group by concat_ws(’- ’,day,month,year) Example query, count # ipv4 queries per day. Query response times

Name server feeds1 Queries per day~150M Daily PCAP volume(gzipped)~33GB Daily Parquet volume~6GB Months operational18 Total # queries stored> 71B Total Parquet volume> 3TB HDFS (3x replication)> 9TB Cluster capacity~150B-200B tuples ENTRADA Status

Focussed on increasing the security and stability of.nl Use Cases Visualize DNS patterns (visualize traffic patterns for phishing domain names) Detect botnet infections Real-time Phishing detection Statistics (stats.sidnlabs.nl) Scientific research (collaboration with Dutch Universities) Operational support for DNS operators

100% tekst TEXT LEVELS Plain text Eerste bullet niveau Tweede bullet niveau 4 Titel Example Applications DNS security scoreboard Resolver reputation

DNS Security Dcoreboard Goal: Visualize DNS patterns for malicious activity How: Combine external phishing feeds with DNS data

Web UI Security feed I Security feed II Event Analyzer Hadoop PostgreSQL new event enrich with DNS data save enriched event retrieve event data REST API Architecture

Traffic Visualization

50% tekst / 50% beeld VOEG EEN FOTO IN: 1.Verwijder de bestaande foto en klik op het icoon, om een foto in te voegen: 2.Zoek de gewenste foto en dubbelklik hierop. 3.Staat de afbeelding er niet goed in? Selecteer de foto, klik ‘Format’ in het lint en selecteer ‘Crop’. 4.De afbeelding is nu te verschuiven, door met een linkermuisklik vast te houden op de afbeelding en de muis naar de gewenste richting te bewegen. TEXT LEVELS Plain text Eerste bullet niveau Tweede bullet niveau 4 Titel Resolver Reputation (RESREP) Goal: Try to detect malicious activity by assigning reputation scores to resolvers How: “fingerprinting” resolver behaviour

50% tekst / 50% beeld VOEG EEN FOTO IN: 1.Verwijder de bestaande foto en klik op het icoon, om een foto in te voegen: 2.Zoek de gewenste foto en dubbelklik hierop. 3.Staat de afbeelding er niet goed in? Selecteer de foto, klik ‘Format’ in het lint en selecteer ‘Crop’. 4.De afbeelding is nu te verschuiven, door met een linkermuisklik vast te houden op de afbeelding en de muis naar de gewenste richting te bewegen. TEXT LEVELS Plain text Eerste bullet niveau Tweede bullet niveau 4 Titel RESREP Concept Malicious activity: Spam-runs Botnets like Cutwail DNS-amplification attacks ISP Resolvers DNS questions and responses Authoritative DNS.nl.nl Registry

50% tekst / 50% beeld VOEG EEN FOTO IN: 1.Verwijder de bestaande foto en klik op het icoon, om een foto in te voegen: 2.Zoek de gewenste foto en dubbelklik hierop. 3.Staat de afbeelding er niet goed in? Selecteer de foto, klik ‘Format’ in het lint en selecteer ‘Crop’. 4.De afbeelding is nu te verschuiven, door met een linkermuisklik vast te houden op de afbeelding en de muis naar de gewenste richting te bewegen. TEXT LEVELS Plain text Eerste bullet niveau Tweede bullet niveau 4 Titel RESREP Architecture Resolvers Root operator Child operator (example.nl)      ISP network User     HTTP ENTRADA Platform AbuseHUB Abusedesk RESREP Privacy Policy.nl Privacy Board RESREP service 

100% tekst TEXT LEVELS Plain text Eerste bullet niveau Tweede bullet niveau 4 Titel Conclusions Technical: Hadoop HDFS + Parquet + Impala is a winning combination! Contributions: Research by SIDN Labs and universities Identified malicious domain names and botnets External data feed to the Abuse Information Exchange Insight into DNS query data

Future Work Combine data from.nl authoritative name server with scans of the complete.nl zone and ISP data. Get data from more name servers and resolvers Expand Open Data program

100% tekst TEXT LEVELS Plain text Eerste bullet niveau Tweede bullet niveau 4 Titel Questions and Feedback Maarten Wullink Senior Research stats.sidnlabs.nl