Streaming Analytics with Spark 1 Magnoni Luca IT-CM-MM 09/02/16EBI - CERN meeting.

Slides:



Advertisements
Similar presentations
Berkeley Data Analytics Stack (BDAS) Overview Ion Stoica UC Berkeley UC BERKELEY.
Advertisements

Running Hadoop-as-a-Service in the Cloud
CHEP 2015 Analysis of CERN Computing Infrastructure and Monitoring Data Christian Nieke, CERN IT / Technische Universität Braunschweig On behalf of the.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
Hadoop Ecosystem Overview
CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.
Apache Spark and the future of big data applications Eric Baldeschwieler.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
Tyson Condie.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.
Next-Generation IDS: A CEP Use Case in 10 Minutes 3rd Draft – November 8, nd Event Processing Symposium Redwood Shores, California Tim Bass, CISSP.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
Lemon Monitoring Presented by Bill Tomlin CERN-IT/FIO/FD WLCG-OSG-EGEE Operations Workshop CERN, June 2006.
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.
XROOTD AND FEDERATED STORAGE MONITORING CURRENT STATUS AND ISSUES A.Petrosyan, D.Oleynik, J.Andreeva Creating federated data stores for the LHC CC-IN2P3,
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Agile Infrastructure Monitoring HEPiX Spring th April.
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
WP1 WP2 WP3 WP4 WP5 COORDINATOR WORK PACKAGE LDR RESEARCHER ACEOLE MID TERM REVIEW CERN 3 RD AUGUST 2010 Magnoni Luca Early Stage Researcher WP5 - ATLAS.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
DDM FAX Dashboard status and future Luca Magnoni IT/SDC 2 nd June 2014.
WLCG Transfers Dashboard A unified monitoring tool for heterogeneous data transfers. Alexandre Beche.
ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN Agile Infrastructure Monitoring Pedro Andrade CERN – IT/GT HEPiX Spring 2012.
XRootD Monitoring Report A.Beche D.Giordano. Outlines  Talk 1: XRootD Monitoring Dashboard  Context  Dataflow and deployment model  Database: storage.
Apache Kafka A distributed publish-subscribe messaging system
CERN IT Department CH-1211 Genève 23 Switzerland t Monitoring: Present and Future Pedro Andrade (CERN IT) 31 st August.
Data Analytics and Hadoop Service in IT-DB Visit of Cloudera - April 19 th, 2016 Luca Canali (CERN) for IT-DB.
WLCG Transfers monitoring EGI Technical Forum Madrid, 17 September 2013 Pablo Saiz on behalf of the Dashboard Team CERN IT/SDC.
Microsoft Ignite /28/2017 6:07 PM
Monitoring Evolution 1 Alberto AIMAR, IT-CM-MM. Outline Mandate Data Centres Monitoring Experiments Dashboards Architecture Plans Status Demo 2.
IT Monitoring Service Status and Progress 1 Alberto AIMAR, IT-CM-MM.
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.
Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:
Integration of Oracle and Hadoop: hybrid databases affordable at scale
Experiments Monitoring Plans and Progress
Daniele Bonacorsi Andrea Sciabà
Connected Infrastructure
CERN Data Analytics Use Cases
Monitoring Evolution and IPv6
Update on CERN IT Unified Monitoring Architecture (UMA)
Connected Living Connected Living What to look for Architecture
Integration of Oracle and Hadoop: hybrid databases affordable at scale
Smart Building Solution
Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.
Introduction to Spark Streaming for Real Time data analysis
Data Analytics and CERN IT Hadoop Service
Hadoop and Analytics at CERN IT
Collecting heterogeneous data into a central repository
Smart Building Solution
Connected Living Connected Living What to look for Architecture
New Big Data Solutions and Opportunities for DB Workloads
IT Monitoring Service Status and Progress
Connected Infrastructure
A Messaging Infrastructure for WLCG
Data Analytics and CERN IT Hadoop Service
Monitoring Of XRootD Federation
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Data Analytics and CERN IT Hadoop Service
Data Analytics – Use Cases, Platforms, Services
Execution Framework: Hadoop 2.x
Project Goals Collect and permanently store the data flowing around ONAP system into several Big Data storages, each in different category. Also serve.
Data science laboratory (DSLAB)
Streaming data processing using Spark
Presentation transcript:

Streaming Analytics with Spark 1 Magnoni Luca IT-CM-MM 09/02/16EBI - CERN meeting

CERN 2 Provide monitoring facilities for CERN Data Centers and WLCG Infrastructure {\"unique_id\":\"30fbed9e-975b-11e b82e4a9beef-4 ef6f2e\", \"file_lfn\":\"/store/mc/Fall-foo/QCD_Pt- 80to1220_Tune4C_13TeV_pythia8/GEN-SIM-RAW/castor_tsg_40bx25_POSTLS32162_V2- v1/20000/6C4FDD E3311-9FC2-3\":\"0\", \"read_max\":\"0\", \"read_average\":\" \", \"read_sigma\":\" \", \"read_single_bytes\":\"0\", \"read_single_operations\":\"0\", \"read_single_min\":\"0\", \"read_single_max\":\"0\", \"read_single_average\":\" \", \"read_single_sigma\":\" \", \"read_vector_bytes\":\"0\", \"read_vector_operations\":\"0\", \"read_vector_min\":\"0\", \"read_vector_max\":\"0\", \"read_vector_average\":\" \", \"read_vector_sigma\":\" \", \"read_vector_count_min\":\"0\", \"read_vector_count_max\":\"0\", \"read_vector_count_average\":\" \", \"read_vector_count_sigma\":\" \", \"write_bytes\":\"0\", \"write_operations\":\"0\", \"write_min\":\"0\", \"write_max\":\"0\", \"write_average\":\" \", \"write_sigma\":\" \", \"read_bytes_at_close\":\"0\", \"write_bytes_at_close\":\"0\", \"user_dn\":\"/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=foo/CN= /CN=HomerSimpson\", \"user_vo\":\”none\", \"user_role\":\"NULL\", \"user_fqan\":\"/none\", \"client_domain\":\”springfield.us\", \"client_host\":\”\", \"server_username\":\"\", \"user_protocol\":\"xroot\", \"app_info\":\"39_https://glidein.cern.ch/39/150318:193421:capalmer:crab:QCD80:March18_039_https://glidein. cern.ch/39/1318:19342:crab:QCD80:March18_0\", \"server_domain\":\"brunel.ac.uk\", \"server_host\":\"dc-grid- pool-a4-03\", \"server_site\":\"UKI-LT2-Brunel\"}" 09/02/16EBI - CERN meeting

Monitoring Architecture & Tools 3 Data Sources Storage Processing Transport View kafka 09/02/16EBI - CERN meeting Data Centers ~ 15 K nodes WLCG Sites & Services elasticsearc h

Apache Spark Distributed large-scale data processing engine Most active Apache project in 2014 Simpler and faster than Hadoop/MapReducefaster One framework for batch, streaming, SQL, iterative,… 4 Spark MapReduce General Batch Processing Specialized Systems (iterative, interactive, ML, streaming…) Storm Giraph Dremel Pig Tez Mahout … General Unified Engine ~ 2004 ~ 2014 time 09/02/16EBI - CERN meeting

Spark Streaming Distributed analysis of data streams Micro-batch computation Scalable, High-Throughput, Fault-Tolerant Processing latency “as low as” few seconds 5 Source: 09/02/16EBI - CERN meeting

Data Transport for Streaming Streaming relies on transport to serve data as it is produced, at scale & speed CERN: Collection/Aggregation of metrics and logs Flume, one agent on each data centers node Reliable and scalable data consumption Kafka, as distributed high-volume publish/subscribe messaging system External producers ActiveMQ (AMQ) over STOMP 609/02/16EBI - CERN meeting

Spark CERN: WLCG Data Transfers dashboards 709/02/16EBI - CERN meeting

Spark CERN: WLCG Data Transfers dashboards ~ 100 GB/day of monitoring data (data transfers and access logs) to be gathered and processed Compute and Aggregate over time the data transfer activities across WLCG sites from raw operation logs 8 ATLAS DDM FTS XRootD HTTP AMQ / Flume / Kafka Spark Batch & Streaming Jobs Parse & Transform Compute statistics Merge with previous results Raw JSON 09/02/16EBI - CERN meeting CERN IT Hadoop

Spark CERN: Live analytics Explore metrics and logs from data centers as they flow Process the raw information to correlate and make sense of the overall behavior (e.g. multi-node alarms) Notebooks as a service 9 Flume / Kafka Sensors Metrics Spark Streaming Jobs Filter Aggregate Detect Patterns Generate Alarms Zeppelin notebook: Easy and Interactive Analytics 09/02/16EBI - CERN meeting

10 More CERN IT Analytics Working Group Offline analysis and data exploration LHC Experiments activities ATLAS study on data scrutiny and popularity 09/02/16EBI - CERN meeting Thank you!