Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t DBCF GT Our experience with NoSQL and MapReduce technologies Fabio Souto.

Slides:



Advertisements
Similar presentations
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
Advertisements

Evaluation of NoSQL databases for DIRAC monitoring and beyond
CERN IT Department CH-1211 Genève 23 Switzerland t Messaging System for the Grid as a core component of the monitoring infrastructure for.
Hadoop Ecosystem Overview
Operating Systems & Infrastructure Services CERN IT Department CH-1211 Geneva 23 Switzerland t OIS CERN Search Updates Eduardo Alvarez November.
CERN IT Department CH-1211 Geneva 23 Switzerland t XLDB 2010 (Extremely Large Databases) conference summary Dawid Wójcik.
CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.
Apache Spark and the future of big data applications Eric Baldeschwieler.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
GRID job tracking and monitoring Dmitry Rogozin Laboratory of Particle Physics, JINR 07/08/ /09/2006.
DuraCloud Managing durable data in the cloud Michele Kimpton, Director DuraSpace.
Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
` tuplejump The data engineering platform. A startup with a vision to simplify data engineering and empower the next generation of data powered miracles!
NoSQL continued CMSC 461 Michael Wilson. MongoDB  MongoDB is another NoSQL solution  Provides a bit more structure than a solution like Accumulo  Data.
CERN IT Department CH-1211 Geneva 23 Switzerland t The Experiment Dashboard ISGC th April 2008 Pablo Saiz, Julia Andreeva, Benjamin.
Presented by John Dougherty, Viriton 4/28/2015 Infrastructure and Stack.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Enabling data management in a big data world Craig Soules Garth Goodson Tanya Shastri.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 1.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Overlook of Messaging.
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
CERN IT Department CH-1211 Geneva 23 Switzerland t GDB CERN, 4 th March 2008 James Casey A Strategy for WLCG Monitoring.
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
Stairway to the cloud or can we take the highway? Taivo Liik.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
Nov 2006 Google released the paper on BigTable.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT IT Monitoring WG Technology for Storage/Analysis 28 November 2011.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Agile Infrastructure Monitoring HEPiX Spring th April.
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
Julia Andreeva on behalf of the MND section MND review.
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.
Microsoft Azure and DataStax: Start Anywhere and Scale to Any Size in the Cloud, On- Premises, or Both with a Leading Distributed Database MICROSOFT AZURE.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT IT Monitoring WG Update on the tool hunt & MonALISA monitoring.
Spark and Jupyter 1 IT - Analytics Working Group - Luca Menichetti.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CC Monitoring I.Fedorko on behalf of CF/ASI 18/02/2011 Overview.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN Agile Infrastructure Monitoring Pedro Andrade CERN – IT/GT HEPiX Spring 2012.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES The Common Solutions Strategy of the Experiment Support group.
CMS Experience with the Common Analysis Framework I. Fisk & M. Girone Experience in CMS with the Common Analysis Framework Ian Fisk & Maria Girone 1.
This is a free Course Available on Hadoop-Skills.com.
Efficient Data Management Tools for the Heterogeneous Big Data Warehouse Autors: Aleksandr Alekseev (Programmer), Victoria Osipova (Associate professor),
BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.
CERN IT Department CH-1211 Genève 23 Switzerland t Monitoring: Present and Future Pedro Andrade (CERN IT) 31 st August.
CERN IT Department CH-1211 Genève 23 Switzerland t Load testing & benchmarks on Oracle RAC Romain Basset – IT PSS DP.
Microsoft Ignite /28/2017 6:07 PM
Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.
Big Data & Test Automation
OMOP CDM on Hadoop Reference Architecture
BigData - NoSQL Hadoop - Couchbase
Data Platform and Analytics Foundational Training
Big Data A Quick Review on Analytical Tools
CLOUDERA TRAINING For Apache HBase
Hadoopla: Microsoft and the Hadoop Ecosystem
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Big Data - in Performance Engineering
Tools for Processing Big Data Jinan Al Aridhee and Christian Bach
Introduction to Apache
Overview of big data tools
Project Goals Collect and permanently store the data flowing around ONAP system into several Big Data storages, each in different category. Also serve.
Big DATA.
IBM C IBM Big Data Engineer. You want to train yourself to do better in exam or you want to test your preparation in either situation Dumpspedia’s.
Big-Data Analytics with Azure HDInsight
Big Data.
Presentation transcript:

Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Our experience with NoSQL and MapReduce technologies Fabio Souto IT Monitoring Working Group, 19 th September 2011

CERN IT Department CH-1211 Geneva 23 Switzerland t GT Outline Objective Big data technologies Technologies reviewed Deployed infrastructure Current status Lessons learned 2

CERN IT Department CH-1211 Geneva 23 Switzerland t GT Problem and goal The SAM infrastructure for WLCG –monitors 400 sites and ~2,000 services daily –receives and stores ~600,000 metric results daily –computes statuses and hourly availabilities for services and sites SWAT is a system to gather information about the configuration of WNs Massive data generation, making storage, search, sharing, analytics and visualizing difficult Objective: proof of concept using big data technologies 3

CERN IT Department CH-1211 Geneva 23 Switzerland t GT Big Data Technologies NoSQL databases –Not relational. Schema free. –Distributed –High availability MapReduce –Framework for processing huge datasets on clusters of computers –Takes advantage of data locality: Move computation is more efficient than moving data 4

CERN IT Department CH-1211 Geneva 23 Switzerland t GT Technologies reviewed NoSQL databases ~140 different solutions, we focused on: –MongoDB No durability(at the moment of study) –Cassandra No single point of failure Big and responsive community Apache Hadoop –Big data de facto standard –Framework for data intensive applications –To write MapReduce jobs for Cassandra 5

CERN IT Department CH-1211 Geneva 23 Switzerland t GT Technologies reviewed II Hive and Pig –ease the complexity of writing MapReduce –Initially not considered Less efficient than pure Hadoop –Independent from the data source We can change to HBase easily –Hive: SQL-like syntax –Pig: data flow language Is not turing complete (no loops, if-else…) –But can be embebed into python code –It’s possible to write custom functions in python/java 6

CERN IT Department CH-1211 Geneva 23 Switzerland t GT Technologies reviewed III Hue –Set of Django apps to interact with Hadoop OpenTSDB –Open source time series database –Lack of flexibility Oozie –Job scheduler and workflow engine for Hadoop 7

CERN IT Department CH-1211 Geneva 23 Switzerland t GT Other Tools Msg-consume2db inserter: –WLCG Messaging infrastructure -> NoSQL sql2nosql-sync –SAM Oracle DB -> NoSQL 8

CERN IT Department CH-1211 Geneva 23 Switzerland t GT Actual infrastructure Deployed infrastructure 9

CERN IT Department CH-1211 Geneva 23 Switzerland t GT Actual infrastructure 10

CERN IT Department CH-1211 Geneva 23 Switzerland t GT Current status 11 SAM –DONE: running infrastructure reading messaging and SAM data and launch pig jobs to calculate availability. –TODO: Results tuning Web interface to visualize the results JSON/XML API to extract results Unit testing SWAT –Early stage of development (~6 days) –Data collection

CERN IT Department CH-1211 Geneva 23 Switzerland t GT Lessons learned Use abstraction layer on top of Hadoop –Write pure MapReduce Hadoop apps is difficult and time-consuming Choose a solution with a responsive community: –Technology in early state(unresolved bugs, undocumented functions), you will need to get in touch with developers/users Big data needs big platform 12

CERN IT Department CH-1211 Geneva 23 Switzerland t GT Lessons learned Must keep up to date. New companies, technologies and tools are emerging –Twitter real time hadoop about to be released –Cascalog, hadoop data mining language –Bigdata distributions: Cloudera, Datastax, Mapr… 13

Grid Technology Questions? 14