Elasticsearch – An Open Source Log Analysis Tool Rob Appleyard and James Adams, STFC Application-Level Logging for a Large Tier 1 Storage System.

Slides:



Advertisements
Similar presentations
MQ Series Cross Platform Dominant Messaging sw – 70% of market Messaging API same on all platforms Guaranteed one-time delivery Two-Phase Commit Wide EAI.
Advertisements

© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
June 22-23, 2005 Technology Infusion Team Committee1 High Performance Parallel Lucene search (for an OAI federation) K. Maly, and M. Zubair Department.
Evaluation of NoSQL databases for DIRAC monitoring and beyond
Software Architecture Patterns (2). what is architecture? (recap) o an overall blueprint/model describing the structures and properties of a "system"
19 February CASTOR Monitoring developments Theodoros Rekatsinas, Witek Pokorski, Dennis Waldron, Dirk Duellmann,
Introduction to UNIX/Linux Exercises Dan Stanzione.
Christopher Jeffers August 2012
Data Integration Problem How to access data across 22 different data systems, most operating on different hardware, using different software, and having.
Lecturer: Ghadah Aldehim
CS 390- Unix Programming Environment CS 390 Unix Programming Environment Topics to be covered: Distributed Computing Fundamentals.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Guide to Linux Installation and Administration, 2e1 Chapter 10 Managing System Resources.
Open Search Office Web Services Database Doc Mgt Sys Pipeline Index Geospatial Analysis Text Search Faceting Caching Query parsing Clustering Synonyms.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Streamlining Monitoring Infrastructure in IT-DB-IMS Charles Newey ›
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
RAL Site Report Castor Face-to-Face meeting September 2014 Rob Appleyard, Shaun de Witt, Juan Sierra.
1 UNIT 13 The World Wide Web Lecturer: Kholood Baselm.
CS603 Basics of underlying platforms January 9, 2002.
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
XROOTD AND FEDERATED STORAGE MONITORING CURRENT STATUS AND ISSUES A.Petrosyan, D.Oleynik, J.Andreeva Creating federated data stores for the LHC CC-IN2P3,
Centralized Logfile Search (a.k.a. Tracing) Vito Baggiolini with Gergo Horanyi, Felix Ehm, Stephen Page.
Storage Centralized Logging (Log Aggregator)
2 Floor, , Sunnae-Dong,Kangdong-Gu Seoul, Korea T | F | SEOJINDSA CO. LTD Enterprise LDAP Team LDAP.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Agile Infrastructure Monitoring HEPiX Spring th April.
CERN IT Department CH-1211 Genève 23 Switzerland t HEPiX Conference, ASGC, Taiwan, Oct 20-24, 2008 The CASTOR SRM2 Interface Status and plans.
ERDDAP The Next Generation of Data Servers Bob Simons DOC / NOAA / NMFS / SWFSC / ERD Monterey, CA Disclaimer: The opinions expressed.
Distributed Logging Facility Castor External Operation Workshop, CERN, November 14th 2006 Dennis Waldron CERN / IT.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
CASTOR logging at RAL Rob Appleyard, James Adams and Kashyap Manjusha.
Monitoring with InfluxDB & Grafana
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR F2F Monitoring at CERN Miguel Coelho dos Santos.
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Data Analysis w ith PROOF, PQ2, Condor Data Analysis w ith PROOF, PQ2, Condor Neng Xu, Wen Guan, Sau Lan Wu University of Wisconsin-Madison 30-October-09.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
Audit & Reporting with Alfresco & NoSQL architecture Lucas Patingre Alfresco consultant and technical lead at Zaizi.
1 UNIT 13 The World Wide Web. Introduction 2 Agenda The World Wide Web Search Engines Video Streaming 3.
1 UNIT 13 The World Wide Web. Introduction 2 The World Wide Web: ▫ Commonly referred to as WWW or the Web. ▫ Is a service on the Internet. It consists.
Alfresco Monitoring with OpenSource Tools Miguel Rodriguez Technical Account Manager.
BIG DATA/ Hadoop Interview Questions.
CERN IT Department CH-1211 Genève 23 Switzerland t Towards end-to-end debugging for data transfers Gavin McCance Javier Conejero Banon Sophie.
Grid Computing: An Overview and Tutorial Kenny Daily BIT Presentation 22/09/2016.
A presentation on ElasticSearch
Working in Open Source Search
DATA Storage and analytics with AZURE DATA LAKE
MQ Series Cross Platform Dominant Messaging sw – 70% of market
Introduction to Eclipse
Now every configuration is possible
WinCC-OA Log Analysis SCADA Application Service - Reporting
Searching and Indexing
Applying Control Theory to Stream Processing Systems
One independent ‘policy-bridge’ PKI
SWITCHdrive Experience with running Owncloud on top of Openstack/Ceph
CERN-Russia Collaboration in CASTOR Development
A Marriage of Open Source and Custom William Deck
TRANSLATORS AND IDEs Key Revision Points.
Challenges in Network Troubleshooting In big scale networks, when an issue like latency or packet drops occur its very hard sometimes to pinpoint.
湖南大学-信息科学与工程学院-计算机与科学系
Ch 4. The Evolution of Analytic Scalability
Overview of big data tools
another noSql customization for the HDB++ archiving system
Get your ETL flow under statistical process control
MQ Series Cross Platform Dominant Messaging sw – 70% of market
Academic & More Group 4 谢知晖 王逸雄 郭嘉宋 程若愚.
Indexing with ElasticSearch
Presentation transcript:

Elasticsearch – An Open Source Log Analysis Tool Rob Appleyard and James Adams, STFC Application-Level Logging for a Large Tier 1 Storage System

Introduction First, a little about what we do –RAL = UK’s LHC Tier 1 site –CASTOR for LHC storage CERN Advanced Storage manager Disk & Tape My responsibility Domain-specific solution developed at CERN for WLCG

CASTOR Logs CASTOR is a complex system… …and produces a lot of logging information –From the application daemons, not the system daemons.

CASTOR Logs

~2GB/day from the node I showed (highest volume) ~30GB/day collected overall ~200 source nodes ~70,000,000 log events/day

Where does it all come from? CASTOR logs each interaction between the various components –…in great detail. The window to the right shows 10 lines of logging from one daemon Time period is ~0.07s

One Log Message… T11:02: :00 lcgcstg01 stagerd[22773]: LVL=Info TID=22822 MSG="Request moved to Wait" REQID=45bea7cd-acb1-4d1f-a66f-45aa41663c3a NSHOSTNAME=cexperimentlsf.ads.rl.ac.uk NSFILEID= SUBREQID=0fd d07-ff31-e053-05b6f6821b16 Type="StagePutDoneRequest“ Filename="/castor/ads.rl.ac.uk/prod/experiment/prodInput/proddata/data/datast ore/ff/ff/datafile.data" Username=”experiment001" Groupname=”experiment" SvcClass=”experimentInput"

One Log Message… T11:02: :00 lcgcstg01 stagerd[22773]: LVL=Info TID=22822 MSG="Request moved to Wait" REQID=45bea7cd-acb1-4d1f-a66f-45aa41663c3a NSHOSTNAME=cexperimentlsf.ads.rl.ac.uk NSFILEID= SUBREQID=0fd d07-ff31-e053-05b6f6821b16 Type="StagePutDoneRequest“ Filename="/castor/ads.rl.ac.uk/prod/experiment/prodInput/proddata/data/datast ore/ff/ff/datafile.data" Username=”experiment001" Groupname=”experiment" SvcClass=”experimentProdInput"

What’s wrong with the old way? Most CASTOR logs are like this The files are big, but they’re easy to parse… –So why not just use normal UNIX commands? grep, awk, sed, etc… –With modern hardware, a 5 million-line logfile can be grepped in reasonable (<1 minute) time periods

What’s wrong with the old way? Our system is distributed! –Multiple management nodes –Multiple storage nodes Grepping on one node? OK (more-or-less) Grepping for the same string across a 200+ node system? No.

The First Solution - DLF DLF = ‘Distributed Logging Facility’ CERN-developed monitoring system for CASTOR Store all the log information in a big Oracle DB Source: CASTOR end-to-end monitoring, by T Rekatsinas et al, URL: _219_4_ pdf

Searching DLF DLF offers a CASTOR- customised search function –Which is pretty neat! –The problem is… –…that searches take… –…a very… –…very… –…long… –…time.

Running DLF Scalability was a killer. –By 2013, simple queries were taking >1 hour. –Fundamental architecture couldn’t cope.

The Hunt for Better CERN’s solution used the Hadoop stack and an Apollo message broker… …and a lot of bespoke Python

The Hunt for Better This didn’t work for our use case –We tried adapting it… –But we just ended up spending ages hacking the Python.

Plan B Our problems are not unique. –There are some really nifty off-the-shelf solutions to these issues… –Let’s see if they scale! Spoiler: They do.

The ELK Stack ELK stack= –Elasticsearch –Logstash –Kibana 3 separate pieces of software –But they are designed to fit together URL for (recently renamed) developer:

Logstash Sequence: –Data arrives in format A… –…process B occurs… –…data out in format C. In our case this is: –CASTOR nodes send log messages in –JSON-ise –Send to Elasticsearch Screengrab from Logstash documentation

Elasticsearch Distributed RESTful search and analytics engine Built on Apache Lucene –Apache 2 license Behind the scenes: shard-based storage –Admin defines no. of primary/replica shards –Users don’t need to think about this

Kibana Web-based data visualisation system –Lucene query syntax –Lots of pretty graphs –Heavily integrated with Elasticsearch ES indices are per-day

Cool Kibana Plots (1)

Cool Kibana Plots (2)

Sysadmin Use Cases Common questions: –“What happened to this user’s request?” –“When did we first see an ORA-5555 error message?” –“Tell me everything you have from the past 5 days about the file with ID= ” We get the answers fast and in a useful format.

Sysadmin Use Cases

Challenges Encountered (1) CASTOR’s logging conventions are sometimes messy… 168 lines of code in Logstash to sort out changing field names, case variations, typos, etc… –3 different field names used for a file’s ID –‘CASTOR_Pivilege’ Lucene query syntax needs to be learned –Handling of quotation marks is odd –Simple sample query: castor_MSG: “Marking transfer as scheduled”

Challenges Encountered (2) Load on hardware is non-trivial –Currently running on 10 (obsolete) batch nodes… Raw messages not stored Tuning proved difficult –None of the published HOWTOs deal with working at this scale –We are very happy to discuss our experiences and offer advice

Other uses Application log search is just our use case Others: –Syslog search –Logging from Condor batch farm –Open big data analysis service for other uses

Conclusion Elasticsearch fits our requirements very well. –Powerful –Cheap to run –Quick querying, even at high scale If you need to manage logs from distributed sources, you should try this!

Any Questions?