Who we are? Jay Boyd Miki Banatwala. TI-1640: Achieving High Availability with IBM Connections - Insights from the Experts.

TI-1640: Achieving High Availability with IBM Connections - Insights from the Experts

Who we are? Jay Boyd Miki Banatwala

What we hope you take away
Insight into how we run our deployments Introduction to recommended Connections deployment options and how to grow a deployment. It’s more than just how you deploy! Introduction to monitoring areas and actions Tips on key things to monitor in Connections

What is High Availability?
Minimizing downtime as much as possible No Single point of failure Scalability Geo-redundancy Minimizing data loss HA in 9’s Downtime per year 99% 3.65 days 99.9% 8.76 hours 99.99% 52.56 minutes 99.999% 5.26 minutes % 31.5 seconds *unplanned outages

Win the Day!

Connections Deployments
Internal (w3-connections) serves: 600,000+ users From a central deployment And is single tenant Cloud serves: A lot more Users From 3 Data centers Is multi-tenant Full collaboration solution

W3 – by the numbers

W3 by the numbers – HTTP Transaction Volume
54 Million http requests per day Feature Requests per day Activities 2,200,000 Blogs 2,000,000 Common 4,700,000 Communities 4,800,000 Files 1,900,000 Forums 660,000 News 580,000 Profiles 5,800,000 Wikis 200,000 Unique users daily Bursts of over 1,000 requests/sec 830 requests/sec peak

W3 by the numbers – HTTP Transaction Volume

W3 by the numbers – Content Volume
Dedicated DB2 Instances total just over 2 TB Growth is just under 2% in 6 months DBs in “others” category range from GB in size

W3 by the numbers – Content Volume
External Content is made up of individual files stored on network file share Growth is just under 3% per month Total content weighs in just under 65 TB

High Availability (Deployment Perspective)
Growing the deployment of connections from a dev owned deployment to a larger/company critical service. Start small (let the team use it) to 500k users using the product on a daily basis Best practices come from keeping the site open for the past 8 years Did not start with everything HA – learned a lot as we went, got to critical mass on HA before making in available to all users base deployment topology characteristics of connections in HA are known and published – follow them!

Logical Deployment Architecture – IBM W3

High Availability – what’s your DR story?
Recovery Time Objective (RTO) – how long does it take to recover? Recovery Point Objective (RPO) - how much data can we loose?

Running Connections at IBM W3
Hosted in an IBM Level 1 enterprise data center (AHE Boulder) Entire environment is 64bit running on SUSE (SLES) v11.2 virtual machines on IBM mainframes 84 App Servers: 6 Nodes, each with approx GHZ with 64.5 GB RAM 1 Connections Application per JVM on each node (14 JVMs) 11 DB2 Servers: SUSE 11.2 running DB2 v10.1 Each application has a dedicated DB2 Server Range from equivalence of 2 to GHZ DB storage is located on solid state drives LDAP shared/common infrastructure Shared storage for content stores is NFS based using IBM GSA cell (GPFS based)

Key Decisions – Virtualization & Separation
Do not over allocate resources, virtual memory requirements should not over commit physical resources Allows dynamically or easily adding additional CPU, Memory, disk and network resources, but can be harder to diagnose performance issues (especially disk issues) App Server Layer Separate JVM/Cluster per application (1 application per JVM, 1 application per cluster) Separate logs for easier problem determination & monitoring Easier maintenance, rolling restart of just one application after applying code updates Allows tuning specific for the application Allows adding more cluster members for just one app Isolate HTTP onto separate servers away from JVM applications

Key Decisions – Database & Pre-production
DB Layer Separate DB Instance & Server per application DB Easier to detect and pin point DB layer bottlenecks Easier to perform application DB specific maintenance and performance tuning Side A (N) & Side B (N+1) within Production provides least risk solution and shortest time of service outage for upgrading Connections to new versions

And it’s up and running – now what?
High-availability is not just about deployment If something can go wrong it will … except of course in Connections  Monitor everything Devops

Monitoring Not an after thought
Holistic approach (not just server stats) Are your users the first to tell you there is a problem? Hard to figure out where the problem is? Finding cause and resolution? - $$ & time! Playbook/Runbook/RBA Anomaly detection

Monitoring Tools – Use what you have

W3 - Vitals Vitals is key to monitoring and troubleshooting
+ Dashboard + Automated URL probing + Persists results + Graphs, Alerts + Generates Core + Rewrites Plugin Config + Calculates SLA + Probes to each DB + Probes to File Storage + Probes to LDAP

W3 - Vitals In addition to Probing, Vitals Collects Stats
+ HTTP Connections in use at the Proxy, ELB & HTTP servers + SIB queue sizes + JVM Memory use (verbose GC) + Aggregation of all http logs + WebContainer thread pool usage

W3 – Vitals Health Timeline

W3 – Vitals Service Level Agreement

W3 – Vitals CPU

W3 – Vitals Thread Pools

W3 – Vitals MPM HTTP Connections

W3 – Vitals SIB Queue

Blogs Outage – Thursday Jan 7

Blogs Outage – spike in WebContainer threads

Blogs Outage – Blogs issue led the problem

Blogs Outage – high volume with sudden spike

Blogs Outage – historical stats are key to PD

Blogs Outage – App Server CPU higher then norm

Blogs Outage- CPU the day before

Blogs Outage – DB CPU

Blogs Outage – why the drop?

Blogs Outage – log analysis
Review of Blogs JVM logs show numerous non “normal” SQL errors --- Cause: com.ibm.db2.jcc.am.SqlException: Not enough storage is available in the application heap to process the statement.. SQLCODE=-954, SQLSTATE=57011, DRIVER= --- Cause: com.ibm.db2.jcc.am.SqlException: The statement was not processed because a limit such as a memory limit, an SQL limit, or a database limit was reached.. SQLCODE=-101, SQLSTATE=54001, DRIVER= Javacores look ok, however some show lots of threads creating DB connections Blogs’ DB2Diag reveals true cause:

Blogs Outage Bottom Line
DB2 snapshots show elevated number of locks and in general a lot of time waiting to execute Identified problem area around updating referrer column, investigation and trace show too much SQL within a single transaction Development creating fix to break into multiple transactions

LDAP mayhem Ok so we know its ldap
Alerted via logs showing: Identifier: LCC8BA217863B5495C92DCF3B30ED76E19 com.ibm.connections.directory.services.exception.DSException: javax.naming.ServiceUnavailableException: :389; socket closed Site is funcctional with intermittent failures & long response times Ok so we know its ldap At 2 minutes – synthetic monitoring starts reporting errors (multiple places) LDAP server shows: GLPSRV153W The server was not able to accept a requested client connection. The maximum capacity of ##### has been reached. This error has occurred times.

LDAP mayhem Response times are – well, really bad
NEVER rely on users to tell you the state of your site! (of course the issue is intermittent) All nodes in LDAP cluster affected Recycle the servers JVM time in request looks better BUT we are not out of the woods Quickly compare config changes between deployments – no change found we still don’t know what caused the spike So… Browser (page load) Time spent in jvm

LDAP mayhem It just starts happening again! We are now at 4 minutes
LDAP connections It just starts happening again! We are now at 4 minutes Need to find the servers creating the traffic Found the server and the root cause Turned off new ‘feature’ We are at 5 minutes To do: Monitor incoming traffic addresses Anomaly detection Browser (page load)

Top things to start with (aside from monitoring network/servers/jvms)
Interservice Dependencies Service Integration Bus be sure to obtain the recommended WebSphere ifixes Search Indexes Issues with slow crawling & index replication due to large data volume, fixes and improvements have been pushed back into the product

Example of Interservice Issue
Usage: Connection applications providing services to other Connections applications (e.g visiting Files application makes a request to Communities to get a list of communities a user is a member of) What to monitor? Multiple vectors, monitor all applications What happens when it fails? Seemingly un-related failures cascade

Vitals Health Timeline – example for intersvc dep.

SIB Usage (not a complete list): Delivery of activity stream events
Cross-service actions Search indexing What happens when it fails? Events will not be delivered to the target components for processing Eventually services will not be able to post events (they will be lost) What to monitor: Messaging Queue in WAS

Monitor SIB Queues wsadmin script: printSIBusSummary.py

Monitor SIB Queues Increasing depth trends require review of SystemOut, grep for SibMessage, may require ME or JVM restart Automate Alerting on high thresholds

Search Usage (not a complete list): End user search queries
Community catalog Files ‘community files’ view What happens when it fails? End user search queries may not return right results (or anything) May fall behind What to monitor: Crawl and indexing progression

Monitor crawling and indexing
Crawling – need to know what node is crawling, then watch for specific msgs to know when crawl starts (CLFRW0297I) & stops (CLFRW0294I) Each app should be crawled every 15 minutes, verify by presence of msgs Indexing – Crawling node writes deltas to shared file system, each node merges changes into its index. Each index contains a file called “segments.gen”. This file’s timestamp is updated each time the index is update Alert for CLFRW[0-9]*E messages

Monitoring – don’t forget…
Synthetic - availability RUM - performance

Playbook/Runbook Monitor everything you can
Alerts for what you can take action on Manual to start Invest in run book automation Do you have a separate ops team that runs your deployment? Devops anyone? Create a partnership with your various IT shops/providers

Log Analysis In general log files become noisier when there are problems in the deployment Log files are HUGE! Great for detailed investigation, NOT capable of deriving dashboard health Use automation to monitor your logs and send appropriate alerts Not for just Connections servers

Log Management Ensure sufficient disk space to retain full logs (consider retaining locally 30 days), retain longer via off line backup mechanism Configure WebSphere JVM logs with daily rollover with large max size so a single day is generally contained within one SystemOut HTTP Access logs should include key fields beyond the basics: real client IP, user agent, request method, URL, query string, JVM that processed the request, http status code, response time, response size, referer, remote user (if possible from your SSO). Consider a monitoring node to host monitoring tooling & provide a single place that support staff can login to review logs and configuration NFS mounts for all logs (each Node's JVM logs & Java Core, DB2 Diag logs, HTTP access & error logs, etc) or use a centralized consolidated logging facility NFS mount for DMGR configuration directory

Start with the basics Start with a solid deployment base
OS & Network tuning, establishment of resource thresholds (HTTP traffic control, shedding load, DOS detection & prevention) Monitoring of basic resources Alert Ops team prior to crisis for exceeded thresholds & abnormal behavior Retention of run time information for troubleshooting and growth planning (CPU & Memory use, Disk & Network IO, Cache hit rates, etc) Capture OS level stats for historical review, capacity planning, and real time alerting Monitor and alert on low and high thresholds for local and SAN Disk access & throughput, CPU, Swap, IO Wait, etc Monitor logs for abnormal growth/size or error rates and alert when appropriate

Don’t put your team in the dark
Ensure staff has read access to: All WebSphere nodes (DMGR & App Servers), DB nodes, HTTP nodes Include read access to application specific logs and configuration and also system level logs (i.e., /var/log/messages) WebSphere Admin Console (at least monitor role) DB Snapshots/performance reports Provide access to tooling that provides insight to health & throughput on supporting applications/hardware (caching proxies, load balancers, LDAP, , SAN)

URL Probing Monitor one URL per application on every JVM (i.e. hit the WebSphere JVM port bypassing HTTP and any other front end devices), ensure response is received within a short interval (i.e., < 20 seconds). Expose with a dashboard, graph availability & slowdowns Alert and automate collection of Javacores when the application doesn't respond (i.e.., trigger 3 core dumps each a minute apart as long as JVM is still hung, after 3'rd detection cycle consider killing the JVM) Javacores may not be useful to your team, however in the event you need to call in IBM support you already have the cores in hand and you don't have to wait for a recurrence Similar URL monitoring can be routed through the front end and request dynamic data to ensure end to end functionality however be careful you don't introduce too much real load on your system through monitoring

URL Patterns used to monitor Connections
activities - blogs - bookmarks - communities - common - blank.gif files - forums - homepage - mobile - news - profiles - push - search - wikis - Repeat each URL using node & JVM Port specific variants to probe each JVM instance

WebSphere Infrastructure
Monitor WebSphere Service Integration Bus queues (), graph & alert Monitor, tune & alert: WebContainer and EJB threads Pool use Datasouce Connection Pool use Java Heap (verbose garbage collection) Dynacache sizes

Ad hoc analysis Establish tooling that enables review of “heavy hitters” or top consumers. For a given period of time, staff should be able to determine what IP Address/User/Application is: Executing the most requests Executing the most time consuming requests Causing the largest responses Causing abnormal HTTP Status codes Historical review is important, you need to be able to look back in time and determine what is different about today Understand normal throughput patterns (graph HTTP requests per application) & normal response times. Review to plan for growth. Tooling should alert when appropriate.

May your dashboard be green!

Thank you

Connections Talks JMP-1660A IBM Connections integration with Microsoft, Sunday 2:15 AD-1503 Extending IBM Connections Cloud and Verse, Monday 11:30 TI-1642 Connections Communities – The New Stuff, Monday 11:30 SI-1135 Super Session: IBM’s Mobile Strategy and a Social Way to Work, Monday 3:30 SI-1264 It’s All About People – A Holistic Approach, Monday 4:45 TI-1601 Take IBM Connections Across Your Enterprise – Through Plugins and Integration Points, Tuesday 8:00 AD-1511 Top 10 things to know for creating successful solutions based on IBM Connections Cloud, Tuesday 9:15 AD-1507 Building applications using the IBM Connections Cloud Developer Experience, Tuesday 10:45 SI-1137 What’s new in IBM Connections, Tuesday 1:15 AD-1656 Transforming Social Data into Business Insights, Tuesday 2:30 SI-1139 IBM Connections Files – The New Way to Work, Sync and Share Tuesday, 4:00 TI-1640 Achieving High Availability with IBM Connections – Insight from the Experts, Tuesday 4:00 TI-1641 Delivering Enterprise Software at the Speed of Cloud, Wednesday 10:45

Notices and Disclaimers
Copyright © 2015 by International Business Machines Corporation (IBM). No part of this document may be reproduced or transmitted in any form without written permission from IBM. U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM. Information in these presentations (including information relating to products that have not yet been announced by IBM) has been reviewed for accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to update this information. THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IN NO EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE USE OF THIS INFORMATION, INCLUDING BUT NOT LIMITED TO, LOSS OF DATA, BUSINESS INTERRUPTION, LOSS OF PROFIT OR LOSS OF OPPORTUNITY. IBM products and services are warranted according to the terms and conditions of the agreements under which they are provided. Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice. Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other results in other operating environments may vary. References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all countries in which IBM operates or does business. Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the views of IBM. All materials and discussions are provided for informational purposes only, and are neither intended to, nor shall constitute legal or other guidance or advice to any individual participant or their specific situation. It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’s business and any actions the customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services or products will ensure that the customer is in compliance with any law

Notices and Disclaimers cont.
Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products in connection with this publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to interoperate with IBM’s products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents, copyrights, trademarks or other intellectual property right. IBM, the IBM logo, ibm.com, Aspera®, Bluemix, Blueworks Live, CICS, Clearcase, Cognos®, DOORS®, Emptoris®, Enterprise Document Management System™, FASP®, FileNet®, Global Business Services ®, Global Technology Services ®, IBM ExperienceOne™, IBM SmartCloud®, IBM Social Business®, Information on Demand, ILOG, Maximo®, MQIntegrator®, MQSeries®, Netcool®, OMEGAMON, OpenPower, PureAnalytics™, PureApplication®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®, pureScale®, PureSystems®, QRadar®, Rational®, Rhapsody®, Smarter Commerce®, SoDA, SPSS, Sterling Commerce®, StoredIQ, Tealeaf®, Tivoli®, Trusteer®, Unica®, urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System z® Z/OS, are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at:

Who we are? Jay Boyd Miki Banatwala. TI-1640: Achieving High Availability with IBM Connections - Insights from the Experts.

Similar presentations

Presentation on theme: "Who we are? Jay Boyd Miki Banatwala. TI-1640: Achieving High Availability with IBM Connections - Insights from the Experts."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Who we are? Jay Boyd Miki Banatwala. TI-1640: Achieving High Availability with IBM Connections - Insights from the Experts.

Similar presentations

Presentation on theme: "Who we are? Jay Boyd Miki Banatwala. TI-1640: Achieving High Availability with IBM Connections - Insights from the Experts."— Presentation transcript:

Similar presentations

About project

Feedback