“For me it is much more important to know how faults impact my service delivery rather than just get alerts and fix them, because It saves me lots of time and prioritize my efforts in a better way to achieve SLA goals”.
The future of IT/OT monitoring and performance analytics is heading toward Business Service Management (BSM). We see the trend already taking place as companies increasingly adopt the BSM approach and those technologies supporting it.
With BSM monitoring, companies can capture a comprehensive set of IT/OT performance metrics (from the full stack) and access it from a common repository. Furthermore, this performance data can be normalized, weighted and analyzed according to key business processes and correlated against required service levels (SLAs). A BSM solution provides IT organizations with a distinct competitive advantage to reduce costs (through proactive troubleshooting and predictive maintenance), improve uptime (through faster MTTR and break-fix response), and increase CSAT (through better performance and reliability).
For companies with complex IT/OT environments, adopting a BSM approach is inevitable. The current batch of specialized reporting tools, event consoles, network and server notifications, and other monitoring and application management tools (APM) are simply not enough to meet the expectations for lower costs, higher performance, and increased availability of critical business services and processes. Traditional monitoring tools can send warnings when various thresholds have been violated but understanding the correlation between an alert and the root-cause of that alert requires time, often a lot of time, especially when there is a chain reaction of events. Specialized monitoring tools were not built to accommodate system-wide analytics, which results in too many irrelevant alarms (no business impact) and significant delays in fault isolation and business impact analysis. System administrators may know which service is currently suffering delays or what is down, but they cannot accurately answer: why is it down? What is the SLA impact (time/money) on the business? Do I have redundancy? Without BSM, accurately answering those questions takes too much time, and meanwhile users get annoyed and management gets upset.
Let’s take for example a simple alert scenario: the system administrator gets complaints that the ERP system is not generating reports. The sys admin checks his screen and sees that the SAP server is down. They cannot remember all the other things that this server supports, and they need fast and accurate answers about the situation:
A/ Is this server defective or is the origin of the problem somewhere else?
B/ What other IT assets are dependent on this SAP server?
C/ What other business services are affected?
D/ What is the current service level, and how does it rank compared to other problems currently being investigated?
“Traditionally the sys admin would contact the network team and ask them to check the network. If the answer comes back that the network if fine, the sys admin will then proceed to the storage team, the server team, the applications team, etc. The current, fragmented troubleshooting process results in long delays for recognizing and resolving root causes and trend analysis”.
Beyond the slow SAP server, the sys admin must also manage other business services, technology layers, and applications to ensure they are not affecting other customers, users, or business processes. So, while traditional monitoring tools indicate there is an IT issue, the sys admin does not necessarily understand the details or the magnitude of the problem. Traditionally the sys admin would then contact the network manager and ask them to check the network. If the answer comes back that the network if fine, the sys admin will then proceed to the storage team, the server team, the applications team, etc. (choosing whichever sequence seems most logical). The current, fragmented troubleshooting process results in long delays for recognizing and resolving root causes and trend analysis.
Now, imagine a different scenario. A sys admin receives a warning that ERP performance is deteriorating and service level has dropped to 86%. They quickly see that many SQL queries are not running to completion and verifies that the HANA database is in a critical state. (The service level is not zero because some queries are still getting responses.) Looking at all of the business service layers, they see that virtualization is also in a critical state. The sys admin sees that virtualization memory is nearly maxed out at 95% and realizes this is the source of the ERP degradation. It is now a relatively easy problem to solve and ERP performance is quickly restored to normal, before user complaints start to come in.
This scenario is not a dream, rather it is the reality for many enterprises and managed service providers that use Centerity’s unified IT monitoring and performance analytics platform. With its single-pane-of-glass dashboard, users have instant access to root cause analysis and failure trends. Furthermore, to ensure business-critical applications achieve the highest levels of availability and uptime these troubleshooting capabilities work for on-premise, Cloud, virtual, and hybrid deployments. Finally, management and administrators have real-time reports and actionable insights regarding key business processes and how they track against established SLAs and business objectives.
Centerity’s unified infrastructure optimizes the management of business services for complex IT/OT environments. Once the software analyzes the business service levels, IT pros see at once which technology layers are causing issues. With straightforward configuration tools, Centerity’s customers choose their critical performance metrics and put a different weight (score) on each of them.
Each monitored metric has a different effect on various business services. A metric may have a significant effect on one business service, and a relatively minor effect on another. By allowing each service to have a different score and adjusting thresholds for warnings and critical states, it becomes easy to recognize problems and impending problems in real-time and to understand how to properly resolve the issue on the first try.
Author__________________________________________________________________________________________________________________
Maxim Reizelman, Director of SI & Innovation at Centerity
Maxim is leading the support and integration team of Centerity and responsible for the smile of hundreds of customers all over the world. Maxim is leading the innovation practice to keep Centerity’s unique technology at the cut-edge level. Maxim brings extensive experience in the information technology space. Prior to Centerity Maxim functioned as the automation and process control manager at a leading Israeli credit card issuer and as a NOC manager in the IDF.
Pls contact me for any question you have
E-mail Linkedin
About Centerity
Centerity is a chosen vendor for leading complex hybrid IT industries as VCE Vblock, VCE VxRail, Smartstack, Nutanix and Flexpod. Centerity’s award winning software provides a Unified enterprise-class IT performance analytics platform that improves performance and reliability of business services to ensure availability of critical systems. By delivering a consolidated view across all layers of the technology stack including, applications, Big Data, operating systems, database, storage, compute, security, networking, Cloud, Edge, and IoT/IIoT devices, Centerity provides an early warning of performance issues along with corrective action tools to quickly isolate faults and identify root causes.