Creating a Troubleshooting and Alert Framework for Enterprise Cloud Solutions
The objective was to study the existing system and come up with a better platform for troubleshooting and alert mechanism for Nutanix products. The idea was was to enhance the experience of troubleshooting for all the stakeholders including users and the service engineers and giving them a more robust platform for easy resolution of cases.
The project is in deployment pipeline and cannot be displayed in detail. But please go through to see the process and approach
Reduce resolution time of issues for Service Engineers. The time reduced is from weeks to a matter of day.
Gives a clear picture to users about what is happening in the cluster. Being aware of what is happening they would not raise unnecessary cases.
Reduces the number of cases that will be coming to the service engineers
Since the diagnostic tool will be solving all the small and silly cases, service engineers can now spend time on important cases.
This feature is going to take away a lot of jobs. As this is going to make troubleshooting a very efficient process.
The concept and visualisation is very neat and straight to the point. This is going to give anyone a much better picture of what is happening in the clusters.
A three pronged solution was devised to aid troubleshooting. It consists of – a visualisation, a diagnostic tool and a machine learnig supported part for automating tasks constitutes the main parts of the solution.
1. Visualization: Helps in giving a clear understanding of what os happening around an alert in the cluster. It helps to correlate the various events that occur along with the alerts and other metrics. This gives a good understanding of the possible reasons.
2. Diagnostic Tool: This works as an automatic diagnostic tool and parses the required information and conducts various checks and tests to solve most of the common problems. Even if the problem is not solved the tool can attach all the required logs and information to the service ticket being created. This helps in saving considerable time in resolving the ticket.
3. Machine Learning: The smart automation center with maching learning capability can automate repreated problems. Also in future it would be able to form a network with all the users and help them avoid hitting a bug that has already been found out. As of now each release/update has bugs which will be hit by all the users. Having a smart automation center can push out scripts, for known bugs, to prevent other users from hitting the same. This would reduce the number of repeated cases that come to the service engineers.