Creating a Troubleshooting and Alert Framework for Enterprise Cloud Solutions

 

The objective was to study the existing system and come up with a better platform for troubleshooting and alert mechanism for Nutanix products. The idea was was to enhance the experience of troubleshooting for all the stakeholders including users and the service engineers and giving them a more robust platform for easy resolution of cases. 

 

The project is in deployment pipeline and cannot be displayed in detail. But please go through to see the process and approach

  • Reduce resolution time of issues for Service Engineers. The time reduced is from weeks to a matter of day.

  • Gives a clear picture to users about what is happening in the cluster. Being aware of what is happening they would not raise unnecessary cases.

  • Reduces the number of cases that will be coming to the service engineers

  • Since the diagnostic tool will be solving all the small and silly cases, service engineers can now spend time on important cases.

This feature is going to take away a lot of jobs. As this is going to make troubleshooting a very efficient process.

Staff Service Reliability Engineer

The concept and visualisation is very neat and straight to the point. This is going to give anyone a much better picture of what is happening in the clusters.

Staff Service Reliability Engineer

A three pronged solution was devised to aid troubleshooting. It consists of – a visualisationa diagnostic tool and a machine learnig supported part for automating tasks  constitutes the main parts of the solution.

1. Visualization: Helps in giving a clear understanding of what os happening around an alert in the cluster. It helps to correlate the various events that occur along with the alerts and other metrics. This gives a good understanding of the possible reasons.

2. Diagnostic Tool: This works as an automatic diagnostic tool and parses the required information and conducts various checks and tests to solve most of the common problems. Even if the problem is not solved the tool can attach all the required logs and information to the service ticket being created. This helps in saving considerable time in resolving the ticket.

3. Machine Learning: The smart automation center with maching learning capability can automate repreated problems. Also in future it would be able to form a network with all the users and help them avoid hitting a bug that has already been found out. As of now each release/update has bugs which will be hit by all the users. Having a smart automation center can push out scripts, for known bugs, to prevent other users from hitting the same. This would reduce the number of repeated cases that come to the service engineers.

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Youtube
Consent to display content from Youtube
Vimeo
Consent to display content from Vimeo
Google Maps
Consent to display content from Google