Training Example 3
(Video Quality Problems Across Site)
Client had been complaining about poor video quality everywhere and particularly on Recorder 3.
Poor video quality (jumpiness or pixelation) implies dropped frames. It can be caused by the following:
- Network congestion visible in windows of higher latency and dropped packet events
- Storage performance visible in volume write latency and volume write queue depth
- Congestion on the storage network which is visible volume write & read latency as well as NIC dropped uploads/downloads on the storage network port.
- CPU Load preventing the system from keeping up.
Triage and Analysis
Most likely problem is network congestion.
Go to Recorder 3, and plot NIC: Download to see that the system is absorbing a daily peak of 26Mbps.
Add another plot for NIC: Dropped downloads. Here we can see that the dropped packets is steadily climbing confirming our suspicion.
Add another chart, choose “Cameras” and Select All. Then choose Camera: Latency. This plot shows a set of cameras spike periodically to over 100ms of latency which is not good.
Duplicate this camera plot and then choose Camera: Tcp port sum. This shows a steady flux, indicating sockets are getting re-established constantly to many cameras.
This adds weight to the assessment that it is Network issues. However, to be thorough, lets check storage.
Set a plot for Volume: Write throughput which tells us that we have a respectable amount of data getting written to disk.
Add a plot for Volume: Write latency and a graph for Volume: Write queue depth. Both of these charts are showing spikes which can also lead to dropped packets.
So we seem to be having both storage issues AND network issues. Let's look at them in combination by plotting CPU: Load, NIC: Dropped downloads, and Volume: queue depth together. CPU Load is high but not in the danger zone. If we look back a month or a quarter, this problem has been going on for quite a while. If we go all the way back to December 1st, we see that prior to December 19th, the storage and dropped downloads were not an issue. Furthermore, something happened to cause the CPU load to jump up 20%.
Something got added or changed on the servers to affect performance of ingesting data from the network and writing it to storage. Since the CPU load was completely saturated, it suggests something that would slow down processing of IO causing it to miss real-time deadlines. This raises suspicions that virus scanning is being done on the video data.
Customer confirmed that all servers got updated with IT mandated virus scanning software at that time.