Training Example 1
(Flakey Server and Poor Video Quality)
Customer has been complaining about a flakey server as well as windows of poor video quality.
The first question is whether this is one thing or multiple things.
Rebooting can be caused by the following:
- Running out of swap space on the system partition visible by free space on system partition (C:)
- Running out of usable memory visible by free memory
- Partially updated software triggering a system crash (see it with “pending reboot” flag)
- An administrator (IT) doing forced shutdowns usually associated with VM platform.
- Power supply flakiness (hard to see it but may be in bugcheck events)
Poor video quality (jumpiness or pixelation) implies dropped frames. It can be caused by the following:
- Network congestion visible in windows of higher latency and dropped packet events
- Storage performance visible in volume write latency and volume write queue depth
- Congestion on the storage network which is visible volume write & read latency as well as NIC dropped uploads/downloads on the storage network port.
- CPU Load preventing the system from keeping up.
Triage and Analysis
There doesn’t seem to be any overlap between the two complaints. The thing to do is to look for reboots in the System Uptime to see if there is any correlating data or pattern.
Going to the server, first look at System: Uptime. System Uptime resets after any reboot of the system so look for a sudden drop of this chart.
Expand the timeframe back to find some reboot events. To improve responsiveness, narrow the time window to a range of a few days around a reboot.
Add a graph for Volume: Free space. This will plot free space on all partitions. You can filter out the video storage partitions D: & E: by clicking on the colored circles associated with those plot lines. Though you can see a sawtooth pattern, there are still well over 100GB of free memory. Moreover, this is happening independently of the reboots. This means that this is not the cause.
Add a graph for Memory: Free bytes to plot the available physical and virtual memory. Here we see a stronger correlating pattern of something consuming virtual memory around reboots of the system.
Add a graph for System: Memory usage which plots memory usage by the top 10 processes. Since the top-10 is variable you will see plot lines appearing and disappearing. What this graph shows is that there are two processes that seem to surge in memory usage: monitorstation.exe which is the recording process for Video Insight, and svchost.exe which is a Windows process that allows multiple programs to share a system services running from dynamic link libraries.
Assessment is that the VideoInsight process is having trouble and it may also be causing the heavy use of svchost as well.
Add a graph and plot Camera: Latency for all the video streams recording to this server. Here we see generally okay activity but windows spikes of up to 1 second periodically. However, we also see some down ticks to “Unpingable”. These down ticks are correlated to times when svchost.exe is a high memory user.
Next, use “Duplicate” to add another camera graph to plot Integrity: vpu to illustration video path uptime. Here, we see a disturbing pattern of camera streams struggling to stay online during these windows.
It seems like the reboots are associated with consuming available free memory in the VMS application. Recommend checking with VMS vendor for updates.
There is also symptoms of link-line saturation (i.e., too much traffic for the negotiated link speeds for various paths). Moreover, these windows of significant latency only affects about half the cameras. Suspect link-speed or switch issues for that batch of cameras. Without switch port information, we have to work with customer on assessment.
There is was a NIC port in one of the camera switches that would reset itself to 10Mbps from its intended setting of 100Mbps.