When we write a monitor for something like “Processor\% Processor Time\_Total” and target “Windows Server Operating System”…. everything is very simple. “Windows Server Operating System” is a single instance target…. meaning there is only ONE “Operating System” instance per agent. “Processor\% Processor Time\_Total” is also a single instance counter…. using ONLY the “_Total” instance for our measurement. Therefore – your performance unit monitors for this example work just like you’d think.
However – Logical Disk is very different. On a given agent – there will often be MULTIPLE instances of “Logical Disk” per agent, such as C:, D:, E:, F:, etc… We must write our monitors to take this into account.
For this reason – we cannot monitor a Logical Disk perf counter, and use “Windows Server Operating System” as the target. The only way this would work, is if we SPECIFICALLY chose the instance in perfmon. I will explain:
Bad example #1:
I want to monitor for the perf counter Logical Disk\% Free Space\
I create a new monitor > unit monitor > Windows Performance Counters > Static Thresholds > Single Threshold > Simple Threshold.
I target a generic class, such as “Windows Server Operating System”.
I choose the perf counter I want – and select all instances:
And save my monitor.
The problem with this workflow – is that we targeted a multi-instance perf counter, at a single instance target. This workflow will load on all Windows Server Operating Systems, and parse through all discovered instances. If an agent only has ONE instance of “Logical Disk” (C:) then this monitor will work perfectly…. if the C: drive does not have enough free space – no issues. HOWEVER… if an agent has MULTIPLE instances of logical disks, C:, D:, E:, AND those disks have different threshold results… the monitor will “flip-flop” as it examines each instance of the counter. For example, if C: is running out of space, but D: is not… the workflow will examine C:, turn red, generate an alert, then immediately examine D:, and turn back to green, closing the alert.
This is SERIOUS. This will FLOOD your environment with statechanges, and alerts, every minute, from EVERY Operating System.
A quick review of Health Explorer will show what is happening:
This monitor went “unhealthy” and issued an alert at 10:20:58AM for the C: instance:
Then went “healthy” in the same SECOND from the _Total Instance:
Then flipped back to unhealthy, at the same time – for the D: instance.
I think you can see how bad this is. I find this condition all the time, even in “mature” SCOM implementations… it just happens when someone creates a simple perf threshold monitor but doesn't understand the class model, or multi-instance perf counters. In an environment with only 500 monitored agents – I can generate over 100,000 state changes – and 50,000 alerts, in an HOUR!!!!
Ok – lesson learned – DONT target a single-instance class, using a multi-instance perf counter. So – what should I have used? Well, in this case – I should use something like “Windows 2008 Logical Disk” But we can still screw that up! :-)
Bad example #2:
I want to monitor for the perf counter Logical Disk\% Free Space\
I create a new monitor > Unit monitor > Windows Performance Counters > Static Thresholds > Single Threshold > Simple Threshold.
I have learned from my mistake in Bad Example #1, so I target a more specific class, such as “Windows Server 2008 Logical Disk”.
I choose the perf counter I want – and select all instances:
And save my monitor.
Ack! The SAME problem! Why????
The problem is – now, instead of each Operating System instance loading this monitor, and then parsing and measuring each instance, now EACH INSTANCE of logical disk is doing the SAME THING. This is actually WORSE than before…. because the number of monitors loaded is MUCH higher, and will flood me with even more state changes and alerts than before.
Now if I look at Health Explorer – I will likely see MULTIPLE disks have gone red, and are “flip-flopping” and throwing alerts like never before.
When you dig into Health Explorer – you will see – that they are being turned Unhealthy – and it isn't event their drive letter! I will examining the F: drive monitor:
I can see it was turned unhealthy because of the free space threshold hit on the D: drive!
and then flipped back to healthy due to the available space on the C: instance:
This is very, very bad. So – what are we supposed to do???
We need to target the specific class (Windows 2008 Logical Disk) AND then use a Wildcard parameter, to match the INSTANCE name of the perf counter to the INSTANCE name of the “Logical Disk” object. Make sense? Such as – match up the “C:” perf counter instance – to the “C:” Device ID of the Logical Disk discovered in SCOM. This is actually easier than it sounds:
Good example:
I want to monitor for the perf counter Logical Disk\% Free Space\
I create a new monitor > Unit monitor > Windows Performance Counters > Static Thresholds > Single Threshold > Simple Threshold.
I have learned from my mistake in Bad Example #1, so I target a more specific class, such as “Windows Server 2008 Logical Disk”.
I choose the perf counter I want – and INSTEAD of select all instances, I learn from my mistake in Bad Example #2. Instead – this time I will UNCHECK the “All Instances” box, and use the “fly-out” on the right of the “Instance:” box:
This fly-out will present wildcard options, which are discovered properties of the Windows Server 2008 Logical Disk class. You can see all of these if you viewed that class in discovered inventory. What we need to do now – is use discovered inventory to find a property, that matches the perfmon instance name. In perfmon – we see the instance names are “C:” or “D:”
In Discovered Inventory – looking at the Windows Server 2008 Logical Disk, I can see that “Device ID” is probably a good property to match on:
So – I choose “Device ID” from the fly-out, which inserts this parameter wildcard, so that the monitor on EACH DISK will ONLY examine the perf data from the INSTANCE in perfmon that matches the disk drive letter.
The wildcard parameter is actually something like this:
$Target/Property[Type="MicrosoftWindowsLibrary6172210!Microsoft.Windows.LogicalDevice"]/DeviceID$This simply is a reference to the MP that defined the “Device ID” property on the class.
Now – no more flip-flopping, no more statechangeevent floods, no more alert storms opening and closing several times per second.
You can use this same process for any multi-instance perf object. I have a (slightly less verbose) example using SQL server HERE.
To determine if you have already messed up…. you can look at “Top 20 Alerts in an Operational Database, by Alert Count” and “Historical list of state changes by Monitor, by Day:” which are available on my SQL Query List. These should indicate lots of alerts, and monitor flip-flop, and should be investigated.
No comments:
Post a Comment