Oct 15, 2010

Understanding Monitors in Opsmgr 2007 part II Aggregate Monitors

This post discusses aggregate monitors.  As mentioned in the first post, unit monitors can be thought of as the workhorse of monitoring.  Unit monitors are just that – a unit of monitoring.

 A self contained engine to monitor a specific item and reflect the result in terms of health state, alerting and diagnostic/recovery.  Aggregates act as a collector and consolidator of information and ultimately reflect the collective result of unit monitors.  For any defined class in OpsMgr there are 5 defined aggregate monitors – Entity Health, Availability, Configuration, Performance and Security.  This is shown below for the Windows Computer object.
image
Until now you might have just thought of these as categories you can use for grouping similar unit monitors together – and they are useful for this – but these are much more than categories.  If we look at the properties of the availability aggregate as an example we quickly see that this monitor itself can be an engine for alerting and is configurable to reflect the health of it’s contained unit monitors.  We even have diagnostic and recovery options available for an aggregate monitor.
image
image
So when we configure a unit monitor we now understand that the setting to specify a parent monitor isn’t just cosmetic – it’s important.  This setting directly dictates where an unhealthy unit monitor will have an impact.  If we choose availability, the health state of our unit monitor (and all others under the availability aggregate) will be ‘watched’ and their collective health ultimately ‘rolled up' to and reflected on the aggregate itself.  This offers some interesting possibilities.  If, for example, you don’t want to generate an alert based on a single unit monitor but would prefer to alert only when all of the unit monitors are unhealthy, the aggregate allows you to do just that!
So we have the availability, configuration, performance and security aggregates and we understand that these are default aggregates that are part of every monitoring object and we understand that unit monitors that are configured under each directly ‘roll up’ their collective health to the aggregates.  In addition to this, we can create our own aggregates and plug them into the monitoring structure.  So, for example, if we wanted to subdivide our unit monitors under the availability aggregate we could create another aggregate inline as shown.
image
Here we have 6 unit monitors that operate and generate state that is collected by the FileSystem Monitors aggregate.  So, the lower six unit monitors do not of their own operation have direct impact on availability health.  In this example, if any of the 6 unit monitors under FileSystem Monitors go unhealthy that state will be reflected on the FileSystem Monitors aggregate itself and the health of the FileSystem Monitors object will be reflected forward to the general Availability aggregate.
OK, so we have the four categories of unit monitors and we know we can add our own aggregate monitors and specify where they should plug into the category model.
image
image
But, even these four top level categories of aggregate monitors themselves combine and are rolled up to the top most aggregate for a class – the entity aggregate.
image
So, ultimately, the overall health of the availability, configuration, performance, security and whatever other custom aggregates are in place on an object are taken into account based on their ‘watched’ unit monitors and rolled forward to the Entity aggregate.  The Entity aggregate will reflect the overall health of the object in health explorer so at a glance we can tell if the monitored object, in this case Windows Computer, has any issue causing it to be unhealthy.  I’ve put health explorer for a Windows Computer object next to the same object in the authoring > monitors view to show how they map together.
image
So we have the health of all unit monitors targeted to a particular class rolling their state forward to finally be shown in the health of the class itself.  So once we have the health reflected on the class, where do we go from there?  A common thought is that the health just continues to roll up along the relationship chain between objects.  That is not correct.  In terms of health state, the buck stops at the class itself.  But, wait – you are about to scream that you have seen an unhealthy class roll up to and impact the health of a class higher up in the relationship chain….yes you have – but that isn’t done automatically.  It requires dependency monitors and they will be the focus of part III of our series.

Understanding Monitors in OpsMgr 2007 Part I – Unit Monitors

There are two key ways of delivering monitoring in opsMgr 2007 – rules and monitors.  At first glance, rules appear to deliver much the same monitoring as monitors. 


image image
There are some similarities for sure but rules and monitors are actually very different things.  There are two major things to understand about rules compared to monitors.  Rules have zero impact on measuring the health of the object being monitored.  In addition, rules can collect data and monitors don’t.  An example is instructive.  If you have a need to both collect performance data and also have the measurement of the same performance data impact the total health of the monitored object, you need both a rule and a monitor.  Why?  Again, monitors don’t collect anything – they just evaluate the data live and reflect back what is found in terms of health state changes.  Rules collect data but have no impact on health state.  So, in this example scenario, you need both.
Notice that in the preceding paragraph I made reference to monitors and rules being associated with an object.  What I really am talking about here is the class (aka, object) against which a rule or monitor is targeted.  Understanding targeting is pivotal to understanding OpsMgr 2007.  If targeting is a confusing topic for you or you want to refresh yourself on proper targeting techniques, take a look at the article I published in Technet magazine discussing this topic in detail - http://technet.microsoft.com/en-us/magazine/2008.11.targeting.aspx?pr=blog
Enough about rules – the topic at hand is to discuss monitors!  Monitors are where you see the power of OpsMgr 2007 and, from the list above, you can see that there is substantially more flexibility when using monitors vs. rules.  Remember, monitors are al about health and that is the goal of OpsMgr 2007.  To restate, monitors watch whatever they are monitoring – performance data, WMI, event logs, whatever – and tell administrators about the results of monitoring by changing the health state of the object being monitored.  One point here – on the rules node you see a category for alert generating rules.  Monitors can definitely generate an alert as well so don’t think you are missing out on that ability by choosing a monitor! 
There are three categories of monitors – unit monitors, aggegrate monitors and dependency monitors. 
Unit Monitors Unit monitors can be thought of as the workhorse of monitoring and unit monitors drive health detection in OpsMgr.  Without unit monitors you would never know a problem exists!  The best way to get to know unit monitors is to work with them.  A caution here – make sure you do your testing in a lab environment as if operating in the production environment, any changes made take place right away.  If building multiple monitors for testing this could cause notable churn in the production environment – unexpected churn is far easier to absorb in a test lab!.
There are lots of options for unit monitors – ranging from very straight forward to very complex.  Discussing each and every unit monitor is beyond the scope of this blog entry but there are a couple that are particularly interesting.
Simple Event Detection – Detecting a simple event is easy with most monitoring solutions – including OpsMgr.  The Simple Event Detection monitor is, well, simple.  I describe it here as a starting point and because it will provide some good discussion on building monitors in general that is applicable to any monitor.
From the create monitoring wizard select to create a Simple Event Detection monitor.  For our example we will use a Windows Event Reset as the type – more about that in a minute.  Make sure you choose a management pack other than the default management pack to store this monitor!
image
On the general properties screen, choose a target and parent monitor.  For our example lets assume we will be delivering additional monitoring to SQL 2005 Servers.  If the SQL 2005 Management Pack is installed we will have a target called SQL 2005 DB Engine.  Select that target.  The next choice is which Parent Monitor should ‘contain’ our unit monitor.  For monitors, there are 4 general parent monitors that may be an option for you – availability, configuration, performance and security.  You may also see one for Backwards Compatibility but that isn’t a category that you should use when authoring a monitor yourself.  These categories allow grouping of unit monitors according to the general  intended purpose.  If, for example, our monitor will be looking for events that could impact the general availability of the monitored object , it should be placed under availability.  If our monitor will be looking for events that could impact the general performance of the monitored object, it should be placed under performance.  We will discuss these categories in greater detail in part II of this post because each category is actually itself a monitor – an aggregate monitor.
image
We are building a simple event detection monitor so the next screen will ask for the event log OpsMgr should look in for the event.  We will leave the default of application but note that this could be any event log in place on the system.
image
So we’ve chosen the application log, now we have to specify the event to look for with our monitor.
image
Click next and – we are being asked again for the event log we want to use?  What’s going on here?  This is an excellent opportunity to discuss the option we first select when we started building our monitor.  We chose that we wanted a Simple Event Detection monitor and we chose that it should be a windows event reset monitor.  Remember that monitors are all about health.  Monitors detect when a healthy condition goes unhealthy and can ALSO detect when an unhealthy condition goes back to health.  That is, in fact, the holy grail of monitors – to detect when an unhealthy situation takes place and then automatically detect when the unhealthy condition goes away!  And that is exactly what is being done here.  One event, the first one with Event ID 1234, will indicate an unhealthy condition has taken place and now this second event – in the same or different event log – will indicate that the unhealthy condition has been resolved – completely automatic!  Of course, not all monitoring scenarios lend themselves to that which is why that in addition to windows event reset we also have options for timer reset and manual reset.  Timer reset is a situation where we detect the unhealthy condition and immediately start a timer.  If the unhealthy condition has not been detected again during our timing period (defined on the monitor) then we revert the health back to a healthy state.  If we detect the unhealthy condition again during the timing period, the timing period starts over.  The manual reset monitor means that once the unhealthy condition is noted it will not be reset until either manually touched or reset by some scripting method.  The manual reset monitor should not be in wide use and, when used, should be for very specific scenarios.
We will again select the Application event log.
image
And select the event criteria that will indicate monitoring is again healthy
image
Next we pull our two event ID’s together and select whether the first event will raise a warning or a critical health state (there is a drop down that will allow you to select warning or critical but you can’t see it until you click on warning to change if needed).  The second event will return us to a health status.
image
The final screen of the wizard allows selection of whether or not this monitor will generate an alert
image
Select to create and the monitor is saved.  If we go find our new monitor and select properties on it we see that there are actually additional items we can configure, such as product knowledge, diagnostic and recovery and overrides that are not part of the initial wizard.  Product knowledge allows information about the monitor and how to resolve detected problems to be recorded.  Diagnostic and Recovery allows specific steps to be configured as a response to the monitor changing state that may aid in diagnosing or fixing the problem and overrides are where you specify any conditions other than the default that should be in place for all monitored objects or a subset thereof.  It is only possible to override values that have been authored to allow overriding.
image
image
image
image
We’ve gone screen by screen for our first example to illustrate a few important key concepts that will generally apply to all monitors.  For our next examples of interesting unit monitors we won’t go screen by screen but only show the relevant screens.
WMI Performance Monitors
In some cases it would be helpful to have a performance monitor style method of collecting information about an object but there is no performance counter available.  An example of this might be file size.  Suppose you have a particular log file that needs to be monitored for an increase in size.  You look in the performance monitor counters and there is not a counter available.  What option do you have?  Certainly a script would be a workable choice but you may not be comfortable scripting.  Is there a way to take file size information and convert it to performance data that can be used in OpsMgr?  Absolutely – and there are lots of other examples too beyond the file size example.  I have previously described exactly how to setup such a scenario here so I will avoid repeating but do take a look at this example as having this ability really is powerful.
Log File Monitoring Another monitor type that is little used but very useful is the log file monitors.  Many applications have log files they write to indicate application processing, error conditions, etc.  SQL has it’s error log and SCCM/SMS is chock full of logged information and there are lots of other examples.  Over time there will be conditions in your environment that you know to be problems that arise when specific log entries show up.  In many cases the provided SQL and/or SCCM management packs will handle the errors from tose systems for you but in cases where they don’t, being able to craft your own log monitor is useful.  And it’s not difficult.  Here are the relevant configuration screens.

On the Application Log Data Source screen we need to configure the directory where our log(s) are stored and then a pattern that will specify how to search.  Note that wildcards are supported so it would be possible to search multiple log files of similar name.
image
Next, we need to configure what information in the log file we want to detect.  In this case we are looking for the text bummer – but it could also be a string of text rather than a single word.  Also, the parameter name will be the same regardless – this is the syntax to specify we are defining the first parameter of interest which is all I’ve ever needed. 
image
The next step is to configure how long we will let the alert remain before automatically resetting it.  This configuration is a timer based monitor.  I chose this option just to illustrate another example.  Note that you could have just as easily setup an event reset monitor here – all that would be needed is to define another parameter that would flag the health condition.  Also note that my timer reset is set to an hour.  This means that if the problem condition is not detected again for an hour that the monitor will reset but, if the problem condition is detected again, the timer gets reset and another full hour will be required before considering the problem cleared up.
image
There rest of the screens are similar to what has already been seen.
Repeated Event Detection
I’ve already described how to configure to detect a simple event.  With such a configuration, when the configured event comes in it is detected and configure action, such as raising an alert, takes place.  There could be situations though where a single event may not indicate a problem.  However, if 3 of the same events happen within a 15 minute window, that may indicate a problem that we need to investigate.  Taking what we already know from the simple event monitor it’s easy to configure the repeat monitor.  Just choose the repeated event monitor, configure what event we care about and then you will get to the screen shown below to configure our repeat settings.
There are a few choices for counting mode but the one I find simple and useful is the Trigger on count, sliding.  Basically this means that when the first event shows up we will start a timer for, in this case, 15 minute.  If another of the events shows up within tat 15 minutes then we will consider this a problem and move on to take action.  You can configure the repeat count however you like by adjusting the compare count settings.  There are other options here you can explore as you like.
image
Correlated Event Monitor A more complex but very useful monitor is the correlated event detection monitor.  Using this monitor it is possible to configure OpsMgr to watch for complex event patterns, whether in the same event log of different event logs, and alert only when the pattern specified is entered.  For our example I’ve chose an windows event reset monitor which means that OpsMgr will watch for a specific windows event to trigger the reset of health after a problem occurs.  I won’t go through configuring every event screen because it’s all similar to what we’ve already seen in the simple event discussion.  Note on the screenshot below, however, that there were 3 events to configure. The first even is the reset event while events A and B are the correlated events.  The screen below also shows options for how events A and B correlate to one another.
On the correlation screen we have a few options to discuss.  First, is the correlation interval.  Like the repeated event detection this interval specifies how long to watch for the event pattern after receiving the first event.  If the pattern doesn’t manifest in this configured time then there will be no change to health.  Also, there are multiple options to correlation settings as shown below.  The wording here may be confusing at first but the graphic shown on the configuration page will change as you move from option to option to illustrate what each option does.  Finally, note the occurrence and Expression options.  With these options you can increase the complexity of our filter by configuring how often our patter should occur as well as specific expression information that is beyond what we are discussing here.
image
image
Finally, we have the health screen that pulls this all together.  Note the event raised refers to the first event configured that will trigger a healthy condition where as match to the correlated event loging we just configured will result in a warning event (you can change to a critical if you like)
image
Part II of our discussion will turn to the aggregate rollup monitor and how it is used in OpsMgr.

MONITRING DMZ AND WORKGROUP COMPUTER WITH SCOM 2007 R2 USING CERTIFICATES (ERRORS 21007 AND 21016 AFTER APPROVING THE AGENT IN PENNDING MANGMENT)

A new guide to help you monitor servers in your dmz or a workgroup with system center operation manger
Well there might be a few guides like this around the web and I have used most of them,
But for the past 3 mounts I have been battling with this scenario where the agent would stay in "not monitor" state after been approved in the pending management pane and the agent had 21007 and 21016 events on the operations manger event log on the workgroup / dmz server  I wanted to monitor

If you have a working gateway and after your approve the agents in pending mode and used to momcertimport with successful results and you still receive event id's like21007 and 21016 on the workgroup / DMZ agent this guide is for you.
Well my solution is available for you here
Well first of all and very basic (but not for me) I have 2003 enterprise ca server so I used this guide to create my certificate template
I flowed that guide to the letter and still those event id's and no communication to my gateway.
 Something was missing,
The first change I noticed was that I now I had no option to save a certificate to local computer certificate store this of course is because of the server 2008 enrolment  pages that would need administrator right witch the internet explorer does not use  
So in order to export the certificate to a file I had to use internet explorer
There under tools -> internet options -> content 
There is a certificates section.
Click the certificate button and you can export your certificate from there
Remember to export the private key after clicking the next batten leave this mark
Don’t mark include all certificates it the certification path if possible
The momcertimport tool will not be able to import the certificate

We will deal with the root ca needed in the workgroup / DMZ server in a minute


Then you can save your certificate to a pfx file and copy it to the server you want to monitor
Keep it in a shared folder for the duration of the install process because you will need it for the gateway server as well as the workgroup / DMZ server.

One more certificate is needed before we can continue and again I used this guide
http://technet.microsoft.com/en-us/library/bb735413.aspx  I used the section called "To download the Trusted Root (CA) certificate"

Notice that you might not be able to get to the web site of your ca server form the workgroup computer so you can do that from your root management server and just save it in the folder were you saved your ca for the workgroup / DMZ server you wanted to monitor

And on one last note before we begin: while most guide say the certificate subject a.k.a the name filed is fqdn don’t just push your domain name in the computer name.
cheek before logon to the workgroup / DMZ server  and Go to start -> computer -> properties – check the full computer name and copy the exact name to your gateway host file if no dns resolution is available

NOW FOR THE STEP BY STEP GUIDE

1.    PREPERING TO INSTALL  THE AGENT ON THE WORKGROUP MECHINE 

 I recommend  you copy this folders  form your scom CD  to one folder you can move around in your environment, let's call that our "scomdmz" inside you will need this folders
* SupportTools
* agent
I recommend  you copy this files to that same folder
* server_cert.pfx (certificate you created using a template for your workgroup / DMZ server)

*
CA_certificate_chain.p7b (for the trusted Root (CA) certificate)
move this file to your workgroup machine  (keep a copy of your
server_cert.pfx
to copy to your gateway server later  )

2.     INSTALLING THE AGENT ON THE WORKGROUP MECHINE

run the msi installation on your server  if there is no dns resolution for your gateway server ping –a the ip address to see if you get the name of your gateway server,  if not you will need to add your gateway server fqdn name to your host file – it's in c:\windows\system32\drivers\etc


(we use the example in our org…)

I KNOW THIS IS A VERY BASIC STUFF RIGHT HERE – I want this guide to be able to apply even to those who don’t deal with this in a daily manner

 now this to prevent any human typing Mistake
write the fqdn gateway server in the host file copy & paste it to the management computer name I recommend also copy & paste to command line and telnet the computer name to your gateway on 5723 to check connectivity.
Click next your almost home free…

3.     IMPORTING THE CERTIFICATES TO YOUR GATEWAY AND SERVER

THIS WILL BE SPLIT IN TO TWO PARTS

A.  IMPORTING THE CERTIFICATES ON YOUR DMZ SERVER YOU WANT TO MONITOR -

 using the momcertimport tool  
-on the
workgroup / DMZ server go to start -> if 2008 type cmd if 2003 go to run type cmd
one thing very imported cheek -  if you're on server 2008 check to see if your command prompt run with administrator rights (if not right click the icon before you press enter and  run it as administrator)



the tool is in the
SupportTools folder (the one we copied earlier if you flowed step one)
so! The way to run this tool is simple get to it in the command prompt and the give the server certificate file like so
c:\dmzfolder\
SupportTools\i386\momcertimport  server_cert.pfx type the password for the key and you will need to receive successfully  state message

YOU GOT THIS FAR – you stop and started the health service like asked in the momcertimport tool after imported the certificate  and still receive those 21007 and 21016 events  you will need to fallow this few steps

 What you need now is another certificate to be imported.
    1.     Go to start mmc -> file -> add/remove snap-in…
    2.     Add certificates add computer account, click next choose local                                            computer click ok and exit – it's all you need for the console

    3.     Go to the
Trusted Root Certification Authorities folder on the folder Certificates right click all tasks -> import…
and import your
 CA_certificate_chain.p7b we prepared in step 1 this guide
and import it to the
Trusted Root Certification Authorities folder
the folder contains certificates that in most time already be in there
but don’t skip this stage.



B. IMPORTING THE CERTIFICATES ON TO YOU GATEWAY SERVER – again this is for all of you battling with  error id  21037 on your gateway (and of course any kind of lack of communication between the agent and your gateway server )

    1.     Go to start mmc -> file -> add/remove snap-in…
    2.     Add certificates add computer account, click next choose local                                            computer click ok and exit – it's all you need for the console
    3. Go to the
Trusted Root Certification Authorities and import your
server_cert.pfx we talked about in step one to that folder
    3. Go to Personal folder and import it to that folder ass well



note: we are importing the certificates of the server that we want to monitor into our gateway  Trusted Root Certification Authorities and to the personal folder



4.     CHEKING THE COUMNICATION -

after all the certificates have been imported to  the gateway server and to our soon to be monitored server, in order for this changes to take affect well have to do the fallowing steps

restart health service known as system center management on your gateway
restart health service known as system center management on your  root management server
restart health service known as system center management on your dmz server


Check your DMZ server event viewer to see if the error id repeats
Some changes take time you might want to wait 5-10 minutes after 10
Minutes you need restart the health service again on your DMZ server and cheek your event viewer for the id's if still receive restart the health service again on your root management server and your gateway server

OpsMgr 2007 Overrides Report (PowerShell)

This PowerShell script retrieves all the overrides from your unsealed management packs, including all the details of the override, including target, value and the source rule or monitor. Results are output to a single Excel spreadsheet.

# ==============================================================================================
#
# Microsoft PowerShell Source File -- Created with SAPIEN Technologies PrimalScript 2009
#
# NAME: OpsMgr Overrides Report
#
# AUTHOR: Daniele Muscetta and Pete Zerger
# DATE  : 8/24/2010
#
# COMMENT: This report will output the overrides in your OpsMgr environment including
#          override settings, target, source rule/monitor and source management pack.
# ==============================================================================================

#---Save the following text as script "Export-Overrides.ps1"

#define the path you want to export the CSV files to
$exportpath = "c:\scripts\export\"

#gets all UNSEALED MAnagement PAcks
$mps = get-managementpack | where {$_.Sealed -eq $false}

#loops thru them
foreach ($mp in $mps)
{
    $mpname = $mp.name
    Write-Host "Exporting Overrides info for Managemetn Pack: $mpname"
   
    #array to hold all overrides for this MP
    $MPRows = @()

    #Gets the actual override objects
    $overrides = $mp | get-override

    #loops thru those overrides in order to extract information from them
    foreach ($override in $overrides)
    {

        #Prepares an object to hold the result
        $obj = new-object System.Management.Automation.PSObject
       
        #clear up variables from previous cycles.
        $overrideName = $null
        $overrideProperty = $null
        $overrideValue = $null
        $overrideContext = $null
        $overrideContextInstance = $null
        $overrideRuleMonitor = $null

        # give proper values to variables for this cycle. this is what we can then output.
        $name = $mp.name
        $overrideName = $override.Name
        $overrideProperty = $override.Property
        $overrideValue = $override.Value
        trap { $overrideContext = ""; continue } $overrideContext = $override.Context.GetElement().DisplayName
        trap { $overrideContextInstance = ""; continue } $overrideContextInstance = (Get-MonitoringObject -Id $override.ContextInstance).DisplayName
           
        if ($override.Monitor -ne $null){
            $overrideRuleMonitor = $override.Monitor.GetElement().DisplayName
        } elseif ($override.Discovery -ne $null){
            $overrideRuleMonitor = $override.Discovery.GetElement().DisplayName
        } else {
            $overrideRuleMonitor = $override.Rule.GetElement().DisplayName
        }
       
        #fills the current object with those properties
        #$obj = $obj | add-member -membertype NoteProperty -name overrideName -value $overrideName -passthru
        $obj = $obj | add-member -membertype NoteProperty -name overrideProperty -value $overrideProperty -passthru
        $obj = $obj | add-member -membertype NoteProperty -name overrideValue -value $overrideValue -passthru
        $obj = $obj | add-member -membertype NoteProperty -name overrideContext -value $overrideContext -passthru
        $obj = $obj | add-member -membertype NoteProperty -name overrideContextInstance -value $overrideContextInstance -passthru
        $obj = $obj | add-member -membertype NoteProperty -name overrideRuleMonitor -value $overrideRuleMonitor -passthru
        $obj = $obj | add-member -membertype NoteProperty -name MPName -value $name -passthru
        $obj = $obj | add-member -membertype NoteProperty -name overrideName -value $overrideName -passthru
       

        #adds this current override to the array
        $MPRows = $MPRows + $obj
    }
   
  #Store up the overrides for all packs to a single variable
  $MPRpt = $MPRpt + $MPRows

}
    #exports cumulative list of overrides to a single CSV

    $filename = $exportpath + "overrides.csv"
    $MPRpt | Export-CSV -path $filename -notypeinfo

Oct 10, 2010

Error when installing OpsMgr Reporting: ‘Could not verify if current user is in sysadmin Role’

Are you getting this error when trying to install reporting? 
 error
Here are the steps to resolve it.


1.  Check to User Permissions.
Verify the user you are running the installer as is a member of the Operations Manager Administrators.
-  Verify the user has sysadmin access to the database in SQL.
2.  Check the SPN of the SDK Service.
- http://wchomak.spaces.live.com/blog/cns!F56EFE25599555EC!824.entry?sa=646856610
- http://blogs.technet.com/jonathanalmquist/archive/2008/08/14/operations-manager-2007-spn-s.aspx
- http://blogs.technet.com/kevinholman/archive/2007/12/13/system-center-operations-manager-sdk-service-failed-to-register-an-spn.aspx
3.  Check the Operations Manager database.
- Go into SQL Enterprise Management Studio 
- Expand Databases, OperationsManager, and Tables
- Right click on MT_ManagementGroup
- Click Open Table if you are using SQL Server 2005 or click Edit Top 200 Rows if you are using SQL Server 2008.
- Look at the Value in column SQLServerName_6B1D1BE8_EBB4_B425_08DC_2385C5930B04
- This should be the name of your operations manager database server.  (If you ever moved your operations manager database to a new SQL server there is a chance that this step got missed.

Moving the Data Warehouse Database and Reporting server to new hardware

The time has come to move my Warehouse Database and OpsMgr Reporting Server role to a new server in my lab.  Today – both roles are installed on a single server (named OMDW).  This server is running Windows Server 2008 SP2 x86, and SQL 2008 SP1 DB engine and SQL Reporting (32bit to match the OS).  This machine is OLD, and only has 2GB of memory, so it is time to move it to a 64bit capable machine with 8GB of RAM.  The old server was really limited by the available memory, even for testing in a small lab.  As I do a lot of demo’s in this lab – I need reports to be a bit snappier.
The server it will be moving to is running Server 2008 R2 (64bit only) and SQL 2008 SP1 (x64).  Since Operations Manager 2007 R2 does not yet support SQL 2008R2 at the time of this writing – we will stick with the same SQL version.


We will be using the OpsMgr doco – from the Administrators Guide:
http://technet.microsoft.com/en-us/library/cc540402.aspx

So – I map out my plan. 
  1. I will move the warehouse database.
  2. I will test everything to ensure it is functional and working as hoped.
  3. I will move the OpsMgr Reporting role.
  4. I will test everything to ensure it is functional and working as hoped.

Move the Data Warehouse DB:
Using the TechNet documentation, I look at the high level plan:
  1. Stop Microsoft System Center Operations Manager 2007 services to prevent updates to the OperationsManagerDW database during the move.
  2. Back up the OperationsManagerDW database to preserve the data that Operations Manager has already collected from the management group.
  3. Uninstall the current Data Warehouse component, and delete the OperationsManagerDW database.
  4. Install the Reporting Data Warehouse component on the new Data Warehouse server.
  5. Restore the original OperationsManagerDW database.
  6. Configure Operations Manager to use the OperationsManagerDW database on the new Data Warehouse server.
  7. Restart Operations Manager services.

Sounds easy enough.  (gulp)

  • I start with step 1 – stopping all RMS and MS core services.
  • I then take a fresh backup of the DW DB and master.  This is probably one of the most painful steps – as on a large warehouse – this can be a LONG time to wait while my whole management group is down.
  • I then uninstall the DW component from the old server (OMDW) per the guide.
  • I then (gasp) delete the existing OperationsManagerDW database.
  • I install the DW component on the new server (SQLDW1).
  • I delete the newly created and empty OperationsManagerDW database from SQLDW1.
  • I then need to restore the backup I just recently took of the warehouse DB to my new server.  The guide doesn’t give any guidance on these procedures – this is a SQL operations and you would use standard SQL backup/restore procedures here – nothing OpsMgr specific.  I am not a SQL guy – but I figure this out fairly easily.
  • Next up is step 8 in the online guide – “On the new Data Warehouse server, use SQL Management Studio to create a login for the System Center Data Access Service account, the Data Warehouse Action Account, and the Data Reader Account.”  Now – that’s a little bogus documentation.  The first one is simple enough – that is the “SDK” account that we used when we installed OpsMgr.  The second one though – that isnt a real account.  When we installed Reporting – we were asked for two accounts – the "reader” and “write” accounts.  The above referenced Data Warehouse Action Account is really your “write” account.  If you aren't sure – then there is a Run-As profile for this that you can see what credentials you used.
  • I then map my logins I created to the appropriate rights they should have per the guide.  Actually – since I created the logins with the same names – mine were already mapped!
  • I start the Data Access (SDK) service ONLY on the RMS
  • I modify the reporting server data warehouse main datasource in reporting.
  • I edit the registry on the current Reporting server (OMDW) and have to create a new registry value for DWDBInstance per the guide – since it did not exist on my server yet.  I fill it in with “SQLDW1\I01” since that is my servername\instancename
  • I edit my table in the OpsDB to point to the new Warehouse DB servername\instance
  • I edit my table in the DWDB to point to the new Warehouse DB servername\instance
  • I start up all my services.
Now – I follow the guidance in the guide to check to make sure the move is a success.  Lots of issues can break this – missing a step, misconfiguring SQL rights, firewalls, etc.  When I checked mine – it was actually failing.  Reports would run – but lots of failed events on the RMS and management servers.  Turns out I accidentally missed a step – editing the DW DB table for the new name.  Once I put that in and bounced all the services again – all was well and working fine.

Now – on to moving the OpsMgr Reporting role!

Using the TechNet documentation, I look at the high level plan:
  1. Back up the OperationsManagerDW database.
  2. Note the accounts that are being used for the Data Warehouse Action Account and for the Data Warehouse Report Deployment Account. You will need to use the same accounts later, when you reinstall the Operations Manager reporting server.
  3. Uninstall the current Operations Manager reporting server component.
  4. Restore the original OperationsManagerDW database.
  5. If you are reinstalling the Operations Manager reporting server component on the original server, run the ResetSRS.exe tool to clean up and prepare the reporting server for the reinstallation.
  6. Reinstall the Operations Manager reporting server component.

Hey – even fewer steps than moving the database! 
***A special note – if you have authored/uploaded CUSTOM REPORTS that are not deployed/included within a management pack – these will be LOST when you follow these steps.  Make sure you export any custom reports to RDL file format FIRST, so you can bring those back into your new reporting server.

  • I back up my DataWarehouse database.  This step isn't just precautionary – it is REQUIRED.  When we uninstall the reporting server from the old server – it modifies the Warehouse DB in such a way that we cannot use – and must return it to the original state before we modified anything – in preparation for the new installation of OpsMgr Reporting on the new server.
  • Once I confirm a successful backup, I uninstall OpsMgr R2 Reporting from my old reporting server.
  • Now I restore my backup of the OperationsManagerDW database I just took prior to the uninstall of OpsMgr reporting.  My initial attempts at a restore failed – because the database was in use.  I needed to kill the connections to this database which were stuck from the RMS and MS servers.
  • I am installing OpsMgr reporting on a new server, so I can skip step 4.
  • In steps 5-10, I confirm that my SQL reporting server is configured and ready to roll.  Ideally – this should have already been done BEFORE we took down reporting in the environment.  This really is a bug in the guide – you should do this FIRST – BEFORE event starting down this road.  If something was broken, we don’t want to be fixing it while reporting is down for all our users.
  • In step 11, I kick of the Reporting server role install.  Another bug in the guide found:  they tell us to configure the DataWarehouse component to “this component will not be available”  That is incorrect.  That would ONLY be the case if we were moving the OpsMgr reporting server to a stand alone SRS?Reporting ONLY server.  In my case – I am moving reporting to a server that contains the DataWarehouse component – so this should be left alone.  I then chose my SQL server name\instance, and type in the DataWarehouse write and reader accounts.  SUCCESS!!!!
Now – I follow the guide and verify that reporting is working as designed.
Mine (of course) was failing – I got the following error when trying to run a report:

Date: 8/24/2010 5:49:27 PM
Application: System Center Operations Manager 2007 R2
Application Version: 6.1.7221.0
Severity: Error
Message: Loading reporting hierarchy failed.
System.Net.WebException: Unable to connect to the remote server ---> System.Net.Sockets.SocketException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 10.10.10.12:80
   at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress)
   at System.Net.ServicePoint.ConnectSocketInternal(Boolean connectFailure, Socket s4, Socket s6, Socket& socket, IPAddress& address, ConnectSocketState state, IAsyncResult asyncResult, Int32 timeout, Exception& exception)
   --- End of inner exception stack trace ---
   at System.Net.HttpWebRequest.GetRequestStream(TransportContext& context)
   at System.Net.HttpWebRequest.GetRequestStream()
   at System.Web.Services.Protocols.SoapHttpClientProtocol.Invoke(String methodName, Object[] parameters)
   at Microsoft.EnterpriseManagement.Mom.Internal.UI.Reporting.ReportingService.ReportingService2005.ListChildren(String Item, Boolean Recursive)
   at Microsoft.EnterpriseManagement.Mom.Internal.UI.Reporting.ManagementGroupReportFolder.GetSubfolders(Boolean includeHidden)
   at Microsoft.EnterpriseManagement.Mom.Internal.UI.Reporting.WunderBar.ReportingPage.LoadReportingSubtree(TreeNode node, ManagementGroupReportFolder folder)
   at Microsoft.EnterpriseManagement.Mom.Internal.UI.Reporting.WunderBar.ReportingPage.LoadReportingTree(ManagementGroupReportFolder folder)
   at Microsoft.EnterpriseManagement.Mom.Internal.UI.Reporting.WunderBar.ReportingPage.LoadReportingTreeJob(Object sender, ConsoleJobEventArgs args)
System.Net.Sockets.SocketException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 10.10.10.12:80
   at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress)
   at System.Net.ServicePoint.ConnectSocketInternal(Boolean connectFailure, Socket s4, Socket s6, Socket& socket, IPAddress& address, ConnectSocketState state, IAsyncResult asyncResult, Int32 timeout, Exception& exception)

The key area of this is highlighted in yellow above.  I forgot to open a rule in my Windows Firewall on the reporting server to allow access to port 80 for web reporting.  DOH!
Now – over the next hour – I should see all my reports from all my MP’s trickle back into the reporting server and console.

Operations Database is full!

image


Let’s say you find yourself in a pickle. 
Perhaps you ignored your Operations Database size, perhaps grooming was failing and you didn’t notice, perhaps you wrote a BAD rule, and FLOODED the database with events, or performance data?
Now, your database is full, and there is no more free space on the disk?

What if you want to get rid of the data RIGHT NOW?

We can run grooming manually.  I discuss a bit about the inner-workings of the grooming process HERE.  We can execute grooming by opening SQL Management Studio, and opening a query window against the OpsDB – and running the grooming procedure “EXEC p_PartitioningAndGrooming”.


You will either get a success – or a failure.  If this fails, it is typically because the transaction log is full, before the job can complete.  If you need more transaction log space, this means you need to groom a LARGE amount of non-partitioned objects.

Data types:  The most common data types we insert (and have to groom) in the OpsDB are:
  • Alerts
  • Events
  • Performance
  • Performance Signature
  • Discovery data
Let’s talk about partitioned, and non-partitioned data types.  Events and Performance data in the OpsDB are partitioned.  All the others aren not partitioned.  There are 61 tables to store Events and 61 tables to store Performance data in the operations DB.  Each table represents 1 days worth of storage.  This is done to assist in grooming.  Since there can be a HUGE amount of event and performance data, we groom these by truncating a daily table, which is FAR more efficient than using a “delete from tablename where date > xx/yy/zz”.  Truncating a table uses almost no transaction log space or time, while “delete from” uses a bunch.
When we groom partitioned data, the first thing we do is truncate the next table in the list, then change the “IsCurrent” marker to the newly empty table.  You can look at this “map” in the PartitionTables table in the database.
To see which tables we are currently writing to – check out:
select * from PartitionTables
where IsCurrent = 1
So – IF our opsDB is flooded with data – and we just need to clear up some space to work with…. a way to cheat, is to run the standard grooming stored procedure 62 times.  This will force a truncate of all partitioned data in the database.
So we would run:  EXEC p_PartitioningAndGrooming in the SQL query window, 62 times.  You can track the progress by running the “IsCurrent” query check above.  This will wipe out all the partitioned data, and free up a ton of space in your DB really quickly.
For non-partitioned data – there are no shortcuts… you have to groom this the old fashioned way, and wait for it to complete.  Once your DB is healthy again – this will go back to being a quick and painless process.