Oct 6, 2010

Generic Trouble Shooting Guide for SCOM

Installing SCOM R2 can be a challenge. However, Microsoft has provided many good guides how to go about it and – even more important – what things to reckon with when designing a SCOM R2 environment. When your design is right and the preparations are done properly, the installation should be straight forward without any surprises.
On top of it all, the installation of a SCOM R2 environment happens only once or twice (when you need a test environment as well for instance). After that it is time to start using the SCOM R2 environment which starts with these Steps:


  1. Configuring the Core MP of SCOM R2 (many times people tend to forget that but it is really important so RTFM is the magic word here);
  2. Configuring SCOM R2 (Resolution States, DB Grooming and the lot);
  3. Deploy the SCOM R2 Agents to the servers which need to be monitored;
  4. Import the MPs (RTFM before, during and after!) as needed and start with the Server OS MP;
  5. Tune the MP as required by the business, based on RTFM the related guide of the Server OS MP;
  6. When all is well and the Alerts coming in are relevant (no noise) import the next MP;
  7. Repeat Steps 4 to 6 per MP to be imported.
At a certain point in time your IT organization has a fully operational SCOM R2 environment. All goes well. Tuning and tweaking takes place while using SCOM R2 and the connection to the organization is being tuned as well. The latter is always work in progress because every organization is dynamic so changes are more than likely to occur and SCOM R2 needs to adapt itself as well in order to reflect the current situation.
image
But then something happens and the SCOM R2 environment turns sour. A set of disks might stop functioning, a bad MP is being imported, some one erases the SCOM R2 service-accounts (had this issue once!), the RMS stops running, the related SQL server suffers a hardware failure. Or everything seems to be just fine but ‘only’ the HealthService on the RMS stalls every hour or so…
So now the SCOM R2 environment becomes very silent and instead of being the looking glass for the IT shop on all IT servers and services, SCOM R2 needs some serious attention. But where to start? And what to do and more important what NOT to do?

With this posting I hope to help you with how to troubleshoot a (partially) failed SCOM R2 environment in order to get things working again, or – when you think it is way over your head -  to set out a call to Microsoft CSS and provide them with some good information.
But before I start I want to emphasize on two very important things here:
  • Know what you know, know what you don’t know, and NEVER mix the two So whenever you bump into something which you do NOT totally understand, leave it. Do not alter anything without having a full understanding of the consequences. And even when you do, backup the OpsMgr R2 DBs in order to have a way back. And check the validity of those backups. Otherwise you could end up in a situation where SCOM R2 gets in an unsupported state or that Microsoft CSS has to trouble shoot an extra complex issue: the first one which caused an error state in SCOM R2 and your ‘repairs’ afterwards. Microsoft knows much and has a lot of experience but they can’t perform magic…
  • Backup, backup and backup AND VALIDATE Be sure to have a valid backup mechanism in place which runs on a regular scheduled basis. Besides that some validation is required as well in order to know for sure that the disks/tapes containing the backup are really valid and do not contain some blob of code without any real value. Check it when all is well and not after SCOM R2 has become (partially) dysfunctional. It will save you and your colleagues a lot of frustration and perhaps even your job…

    Only a backup of the SCOM R2 servers will NOT suffice. Backup the DBs as well (use a ‘connector’ for it) and the Unsealed MPs as well. Also a backup of the EncryptionKey (with a VALID password) is a requirement. This way you have covered the SCOM R2 environment from end-to-end.
Having said that, its time to move on. This is what I do when SCOM R2 is experiencing some issues which need attention:
  1. What is exactly happening and since when? Find out what is exactly happening and since when. Try to describe it as briefly as possible and attach a date and time to it. This is not only important when you want to call in Microsoft CSS but also for yourself. This way you do not start a goose chase. Also be aware that there is a uge difference when the problem was detected and when the problem started. Try to get to the bottom of it all.
  2. Can it be reproduced?
    Sometimes network errors occur which can have its impact on SCOM R2. When all is well again, SCOM R2 should be fine as well. So try to see whether you can reproduce the error. If not, it is back to business. When it comes back, it is time to take a deeper dive.
  3. Ask questions
    Did anything took place before the SCOM R2 environment started to fail? Like importing a MP for instance (Many times poorly written MPs can wreck havoc…), AD changes, migrations, failovers or network changes? Did any one perform any task on the SQL server(s) hosting the SCOM R2 DBs and SSRS instance? So inform your self thoroughly and communicate with your colleagues and team members.
  4. Differentiate between main issues and secondary ones When a SCOM R2 environment is experiencing issues, many things can happen. Try to differentiate between the main issue(s) and the less important ones. Target your troubleshooting efforts at the main issues. Mostly are the less important ones caused by the main issue(s).
  5. Check out the SCOM R2 services on the RMS Are the three SCOM R2 services still operational on the SCOM R2 RMS? Nothing stopped? Nothing stalled?
  6. Check out the SCOM R2 DBs Are the DBs still OK? Can you access them from SQL Server Management Studio? Are the DBs still healthy? Can you query these DBs? Can you connect to the SQL server from all SCOM R2 Management Servers? (Telnet is required for it).
  7. Is the SCOM R2 Console still operational?
    When you’re able to access the SCOM R2 Console and navigate through it and the Views are refreshed, you know the SDK service is still running and the OpsMgr DB is still accessible and operational. So by a simple check much is to be found out.
  8. Check out the OpsMgr event logs on the RMS and Management Servers
    These logs are really great and tell you so much. These are the first location to go to in order to get a better understanding what is happening and why. Of course, the SCOM R2 Core MP picks up a lot of these events and raises one or more Alerts in the SCOM R2 Console, but still it is wise to checkout the logs as well since not all Events are covered for by the Core MP.
  9. OpsMgr event log on the RMS First look on the RMS since that server is the top level server of the SCOM R2 hierarchy. Stop the HealthService (System Center Management) on the RMS and start it again. This forces the RMS to reprocess its configuration like the SCOM R2 service accounts. When anything is wrong there many errors will be shown in the log file. Keep a watchful eye on and refresh it many times. When something serious is at hand mostly within ten minutes it will be displayed in the event log.
  10. OpsMgr event log on the MS servers These servers are used by the monitored servers (aka Managed servers) to report to. The SCOM R2 Management Servers write directly to the SCOM R2 DBs. So when anything goes wrong, these servers should report on it in the OpsMgr event log. Same procedure here as well: Restart the HealthService and check out the logs. Keep a watchful eye on it for the first ten minutes after the HealthService has been restarted. When something goes wrong it should be shown in that timeframe.
  11. Bounce the RMS related SCOM R2 Services When nothing strange comes out in Steps 1 to 3 it is time to restart the SCOM R2 services on the RMS (NOT ON THE SCOM R2 MANAGEMENT SERVERS!): restart the Config Service (System Center Management Configuration) first and check out the Ops Mgr event log in order to see what comes out. Do the same for the SDK Service (System Center Data Access). Keep a watchful eye on the OpsMgr event log.
Wow! Stop! I have found some or many serious errors! What’s next? Good question.
Start at the bottom of it
Even when many errors/warnings are shown in the OpsMgr event log, the first one or the first series of three up to to five events are mostly the real cause. The other ones are many times failing workflows BECAUSE some required basic processes are failing. So take a good look at the first errors and warnings.
A good internet connection is important now. Use your favorite search engine and query the internet where you use this format: SCOM <eventid> and the most important piece of the information which is displayed in the general part of the Event, like ‘Failed to store data in the Data Warehouse. Exception 'SqlException': Timeout expired.’ for instance. Leave out the specific details like server names, GUIDs and SCOM R2 Management Group names.
Also details displayed in things like Workflow names can give one a good clue what is causing the issue. So always read the full event and not only the headers.
Some tricks to get things going again:
  • Remove the latest imported MP Only when its relevant of course. When on the 1st of October 2010 your SCOM R2 environment starts having issues and the last MP you imported/changed was two months ago changes are that the cause of this issue is to be found some where else.
  • Clear the HealthService State on the (R)MS server experiencing the issues
    On the SCOM R2 (R)MS server which is experiencing the issues, stop the HealthService, rename the folder ‘~:\Program Files\System Center Operations Manager 2007\Health Service State’ to ‘~:\Program Files\System Center Operations Manager 2007\Health Service State_OLD’ and start the HealthService again.
  • Clear the HealthService State on the Agent(s) causing the issues When you have pinpointed the issues and suspect one or more SCOM R2 Agents to be the culprit(s), stop the HealthService, rename the folder ‘~:\Program Files\System Center Operations Manager 2007\Health Service State’ to ‘~:\Program Files\System Center Operations Manager 2007\Health Service State_OLD’ and start the HealthService again.
Here is a list of EventIDs which I have seen sometimes and need some attention. Some are very serious and some are easily fixed:
  • EventID 33333
    Data Access Layer rejected retry on SqlError
    . This is a serious one and needs real attention. Sometimes it is an easy one. An Agent has been partially reinstalled but its ID ('BaseManagedEntityId') doesn’t match anymore with the one present in the SCOM R2 DB. By recycling its HealthState all is well again.
  • EventID 33333
    Health service should not generate data about this managed object. Easy one. Proxying needs to be enabled on the SCOM R2 Agent generating this event.
  • EventID 10850
    A performance signature couldn't be inserted to the database. A tricky one. Many times it happens when a MP has recently been deleted. More serious is an issue where the OpsMgr DB is running out of space. But when this event also contains this message ‘The INSERT statement conflicted with the FOREIGN KEY constraint’ there is a real challenge to be met. When you are lucky it is happening because of a corrupt Agent. If so, a HealthServiceId is displayed in the same Event. Run this PS script in order to obtain its friendly name (Get-MonitoringObject -id: 'HealthServiceId' | ft DisplayName). Recycle the HealthService State of that Agent and most of the times all is well again.

    Otherwise check this out. If that isn’t the case either contact Microsoft PS.
  • EventID 5300 and or 5304
    On a RMS it means the Health Service is stalled. Serious attention required. Check this out.
Of course there are lot more of EventIDs which need attention. A good approach here is the Excel sheet made by Daniele Muscetta containing all SCOM R2 EventIDs. I hope this posting aided in some targeted trouble shooting.
Of course I know about the tracing tools which are available by default in SCOM R2 (~:\Program Files\System Center Operations Manager 2007\Tools). However, be careful when using them since you really must have a thorough understanding of what you are doing. Taken directly from the file ‘TracingReadMe.txt’ residing in the same folder: ‘…The files in this folder are for use in diagnostic tracing in conjunction with Microsoft Customer Support Services (CSS) only. Do not enable Operations Manager tracing without prior consultation with CSS through a support engagement. Doing so could have an adverse effect on system performance. Operations Manager diagnostic tracing is not customer consumable…’.
When you’re not sure, note down all your findings and contact Microsoft CSS. Do not do anything which you might regret afterwards.

Free eBook: Mastering Powershell

Dr. Tobias Weltner, a PowerShell MVP has written an excellent eBook about PowerShell. Dr. Tobias Weltner has written many books for Microsoft Press on Windows and scripting techniques so he is a real professional and knows his stuff inside and out. Big surprise: the eBook is FREE!
image
For every one who wants to know more about PowerShell, this is a great opportunity. Go get it NOW from here.

Free Workshops and eBooks on PowerShell

Besides the earlier mentioned free eBook on PowerShell there is also another FREE source for mastering PowerShell available. It consists out two workshops and two eBooks containing the instructions for these workshops.
imageimage
Want to know more? Go get them here:
  1. English version: http://bit.ly/cV9FYc
  2. German version: http://bit.ly/d5A2Jf
  3. Italian version: http://bit.ly/9hAyaY

System Center Operations Manager 2007 R2 Authoring Resource Kit

http://www.microsoft.com/downloads/en/details.aspx?FamilyID=9104af8b-ff87-45a1-81cd-b73e6f6b51f0

System Center Operations Manager 2007 R2 Cumulative Update 3

http://support.microsoft.com/kb/2251525
Включает в себя Cumulative Update 3 for Cross-Platform Components (KB2222955). 

Sep 29, 2010

Installing Opalis

In the following Install series I’m going to how to install and configure Opalis!

Opalis Integration Server includes the following components:
  • Action Server - The engine that runs Policies. Action Servers communicate with the Datastore. Action Servers do not require a Management Server to be online to be able to run Policies. You can deploy a single Action Server or multiple ones.
  • Management Server - The central manager of Clients, Action Servers, Policies, the Policy Testing Console, and the Self Monitoring functionality. The Management Server deploys Integration Packs to Action Servers and Clients, deploys Policies to Action Servers, and acts as a communication link between the Clients, the Action Servers, and the Datastore. 
  • Client - The tool used by designers to create, modify,and deploy Policies.
  • Operator Console - The Operator Console enables you to see which Policies are currently running, view their real-time status, and start or stop them from a browser console interface.
  • Policy Testing Console – The tool used by designers to test Policies that are developed in the Client before they are deployed.
  • Datastore - The Datastore is the Oracle or SQL Server database where configuration information, Policies, and logs are stored.

image

System requirements for Opalis
image
You need a Windows 2003 machine with SQL 2005 or 2008. Windows 2008 is sadly enough not yet supported.

Let’s start with the FUN part!
clip_image001
The Install Opalis Integration Server splash screen guides you through the four steps of installing Opalis Integration Server. Complete the following steps in the same order as they are listed on the splash screen:
clip_image002
Click Install the Management Server in step 1 of the Install Opalis Integration Server splash screen.The Management Service Setup screen appears.
clip_image003
clip_image004
Type your name and your organization’s name in the respective fields, then select which users on your computer will have access to the application.
clip_image005
Click Next. The Destination Folder page appears.
clip_image006
Type the user name (DOMAIN\Username format) and password that the Management Service and Operator Console Service will use.
Opalis Integration Server automates tasks across the entire server architecture. This type of automation requires high levels of access permissions. It is imperative to restrict access to the Action Server and Management Server computers such that only authorized administrators may alter the settings on these computers. So don’t put in a domain admin. It is imperative to restrict access to the Action Server and Management Server computers such that only authorized administrators may alter the settings on these computers.
That’s why Microsoft recommends the following:
  • restrict interactive login access to local administrators group only
  • Add only the minimum necessary users to the local admin group

clip_image007
Click Next. The Ready to Install the Application page appears.
clip_image008
Click Next. The Management Service is installed. When the wizard has finished the installation, click Finish. The Install Opalis Integration Server splash screen appears.
The Management Server installation program installs the following items:
  • The Management Server service and all related library files.
  • The Database Configuration utility.
  • The License Manager utility.
  • The Deployment Manager.
  • The Self Monitoring components.
  • The Audit Trail components.
     
    Let’s install the datastore now. The Database Configuration utility prepares the database where Opalis Integration Server stores information about your Policies. This database is called the Datastore. That’s the database where configuration information, Policies, and logs are stored.
    When setting up the datastore you can choose between an Oracle server or a SQL server. In my lab environment I only have a SQL so I’m going to install the datastore or database on my sql server.

    clip_image001
    Click Configure the Datastore in step 2 of the Install Opalis Integration Server splash screen. The Database Configuration dialog appears.
    clip_image002
    On the Database Type page Type tab, select the Microsoft SQL Server option and click Next. The Server Details page appears.
    clip_image003
    In the Server field, type the name of the server that is running the SQL Server database that will host the Datastore. Do not use 'localhost' or '127.0.0.1'.
    clip_image004
    Create a new database - Select this option to create a new Management Database and type a name for this database.
    clip_image005 
    Licenses provide information to the Management Server about the number of users, Action Servers, and satellites that you can run. A satellite license is a license that enables Policies to interact with remote machines.
    clip_image001
    Click Import a License in step 3 of the Install Opalis Integration Server splash screen. The License Manager dialog appears.
    clip_image002
    When I wanted to import a license using the License Manager I got an error saying:

    cannot connect to Management Server

    So I checked the Opalis Services and they where all running fine

    I started debugging this issue by looking in the Opalis Log files that you can find “%programfiles%\Opalis Software\Opalis Integration Server\ Management Service\Logs'”
    In  the log folder you will find two different log files. One called ActionServerWatchdog.log and the other one called OpalisManagementService.log.
    I will go in more detail about the logging and auditing functionalities of Opalis in a later blogpost.

    clip_image004
    Open the OpalisManagementService.log file and there you will see why I was not able to connect to the Management server. When launching the Licensemanager tool , it tries to connect to the Management Server which in turn is unable to connect to the database. In the logfile I found an error saying that the DB connection could not be opened. You can find the DB connection string in there as well.
    clip_image006
    So the thing is to make sure that the account you used as the Opalis service account has the correct permissions on the database. In my case the account “contoso\svc_opalis” didn’t had the correct permissions to connect to the database.
    image
    Create a new login for the Opalis service account and give it permission to connect to the database. Once done you will be able to launch the License Manager tool and import the Opalis Licenses.
    clip_image007
  • Click Import. The Import License dialog appears.
  • In the Key field, type or paste the key that was sent to you by Opalis Software Inc.
  • In the License file field, type the path of the license (.lic) file that was sent to you by Opalis Software Inc., or click the ellipsis (...) button to browse for it.
  • Click OK. The License Manager prompt appears.
  • Click OK. The License Manager dialog appears.
  • Click Close. The Install Opalis Integration Server splash screenappears.
clip_image008

In the last step we will install the Opalis Client.

Sep 21, 2010

TechNet Virtual Lab: Introduction to Opalis – Video Tutorials

Exercise 1: Building Policies – Part 1 

 
Exercise 1: Building Policies – Part 2 

Exercise 1: Building Policies – Part 3 

Exercise 2: System Center Integrations (System Center Operations Manager)