SCCM SCOM WSUS: OpsMgr - Cross Platform Discovery Errors

The key to being able to monitor a server is being able to discover that server :), until you can get the server into Operations Manager you aren't going to be able to do much with it. While the discovery process for Unix and Linux servers seems simple enough, there is a lot going on behind the scenes that is hidden by the wizard. In a previous entry I went over a successful discovery path (OpsMg and Cross Plat-Getting Started), for this post I'm going to go over some of the errors that can occur and how to resolve them.
The first one I'll talk about is Not Enough Entropy, this one required a little digging to figure out what was wrong. The exact error is Failed to allocate resource of type random data: Failed to get random data - not enough entropy.

I've had this issue when discovering both RHEL and SLES servers and it is related to certificate generation.
There are two ways to solve this problem, you can recreate the /dev/random file or do a manual agent install.
For both fixes, clean off the partially installed agent using the commands

rpm -e scx
rm -rf /etc/opt/microsoft/scx

Then if you want to make it so that discovery will work from the wizard use the commands

rm /dev/random
mknod -m 644 /dev/random c 1 9
chown root:root /dev/random

A manual install requires copying the appropriate package from %Program Files%\System Center Operations Manager 2007\AgentManagement\UnixAgents to the Unix\Linux machine and installing it directly.
After fixing the install issue, switch the /dev/random file back to a signed random file using the commands:

rm /dev/random
mknod -m 644 /dev/random c 1 8
chown root:root /dev/random

Next, let's look at Unspecified Problem, this is one where I am sure there is a whole gamut of reasons why it occurs. The text is Starting Microsoft SCX CIM Server: Unspecified Problem.

The key here is that we can see that the certificate was generated by the statement "Generating certificate with hostname..." so we know we need to look at things after the certificate creation. The only reason I have found for this error is the firewall, after installation and certificate generation there is a validation step. If you watch the steps through the wizard, the error pops up almost immediately so the wizard is unable to verify the agent suggesting a communication issue. Ensure that port 1270 has been opened on the firewall and try to discover again.

Some of the other errors I've run into over time are:

Access is Denied, this one pops up from time to time when an agent installation failed for some reason, you fixed the underlying reason and tried again. The problem is the partially installed agent is blocking the re-install, the fix is to clean off the agent and do a fresh install the same way we did for Not Enough Entropy.

Cannot connect to port 1270, this one typically occurs when there is a library path issue on the monitored server. If you go to the server, you'll likely see that the service failed to start. Trying to restart the service will give you the name of the library that cannot be found.

The typical resolution path for linux is:

scxadmin -restart all
See what library is missing
find / -name
vi /etc/ld.so.conf
add path to missing library
ldconfig to reload dynamic loader
scxadmin -restart all

The path for Solaris is the same for steps 1 - 3 but differs when it comes to setting the library path:

crle to see the current path
crle -l to update the path (include the old path plus the new path because the command is a replacement, not an append)
scxadmin -restart all

Can not resign certificate, /etc/opt/microsoft/ssl/scx-host-.pem already exists,in this situation the re-creation of a certificate was attempted but failed because there was a previously generated certificate on the target host. If you want to generate a new certificate, simply delete the contents of the /etc/opt/microsoft/ssl directory. Alternatively you can export the certificate and trust it on the management server.

winrm failed to connect in a timely manner, this can happen if the target server is over loaded. OpenPegasus will time out after 20 seconds or so and this can result in a failure to validate the agent was properly installed. The fix here is to ensure the agent was in fact installed using scxcimcli ei -n root/scx CIM_ManageElement on the target server and then retrying the discovery.

There are many other things that couild go wrong during discovery but in most cases the error message you receive should help you determine how to fix the problem. One thing to watch is at what phase the error occurred: Initial discovery (name resolution issues), Installation (user account issues), Signing (certificate issues), Validation (configuration issues), knowing where to start looking is half the battle to getting our servers successfully discovered.

Jun 29, 2010

OpsMgr - Cross Platform Discovery Errors

No comments:

Post a Comment

Total Pageviews