The first one I'll talk about is Not Enough Entropy, this one required a little digging to figure out what was wrong. The exact error is Failed to allocate resource of type random data: Failed to get random data - not enough entropy.
I've had this issue when discovering both RHEL and SLES servers and it is related to certificate generation.
There are two ways to solve this problem, you can recreate the /dev/random file or do a manual agent install.
For both fixes, clean off the partially installed agent using the commands
- rpm -e scx
- rm -rf /etc/opt/microsoft/scx
- rm /dev/random
- mknod -m 644 /dev/random c 1 9
- chown root:root /dev/random
After fixing the install issue, switch the /dev/random file back to a signed random file using the commands:
- rm /dev/random
- mknod -m 644 /dev/random c 1 8
- chown root:root /dev/random
Next, let's look at Unspecified Problem, this is one where I am sure there is a whole gamut of reasons why it occurs. The text is Starting Microsoft SCX CIM Server: Unspecified Problem.
The key here is that we can see that the certificate was generated by the statement "Generating certificate with hostname..." so we know we need to look at things after the certificate creation. The only reason I have found for this error is the firewall, after installation and certificate generation there is a validation step. If you watch the steps through the wizard, the error pops up almost immediately so the wizard is unable to verify the agent suggesting a communication issue. Ensure that port 1270 has been opened on the firewall and try to discover again.
Some of the other errors I've run into over time are:
Access is Denied, this one pops up from time to time when an agent installation failed for some reason, you fixed the underlying reason and tried again. The problem is the partially installed agent is blocking the re-install, the fix is to clean off the agent and do a fresh install the same way we did for Not Enough Entropy.
Cannot connect to port 1270, this one typically occurs when there is a library path issue on the monitored server. If you go to the server, you'll likely see that the service failed to start. Trying to restart the service will give you the name of the library that cannot be found.
The typical resolution path for linux is:
- scxadmin -restart all
- See what library is missing
- find / -name
- vi /etc/ld.so.conf
- add path to missing library
- ldconfig to reload dynamic loader
- scxadmin -restart all
The path for Solaris is the same for steps 1 - 3 but differs when it comes to setting the library path:
- crle to see the current path
- crle -l to update the path (include the old path plus the new path because the command is a replacement, not an append)
- scxadmin -restart all
Can not resign certificate, /etc/opt/microsoft/ssl/scx-host-
winrm failed to connect in a timely manner, this can happen if the target server is over loaded. OpenPegasus will time out after 20 seconds or so and this can result in a failure to validate the agent was properly installed. The fix here is to ensure the agent was in fact installed using scxcimcli ei -n root/scx CIM_ManageElement on the target server and then retrying the discovery.
There are many other things that couild go wrong during discovery but in most cases the error message you receive should help you determine how to fix the problem. One thing to watch is at what phase the error occurred: Initial discovery (name resolution issues), Installation (user account issues), Signing (certificate issues), Validation (configuration issues), knowing where to start looking is half the battle to getting our servers successfully discovered.
No comments:
Post a Comment