Troubleshooting NIM LED hangs

Troubleshooting

Problem

This document covers bootp/tftp problems, LED 605, 607, 608, 611, 613, and 888 crashes when attempting to initiate a NIM install.

Resolving The Problem

This document is intended as a reference guide for troubleshooting common NIM LED hangs. It is not intended as a fail-safe resolution guide, however the steps below represent the “most likely” causes and resolutions to various NIM hangs.

What this document will cover :

 bootp problems

LED 605

LED 607

LED 608

LED 611

LED 613

LED 888-102-700-0c5 (crash)

What this document will not cover :
NIM setup. That information can be found in the document entitled, “NIM Setup Guide”.

In all examples covered in this document, device references will always come in the form of ent0, cd0, hdisk0....etc. Your system configuration may be different so substitute your devices as necessary in the command syntax listed. Additionally, all commands to be executed from the command line are expressed as they would look from a root command line prompt. Many times you will have to use your NIM client names in these commands, so I've placed that indication in italics. For example, you might see this command in the document :


# lsnim -l client_name

You would put your client's actual hostname in the client_name portion of that command.

`Bootp problems:`

Bootp is the first service you visually encounter during a NIM client BOS installation that you may run into problems with. Though many people run the RIPL (remote initial program load) “Ping Test” option in SMS first, this document will not cover troubleshooting errors reported from attempting that, as quite frankly, there are no troubleshooting steps to take in SMS if your SMS ping test fails. The SMS ping test is executed by firmware, not software, so there are no diagnostic commands we can execute to determine why the ping test is failing. We can however troubleshoot bootp failures from both the master and client side.

If your bootp request is failing check the following from the client side :

Make sure your network adapter cable is connected (if not virtual) and that you are selecting the correct network adapter from which you are intending to execute the bootp request.

This may seem like an amateur check to make, however this check is the number one check to make for a reason. Nothing is worse than spending hours diagnosing a problem without checking the basics. A good indication that there is a client side adapter issue is when kicking off a bootp request you do not see an actual count attempt for bootp. When you initate the bootp request you normally see a “Sent” and “Received” indicator that may look like this :
S=1 R=0
S=2 R=0
S=3 R=0
....etc
This at least indicates the client's adapter is actively trying to send a request. If you do not see this, or if you immediately receive an error similar to “No OS found” - then you are likely have a client side adapter/cabling/networking issue. You should contact your network admin to have him/her check this out.

Check your IP Parameters settings in SMS.

This is one of the most common errors that are made with bootp/tftp requests. In SMS your IP Parameters (Under the RemoteIPL option from the main menu) screen will look similar to :


Client IP Address :	[9.3.58.216]
Server IP Address :	[9.3.58.215]
Gateway IP Address : 	[9.3.58.1]
Subnet Mask : 		[255.255.255.0]

Using the above scenario you will likely fail to bypass bootp. When the NIM master and target NIM client are on the same network you should always use the master's IP address in the “Gateway IP Address” field. The client will not be communicating through the gateway in this case so it's address should not be used. The correct screen entries in the above example would look like this :


Client IP Address :	[9.3.58.216]
Server IP Address :	[9.3.58.215]
Gateway IP Address : 	[9.3.58.215]
Subnet Mask : 		[255.255.255.0]

*there are cases where some levels of firmware will allow you to use either the actual gateway address or just 0.0.0.0 – but those are rare and unpredictable cases controlled by firmware and are not considered to be reliable.

The following checks are to be made from the master's side.

Before running any check it is always good to run a reset/deallocate operation so we know that we're starting fresh.


# nim -Fo reset client_name
# nim -o deallocate -a subclass=all client_name

The first check from your master's side would be to verify that your master/client definitions and network definitions are correct.

Start off with full hostname resolution. For these examples my master's hostname will be 'shadoe' and my client's hostname will be 'kintaro'. You will use your own appropriate hostnames.


# host shadoe
# host shadoe.austin.ibm.com
# host 9.3.58.215

Your output should read similar to the following for all 3 commands. Make sure all 3 commands output the exact same output (verbatim) :


shadoe is 9.3.58.215,  Aliases:   shadoe.austin.ibm.com

# host kintaro
# host kintaro.austin.ibm.com
# host 9.3.58.216

kintaro is 9.3.58.216,  Aliases:   kintaro.austin.ibm.com

If there is any discrepancy or unexpected output, this should be fixed before proceeding. If hostname resolution looks good, then we'll want to make sure NIM has the correct information stored in it's database as well.


# lsnim -l master

This command will typically have a lot of output. What we're really interested in is the master's network definition. That is held with the “if1” attribute. You may have more adapters defined within NIM but we'll keep it simple for the purpose of this example. To get only the network information run : *note – the word 'master' is not in italics. You will actually use the word 'master' in this case – not the hostname of the NIM master.


# lsnim -l master |grep if1
 if1                 = master_net shadoebso.austin.ibm.com 00145EB7F3F5

The master's network object name is “master_net”. You also will want to verify the master's hostname is correct in the output of that command as well.
Next we'll look at the master's network definition :


# lsnim -l master_net
master_net:
   class      = networks
   type       = ent
   Nstate     = ready for use
   prev_state = information is missing from this object's definition
   net_addr   = 9.3.58.0
   snm        = 255.255.255.0
   routing1   = default 9.3.58.1

Given the IP addresses I know to be correct this network definition looks right. Next you'll check your client.


# lsnim -l client_name

kintaro :
   class           = machines
   type            = standalone
   connect         = shell
   platform        = chrp
   netboot_kernel  = mp
   if1             = master_net kintaro 0
   cable_type1     = tp
   Cstate          = ready for a NIM operation
   prev_state      = ready for a NIM operation
   Mstate          = not running
   cpuid           = 00012D4AD200
   Cstate_result   = reset

We can see from the output that this client is also defined on “master_net”, which I know is correct. If this client was defined on any other NIM network, that might be the source of my bootp (or possibly a different NIM boot) failure.

Check the bootp service and /etc/bootptab file.

To verify bootp is running execute the following :


# lssrc -t bootps
Service       Command		Description		Status
bootps       /usr/sbin/bootpd         bootpd /etc/bootptab	active

Make sure the command is listed as “/usr/sbin/bootpd” and make sure the status is active. If either the command is wrong or the status is set to “inoperative” you can check your /etc/inetd.conf file.
Make sure the command string is correct for the bootps line :


bootps  dgram   udp     wait    root    /usr/sbin/bootpd       bootpd /etc/bootptab

Many times network admins will comment out bootp for security reasons. If this is the case simply uncomment the 'bootps' line. While you're in this file also make sure the 'tftp' line is uncommented.
Run a refresh of inetd from command line to restart these services.
*note : make sure you clear this with your network admin first. They may not allow you to do this without their knowledge.


# refresh -s inetd

Your bootp and tftp should both show active now.


# lssrc -t bootps
Service       Command                  Description              Status
 bootps       /usr/sbin/bootpd         bootpd /etc/bootptab     active

# lssrc -s tftpd
Subsystem         Group            PID          Status
 tftpd            tcpip            3735960      active

Next you'll want to setup for your install again. Your /etc/bootptab file may be populated with incorrect information about your master, client, or both.
Most commonly you'll setup for a NIM install via the 'smit' tool.


# smitty nim_bosinst

Once you've setup for your NIM install cat your /etc/bootptab file. The last entry in the file should be for your target client. There should also only be one entry for any one client.


# cat /etc/bootptab
kintaro:bf=/tftpboot/kintaro:ip=9.3.58.216:ht=ethernet:sa=9.3.58.215:sm=255.255.255.0:

All of these addresses look valid. If you are still receiving bootp errors you can put bootp into debug. This will tell us whether or not the client's bootp request is making it to the master, whether or not the master is sending a reply, and whether or not the master recognizes the client it's trying to reply to.

Putting bootp into debug

This process is most helpful when running in a NIM environment where the master and client are on different networks. Often times network admins will block bootp requests by use of the routers. Running this test will show us whether or not that is a likely possibility.

Putting bootp into debug :
The following commands will be executed on the NIM master.


# nim -Fo reset client_name
# nim -o deallocate -a subclass=all client_name
# stopsrc -t bootps
# ps -ef |grep bootp

Make sure there are no processes still running. If so, kill them off (as long as they don't have a parent id of 1).


# vi /etc/inetd.conf

Comment out the bootps line with a # .
Now that bootp has stopped go ahead and bring up another window on your master. Putting bootp into debug is going to lock the window.


# bootpd -d -d -d -d -s

This will display any bootp requests coming into the master. The master may also detect other bootp requests going across the network – those can be ignored. Go ahead and setup for your installation operation again from smit.


# smitty nim_bosisnt

Once that is done initiate the bootp request from the client side. If you do not see a request being made from the client, and if you're certain the IP addressing is correct then your router or firewall are most likely the source of your bootp failure. You should contact your network admins to have them resolve that problem.
If you do see a bootp request coming in but still do not get a receipt from the client side you should contact the support center for further diagnostics.
Once you are finished testing bootp make sure you remember to remove the comment from the /etc/inittab file, ctrl-c out of the bootp debug, stop bootp, and refresh inetd.

`NIM LED 605 – missing device support`

A NIM LED 605 hang indicates that you're missing device driver support for your client's adapter in the SPOT being used for the install. This is a relatively rare hang now that the default installation options are to auto-install all device support. If you do end up with an LED 605 hang you can check the SPOT by running the following command.


# nim -o lslpp spot_name |grep device_driver

If for example my client had a 10gb PCI Express card I would run the following against the SPOT :


# nim -o lslpp AIX_6100-06_SPOT |grep 2514310025140100

This would either show me that the filesets are installed or return with no output. If the drivers are missing the SPOT needs to be updated with the appropriate software.

`NIM LED 607`

This is another relatively rare LED hang, but it does come up occasionally. This indicates :
o The device driver level installed into the SPOT is too low
o The target adapter is not supported
o The target adapter is bad

Often times there is an open or resolved defect where an APAR applied to the SPOT, or a firmware upgrade is necessary.

`NIM LED 608`

This is one of the most common NIM LED hangs. In short it simply means that something is wrong with your network setup. The problem is most likely in your NIM master's database information. You'll want to reset and deallocate resources from your client first.


# nim -Fo reset client_name
# nim -o deallocate -a subclass=all client_name

Next we'll want to check your master's and client's hostname resolution.
For these examples my master's hostname will be 'shadoe' and my client's hostname will be 'kintaro'. You will use your own appropriate hostnames.


# host shadoe
# host shadoe.austin.ibm.com
# host 9.3.58.215

 Your output should read similar to the following for all 3 commands. Make sure all 3 commands output the exact same output (verbatim) :


shadoe is 9.3.58.215,  Aliases:   shadoe.austin.ibm.com

# host kintaro
# host kintaro.austin.ibm.com
# host 192.168.50.25
kintaro is 192.168.50.25,  Aliases:   kintaro.austin.ibm.com


# lsnim -l master

This command will typically have a lot of output. What we're really interested in is the master's network definition. That is held with the “if1” attribute. You may have more adapters defined within NIM but we'll keep it simple for this example. To get only the network information I'll run :
*note – the word 'master' is not in italics. You will actually use the word 'master' in this case – not the hostname of the NIM master.


# lsnim -l master |grep if1
 if1                 = master_net shadoebso.austin.ibm.com 00145EB7F3F5


# lsnim -l master_net
master_net:
   class      = networks
   type       = ent
   Nstate     = ready for use
   prev_state = information is missing from this object's definition
   net_addr   = 9.3.58.0
   snm        = 255.255.255.0
   routing1   = default 9.3.58.1

Given the IP addresses I know to be correct this network definition looks right. Next you'll check your client.


# lsnim -l client_name

kintaro :
   class           = machines
   type            = standalone
   connect         = shell
   platform        = chrp
   netboot_kernel  = mp
   if1             = master_net kintaro 0
   cable_type1     = tp
   Cstate          = ready for a NIM operation
   prev_state      = ready for a NIM operation
   Mstate          = not running
   cpuid           = 00012D4AD200
   Cstate_result   = reset

Given what we found out above – this client with an IP address of 192.168.50.25 can not be on the “master_net' NIM network. This incorrect NIM network definition will cause an LED 608 hang. This commonly happens when a NIM client's IP address information has changed. It does not necessarily have to be the client's definition that is in error. The NIM master's network information may be wrong also, so make sure you check both machine definitions and both NIM network definitions for errors.

If there is no obvious discrepancy with your NIM network definitions or hostname resolution, a SPOT debug operation may be necessary. To do this first make sure you've executed the reset and deallocate operation against the client.


# nim -Fo reset client_name
# nim -o deallocate -a subclass=all client_name

Next, set the SPOT into debug :


# nim -Fo check -a debug=yes spot_name

Now you'll setup for the installation again :


# smitty nim_bosinst

 After you boot the client to SMS and initiate the bootp/tftp process the system will eventually drop you down to a debug prompt :



KDB (0) >

At this prompt you'll enter the following sequence, pressing 'Enter' after each of these lines.


KDB (0)> mw enter_dbg
42
.
KDB(0)> g

This will initiate the debug boot output. You'll want to make sure you're using some sort of screen or text capturing tool (like puTTY) and log all of the output that is coming to the screen. Once the LED 608 hang hits you'll see something similar displaying several times as it continues to try to bypass the LED 608.


RETRY CLIENTFILE 608
tftp: sendto: Network is unreachable

 At this point you can stop the install attempt. Feel free to call in to open a PMR to have one of our technicians review the debug output if you require assistance in doing so. To take the SPOT resource out of debug mode :


# nim -Fo reset client_name
# nim -o deallocate -a subclass=all client_name
# nim -Fo check spot_name

`NIM LED 611 - NFS`

This LED hang indicates there is a most likely a problem with NFS. Hostname resolution and the NFS option for portchecking can also be a problem, so we'll go through all three.

Firstly however, it is always a good idea to run a reset/deallocate operation and fully check hostname resolution.
For these examples my master's hostname will be 'shadoe' and my client's hostname will be 'kintaro'. You will use your own appropriate hostnames.


# nim -Fo reset client_name
# nim -o deallocate -a subclass=all client_name

# host shadoe
# host shadoe.austin.ibm.com
# host 9.3.58.215

 Your output should read similar to the following for all 3 commands. Make sure all 3 commands output the exact same output (verbatim) :


shadoe is 9.3.58.215,  Aliases:   shadoe.austin.ibm.com

# host kintaro
# host kintaro.austin.ibm.com
# host 9.3.58.216
kintaro is 9.3.58.216,  Aliases:   kintaro.austin.ibm.com

If there is any discrepancy or unexpected output, this should be fixed before proceeding. Next we can check to make sure the NFS option for portchecking is set appropriately. Execute the following on the NIM master.


# nfso -a |grep portcheck

This should return :


portcheck = 0

Finally, a full NFS reset usually clears this hang. NIM is extremely sensitive to NFS and any old cache information can cause an LED 611 hang. In the following process we will be shutting down NFS, clearing out the cache files, and restarting it. Though this process should take less than a minute to complete, you will want to verify with your network admins that you are safe in shutting down NFS at this time.
Execute the following on the NIM master.


# cd /etc
# stopsrc -g nfs
# mv exports exports.
# rm rmtab xtab
# cd /var/statmon
# rm -rf ./sm ./sm.bak ./state 
# startsrc -g nfs

Go ahead and setup for your NIM installation again.


# smitty nim_bosinst

This should in most cases resolve your 611 hang.

`LED 613 – NIM routing`

This LED indicates your nim routing is setup incorrectly and typically is easy to diagnose. Your NIM routing entry represents the gateway system connecting your master and client. If the master and client are on the same network on you will not be using the gateway information in the SMS panels, however your NIM setup information may still be incorrect so it should be checked.

Once again we'll look at the 'lsnim -l' output of the NIM master and client.
For these examples my master's hostname will be 'shadoe' and my client's hostname will be 'kintaro'. You will use your own appropriate hostnames.


# host shadoe
# host shadoe.austin.ibm.com
# host 9.3.58.215

 Your output should read similar to the following for all 3 commands. Make sure all 3 commands output the exact same output (verbatim) :


shadoe is 9.3.58.215,  Aliases:   shadoe.austin.ibm.com

# host kintaro
# host kintaro.austin.ibm.com
# host 192.168.50.25
kintaro is 192.168.50.25,  Aliases:   kintaro.austin.ibm.com


# lsnim -l master

This command will typically have a lot of output. What we're really interested in is the master's network definition. That is held with the “if1” attribute. You may have more adapters defined within NIM but we'll keep it simple. To get only the network information I'll run :
*note – the word 'master' is not in italics. You will actually use the word 'master' in this case – not the hostname of the NIM master


# lsnim -l master |grep if1
 if1                 = master_net shadoebso.austin.ibm.com 00145EB7F3F5


# lsnim -l master_net
master_net:
   class      = networks
   type       = ent
   Nstate     = ready for use
   prev_state = information is missing from this object's definition
   net_addr   = 9.3.58.0
   snm        = 255.255.255.0
   routing1   = default 9.3.58.1

Given the IP addresses I know to be correct this network definition looks right. Next you'll check your client.


# lsnim -l client_name

kintaro :
   class           = machines
   type            = standalone
   connect         = shell
   platform        = chrp
   netboot_kernel  = mp
   if1             = 192.168_network kintaro 0
   cable_type1     = tp
   Cstate          = ready for a NIM operation
   prev_state      = ready for a NIM operation
   Mstate          = not running
   cpuid           = 00012D4AD200
   Cstate_result   = reset

This shows my client, kintaro, to be on the NIM network called 192.168_network. Next, you'll check the definition of that network by running the following :


# lsnim -l 192.168_network

192.168_network :
   class      = networks
   type       = ent
   Nstate     = ready for use
   prev_state = information is missing from this object's definition
   net_addr   = 192.168.50.0
   snm        = 255.255.255.0
   routing1   = default 9.3.58.1

The routing1 entry for that client is incorrect. During the definition of this NIM network the master's gateway IP address was entered instead of the client's gateway IP address. The easiest way to change this is through SMIT. You will need to ensure the client has been reset and had the resources deallocated.


# smitty nim_rmroute

 Select the name of the network and the route to remove. Once it has been removed you can correct it by running the following command :


# smitty nim_mkdroute

Here you'll select the network name and enter the correct default gateway address, in this case it would be 192.169.50.1.

`LED 888-102-700-0c5 (crash)`

This LED hang usually indicates the SPOT filesystem does not have appropriate export permissions, or that the client machine was not given root access to the SPOT resource.

This can be easily determined by running the 'exportfs' command.
This is one example of how a properly exported RTE set of resources should look like :


# exportfs
/export/tl5spot/usr         -ro,root=kintaro,access=kintaro
/export/tl5lpp/6105         -ro,root=kintaro,access=kintaro
/export/nim/scripts/kintaro.script -ro,root=kintaro,access=kintaro

NIM tends to be extremely NFS sensitive. If you've manually exported any of your NIM resources or are doing something to break NFS rules you will find that your exports for one reason or another may not even show up in the output of this command. It is usually a good idea to run a full NFS cleanup if your exports are not being created correctly. If even after a full NFS cleanup your exports still are not being populated properly, check the pathnames of your resources. You may find that you're breaking NFS rules.

To run a full NFS clean run the following, including a reset and deallocate operation, before setting up for your next install attempt :



# nim -Fo reset client_name
# nim -o deallocate -a subclass=all client_name

# stopsrc -g nfs
# cd /etc
# mv exports exports.
# rm rmtab xtab
# cd /var/statmon
# rm -rf ./state ./sm ./sm.bak
# startsrc -g nfs

SUPPORT
If you require more assistance, use the following step-by-step instructions to contact IBM to open a case for software with an active and valid support contract. 1. Document (or collect screen captures of) all symptoms, errors, and messages related to your issue. 2. Capture any logs or data relevant to the situation. 3. Contact IBM to open a case: -For electronic support, see the IBM Support Community: https://www.ibm.com/mysupport -If you require telephone support, see the web page: https://www.ibm.com/planetwide/ 4. Provide a clear, concise description of the issue. - For guidance, see: Working with IBM AIX Support: Describing the problem. 5. If the system is accessible, collect a system snap, and upload all of the details and data for your case. - For guidance, see: Working with IBM AIX Support: Collecting snap data

Related Information

IBM Support Community

Learn more about "Getting IBM Support" here

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SWG10","label":"AIX"},"ARM Category":[{"code":"a8m0z000000cvz2AAA","label":"Install->NIM"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Version(s)","Line of Business":{"code":"LOB08","label":"Cognitive Systems"}}]

Tips

Troubleshooting NIM LED hangs