Server Clustering
1748022 Members
4989 Online
108757 Solutions
New Discussion

NO Netboot / reboot not working ------- CMU 7.2 - ProLiant SL4540 Gen8 + RHES 6.4

 
SOLVED
Go to solution
NicolasR
Occasional Advisor

NO Netboot / reboot not working ------- CMU 7.2 - ProLiant SL4540 Gen8 + RHES 6.4

Hi There,

 

We've got 2 problems for the moment and some lack of information to solve it.

The specs:

CMU 7.2 - ProLiant SL4540 Gen8 + RHES 6.4 -

 

Case:

 

1) no reboot into PXE

 

We´ve just installed CMU and we are unable to reboot the nodes from the Management console.

If we right-click the node, and press Backup (Capture image), the node will shutdown and we have to manually power it on.

 

Of course if we power it on, it will not boot from PXE but it will just go directly to grub.

 

2) From the OS prompt, running tcpdumps and dhclient´s I can see the DHCPDISCOVER + OFFER ok.

However, if we force it manually to boot from PXE, the node will not get an DHCP offer from the CMU server. (media cable error / timeout)

 

Here we are suspecting we have a switch problem where:

 

1. Verify that broadcast is enabled and is redirected to the switch. -----> DONE

2. Verify that the spanning tree is disabled on all ports connected to a node. ------> due to business specs we cannot disable it, do we have another option?

3. Verify that « multicast IGMP snoop loop » is disabled on the switch. ------> DONE

 

We don´t have point 2 very clear, still looking for an answer or an option.

Can you recommend us an alternative?

Thanks.-

 

 

13 REPLIES 13
NicolasR
Occasional Advisor

Re: NO Netboot / reboot not working ------- CMU 7.2 - ProLiant SL4540 Gen8 + RHES 6.4

Ok, Problem 2 is partially solved.

There was an issue with our Infiniband Mellanox fibre cards,  replaced them and solved.

 

Now back to problem 1:

 

Why is it possible that when from CMU selecting BACKUP will not boot the server, rather than just shutting it down????

 

Chintala
Advisor

Re: NO Netboot / reboot not working ------- CMU 7.2 - ProLiant SL4540 Gen8 + RHES 6.4

Hello,

 

It looks like node is taking more than 30sec to shutdown.

Please increase the CMU_ILO_OS_SHUTDOWN_TIMEOUT value to 60 sec, in /opt/cmu/etc/cmuserver.conf file on management node.

 

CMU_ILO_OS_SHUTDOWN_TIMEOUT=60

 

And, try a backup again. Let us know how it goes.

NicolasR
Occasional Advisor

Re: NO Netboot / reboot not working ------- CMU 7.2 - ProLiant SL4540 Gen8 + RHES 6.4

Hi Chintala,

 

Thanks for the answer.

Setting higher the timeout value hasn´t help us.

 

We still have the "no rebooting" problem.

If we select "BACKUP" the node won´t reboot, will shutdown directly.

 

Setting it from CMU to boot from netboot directly will launch node into PXE/TFTP mode.

However, it will fail to complete backup.

 

The error logs I could see at this moments are:

 

/opt/cmu/tmp/cmu_backup_err_1904676650556995234.tmp
error: CMU_NETBOOT_TIMEOUT(480 seconds) reached while waiting for hadwrk03p to network boot: check 'odin'. Debug information : NEW=1400684378 - ORIG=1400683897

 

/opt/cmu/tmp/power_osoff_hadwrk03p.err

Wed May 21 16:51:08 CEST 2014 ssh succeeded, waiting 60 seconds to let system shut down...
Enter the username: Enter the password: <?xml version="1.0"?>
<RIBCL VERSION="2.23">
<RESPONSE
STATUS="0x0000"
MESSAGE='The RIBCL version is incorrect. The correct version is 2.0 or later.'
/>
</RIBCL>

 

 

 

Why is complaining about RIBCL version if the installed one is higher?

 

Chintala
Advisor

Re: NO Netboot / reboot not working ------- CMU 7.2 - ProLiant SL4540 Gen8 + RHES 6.4

Hello,

 

Is it a customer cluster or an internel cluster ?

 

If it is a customer cluster, please raise a support call at HP local support center. And, give us case ID.  

Our team will help you.

 

If it is a internal cluster, please let us know your hp email id, so than we can get in touch with you.

 

Mean while provide us the follwoing detais:

 

Wha is the OS on head node ?

 

From the log

 

>>>error: CMU_NETBOOT_TIMEOUT(480 seconds) reached while waiting for hadwrk03p to network boot: check 'odin'. Debug 

 

Why these two hostnames are different. hadwrl03p and odin ? Is it a typo ? Ideally both should be same. 

Is it trying to power on the proper node ?

 

How much time the node is taking to complete shutdown, when you manually perform the /sbin/shutdown -h now.

Is it taking more than 60sec or less than 60sec ?  If it is more than 60sec, increase the tim to that value and try again.

 

 

On SL4540 node are you using hpvsa, i.e Dynamic RAID mode (B120i) enabled on the node ?

 

Also, is it possible to get the cluster access ?

Chintala
Advisor

Re: NO Netboot / reboot not working ------- CMU 7.2 - ProLiant SL4540 Gen8 + RHES 6.4

Also get the complete output of the following files:

 

 /opt/cmu/tmp/ILO_power_osoff_<nodename>.output

/opt/cmu/tmp/power_osoff_<nodename>.err

 

/opt/cmu/tmp/ILO_power_on_<nodename>.output

/opt/cmu/tmp/power_on_<nodename>.err

NicolasR
Occasional Advisor

Re: NO Netboot / reboot not working ------- CMU 7.2 - ProLiant SL4540 Gen8 + RHES 6.4

Thanks again Chintala!

 

-We´ve just purchased CMU for customer cluster, so we are opening a case ID (so far unable since im still waiting for serial/said).

-The previous error was a typo.......dismiss it

-Shutdown is not taking longer than 20 secs

-on sl4540 we are using hpvsa (original installation was made with "blacklist=ahci dd")

 

I still cant figure out why node is not being reboot (but powered off) and cannot complete a TFTP boot.

 

The logs:

  /opt/cmu/tmp/ILO_power_osoff_<nodename>.output --------> OK

/opt/cmu/tmp/power_osoff_<nodename>.err--------> does not exist

 

/opt/cmu/tmp/ILO_power_on_<nodename>.output --------> OK

/opt/cmu/tmp/power_on_<nodename>.err --------> does not exist

 

added:

/opt/cmu/tmp> cat cmu_backup_err_1181849120419151806.tmp

------------------------------------------------------------------------------------------------------------------------

 

Thu May 22 10:49:34 CEST 2014 /opt/cmu/hardware/ILO/cmu_ILO_power_osoff called with -n hadwrk03p -i 172.22.20.34 -e /opt/cmu/tmp/power_osoff_hadwrk03p.err
Enter the username: Enter the password: <?xml version="1.0"?>
<RIBCL VERSION="2.23">
<RESPONSE
STATUS="0x0000"
MESSAGE='The RIBCL version is incorrect. The correct version is 2.0 or later.'
/>
</RIBCL>

<?xml version="1.0"?>
<RIBCL VERSION="2.23">
<RESPONSE
STATUS="0x0000"
MESSAGE='No error'
/>
</RIBCL>

<?xml version="1.0"?>
<RIBCL VERSION="2.23">
<RESPONSE
STATUS="0x0000"
MESSAGE='No error'
/>
</RIBCL>

<?xml version="1.0"?>
<RIBCL VERSION="2.23">
<RESPONSE
STATUS="0x0000"
MESSAGE='No error'
/>
</RIBCL>

<?xml version="1.0"?>
<RIBCL VERSION="2.23">
<RESPONSE
STATUS="0x0000"
MESSAGE='No error'
/>
</RIBCL>

<?xml version="1.0"?>
<RIBCL VERSION="2.23">
<RESPONSE
STATUS="0x0000"
MESSAGE='No error'
/>
</RIBCL>

 


----------------


14:12:25 hadmae1p(root):/opt/cmu/tmp> cat ILO_power_on_hadwrk03p.output
Thu May 22 10:49:48 CEST 2014 /opt/cmu/hardware/ILO/cmu_ILO_power_on called with -n hadwrk03p -i 172.22.20.34 -e /opt/cmu/tmp/power_on_hadwrk03p.err
Enter the username: Enter the password: <?xml version="1.0"?>
<RIBCL VERSION="2.23">
<RESPONSE
STATUS="0x0000"
MESSAGE='The RIBCL version is incorrect. The correct version is 2.0 or later.'
/>
</RIBCL>

<?xml version="1.0"?>
<RIBCL VERSION="2.23">
<RESPONSE
STATUS="0x0000"
MESSAGE='No error'
/>
</RIBCL>

<?xml version="1.0"?>
<RIBCL VERSION="2.23">
<RESPONSE
STATUS="0x0000"
MESSAGE='No error'
/>
</RIBCL>

<?xml version="1.0"?>
<RIBCL VERSION="2.23">
<RESPONSE
STATUS="0x0000"
MESSAGE='No error'
/>
</RIBCL>

<?xml version="1.0"?>
<RIBCL VERSION="2.23">
<RESPONSE
STATUS="0x0000"
MESSAGE='No error'
/>
</RIBCL>

<?xml version="1.0"?>
<RIBCL VERSION="2.23">
<RESPONSE
STATUS="0x0000"
MESSAGE='No error'
/>
</RIBCL>


-------------


14:13:56 hadmae1p(root):/opt/cmu/tmp> cat cmu_backup_err_1181849120419151806.tmp
error: CMU_NETBOOT_TIMEOUT(480 seconds) reached while waiting for hadwrk03p to network boot: check 'hadwrk03p'. Debug information : NEW=1400749085 - ORIG=1400748604

 

 

Thanks for your help!

 

 

 

 

Chintala
Advisor

Re: NO Netboot / reboot not working ------- CMU 7.2 - ProLiant SL4540 Gen8 + RHES 6.4

Thank you for information.

 

Is it happening only with this node or other nodes in cluster as well ?

Have you tried by resetting the ILO of backup node once ?

 

Can you try the following from the management node (make sure backup node is in power ON state )

 

# /opt/cmu/bin/cmu_power -p BOOT -n <backup_nodename>

 

And, see whether node comes up or not. 

 

Try the same steps on an another node in the cluster.

 

Get the power logs (mentioned in the above post) for those nodes.

 

Next:

-------

Can you manually shutdown the backup node and then start backup process ? 

 

Let us know what is happening on the console of the backup node.

 

 

 Also, is it possible to have the cluster access ?

 

 

NicolasR
Occasional Advisor

Re: NO Netboot / reboot not working ------- CMU 7.2 - ProLiant SL4540 Gen8 + RHES 6.4

Morning Chintala (at least here XD on spain),

 

Thanks for the recommendations, they keep me knowing the software deeper.

 

1st action was successful, I can manually reboot the node.

 

09:55:47 hadmae1p(root):/opt/cmu/bin> ./cmu_power -p BOOT -n hadwrk03p

powering off hadwrk03p ...

spawning 1 task(s) ................
waiting for 1 task(s) ................ { last:hadwrk03p }
powering on hadwrk03p ...

spawning 1 task(s) ................
waiting for 1 task(s) ................ { last:hadwrk03p }
./cmu_power finished

 

2nd action: Manually shutting it down, the running the backup from CMU.

Manuak shutdown OK

Launch backup from CMU partially ok.

It will boot on, but will go directly into grub (its supposed to boot into PXE/TFTP mode, correct?)

 

ILO_power_osoff_hadwrk03p.output

ILO_power_on_hadwrk03p.output

Boths seems ok, no errors no discrepancies founded.

 

CMU_Backup_err from /opt/tmp

10:22:58 hadmae1p(root):/opt/cmu/tmp> cat cmu_backup_err_7263375511447792300.tmp
<hadwrk03p> : error netbooting node: check CMU configuration, boot order, and BMC access. Debug info : NEW=1400832591 - ORIG=1400832373 : TIMEOUT=480

 

Sorry, we are not able to provide access to our customer network.

 

-How can I launch a command make the node launch into PXE/TFTP and check whats happening on the server side?

 

Thanks for your support,

Nicolas.-

 

Chintala
Advisor

Re: NO Netboot / reboot not working ------- CMU 7.2 - ProLiant SL4540 Gen8 + RHES 6.4

Morning Niclolas,

 

Good to know that node is powering on when you manually power ON it.

 

we still need to figure out why node is not powering on when we start a backup process.

As node is successfully powering ON when you do it by using /opt/cmu/bin/cmu_power. (In fact cmu backup process also uses the same command, which you tried)

 

>>>It will boot on, but will go directly into grub (its supposed to boot into PXE/TFTP mode, correct?) 

 

Is the admin NIC (which is in CMU database ) PXE enabled ?

 

Is PXE boot is set before HDD in the node boot order ? If not you can set it through the ILO -> Virtual Power -> Bootorder.

 

When you start backup process, is the backup node sending any DHCP requests. If yes does the requests (packets) reaching headnode ? You can check this by looking for MAC address in /var/log/messages on headnode.

 

Also, are you enabling spanning tree on the switch side (i saw it in your early posts) ?

Is it possible to disable the spanning tree ? We saw some difficulties with PXE booting in the past with spanning tree enabled.

 

If it is not possible to disable the spanning tree, you need to set the ports connected to nodes in Portfast/edged-port mode in switch to work around the STP.

 

Set the ports connected to nodes in Portfast/edged-port mode at switch level and manually shutdown the backup node,

then start a backup process again.

  

>>>How can I launch a command make the node launch into PXE/TFTP and check whats happening on the server side?

 

You can view the console of backup node from the management node by giving

 

# /opt/cmu/bin/cmu_console <nodename> vsp

 

Before that make sure that node virtual serial port (vsp) is set to COM1 (ttyS0).

If it is set to COM2, change it to the COM1 in the BIOS. 

 

Let us know how it goes.