Re: CMU v.8.0 clone fail on Apollo r2200 with XL170r nodes

Thomas Kappfjell · ‎08-17-2016

Hi. We have problems cloning from golden image on node1 on a Apollo r2200 LFF chassis with XL170r nodes, to the rest of the nodes in the chassis. All the nodes have same hw config with Smart Array H240 and two 300GB SAS LFF drives in Raid1 configured. The cloning fails on disk setup/provisoning. We use "/opt/cmu/bin/cmu_backup -l <logical_group> -n <node>" to bacup node1, and "/opt/cmu/bin/cmu_clone -i <imagename> -n <nodeslist>" to clone onto the other nodes

The backup form node1 is set to disk "sda", but i see when using hpssa on the other nodes in the chassis, they come up with sdb, sdc and sdd. We suspect this is because the share disks shelf in the front of the chassis. If we clone to node1 in other chassis, it works fine. Have patced CMU to latest release. Tried both UEFI and Legacy BIOS options.

Any ideas?

Could not attach log files, since only jpg,gif,png are the valid extensions.

CMU log:

node list found in backup group : node02 { 1 node }

using single-stage cloning

***
*** the clone disk selection mode is set to STRICT in cmuserver.conf
*** in STRICT mode, clone operation expects that the nodes-to-be-cloned
*** and the backup node are exactly similar in their hardware and
*** disk controller configuration, otherwise the clone operation
*** may fail to select a disk
*** this is a new setting introduced in v8.0 to avoid inadvertent data
*** loss by cloning a wrong disk
***
*** for cloning nodes with different h/w config to that of backup node,
*** set CMU_CLONE_DISK_SELECTION_MODE=FLEXIBLE in cmuserver.conf, which
*** directs the cloning engine to heuristically select the most suitable disk
*** however, there is a risk that a wrong disk is selected by the
*** FLEXIBLE mode, especially while cloning nodes with multiple disks/luns,
*** resulting in data loss
***

making node(s) reservation(s) for cloning ( id: 13590 )
cleaning /etc/dhcpd.conf
cleaning boot directory
configuring the system
copying ssh settings
sending power off to selected nodes
rebuilding network-boot image
starting cloning # 13590

cloning started on [17-Aug-2016_11:18:16]

+-------------------------------------+--------+
| 1 x PREPARING_DISK ==> ERROR | node02 |
+-------------------------------------+--------+

cloning process finished on 2016-08-17 at [17-Aug-2016_11:23:21]

[CerbereDB] Database report:
[CerbereDB] | cloned | error | unknown
[CerbereDB] ComputeNodes_p0 | 0 | 1 | 0
[CerbereDB] Total | 0 | 1 | 0
[CerbereDB] List of nodes in error: node02
[CerbereServer] Delete "/opt/cmu/ntbt/rp/x86_64/etc/rc.d/auto/cmucerbere.sh-135
[CerbereTypes] Delete Cerbere Data
[CerbereTypes] Delete Cerbere Data - Stage 0
[CerbereTypes] Delete Cerbere Data - Stage 1
[CerbereTypes] Delete Cerbere Data - Stage 2
[CerbereTypes] Delete Cerbere Data - Stage 3
Cerbere is terminating with status 0

detailed logs are in /opt/cmu/log/cmucerbere-13590.log and
/opt/cmu/log/cmucerbere-*.log
releasing node(s) reservation(s) for cloning ( id: 13590 )
logout

Armugam_Pradeep · ‎08-17-2016

Hi,

Is it a customer cluster?

Please provide us the following logs from management node:

#cat /opt/cmu/image/<image name>/header.txt

# /opt/cmu/log/cmucerbere-node02-13590.log

please try changing the file extension to jpg etc or copy/paste entire log here.

Regards,

Pradeep

Thomas Kappfjell · ‎08-17-2016

Its customer cluster, but we run it. CMU is on support.

#cat /opt/cmu/image/XLnodesGen9/header.txt
date:00h54m06s 17-Aug-2016
ostype:CentOS Linux release 7.2.1511 (Core)
imagename:XLnodesGen9
hostname:node01
root:disk/by-path/cmu_pci0000:00_0000:00:03.0_0000:03:00.0_host0_target0:0:0_0:0:0:0-part4
rootctrlvendorid:103c:3239
rootctrlbusid:0000:03:00.0
disk:disk/by-path/cmu_pci0000:00_0000:00:03.0_0000:03:00.0_host0_target0:0:0_0:0:0:0
partition:disk/by-path/cmu_pci0000:00_0000:00:03.0_0000:03:00.0_host0_target0:0:0_0:0:0:0-part4
partition:disk/by-path/cmu_pci0000:00_0000:00:03.0_0000:03:00.0_host0_target0:0:0_0:0:0:0-part1
partition:disk/by-path/cmu_pci0000:00_0000:00:03.0_0000:03:00.0_host0_target0:0:0_0:0:0:0-part3
terminated:noerror
timespent:77sec

Thomas Kappfjell · ‎08-17-2016

renamed logfile cmucerbere-node02-13590.txt

Armugam_Pradeep · ‎08-17-2016

Hi

cmucerbere-node002-13590.log

......

+ grep -qE 'disk:.*by-path.*|root:.*by-path.*' /opt/cmu/image/XLnodesGen9/header.txt
+ rc=0
+ '[' 0 -ne 0 ']'
+ /opt/cmu/tools/cmu_wait_dev -d /dev/disk/by-path
+ echo 'error: cannot find /dev/disk/by-path, exiting...'
error: cannot find /dev/disk/by-path, exiting...
+ exit 1

From the cloning logs, we could see that by-path entries of disk are not visible during cloning which is strange.

How many times did you tried cloning on this problematic node?

Can you please try again with the same image, if it fails, login into node02 from management node , and run the below command (see whether by-path entries are visible under cmu netbooted node)

#ls -l /dev/disk/by-path

As a last try, please copy the below code in "custom code" of /opt/cmu/image/XLnodesGen9/pre_reconf.sh and retry cloning again. The below code causes some delay (120sec) & it makes sure that dev disk path entries will be loaded by udev in cmu netboot environment.

Example:

[root@mn-head1 ~]# cat /opt/cmu/image/lg_rh6u6_13_7/pre_reconf.sh
#!/bin/bash

#cmu_begin_interface

#do not change anything in this section
#add custom code after this section

CMU_PRE_RECONF_VERSION=1

#starting from cmu version 4.2 this script is dedicated to custom code
#it is running at cloning time after netboot is done and before the
#filesystems or even the partitioning is created.

# this script is invoked by cmu_pre_cloning stored on the management node
# into /opt/cmu/ntbt/rp/<arch>/opt/cmu/tools/

#cmu_end_interface

# - custom code starts here -
echo "running pre_reconf script ....loading by-path entries"
for((i=1;i<=25;i++)); do
if [ -d /dev/disk/by-path ]; then
echo "found dev/disk/by-path dir"
break;
fi;
sleep 5;
done

exit 0

Note: Please call HPE Support Center for the official CMU support to debug the issue. HPE forum is not meant for customer issues.

Pradeep

Thomas Kappfjell · ‎08-18-2016

Hi, and thank for reply. Will continue this with HPE support. The script change did not work. CMU is configured by HPE Core HPC stack using Cluster Setup Tool. The pre_reconf.sh file looks like this:

#!/bin/bash

#cmu_begin_interface

#do not change anything in this section
#add custom code after this section

CMU_PRE_RECONF_VERSION=1

#starting from cmu version 4.2 this script is dedicated to custom code
#it is running at cloning time after netboot is done and before the
#filesystems or even the partitioning is created.

# this script is invoked by cmu_pre_cloning stored on the management node
# into /opt/cmu/ntbt/rp/<arch>/opt/cmu/tools/

#cloning will fail if this script returns non-zero exit code

#cmu_end_interface

# - custom code starts here -
#Added by CST start

CN_BASE_DIR=/tmp
HEAD_NODE=skadi.ngu.no
SHARED_DIR=/share/apps
DISK_ARRAY_CONFIGURATION_FILE=/share/apps/diskarray/XLnodesGen9_disk_array_configuration

declare -r mounted_dir="$CN_BASE_DIR$SHARED_DIR"
declare -r hpssascripting=$mounted_dir/diskarray/hpssascripting

umount -f $mounted_dir >/dev/null 2>&1
grep -q $HEAD_NODE:$SHARED_DIR /etc/fstab
if [ $? -eq 0 ]; then
grep -v $HEAD_NODE:$SHARED_DIR /etc/fstab > /etc/fstab.new
mv -f /etc/fstab /etc/fstab.orig
mv -f /etc/fstab.new /etc/fstab
fi
cat >> /etc/fstab <<-CST_MOUNT
$HEAD_NODE:$SHARED_DIR $mounted_dir nfs defaults 0 0
CST_MOUNT
mkdir -p $mounted_dir
mount $mounted_dir

$hpssascripting -reset -i "$CN_BASE_DIR$DISK_ARRAY_CONFIGURATION_FILE"

#Added by CST end

exit 0

and

cat /share/apps/diskarray/XLnodesGen9_disk_array_configuration
; Date captured: Tue Aug 16 22:06:57 2016

; Version: 2.50.1.0

Action= Configure
Method= Custom

; __________________________ Controller Specifications SLOT 1 ________________________________
;
; Controller HPE Smart HBA H240, FirmwareVersion 3.56, License Keys Supported
; SerialNumber PDNNK0BRH240VH
; DriverName hpsa
; DriverVersion 3.4.10
; SSDSmartPath Supported
Controller= SLOT 1
; PowerMode= MaxPerformance
RebuildPriority= High
ExpandPriority= Medium
ParallelSurfaceScanCount= 1
SurfaceScanMode= Idle
SurfaceScanDelay= 3
Latency= Disable
DriveWriteCache= Disabled
MNPDelay= 60
IRPEnable= Disabled
DPOEnable= Disabled
ElevatorSortEnable= Enabled
QueueDepth= Automatic
PredictiveSpareActivation= Disable

; Array Specifications
Array= A
; Array Drive Type is SAS
; Array Free Space 0 GBytes
; 1I:1:1 (300 GB, SAS), 1I:1:2 (300 GB, SAS)
Drive= 1I:1:1, 1I:1:2
OnlineSpare= No

; Logical Drive Specifications
LogicalDrive= 1
RAID= 1
Size= 286070
; SizeBlocks= 585871964
Sectors= 32
StripSize= 256
Caching= Disabled
; VolumeUniqueID= 600508B1001C55A3FF6AE21FB36E5B50

Armugam_Pradeep · ‎08-18-2016

Hi,

Please raise a issue with HPE core HPC stack team for debugging the issue.

May be you should had tried copying our script at the end of CST customization in prereconf script . see as below

#!/bin/bash
#cmu_begin_interface
#do not change anything in this section
#add custom code after this section
CMU_PRE_RECONF_VERSION=1
#starting from cmu version 4.2 this script is dedicated to custom code
#it is running at cloning time after netboot is done and before the
#filesystems or even the partitioning is created.
# this script is invoked by cmu_pre_cloning stored on the management node
# into /opt/cmu/ntbt/rp/<arch>/opt/cmu/tools/
#cloning will fail if this script returns non-zero exit code
#cmu_end_interface
# - custom code starts here -
#Added by CST start
CN_BASE_DIR=/tmp
HEAD_NODE=skadi.ngu.no
SHARED_DIR=/share/apps
DISK_ARRAY_CONFIGURATION_FILE=/share/apps/diskarray/XLnodesGen9_disk_array_configuration
declare -r mounted_dir="$CN_BASE_DIR$SHARED_DIR"
declare -r hpssascripting=$mounted_dir/diskarray/hpssascripting
umount -f $mounted_dir >/dev/null 2>&1
grep -q $HEAD_NODE:$SHARED_DIR /etc/fstab
if [ $? -eq 0 ]; then
grep -v $HEAD_NODE:$SHARED_DIR /etc/fstab > /etc/fstab.new
mv -f /etc/fstab /etc/fstab.orig
mv -f /etc/fstab.new /etc/fstab
fi
cat >> /etc/fstab <<-CST_MOUNT
$HEAD_NODE:$SHARED_DIR $mounted_dir nfs defaults 0 0
CST_MOUNT
mkdir -p $mounted_dir
mount $mounted_dir
$hpssascripting -reset -i "$CN_BASE_DIR$DISK_ARRAY_CONFIGURATION_FILE"
#Added by CST end

echo "running pre_reconf script ....loading by-path entries"
for((i=1;i<=25;i++)); do
if [ -d /dev/disk/by-path ]; then
echo "found dev/disk/by-path dir"
break;
fi;
sleep 5;
done

exit 0

Pradeep

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: CMU v.8.0 clone fail on Apollo r2200 with XL170r nodes

CMU v.8.0 clone fail on Apollo r2200 with XL170r nodes