Operating System - Tru64 Unix
1839255 Members
2436 Online
110137 Solutions
New Discussion

Re: cfg_ksm_memreq returned null name

 
SOLVED
Go to solution
Christof Schoeman
Frequent Advisor

cfg_ksm_memreq returned null name

Hi

I get the following warning every time I do a vold -k

lsm:vold: WARNING: cfg_ksm_memreq returned null name

The system belongs to a 2 node cluster, running Tru64 V5.1B PK5. vold -k also takes about 30 minutes to complete.

Has anyone seen this warning?

7 REPLIES 7
Hein van den Heuvel
Honored Contributor
Solution

Re: cfg_ksm_memreq returned null name

The cfg_ksm_memreq() routine is used to access a particular member of a kernel set.
Currently, there are two kernel sets:
the Hardware Set and the Process Set. The Hardware Setis a specifickernel
set used to describe all the hardware that is part of a system.
The cfg_ksm_memreq routine is typically used to obtain attribute information associated with an individual set member. It can also be used to set a memberâ s attribute value, if setting the attribute value is allowed.

Parameters for the cfg_ksm_memreq Routine
The cfg_ksm_memreq() routine has the following parameters:
cfg_status_t cfg_ksm_memreq(
cfg_handle_t *handle,
struct ksm_mem_req_buf *reqbuf,
int bufsiz,
cfg_attr_t **attributes,
int *nattributes);

All this mumbo jumbo is to say that it is a normal routine, used by lsm, hwmgr, collect and the likes, but returning an unexpected result.
This suggest some inconsistencies in the device database.
I would look around with dn_setup, hwmgr, dsfmgr and the like to see if you can identify, and hopefully correct that incensistency.

fwiw,
Hein.
Han Pilmeyer
Esteemed Contributor

Re: cfg_ksm_memreq returned null name

I'll try a different approach to this topic.

There were changes in BL26 to decrease LSM boot times in clusters. So it should be faster rather than slower. Are you saying the vold startup (boot) time is longer than before?

I would only expect 30 minute start up times on very large/complex LSM/cluster configurations unless there are problems.
- Can you describe your LSM configuration, i.e. number of disks, number of volumes, number of disk groups?
- Are you using LSM auto configuration?
- Have you checked for any errors (missing disks, paths, broken mirrors, etc)?

There also were changes to the way the "kernel sets" are used internally (as Hein described). This might not at all be related to the warning message you got. Also I haven't found any evidence that this is linked to the vold start up time (which doesn't imply that that couldn't be the case however).

Could you do a "voldctl enable" to see if you get the same error and whether that also takes 30 minutes? This rescans the LSM configuration.
Christof Schoeman
Frequent Advisor

Re: cfg_ksm_memreq returned null name

I wish there was a simple answer to all your questions, but here goes...

This system is our Legato server. The production systems all use EMC storage, with BCVs (Business Continuaty Volumes). What we do is, we split the BCVs (get a snapshot copy of the database) and then clone-deport-import-recover the LSM volumes on this server and then do our backups, thus not impacting production.

So, the amount of storage can be anything from 5TB to 40TB, with a number of configurations as far as volumes and groups are concerned.

Yesterday, LSM would not even start, so we re-initialized LSM and sync'ed the other node. Now at least it starts, slowly.

We are waiting for a downtime slot, so that we can re-create the whole device database. I have a strange feeling that an inconsistency might have slipped in along the way. I've done this on single nodes, so any pointers on how to tackle this in a cluster, shall be appreciated.

About the voldctl - I'll have to wait my turn now, probably during the downtime slot.


Han Pilmeyer
Esteemed Contributor

Re: cfg_ksm_memreq returned null name

Sounds likely that the stale devices from the BCV's are playing a role here. Got to think about that for a while...

Are you using volclonedg in the cycle?
Christof Schoeman
Frequent Advisor

Re: cfg_ksm_memreq returned null name

volclonedg...
voldg deport...
voldg -n ... import... (with new name)
volrecover...
Christof Schoeman
Frequent Advisor

Re: cfg_ksm_memreq returned null name

Still no downtime, but here is what I plan to do:
- Remove all the disks that does not belong to this particular system.
- Stop LSM
- Clean up the device database with hwmgr
- Reboot the cluster.

Hopefully, this will sort out whatever incosistency is causing the warning.

If this does not work, then I will rebuild the entire device database. I found the following procedure for doing this in another forum. I didn't see any copyright stuff, but thanks to Debra Alpert.

Any comments on the procedure?


------- snip -------

Hello,


Since corrupted device databases have reared their ugly heads again
today, I thought I should send out this summary describing how to
correct these types of situations. The steps that follow assume a Tru64
5.X cluster, where cluster LSM is not utilized. All of our clusters
fall into this category. The steps under the header "Cleanup the
Installation Disk" may be used to repair device databases on a
standalone node that is not using LSM. On the other hand, we don't have
any of those...


On standalone nodes where LSM is used (these we have), physically remove
one of the rootdg disks, and boot the remaining disk to single-user
mode. You can't actually mount anything except the root partition in
this situation. Once you run "mountroot", you can apply the steps in
the "Cleanup the Installation Disk" section. When the node is rebooted
after these steps are completed, you'll have to deactivate LSM to
proceed further. Assume that the disk you've booted from is
/dev/disk/dskB.


# mountroot
# cd /etc/vol
# mv volboot volboot.sav
# ed /etc/sysconfigtab
/swapdevice/p
swapdevice = /dev/vol/rootdg/swapvol
/swapdevice/s/vol\/rootdg\/swapvol/disk\/dskBb/p
swapdevice = /dev/disk/dskBb
/rootdev/p
lsm_rootdev_is_volume = 1
/rootdev/s/1/0/p
lsm_rootdev_is_volume = 0
w
q
# disklabel -e dskB
(using ed, you'll have to replace all LSM references with AdvFS, swap,
or unused filesystem types, as appropriate)
# ed /etc/inittab
(using ed, you'll have to comment out the two lsm entries and the single
vol entry)
# cd /etc/fdmns/root_domain
# rm *
# ln -s /dev/disk/dskBa
# cd ../usr_domain
# rm *
# ln -s /dev/disk/dskBg
# shutdown -r now


You should now come up in multi-user mode, with no user filesystems
accessible, as LSM is disabled. Once you encapsulate the root disk, the
LSM metadata on the other disks will allow automatic import of the user
diskgroups.


# volencap dskB
# volreconfig


After the system reboots, all filesystems should be mounted and
accessible. Insert the system mirror disk you removed earlier back into
its former slot, run "scu scan edt" so the system recognizes the disk,
run "hwmgr -vi dev" to obtain its new name, and use "dsfmgr" to rename
the disk to its original OS designation. You can now modify the
disklabel on this device, say dskM, and remirror the boot device:


# disklabel -z dskM
# disklabel -r dskB >/tmp/dl
# vi /tmp/dl (mark all partitions unused, and if you enjoyed ed, go for
it again instead of vi!)
# disklabel -rR dskM /tmp/dl
# volrootmir -a dskM


The cluster repair steps follow. For LSM protected standalone nodes,
follow the "Cleanup the Installation Disk" section at the point
indicated.


Have fun!


--Deb


####################################################################


Note:
-the terms "install disk" and "ER disk" and "standalone
disk" are interchangeable
-the procedure is normally done using member1. This is due to the
fact that the local device info propagated to the cluster initially
is from member1. If you use a different member, the local info
restored to the "down" cluster may not match exactly. This can
usually be cleaned up later with various hwmgr/dsfmgr commands.



####################################################################


How to rebuild HW-DB for 1st Member in Cluster
--------------------------------------
- Have a 'hwmgr show scsi -full' from all systems saved
- Take note of the tape, mc and cdrom devices manually
- shutdown the whole cluster
- boot the install disk
- Cleanup Install disks' HWDB here if necessary


#####################################################################.


Cleanup the Installation Disk
-----------------------------
boot -fl s
rm /etc/dec*
rm /etc/dfsc*
rm /etc/dc*
rm -rf /cluster/members/member/dev/[a-z]*
cd /cluster/members/member/dev/; ./MAKEDEV std
rm /cluster/members/member/etc/dfsl*
rm /cluster/members/member/.Booted
rm -rf /devices/*
halt
boot -fl s
mountroot
dn_setup -init
dsfmgr -K
dsfmgr -v # optionally -vF
hwmgr show scsi


# fix if necessary links in /etc/fdmns
mount /usr
mount /var
cdslinvchk # Fix problems NOW


dsfmgr -m/-e # to make old/new device names match


##################################################################


Notes:
-here we're assuming you're fixing member1. Change
accordingly if necessary.
-you may need to manually create the mount directories and links in
/etc/fdmns for cluster_root and root?_domain.
-you should backup all of these files before removing them - best
just to 'mv' them to a new "backup" subdirectory.


Cleanup Clustermember 1 and Cluster Root
----------------------------------------
boot -fl s
mount root1_domain#root /mnt # Do the fdmns links match?
mount cluster_root#root /mnt1 # Do the fdmns links match?
rm /mnt/etc/dec*
rm /mnt1/etc/dfsc*
rm /mnt1/etc/dec_unid_db*
rm /mnt1/etc/dec_hwc_cdb*
rm /mnt1/etc/dccd*
rm /mnt1/etc/dcdd*
rm -rf /mnt1/devices/*
rm /mnt1/cluster/members/member1/.Booted
rm /mnt1/cluster/members/member1/etc/dfsl*
rm /mnt1/cluster/members/member1/etc/cfginfo
rm -rf /mnt1/cluster/members/member1/dev/[a-z]*
cd /mnt1/cluster/members/member1/dev/; ./MAKEDEV std


###################################################################


Setup first Member & Cluster
------------------------------
To member boot:
cp /etc/dec_devsw* /mnt/etc/
cp /etc/dec_hw_db* /mnt/etc/
cp /etc/dec_hwc_ldb* /mnt/etc/
cp /etc/dec_scsi* /mnt/etc/


To cluster root:
cp /etc/dfsc* /mnt1/etc/
cp /etc/dec_unid_db* /mnt1/etc/
cp /etc/dec_hwc_cdb* /mnt1/etc/
cp /etc/dccd* /mnt1/etc/
cp /etc/dcdd* /mnt1/etc/
cp /etc/dfsl* /mnt1/cluster/members/member1/etc/
cp /etc/cfginfo /mnt1/cluster/members/member1/etc/


###################################################################


For ALL Members do :
--------------------
file /dev/disk/dsk?h # Quorum Disk
- Take note of the Major/Minor Number


file /dev/disk/dsk?a # Member Boot Disk
- Take note of the Major/Minor Number


- edit /mnt/etc/sysconfigtab
clubase:
cluster_qdisk_major=19 # From above Quorum Disk
cluster_qdisk_minor=32 # From above Quorum Disk
cluster_seqdisk_major=19 # From above Boot Disk
cluster_seqdisk_minor=64 # From above Boot Disk


vm:
swapdevice=/dev/disk/dsk?b # Use correct swapdevice here !


###################################################################


# reboot first member into 'new' cluster
#boot -fl s


Note: you'll likely need to specify the cluster root maj/min
numbers here. This should automatically update the cnx partitions
everywhere (if you have the qdisk configured).


boot -fl is
vmunix cfs:cluster_root_dev1_min=19 cfs:cluster_root_dev1_maj=XXXX


mountroot
dn_setup -init
dsfmgr -K
dsfmgr -v # optionally -vF
hwmgr show scsi


# fix if necessary links in /etc/fdmns
mount /usr
mount /var
cdslinvchk


###############################################################


Note: You may not need to do this. If you decide to copy the
device DBs to all member boot disks (and if you can easily fix
the local device name/ID issues) this is moot.


# For all remaining members
mount root2_domain#root /mnt # Do the fdmns links match?
rm /cluster/members/member2/.Booted # For all members
rm /cluster/members/member2/etc/dfsl*
rm /cluster/members/member2/etc/cfginfo
rm -rf /cluster/members/member2/dev/[a-z]*
cd /cluster/members/member2/dev/; ./MAKEDEV std


##################################################################


Note: You may not need to do this. If you decide to copy the
device DBs to all member boot disks (and if you can easily fix
the local device name/ID issues) this is moot.


Create genesis databases
------------------------
clu_bdmgr -d dsk0 >/tmp/dsk0.bd #dsk5 should be a bootdisk with
valid cnx partition (example member1 bootdisk)
/usr/sbin/cluster/clu_partmgr -mg /tmp/dsk0.bd dsk5 #dsk5 =
member-boot-disk which should be created


# clu_partmgr initialize cnx partition and creates a valid genesis
# hwdb at /etc for this member
mv /etc/dec_hwc_genesis* /mnt/etc/


#at this point you have to check all relevant files, sysconfigtab,
#cnx partition etc, to be sure that all looks correct


#################################################################


# Boot second member into cluster
cd /
umount /mnt
boot -fl s # Will fail if you forgot umount
mountroot
dn_setup -init
dsfmgr -K
dsfmgr -v # optionally -vF
hwmgr show scsi


# Very important!
# finally you have to take down the cluster and boot it once again,
# to be sure, that the new created HW-DB is really loaded into
# kernel.


#################################################################


#sometimes additional
---------------------
# If you have problems with cnx partition, shutdown this member to
# create a proper cnx partition


init 0


################################################################


Note: this shouldn't be necessary if booting with the
interactive boot flags specifying the cluster_root maj/min
values. Confirm with a 'clu_bdmgr -d dsk??' on all CNX
partitions to ensure they're pointing to the correct disk.


# Create proper CNX partitions
clu_bdmgr -d dsk1 >/tmp/dsk1.bd # As a backup
mount root2_domain#root /mnt
vdump 0f /tmp/root.vdmp /mnt
umount /mnt
rm -rf /etc/fdmns/root2_domain
clu_bdmgr -c dsk1 2
mount root2_domain#root /mnt
cd /mnt
vrestore xf /tmp/root.vdmp
cd /
umount /mnt


# boot this member now into cluster
###############################################################

Christof Schoeman
Frequent Advisor

Re: cfg_ksm_memreq returned null name

...and the solution.

Cleaned up all the devices that did not belong to the system (like other systems' BCVs) using hwmgr -delete scsi -did xxx, and voila.

The warning is gone and vold -k takes only a few seconds to complete. A reboot was not even required.

Thanks for all your advice.