Showing results for 
Search instead for 
Do you mean 

HSG80 Error Code 4160 -- HELP

SOLVED
Go to Solution
Frequent Advisor

HSG80 Error Code 4160 -- HELP

Has anybody come across a 4160 error when trying to restart either of the controllers?

I am unable to boot from my cluster (Tru64 environment), errors indicate that the device or unit are unavailable. From the console port I try to reset the controller, but am greeted with the response:

Error 4160: Unable to rundown the following units (then a list of all my units is displayed). I have tried restarting both controllers, with the same results.

when I do a show units all, each unit's state is INOPERATIVE with their size and geometry UNKNOWN.

I am kinda new at this and would appreciate anybody assistance.

Thanks
Allan
34 REPLIES
Honored Contributor

Re: HSG80 Error Code 4160 -- HELP

Can you please post the output of your commands?
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
Frequent Advisor

Re: HSG80 Error Code 4160 -- HELP

Output of the commands looks like this:

FROM THE CONTROLLER:
HSG-TOP> RESTART THIS_CONTROLLER
ERROR 4160: unable to rundown the following units on this controller:
D11 -- Cannot rundown unit
D12 -- Cannot rundown unit
and so on for all units

HSG-TOP> SHOW UNITS FULL
D11
LUN_ID...
blah, blah, blah

State:
INOPERATIVE
Unit has lost data
NOPREFERRED_PATH
WRITE_PROTECT - DATA SAFETY
Size: NOT YET KNOWN
Geometry: NOT YET KNOWN


When trying to boot my Alpha...
dga11.1002.0.9.0 has no media present or is disabled via the RUN/STOP switch.


Let me know if you need more information, and thanks for your help.

Allan
Honored Contributor

Re: HSG80 Error Code 4160 -- HELP

Hi Allan,

Can you post the output of "SHOW THIS" and "SHOW OTHER" (if you have dual controllers).

I'll take a guess that your cache batteries have died and things have got into an ambiguous state...

Cheers,

Rob
Frequent Advisor

Re: HSG80 Error Code 4160 -- HELP

actually... almost by accident we came across a peice of documentation that suggests we have to issue the command "CLEAR {unit id} LOST_DATA" for all units that have lost data. After doing this, we are able to access the units again from the alpha and are in the process of booting up now. This is taking a long time so we may have other issues. I will let you guys know if this solves the problem.

I do appreciate the help, and am glad I found this forum for assistance.

Thanks
Allan
Honored Contributor
Frequent Advisor

Re: HSG80 Error Code 4160 -- HELP

I saw that... looks the same (exactly).

When I boot from my seed/installation disk I can get the system up. The funny thing here is that in the sysconfigtab file, the swap device is identified as /dev/disk/dsk0b (which does not exist), but it gets by it anyway. The err 19 does not appear however.

Once I log on, I can manually mount the boot devide (/dev/disk/dsk11a) and I see stuff in there. Now the etc directory is a little lean but I may not know enough about Tru64 to know what is suppossed to be there (I am a developer by trade, SA by nessessity). What I see on this partition is:
.tags
etc
genvmunix
mdec
osf_boot
quota.group
quota.user
vmunix
vnunix.PrePatch

etc..
clu_bdmgr.conf
clu_recover.dat
ddr.db
ddr.dbase
dec_devsw_db
dec_devsw_db.bak
dec_hw_db
dec_hw_db.bak
dec_hwc_ldb
dec_hwc_ldb.bak
dec_scsi_db
dec_scsi_db.bak
dvrdevtab
fwevdb
gen_databases
sysconfigtab


Can anybody tell me if I am missing anything? If I have to rebuild cluster member, I think I can do that...

Thanks
Allan


Honored Contributor

Re: HSG80 Error Code 4160 -- HELP

let's start at the beginning.

you've got a TRU64 cluster.
what version OS what version cluster software?
how many nodes.
memory-channel or LAN cluster-interconnect?

youv'e got a HSG80
single or redundant?
please post output of "show this" and "show other"
are there any SAN-switches involved ?

yov'e got an installation disk
this is a local disk not on the hsg80?
Frequent Advisor

Re: HSG80 Error Code 4160 -- HELP


I have access to a working system (LAB) and the non-working system (FIELD).Each have 2 members. Both are on private networks, so I will have to get you the SHOW THIS and SHO OTHER information in a few minutes (no comp/paste).

I have done some additional checking here. Remember, I am not an SA by trade; but the only SA we have... so I am learing as I go along. I have brought all members on both systems down to boot prompt. I ran wwidmgr -show ev on all 4 workstations. I can confirm that on all workstations, the WWID number matched the LUN number of the correcponding unit on the HSG80 (I think this is good), however... the N1-N4 numbers do not quite behave properly. On the LAB system, N1-N4 on member1 read something like 32, 31, 34, 33 and on member2 read something like 33, 34, 31, 32 (inverse ordering). On the FIELD system, the N1-N4 numbers are in the identical order (not sure if this is bad or not).

When booting the systems, before the err 19 message, the FIELD system I get the following (twice):

scsi0: SCSI bus as reset
cam_logger: SCSI event packet
cam_logger: bus 0
itpsa SSI HBA
SCSI bus was reset

This does not happen on LAB. I think this is not good too, but really do not know the importance of this message.

The err 19 message is referring to domain root1_domain.

I can boot into SINGLE USER MODE and manually mount all of the filesystems, and to me they look ok.

I will get the rest of the information you asked for in a few minutes.

Thanks for your help
Allan

Frequent Advisor

Re: HSG80 Error Code 4160 -- HELP

HSG80 OUTPUT

SHOW THIS:
HSG80 {DISK SERIAL NUMBER} SOFTWARE V86G-4 HARDWARE E12
NODE_ID = 5000-1FE1-0010-4E73
ALLOCATION_CLASS = 0
SCSI_VERSION = SCSI-2
CONFIGURED FOR MULTIBUS-FAILOVER WITH {DISK) IN DUAL-REDUNDANT CONFIGURATION
DEVICE PORT SCSI ADDRESS 7
TIME...
COMMAND CONSOLD LUN IS DISABLED

HOST_PORT_1
REPORTED PORT_ID: 5000:1FE1:0010:4E73
PORT_1 TOPOLOGY - FABRIC (fabric up)
ADDRESS = 011000

HOST_PORT2
REPORTED PORT_ID: 5000:1FE1:0010:4E74
PORT_2 TOPOLOGY - FABRIC (fabric up)
ADDRESS = 011000
NOREMOTE_COPY

CACHE:
256 MB WRITE CACHE, VERSION 0022
CACHE IS GOOD
NO UNFLUSHED DATA IN CACHE
CACHE_FLUSH_TIMER = DEFAULT (10 SECONDS)

MIRRORED CACHE
NOT ENABLED

BATTERY
UPS = DATACENTER_WIDE



SHOW OTHER:
HSG80 {DISK SERIAL NUMBER} SOFTWARE V86G-4 HARDWARE E12
NODE_ID = 5000-1FE1-0010-4E70
ALLOCATION_CLASS = 0
SCSI_VERSION = SCSI-2
CONFIGURED FOR MULTIBUS-FAILOVER WITH {DISK) IN DUAL-REDUNDANT CONFIGURATION
DEVICE PORT SCSI ADDRESS 6
TIME...
COMMAND CONSOLD LUN IS DISABLED

HOST_PORT_1
REPORTED PORT_ID: 5000:1FE1:0010:4E71
PORT_1 TOPOLOGY - FABRIC (fabric up)
ADDRESS = 011100

HOST_PORT2
REPORTED PORT_ID: 5000:1FE1:0010:4E72
PORT_2 TOPOLOGY - FABRIC (fabric up)
ADDRESS = 011100
NOREMOTE_COPY

CACHE:
256 MB WRITE CACHE, VERSION 0022
CACHE IS GOOD
NO UNFLUSHED DATA IN CACHE
CACHE_FLUSH_TIMER = DEFAULT (10 SECONDS)

MIRRORED CACHE
NOT ENABLED

BATTERY
UPS = DATACENTER_WIDE




My environment:
Platform: DS20
OS: Tru64 V5.1A
TruCluster that came with OS.


Please let me know if you would like additional information.

Thanks
Allan

Frequent Advisor

Re: HSG80 Error Code 4160 -- HELP

forgot to mention
installation disk is NOT on the cluster.
2 members per system (LAB and FIELD)
same OS on each


think thats all...
Honored Contributor

Re: HSG80 Error Code 4160 -- HELP

both controllers report "CONFIGURED FOR MULTIBUS-FAILOVER WITH {DISK) IN DUAL-REDUNDANT CONFIGURATION"
that's good, it means they communicate with eachother.
both report "CACHE IS GOOD NO UNFLUSHED DATA IN CACHE"
normally thats good too, but you issued the "CLEAR {unit id} LOST_DATA" command

the order of WWID numbers is not very important.




The SCSIbus reset is suspicious.
it looks like something is blocking one of the busses.
I would take these steps :
- power down both hosts
- shutdown both hsg's
--- shutdown other
--- shutdown this
- powercycle the disk shelves one at a time.
- start one controller (push square button/led)
- try to boot one of the hosts
- start second controller
- try to boot second host

but before doing this save the output from
"show disks", "show storagesets" and "show units" and inspect if there are any failed drives.
Frequent Advisor

Re: HSG80 Error Code 4160 -- HELP

Pieter...

I am trying to in touch with someone with physical access to the FIELD system.

I had to issue the CLEAR xx LOST_DATA only once (very early on when the disks were INOPERATIVE). I have not had to do that since.

Once I get in touch with someone on the other end, I will have them try your suggestions. I am also trying to get information on what lead up to the failure and to have a look at the rack to see if any cables look like thay are disconnected. One odd thing about the HSG80.... on the chasis with the batteries, on the rught hand side is some indicator lights and 2 push buttons. On the LAB system, the bottom pushbutton is illuminated ORANGE, in the FIELD it is not illuminated at all. I do not now what the button does or if this is significant or not but thought I would let you know.

Tks/Allan
Honored Contributor

Re: HSG80 Error Code 4160 -- HELP

are you really talking switches on the battery i'll have to look it up.
the hsg has some switches like attachement.
switch 1 is the reset/restart button
and has a (white?) LED
switches 1-6 are used in the disk "hot-swap" procedure and have amber LED's.

I think when sw1 is burning constantly the hsg is shutdown else it should blink, indicating a heartbeat.

Frequent Advisor

Re: HSG80 Error Code 4160 -- HELP

The button I am referring to... on the working system)

As I look at my rack, there is a battery chassis. All the way to the right is a module that has what I would consider the power symbol beneath it (looks kinda like ~), actually stencled on the rack. At the top of this module are 3 LEDs:
1. GREEN and blinking (power symbol)
2. GREEN and steady (symbol looks like the sun)
3. OFF - Traingle with ! in it (fault indicator)

beneath these indicators is a digital readout (2 characters) that read "--- ---".

beneath the readout are 2 push buttons (that can be illuminated) with a line connecting them to the digital display. The bottom light is illuminated AMBER, the other is off.


In the FIELD... the amber button is NOT illuminated.


BTW: I do not see power buttons for the disk trays. In the back, I see a power cable. I also see a red switch (not a typical power switch, more like an electrical breaker (but it isn't). How would you recommend powering off the disk tray?

Tks
Frequent Advisor

Re: HSG80 Error Code 4160 -- HELP

While I was doing what I do best (nothing...) this appeared in my window that I am monitoring the GSG80 with:

{TIMESTAMP}--Instance Code: 02892301
Template: 18.(12)
Occurred on {TIMESTAMP}
Power On Time: 6 Years, 305 days.....
Controller Model: HSG80
Serial Number {DISK SRN} Hardware Version E12(2A)
Software Version: V86G-4(BA)
Memory Address: 20000000
Instance Code: 02892301


Does this mean anything....?

Allan
Honored Contributor

Re: HSG80 Error Code 4160 -- HELP

Hi Allan,

> Does this mean anything....?

The logging system throws out a timestamp "error" periodically, just so know that the logging system is actually working. It's nothing to worry about...

Cheers,

Rob
Frequent Advisor

Re: HSG80 Error Code 4160 -- HELP

Thanks, I was hoping it was telling us something more than we already know.

Allan
Honored Contributor

Re: HSG80 Error Code 4160 -- HELP

Hi Allan,

> I do not see power buttons for the disk
> trays. In the back, I see a power cable. I
> also see a red switch (not a typical power
> switch, more like an electrical breaker
> (but it isn't). How would you recommend
> powering off the disk tray?

The disk trays don't have power switches. To power them off, you have to remove the power lead from each of the two Fan/PSU assemblies.

The red "switch" that you refer to, sounds like its probably the "catch" that releases the Fan/PSU from the chassis. It should slide up, so that the PSU can be pulled out.

Cheers,

Rob
Honored Contributor

Re: HSG80 Error Code 4160 -- HELP

Hi again,

I was wrong with relation to that error message you posted previously.

http://bizsupport1.austin.hp.com/bc/docs/support/SupportManual/c00309745/c00309745.pdf

According to the service guide, as above, a 02892301 is: "The cache backup battery is near end of life. The Memory Address field contains the starting physical address of the Cache A0 memory."

As I suggested at the start of the thread, the batteries are probably dead, which can cause HSG80s to do strange things. You should get them replaced asap.

Cheers,

Rob
Honored Contributor

Re: HSG80 Error Code 4160 -- HELP

Hi allan,
In the shelf with the three led's
1) blinking is heartbeat
2) steady is power present
3) off no fault
you find the fan-modules to the HSG
and you may find modules with pictograms battery full and empty and a triangle?
The amber LED could mean one of the two power supplies has failed.
The HSG is back-to-back with this module shelf, so look at the other side of the cabinet where the power cords come in.

if there are no batteries in this shelf, then on the cache module of the hsg, there will be a cable connected to an external battery-unit.
I think you will find the last sistuation as the hsg reports "UPS = DATACENTER_WIDE"
please check.
Frequent Advisor

Re: HSG80 Error Code 4160 -- HELP

Morning Rob/Pieter't,

Can I 'fake out' the system by running FRUTIL by responding as if I did change the battery, but not really change the battery?

As soon as the FIELD guys call me back I will have them try the power-cycle idea you guys gave me (assuming I either cannot do the FRUTIL thing, or it has no affect). On my LAB system I issued a SHUTDOWN THIS and SHUTDOWN OTHER thinking that may power off the disks (like the documentation suggested it would), it didn't. I had to physically unplug both plugs on each try to power off the disks (hate doing this on a Friday, but what are my choices, right?) I was happy to see everything power back up properly. Should I be powering off the controllers as well? What do you think about reseating the controllers?

Gosh I hope this gets resolved this morning.

Thanks for your continued help.
Allan
Honored Contributor

Re: HSG80 Error Code 4160 -- HELP

Yes as far as i know you can fake the battery replacement. This was done sometime by HP technician while waiting a long time (weeks) for delivery of new battery (wrong battery the first time).
Frequent Advisor

Re: HSG80 Error Code 4160 -- HELP

This is getting distressing... :(

I have had the guys at FIELD do the following:

1. Power-cycle the disk trays (I shutdown both controllers first)
2. Re-seat both controllers
3. Inspect the cabeling - nothing obviously out of wack.
4. Identify the SAN switch and verify that the GREEN LEDs for each connection are on.
5. Verify the lights on the power module

Booting from the cluster continues to return the same scsi bus reset messages, and the err 19 on the root1_domain (or root2_domain, depending on which member I tried to boot).

Since I have a working system here in the LAB my thoughts are to try to break it (loosen cables) to see if I can replicate the situation I have at FIELD. You guys have any other ideas I should maybe try first?

Tks
Allan
Frequent Advisor

Re: HSG80 Error Code 4160 -- HELP

oh yea... I also faked the system into thinking the battery was changed --- received a message that the battery is fully charged...

//Add this to "OnDomLoad" event