1823920 Members
3161 Online
109667 Solutions
New Discussion юеВ

system crash

 
hy_3
Frequent Advisor

system crash

OS is v4.0d.Machine is alpha1200.On May 17,the machine crashed suddenly.Please help me to analyze the reason.Thank you.I have attached the messages file.
14 REPLIES 14
Mobeen_1
Esteemed Contributor

Re: system crash

hy,
Do you have the crash dump? Some of the things that you can do is

1. From the console prompt do things like
show ????
do a help if you need to know your
options.
Look out for any environment errors etc

2. After the system is up, you can always
look at your errorlog and dump
(if present) using the stack dump anal
ANAL/SYS

regards
Mobeen
Nicolas Dumeige
Esteemed Contributor

Re: system crash

Hello,

It's look like there's a problem with the metadata for the /oracle.

Cheers

Nicolas
All different, all Unix
Mobeen_1
Esteemed Contributor

Re: system crash

Hy,
I am sorry, i did not realise that there was an attachment. As our friend has highlighted, it looks like the following are the problem areas

bs_osf_complete: metadata write failed
Oct 22 13:48:42 sjbdaa vmunix: AdvFS Domain Panic; Domain oracle_domain Id 0x36f305f6.0004e6f2
Oct 22 13:48:42 sjbdaa vmunix: An AdvFS domain panic has occurred due to either a metadata write error or an internal inconsistency. This domain is being rendered inaccessible.
Oct 22 13:48:42 sjbdaa vmunix: Please refer to guidelines in AdvFS Guide to File System Administration regarding what steps to take to recover this domain.
Oct 22 13:50:42 sjbdaa vmunix: AdvFS I/O error:
Oct 22 13:50:42 sjbdaa vmunix: Volume: /dev/rzb128d
Oct 22 13:50:42 sjbdaa vmunix: Tag: 0xfffffff7.0000
Oct 22 13:50:42 sjbdaa vmunix: Page: 175
Oct 22 13:50:42 sjbdaa vmunix: Block: 3952
Oct 22 13:50:42 sjbdaa vmunix: Block count: 16
Oct 22 13:50:42 sjbdaa vmunix: Type of operation: Write
Oct 22 13:50:42 sjbdaa vmunix: Error: 5
Oct 22 13:50:42 sjbdaa vmunix:
Oct 22 13:50:42 sjbdaa vmunix: bs_osf_complete: metadata write failed
Oct 22 13:50:42 sjbdaa vmunix: AdvFS Domain Panic; Domain dbf_domain Id 0x36f30612.000011f7
Oct 22 13:50:42 sjbdaa vmunix: An AdvFS domain panic has occurred due to either a metadata write error or an internal inconsistency. This domain is being rendered inaccessible.
Oct 22 13:50:42 sjbdaa vmunix: Please refer to guidelines in AdvFS Guide to File System Administration regarding what steps to take to recover this domain.
Oct 24 22:51:06 sjbdaa vmunix:

regards
Mobeen
Mobeen_1
Esteemed Contributor

Re: system crash

hy,
Looking at the timestamps of the errors i posted in my previous message, it looks like they are pretty old, dated as far as Oct ?

Are we missing something

regards
Mobeen
Michael Schulte zur Sur
Honored Contributor

Re: system crash

May 17 11:50:35 sjbdaa vmunix: AdvFS I/O error:
May 17 11:50:35 sjbdaa vmunix: Volume: /dev/rzb128f
May 17 11:50:35 sjbdaa vmunix: Tag: 0xfffffff7.0000
May 17 11:50:35 sjbdaa vmunix: Page: 510
May 17 11:50:35 sjbdaa vmunix: Block: 8784
May 17 11:50:35 sjbdaa vmunix: Block count: 32
May 17 11:50:35 sjbdaa vmunix: Type of operation: Write
May 17 11:50:35 sjbdaa vmunix: Error: 5

Hi,

it seems you have a problem with your hsz50.
Can you post
show this full
show other (if redundant)
show disk full
show mirror
show fail

thanks,

Michael
hy_3
Frequent Advisor

Re: system crash

Thank Mobeen,Nicolas and Michael.Now the system is running and the hsz50 is being used.Then can "show this fulls","show other (if redundant)","show disk full","show mirror" and "show fail" be used when the hsz50 being used?And do you need me to provide other messages?Thank you.

Mobeen_1
Esteemed Contributor

Re: system crash

Hy,
No problem. You can use all the show commands that were requested without any issues on a running system. Trust me for that :-). Go ahead and give us the details

regards
Mobeen
Michael Schulte zur Sur
Honored Contributor

Re: system crash

Hi,

you can either use the controller port, swcc or the hszterm.

Michael
hy_3
Frequent Advisor

Re: system crash

Thank you.I will try.
Yong_7
Frequent Advisor

Re: system crash

Hi HY,

I can't access your attachment at this time.( due to company firewall masterwork ).

from other gurus posts, I think you got the point that the problem is with a advfs domain.

the reason to look at controller configuration is trying to find out whether you have RAID set under that oracle_domain to pinpoint the specific physical disk/disks.

you may have a look at /var/adm/binaryerr.log file by
#uerf -R -o full | more

any finding about bad spot on the hard drive ? or it just simple advfs domain panic ? or whatever.

in any case, have a look at advfs Admin manual there, and "# verify " is the thing we should do, plus have a look at salvage by
"#man salvage "( you may need fix hard drive if that's the root cause )

4.0D is not supported by vendor. just make sure keep patch level up-to-date.

http://www1.itrc.hp.com/service/patch/search.do?pageContextName=tru%3A%3A&admit=-682735245+1084892897444+28353475

regards !

YJ
hy_3
Frequent Advisor

Re: system crash

#uerf -R -o full | more
uerf version 4.2-011 (122)


********************************* ENTRY 1. *********************************

----- EVENT INFORMATION -----

EVENT CLASS ERROR EVENT
OS EVENT TYPE 199. CAM SCSI
SEQUENCE NUMBER 5149.
OPERATING SYSTEM DEC OSF/1
OCCURRED/LOGGED ON Wed May 19 15:22:49 2004
OCCURRED ON SYSTEM sjbdaa
SYSTEM ID x00070016
SYSTYPE x00000000
PROCESSOR COUNT 2.
PROCESSOR WHO LOGGED x00000000

----- UNIT INFORMATION -----

CLASS x001F UNKNOWN
SUBSYSTEM x0000 DISK
BUS # x0010
x043F LUN x7

TARGET x7

----- CAM STRING -----

ROUTINE NAME targ_send_comp

----- CAM STRING -----

Target SEND failed

----- CAM STRING -----

ERROR TYPE Soft Error Detected (recovered)

----- CAM STRING -----

Active CCB at time of error
ERROR - os_std, os_type = 11, std_type = 10


----- ENT_CCB_SCSIIO -----
*MY ADDR xEFE37580
CCB LENGTH x00C0
FUNC CODE x01
CAM_STATUS x0013 CAM_UNEXP_BUSFREE
PATH ID 16.
TARGET ID 7.
TARGET LUN 7.
CAM FLAGS x00001480
CAM_DIR_OUT
CAM_SIM_QFRZDIS
CAM_SIM_QHEAD
*PDRV_PTR xEFE37228
*NEXT_CCB x00000000
*REQ_MAP x00000000
VOID (*CAM_CBFCNP)() x00674B58
*DATA_PTR x07A5FD60
DXFER_LEN x0000008C
*SENSE_PTR xEFE37250
SENSE_LEN xA4
CDB_LEN x06
SGLIST_CNT x0000
CAM_SCSI_STATUS x0000 SCSI_STAT_GOOD
SENSE_RESID x00
RESID x00000000
CAM_CDB_IO x000000000000018C0000E00A
CAM_TIMEOUT x00000005
MSGB_LEN x0000
VU_FLAGS x0000
TAG_ACTION x00

----- ENT_SENSE_DATA -----

ERROR CODE x0000 CODE x0
SEGMENT x00
SENSE KEY x0000 NO SENSE
INFO BYTE 3 x00
INFO BYTE 2 x00
INFO BYTE 1 x00
INFO BYTE 0 x00
ADDITION LEN x00
CMD SPECIFIC 3 x00
CMD SPECIFIC 2 x00
CMD SPECIFIC 1 x00
CMD SPECIFIC 0 x00
ASC x00
ASQ x00
FRU x00
SENSE SPECIFIC x000000
ADDITIONAL SENSE
0000: 00000000 00000000 00000000 00000000 *................*
0010: 00000000 00000000 00000000 00000000 *................*
0020: 00000000 00000000 00000000 00000000 *................*
0030: 00000000 00000000 00000000 00000000 *................*
0040: 00000000 00000000 00000000 00000000 *................*
0050: 00000000 00000000 00000000 00000000 *................*
0060: 00000000 00000000 00000000 00000000 *................*
0070: 00000000 00000000 00000000 00000000 *................*
0080: 00000000 00000000 00000000 00000000 *................*
0090: 00000000 00000000 7E250000 00005E3C *..........%~<^..*
00A0: 00000000 *.... *
The messages above repeat every 10 seconds.Please help me analyze them.Thank you.


Michael Schulte zur Sur
Honored Contributor

Re: system crash

Hi,

do you have decevent on your machine?
If so, post the relevant part from
dia -R | more
It is more precise than uerf.

thanks,

Michael
Yong_7
Frequent Advisor

Re: system crash

Hi hy,

the message in your last post indicates a bad block replacement sequence, and system was able to self-heal itself. if you have many them there, this could be a sign that necessary hard drive replacement, also noticed alpha 1200 has loooong life, so does your storage i guess.

DECevent may help more if you paid for that,
uerf is universal, now CA is in charge for all.

anyway, you still need address that advfs domain first. that's the key.

Good Luck !

YJ
Ralf Puchner
Honored Contributor

Re: system crash

the error within the message files indicates an advfs write problem. So if the check of the involved disks indicates a hardware problem, please replace the disks.

If you need the data within the domains, use verify or salvage to try to repair the domains. Please read the admin guide first explaining how to use the tools. Be sure a backup exists if all fails....

Help() { FirstReadManual(urgently); Go_to_it;; }