Re: IBM SAN disk generate an error every hour to the second

Clark Powell · ‎01-27-2010

I get five errors recorded from this IBM SAN disk every hour at exactly the hour and 21 minutes and 9 seconds. I'm running OpenVMS 8.3 with update patch 12. Can anybody guess what happens at exactly x:21:09 every hour that makes the disk generate an error? This particular disk was unpresented last night but there is not being access by anything. Not surprisingly, when I go to HP they point at IBM and when I go to IBM they point at HP. Below is a sample error collected with WSEA.

1490. Source: HostEventLog@alphaz.internal.vmmc.org:7920
Event: VMS Device Error Event occurred at Wed 27 Jan 2010 08:21:09 GMT-08:00

COMMON EVENT HEADER (CEH) V2.0 Event_Leader xFFFF FFFE
Header_Length 284
Event_Length 720
Header_Rev_Major 2
Header_Rev_Minor 1
OS_Type 2 -- OpenVMS
Hardware_Arch 4 -- Alpha
CEH_Vendor_ID 3,564 -- Hewlett-Packard Company
Hdwr_Sys_Type 38 -- Titan Corelogic
Logging_CPU 2 -- CPU Logging this Event
CPUs_In_Active_Set 4
Major_Class 1
Minor_Class 1
Entry_Type 1,001 -- VMS Device Error Event
DSR_Msg_Num 1,978 -- AlphaServer ES45
.... Model 2/2B
.... CPU Slots: 4 (1000 Mhz)
.... PCI Slots: 10
.... MMB Slots: 8 (DIMMs)
Chip_Type 12 -- EV68CB - 21264C
CEH_Device 54
CEH_Device_ID_0 x0000 0000
CEH_Device_ID_1 x0000 0000
CEH_Device_ID_2 x0000 0000
Unique_ID_Count 2,502
Unique_ID_Prefix 0
Exact_Length 422
Num_Strings 6

TLV Section of CEH TLV_DSR_String AlphaServer ES45 Model 2
TLV_DDR_String IBM 2145
TLV_Sys_Serial_Num 4150JSPZA261
TLV_Time_as_Local Wed 27 Jan 2010 08:21:09 GMT-08:00
TLV_OS_Version V8.3
TLV_Computer_Name ALPHAZ

Entry_Type 1,001

EMB_Block_Disk emb_d_ertcnt 16
emb_d_ertmax 16
emb_d_iosb 0
emb_d_sts x1800 0010
emb_d_class 1 Disk Class
emb_d_type 54
emb_d_rqpid 2,178,478,024
emb_d_boff 0
emb_d_bcnt 0
emb_d_media 0
emb_d_unit 7,498
emb_d_errcnt 86 Error Count
emb_d_opcnt 45
emb_d_ownuic x0001 0004
emb_d_char x1C45 5808
emb_d_Device_Number 0
emb_d_func 4
emb_d_name_len 6
emb_d_name $1$DGA
emb_d_dtname_len 8
emb_d_dtname IBM 2145

Generic DK Driver Header Revision 3 HW_Revision 0000
DK_Error_Type x05 Extended Sense Data from Device
DK3_SCSI_ID x0000 0000 0000 0004
DK3_SCSI_LUN x0000 0000 0000 2500
DK_Port_Status x0000 0001 Normal Successful Completion
DK3_SCSI_CMD_Len 6
DK3_SCSI_CMD Dump starting at offset: x1b5
[x0] x0
[x1] x0
[x2] x0
[x3] x0
[x4] x0
[x5] x0

SCSI_Status 2 Check Condition
DK3_Additional_Data_Len 24

DK3_Additional_Data DK3_AdditionalData Dump starting at offset: x1bd
[x0] x10000025000500f0
[x10] x000000000000000

Volker Halle · ‎01-27-2010

Clark,

can you try to diagnose ERRLOG.SYS with DECevent ?

I assume the key information is SCSI Check Condition data, which should contain ASC/ASCQ information, which you may be able to take back to IBM to find out, why there disk is reporting that information.

Volker.

Robert Brooks_1 · ‎01-27-2010

Hmmm. This might be the "error flush" stuff that was added a few years ago.

Post the output from SDA DKLOG and we'll see what we find.

To use the DKLOG facility in SDA, use these commands

(Do this some time close to when you expect the error to happen)
SDA> DKLOG start $1$DGAn:

[the error is logged]

SDA>DKLOG STOP $1$DGAn:
SDA>DKLOG SHOW $1$DGAn:

Prior to using the DKLOG SHOW command,
you can redirect the output to a file
using the SDA> SET OUTPUT command

-- Rob

Bob Blunt · ‎01-27-2010

Clark, does your IBM controller report any specific error occurring at that moment? WSEA and DECevent will TRY to convert the error bits to text but you're dealing with what probably is NOT a simple one-to-one physical device like a direct connected SCSI disk. SAN disks are typically striped, RAIDed, scattered, smothered, chunked and peppered across multiple spindles. The controller itself better know more about specific errors than the host in a SAN environment.

Rob's recommendations will likely allow a more detailed analysis and what WSEA shows (extended sense data from device), from my perspective, agrees with Rob's suspicion. DECevent might decode this particular error better than WSEA, it might be somewhat more readable anyway. I'd still watch the controller to make sure it's not whining about a problem just to be sure.

Michael Lothrop · ‎01-29-2010

If "unpresented" means what I think it does then maybe sysman io scsi_verify_path will remove it from being online. Worked for ours that were no longer zoned to the servers.

http://forums11.itrc.hp.com/service/forums/questionanswer.do?threadId=1390959

mike

Clark Powell · ‎02-03-2010

Here's an update on this problem. An HP engineer analyzed the data for us and came up with this:
On this web site it lists the 3F0E ASC/ASCQ code as:

3F/0E DTLPWROMAE REPORTED LUNS DATA HAS CHANGED

Basically this means that the IBM SAN-based storage controller is informing OpenVMS that it has new LUNs (disk devices) to report (present to OpenVMS). Normally you would only see this message once when querying the controller about new LUNs. Apparently in the IBM implementation, this information is presented once per LUN rather than once per controller (target)...

The moral of this story is if you can afford HP storage for your OpenVMS system, (and we can't right now,) then be sure to use HP storage. Other storage will probably work in a production environment but you will always be fighting things like these 80 informational error messages when we create a new disk or the inability to expland disks on the fly not to mention the 3rd party vendor not taking responsibility for their claims of compatibility.

thanks
Clark Powell

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: IBM SAN disk generate an error every hour to the second

IBM SAN disk generate an error every hour to the second