Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

Trying to simulate a hardware error on OpenVMS server

SOLVED
Go to solution
Tom Wolf_3
Valued Contributor

Trying to simulate a hardware error on OpenVMS server

Hello VMS experts.
We are using the HP Operations Manager for Windows VMS Smart Plug In (SPI) to monitor hardware on our VMS servers.
I'm trying to confirm the SPI works by increasing the error count on one of the devices - a tape drive.
Is anyone aware of a command I can run on a VMS server to increase the error count for a device?

I know "set device/reset=(error,operation)" can be used to clear the error count so I'm hoping there's something similiar to increase it so I can simulate a hardware issue on our VMS server's tape drive.

Any assistance would be greatly appreciated.

Thanks in advance.

Tom Wolf
7 REPLIES
Hein van den Heuvel
Honored Contributor

Re: Trying to simulate a hardware error on OpenVMS server

Use the 'Logical Disk' (LD) driver which also has 'Logital Magtape' (LM) support:

http://www.digiater.nl/lddriver.html#LD%20V9.4

It know can inject errors for the LDdriver, and I assume it can do so for for the LMdriver.

hth,
Hein

Hoff
Honored Contributor

Re: Trying to simulate a hardware error on OpenVMS server

Reposting. ITRC is acting like, well, ITRC. Apologies on any duplicate postings.

You could use zdec:

http://www.decuslib.com/decus/vmslt02a/vu/zdec-src.txt

or likely better, clear_errors:

http://www.decuslib.com/decus/freewarev80/clear_errors

Or use SDA on the console and locate the error count for a device in virtual address space and halt the box and "bomb core", err, deposit and continue from the SRM console.

Or use the SDA data to generate a targeted version of this brute-force tool:

http://labs.hoffmanlabs.com/node/815

Or briefly pull the Ethernet connection and plug it back in.

Or load an older magtape and perform a BACKUP.

All of these assume you have a testing server. While unlikely to crash, I would not suggest any of these on a production server.

Some of these (such as the halt-continue) are specific to (most) Alpha boxes and will not operate on Integrity.
Volker Halle
Honored Contributor

Re: Trying to simulate a hardware error on OpenVMS server

Tom,

do you know how this piece of software monitors 'hardware errors' on OpenVMS ?

Just by looking at the device error counts ? Or maybe by watching the ERRLOG.SYS file or declaring an error log mailbox ?

There is no OpenVMS command to increase the error count on a device.

Using LD (or LM), you can induce QIO errors, but I doubt that you can increase the error count of LD (or LM) devices.

Would this software monitor the link state change of a LAN interface ? Maybe shortly unplug one of the LAN cables ?

Volker.
Hoff
Honored Contributor

Re: Trying to simulate a hardware error on OpenVMS server

Clarification: using zdec or clear_errors or the $cmkrnl stuff as the basis for increasing the displayed error count.

I'd hope that this SPI widget tapped into the OpenVMS error reporting mechanisms and the system service API for that, but I'd tend to assume not. That the tool polls the error count displays is more likely.
Jur van der Burg
Respected Contributor
Solution

Re: Trying to simulate a hardware error on OpenVMS server

Well, LD can induce errors but it does not change the errorcount on the device. I may add that as a feature though.

LM does not allow one to induce an error (yet).

If you want to set an arbitrary count you can do this:

thealp> sh dev dk

Device Device Error Volume Free Trans Mnt
Name Status Count Label Blocks Count Cnt
$12$DKA0: (THEALP) Mounted 0 THEALP_V83 1773616 368 1
$12$DKA1: (THEALP) Online 0
$12$DKA3: (THEALP) Online 0
$12$DKA400: (THEALP) Online wrtlck 0
thealp> ana/sys

OpenVMS system analyzer

SDA> sh dev dka0

$12$DKA0 [THEALP$DKA0] RZ28 UCB: 81C7A780

Device status: 18021810 online,valid,unload,lcl_valid,exfunc_supp,fast_path
Characteristics: 1C4D4008 dir,fod,shr,avl,mnt,elg,idv,odv,rnd
01010201 clu,nnm,nlt,scsi
SUD Status 00000001 path_available
DK Flags 1430401A first_attn_seen,disconnect,synchronous,hbs_check,port_cmdq,cmdq,port_autosense,clusq
DK Flags 2 00000030 sectors_via_ms,trk_cyl_via_ms

Owner UIC [000001,000004] Operation count 6010 ORB address 81C7ACC0
PID 00000000 Error count 0 DDB address 81C7A580
Alloc. lock ID 0100007D Reference count 129 DDT address 81992DA0
Alloc. class 12 Online count 1 SUD address 81C7ABC0
Class/Type 01/80 Retry cnt/max 16/16 VCB address 81CB9740
Def. buf. size 512 BOFF 00000A00 CRB address 81C7A600
DEVDEPEND 0A231063 Byte count 00000200 I/O wait queue 81C7A838
DEVDEPND2 00000000 SVAPTE FFDFC148
DEVDEPND3 01000001 DEVSTS 00000004
FLCK index 3A
DLCK address 81C7A680
Preferred CPUDB 81C6B680
Preferred CPUID 001

-- Device Path Information --

UCB: 81C7A780 Path: PKA0.0

*** PORT I/O queue is empty ***

*** DEVICE I/O queue is empty ***


*** I/O request queue is empty ***
Press RETURN for more.
SDA> ev ucb+ucb$l_errcnt
Hex = FFFFFFFF.81C7A898 Decimal = -2117621608 UCB+00118
SDA> Exit
thealp> r sys$share:delta
OpenVMS Alpha DELTA Debugger

Exit 00000001

80088F18! LDQ R28,#X0008(SP) 1;m
00000001
10001:FFFFFFFF81C7A898/00000000 10

exit
thealp> sh dev dk

Device Device Error Volume Free Trans Mnt
Name Status Count Label Blocks Count Cnt
$12$DKA0: (THEALP) Mounted 16 THEALP_V83 1773616 368 1
$12$DKA1: (THEALP) Online 0
$12$DKA3: (THEALP) Online 0
$12$DKA400: (THEALP) Online wrtlck 0

So, at the delta prompt enter this:

1;m
10001:ffffffff81c7a898/

then enter the new value followed by a return.

And don't try it on a production system unless you really know what you do.

Jur.

MichelleP_1
Advisor

Re: Trying to simulate a hardware error on OpenVMS server

>>>>>>>>>>
Volker Halle:
Or maybe by watching the ERRLOG.SYS file or declaring an error log mailbox ?
<<<<<<<<<<

If it watches ERRLOG.SYS, as does WEBES (SEA) via ELMC, install ELMC and use the ELMC test. Both can be downloaded from http://www.compaq.com/support/svctools/webes/webesdownloads.html

HTH,
Michelle
Art Wiens
Respected Contributor

Re: Trying to simulate a hardware error on OpenVMS server

There are many things the SPI can monitor. If you're interested in seeing if "anything" is being reported and a little less invasive than causing a hardware "problem", have a look at SYS$COMMON:[SYSMGR]VMSSPI$CONFIGURATION.DAT and see what it found during installation. The easiest would be to stop one of the queues it's monitoring. Changing the thresholds for disk space is also easy to do and doesn't actually cause any harm. Invoking Intrusion Detection (hammer a bad password 3 or 4 times) should generate a "Critical" message.

I can confirm that the VMS SPI works very well (in our environment ... VMS v8.3 on ES47's) and has notified on hardware events such as a network switch rebooting (the ethernet device error count went up). If you use Volume Shadowing (and the devices are setup in the config file), it "notices" when shadow membership is reduced ... I use it to tell the daily backups are proceeding because the notifications of missing shadow disks get automatically acknowledged and are removed from the display.

Cheers,
Art