1752510 Members
4991 Online
108788 Solutions
New Discussion юеВ

Re: CPU error

 
Sk Noorul  Hassan
Regular Advisor

CPU error

Hi,

In one of my VAX servers,when giving "$Show Error" command, it is showing an error count of 1 against device "CPU".

Do we require to replace the CPU board ? Please Suggest.
8 REPLIES 8
Ian Miller.
Honored Contributor

Re: CPU error

it may be a corrected error in the on board cache or something else. Have a look at the error log

ANAL/ERROR/INCLUDE=CPU/SINCE=XX-XXX-XXXX

____________________
Purely Personal Opinion
Arch_Muthiah
Honored Contributor

Re: CPU error

Noorul,

"The system reboot is the only supported approach", but it is obviously undesirable in various situations-there is presently no supported mechanism to reset error counts once the error(s) have been logged.

As for an unsupported approach-and be aware of the potential for causing a system crash...

To reset the error count, one needs to determine the system address of the error count field. For a device, this is at an offset within the device's UCB structure. On VAX, the field is at an offset symbolically defined as UCB$W_ERRCNT and in Alpha UCB$L_ERRCNT.

You now need to locate the system address of the UCB$%_ERRCNT field of the CPU.

Now invoke SDA
$ ANALYZE/SYSTEM
SDA> READ SYS$SYSTEM:SYSDEF.STB
SDA> SHOW DEVICE
SDA> EVALUATE UCB+UCB$ W_ERRCNT
hex=xxx, Decimal=-xxx UCb+offset

Take the hex value,
SDA> exit
$ RUN SYS$SHARE:DELTA --- this will ask series of questions along with the error count noted above in hex value, respond carefully to reset the error count. If any mistake, it agins leads to system crash.

If unable run this successfully, possible reboot solve your problem.


Archunan

Regards
Archie
Robert Brooks_1
Honored Contributor

Re: CPU error

Archunan wrote . . .

To reset the error count, one needs to determine the system address of the error count field. For a device, this is at an offset within the device's UCB structure. On VAX, the field is at an offset symbolically defined as UCB$W_ERRCNT and in Alpha UCB$L_ERRCNT.

You now need to locate the system address of the UCB$%_ERRCNT field of the CPU.

Now invoke SDA
$ ANALYZE/SYSTEM
SDA> READ SYS$SYSTEM:SYSDEF.STB
SDA> SHOW DEVICE
SDA> EVALUATE UCB+UCB$ W_ERRCNT
hex=xxx, Decimal=-xxx UCb+offset

--

This may come as a surprise, but the CPU has no UCB. It is not a device; it is not associated with a device driver

-- Rob
Arch_Muthiah
Honored Contributor

Re: CPU error

Bob,

I am sorry, I totally misunderstood the question, after I read error count, immly I got device error count only.

Yes Noorul, my procedure is to reset error count of any device.


Archunan
Regards
Archie
John Gillings
Honored Contributor

Re: CPU error

Noorul,

If this system is life critical, I'd recommend you log a hardware service case immediately and have an engineer analyze your error log.

If the system isn't that critical, I'd tend to keep an eye on the errors, if it increases within (say) one week, then log a case.

You could also install WEBES and/or ISEE. One of its purposes is to gather statistics on hardware errors to determine what needs immediate attention. If configured for it, ISEE is capable of logging a service case automatically if a serious enough hardware error is detected.

Talk to your local customer support centre for more details.
A crucible of informative mistakes
Richard White_5
Advisor

Re: CPU error

Good Morning Noorul...
Don't know if you have had a chance yet to invoke the $anal/err/includ=cpu command that Ian had mentioned, but you might want to modify to $anal/err/incl=(cpu, mach)/sin=xx. The Vax systems will report a majority of their "machine-checks" as cpu errors. If the Machine-Check occurred in user/supervisor mode, then the system "typically" will NOT crash with a Fatal Bugcheck, but instead just Log the Error to Event-Logger. If the machine-check is due to cache-parity-error, while in user-mode, and the threshold limit is exceeded, the system will disable the cache, which will affect performance. As Mr. Gillings pointed out, if you have the opportunity, you may wish to install Webes or ISEE, which may supply more info. It would appear that currently, the system has NOT suffered a serious error in Exec/Kernel mode, but you may be on "borrowed-time", so you may wish to schedule a Service-Call.
Thanx,
r_white
Lawrence Czlapinski
Trusted Contributor

Re: CPU error

OT - Archunan: For devices, starting with VMS Alpha 7.3-2, you can reset device counts with SET DEVICE/RESET commands. This can be useful if a device has been replaced.
Lawrence
Sk Noorul  Hassan
Regular Advisor

Re: CPU error

Hi all,

The error is not increasing since the server became duty on 17/11/2005. Still, under observation. Thanks for ur suggestions.