Operating System - OpenVMS
1748252 Members
3979 Online
108760 Solutions
New Discussion юеВ

Re: Alpha 4100 5/466 1GB memory

 
SOLVED
Go to solution
Chris Smith_23
Frequent Advisor

Alpha 4100 5/466 1GB memory

I have a customer who has two identical Alpha 4100 systems running the same version of OpenVMS (V7.2-1)and the same home-brewed applications. One runs happily while the other logs the following error in sys$errorlog:errlog.sys and exits it's application:

NON-FATAL BUGCHECK

SSRVEXCEPT, Unexpected system service exception


I don't have all the details that followed but I noted that:

previous & current Mode = EXECUTIVE
VMM=00 IPL=0 SP Alignment=0


Some of the applications will run, notably those which don't require a lot of memory. Those which do require large amounts of memory will fail. This may be significant as I was called to investigate a memory failure on this system several months ago. When I opened the CPU/Memory cage I discovered a large amount of dust had accumulated. I completely dismantled the cage and vacuumed out all the dust before rebuilding and reseating all the CPU & memory cards. The console memory tests passed
following this operation and no hardware errors are currently being logged in errlog.sys.

Anybody got any ideas what may be the cause of this strange behaviour? There are other symptoms which I will post if anyone thinks they may be significant. They relate to disk files failing to completely copy with 'read failures'. Could be due to corrupted directory entries.

Unfortunately I didn't have my laptop with me so haven't any dumps or full errlog listings.

Cheers

Chris

19 REPLIES 19
Jeroen Hartgers_3
Frequent Advisor

Re: Alpha 4100 5/466 1GB memory

i would switch the memory boards from place. Because the was a lot of dust it is also possible the contact are a little durty and past the test but in using the memory it fails.

If your customer has a suport contract for hardware use it to make a call.
Volker Halle
Honored Contributor

Re: Alpha 4100 5/466 1GB memory

Chris,

non-fatal bugchecks in EXECUTIVE mode are mostly due to internal problems detected by RMS. The current process will be deleted.

Memory errors should log HW errlog entries and not non-fatal bugchecks.

Problems with 'files' may indicate disk errors or internal file structures errors (in case of e.g. indexed files).

Consider to run ANAL/DISK and ANAL/DISK/READ against the suspect disk drive. This will check the file system structures and read all blocks, which belong to any files on that disk.

If you suspect specific application data files to cause problems, run ANAL/RMS on them.

To diagnose the reason for the non-fatal SSRVEXCEPT bugchecks, one has to force a system crash by setting BUGCHECKFATAL = 1 (it's best set dynamically with SYSGEN> WRITE ACTIVE). Then the system will crash on the next error (BUGCHECKFATAL makes non-fatal bugchecks fatal and will cause a system crash).

The crash can then be evaluated (e.g. failing code, file involved etc.).

Volker.
Wim Van den Wyngaert
Honored Contributor

Re: Alpha 4100 5/466 1GB memory

If shadowing is used with a disk of each system, remove the bad disk and add it again. The disks may be different due to a bug (had that in 7.3 with an interbuilding cluster).

Wim
Wim
Chris Smith_23
Frequent Advisor

Re: Alpha 4100 5/466 1GB memory

Hi again

Sorry I haven't updated this thread for a while. I've been suffering from a bad back and haven't been very mobile. I have been back to the customer and have found the following:

I'm reasonably certain that this is NOT a memory problem. I ran the SRM console 'test mem*' for 30 minutes without a single hard or soft error being detected.

I ran ANA/DISK/READ on the suspect drive. This showed up a number of file inconsistances. I dismounted and re-initialised the drive, recreated the directory structure and copied the files back to the drive. I ran ANA again and found only one file with a header inconsistancy. Repeating the exercise produced a 'clean' drive! This is one of the two drives on the Mylex 960 controller.

The two drives on the Mylex 960 controller are JBOD (I don't know why it was configured this way, its a waste of a Mylex controller!). The other drive does not exhibit any problems. In order to eliminate the Mylex controller and the Storageworks shelf, I moved the drives to another shelf served by a KZPBA-?? adapter. The application failed in exactly the same way.

I then ran ANA/DISK on the system drive (DKB0:). This threw up a large number of errors including 'probable bad blocks'.

So, I have decided to replace both the system drive and the suspect JBOD drive (both RZ1CB-VW), restore the system from a back-up and go round the loop once more.

On one of my recent visits I did try making all BUGCHECKS fatal and took a copy of the crash dump to anaylse on my OpenVMS system here, only to find I'm having difficulties booting my system into X Window System. It never rains but it pours ...

If any of the above throws further light on the subject I'd be very glad to hear your views.

Many thanks for your assistance so far.

Cheers

Chris
Volker Halle
Honored Contributor

Re: Alpha 4100 5/466 1GB memory

Chris,

despite your attempts to diagnose and fix all hardware level problems, the bugchecks did not disappear. This seems to indicate, that the bugchecks are not due to HW problems.

Only a crashdump analysis can shed more ligth on the problem. You don't need DECwindows to analyse a crash ;-)

Once you can read the system dump file on your OpenVMS system, please issue the following commands and post the output as a text attachment:

$ ANAL/CRASH dumpfilename
SDA> READ/EXEC/NOLOG
SDA> SET OUT/NOINDEX clue.txt
SDA> CLUE CRASH
SDA> CLUE REGISTER
SDA> CLUE STACK
SDA> CLUE CALL
SDA> CLUE ERRLOG
SDA> SHOW PROC/CHANNELS
SDA> SHOW PROC/IMAGE
SDA> EXIT

Volker.
Volker Halle
Honored Contributor

Re: Alpha 4100 5/466 1GB memory

Chris,

after rebooting from the crash, there will also be a CLUE$COLLECT:CLUE$node_ddmmyy_hhmm.LIS file at the customer's system, describing the most important footprint information of the crash. Maybe you can have the customer mail you this text file and make it available.

Volker.
Chris Smith_23
Frequent Advisor

Re: Alpha 4100 5/466 1GB memory

Hi Volker

I will be going to the customer's site this afternoon and will collect the CLUE$... list file. If I get some time this morning I will boot my OVMS system into non-X mode and run SDA with the params you suggest. All being well I will post everything ASAP.

Cheers

Chris
Jan van den Ende
Honored Contributor

Re: Alpha 4100 5/466 1GB memory

Chris,

from your Forum Profile:


I have assigned points to 1 of 24 responses to my questions.


Maybe you can find some time to do some assigning?

http://forums1.itrc.hp.com/service/forums/helptips.do?#33

Mind, I do NOT say you necessarily need to give lots of points. It is fully up to _YOU_ to decide how many. If you consider an answer is not deserving any points, you can also assign 0 ( = zero ) points, and then that answer will no longer be counted as unassigned.
Consider, that every poster took at least the trouble of posting for you!

To easily find your streams with unassigned points, click your own name somewhere.
This will bring up your profile.
Near the bottom of that page, under the caption "My Question(s)" you will find "questions or topics with unassigned points " Clicking that will give all, and only, your questions that still have unassigned postings.

Thanks on behalf of your Forum colleagues.

PS. - nothing personal in this. I try to post it to everyone with this kind of assignment ratio in this forum. If you have received a posting like this before - please do not take offence - none is intended!

Proost.

Have one on me.

jpe
Don't rust yours pelled jacker to fine doll missed aches.
Chris Smith_23
Frequent Advisor

Re: Alpha 4100 5/466 1GB memory

Jan

OK I will do as soon as I've sorted out this problem. 24 postings? If you say so but that seems a lot for just two initial postings. Maybe there have been further postings after I've assumed the thread ended.

Chris