Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

Alpha 4100 5/466 1GB memory

SOLVED
Go to solution
Chris Smith_23
Frequent Advisor

Alpha 4100 5/466 1GB memory

I have a customer who has two identical Alpha 4100 systems running the same version of OpenVMS (V7.2-1)and the same home-brewed applications. One runs happily while the other logs the following error in sys$errorlog:errlog.sys and exits it's application:

NON-FATAL BUGCHECK

SSRVEXCEPT, Unexpected system service exception


I don't have all the details that followed but I noted that:

previous & current Mode = EXECUTIVE
VMM=00 IPL=0 SP Alignment=0


Some of the applications will run, notably those which don't require a lot of memory. Those which do require large amounts of memory will fail. This may be significant as I was called to investigate a memory failure on this system several months ago. When I opened the CPU/Memory cage I discovered a large amount of dust had accumulated. I completely dismantled the cage and vacuumed out all the dust before rebuilding and reseating all the CPU & memory cards. The console memory tests passed
following this operation and no hardware errors are currently being logged in errlog.sys.

Anybody got any ideas what may be the cause of this strange behaviour? There are other symptoms which I will post if anyone thinks they may be significant. They relate to disk files failing to completely copy with 'read failures'. Could be due to corrupted directory entries.

Unfortunately I didn't have my laptop with me so haven't any dumps or full errlog listings.

Cheers

Chris

19 REPLIES
Jeroen Hartgers_3
Frequent Advisor

Re: Alpha 4100 5/466 1GB memory

i would switch the memory boards from place. Because the was a lot of dust it is also possible the contact are a little durty and past the test but in using the memory it fails.

If your customer has a suport contract for hardware use it to make a call.
Volker Halle
Honored Contributor

Re: Alpha 4100 5/466 1GB memory

Chris,

non-fatal bugchecks in EXECUTIVE mode are mostly due to internal problems detected by RMS. The current process will be deleted.

Memory errors should log HW errlog entries and not non-fatal bugchecks.

Problems with 'files' may indicate disk errors or internal file structures errors (in case of e.g. indexed files).

Consider to run ANAL/DISK and ANAL/DISK/READ against the suspect disk drive. This will check the file system structures and read all blocks, which belong to any files on that disk.

If you suspect specific application data files to cause problems, run ANAL/RMS on them.

To diagnose the reason for the non-fatal SSRVEXCEPT bugchecks, one has to force a system crash by setting BUGCHECKFATAL = 1 (it's best set dynamically with SYSGEN> WRITE ACTIVE). Then the system will crash on the next error (BUGCHECKFATAL makes non-fatal bugchecks fatal and will cause a system crash).

The crash can then be evaluated (e.g. failing code, file involved etc.).

Volker.
Wim Van den Wyngaert
Honored Contributor

Re: Alpha 4100 5/466 1GB memory

If shadowing is used with a disk of each system, remove the bad disk and add it again. The disks may be different due to a bug (had that in 7.3 with an interbuilding cluster).

Wim
Wim
Chris Smith_23
Frequent Advisor

Re: Alpha 4100 5/466 1GB memory

Hi again

Sorry I haven't updated this thread for a while. I've been suffering from a bad back and haven't been very mobile. I have been back to the customer and have found the following:

I'm reasonably certain that this is NOT a memory problem. I ran the SRM console 'test mem*' for 30 minutes without a single hard or soft error being detected.

I ran ANA/DISK/READ on the suspect drive. This showed up a number of file inconsistances. I dismounted and re-initialised the drive, recreated the directory structure and copied the files back to the drive. I ran ANA again and found only one file with a header inconsistancy. Repeating the exercise produced a 'clean' drive! This is one of the two drives on the Mylex 960 controller.

The two drives on the Mylex 960 controller are JBOD (I don't know why it was configured this way, its a waste of a Mylex controller!). The other drive does not exhibit any problems. In order to eliminate the Mylex controller and the Storageworks shelf, I moved the drives to another shelf served by a KZPBA-?? adapter. The application failed in exactly the same way.

I then ran ANA/DISK on the system drive (DKB0:). This threw up a large number of errors including 'probable bad blocks'.

So, I have decided to replace both the system drive and the suspect JBOD drive (both RZ1CB-VW), restore the system from a back-up and go round the loop once more.

On one of my recent visits I did try making all BUGCHECKS fatal and took a copy of the crash dump to anaylse on my OpenVMS system here, only to find I'm having difficulties booting my system into X Window System. It never rains but it pours ...

If any of the above throws further light on the subject I'd be very glad to hear your views.

Many thanks for your assistance so far.

Cheers

Chris
Volker Halle
Honored Contributor

Re: Alpha 4100 5/466 1GB memory

Chris,

despite your attempts to diagnose and fix all hardware level problems, the bugchecks did not disappear. This seems to indicate, that the bugchecks are not due to HW problems.

Only a crashdump analysis can shed more ligth on the problem. You don't need DECwindows to analyse a crash ;-)

Once you can read the system dump file on your OpenVMS system, please issue the following commands and post the output as a text attachment:

$ ANAL/CRASH dumpfilename
SDA> READ/EXEC/NOLOG
SDA> SET OUT/NOINDEX clue.txt
SDA> CLUE CRASH
SDA> CLUE REGISTER
SDA> CLUE STACK
SDA> CLUE CALL
SDA> CLUE ERRLOG
SDA> SHOW PROC/CHANNELS
SDA> SHOW PROC/IMAGE
SDA> EXIT

Volker.
Volker Halle
Honored Contributor

Re: Alpha 4100 5/466 1GB memory

Chris,

after rebooting from the crash, there will also be a CLUE$COLLECT:CLUE$node_ddmmyy_hhmm.LIS file at the customer's system, describing the most important footprint information of the crash. Maybe you can have the customer mail you this text file and make it available.

Volker.
Chris Smith_23
Frequent Advisor

Re: Alpha 4100 5/466 1GB memory

Hi Volker

I will be going to the customer's site this afternoon and will collect the CLUE$... list file. If I get some time this morning I will boot my OVMS system into non-X mode and run SDA with the params you suggest. All being well I will post everything ASAP.

Cheers

Chris
Jan van den Ende
Honored Contributor

Re: Alpha 4100 5/466 1GB memory

Chris,

from your Forum Profile:


I have assigned points to 1 of 24 responses to my questions.


Maybe you can find some time to do some assigning?

http://forums1.itrc.hp.com/service/forums/helptips.do?#33

Mind, I do NOT say you necessarily need to give lots of points. It is fully up to _YOU_ to decide how many. If you consider an answer is not deserving any points, you can also assign 0 ( = zero ) points, and then that answer will no longer be counted as unassigned.
Consider, that every poster took at least the trouble of posting for you!

To easily find your streams with unassigned points, click your own name somewhere.
This will bring up your profile.
Near the bottom of that page, under the caption "My Question(s)" you will find "questions or topics with unassigned points " Clicking that will give all, and only, your questions that still have unassigned postings.

Thanks on behalf of your Forum colleagues.

PS. - nothing personal in this. I try to post it to everyone with this kind of assignment ratio in this forum. If you have received a posting like this before - please do not take offence - none is intended!

Proost.

Have one on me.

jpe
Don't rust yours pelled jacker to fine doll missed aches.
Chris Smith_23
Frequent Advisor

Re: Alpha 4100 5/466 1GB memory

Jan

OK I will do as soon as I've sorted out this problem. 24 postings? If you say so but that seems a lot for just two initial postings. Maybe there have been further postings after I've assumed the thread ended.

Chris
Chris Smith_23
Frequent Advisor

Re: Alpha 4100 5/466 1GB memory

Volker

OK. I visited the site yesterday afternoon and took copies of the CLUE$VMP4_...LIS file and the clue.txt produced by the SDA sequence you suggested. These are in the form of session logs from a SecureCRT serial connection to the Alpha. Both are zipped into one file called VMP4.zip.

I also managed to ftp the sysdump.dmp file to my OVMS Alpha system here so any further info I can gather without the trip to site.

Many thanks

Chris
Volker Halle
Honored Contributor

Re: Alpha 4100 5/466 1GB memory

Chris,

here is a summary of the crash:

Bugcheck Type: SSRVEXCEPT, Unexpected system service exception
VMS Version: V7.2-1
Current Process: _FTA8:
Current Image:
Failing PC: FFFFFFFF.92CA27C4 IMG$ADD_PRIVILEGED_VECTOR_ENTRY+00544
Failing PS: 00000000.0000000A
Module: IMAGE_MANAGEMENT (Link Date/Time: 28-MAY-1999 23:35:15.10)
Offset: 0000C7C4

failing instruction stream:

IMG$ADD_PRIVILEGED_VECTOR_ENTRY+00524: LDL R17,#X0040(R23)
IMG$ADD_PRIVILEGED_VECTOR_ENTRY+00528: LDL R18,#X003C(R23)
IMG$ADD_PRIVILEGED_VECTOR_ENTRY+0052C: BIS R31,R31,R2
IMG$ADD_PRIVILEGED_VECTOR_ENTRY+00530: SUBL R17,#X01,R3
IMG$ADD_PRIVILEGED_VECTOR_ENTRY+00534: ADDL R23,R18,R23
IMG$ADD_PRIVILEGED_VECTOR_ENTRY+00538: BLT R3,#X00002D
IMG$ADD_PRIVILEGED_VECTOR_ENTRY+0053C: LDQ_U R31,(SP)
IMG$ADD_PRIVILEGED_VECTOR_ENTRY+00540: LDL R19,#X0010(R24)
IMG$ADD_PRIVILEGED_VECTOR_ENTRY+00544: LDL R5,#X0018(R23)

R3 = 00000000.6C6C6C6B
...
R17 = 00000000.6C6C6C6C
R18 = 00000000.6C6C6C6C
...
R23 = 00000000.6F10EC6C

The system crashes with ACCVIO in EXEC mode when executing instruction LDL R5,#X0018(R23) due to an invalid address in R23

R23 had a value of 02A48000 at the entry into the above instruction stream and should have pointed to some data structure. R17 and R18 (queue header ?) have been loaded from this data structure, but the datastructure contains invalid data:
6C6C6C6C = '||||' in ASCII - when using these values as adresses, the system crashes (now as BUGCHECKFATAL is 1).

This address space is occupied by the global image DXML$FGS_BLAS1E, so the problem could be in that code. It could also be some run-away data copy/move, which has overwritten the address space.

Once you can read the dump on your OpenVMS system, you need to try to find out, which part of memory has been overwritten by 6C6C6C6C. Start with

SDA> EXA 02A48000;50 ! should show corruption

then work backwards by 100 or 1000, e.g.

SDA> EXA 02A48000-100;50 etc.

to find, where the corruption starts.

Volker.
Volker Halle
Honored Contributor
Solution

Re: Alpha 4100 5/466 1GB memory

Chris,

all the files in this process are from DKB0:. You can find the file-ids in the SDA> SHOW PROC/CHAN output.

You could use the following command, to dump the file headers to determine the file names:

$ dump/head/block=count=0 DKB0:/ident=

where is the file-id - first number from VMP4$DKB0:(xxx,xxx,0)

If any of those files are .EXE files, run ANAL/IMAGE on them.

It may be, that one of the image files are corrupted (by that 6c6c6c6 pattern !).

SDA> SHOW PROC/IMAGE shows bad start/end address values for DXML$FGS_BLAS1

Volker.
Chris Smith_23
Frequent Advisor

Re: Alpha 4100 5/466 1GB memory

Volker

Well, you don't hang about do you? I'd barely had time to shut my laptop down when your reply came in!

The bad news is that, having ftp'ed the sysdump.dmp file from my laptop to my OVMS Alpha system, when I ran SDA I got a message about wrong header type for this version of SDA!

As I posted earlier, I did find a lot of files on DKB0: with inconsistant headers and possible bad blocks. So the chances are that one of the processing applications has been corrupted hence causing the violation. I'm due back there tomorrow and will let you know what I find.

Many, many thanks for your help.

Cheers

Chris
Chris Smith_23
Frequent Advisor

Re: Alpha 4100 5/466 1GB memory

Volker

You are a genius! I did a dump/head... on all the fids for this process then ana/ima on the executables without finding any problems. Then I did ANA/IMA on SYS$LIBRARY:DXML$FGS_BLAS1.EXE. That threw up a very interesting error which showed the string of 6Cs we were looking for. I reinstalled the DXML run-time library fron the layered products CD and the process now runs to a point beyond the original failure. We are not comletely out of the woods yet but I think the rest of the problems are disk header curruption related. I've attached the console session log of the ana/ima so you can see the error. I guess the module image may have got corrupted when the original memory problem occured - who knows.

Anyway the customer is very happy now.

Thank you very much for your hard work in analysing this problem and pointing me in the right direction.

Cheers

Chris
Volker Halle
Honored Contributor

Re: Alpha 4100 5/466 1GB memory

Chris,

the ANAL/IMA output shows a corruption in the IMAGE ACTIVATOR FIXUP SECTION:

EIAF$L_QRELFIXOFF : 6C6C6C6C
EIAF$L_LRELFIXOFF : 6C6C6C6C
EIAF$L_QDOTADROFF : 6C6C6C6C
EIAF$L_LDOTADROFF : 6C6C6C6C
EIAF$L_CODEADROFF : 6C6C6C6C
EIAF$L_LPFIXOFF : 6C6C6C6C
EIAF$L_CHGPRTOFF : 6C6C6C6C
EIAF$L_SHLSTOFF : 6C6C6C6C
EIAF$L_SHRIMGCNT : 6C6C6C6C
EIAF$L_SHLEXTRA : 6C6C6C6C
EIAF$L_PERMCTX : 6C6C6C6C
EIAF$L_LPPSBFIXOFF : 6C6C6C6C

The crash happened in IMAGE_MANAGEMENT code, which is processing this kind of information, bingo !

Volker.

PS: Again, this example shows, that there always is a way in OpenVMS to DIAGNOSE a problem to come up with the a solution (sometimes just a workaround) for the underlying problem. Diagnosis works much better than speculation. RE-INSTALL VMS would also have worked, but that's not the way we solve problems in OpenVMS ;-)
Chris Smith_23
Frequent Advisor

Re: Alpha 4100 5/466 1GB memory

Volker

As soon as I saw the strings of 6Cs in the fix-up area I knew we'd found the culprit.

Back in the 1980s & 90s I spent many hours staring at crash dumps from VAX-VMS but haven't done much of that recently. We used to use VAX-VMS as a vehicle for hosting real-time aircraft & systems models. If I phoned support at DEC at the Viables in Basingstoke (but a memory now) I used to be told that I probably knew more about VMS than they did!!!

If I may, a supplementary question regarding my OVMS Alpha system here?

As I mentioned long ago in the thread, I am unable to log in using CDE or DECWindows. I can log in in Failsafe Mode. In the other modes the system seems to hang for many minutes with the blues screen of CDE or the black screen of DECWindows before eventually displaying a grey pattern screen with an up/left (diagonal) pointer arrow which I can move about - but that is all. I can log in via the network using telnet but the console is useless. Any idea what may be causing this?

Cheers

Chris
Volker Halle
Honored Contributor

Re: Alpha 4100 5/466 1GB memory

Chris,

check the SYS$MANAGER:DECW$*.LOG files for any errors. Consider to open another thread for troubleshooting your DECwindows problem.

Volker.
Chris Smith_23
Frequent Advisor

Re: Alpha 4100 5/466 1GB memory

Volker

OK. Thanks very much. I'll close this thread.

All the best.

Chris
Chris Smith_23
Frequent Advisor

Re: Alpha 4100 5/466 1GB memory

Thanks to Volker's analysis of the crash dump data, a solution was found which, luckily, only involved reinstalling the DXML runtime libraries.