Operating System - OpenVMS
Showing results for 
Search instead for 
Did you mean: 

AlphaServer Memexer

Occasional Advisor

AlphaServer Memexer

I know this isn't an OpenVMS question persay... but there wasn't a category for Alphas in the Server section of the forums so I figure this is the next best place. :)

There was an issue with one of our DS20e servers which is being described as "something got corrupted in memory". My manager wants me to perform a diagnostic scan so we can be sure that there is nothing physically wrong with it to rule it out the hardware as the root cause.

I found the "memexer" SRM console command that sounds like it will discover any memory issues. Since the system will need to be offline I'd like to determine approximately how long the scan will take so I can schedule the outage accordingly. I'm running a 2 pass scan "memexer 2" on my test AlphaServer800 and it's been running for over an hour now. The AlphaServer800 only has 128MB where the DS20e has 1.5GB.

Does this mean the DS20e will take 12 times as long, or does the scan usually take the same amount of time regardless of the amount of memory?
Steven Schweda
Honored Contributor

Re: AlphaServer Memexer

> [..] described as "something got corrupted
> in memory". [...]

Do you always believe what you're told? (I
have a bridge for sale, ...)


Look for "Bad Page List".

You do have ECC memory, right? Memory
hardware errors don't normally go unnoticed
by the OS.

As usual, showing actual commands with actual
error messages can be more helpful than vague
descriptions or interpretations.

And running physical memory tests before
having any real evidence of a physical memory
problem can waste considerable time.

> Does this mean the DS20e will take 12 times
> as long, [...]

Perhaps, if the memory speed of a DS20e is
the same as that of the other system, and if
"memexer" does exactly the same things on
both systems. Most of which seems unlikely.
Honored Contributor

Re: AlphaServer Memexer

Given typical AlphaServer memory is ECC/EDC, you should have received an error reported for actual failed memory, if the error was a hardware-level error.

The usual trigger for the "something got corrupted in memory" reports is an application bug, or (more rarely, but there are some cases known) a kernel bug. These won't show via ECC/EDC, as they're not hardware errors.

Check the error logs.

This is old gear, and you're best headed for a hardware upgrade regardless of this particular case. If not, look at getting yourself a used DS20e as a source for spare parts and upgrades.)

As for the direct answer to your "how long?", I don't know. The last round of (gonzo) memory tests I was running were on a GS1280-class box, and those took a couple of hours. But that's not going to be particularly comparable to your box.

Put another way, run the diagnostic. It takes as long as it takes. Your boss said to do it, so... (And if you can't afford the downtime, start looking at bringing a spare server online for your environment. There's another issue here for your boss to consider.)

The memexer tool runs in the background and completes silently on success, so it may well have already finished. (If you haven't already found it, the show_status command shows you progress.)

If your version of SRM has the command available, then memexer_mp can use both processors here for testing.

Don't run parallel sets of memexer or memexer_mp as they can get tangled and report errors.

kill_diags ends the testing on command, and can be useful if you approach the end of your "guestimated" maintenance window.
Occasional Advisor

Re: AlphaServer Memexer

Thanks for the responses! Yes I agree that it is likely an application bug but the scan may be the only thing to completely convince them.

I've been running the "while true; show_status; sleep 10; done" on my test machine so I can monitor it. It's coming up on 3 hours. Under the Pass column of the show_status it's at 1280 and the Bytes Written/Read is at 135450853376 which seems high since it only has 128MB.

The AlphaServer800 doesn't have memexec_mp when I run HELP but I'd have to check the DS20e.

From what you described it sounds like the scan will end eventually (I was getting worried that it will run forever) so I'll let it continue... more to feed my curiosity than anything. :)
Respected Contributor

Re: AlphaServer Memexer

Memory exercisers are just that, excersizers. This may continue for quite a while. As a diagnostic tool, these progrmas are rediculously simple. The only thing that changes is the pattern used to write and read from each memory location.

As stated before, with no indication from VMS that there was an error, it is likely that the problem is a programming one, not hardware. At this point, you are wasting time. You are best spending time trying to locate the portion of the application that encountered the problem and checking that code for errors.

There are many of us here that provide consulting services to assist with these types of problems. I am one.
Honored Contributor

Re: AlphaServer Memexer

You're probably looking under the wrong rock.

But it'll probably give your boss some ammo for having a subsequent chat with whomever tossed out that "something got corrupted in memory" statement.

A memory error is either silently corrected, or it tosses a honking obvious parity error and an associated run-time snit underneath whatever was using the page. If the upstairs software using the bad page happened to be some core part of VMS, well, bye-bye VMS.

View the error logs, and see if there are any CPU, cache, memory, disk or other core hardware errors. (Don't depend on SHOW ERROR here, either, as memory errors don't get logged there until things get, um, nasty. View the error logs directly.) Any error details from the log will be more reliable than the memory exercisers; those are only particularly useful once you know you have a hardware error.

(A transient memory error won't repeat, so the memory exerciser won't find it. A failing memory component will repeat as the memory is hit and either corrected or logged as a hard error, so those errors will show up in the error logs.)

Also start instrumenting and bench-checking the code, as that's the most likely culprit for a corruption.

It's commonplace for a multiprocessor box to reveal all manner of weird and latent errors in existing application code, too, particularly when there is shared memory or any asynchronous code involved.

Given the statements so far, my bet is on an application error.
Steven Schweda
Honored Contributor

Re: AlphaServer Memexer

> [...] which seems high since it only has
> 128MB.

Who said that it looks at any byte only once?
It's called "memexer", not "memhardlytest".

> Given the statements so far, my bet is on
> an application error.

Give the complete lack of useful evidence,
I'd tend to wait for some useful evidence.
But if I had to bet blind, my money would be
on the software. Show me an actual error
report, and I'll think harder.

David B Sneddon
Honored Contributor

Re: AlphaServer Memexer

May be worth considering or not... A long time ago I had an issue with seemingly random process crashes which turned out to be a bad block in a pagefile.
Andy Bustamante
Honored Contributor

Re: AlphaServer Memexer

>>>something got corrupted in memory".

By the application, the operating system, hardware error or cosmic rays.

Where did this diagnostic come from? And what was the behavior of the system when corruption was present?

If you have 2 CPUs isntalled, use memexer_mp. Besides testing memory, you'll also exercise the CPUs. I once saw a new ES-45 crash with random memory errors logged. We changed memory options without resolving the error. After running memexer_mp for a a weekend, we captured a console error reporting CPU cache problems. Use a device that captures console output.

Did you review error logs? Do you have hardware support on this system?
If you don't have time to do it right, when will you have time to do it over? Reach me at first_name + "." + last_name at sysmanager net
Occasional Advisor

Re: AlphaServer Memexer

Thanks for all of your responses. The application team is investigating the software side of things and I've opened a ticket with HP to analyze the error log but I'm definitely leaning towards it being an application issue. :)