cancel
Showing results for 
Search instead for 
Did you mean: 

crash alpha 4100

SOLVED
Go to solution
Michael Schulte zur Sur
Honored Contributor

crash alpha 4100

Hi,

has anyone seen this error?
Feb 2 17:42:20 fra10d vmunix: pmap_update_send: missing ack from cpu 3
Feb 2 17:42:20 fra10d vmunix: panic (cpu 0): tb_shoot ack timeout

I have been doing a
consvar -s sys_serial_num xxxxxxxx

thanks for all input,

Michael
11 REPLIES
Michael Schulte zur Sur
Honored Contributor

Re: crash alpha 4100

Oh,

the machine is running 5.1A pk6

Michael
Ann Majeske
Honored Contributor

Re: crash alpha 4100

Hi Michael,

I asked the kernel developers if they'd heard about this. The summary is, if doing that consvar command causes your system to panic, don't do that consvar command.

They're going to look into this further and talk to the firmware engineers to see if they can narrow it down a little. It may just be that it's not allowed to set the sys_serial_num using consvar. It may be that Tru64 should be funnelling this particular request to the master cpu (since the panic was on cpu 3 it looks like it isn't).

There have been previous problems with the 4100 when trying to set console variables using consvar that are not allowed to be set. Sounds like the bottom line is that you should avoid using consvar on a 4100 if possible.

Ann
Ann Majeske
Honored Contributor
Solution

Re: crash alpha 4100

I've got the "official" answer, they're pretty quick!

Official answer is: "the sys_serial_num should only be set from the console command, it is not allowed to set the sys_serial_num using consvar".

Apparently there's a long, highly technical, explanation why this is true, but I'll only push them for it if you're interested in all the gory details.

Ann
Michael Schulte zur Sur
Honored Contributor

Re: crash alpha 4100

Hi,

thanks for your answers. I found it out by myself using google. Well, this does not make much sense too me. I am using consvar to set other parameters.
I used it on another machine and yes with the same result. I am interested in details because I am certainly going to be asked why this? I can't help but think it must be a bad implementation.

thanks,

Michael
Michael Schulte zur Sur
Honored Contributor

Re: crash alpha 4100

Ann,

can you imagine how embarassing this is to shoot down a production machine with a simple command? I have set the serial number on other machines without a problem. How could I have anticipated that? Figuring out it would be a hardware problem I used that command on another 4100 with the same result last night.

Michael
Johan Brusche
Honored Contributor

Re: crash alpha 4100


Michael,

"consvar -s" is only SUPPORTED for parameters you can show with "consvar -l" (look for word supported in manpage consvar).

About the gory details.....The setting of some parameters makes that the CPU has a lot of instructions to execute in console firmware context mode, leaving no chance for the kernel to do it's job. To guarantee CPU-cache coherency that kernel has some maximum time interval in which the look-aside translation buffer has to be update on all CPU's. If not updated in time ==> panic.

You probably can get away with these kind of consvar commands on systems with only one CPU, but on multi-CPU systems the panic risc is high, especially if the consvar is executed by CPU0.

So... the manpage told you so, and no, NOT an impementation issue.

__ Johan ;-)

_JB_
Michael Schulte zur Sur
Honored Contributor

Re: crash alpha 4100

Johan,

thank you for your explanation. It explains why it does not happen on a single cpu machine. However I do not see where man consvar tells me of any danger.

greetings,

Michael
Johan Brusche
Honored Contributor

Re: crash alpha 4100


The manpage does not explicitely tell you there is a danger, but the text with the "-l"-switch tells you how to know the supported variables. If the setting of one of the parameters in that output causes a panic, then there is reason to log a case.

__ Johan.

_JB_
Michael Schulte zur Sur
Honored Contributor

Re: crash alpha 4100

Johan,

if something is disabled then it should be rejected and not crash the machine. For example, the first 16 blocks of a disk are read only and any attempt to write to it result in an error. So this is after all a bad implemenation.

thanks for your time,

Michael
Aaron Biver_2
Frequent Advisor

Re: crash alpha 4100

Michael,

Let me first of all say that I am truly sorry that you ran into this on a production machine. You have my sympathy. You also have my full attenion.

I agree that we should prevent this particular crash from occurring. The manpage explicitly states that an exception database is used for purposes such as this - when some fw versions don't behave. The problem here might be that the exception database might not be up to date. I promise to look into it (I will file an internal development problem report).

As for the fix:
If we know ahead of time that setting some particular variable, on some particular platform, with some particular version of fw (or any version of fw, in this case), will cause a panic, it should be an easy change to make. I think we can say that these condtions are true, and it is now a matter of imlpementing a fix. I think I can just
update the existing exception database.

This solution adopts a reactive approach to the generic problem (i.e. that some variables might be unsafe to set). It involves taking known panic cases (such as yours) and entering them into the database so they won't happen with future versions of the OS.


so far, the only two instances I've ever heard of that can crash a system are both on the 4100 family, and they are the variables:
-> sys_serial_num
-> ewa0_mode (or any other ew*_mode)
Don't try these at home, kids.

However, a proactive solution, in which no variable is allowed to be set unless it is on a "good" list, would be unrealistic. There are too many platform/fw-version/variable combinations to test every one.
Also, we risk breaking binary compatibility for unsupported uses of consvar, in case someone is using consvar to store some info in console variables.

Best regards, and be as careful with consvar as possible.

Aaron Biver
Tru64 Kernel
Michael Schulte zur Sur
Honored Contributor

Re: crash alpha 4100

Aaron,

thanks for your sympathy. So far I tend to believe Johan that it happens more likely on multi processor machines. I have tried it on single processor 4100 and it worked. To me sys_serial_num was just a text stored in one place in the nvram. Why this could cause a panic was a mystery to me. To me all these parameter were things only important at startup.
The idea with the positive list may not be good also because you can create your own parameter.
Who has a 4100 at home? Me not! LOL!!
Although I wonder if I could get our 4cpu 4gb 4100 when they are sorted out. It would be a nice machine to have. ;-)
Can you shed some light on what is done with the sys_serial_num that the change could cause a panic?

thanks,

Michael