Re: How to debug system crashes

Jure Pecar_1 · ‎04-22-2009

Hello,

we have a batch of DL385 machines with slightly different hw configurations. One of them started making problems at the end of last year. Problems show as sporiadic application segfaults and random kernel oopsen (these always happen at exactly the same function in kernel). We switced kernels, upgraded from rhel4 to rhel5, replaced all memory, did upgrades from firmware cd 8.20, but the problem persist. This problem shows up only on one of many machines with identically configured software.

This machine survives two weeks of memtest and a week of cpuburn without a single error.

Most recently we started getting ASR reboots in the IML logs, so one of the HP agents on the machine must be falling and the system then reboots itself.

My question is: how can I debug this further? I'd like to understand what the problem is. At this point I can only replace motherboard or cpus so I'd like to know which spare parts to order that would solve the problem.

Matti_Kurkela · ‎04-22-2009

The first question that comes to my mind is: if it always oopses at the same kernel function, what is that function trying to do?

If it is related to some specific hardware driver, it might be smart to try swapping that particular hardware component first.

If you would like to capture more information about your crashes, in RHEL 5 you might try kdump:
http://kbase.redhat.com/faq/docs/DOC-6039

MK

MK

Jure Pecar_1 · ‎04-22-2009

No, kernel crash happens in memory allocation function. We use Virtuozzo and this is their reply after investigation:

-----

For this node in logs we found 4 OOPSes. We investigated these issues and believe that they most likely caused by CPU fault (probably overheat or old firmvare version loaded from motherboard BIOS)

3 of them on the same place, in __alloc_pages() (one after the node reboot), last one in csum_partial_copy_from_user().

First 3 cases looks very similar,
inside zone_statistics()
pg_data_t *pg = z->zone_pgdat;
pg_data_t *orig = zonelist->zones[0]->zone_pgdat; <<< HERE
failed because zonelist->zones[0] is NULL.

However we not found how it is possible. Theoretically I am willing to concede that it may be caused by some incorrect mempolicy operations. But in current case process is known for us, it is vzaserv process executed inside Service CT.
Therefore I doubt that it is software issue, because it was newer triggered on the other nodes.

Last OOPS is more interesting:
General protection fault was generated in csum_partial_copy_to_user() function on RETQ instruction, but call trace looks correct.

In general these errors looks like Processor faults for us. Probably CPUBurn test is able to clarify this situation. I would recommend to re-check CPU/node coolers and Power and update motherboard BIOS.

-------

We did that, came up with no errors, problem persists.

Can you suggest any method that can help me pinpoint the problem to either CPUs or to motherboard?

Matti_Kurkela · ‎04-23-2009

Okay... that definitely looks like a problem somewhere within the core hardware (CPU, memory, system board).

If the IML log cannot provide any useful information, the only way I know for debugging this further would be to connect a high-speed logic analyzer to various points in the system bus and start looking for signaling errors :-)

But unless you are working in a place that designs and debugs high-speed electronics for a living, getting the necessary equipment and knowledge to do it would be *extremely* expensive and time-consuming. Much more so than simply buying some spare parts and swapping them one at a time until the problem is gone.

This is one of the situations when it is nice to have hardware support agreement: with it, you can just call your HW support provider and say "This machine keeps crashing, fix it please!".

If the server is still under warranty, you might get something similar from HP, although the response time won't be as good as with an appropriate support agreement.

MK

MK

mohamed.bouraoui · ‎04-24-2009

Hi friend,
Another think perhaps it solve your problem,
Use a kernel without smp (kernel for single CPU)to insure that no problem in CPUs.
good luck

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: How to debug system crashes

How to debug system crashes