- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - Linux
- >
- Re: How to debug system crashes
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Discussions
Discussions
Forums
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО04-22-2009 09:50 PM
тАО04-22-2009 09:50 PM
How to debug system crashes
we have a batch of DL385 machines with slightly different hw configurations. One of them started making problems at the end of last year. Problems show as sporiadic application segfaults and random kernel oopsen (these always happen at exactly the same function in kernel). We switced kernels, upgraded from rhel4 to rhel5, replaced all memory, did upgrades from firmware cd 8.20, but the problem persist. This problem shows up only on one of many machines with identically configured software.
This machine survives two weeks of memtest and a week of cpuburn without a single error.
Most recently we started getting ASR reboots in the IML logs, so one of the HP agents on the machine must be falling and the system then reboots itself.
My question is: how can I debug this further? I'd like to understand what the problem is. At this point I can only replace motherboard or cpus so I'd like to know which spare parts to order that would solve the problem.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО04-22-2009 10:27 PM
тАО04-22-2009 10:27 PM
Re: How to debug system crashes
If it is related to some specific hardware driver, it might be smart to try swapping that particular hardware component first.
If you would like to capture more information about your crashes, in RHEL 5 you might try kdump:
http://kbase.redhat.com/faq/docs/DOC-6039
MK
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО04-22-2009 10:36 PM
тАО04-22-2009 10:36 PM
Re: How to debug system crashes
-----
For this node in logs we found 4 OOPSes. We investigated these issues and believe that they most likely caused by CPU fault (probably overheat or old firmvare version loaded from motherboard BIOS)
3 of them on the same place, in __alloc_pages() (one after the node reboot), last one in csum_partial_copy_from_user().
First 3 cases looks very similar,
inside zone_statistics()
pg_data_t *pg = z->zone_pgdat;
pg_data_t *orig = zonelist->zones[0]->zone_pgdat; <<< HERE
failed because zonelist->zones[0] is NULL.
However we not found how it is possible. Theoretically I am willing to concede that it may be caused by some incorrect mempolicy operations. But in current case process is known for us, it is vzaserv process executed inside Service CT.
Therefore I doubt that it is software issue, because it was newer triggered on the other nodes.
Last OOPS is more interesting:
General protection fault was generated in csum_partial_copy_to_user() function on RETQ instruction, but call trace looks correct.
In general these errors looks like Processor faults for us. Probably CPUBurn test is able to clarify this situation. I would recommend to re-check CPU/node coolers and Power and update motherboard BIOS.
-------
We did that, came up with no errors, problem persists.
Can you suggest any method that can help me pinpoint the problem to either CPUs or to motherboard?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО04-23-2009 03:35 AM
тАО04-23-2009 03:35 AM
Re: How to debug system crashes
If the IML log cannot provide any useful information, the only way I know for debugging this further would be to connect a high-speed logic analyzer to various points in the system bus and start looking for signaling errors :-)
But unless you are working in a place that designs and debugs high-speed electronics for a living, getting the necessary equipment and knowledge to do it would be *extremely* expensive and time-consuming. Much more so than simply buying some spare parts and swapping them one at a time until the problem is gone.
This is one of the situations when it is nice to have hardware support agreement: with it, you can just call your HW support provider and say "This machine keeps crashing, fix it please!".
If the server is still under warranty, you might get something similar from HP, although the response time won't be as good as with an appropriate support agreement.
MK
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО04-24-2009 02:35 AM
тАО04-24-2009 02:35 AM
Re: How to debug system crashes
Another think perhaps it solve your problem,
Use a kernel without smp (kernel for single CPU)to insure that no problem in CPUs.
good luck