HPE 9000 and HPE e3000 Servers
1758027 Members
2128 Online
108866 Solutions
New Discussion юеВ

SuperDome Hardware

 
SOLVED
Go to solution
Wanderers
Occasional Contributor

SuperDome Hardware

Does the System continue to run incase of a memory fault on any one DIMM on a cellboard in a Superdome

I supppose the memory on cell boards is globally available and in case of a memory fault even though the remaining memory may be available to the system, does the system continue to run or does it reboot ?
8 REPLIES 8
Sunil Sharma_1
Honored Contributor

Re: SuperDome Hardware

Hi,

It's depend on type of instruction execution on that DIMM if some kernel thread is running on Faulty thread it will give panic and rebooted.

Sunil
*** Dream as if you'll live forever. Live as if you'll die today ***
Michael Duthie
Trusted Contributor

Re: SuperDome Hardware

Depends on the severity of the fault. single bit errors no prob. double bit errors may need a reboot to deallocate the faulty page.

Is that still the case ?
Michael Steele_2
Honored Contributor
Solution

Re: SuperDome Hardware

First, do you mean continue to run in the cell and the one partition affected or the entire SD?

Second, most HW on SD's are hot swappable.

Third, what does "parstatus" indicate?

parstatus -A C (* all avail cells *)

parstatus -P (* partition *)

Fourth, what does dmesg from the partition indicate your physical memory is?

dmesg | grep -i phy (* cross reference with what should be in there *)

Fifth, what does your virtual front panel say? Any faults indicated? Especially upon boot up?

Support Fatherhood - Stop Family Law
Michael Duthie
Trusted Contributor

Re: SuperDome Hardware

sorry

I meant reboot the cell not the superdome :-)
Wanderers
Occasional Contributor

Re: SuperDome Hardware

Hi, Thanks guys , your replies have cleared the most of the doubts. But i suppose i am not 100% clear.
I would like to rephrase my statement.

In case of an cell board failure. The system reboots and comes up with the other cellboard within the partition or system. I am looking to figure out if there is an individual CPU failure or any of the Memory DIMM's fail within a cellboard, then does the system isolate these components and continue to run or does it reboot with the remaining working cellboard.
Its clear that the Superdome does not funtion like a non stop mainframe but more of HA system. But I am trying to understand how much redundancy can be built into it.
Your answers have given a more clear idea about the architecture, but would request you to provide me any links to articles which talk about the architecture for superdomes in detail.
Sunil Sharma_1
Honored Contributor

Re: SuperDome Hardware

Hi,

As i told you in my earlier posting.
chances of server rebooting or recovering from that error is fully depends on the instruction/process thread running at that point of time at defective parts. in HP terms if system in receiving LPMC kind of instruction it can recover but if it is HPMC it can not recover. Suppose there is some realt time process is running on cpu1 at the time CPU get failed system will panic but if some regular process is running like some application process that process will only crash/core dump.

Sunil
*** Dream as if you'll live forever. Live as if you'll die today ***
Michael Duthie
Trusted Contributor

Re: SuperDome Hardware

Try this.

http://www.hp.com/products1/servers/scalableservers/superdome/infolibrary/index.html

Look for examples using Superdomes & Serviceguard. If one cell does recieve a HPMC and panics, serviceguard can be used to switch the application to another cell or even another superdome.


Michael Steele_2
Honored Contributor

Re: SuperDome Hardware

4 gb is the minimum amount of memory in a SD. 2x2db DIMMS. If one DIMM fails then the cell continues to operate on the other 2gb DIMM.

Can you run parmgr and get the partition configuration where the total memory is listed?

You have a Genesis cell which is the largest on the SD and located in partition 0, cell 0 in cabinet 0. User 1 has access to this cell. What is your GSP user id?

Also, SD's will HPMC so also check the /var/tombstones/ts99 file. Let me know if its there.

Learn these GSP commands for SD and check the GSP logs:

control b
login
co - console (* list of partitions *)
cm - command menu
cl - console logs (* GSP>cl *)
sl - chassis logs (* GSP>sl>e>return *)
vfp - virtual front panel (* GSP>vfp>partition #>E>errors since boot
who - list of logged in users
x - exit
Support Fatherhood - Stop Family Law