HPE 3PAR StoreServ Storage

3Par 7200c Node Offline Due to Failure {0xd},Node IDE Drive Failure {0x27},Fatal Boot Error {0x29}

 
SOLVED
Go to solution
Blago
Advisor

3Par 7200c Node Offline Due to Failure {0xd},Node IDE Drive Failure {0x27},Fatal Boot Error {0x29}

Hey, boys and girls,

One of my nodes, node 1, has failed with the following error:

Node 1, SubSys Device Unknown, SubSys Instance 0 Failed (Node Offline Due to Failure {0xd} , Node IDE Drive Failure {0x27} , Fatal Boot Error {0x29} )

Once I have connected to the serial port and rebooted the node I have found the following output:

Booting from device 0, AHCI0Invalid boot sector.

Use "boot net install" to correct this.


*** Fatal error: Code 39, sub-code 0x10 (0).

AHCI: identify failed on detected port 0

Neither AHCI drive is online

Disabled FSBC boot watchdog.


*** Fatal error: Code 39, sub-code 0x10 (0).


*** Fatal error: Code 39, sub-code 0x10 (0).

Which seems as a disk failure of the node.

My questions ar what is the best path:

purchasing a spear ssd for node? Then what is the procedure that needs to be followed for a rescue?

or

purchasing a spear ssd for the node? Then copy the data from the healthy node sdd to the new one with dd(under linux)?

There could be a better option, but this is the two that seems straight forward from all the forum discussions I went thru.

Cheers

If this helps you with your issue, please click the thumb to register a Kudo.
If it resolves the issue, please consider marking it as an Accepted Solution.
2 REPLIES 2
Torsten.
Acclaimed Contributor
Solution

Re: 3Par 7200c Node Offline Due to Failure {0xd},Node IDE Drive Failure {0x27},Fatal Boot Error {0x2

Install a new SSD, the node will recover automatically (it will do a noderescue).

Hope this helps!
Regards
Torsten.

__________________________________________________
There are only 10 types of people in the world -
those who understand binary, and those who don't.

__________________________________________________
No support by private messages. Please ask the forum!

If you feel this was helpful please click the KUDOS! thumb below!   
Blago
Advisor

Re: 3Par 7200c Node Offline Due to Failure {0xd},Node IDE Drive Failure {0x27},Fatal Boot Error {0x2

Update:

I have replaced the ssd disk, put node 1 back in the 3par Storage and all synced and now both nodes are working as expected.

For whoever needs to do this I will provide these practical steps:

Once you replace the ssd on the node, once you put it back you will have to connect it to the network. If you have it connected to a smart switch that supports spanning tree, do configure the port that the node is connected to as: spanning tree portfast. In order to avoid any issues.

Also, if for some reason the ssd disk is not empty do format it.

The output that you are going to get once the ssd is detected to be empty is:

....
--> Starting: [22] Mass Storage Initialization
--> Starting: [23] Mass Storage Drive Test
AHCI: skipping register test
AHCI: skipping controller test
AHCI0 seek test: PASS
AHCI0 read test: PASS
AHCI: skipping DMA read test
First sector is blank.
--> Starting: [24] BIOS Automatic Update
AHCI0 MBR does not have a valid partition table
AHCI0 MBR does not have a valid partition table
AHCI0 MBR does not have a valid partition table
AHCI0 MBR does not have a valid partition table
--> Starting: [25] Fibre HBA Automatic Update
AHCI0 MBR does not have a valid partition table
AHCI0 MBR does not have a valid partition table
AHCI0 MBR does not have a valid partition table
AHCI0 MBR does not have a valid partition table
AHCI0 MBR does not have a valid partition table
AHCI0 MBR does not have a valid partition table
--> Starting: [26] Cluster Manager Reset
--> Starting: [27] Cluster Memory Initialization
Harrier ASIC found at 5.0.0. CSR at 0xc0000000
Set0 DC DIMM 0.0.0 (J0300): Part #18HVF25672PZ-80EH1 SN:hidden
Set0 DC DIMM 0.1.0 (J0301): Part #18HVF25672PZ-80EH1 SN:hidden SPD Passed
Initializing Harrier at 5.0.0...
CSR=00:c0000000 SMW=00:a0000000 LMW=10:00000000 MMW=0c:00000000
Running BIST Test: [GOOD]
CMA1: LPC1 is not Up, Trying LPC0
DMA PM using LPC2 CMA0 using LPC0 CMA1 using LPC0
RPC PM using LPC2 CMA0 using LPC0 CMA1 using LPC0
Set0 DC DIMM 0.0.0 (J0300): 2048 MB CL4/6
Set0 DC DIMM 0.1.0 (J0301): 2048 MB CL4/6
Present 1 Set 0, Side 0 Base = 00:00000000 Size = 00:ffffffff
Testing CMA interrupts: [............................]
Zero 32M cluster memory window: [................]
Zero 4096M cluster memory window: [...............................]
Scan cluster memory for errors: [................................]
CMA0 Cluster Memory is configured: 4096 MB CL4
Total Cluster Memory configured: 4096 MB
Harrier LPC Port Test: [............]
--> Starting: [28] Cluster Memory Diagnostic
Testing CMA0 SMW data lines with walking 1
Testing CMA0 SMW data lines with walking 0
Testing CMA0 MMW data lines with walking 1
Disabled FSBC boot watchdog.

Whack>
Whack>set perm cnt_no_os_boot=0
Whack>set perm cnt_no_cluster=0
Whack>set perm cnt_no_shutdown=0
Whack>set perm cnt_os_panic=0
Whack>set perm cnt_same_fatal=0
Whack>set perm cnt_log_error=0
Whack>set perm sys_serial=1623169
Whack>net addr 169.254.251.110
My address is 169.254.251.110
Whack>net netmask 0.0.0.16
Network mask 0.0.0.16
Whack>net gateway 10.102.7.246
Gateway address 10.102.7.246
Whack>net server 10.102.7.246
Server address 10.102.7.246
Whack>boot net install cr=2 ipaddr=169.254.251.110 nm=255.255.0.0 rp=10.102.7.246::rescue hn=1623169_1
Testing CMA0 MMW data lines with walking 0
Testing CMA0 SMW ECC lines with walking 1, channel 0
Testing CMA0 SMW ECC lines with walking 1, channel 1
Testing CMA0 SMW ECC lines with walking 0, channel 0
Testing CMA0 SMW ECC lines with walking 0, channel 1
Testing CMA0 Opcode: Zero ECC XOR PAR OR CRC
Testing CMA0 SRC Interrupts CM->CM PM->CM CM->PM PM->PM
Testing CMA0 SMW address lines with walking 1
Testing CMA0 SMW address lines with walking 0
Testing CMA0 MMW address lines with walking 1 (first 4 GB only)
Testing CMA0 MMW address lines with walking 0 (first 4 GB only)
Testing CMA0 with random XOR (all Cluster Memory)
4095 MB [..................................................]
--> Skipping: [29] Cluster PCI Diagnostic
--> Starting: [30] DIMM Information Collection
--> Skipping: [31] Manufacturing Centerpanel GPIO Test
--> Skipping: [32] Serial Port Loopback Test
--> Skipping: [33] Manufacturing Serial Port Test
--> Starting: [34] Environmental Initialization
--> Skipping: [35] Environmental Temperature and Voltage Test
--> Skipping: [36] Test CPU Operation
--> Skipping: [37] Manufacturing Cluster Link Initialization
--> Skipping: [38] Manufacturing Cluster Link Test
--> Skipping: [39] PCI Fibre Channel/SAS Adapter Test
--> Skipping: [40] PCI iSCSI Adapter Test
--> Starting: [41] System Error Reporting Init
--> Skipping: [43] Manufacturing CP EEPROM Test
--> Skipping: [44] Certify OS Startup
--> Skipping: [45] Table Execution Summary
+-----------------------------------------------------------------------------+
HP SPI Image 3.1.05. Release version. 12:21:30 Aug 19 2014.
+-----------------------------------------------------------------------------+
| CPU 1 x 1.80 GHz Sandy Bridge hexa core HT dual
| Control Cache Size 15.100 GB (cpu mem type 11) CL9
| Pair0 DIMM0 (J0155): 16384 MB CL6/11
| Data Cache Size 4 GB CL4/6
| Set0 DC DIMM 0.0.0 (J0300): 2048 MB CL4/6
| Set0 DC DIMM 0.1.0 (J0301): 2048 MB CL4/6
| Slot ID 1 [2 Node HP 3PAR 7200c Centerplane]
| FPGA Eos v0.f
| SATA0 Disk SanDisk DX110128A5xnNMRI hidden 128 GB
| PCI Slot 0 LSI-SAS 9205-8e
| PCI Slot 1 Emulex LPe12002 VM8017hidden 2-port
| Board 920-200040.B6 FXN 2015/16/Tue 03081379 hidden
| Cluster serial 1623169 hidden
| Board reset reason PCI_RESET
| Current Time 2020-11-26 18:47:52 (UTC+2)
+-----------------------------------------------------------------------------+
Booting from net...
TFTP "install" from 10.102.7.246....................................................................................................complete

Setting FSB WDT Boot Complete State.
Devs: 263
Linux version 2.6.18-prep-3pardata (ddejong@moes) (gcc version 4.4.5 (Debian 4.4.5-8) ) #8 SMP Tue Jul 10 15:21:46 PDT 2012
Command line: BOOT_IMAGE=install console=ttyS0,57600 acpi_hest=off acpi=ht genapic=phys cr=2 ipaddr=169.254.251.110 nm=255.255.0.0 rp=10.102.7.2461
BIOS-provided physical RAM map:
BIOS-e820: 0000000000010000 - 000000000009ec00 (usable)
BIOS-e820: 000000000009ec00 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 000000007b675000 (usable)
BIOS-e820: 000000007b675000 - 000000007d676000 (reserved)
BIOS-e820: 000000007d676000 - 000000007f0ff000 (usable)
BIOS-e820: 000000007f0ff000 - 000000007f1ff000 (reserved)
BIOS-e820: 000000007f1ff000 - 000000007f6ff000 (ACPI NVS)
BIOS-e820: 000000007f6ff000 - 000000007f7f3000 (ACPI data)
BIOS-e820: 000000007f7f3000 - 000000007f800000 (usable)
BIOS-e820: 000000007f800000 - 000000007fc00000 (reserved)
BIOS-e820: 0000000080000000 - 0000000090000000 (reserved)
BIOS-e820: 00000000fed1c000 - 00000000fed20000 (reserved)
BIOS-e820: 00000000ffc00000 - 0000000100000000 (reserved)
BIOS-e820: 0000000100000000 - 0000000480000000 (usable)
DMI present.
Memory for crash kernel (0x0 to 0x0) notwithin permissible range
disabling kdump
ACPI: PM-Timer IO Port: 0x408
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
...

 

 

If this helps you with your issue, please click the thumb to register a Kudo.
If it resolves the issue, please consider marking it as an Accepted Solution.