ProLiant Servers (ML,DL,SL)
cancel
Showing results for 
Search instead for 
Did you mean: 

HpCISSs3 Error 5008, 129, 153 + Cissesrv 24607 = freezing VMs on node

 
Highlighted
Occasional Advisor

HpCISSs3 Error 5008, 129, 153 + Cissesrv 24607 = freezing VMs on node

Hello,

One of two ProLiant DL 360 Gen9 (exactly same HW setup) servers is spaming event log with error 129, 153 and 5008. Main problem is, that VMs are freezing on this node from cluster and it is not possible to work with them. Source of MPIO is MSA 2040.

NODE A is the problematic one, lets asume there are no VMs running. When I move 1 VM to NODE A, errors start showing up (129, 153, 5008). When I move that VM to another node (good one) no more errors shows up at NODE A.

Then Cissesrv 24607 appeared:

"The event information received from array controller H241 located in server slot 2 was of an unknown or unrecognized class.

An excerpt of the controller message is as follows: Phy Err Thresh Exceeded, DeviceType=254 Port=2E Box=0 Bay=0."

So I started updating all and NODE A, leaving NODE B as it is since it is working well.

NODE A running Win2012R2,  System Rom: P89 v2.76 (10/21/2019).
HBA H241 driver: 106.26.0.64
Smart Array P440ar Controller: 106.26.0.64

Found this solution for error 129 and 5008 https://support.hpe.com/hpesc/public/docDisplay?docLocale=en_US&docId=emr_na-c04584498 but why uninstall WBEM when on other server it is working well?

 

I would apreciate any help. Thank you! Jaromir

 

EDIT 18.2.2020

FW update of MSA 2040 and HBA241 to latest version helped from Cissesrv 24607 event.
Still experiencing disk 153, HpCISSs3 129 and 5008 events which causes to freez VMs under heavy load.

12 REPLIES 12
Highlighted
HPE Pro

Re: HpCISSs3 Error 5008, 129, 153 + Cissesrv 24607 = freezing VMs on node

Where is the VM located? Is it on the MSA2040?
What is the version of Windows running on the server?
How many VM's are running on Node A?
Are the Driver and Firmware versions of Network adapter, Storage controller same on both the Nodes?
If the issue is observed only when a VM is running on Node A, was it ever working fine on Node A? If it was working fine before, was there any changes made on Node A due to which the mentioned events started to be seen?
Has the VM configuration verified on Node A? Any difference when compared to Node B?

Please provide the source of the events that are observed on Node A. If possible share the System Event Log file. There are multiple reasons for the said events to be observed in the Event logs. Hence, we need to know the source of the events.

If there are any differences found in the driver, firmware or any other configuration between Node A & B, bring both Nodes to the same version or configuration.

For Event ID's 129 & 5008, please follow the solution provided in the below link. Perform the changes to the Power Management Settings as mentioned in the link below

https://support.hpe.com/hpesc/public/docDisplay?docLocale=en_US&docId=kc0135886en_us

After making the above changes, move a VM & monitor the server

Provide an update on the status

Thank you


I am an HPE employee
Accept or Kudo
Highlighted
Occasional Advisor

Re: HpCISSs3 Error 5008, 129, 153 + Cissesrv 24607 = freezing VMs on node

Thank you for reply.

All VMs are on MSA2040
FW on controllers A/B, drivers (HBA) and SPP on nodes A/B are the same.
Win 2012 R2 Datacenter
Node A = 5VMs, Node B = 5 VMs
Issue appeard after updating SPP on None A (version 2017 from 2015)

What do you mean by "Has the VM configuration verified on Node A" ?

Event log: https://instruments.cz/wp-content/uploads/2020/02/event_log_node_a.zip
My own website link is secure.

I was messing with power options long time ago, no change for me running on power balance or high performance. Events 129 & 5008 still poping up. Just to clarify, Node B is running on balanced mode - no problems at all.

Thank you! Feel free to ask anything.
Jaromir

Highlighted
HPE Pro

Re: HpCISSs3 Error 5008, 129, 153 + Cissesrv 24607 = freezing VMs on node

Thank you for sharing the event log.

As per the event log, Events 129, 153 & 5008 are observed since 14th December 2019.

Please confirm if this is when the SPP was update on Node A?

Was the SPP updated only on Node A or was it also updated on Node B & Node B works fine even after SPP upgrade?

What was the exact version of SPP used, what was the exact version of SPP installed previously?

Below are the information that we find from the Event Logs:

Event ID 129: Source: HpCISSs3

The IO operation at logical block address 0x1d3edee00 for Disk 1 (PDO name: \Device\MPIODisk1) was retried.

Event ID 5008: Source: HpCISSs3

The description for Event ID 5008 from source HpCISSs3 cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

If the event originated on another computer, the display information had to be saved with the event.

Event ID: 153:  Source: disk

The IO operation at logical block address 0x1d3edee00 for Disk 1 (PDO name: \Device\MPIODisk1) was retried.

The Source indicates an issue with the Disk Drive & the Storage Driver.

You have mentioned the current storage Driver version on Node A: 106.26.0.64

We request to verify the storage driver version on Node B. If both server have different versions, we recommend to downgrade the driver version on Node A matching to the version on Node B. Once this is performed, monitor Node A to verify if the events are no more reported.

Find the download link for the Drivers, under Revision History Tab, you will find all the Driver versions that is required.

https://support.hpe.com/hpsc/swd/public/detail?swItemId=MTX_90b6d3b309174d6f86065034ff#tab-history

Update us with the outcome of the above action plan

Thank you


I am an HPE employee
Accept or Kudo
Highlighted
Occasional Advisor

Re: HpCISSs3 Error 5008, 129, 153 + Cissesrv 24607 = freezing VMs on node

Thank you for reply.

1st time I observed this problem (VMs freezing) was after Node A SPP update (version 2017 - 2.4_02_17_2017) few years ago. But nothing appeard in event log so I moved on. Then after Node A update to SPP 2019 (2.76) it started spaming event log (14/Dec/2019).

I updated Node A to latestst SPP, then controller A (MSA) + few days after some test I updated Node B which works fine after SPP update + MSA controller B.

Both Nodes A+B ALL SAME - SPP (2019 2.76) and controllers A+B MSA (GL225P002-02) + HBA FW 7.0 + drivers. 106.26.0.64

So I think there is no need to downgrade.

Thank you for more help
Jaromir

Highlighted
HPE Pro

Re: HpCISSs3 Error 5008, 129, 153 + Cissesrv 24607 = freezing VMs on node

The source of the reported event ID's are from the Disks & the Storage driver. Please validate all the hard drives & the storage controller in the Node. Run the Smart Storage Administrator or the Array Diagnostics Utility to identify any faulty drives or controller which could the reason for the issue reported.

Please share the report for analysis

Thank you


I am an HPE employee
Accept or Kudo
Highlighted
Occasional Advisor

Re: HpCISSs3 Error 5008, 129, 153 + Cissesrv 24607 = freezing VMs on node

Highlighted
HPE Pro

Re: HpCISSs3 Error 5008, 129, 153 + Cissesrv 24607 = freezing VMs on node

Thank you for sharing the reports of both the controllers. The H241 controller does not have any drives connect internally & I hope it is used to connect the MSA. P440ar controller has two drives connected internally. We have reviewed both the logs & did not find any issues with the controller or the disks attached internally.

However, looking at the event description, it points to disk issue & controller driver issue. We see the drivers & the firmware versions are up to date. However, possibly this could be due to an issue while updating the driver or a firmware or it could be a faulty hardware (either the controller or the disk in the MSA)

Please refer to the following link for more information: https://docs.microsoft.com/en-us/archive/blogs/ntdebugging/interpreting-event-153-errors

Below are the events from the event log shared:

Warning 2/22/2020 6:06:27 PM disk 153 None

The IO operation at logical block address 0x9c7fb200 for Disk 1 (PDO name: \Device\MPIODisk1) was retried.

Warning 2/22/2020 6:06:27 PM HpCISSs3 5008 None

The description for Event ID 5008 from source HpCISSs3 cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

If the event originated on another computer, the display information had to be saved with the event.

The following information was included with the event:

\Device\00000059
5
0
0
1
2
1

Warning 2/22/2020 6:06:27 PM HpCISSs3 129 None

The description for Event ID 129 from source HpCISSs3 cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

If the event originated on another computer, the display information had to be saved with the event.

The following information was included with the event:

\Device\RaidPort2

**********************************************

We recommend validating all the hard drives on the MSA for any faulty drives, re-install (force install) the drivers for the controller on the server. If any drive is found faulty, replace the drives.

Thank you


I am an HPE employee
Accept or Kudo
Highlighted
HPE Pro

Re: HpCISSs3 Error 5008, 129, 153 + Cissesrv 24607 = freezing VMs on node

After reviewing the reports that were shared through the FTP, we confirm that the issue is not within the server. It could on the MSA port or path connected to Node A.

If this is an FC connectivity, schedule a downtime & reseat the SFP module & the cable that is connected to Node A & the respective controller on the MSA.

If this is a SAS based MSA, reseat the cables between Node A to the MSA

Monitor to check if the same events are logged again, if yes, need to validate the hard drives installed in the MSA for any faults.

Thank you.


I am an HPE employee
Accept or Kudo
Highlighted
Occasional Advisor

Re: HpCISSs3 Error 5008, 129, 153 + Cissesrv 24607 = freezing VMs on node

Thank you for reply.

Yes H241 is connected to MSA via SAS. I did cable maintenace while upgrading MSA controllers (disconnect, change for new, connect). I forced reinstall of controller to latest version while finished MSA FW upgrade = recieving less and less events but still some.

Last events 153 are from 29/FEB/2020 and 5008 + 129 + 153 from 22/FEB/2020.Since then, NONE.

I was doing windows updates yesterday (all 10VMs at once) and NONE event appeard in log, VMs did NOT freez once.
I think that hard drives in MSA are fine, because Node B shows no events at all, working with same pool from MSA. One more thing crossed my mind now, how about MSA disk drives update? I have revision HPD1 .

So far looks good. I have no idea what is causing those event and if it will appear again.
I am little bit confused about this because now it looks good. Is it possible that it took some time to controllers MSA + Node A to "settle down" and will be now working fine?

Thank you

 

EDIT:

So now I was doing some work on VM - suddenly VM freezes and 129 + 5008 + 153 event appeard at Node A log.