cancel
Showing results for 
Search instead for 
Did you mean: 

BSOD on BL460c

Terai
Occasional Advisor

BSOD on BL460c

Hello,

I've encountered BSOD's for many times since March. With support from Microsoft and hp, I've been tackling with this problem for more than 6 months but BSOD still happens.

The situation is, BSOD happens in various timings, like 3 times in a week or nothing in a month. I've turned Driver Verifier on and have kept monitoring for a while but BSOD had never happened during that period for some reason.

Stop Codes vary from 0x0a, 0x8e or 0xD0. 0xD0 happened most (5 times for 3 servers for a month!). Microsoft and hp have checked the momory dumps for all BSOD and almost all the BSODs were caused by memory corruption (overwritten by nil) but so far they've not been able to pin down which process currupted the memory.

The servers details are as follows.

Hardware
- ProLiant BL460c's x 3 in a C7000 enclosure
- connected to EVA 4100 by Qlogic's QMH2462 4Gb FC HBAs
- NC325m 4-ports Gb network adapter

Software
- Windows 2003 Server Enterprise R2 SP1(these servers are used as just Windows file servers.)
- SystemRom(BIOS)  2008/2/29
- SmartArrayE200i 1.72
- SAS HDD HPD9
- QHM2462 1.26
- NC373i 1.9.6
- NC325m 3.28
- HP PSP 8.00
- Qlogic 9.1.3.18 for HP 
- StorPort.sys 5.2.3790.3030 
- Clusdisk.sys 5.2:3790.2938

Configuration
- MSCS is working with 3 nodes failover ring cluster configuration.

Usage
- 300 client connections per server
- 1.5% average CPU usage

If someone have had the similar problems and successfully fixed them, please give me advise.
Thanks in advance.
64 REPLIES
Steven Clementi
Honored Contributor

Re: BSOD on BL460c

Have you checked the integrated Management Log? The iLo logs?

What version of MPIO are you using?

What happened in March, when this first started? Any upgrades? Drivers? firmware?

What version of XCS are you running on the EVA? What about the firmware on your SAn Switches?


Steven
Steven Clementi
HP Master ASE, Storage and Clustering
MCSE (NT 4.0, W2K, W2K3)
VCP (ESX2, Vi3, vSphere4, vSphere5)
RHCE
NPP3 (Nutanix Platform Professional)
Terai
Occasional Advisor

Re: BSOD on BL460c

Hello, Steven,

Thanks for your response. Pls find my answers below.

>>Have you checked the integrated Management Log? The iLo logs?

When BSOD heppned, only Blue Screen Trap was recorded like this;
"Blue Screen Trap (BugCheck,STOP: 0x000000D0 (0x00000008,0x00000002,0x00000000,0xE089AF77))"

There was no iLo log.

>>What version of MPIO are you using?
MPIO : 3.00.00

>>What version of XCS are you running on the EVA? What about the firmware on your SAn Switches?

XCS : 6.110
SANswitch : 5.3.0.d

Thanks.
Terai
Occasional Advisor

Re: BSOD on BL460c

Steven,

I forgot to answer one thing.

>>What happened in March, when this first started? Any upgrades? Drivers? firmware?

March is when these servers were initially installed. Since then BSOD happened as follows.

March: Once (0xD0)
Apr & May: none
June: 5 (all 0xD0)
July: twice (0x0a & 0x8e)
Aug: none
Sep: upgraded to Win2K3 SP2 but BSOD happened (0x0a)

thanks.
Angelina
Occasional Visitor

Re: BSOD on BL460c

Hi,

0xD0 is the typical stop code of pool
corruption caused mostly by bad driver.

Please let me know couple of things.

Have you ever done memory dump analysis?
Post the analysis report if so.

Did you install any 3rd party software especially which contains filter drivers.
Symntec products or trend micro's are the exmaples.

What was the system's load when you got that problem?

Did you see the msdn article and try ProtectNonPagedPool registry?
http://msdn.microsoft.com/en-us/library/ms796128.aspx
Please be cautious to set the registry, it might increase the possibility of bsod but might help you to identify the root cause.

Terai
Occasional Advisor

Re: BSOD on BL460c

Hello, Angelina,

Thank you for your comment.
As to your questions, please find the below.

Q1. Have you ever done memory dump analysis?
Post the analysis report if so.
-> Find the attached file. Pls note the stop error code of this case was 0x8E, not 0xD0.

Q2. Did you install any 3rd party software especially which contains filter drivers.
Symntec products or trend micro's are the exmaples.
-> Symantec's Backup Exec agent is installed on each server.

Q3. What was the system's load when you got that problem?
-> The system's load was not particularly high.

Q4. Did you see the msdn article and try ProtectNonPagedPool registry?
-> Discussing this point with our Microsoft and hp consultants.

Regards,
O. Terai
Blade user
Occasional Advisor

Re: BSOD on BL460c

Hi

We've also just moved across to the C7000 enclosure with Bl460c's and are experiencing the same issue.
We are using a Hitachi SAN Array instead of an EVA, but apart from that everything else is pretty much the same.
Looking at applying the ProtectNonPagedPool key , but sounds like it may be something common to the BL460c Blade?

Any help would be greatly appreciated.

Blazhev_1
Honored Contributor

Re: BSOD on BL460c

check the memory dump, most probably it is the multifunction NIc driver...
Eric Gazrighian
Occasional Visitor

Re: BSOD on BL460c

I also encouter BSOD and ASR on numerous ProLiant BL460c G1 since octobre 2008 like Blue Screen Trap (BugCheck, STOP: 0x000000D0 (0x00000008, 0xD0000002, 0x00000000, 0xE0899F77)) in different Rack using the last Rack firmware, lame firmware and 8.10 Proliant support Pack. Can not find any valid reason for the moment.

Management Processor Firmware (Active) 1.60
Server Blade Enclosure Firmware 2.25
System ROM Firmware-I15 (Active) 2008.06.25
System ROM Firmware-I15 (Redundant) 2008.01.24
HP NC373i Multifunction Gigabit Server Adapter 1.9.6
HP NC373i Multifunction Gigabit Server Adapter #2 1.9.6
HP NC373i Multifunction Gigabit Server Adapter 1.1.3
HP NC373i Multifunction Gigabit Server Adapter #2 1.1.3
Disk Drive Firmware HPDA
Disk Drive Firmware HPDA
Storage Enclosure Processor Firmware N/A
Storage Enclosure Processor Firmware N/A
Array Controller Firmware 1.72


karim h
Valued Contributor

Re: BSOD on BL460c

Check this bulletin on smartarray drivers and storport.sys


http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?objectID=c01068337&dimid=1489071792&dicid=alr_may07&jumpid=em_alerts/us/may07/all/xbu/emailsubid/mrm/mcc/loc/rbu_category/alerts

***PROBLEM***
Affected Software Configuration:

- Any Edition of Microsoft Windows Server 2003 (x86 or x64).
AND

- HP ProLiant Smart Array 5x/6x Controller Driver (HPCISSS.SYS) Version 5.18.0.64 (or earlier) OR HP ProLiant Smart Array SAS/SATA Controller Driver (HPCISSS2.SYS) Version 5.10.0.32 or 5.10.0.64 (or earlier).
AND

- Microsoft Storport Driver for Windows Server 2003 Version 5.2.3790.2880 (for SP1) OR 5.2.3790.4021 (for SP2) from Microsoft KB932755.
AND

- HP Insight Management Storage Agents (any version).

***RESOLUTION***
The blue screen event has been corrected in the following updates:

For ProLiant Servers Running 64-bit Versions of Windows Server 2003:

(HPCISSS.SYS) HP ProLiant Smart Array 5x and 6x Controller Driver for Windows Server 2003 x64 Editions Version 6.4.0.64 (or later)

(HPCISSS2.SYS) HP ProLiant Smart Array SAS/SATA Controller Driver for Windows Server 2003 x64 Editions Version 6.2.0.64 (or later)

For ProLiant Servers Running 32-bit Versions of Windows Server 2003:

(HPCISSS2.SYS) HP ProLiant Smart Array SAS/SATA Controller Driver for Windows Server 2003 Version 6.2.0.32 (or later)

fmags24
Advisor

Re: BSOD on BL460c

I too have experienced these BSOD's in my server environment. We have 48 BL460's in three C7000 enclosures. All servers are Win2K3 SP2
- SystemRomï¼ BIOS)ã ã 2008.09.29
- SmartArrayE200i 1.80
- NC373i 4.6.16.0
- HP PSP 8.1
- Quad-Core Intel Xeon, 3000 MHz

All servers have experienced a blue screen at one point in time and they happen at different times over the last couple of months. They will work fine for a month or so and then BSOD with a STOP: 0X0000000A, STOP: 0X000000D0, or a STOP: 0X0000008E for no reason at all. There was not a high load during the time of the BSOD. Also, all of these servers were migrated to blades using the HP SMP-P2P software from DL385 G1's.

I have tried the protectnonpagedpool registry key and have updated all drivers to the latest and greatest, but two servers blue screened over the weekend.

Has anyone had any luck with the storport driver fix?
fmags24
Advisor

Re: BSOD on BL460c

I'm sorry, My storport driver is HpCISSs2.sys version 6.13.0.32.

According to the article this problem should have been corrected with
Version 6.2.0.32 (or later)
karim h
Valued Contributor

Re: BSOD on BL460c

I have opened a case with HP - hopefully everyone else has done so, so that this problem can get some attention...

HP has suggested that I update my STORPORT driver to the following-

Windows 2003 SP1 Storport.sys 5.2.3790.3148
Windows 2003 SP2 Storport.sys 5.2.3790.4303

http://support.microsoft.com/kb/950448/en-us

Blade user
Occasional Advisor

Re: BSOD on BL460c

Please let me know if you get an answer.
We are using storport.sys version 5.2.3790.3959 and experiencing the issue.
Feels a bit hit and miss updating the storport.sys drivers as it dosn't specifically mention the problems we are having. New storport driver resolves a clustering error?
karim h
Valued Contributor

Re: BSOD on BL460c

Has anyone run the kernel dumps/minidumps through Windbg and got any answers? Please post your results if you have them.
karim h
Valued Contributor

Re: BSOD on BL460c

Attached analysis of minidump file below (again if anyone else can post their dump files that would be great..) -


Blue Screen Trap (BugCheck, STOP: 0x000000D1 (0x0000000100060049, 0x0000000000000002, 0x0000000000000000, 0xFFFFFADF900AC743))


>>>>>>


2: kd> !analyze -v
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************

DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1)
An attempt was made to access a pageable (or completely invalid) address at an
interrupt request level (IRQL) that is too high. This is usually
caused by drivers using improper addresses.
If kernel debugger is available get stack backtrace.
Arguments:
Arg1: 0000000100060049, memory referenced
Arg2: 0000000000000002, IRQL
Arg3: 0000000000000000, value 0 = read operation, 1 = write operation
Arg4: fffffadf900ac743, address which referenced memory

Debugging Details:
------------------


READ_ADDRESS: 0000000100060049

CURRENT_IRQL: 2

FAULTING_IP:
storport!RaidXrbSetDataBufferAddress+2b
fffffadf`900ac743 488b4848 mov rcx,qword ptr [rax+48h]

CUSTOMER_CRASH_COUNT: 1

DEFAULT_BUCKET_ID: DRIVER_FAULT_SERVER_MINIDUMP

BUGCHECK_STR: 0xD1

PROCESS_NAME: cqmgstor.exe

LAST_CONTROL_TRANSFER: from fffff8000102e5b4 to fffff8000102e890

STACK_TEXT:
fffffadf`8cecc258 fffff800`0102e5b4 : 00000000`0000000a 00000001`00060049 00000000`00000002 00000000`00000000 : nt!CmpDelayCloseWorker+0xa7
fffffadf`8cecc260 00000000`0000000a : 00000001`00060049 00000000`00000002 00000000`00000000 fffffadf`900ac743 : nt!CmpDelayDerefKCBDpcRoutine+0x50
fffffadf`8cecc268 00000001`00060049 : 00000000`00000002 00000000`00000000 fffffadf`900ac743 fffffadf`9cd27b10 : 0xa
fffffadf`8cecc270 00000000`00000002 : 00000000`00000000 fffffadf`900ac743 fffffadf`9cd27b10 00000000`00000000 : 0x1`00060049
fffffadf`8cecc278 00000000`00000000 : fffffadf`900ac743 fffffadf`9cd27b10 00000000`00000000 00000000`00000000 : 0x2
fffffadf`8cecc280 fffffadf`900ac743 : fffffadf`9cd27b10 00000000`00000000 00000000`00000000 00000000`00000000 : 0x0
fffffadf`8cecc288 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : storport!RaidXrbSetDataBufferAddress+0x2b


STACK_COMMAND: kb

FOLLOWUP_IP:
storport!RaidXrbSetDataBufferAddress+2b
fffffadf`900ac743 488b4848 mov rcx,qword ptr [rax+48h]

SYMBOL_STACK_INDEX: 6

SYMBOL_NAME: storport!RaidXrbSetDataBufferAddress+2b

FOLLOWUP_NAME: MachineOwner

MODULE_NAME: storport

IMAGE_NAME: storport.sys

DEBUG_FLR_IMAGE_TIMESTAMP: 471c86a8

FAILURE_BUCKET_ID: X64_0xD1_storport!RaidXrbSetDataBufferAddress+2b

BUCKET_ID: X64_0xD1_storport!RaidXrbSetDataBufferAddress+2b

Followup: MachineOwner
---------


karim h
Valued Contributor

Re: BSOD on BL460c

At a guess, this problem may be related to the HP Insight Storage Agents making an unsupported call to storport.sys.

Unfortunately you can't disable the Insight Storage Agents without shutting down the underlying Foundation Agents required for monitoring.

You could potentially set all the storage related Management agents to Inactive through the System Control Panel but i'm not sure that this would rectify the issue.


E.g." Fibre Array, IDE, iSCSI, Drive Array, SCSI, SAS, Storage information and Remote Alerter" etc.



karim h
Valued Contributor

Re: BSOD on BL460c

On the affected servers do the following:

1. Update to PSP v8.15a
2. Update all firmware to FW Maintenance CD v8.30 baseline
3. Update Storport drivers from Microsoft - http://support.microsoft.com/kb/950448/en-us

Blade user
Occasional Advisor

Re: BSOD on BL460c

Thanks for that.

I've decided to do the exactly the same thing during our next outage, so fingers crossed the issue will go away.

I've also been having a look at http://support.microsoft.com/kb/244617

Blade user
Terai
Occasional Advisor

Re: BSOD on BL460c

I've summarised our systems driver version information and marked on which system BSOD has happened or not to find out root causes. (see attached)

According to our trial & error excercise, what I've noticed is, if you set Driver Verifier against some specific drivers, BSOD seems to have stopped. Those drivers are hpeaadsm.sys, ql2300.sys, storport.sys and hpciss2.sys. From this, I'm guessing that there are something wrong with the combination of these 4(or some of these) drivers under a certain conditions in terms of the root cause of BSOD.

I keep this testing to re-produce BSOD using several blades to clarify the real cause with help from Microsoft and hp, but if someone can provide similar data or any related information, it would be very helpful.

O. Terai
karim h
Valued Contributor

Re: BSOD on BL460c

Terai,
Thanks for the headsup. Have you tried running driver verifier against just storport.sys and ql2300.sys?

I think that HP MPIO driver (hpeaadms.sys) can be ruled out as both you and I are running EMC back end arrays and are having issues, but then again, who knows?

I don't believe that hpciss2.sys is causing the issue as some of my servers do not even have this driver and they still crash with the same symptoms.

I'm currently running 2 baseline configs (each baseline is applied to 17 blades - BL460c's and BL680c's):

BASELINE 1.) (W2003 X64 R2 SP2) PSP8.10 - FW Maintenance 8.20 with storport.sys 5.2.3790.4173 and ql2300.sys v9.1.7.16
-- This is unstable

BASELINE 2.) (W2003 X64 R2 SP2) PSP8.15 - FW Maintenance 8.30 with storport.sys 5.2.3790.4303 and ql2300.sys v9.1.7.17
-- Rolled out 6 days ago, no BSODs as yet but too early to tell.

As i mentioned a few emails ago, my dumps indicate that cqmgstor.exe (Insight Storage Agent) may be causing the issue...
ACHCHGUY
Frequent Advisor

Re: BSOD on BL460c

I have applied the below registry to some of our bl 460c as I was having BSOD that could not be pinned down.
After applying the issue has not happened again on any of these Servers.

These servers and sans were all at latest revisions (as you do) but would still BSOD.
Cheers
Steve

Did you see the msdn article and try ProtectNonPagedPool registry?
http://msdn.microsoft.com/en-us/library/ms796128.aspx
Please be cautious to set the registry, it might increase the possibility of bsod but might help you to identify the root cause.

karim h
Valued Contributor

Re: BSOD on BL460c

Update -

I have not had any problems since applying PSP 8.15, FW 8.30 and the updated storport driver I mentioned above. If you haven't already, I suggest that you try this as a course of action.

- 17 blades stable for 3 weeks (previously seeing incidents every few days)


Karim
Melissa O'Brien
Frequent Advisor

Re: BSOD on BL460c

Hi Karim,
Just a warning - our blue screens went away after updating all firmware and drivers to version 8.1 in August. We then upgraded our storport.sys as a prereq for the newer version of Veritas, and the blue screens came back. I've tried newer storports.

I deployed 8.15 and some are still blue screening. It's almost as if it's the order of installs that matter.

Very frustrating.
Robert Walker_8
Valued Contributor

Re: BSOD on BL460c

Well I'd like to report that I have been having a frustrating time since March'08 when we converted from DL360's to BL460c's.

My citrix farm has 16 of these which all but one fails - the only one is a 2003 x64 server! The 2003 32bit servers die like flies with veritable smorgasbord of codes! I count 108 BSODs, 29 ASRs since March 08. The BSODs are nasty - with ASR switched off they just stick and require manual intervention to restart the server.

23/01/2009 5:23 Blue Screen Trap (BugCheck, STOP: 0x0000007E (0xC0000005, 0x808402FE, 0xF707AB28, 0xF707A824))
22/01/2009 9:18 Blue Screen Trap (BugCheck, STOP: 0x00000019 (0x00000020, 0x86546000, 0x86546818, 0x1B030000))
1/01/2009 1:01 Blue Screen Trap (BugCheck, STOP: 0x000000C5 (0x00000000, 0xD0000002, 0x00000001, 0x8089C4BB))
30/11/2008 1:13 Blue Screen Trap (BugCheck, STOP: 0x00000024 (0x0019033D, 0x8DD44514, 0x8DD44210, 0x808402FE))
25/11/2008 4:42 Blue Screen Trap (BugCheck, STOP: 0x000000E1 (0x8082BF6F, 0x0000001F, 0x8A36BD50, 0x8A36BD50))
25/11/2008 1:51 Blue Screen Trap (BugCheck, STOP: 0x000000D0 (0x00000000, 0xD0000002, 0x00000001, 0x8089968D))
22/11/2008 17:43 Blue Screen Trap (BugCheck, STOP: 0x000000D1 (0x00000004, 0xD0000002, 0x00000001, 0xBA2022DF))
17/09/2008 8:01 Blue Screen Trap (BugCheck, STOP: 0x0000008E (0xC0000005, 0x8092AE38, 0x9A140C94, 0x00000000))
7/09/2008 2:01 Blue Screen Trap (BugCheck, STOP: 0x000000C2 (0x00000007, 0x0000121A, 0x00000000, 0x83000168))
19/08/2008 6:05 Blue Screen Trap (BugCheck, STOP: 0x0000001A (0x00000401, 0xC01D33AC, 0x8002B867, 0x00000000))
29/07/2008 6:34 Blue Screen Trap (BugCheck, STOP: 0x000000BE (0xD22C4000, 0x8002B121, 0xF792E5B8, 0x0000000B))
1/07/2008 9:33 Blue Screen Trap (BugCheck, STOP: 0x0000004E (0x00000007, 0x00080010, 0xFFFFFFFF, 0x00000000))
22/05/2008 15:43 Blue Screen Trap (BugCheck, STOP: 0x0000000A (0xFFFFFFB8, 0xD000001B, 0x00000000, 0x808541D3))

I have updated to PSP 8.15a, FW 8.3 and STORPORT.SYS 4368. Have also got driver verifier running:

C:\> verifier /querysettings
Special pool: Enabled
Force IRQL checking: Disabled
Low resources simulation: Disabled
Pool tracking: Enabled
I/O verification: Disabled
Deadlock detection: Disabled
Enhanced I/O verification: Disabled
DMA checking: Disabled
Disk integrity checking: Disabled

Verified drivers:

hpcisss2.sys
storport.sys
videoprt.sys
cpqcidrv.sys
hpqilo2.sys
cpqteam.sys
ptilink.sys
icacdd.sys
tmtdi.sys
cdfdrv.sys
dump_diskdump.sys
dump_hpcisss2.sys
ctxpidmn.sys
ctxsbx.sys
tmpreflt.sys
vsapint.sys
tmxpflt.sys
ctxaltstr.sys
cdm.sys
ctxsmcdrv.sys
tmcomm.sys
wdica.sys
icareduc.sys
pdrframe.sys
pdcrypt1.sys
vdtw30.dll

HP, Microsoft and Citrix have all been looking at it with much finger pointing and shrugged shoulders.

Robert.