HPE EVA Storage
1819834 Members
2695 Online
109607 Solutions
New Discussion юеВ

Re: MAJOR PROBLEM...The device, \Device\Scsi\CPQKGPSA1, did not respond...

 

MAJOR PROBLEM...The device, \Device\Scsi\CPQKGPSA1, did not respond...

I have a major problem with a Win2000 SP4 + Exchange 2000 SP3 system, to often i get this in the log: "The device, \Device\Scsi\CPQKGPSA1, did not respond within the timeout period." this error can be up to 10-15 times per day.
And some times i get a bluscreen with diffrent bugchecks, the last week it has dumped 4 times, the version of the 2 CPQKGPSA is 5.4.82.16 and SecurePath 4.0b, i have a EVA3000, what can i do??
18 REPLIES 18

Re: MAJOR PROBLEM...The device, \Device\Scsi\CPQKGPSA1, did not respond...

It sounds like a driver issue.

Try upgrading your Secure Path agent to 4.0c

Regards
Christian
Antonio Gonzalez_4
Regular Advisor

Re: MAJOR PROBLEM...The device, \Device\Scsi\CPQKGPSA1, did not respond...

There are many more things to look at:
- look at the switches (firmware version ??) and read any error messages
- did you apply the Fibre Channel setup software ???
- look at the EVA logs; any messages there ??
- is the VCS version updated ??
- What are the HBAs: model & firmware version ??
- There are some issues with the driver parameters; you should check them with regedit and read the driver/HBA release notes

good luck
antonio
andyh310
Occasional Contributor

Re: MAJOR PROBLEM...The device, \Device\Scsi\CPQKGPSA1, did not respond...

We got this error too. We tried upgrading drivers & firmware on the HBAs, but in the end it turned out to be firmware on the EVA controllers.
Derek_31
Valued Contributor

Re: MAJOR PROBLEM...The device, \Device\Scsi\CPQKGPSA1, did not respond...

I get the same problem with my:

MSA-1000 4.24
Secure Path 4.0c
FCA2101 with latest firmware/drivers
SAN 2/16 switch with latest firwmare
Windows Server 2003 Enterprise

The problems seem to have started with MSA-1000 v4.24 and secure path 4.0c. But I wasn't running backups at that time, and now I am.
John Silk
Advisor

Re: MAJOR PROBLEM...The device, \Device\Scsi\CPQKGPSA1, did not respond...

I suggest that you try deactivating the HP Insight Management "Fibre Array Information" Agent. This problem will apparently be fixed in the V7.0 Support Paq.
Striving for perfection is a worthwhile goal
Carl Mcnulty
Advisor

Re: MAJOR PROBLEM...The device, \Device\Scsi\CPQKGPSA1, did not respond...

We have the same problem here,

HSG80 controllers with latest firmware.
Emulux LP8000 driver revision 5.4.82.16
SecurePath 4.0b
W2K server SP3.

Only one server is receiving this error.

Identical set-up has been installed on another site and we don't see this error.

Any help is appreciated
Kevin Harlan
Valued Contributor

Re: MAJOR PROBLEM...The device, \Device\Scsi\CPQKGPSA1, did not respond...

Something else I would note is that this could also be indicative of a failure in the fabric (or anything along the path -- HBA, fiber cable, GBIC, switch, etc. etc. etc.). It is worth pointing out that if a server with Secure Path installed were to lose all (note: ALL, not just a single HBA) connectivity with the SAN, the typical outcome of that is a Blue Screen. (And if I remember correctly, there are several different ones, and not just one specific one, when this happens.)

Does this error only occur on KGPSA1 and never on KGPSA2? Or does it occur on both?

If it occurs just on KGPSA1, have you tried using Secure Path to "verify" the path, and failover from one path to the other? (For example, KGPSA1 might be okay but plugged into a switch that is acting up, but KGPSA2 might have a bad fiber cable or GBIC. When the switch acts up and cuts KGPSA1's connection, when KGPSA2 tries to connect it fails -- resulting in full disconnect from the SAN, and a Secure Path Blue Screen.)

Do you have any other servers plugged into this same switch, EVA, etc.? If so, are they experiencing any problems?

Also, on the Blue Screen, did it indicate a driver or module? Sometimes you will see a module listed (like ntoskrnl.exe, srv.sys, atcp.sys, etc.), and in a few rare cases you will even get a stack dump.

What is your firmware revision on the EVA3000? I will also point out that if you were to call into HP support with this problem, the first thing they would probably tell you is to update all your components (i.e. EVA firmware version 3.010, SecurePath 4.0c).
Julian Perez_1
Valued Contributor

Re: MAJOR PROBLEM...The device, \Device\Scsi\CPQKGPSA1, did not respond...

Hello,

Before upgrading anything, first and easier step could be to modify card parameters using LPUTILNT.exe utility, which is located in system32 directory.

With this utility, you can modify Emulex card parameters without going to the windows registry.

Also, in driver, there is one most important issue which is revealed by the readme.txt file. The default EmulexOption setting for both Emulex (5-5.00a10-1) and the CPQKGPSA (5-4.82.a16) driver is 0xDA00, which is a "must" for W2K SP3. All previous versions of these drivers have the default EmulexOption setting of 0xAA00, which is a "must" for W2K SP2. Not setting this parameter correctly might result in event ID 50 with "SRB_STATUS_BUSY" and possibly even in event ID 9 timeouts logged to the system eventlog.

Best regards,

Julian
Keep the faith
Daniel Kobus_1
Occasional Advisor

Re: MAJOR PROBLEM...The device, \Device\Scsi\CPQKGPSA1, did not respond...

We had the same issues that you are refering to, first off do not use the lputilnt utility to update the registry settings. This caused a bigger problem than we started with! Reg. is the best way to edit this - the setting of DA00 i have never seen, BA00 is the setting we have been using for SP3 and SP4. You sould also check your timeout settings. We also had to make some chenges to exchange to solve this problem - MS has a fix and a white paper on it but they seem to forget about it until you rase the support issue!!!
Steven White_2
Occasional Advisor

Re: MAJOR PROBLEM...The device, \Device\Scsi\CPQKGPSA1, did not respond...

Hi,
We have a similar problem, but lose connection to the san from two servers simulaneously. We get a disk error on both servers, but it comes back within a few minutes.
Any help would be much appeciated.
Regards,
Steve

Info:

Environment:
DL380 G3 with v6.4
MSA1000 with v4.24 B272
Integrated san switch 2/8
Windows Server 2003, one F&P, One Exchange 2003
FCA2101 HBAs with v5-4.82a16 driver
We have been following a number of threads in the forum relating to this issue, but there is no definite fix mentioned.

The common thread through the forums is to disable the HP Insight Management "Fibre Array Information" Agent.
We have done this as an interim measure.
Some people suggest upgrading to v7 agents, but there is no update in this for the HBA drivers, so we are reluctant to do this, and others have meentioned they have tried, but it does not fix the issue.

We have checked the registry settings as per HP advisory:
Parameters for the KGPSA and FCA-2101driver in a Windows SAN environment may require editing to prevent disruptive affects on the SAN. The LPUTILNT utility can introduce changes and additional spaces in the Windows registry. The resulting registry entry data exceeds Microsoft byte character limit of 229 characters.

http://www3.compaq.com/support/reference_library/viewdocument.asp?source=OS030221_CW01.xml&dt=3

I have posted this info into a similar thread: \device\scsi\cpqkgpsa error
http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=578729

Events Log Info:

Time: 11:33:34
Source: Disk
ID: 51
An error was detected on device \Device\Harddisk5 during a paging operation.

Source: CPQKGPSA
ID: 11
The driver detected a controller error on \Device\Scsi\CPQKGPSA1.

Source: Disk
ID: 11
The driver detected a controller error on \Device\Harddisk2.

Source: Storage Agents
ID: 1185
Fibre Channel Controller Status Change. The Fibre Channel Controller in slot 255 has a new status of 6.
(Host controller status values: 1=other, 2=ok, 3=failed, 4=shutdown, 5=connectionDegraded, 6=connectionFailed)
[SNMP TRAP: 16021 in CPQFCA.MIB]

Source: Storage Agents
ID: 1185
External Array Controller Status Change. The external controller in I/O slot 1 of array "MELSAN01" has a new status of 4.
(Controller status values: 1=other, 2=ok, 3=failed, 4=offline, 5=redundantPathOffline)
[SNMP TRAP: 16020 in CPQFCA.MIB]

Time: 11:37:29
Source: Storage Agents
ID: 1185
Fibre Channel Controller Status Change. The Fibre Channel Controller in slot 255 has a new status of 2.
(Host controller status values: 1=other, 2=ok, 3=failed, 4=shutdown, 5=connectionDegraded, 6=connectionFailed)
[SNMP TRAP: 16021 in CPQF
Ben Trappe
New Member

Re: MAJOR PROBLEM...The device, \Device\Scsi\CPQKGPSA1, did not respond...

I have had a similar problem, shortly after updating the firmware on the two MSL5052 tape librarys E1200 Storage Router.

Don't suppose you did these prior to getting these messages did you?

Regards
Ben
Patrick Kane_3
Occasional Advisor

Re: MAJOR PROBLEM...The device, \Device\Scsi\CPQKGPSA1, did not respond...

This response is to Ben's message posting in the MSL5000 series tape library firmware upgrade and the possibility this is causing these KGPSA card timeouts:

Dude, I have (1) MSL5026 tape library at our "problem" facility. Everything seemed to be working fine. I discovered that our library was connected in an "unsupported" configruation per HP. Previously, we had our tape library connected SCSI (VHDCI) to a Modular Data Router. The Fiber side of the MDR was connected directly into our backup server. This was not cool, according to HP. HP advised us to createa zone on our swtich, create aliases for the MDR, Server, and connect the two devices through the swtich. We did this. This is when our "blue screens" started. When we attempted to mount volumes to cluster servers, they crashed. This happened for approximately three days. The blue screens "mysteriously" disappeared, although I am still unable to mount two disk volumes to my cluster servers. The fiber zone creation was the only configuration change made. Is following HP's best-practice causing my problems?
I'm Here Because I Have Questions
Steven White_2
Occasional Advisor

Re: MAJOR PROBLEM...The device, \Device\Scsi\CPQKGPSA1, did not respond...

Hi, Interesting that zoning caused the issue for you.

Our config is basic, in that there is only two w2k3 dl380s attached to the internal fibre switch. We are using SSP to assign the logical drive to the servers.
It is as basic as you could get! However, the issue is now occuring on two identically configured systems in seperate offices.
This really points to a driver or config issue. We have swapped out the switch and controller in one site on HPs recommendation.

They have also requested zoning on the switch like you, this is being done in the second office.

We have disabled the HP Insight Event Notifier (C:\WINDOWS\system32\CIMntfy\cimntfy.exe) on the system where this originally occured, and it has not happened again after 1 1/2 weeks.
Will keep posting as we get updates. Steve
Andy McCreath
Frequent Advisor

Re: MAJOR PROBLEM...The device, \Device\Scsi\CPQKGPSA1, did not respond...

how far have you escalated this issue?
We have the same problem on three servers out of 47 connected to an EVA5000.
HP have suggested upgrading our VCS & 2/64 OS to ver 3.0.14 & 4.2.0B respectively.
(prerequisites being SecPath 4.0C & drivers from EVA Platform Kit 3C (3B will sufice).


http://h20000.www2.hp.com/bizsupport/TechSupport/DriverDownload.jsp?pnameOID=83235&locale=en_US&taskId=135&prodTypeId=12169&prodSeriesId=83233&cc=us&swEnvOID=54


http://h20000.www2.hp.com/bizsupport/TechSupport/DriverDownload.jsp?pnameOID=321354&locale=en_US&taskId=135&prodTypeId=12169&prodSeriesId=321347
www.kimberly-clark.com
Ben Trappe
New Member

Re: MAJOR PROBLEM...The device, \Device\Scsi\CPQKGPSA1, did not respond...

Hi all!

Have just had a HP engineer onsite looking into the problem with me, and have found the awnser (for my problem at least!).

The conflict appears to be if you are running the drivers from SmartStart 6.40 (although it could well be the case for other versions). If this is the case, do the following:

On ALL Fiber attached servers connected to your SAN:

1. Go to Start\Settings\Control Panel and open the HP Management Agents (or Compaq Management Agents).
2. Click the 'Services' tab and remove 'Fibre Array Information' from the Active agents list.
3 Close down the Management agents and allow it to restart the management agents.
4. Repeat on all fibre attached server even if they are not running 6.40.

This fixed my problem quite dramaticly, and switching that setting back on again repeats the problem! Hopefully will work for you as well!!!!

Let me know if it helps.
Patrick Kane_3
Occasional Advisor

Re: MAJOR PROBLEM...The device, \Device\Scsi\CPQKGPSA1, did not respond...

Andy: We have (3) fiber switches in our topology. The (2) switches are for redundancy. They are both Compaq Storageworks SAN Switch 16 units running Firmware version 2.6.0.h. If I am correct, I'm only one firmware version out of spec. The other switch serves as our disk presentation and replication fiber device to our remote facility. That switch is a HP SAN Switch 2/16 (I believe cross-referenced as the Brocade SilkWorm) running Firmware V3.1.0. This firmware is also pretty recent. I belive the firmware you were referring to was for 32-port switches? I think I'm in acceptable range on the firmware.

Ben: I disabled the Fibre Array Information snapin on our cluster server last week. There was one additional server (Backup Server) connected to our fiber switch on which it still ran. I just disabled the snapin and will keep my fingers crossed. I'm in the process of installing Terminal Services on these guys so I can administer remotely. Do you guys know of any compatibility issues? I am still getting "blue-screened" intermittently. I'm gonna keep cracking at this system. Luckily, we're not production for another couple of weeks! Thanks again for all your input and keep the good advice coming!
I'm Here Because I Have Questions
Uwe Zessin
Honored Contributor

Re: MAJOR PROBLEM...The device, \Device\Scsi\CPQKGPSA1, did not respond...

Terminal Services should be OK for looking after things, but I have been told that one should not use such a session to install software. I have understood that the Windows registry might not be properly set up in such a case.
.
Patrick Kane_3
Occasional Advisor

Re: MAJOR PROBLEM...The device, \Device\Scsi\CPQKGPSA1, did not respond...

Our cluster administration software vendor, Veritas indicates I am experiencing a networking problem. I sent him log dumps of the configuration and event information. Here's what they sent me:

LanMan is failing due to networking issues:
4307(0xc00010d3) NetBT FS2A Initialization failed because the transport refused to open initial Addresses.
And
64(0x80000040) w32time FS2A Because of repeated network problems, the time service has not been able to find a domain controller to synchronize with for a long time. To reduce network traffic, the time service will wait 960 minutes before trying again. No synchronization will take place during this interval, even if network connectivity is restored. Accumulated time errors may cause certain network operations to fail. To tell the time service that network connectivity has been restored and that it should resynchronize, execute "w32tm /s" from the command

Are examples of these showing in System log.

You are having communication issues with your Disks.

15(0xc004000f) Disk FS2A The device, \Device\Harddisk3\DR29, is not ready for access yet.

41(0xc0050029) vxio FS2A vxio: cluster or private disk group ba025e2b-ef69-4939-81fa-54e0f4af592d has lost access to a majority of its disks. Its reservation thread has been stopped.

7011(0xc0001b63) Service Control Manager FS2A Timeout (30000 milliseconds) waiting for a transaction response from the vxpnpsvc service.
These messages are saying is that the vcpnpsvc did not get a response back from the Microsoft pnp service in 30 seconds. It not an error so much as a report of a temporary lack of communication between the 2 services.
The last one should be a call to Microsoft.

We need to sort out the issues with MS and HP.

The LanMan issues are due to networking.
The VMDG not onlining due to inability of the OS to see all of the disks and when it tries to it gives up after 30 seconds. I noted that the VCS has all of the force options set to true.
ForceDeport
ForceIm port
ForceUnmount
AutoFSClean
This is usually the case if one is having issues. It is dangerous and can cause data corruption.

I would get Microsoft involved for the Network issues and HP for the SAN issues.

I don't believe the W32Time (although I need to address this) would cause the problems I'm having. Also, Veritas says I'm getting SAN to server communication problems. Everything on the SAN looks OK. Wouldn't the SAN console show errors if there is communication problems between the SAN disks and the servers?
I'm Here Because I Have Questions