Automated Backup
cancel
Showing results for 
Search instead for 
Did you mean: 

VBDA inactivity timeouts & IPC EVENT TIMEOUTS

CA30934
Occasional Visitor

VBDA inactivity timeouts & IPC EVENT TIMEOUTS

I am currently looking into an issue whereby we occasionally get (VBDA reached its inactivity timeout) error during backups.

We have a LTO Tape Library attached to a (fully patched) DP5.0 cell server running under Windows, and run two datalists (one for the Linux client, the other for the Windows clients). We are seeing this happen on both datalists.

I have had a pretty good look at cases of this on the Linux systems and have found that at a particular time all of the running disk agents (vbda's) running on the clients go idle... they still exist, but are no longer using any CPU or network traffic. An lsof shows that these agents still have a connection to the cell server, and a parallel running datalist continues to run. This implies that it is the media agent on the cell server which is no longer responding, or the network is causing an issue to upset its connections. It should be noted that other netwok traffic (such as continuous pings, established SSH connections, and other DP datalist backups) are not being upset, so it does not look network related (all systems are attached to 1GB switches).

Another interesting point is that when debugging is enabled on the Linux client VBDA's the following "IPC EVENT ==> TIMEOUT" messages occur 50-60 times a second (per vbda). While these do not seem to be affecting the backup in any way I would like to know what the cause of them is, as it could be related to the connections going idle. I get these even when running very simple, small backups at low network load.

This is snip of a VBDA debug log (anyone have access to bdabuf.c?):

03/23/05 03:13:37 ("da/dacmn/bdabuf.c /main/r50_fix/1":934) A.05.00 bPHSS_32331/DPSOL_00132
IPC EVENT ==> TIMEOUT

03/23/05 03:13:37 ("da/dacmn/bdabuf.c /main/r50_fix/1":934) A.05.00 bPHSS_32331/DPSOL_00132
IPC EVENT ==> TIMEOUT

03/23/05 03:13:37 ("da/dacmn/bdabuf.c /main/r50_fix/1":934) A.05.00 bPHSS_32331/DPSOL_00132
IPC EVENT ==> TIMEOUT

03/23/05 03:13:37 ("da/dacmn/bdabuf.c /main/r50_fix/1":934) A.05.00 bPHSS_32331/DPSOL_00132
IPC EVENT ==> TIMEOUT
4 REPLIES
CA30934
Occasional Visitor

Re: VBDA inactivity timeouts & IPC EVENT TIMEOUTS

ADDITIONAL INFORMATION:
The following is what is displayed in the log of the client (VBDA) when the connection times out. The same message is displayed for multiple disk agents, on multiple clients that were active at the time. However another datalist running on a different BMA continues just fine. This suggests that it is the specific BMA (on the cell master) for this datalist which is failing.

04/09/05 01:57:36 ("da/dacmn/breconn.c /main/16":269) A.05.00 bPHSS_32331/DPSOL_00132
RcIpcReceiveData failed!

04/09/05 01:57:36 ("da/dacmn/bdabuf.c /main/r50_fix/1":979) A.05.00 bPHSS_32331/DPSOL_00132
IPC EVENT (BSM) ==> SOCKET CLOSED
04/09/05 01:57:36 ("da/bda/tw_unix.c /main/r50_fix/1":Line:423 in 'TwDumpFile' )
Marked: ("":Line:0 in '')
Code is:0
From: VBDA@machine.domain.co.nz "machine " Time: 04/09/05 01:57:36
Connection to Media Agent broken => aborting.

04/09/05 01:57:36 ("da/bda/vbda.c /main/r50_fix/2":1145) A.05.00 bPHSS_32331/DPSOL_00132
===>> (0) VbdaExit {
Args: cc=4

IpcReleaseIpcHandle: 1 (1, 1)
04/09/05 01:57:36 ("da/bda/vbda.c /main/r50_fix/2":Line:1289 in 'VbdaExit' )
Marked: ("":Line:0 in '')
Code is:0
From: VBDA@machine.domain.co.nz "machine " Time: 04/09/05 01:57:36
ABORTED Disk Agent for machine.domain.co.nz:/data1 "machine ".

04/09/05 01:57:36 ("da/bda/vbda.c /main/r50_fix/2":1312) A.05.00 bPHSS_32331/DPSOL_00132
VbdaExit: OB2VBDASLEEPBEFORECLOSE = 0

04/09/05 01:57:36 ("lib/ipc/ipc.c /main/r50_fix/10":766) A.05.00 bPHSS_32331/DPSOL_00132
IpcExit()

04/09/05 01:57:36 ("lib/ipc/ipc.c /main/r50_fix/10":2749) A.05.00 bPHSS_32331/DPSOL_00132
IpcCloseConnection(0)
IpcReleaseIpcHandle: 0 (0, 0)

04/09/05 01:57:36 ("da/bda/vbda.c /main/r50_fix/2":1341) A.05.00 bPHSS_32331/DPSOL_00132
VbdaExit: will exit with exit code 4

04/09/05 01:57:36 ("lib/cmn/common.c /main/r50_fix/1":393) A.05.00 bPHSS_32331/DPSOL_00132
CmnExit()

04/09/05 01:57:36 ("lib/cmn/trouble.c /main/5":121) A.05.00 bPHSS_32331/DPSOL_00132
===>> (1) TrbExit {


04/09/05 01:57:36 ("lib/cmn/trouble.c /main/5":126) A.05.00 bPHSS_32331/DPSOL_00132
<<=== (1) } /* TrbExit */


04/09/05 01:57:36 ("lib/cmn/debug.c /main/r50_fix/7":433) A.05.00 bPHSS_32331/DPSOL_00132
==================================================================
DEBUGGING ENDED PROPERLY
==================================================================
CA30934
Occasional Visitor

Re: VBDA inactivity timeouts & IPC EVENT TIMEOUTS

This is still occasionally happening with the latest fully patched version of Data Protector (v5.5).

You can actually predict when the failure is going to occur as the VBDA process on the client will sit there using zero CPU time... whereas a working VBDA process will start chewing up CPU time within a minute or two.

The issue seems to be that the disk agent goes to backup a file and then times out with a "RcIpcReceiveData failed" (the timeouts have been reduced to 30 minutes as it makes no difference what this setting is, and 30 minutes allow us to know about the problem sooner in the evening).

from NET: MSG_LISTEN
shmemKey = 1783634327
shmemSize = 2097168
shmemSection = 0
shmemBasePtr = 0x(nil)
bufNum =32
bufSize =65536
bufOffset =160
Attached shmemBasePtr = 0x40230000
boName (addr) = 0xbfffea60, boNameSize = 67
boDesc (addr) = 0x40431008, boDescSize = 1
appName (addr) = 0x8074ce0, appNameSize = 5
appDesc (addr) = 0xbffff004, appDescSize = 32

2005-12-03 23:17:13 ("da/dacmn/bdarec.c /main/32":167) A.05.50 bPHSS_33629_33630_DPSOL_00168
TwBackupObj: Dir /data3
TwBackupObj: objProcessed=
TwBackupObj: Dir /data3/lost+found
TwBackupObj: objProcessed=
TwBackupObj: Dir /data3/dbs
TwBackupObj: objProcessed=
TwBackupObj: Dir /data3/dbs/ren_nth
TwBackupObj: objProcessed=
TwTraverse: Processing file object /data3/dbs/ren_nth/ren_nth_rollback1.dbf

2005-12-03 23:54:25 ("da/dacmn/breconn.c /main/dp55/2":275) A.05.50 bPHSS_33629_33630_DPSOL_00168
RcIpcReceiveData failed!


2005-12-03 23:54:25 ("da/dacmn/bdabuf.c /main/dp55/dp55_fix/4":1537) A.05.50 bPHSS_33629_33630_DPSOL_00168
IPC EVENT (BSM) ==> SOCKET CLOSED

2005-12-03 23:54:25 ("da/bda/tw_unix.c /main/dp55/dp55_fix/4":506) A.05.50 bPHSS_33629_33630_DPSOL_00168
in TwDumpFile
Marked: ("":Line:0 in '')
Code is:0
From: VBDA@husky3.hnz.co.nz "husky3 " Time: 12/03/2005 11:54:25 PM
Connection to Media Agent broken => aborting.


BR931836
Occasional Visitor

Re: VBDA inactivity timeouts & IPC EVENT TIMEOUTS

Kevin,

What did you do to solve this issue? I'am seeing the same stuff on some client machines...
CA30934
Occasional Visitor

Re: VBDA inactivity timeouts & IPC EVENT TIMEOUTS

The issue seems to be with the Medium Agent which is running on the Cell Server (installed under Windows as Linux was not supported under v5). The medium agent occasionally gets upset and stops responding.

By spacing out the Linux system backups and manually regiging the order of the objects within the datalists to spread the load it is possible to drastically reduce the occurrence of this.

It's also possible to try and reduce the load by reducing the level of logging (eg just directories rather than file versions) to see if that helps.