1754414 Members
2606 Online
108813 Solutions
New Discussion юеВ

diald taking 100% CPU

 
Michele (Mike) Alberton
Regular Advisor

diald taking 100% CPU

Hi !

I've seen quite a few of this occurrences in some previous threads, and got some useful hints on the subject but would like some support about the information I retrieved while debugging the issue.

I have Itanium based (HP-UX 11.23) systems with an EMC Clariion connected.

diald on one of the units (we have 6 of them exactly the same in terms of SW/HW) that after being quiet for a few hours starts ramping up the CPU utilization.

I installed tusc 7.9 and restarted diald. Initially everything was ok but this morning the trace file was humongous. It was stuffed with:

[./diald ][8185] open("/dev/fcd1", O_RDONLY, 0) .............................................. = 9
[./diald ][8185] ioctl(9, 0xc00c4601, 0x7fff9550) ............................................ = 0
[./diald ][8185] ioctl(9, 0xc00c4604, 0x7fff9560) ............................................ = 0
[./diald ][8185] close(9) .................................................................... = 0
[./diald ][8185] ioctl(9, SIOC_GET_PLUN, 0x7fff9540) ......................................... ERR#9 EBADF
[./diald ][8185] close(9) .................................................................... ERR#9 EBADF
[./diald ][8185] open("/dev/fcd1", O_RDONLY, 0) .............................................. = 9
[./diald ][8185] ioctl(9, 0xc00c4601, 0x7fff9550) ............................................ = 0
[./diald ][8185] ioctl(9, 0xc00c4604, 0x7fff9560) ............................................ = 0
[./diald ][8185] close(9) .................................................................... = 0
[./diald ][8185] ioctl(9, SIOC_GET_PLUN, 0x7fff9540) ......................................... ERR#9 EBADF
[./diald ][8185] close(9) .................................................................... ERR#9 EBADF
[./diald ][8185] open("/dev/fcd1", O_RDONLY, 0) .............................................. = 9
[./diald ][8185] ioctl(9, 0xc00c4601, 0x7fff9550) ............................................ = 0
[./diald ][8185] ioctl(9, 0xc00c4604, 0x7fff9560) ............................................ = 0
[./diald ][8185] close(9) .................................................................... = 0
[./diald ][8185] ioctl(9, SIOC_GET_PLUN, 0x7fff9540) ......................................... ERR#9 EBADF
[./diald ][8185] close(9) .................................................................... ERR#9 EBADF

We have 4 FC HBA (AB465) running B11.23.07 FibrChanl-01 and this specific one (/dev/fcd1) is not the one connected to the EMC DIsk Array but to a MSL6030 LTO library via a FC switch.

There's a chance that this behavior could be triggered by a backup running overnight over that channel, I could verify disabling the backup for a few days.

Still I would like some support from you about the meaning of the tusc traces above.

Thanks !

Mike

7 REPLIES 7
Steven E. Protter
Exalted Contributor

Re: diald taking 100% CPU

Shalom Mike,

I'd check the EMC logs for possible hardware problems.

Also /var/adm/syslog/syslog.log for clues to this issue.

I suspect the application itself that is using the CPU cycles, perhaps remove it and reinstall it, or copy the binary from another similar system that does not exhibit this behavior.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
Michele (Mike) Alberton
Regular Advisor

Re: diald taking 100% CPU

Hi Steven, thanks for replying.

I checked EMC and sys logs but nothing obvious there.

The /dev/fcd1 device refers to the HBA connected to the Backup LTO library, so that I suspect the issue could be triggered by the nightly backup running over that connection. I've disable the backup just to monitor the situation and restarted diald and its traces.

Do you know the meaning of:

[./diald ][8185] close(9) .................................................................... ERR#9 EBADF
[./diald ][8185] open("/dev/fcd1", O_RDONLY, 0) .............................................. = 9

I understand the EBADF refers to a file which diald tries to close but descriptor refers to no open file, not sure about the open row.

Thanks !

Mike
Dennis Handly
Acclaimed Contributor

Re: diald taking 100% CPU

>Do you know the meaning of:
close(9) ...... = 0
ioctl(9, SIOC_GET_PLUN, 0x7fff9540) .. ERR#9 EBADF
close(9) ...... ERR#9 EBADF

>I understand the EBADF refers to a file which diald tries to close but descriptor refers to no open file

Basically there is some sloppy programming. The file is closed then ioctl and close is done on the unopened file.

Then this whole thing is repeated from the open.
Michele (Mike) Alberton
Regular Advisor

Re: diald taking 100% CPU

Hi Dennis, thanks !

I looked at the tusc taken when the process started behaving "wildly":

[./diald ][8185] ioctl(9, 0xc00c4604, 0x7fff9560) .... = 0
[./diald ][8185] close(9) ............................ = 0
[./diald ][8185] ioctl(8, SIOC_GET_PLUN, 0x7fff9540) . = 0
[./diald ][8185] stat("/dev/rmt/1mnb", 0x7fff9310) ... = 0
[./diald ][8185] stat("/tmp/devfcdtape0601c0", 0x7fff9380) = 0
[./diald ][8185] open("/tmp/devfcdtape0601c0", O_RDONLY|O_NDELAY, 01210) = 9
[./diald ][8185] close(10) ........................... = 0
[./diald ][8185] open("/dev/fcd1", O_RDONLY, 0) ...... = 10
[./diald ][8185] ioctl(10, 0xc00c4601, 0x7fff9550) ... = 0
[./diald ][8185] ioctl(10, 0xc00c4604, 0x7fff9560) ... = 0
[./diald ][8185] close(10) ........................... = 0
[./diald ][8185] ioctl(9, SIOC_GET_PLUN, 0x7fff9540) . ERR#16 EBUSY
[./diald ][8185] close(9) ............................ = 0
[./diald ][8185] open("/dev/fcd1", O_RDONLY, 0) ...... = 9
[./diald ][8185] ioctl(9, 0xc00c4601, 0x7fff9550) .... = 0
[./diald ][8185] ioctl(9, 0xc00c4604, 0x7fff9560) .... = 0
[./diald ][8185] close(9) ............................ = 0
[./diald ][8185] ioctl(9, SIOC_GET_PLUN, 0x7fff9540) . ERR#9 EBADF
[./diald ][8185] close(9) ............................ ERR#9 EBADF
[./diald ][8185] open("/dev/fcd1", O_RDONLY, 0) ...... = 9
[./diald ][8185] ioctl(9, 0xc00c4601, 0x7fff9550) .... = 0
[./diald ][8185] ioctl(9, 0xc00c4604, 0x7fff9560) .... = 0
[./diald ][8185] close(9) ............................ = 0
[./diald ][8185] ioctl(9, SIOC_GET_PLUN, 0x7fff9540) . ERR#9 EBADF
[./diald ][8185] close(9) ............................ ERR#9 EBADF
[./diald ][8185] open("/dev/fcd1", O_RDONLY, 0) ...... = 9
[./diald ][8185] ioctl(9, 0xc00c4601, 0x7fff9550) .... = 0
[./diald ][8185] ioctl(9, 0xc00c4604, 0x7fff9560) .... = 0
[./diald ][8185] close(9) ............................ = 0
[./diald ][8185] ioctl(9, SIOC_GET_PLUN, 0x7fff9540) . ERR#9 EBADF
[./diald ][8185] close(9) ............................ ERR#9 EBADF
[./diald ][8185] open("/dev/fcd1", O_RDONLY, 0) ...... = 9
[./diald ][8185] ioctl(9, 0xc00c4601, 0x7fff9550) .... = 0

I can see that an ioctl(9, SIOC_GET_PLUN, 0x7fff9540) gets a BUSY response, then it close the (9).

Next open on /dev/fcd1 gets assigned (9) but at this point diald starts with the ioctl on (9) after the descriptor is already closed. I think there could be a bug on the code maybe triggered only after that BUSY ocndition occurs, does it make sense ?

Thanks !

Mike
Dennis Handly
Acclaimed Contributor

Re: diald taking 100% CPU

>I think there could be a bug on the code maybe triggered only after that BUSY condition occurs, does it make sense?

Yes. But FD 9 was on /tmp/devfcdtape0601c0 where it got that EBUSY.

Unless this was suppose to be on FD 8 where it also did that ioctl(8, SIOC_GET_PLUN, 0x7fff9540).
Michele (Mike) Alberton
Regular Advisor

Re: diald taking 100% CPU

Hi Dennis,

we suspended the backup on the device connected to /dev/fcd1 and the issue has not appeared again which so far makes me believe something could be related to that device.

About the /tmp/devfcdtape0601c0 it's true that it was the one replying BUSY but don't understand its meaning, could it be created by diald itself ?

Sorry for the naive question but what does that device mean in /tmp directory ? I checked it and it was really a character dev ice, maybe needed by diald for some temporary repository ?
I'll wait a few more days and then start the backup to monitor a bit closer that device.

Thanks !

Mike
Dennis Handly
Acclaimed Contributor

Re: diald taking 100% CPU

>About the /tmp/devfcdtape0601c0 it's true that it was the one replying BUSY but don't understand its meaning, could it be created by diald itself?

Most likely, check the date.

>Sorry for the naive question but what does that device mean in /tmp directory?

I don't know either. I've seen /var/tmp/rdsk* files.