- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - HP-UX
- >
- ISCSI problem
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Forums
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-30-2011 02:14 AM - edited 10-16-2012 12:41 PM
12-30-2011 02:14 AM - edited 10-16-2012 12:41 PM
ISCSI problem
Problem:
Using iSCSI-00 B.11.31.03b or B.11.31.03c Software Initiator
and with high memory usage (>95%) all lunpaths go offline to the filer luns:
(NO_HW)
vmunix: All available lunpaths of a LUN have gone offline. The LUN has entered a transient condition. The transient time threshold is 120 seconds.
2 lunpaths are currently in a failed state.
Without a reboot the lunpaths do not recover, even the iswd daemons
are running.
HP has no patch or new driver , only a workaround:
The iswd daemon is the iSCSI TCP connection management daemon.
And it has a parent process (to receive TCP open/close request from kernel)
and it forks child processes based on the requests.
The child iswd daemon process actually manages (open/close) the
connections. It is multi-threaded process and one of the threads is
in non-interruptible sleep due to not be terminated by system shutdown
time -- to keep to TCP connection as possible for iSCSI disk access. And
the other thread is sleeping with select() for the connection to targets.
This should work with normal situation. But once the process was
targeted to be deactivated (by memory pressure), following problem happens.
When a process is deactivated, it needs to change all thread status
to STOP state. It is possible to change the second (waiting at select())
thread status to STOP, but first thread could not become STOP until
it is woken up.. But,, the first thread should be woken up by the
second (already STOP) thread. So, the first thread cannot be
woken up/STOP and thus the process is hanging in middle of
deactivation status (could not deactivated/reactivated anymore).
I think iswd daemon should be fixed to address this problem and
iSCSI lab just started to reproduce the problem by memory pressure.
(They could reproduce the issue by SIGSTOP already.)
In the meanwhile, I think workaround is to prevent the iswd daemon
to be deactivated.
We have some good news from the L3/Expert Center:
They have found the condition from the TOC dumps, and other logs and statistics that appears to match the condition you have run into.
In a nutshell, during times of transient memory pressure, user processes, including the iswd, may be temporarily deactivated, and late reactivated by the kernel . In certain situations, the iswd may not be able recover.
Attached are a few more details for you.
A workaround is to run the iswd at a "real time" priority, so that it does not get deactivated. This may be accomplished by this ACTION PLAN:
To run iswd at "real time" priority on any HPUX B.11.31 iSCSI version system , do the following tasks.
Do this when system activity is at a low period. No reboot required.
1. Get the ps -leaf | grep iswd
netuxbl4 # ps -leaf |grep -i iswd
1401 S root 6707 1 0 153 20 e0000001e9798700 33 e0000001010cdff0 05:57:44 ? 0:00 /opt/iscsi/bin/iswd
1401 R root 13218 6707 0 152 20 e0000001e57cd400 76 - 04:43:38 ? 0:00 /opt/iscsi/bin/iswd
2. Set the PARENT iswd to real time priority 127. Note from the above, PID 6707 is the parent, and is running in Unix Timesharing priority 153, and the child is PID 13218 running at prioity 152.
netuxbl4 # rtprio 127 -6707
3. Send QUIT signal to CHILD iswd. A new one will respawn from the parent at real time priority. In this case PID 13218:
netuxbl4 # kill -QUIT 13218
4. Verify that both iswd's are running Real time Priority. In this case we see they are, and that the new child iswd process is 13788:
netuxbl4 # ps -leaf|grep iswd
1401 S root 6707 1 0 127 20 e0000001e9798700 33 e0000001010cdff0 05:57:44 ? 0:00 /opt/iscsi/bin/iswd
1401 R root 13788 6707 0 127 20 e0000001afbefc80 76 - 04:56:01 ? 0:00 /opt/iscsi/bin/iswd
To back out the workaround:
On any HPUX B.11.31 iSCSI version system , do the following tasks.
Do this when system activity is at a low period. No reboot required.
1. Set the PARENT iswd to Unix Timeshare Priority. PID 6707 is the parent:
netuxbl4 # rtprio -t -6707
2. Send QUIT signal to CHILD iswd. A new one will respawn from the parent at real time priority. In this case PID 13788
netuxbl4 # kill -QUIT 13788
3. Verify that both iswd's are running Unix Timeshare Priority.
netuxbl4 # ps -leaf|grep iswd
1401 S root 6707 1 0 153 20 e0000001e9798700 33 e0000001010cdff0 05:57:44 ? 0:00 /opt/iscsi/bin/iswd
401 S root 13915 12983 1 154 20 e0000001e2b61c80 29 e0000001e1fef7f2 04:59:13 pts/1 0:00 grep iswd
NOTE: The workaround will need to be done after every reboot.
Additional notes:
1. If the syslogging from the debug kernel is causing any problem, the L3-Expert Center said if you want to go back to the non-debug iSCSI B.11.31.03c, to schedule outage to do so.
Use swinstall with options to "reinstall files if already there" and to "reinstall same version if already present".
2. The iSCSI lab is reviewing the details over the weekend and should confirm this action plan on Monday. We are highly confident the workaround will provide a stable iSCSI Initiator environment until the HP-UX iSCSI Lab has crafted a stable binary solution.
3. A minor iSCSI update that does not include a fix for this issue, will be released in September. We will ask the Lab to confirm that the workaround should be valid with that version.
The official fix for the iswd iSCSI hang QXCR1001231967 to address this issue and current target date to release a fix is March 2013.
A workaround, to prevent deactiveation of the iswd, is to run the iswd in Real Time priority 127, instead of in Time Share priority (typically 153 or 152).
Edit line 347 of /sbin/init.d/iscsi which reads:
/opt/iscsi/bin/iswd
to:
/usr/bin/rtprio 127 /opt/iscsi/bin/iswd
The change to run the iswd in real time priority 127 will occur on the next reboot.
To verify the iswd is in real time priority after the reboot:
# ps -eal | grep iswd
1401 R 0 24013 24006 0 127 20 e00000030f28c400 76 - ? 0:00 iswd
1401 S 0 24006 1 0 127 20 e0000001af3bb680 44 e0000001010cdff0 ? 0:00 iswd
The parent process 24006, and its child 24013 are both running at priority 127.
- Tags:
- iSCSI