- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - HP-UX
- >
- How to identify wait reason of stalled fsck?
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Forums
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-10-2007 12:21 AM
10-10-2007 12:21 AM
this is a kind of follow-on to my last thread where I was struggling with a stalled SAP process.
http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=1165689&admit=-682735245+1191917384394+28353475
This was resolved by creating a new file system and doing a DBMS restore.
As we still were left the corrupt file system's VG I yesterday activated it exclusively and had an fsck performed on the corrupt volume as a batch job.
This process must have irrecoverably hung itself instantly while a parallel fsck of another RLV from the same VG finished ok within 12 secs.
Anyway, I left it running till today.
When I now look up the process in the proc table it is reported to be in a sleeping state and waiting for another event.
Unfortunately, I cannot attach tusc to its PID to see if it really is sleeping or what is going on.
How can I use the wait channel ID/address ps is displaying to find out what is keeping this proc waiting?
[root@alster(Z01):/root]
# UNIX95= ps -C fsck -o pid,ppid,uid,tty,cpu,pcpu,pri,nice,time,stime,state,wchan,args
PID PPID UID TT C %CPU PRI NI TIME STIME S WCHAN COMMAND
24644 24643 0 ? 0 0.02 148 20 00:00 15:48:56 S 7750f040 fsck -p -y -ofull /dev
/vgZ01/rlvol18
The process is absolutely indolent and doesn't react on signals anymore.
# for s in TERM QUIT INT HUP KILL;do echo "Sending SIG$s:";kill -$s 24644;sleep 5;kill -0 24644;sleep 5;kill -0 24644 && echo "still living :-(";done
Sending SIGTERM:
still living :-(
Sending SIGQUIT:
still living :-(
Sending SIGINT:
still living :-(
Sending SIGHUP:
still living :-(
Sending SIGKILL:
still living :-(
Solved! Go to Solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-10-2007 12:28 AM
10-10-2007 12:28 AM
Re: How to identify wait reason of stalled fsck?
You need to terminate this and try it again in single user mode.
It is not going to finish. I suspect a SAN disk problem based on my memory of your earlier thread.
If you can get an fsck to finish that might resolve the issue.
backup newfs restore might also help if its not coming from the SAN
SEP
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-10-2007 12:49 AM
10-10-2007 12:49 AM
Re: How to identify wait reason of stalled fsck?
I cannot transition into single user mode right now.
But I think for my sort of "post mortem examination" this isn't really necessary.
As I wrote, we already rebuilt the corrupted file system and did the recovery.
So the app is now happily running using different storage.
My objective was only to find out a possible reason for the corruption because this is what the SAP DBAs surely will expect from me.
So I ran the fsck on the old RLV which isn't used by anything other than my fsck.
I was simply wondering if fsck could somehow magically mend the corruption, which obviously was too high an expectation.
The SAN admin insists that there is no SAN I/O problem according to his findings with his toolset.
And I am convinced that, although I may not be able to write to this file system, that I will be able to read from all its disks raw and thoroughly (I've done this already).
# lvdisplay -v /dev/vgZ01/lvol18|awk '$1~/^\/dev\/dsk/{sub("dsk","rdsk",$1);print$1}'|xargs -n1 diskinfo
SCSI describe of /dev/rdsk/c30t1d5:
vendor: HITACHI
product id: OPEN-9*8
type: direct access
size: 57692160 Kbytes
bytes per sector: 512
SCSI describe of /dev/rdsk/c30t1d3:
vendor: HITACHI
product id: OPEN-9*7
type: direct access
size: 50480640 Kbytes
bytes per sector: 512
SCSI describe of /dev/rdsk/c30t1d6:
vendor: HITACHI
product id: OPEN-9*15
type: direct access
size: 108172800 Kbytes
bytes per sector: 512
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-10-2007 01:15 AM
10-10-2007 01:15 AM
Re: How to identify wait reason of stalled fsck?
This has to be a san / scsi-driver problem.
Clearly Sap / Oracle is no longer suspect now that fsck hangs.
One would have hoped for and expected an IO error report with time-out of some sorts.
Stating the obvious here perhaps, but would this not be the perfect time to solicit your HPux Support services?
fwiw,
Hein.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-10-2007 01:33 AM
10-10-2007 01:33 AM
Re: How to identify wait reason of stalled fsck?
Lets test the hardware.
xstm
exercize the disk.
You need to get an fsck to finish on the filesystem, even if it means downtime.
SEP
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-14-2007 08:55 PM
10-14-2007 08:55 PM
Re: How to identify wait reason of stalled fsck?
sorry for the interruption
but I had been to Nagios-Konferenz on Thursday and Friday and have only returned back to work today.
The insignalable fsck started on 9th of Oct. is still dangling in the proc table.
SEP,
I very much would like to test or excercise the disks.
The only problem that I face is that I cannot identify the concerned disks because of a bug of STM.
No matter which guise of STM I call (be it mstm, xstm, or cstm) the displayed map only seems to reserve 20 chars maximal to store the path which results in paths of all our SAN disks, whose paths are longer than those 20 chars, being truncated indisambiguateably . (see below)
Can you tell me how I can mark the exact disk in STM's map so that I can run the exercise?
Also is exercising of SAN disks supported by STM, and does it require an SE daytime password?
# UNIX95= ps -C fsck -o pid,state,wchan,stime,args
PID S WCHAN STIME COMMAND
24644 S 7750f040 Oct 9 fsck -p -y -ofull /dev/vgZ01/rlvol18
# lssf $(lvdisplay -v /dev/vgZ01/lvol18|awk '$1~/dsk/{print$1}')|awk '{print$(NF-1)}'
1/0/2/0/0.117.6.19.1.1.5
1/0/2/0/0.117.6.19.1.1.3
1/0/2/0/0.117.6.19.1.1.6
# echo map|cstm|grep '1/0/2/0/0.117.6.19.1'|tail -3
399 1/0/2/0/0.117.6.19.1 SCSI Disk (HITACHIDISK-SU
400 1/0/2/0/0.117.6.19.1 SCSI Disk (HITACHIDISK-SU
401 1/0/2/0/0.117.6.19.1 SCSI Disk (HITACHIDISK-SU
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-14-2007 09:15 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-14-2007 09:32 PM
10-14-2007 09:32 PM
Re: How to identify wait reason of stalled fsck?
kudos to the STM guru.
This is an invaluable piece of information.
Now I can identify the map instance Nos.
# echo "mop pathwidth 50;map"|cstm|grep 1/0/2/0/0.117.6.19.1.1.[536]
371 1/0/2/0/0.117.6.19.1.1.3 SCSI Disk (HITACHIOPEN-9*
373 1/0/2/0/0.117.6.19.1.1.5 SCSI Disk (HITACHIOPEN-9*
374 1/0/2/0/0.117.6.19.1.1.6 SCSI Disk (HITACHIOPEN-9*
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-14-2007 11:05 PM
10-14-2007 11:05 PM
Re: How to identify wait reason of stalled fsck?
All of them terminated successfully.
# echo "mop pathwidth 25;map"|cstm|grep 1/0/2/0/0.117.6.19.1.1.[536]
371 1/0/2/0/0.117.6.19.1.1.3 SCSI Disk (HITACHIOPEN-9* Exercise Successful
373 1/0/2/0/0.117.6.19.1.1.5 SCSI Disk (HITACHIOPEN-9* Exercise Successful
374 1/0/2/0/0.117.6.19.1.1.6 SCSI Disk (HITACHIOPEN-9* Exercise Successful
I also attached to an HP Support Call my colleague had filed during my absence,
and asked for further assistance.
Apart from this, has anyone another idea what else could be done?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-14-2007 11:15 PM
10-14-2007 11:15 PM
Re: How to identify wait reason of stalled fsck?
I don't know whether the stm disk exerciser touches every block on the disk (probably not), so you *might* want to try a dd of the disks to /dev/null to confirm there are no problems reading every block...
dd if-/dev/dsk/cXtYdZ of=/dev/null bs=2k
HTH
Duncan
I am an HPE Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-14-2007 11:39 PM
10-14-2007 11:39 PM
Re: How to identify wait reason of stalled fsck?
rest assured that I already did full rdsk dd reads (albeit with a bigger bs of 1024k to speed a little up) of all involved disks last week.
For all disks same No. of in and out blocks.
I think however that the STM exercises allow for raising the harness or stress level, although some of those may become data destructive (which isn't an issue anymore at this stage).
At least the cstm online help seems to imply this.
e.g.
cstm>he eop
ExerOptions
Syntax: exeroptions | eop [execctrl] [time {
[behavior] [errorexit|errorcount {
mincoverage] [gentactlog {yes|no}] [reporterrors | reportwarnings |
reportinfo] [queries] [queryallow | querynondes | querydes]
Use this command to configure the options which will be used for
subsequently executing exerciser tools, including:
* Controlling execution time/loop limits
* Defining behavior on error detection
* Defining stress level
* Defining the contents of the InfoActLog
* Allowing/disallowing user queries