Operating System - HP-UX
1830620 Members
2925 Online
110015 Solutions
New Discussion

no response from "ll" in a directory

 
Matt Harrell
Advisor

no response from "ll" in a directory

One of the systems I administer is an HP-UX workstation (9000/785 running 11.00). It's a TDM host on the Ford ANX network, and it runs IDEAS.

In the last few weeks, IDEAS has begun to lock up occasionally. We noticed this morning when it happened, that we could not get a directory listing (e.g., "ll") of the contents of a directory that contains IDEAS work files. The directory is it's own file system. It is not the install direcory for IDEAS, just the work directory.

Clearly, when IDEAS hangs, something happens to the directory, but I'm curious as to what it might be. What would cause a directory to stop responding to "ll"?

To fix the root problem (IDEAS locking up), we have a support contract with SDRC which I will be utilizing.
13 REPLIES 13
A. Clay Stephenson
Acclaimed Contributor

Re: no response from "ll" in a directory

One very common problem is that your local directory has a subdirectory or link that is a mountpoint for a remote file. The remote filesystem is failing and that is hanging your ls command.

You may also have a corrupt filesystem. How long has it been since you've done an fsck?
If it ain't broke, I can fix that.
steven Burgess_2
Honored Contributor

Re: no response from "ll" in a directory

Hi Matt

Have you any errors in your syslog or from dmesg with regard to inode errors or pv timeouts ?

Regards

Steve
take your time and think things through
Anil C. Sedha
Trusted Contributor

Re: no response from "ll" in a directory

Matt,

Stupid as it may sound.

did you try creating an alias for "ll"

did you try using ls -ail or ls

i believe it's just that problem.

Regards,
Anil
If you need to learn, now is the best opportunity
Govind
Frequent Advisor

Re: no response from "ll" in a directory

Hey Matt
There might be couple of issues I can think of which cause this problem.
1) If this directory is a local directory(on server) and you are not able to do a ls -l on the server then the file system that this directory is on is defintely corrupted. You could try running fsck -y .
2)If you are having a problem doing ls -l on the server which is sharing this directory, that would also cause the hangup. I would try to do a "showmount -e SERVERNAME" from the client to see if I am able to see the exported fs from the server on the client. Then tryt o mount it ont /mnt (or any directory of your choice) and thentry doing a ls -l.
3) Another likely issue would be related to automounter. Since I have been confronting NFS and automounter issues for ever I didnt want to rule it out.
Hope this helps
Govind
Dont try to fix something till it Aint Broke...Honesty is not always the best policy.....
Matt Harrell
Advisor

Re: no response from "ll" in a directory

I should have also mentioned that the problem is fixed by a reboot.

I did find a TON of errors in the syslog about "giving up on pv 0x00...", among other ugly messages. I'll have to research these.
A. Clay Stephenson
Acclaimed Contributor

Re: no response from "ll" in a directory

You've found your problem. You better plan on a drive replacement very, very soon.
If it ain't broke, I can fix that.
Don Yeske
Occasional Advisor

Re: no response from "ll" in a directory

I may be a little late on this, but...

Note that when "ls" works and "ll" doesn't, the problem has to do with the fact that "ll" must STAT the files on disk (to obtain the extra information), versus "ls" which simply derives all its information from the metadata in the directory node (a single file, effectively). If you can't STAT the files in a given directory, you may have just built up way too many files in a single directory (have seen this before where in excess of 100,000 small files exist in a single directory -- ll will slow way down) or, the files themselves are damaged on disk. Given what you're seeing (the "giving up on pv" messages), and the fact that this is a mount point, then the disk is hosed, as someone else has already pointed out... I would still make sure the physical volume being talked about in syslog is the same volume on which these files reside, or is one of a mirrored set where they reside if you're using MirrorDisk/UX. You definitely have a disk problem. Now, is the problem with *this* disk?

- Don
Matt Harrell
Advisor

Re: no response from "ll" in a directory

The disk is having problems today. There are three file systems on that disk, /IMI, /team, and /opt2 (no longer used). /team will not mount ("read of super-block on /dev/vg02/lvol3 failed: I/O error"). However, /IMI is mounted, but there are some errors in syslog.log about it. Both file systems are mirrored with a strict mirroring policy. I'm not sure I understand why /team isn't working with the mirrored copies. Is this due to the fact that the disk has not totally failed? SAM still sees it just fine, and /IMI is mounted.

I'm not sure what the best way to handle this is. I've got the "HP Procedure for Repleacing an LVM Disk" in front of me, but because the disk does not seem to be totally dead, I'm not sure what is the best way to proceed (other than calling HP and getting the ball rolling on a new disk).
A. Clay Stephenson
Acclaimed Contributor

Re: no response from "ll" in a directory

Since you are mirrored (or at least you think you are) the first thing to do is verify that the mirrors are working. I would do an lvdisplay -v on each of these LVOL's and look for stale extents. You may need to do an lvsync. Remember, LVM and Mirror/UX know absolutely nothing about filesystems but rather only about extents. It is quite possible to have absolutely consistant LVOL blocks but at the same time to have a corrupt filesystem. I suppose to be on the safe side while the filesystem is still mounted, I would run a backup and then unmount the filesystem and do an fsck -o full (assuming this is a vxfs filesystem). For your filesystem that won't mount, you definitely need to do an fsck. Now to really make your day, it is possible (though not likely) to have two disks that are somewhat bad.
If it ain't broke, I can fix that.
David Hixson
Advisor

Re: no response from "ll" in a directory

Up until about 6 months ago, 'working' disks and 'not working' disks that were mirrored would still do reads on a round-robin scheme. The current patch rev gives preference to reading from disks that are working....

This caused a lot of problems with performance after the failure of one member of a mirror set.
LVM is a powerful tool in the hands of the devious.
Matt Harrell
Advisor

Re: no response from "ll" in a directory

From what I can tell from fsck and lvdisplay, there is a physically bad disk at c3t15d0. fsck completely fails to run on file systems that use this disk (if I'm reading this right). I'm not on site, but the IT guy there says that one of the disks in the external array is solidly lit. The others show flickering activity, so I'm assuming that this disk must be c3t15d0.

If all this is true, then what good did my mirros do m? There are 4 file systems on vg02 (external array) that had extents on c3t15d0, and 3 of those were mirrored. In every case, the extents that reside on the failed disk are mirrored on c3t6d0. Is this an indication that c3t6d0 might be having issues? If so, how do I check that disk for errors?
David Hixson
Advisor

Re: no response from "ll" in a directory

The best way to get back into a working state is probably to split all of the lvols that are on the disk, fsck the filesystems and just mount up the ones that are working. This will also help you clearly identify which disk is the bad one. (Probably the solidly lit drive)

Depending on how broken the disk is, you may have to reduce it out with the pvkey rather than the normal LV tools...

The first priority should be to make sure you have the current patches installed for LVM and for veritas. Both have fixed some critical bugs in the past year that cause these kinds of problems. If you get lucky, you can probably get a solid system hang out of veritas while you are in this condition as well, so it is definitely something that should be addressed as soon as possible. After the patch install, it might be worth pulling the solid-lit disk before the reboot and then bring the vg up manaully (ignore quorum) and see if everything works as expected. Then the disk replacement procedure is very easy (vgcfgrestore).
LVM is a powerful tool in the hands of the devious.
Don Yeske
Occasional Advisor

Re: no response from "ll" in a directory

Matt,

In general, I agree with everything David Hixson has said. I tend toward a more time-consuming approach, though...

- You will need to update your current patches for LVM and VxFS before doing anything else.

- Next, you'll need to be sure you know which disk is having the errors. Start with the messages from syslog.log and look at the pv name. Run 'vgdisplay -v vg02' to tell you what disks are in that volume group, and and 'lvdisplay -v /dev/vg02/lvol3 | more' will tell you what disks are part of that logical volume and how the extents are distributed among those disks at the bottom of the first page of the output. Repeat this process for all affected logical volumes, and make sure the extents marked as 'stale' are all on the same disk, and it's the disk you expect.

- Next, you'll need the pv key of the failed disk. Repeat the lvdisplay command, but this time, add the -k switch. In the column where the pv name was displayed before (e.g. -- c3t15d0) you will see a pv key number (usually 0 or 1) in its place.

- Of course, you have to physically locate the failed disk on the system. The disk name helps you do that. You know, for example, that the failed disk (if it's the one you mentioned) is on controller 3, target 15. The one with that SCSI target ID on that bus is the bad disk. How you find out where that bus is, though, and which one has that target, varies by hardware platform.

- The most reliable way to get the disk out of there, in my experience, is to pull it and boot without it as was suggested. When you come up, in order to activate the volume group, you'll have to ignore quorum with the -q n argument to vgchange (e.g. -- vgchange -a y -q n vg02). Then, you will want to reduce the failed mirrors by pv key. Read the man page for lvreduce carefully, as this is a delicate procedure -- you could specify the wrong key, or the wrong mirror to reduce, and blow away all data in that volume. So, BE CAREFUL. You will need to repeat this for each volume that lived on the failed disk, so that no logical volumes have mirrors on that disk in LVM's configuration.

- When there are no more mirrors on the failed disk, reduce the disk from the volume group (vgreduce). Since the disk isn't available to have its header written to, you may need to force it out (vgreduce -f). See the vgreduce man page.

- When this is done, you will need to reactive the volume group (do another vgchange -a y) to see the updated information in commands like vgdisplay and lvdisplay. The data these commands use is cached in the kernel. Once you have the mirrors reduced, and the disk taken out of the volume group, run vgdisplay -v and lvdisplay -v again for each affected logical volume to ensure that the changes did what you expected. At this point, no LVM commands should fail, and you will have no record of that disk in that volume group. a 'strings /etc/lvmtab' will also verify this.

- With the LVM configuration working, but without the file systems mounted, run a full fsck on the affected file systems. Hopefully, the remaining extents on the mirror copies are good, and you can recover the file systems to a consistent state. In all likelihood, some things will be damaged, but hopefully the damage is not too extensive. A full fsck takes time...

- Recovery, at this point, is more time-consuming than a simple vgcfgrestore, but straightforward. You will need to insert the new disk, add it to the volume group (vgextend), and extend mirrors onto it (lvextend) as you would if it were a brand-new disk -- which it is.

Again, this is dangerous, but I've found it to be the most reliable way to fix a 'partially broken' disk in LVM. Best of luck!

- Don