topic Re: SLES 10 SP1, DRBD, and rsync in Operating System - Linux

SLES 10 SP1, DRBD, and rsync

Jeff_Traigle — Tue, 06 Nov 2007 15:49:18 GMT

Weird situation that I haven't been able to figure out yet. Thought someone on here may have run into it.

I have two Heartbeat/DRBD clusters configured identically (other than names, of course). I was trying to sync the data between the current non-clustered nodes and the new clusters. On one cluster, this appears to work perfectly using rsync. On the other, the file system becomes read-only at seemingly random times. I've been running a tar piped via ssh to do a blind copy of the data to this cluster and that's been running smoothly for a while now, but any attempt to use rsync results in the file system going read-only. Not ideal for the final cut over.

I didn't find anything useful with a Google search. Any ideas?

Re: SLES 10 SP1, DRBD, and rsync

Steven E. Protter — Tue, 06 Nov 2007 16:22:01 GMT

Shalom,

Looks to me like a network inconsistency. A filesystem in Linux goes read only when there is a problem.

Here is what I'd check on:
1) dmesg, look for a problem related to the disk that underlies the filesystem.
2) fsck the filesystem itself after umounting it. Same as hp-ux, no fsck with the filesystem hot.
3) Consider building a new ram disk (mkinitrd) on the effected system.

There is evidence, I'd like to see it to help guide the diagnosis.

SEP

Re: SLES 10 SP1, DRBD, and rsync

Justin_99 — Tue, 06 Nov 2007 18:28:14 GMT

we used to have a problem with our attached storage going read-only quite often. I found an old bug report at the time that pointed to SCSI timeouts. Might not be the same problem, but might help. http://www.redhatmagazine.com/2006/12/15/tips_tricks/

It is about half way down titled
Why does the ext3 filesystems on my Storage Area Network (SAN) repeatedly become read-only?

Re: SLES 10 SP1, DRBD, and rsync

Jeff_Traigle — Wed, 07 Nov 2007 10:29:13 GMT

Nothing in dmesg. While no fsck is required to remount it read-write after it does this, fsck did detect a problem:

host1:~ # fsck /dev/drbd0
fsck 1.38 (30-Jun-2005)
e2fsck 1.38 (30-Jun-2005)
/dev/drbd0: recovering journal
The filesystem size (according to the superblock) is 53739520 blocks
The physical size of the device is 53287936 blocks
Either the superblock or the partition table is likely to be corrupt!

This seems to correlate with some syslog messages I found this morning:

Nov 6 17:09:46 fshare1 kernel: attempt to access beyond end of device
Nov 6 17:09:46 fshare1 kernel: drbd0: rw=0, want=426770448, limit=426303488
Nov 6 17:09:46 fshare1 kernel: EXT3-fs error (device drbd0): read_inode_bitmap: Cannot read inode bitmap - block_group = 1628, inode_bitmap = 53346305
Nov 6 17:09:46 fshare1 kernel: Aborting journal on device drbd0.
Nov 6 17:09:46 fshare1 kernel: EXT3-fs error (device drbd0) in ext3_ordered_writepage: IO failure
Nov 6 17:09:47 fshare1 kernel: ext3_abort called.
Nov 6 17:09:47 fshare1 kernel: EXT3-fs error (device drbd0): ext3_journal_start_sb: Detected aborted journal
Nov 6 17:09:47 fshare1 kernel: Remounting filesystem read-only
Nov 6 17:09:50 fshare1 kernel: EXT3-fs error (device drbd0) in ext3_new_inode: IO failure
Nov 6 17:09:50 fshare1 kernel: EXT3-fs error (device drbd0) in ext3_create: IO failure
Nov 6 17:10:09 fshare1 kernel: __journal_remove_journal_head: freeing b_committed_data

All of which led me to the discovery this morning that the available space for the RAID-5 LUN on the internal arrays don't match (203.4GB on one and 205.0GB on the other). First rule of clusters is make everything identical. :) So I'm reformatting the LVs to the lower capacity so they match and will test that. I'm pretty sure that should fix the problem though.

Re: SLES 10 SP1, DRBD, and rsync

Jeff_Traigle — Fri, 09 Nov 2007 08:55:02 GMT

Well, it looked promising, but last night I continued having the problem (although it doesn't seem to happen as often now... I was able to copy all the data after a 3 attempts). So, I have the following setup:

1. The mismatched physical size of the LUNs (203.4GB on one and 205.0GB on the other.
2. The matching LVs defined using these LUNs (both 203.4GB).
3. DRBD configured to use these matching LVs.

I got the same "attempt to access beyond end of device" error registered in syslog a few hours ago. This time, fsck found no problems, however:

hostname1:~ # fsck /dev/mapper/vg01-data
fsck 1.38 (30-Jun-2005)
e2fsck 1.38 (30-Jun-2005)
/dev/mapper/vg01-data: recovering journal
/dev/mapper/vg01-data contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/mapper/vg01-data: 175608/26673152 files (1.7% non-contiguous), 36824434/53320704 blocks

So I'm confused. Based on the message, it seems like DRBD is trying to write to the space on the LUN on the primary system that isn't being used in the underlying LV?