BladeSystem - General
1748080 Members
5331 Online
108758 Solutions
New Discussion

BL685c file system corruption of in-memory data detected

 
Allen Huang_1
Frequent Advisor

BL685c file system corruption of in-memory data detected

We have a sticky problem that keeps coming up on some of our BL685 blades. We have 6 of these blades that are all purposed the same and running SLES 10 sp2. At seemingly random times, we see the following messages appearing the /var/log/messages file:

May 4 12:38:21 hpbp04 kernel: Filesystem "cciss/c0d0p4": XFS internal error xfs_trans_cancel at line 1175 of file fs/xfs/xfs_trans.c. Caller 0xffffffff880c0a0c
May 4 12:38:21 hpbp04 kernel:
May 4 12:38:21 hpbp04 kernel: Call Trace: {:xfs:xfs_trans_cancel+91}
May 4 12:38:21 hpbp04 kernel: {:xfs:xfs_create+1395} {:x fs:xfs_vn_mknod+429}
May 4 12:38:21 hpbp04 kernel: {vfs_create+390} {open_nam ei+421}
May 4 12:38:21 hpbp04 kernel: {do_filp_open+28} {do_sys_ open+69}
May 4 12:38:21 hpbp04 kernel: {cstar_do_call+27}
May 4 12:38:21 hpbp04 kernel: xfs_force_shutdown(cciss/c0d0p4,0x8) called from line 1176 of file fs /xfs/xfs_trans.c. Return address = 0xffffffff880b9475
May 4 12:38:21 hpbp04 kernel: Filesystem "cciss/c0d0p4": Corruption of in-memory data detected. Shutting down filesystem: cciss/c0d0p4
May 4 12:38:21 hpbp04 kernel: Please umount the filesystem, and rectify the problem(s)

The partition is unusable until it is fixed. To remedy the problem, I have to unmount the partition and mount again. Running xfs_check and xfs_repair does not show any problem. This is what appears in the log after I do the umount/mount:

May 4 16:02:07 hpbp04 kernel: XFS mounting filesystem cciss/c0d0p4
May 4 16:02:07 hpbp04 kernel: Starting XFS recovery on filesystem: cciss/c0d0p4 (logdev: internal)
May 4 16:02:09 hpbp04 kernel: Ending XFS recovery on filesystem: cciss/c0d0p4 (logdev: internal)

The red flag is the memory corruption error. That error points me to either the raid controller cache, the hard drive caches or the RAM used for the file system cache.

There was a hot spot detected in the data center that showed intake temps of ~90 degrees, which is still within HP's tolerance level, but makes me suspicious of the hardware.

All the firmware is verified against the latest FW DVD 9.0. and all up-to-date. All the diagnosis passed.

Any ideas? Could this be an issue with a bad memory chip? What hardware part(s) should we replace?