How to locate a sector into the file system

Geert Van Pamel — Wed, 30 Jan 2008 20:16:59 GMT

I have a disk sector error.

* How could I locate in which file this sector is contained?

Then I would know which application is impacted, and I would know which file should be restored, or recreated?

I am using ext3 file systems.

dmesg |tail

end_request: I/O error, dev hda, sector 29012529
hda: dma_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=29012535, high=1, low=12235319, sector=29012529
ide: failed opcode was: unknown
end_request: I/O error, dev hda, sector 29012529

If possible, I would prefer not to halt the system.

Re: How to locate a sector into the file system

Ivan Ferreira — Wed, 30 Jan 2008 21:16:26 GMT

Please see:

http://www.gra2.com/article.php/20041015232512624

Re: How to locate a sector into the file system

Geert Van Pamel — Wed, 30 Jan 2008 23:11:22 GMT

Well, I suspected that it was boinc that was suffering from the bad sector, but now I know for sure.

boinc did not use any CPU time...

ps -fu boinc
UID PID PPID C STIME TTY TIME CMD
boinc 1864 1 0 05:27 ? 00:00:06 ./boinc
boinc 9655 1864 1 16:39 ? 00:00:52 einstein_S5R3_4.20_i686-pc-linux

And the system had huge I/O wait times

I/O wait times > 80 %

vmstat 5
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 2 40720 3396 8468 20700 1 2 38 37 39 66 6 6 8 80
0 3 40720 3336 8496 20700 0 0 3 17 1018 46 0 6 0 93

The system did already perform an unplanned crash/reboot

uptime
22:40:04 up 1 day, 17:14, 2 users, load average: 0.03, 0.04, 0.00

Normally the system would never be idle, nor reboot...

smartctl -A /dev/hda
smartctl version 5.33 [i386-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 051 Pre-fail Always - 41
3 Spin_Up_Time 0x0007 067 042 000 Pre-fail Always - 5760
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 174
5 Reallocated_Sector_Ct 0x0033 253 253 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 253 253 051 Pre-fail Always - 0
8 Seek_Time_Performance 0x0024 253 253 000 Old_age Offline - 0
9 Power_On_Half_Minutes 0x0032 097 097 000 Old_age Always - 19359h+40m
10 Spin_Retry_Count 0x0013 253 253 049 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 102
194 Temperature_Celsius 0x0022 151 094 000 Old_age Always - 29
195 Hardware_ECC_Recovered 0x000a 100 100 000 Old_age Always - 432100053
196 Reallocated_Event_Count 0x0012 099 099 000 Old_age Always - 3
197 Current_Pending_Sector 0x0033 253 253 010 Pre-fail Always - 0
198 Offline_Uncorrectable 0x0031 099 099 010 Pre-fail Offline - 3
199 UDMA_CRC_Error_Count 0x000b 100 100 051 Pre-fail Always - 0
200 Multi_Zone_Error_Rate 0x000b 100 100 051 Pre-fail Always - 0
201 Soft_Read_Error_Rate 0x000b 100 100 051 Pre-fail Always - 0

fdisk -lu /dev/hda

Disk /dev/hda: 120.0 GB, 120060444672 bytes
255 heads, 63 sectors/track, 14596 cylinders, total 234493056 sectors
Units = sectors of 1 * 512 = 512 bytes

Device Boot Start End Blocks Id System
...
/dev/hda6 21109473 31358879 5124703+ 83 Linux

df -kl |sort
...
/dev/hda6 5044156 365184 4422740 8% /home

grep /home /etc/fstab
LABEL=/home /home ext3 defaults 1 2

tune2fs -l /dev/hda6 |grep Block
Block count: 1281175
Block size: 4096
Blocks per group: 32768

We have 8 sectors in a block = 4096 / 512

Thus: the block offset number within the partition is 987882

bc
( 29012535 - 21109473 ) / 8
987882
quit

29012535 = bad sector
21109473 = first sector of partition

which debugfs
/sbin/debugfs
[root@hp-interex ~]# debugfs
debugfs 1.37 (21-Mar-2005)
debugfs: open /dev/hda6
debugfs: icheck 987882
Block Inode number
987882 481248
debugfs: ncheck 481248
Inode Pathname
481248 /boinc/BOINC/slots/0/skygrid_0750Hz_S5R3.dat
debugfs: quit

So I stopped boinc, and moved the impacted file.

ls -l /home/boinc/BOINC/slots/0/skygrid_0750Hz_S5R3.dat
-rw-rw-r-- 1 boinc boinc 4552614 Sep 2 20:45 /home/boinc/BOINC/slots/0/skygrid_0750Hz_S5R3.dat

We could move this 4 MB file to e.g. /home/badblocks/ and ask boinc to abort processing of this file.

Or it could be that boinc will skip this file automatically, when we start boinc up again, after moving the file containing the bad block.

We can give it a try.

mkdir /home/badblocks

mv /home/boinc/BOINC/slots/0/skygrid_0750Hz_S5R3.dat /home/badblocks/

Note that we must move the file within the "same" partition ... we moved it to a badblocks subdirectory.

Better double check

debugfs
debugfs 1.37 (21-Mar-2005)
debugfs: open /dev/hda6
debugfs: icheck 987882
Block Inode number
987882 481248
debugfs: ncheck 481248
Inode Pathname
481248 /badblocks/skygrid_0750Hz_S5R3.dat
debugfs: quit

Yes, we did it !!! If we never move or delete this file, the bad block will stay within and does not harm us any more...

We will simulate the sector read error:

dd if=/home/badblocks/skygrid_0750Hz_S5R3.dat of=/dev/null

dd: reading `/home/badblocks/skygrid_0750Hz_S5R3.dat': Input/output error

Now we can restart boinc

And yes, it skips this file that it could no longer find, and automatically recoveres processing by downloading a new file to process... notice 99 % CPU now, which is what we want.

ps -fu boinc
UID PID PPID C STIME TTY TIME CMD
boinc 31263 1 3 23:30 pts/2 00:00:00 ./boinc

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
31277 boinc 39 19 25624 6620 12 R 99.2 3.5 0:06.92 setiathome-5.27

We do no longer have I/O wait times...

vmstat 5
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 0 92352 12496 5064 29672 1 1 56 33 181 64 7 6 10 77
1 0 92352 12468 5080 29672 0 0 0 14 1014 41 100 0 0 0
1 0 92352 12484 5096 29672 0 0 0 18 1020 47 100 0 0 0

The Hardware_ECC_Recovered is always and rapidly increasing ... seems not OK ??

195 Hardware_ECC_Recovered 0x000a 100 100 000 Old_age Always - 434019663
195 Hardware_ECC_Recovered 0x000a 100 100 000 Old_age Always - 434019677

And a serious indication of the near death of the disk ??

Disk should be replaced immediately? Serious HW problem ??

* I have now found the command how to measure the disk temerature, and to alarm on temperature problems:

smartctl -A /dev/hda
...
194 Temperature_Celsius 0x0022 145 094 000 Old_age Always - 31

disktemp=$(smartctl -A /dev/hda |awk '/Temperature_Celsius/ {print $10}')

echo $disktemp
31

if [ $disktemp -gt 31 ] ;then echo "Disk temperature alarm" ;fi

Re: How to locate a sector into the file system

Ivan Ferreira — Thu, 31 Jan 2008 12:37:23 GMT

>>> Disk should be replaced immediately? Serious HW problem ??

Don't know, but as it is a IDE disk (according to /dev/hd*), IDE didsks are not "smart" to avoid the usage of badblocks, and also, is true that badblocks happens. Your options are:

- Run fsck with -c argument (badblocks verification - must be done offline). This will avoid the usage of badblocks in your disk.
- Replace the disk

topic Re: How to locate a sector into the file system in Operating System - Linux

How to locate a sector into the file system

Re: How to locate a sector into the file system

Re: How to locate a sector into the file system

Re: How to locate a sector into the file system