- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - Linux
- >
- How to locate a sector into the file system
Operating System - Linux
1752585
Members
4547
Online
108788
Solutions
Forums
Categories
Company
Local Language
юдл
back
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Discussions
Discussions
Forums
Forums
Discussions
юдл
back
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Blogs
Information
Community
Resources
Community Language
Language
Forums
Blogs
Go to solution
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО01-30-2008 12:16 PM
тАО01-30-2008 12:16 PM
I have a disk sector error.
* How could I locate in which file this sector is contained?
Then I would know which application is impacted, and I would know which file should be restored, or recreated?
I am using ext3 file systems.
dmesg |tail
end_request: I/O error, dev hda, sector 29012529
hda: dma_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=29012535, high=1, low=12235319, sector=29012529
ide: failed opcode was: unknown
end_request: I/O error, dev hda, sector 29012529
If possible, I would prefer not to halt the system.
* How could I locate in which file this sector is contained?
Then I would know which application is impacted, and I would know which file should be restored, or recreated?
I am using ext3 file systems.
dmesg |tail
end_request: I/O error, dev hda, sector 29012529
hda: dma_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=29012535, high=1, low=12235319, sector=29012529
ide: failed opcode was: unknown
end_request: I/O error, dev hda, sector 29012529
If possible, I would prefer not to halt the system.
Solved! Go to Solution.
3 REPLIES 3
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО01-30-2008 01:16 PM
тАО01-30-2008 01:16 PM
Solution
Please see:
http://www.gra2.com/article.php/20041015232512624
http://www.gra2.com/article.php/20041015232512624
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО01-30-2008 03:11 PM
тАО01-30-2008 03:11 PM
Re: How to locate a sector into the file system
Well, I suspected that it was boinc that was suffering from the bad sector, but now I know for sure.
boinc did not use any CPU time...
ps -fu boinc
UID PID PPID C STIME TTY TIME CMD
boinc 1864 1 0 05:27 ? 00:00:06 ./boinc
boinc 9655 1864 1 16:39 ? 00:00:52 einstein_S5R3_4.20_i686-pc-linux
And the system had huge I/O wait times
I/O wait times > 80 %
vmstat 5
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 2 40720 3396 8468 20700 1 2 38 37 39 66 6 6 8 80
0 3 40720 3336 8496 20700 0 0 3 17 1018 46 0 6 0 93
The system did already perform an unplanned crash/reboot
uptime
22:40:04 up 1 day, 17:14, 2 users, load average: 0.03, 0.04, 0.00
Normally the system would never be idle, nor reboot...
smartctl -A /dev/hda
smartctl version 5.33 [i386-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 051 Pre-fail Always - 41
3 Spin_Up_Time 0x0007 067 042 000 Pre-fail Always - 5760
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 174
5 Reallocated_Sector_Ct 0x0033 253 253 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 253 253 051 Pre-fail Always - 0
8 Seek_Time_Performance 0x0024 253 253 000 Old_age Offline - 0
9 Power_On_Half_Minutes 0x0032 097 097 000 Old_age Always - 19359h+40m
10 Spin_Retry_Count 0x0013 253 253 049 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 102
194 Temperature_Celsius 0x0022 151 094 000 Old_age Always - 29
195 Hardware_ECC_Recovered 0x000a 100 100 000 Old_age Always - 432100053
196 Reallocated_Event_Count 0x0012 099 099 000 Old_age Always - 3
197 Current_Pending_Sector 0x0033 253 253 010 Pre-fail Always - 0
198 Offline_Uncorrectable 0x0031 099 099 010 Pre-fail Offline - 3
199 UDMA_CRC_Error_Count 0x000b 100 100 051 Pre-fail Always - 0
200 Multi_Zone_Error_Rate 0x000b 100 100 051 Pre-fail Always - 0
201 Soft_Read_Error_Rate 0x000b 100 100 051 Pre-fail Always - 0
fdisk -lu /dev/hda
Disk /dev/hda: 120.0 GB, 120060444672 bytes
255 heads, 63 sectors/track, 14596 cylinders, total 234493056 sectors
Units = sectors of 1 * 512 = 512 bytes
Device Boot Start End Blocks Id System
...
/dev/hda6 21109473 31358879 5124703+ 83 Linux
df -kl |sort
...
/dev/hda6 5044156 365184 4422740 8% /home
grep /home /etc/fstab
LABEL=/home /home ext3 defaults 1 2
tune2fs -l /dev/hda6 |grep Block
Block count: 1281175
Block size: 4096
Blocks per group: 32768
We have 8 sectors in a block = 4096 / 512
Thus: the block offset number within the partition is 987882
bc
( 29012535 - 21109473 ) / 8
987882
quit
29012535 = bad sector
21109473 = first sector of partition
which debugfs
/sbin/debugfs
[root@hp-interex ~]# debugfs
debugfs 1.37 (21-Mar-2005)
debugfs: open /dev/hda6
debugfs: icheck 987882
Block Inode number
987882 481248
debugfs: ncheck 481248
Inode Pathname
481248 /boinc/BOINC/slots/0/skygrid_0750Hz_S5R3.dat
debugfs: quit
So I stopped boinc, and moved the impacted file.
ls -l /home/boinc/BOINC/slots/0/skygrid_0750Hz_S5R3.dat
-rw-rw-r-- 1 boinc boinc 4552614 Sep 2 20:45 /home/boinc/BOINC/slots/0/skygrid_0750Hz_S5R3.dat
We could move this 4 MB file to e.g. /home/badblocks/ and ask boinc to abort processing of this file.
Or it could be that boinc will skip this file automatically, when we start boinc up again, after moving the file containing the bad block.
We can give it a try.
mkdir /home/badblocks
mv /home/boinc/BOINC/slots/0/skygrid_0750Hz_S5R3.dat /home/badblocks/
Note that we must move the file within the "same" partition ... we moved it to a badblocks subdirectory.
Better double check
debugfs
debugfs 1.37 (21-Mar-2005)
debugfs: open /dev/hda6
debugfs: icheck 987882
Block Inode number
987882 481248
debugfs: ncheck 481248
Inode Pathname
481248 /badblocks/skygrid_0750Hz_S5R3.dat
debugfs: quit
Yes, we did it !!! If we never move or delete this file, the bad block will stay within and does not harm us any more...
We will simulate the sector read error:
dd if=/home/badblocks/skygrid_0750Hz_S5R3.dat of=/dev/null
dd: reading `/home/badblocks/skygrid_0750Hz_S5R3.dat': Input/output error
Now we can restart boinc
And yes, it skips this file that it could no longer find, and automatically recoveres processing by downloading a new file to process... notice 99 % CPU now, which is what we want.
ps -fu boinc
UID PID PPID C STIME TTY TIME CMD
boinc 31263 1 3 23:30 pts/2 00:00:00 ./boinc
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
31277 boinc 39 19 25624 6620 12 R 99.2 3.5 0:06.92 setiathome-5.27
We do no longer have I/O wait times...
vmstat 5
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 0 92352 12496 5064 29672 1 1 56 33 181 64 7 6 10 77
1 0 92352 12468 5080 29672 0 0 0 14 1014 41 100 0 0 0
1 0 92352 12484 5096 29672 0 0 0 18 1020 47 100 0 0 0
The Hardware_ECC_Recovered is always and rapidly increasing ... seems not OK ??
195 Hardware_ECC_Recovered 0x000a 100 100 000 Old_age Always - 434019663
195 Hardware_ECC_Recovered 0x000a 100 100 000 Old_age Always - 434019677
And a serious indication of the near death of the disk ??
Disk should be replaced immediately? Serious HW problem ??
* I have now found the command how to measure the disk temerature, and to alarm on temperature problems:
smartctl -A /dev/hda
...
194 Temperature_Celsius 0x0022 145 094 000 Old_age Always - 31
disktemp=$(smartctl -A /dev/hda |awk '/Temperature_Celsius/ {print $10}')
echo $disktemp
31
if [ $disktemp -gt 31 ] ;then echo "Disk temperature alarm" ;fi
boinc did not use any CPU time...
ps -fu boinc
UID PID PPID C STIME TTY TIME CMD
boinc 1864 1 0 05:27 ? 00:00:06 ./boinc
boinc 9655 1864 1 16:39 ? 00:00:52 einstein_S5R3_4.20_i686-pc-linux
And the system had huge I/O wait times
I/O wait times > 80 %
vmstat 5
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 2 40720 3396 8468 20700 1 2 38 37 39 66 6 6 8 80
0 3 40720 3336 8496 20700 0 0 3 17 1018 46 0 6 0 93
The system did already perform an unplanned crash/reboot
uptime
22:40:04 up 1 day, 17:14, 2 users, load average: 0.03, 0.04, 0.00
Normally the system would never be idle, nor reboot...
smartctl -A /dev/hda
smartctl version 5.33 [i386-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 051 Pre-fail Always - 41
3 Spin_Up_Time 0x0007 067 042 000 Pre-fail Always - 5760
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 174
5 Reallocated_Sector_Ct 0x0033 253 253 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 253 253 051 Pre-fail Always - 0
8 Seek_Time_Performance 0x0024 253 253 000 Old_age Offline - 0
9 Power_On_Half_Minutes 0x0032 097 097 000 Old_age Always - 19359h+40m
10 Spin_Retry_Count 0x0013 253 253 049 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 102
194 Temperature_Celsius 0x0022 151 094 000 Old_age Always - 29
195 Hardware_ECC_Recovered 0x000a 100 100 000 Old_age Always - 432100053
196 Reallocated_Event_Count 0x0012 099 099 000 Old_age Always - 3
197 Current_Pending_Sector 0x0033 253 253 010 Pre-fail Always - 0
198 Offline_Uncorrectable 0x0031 099 099 010 Pre-fail Offline - 3
199 UDMA_CRC_Error_Count 0x000b 100 100 051 Pre-fail Always - 0
200 Multi_Zone_Error_Rate 0x000b 100 100 051 Pre-fail Always - 0
201 Soft_Read_Error_Rate 0x000b 100 100 051 Pre-fail Always - 0
fdisk -lu /dev/hda
Disk /dev/hda: 120.0 GB, 120060444672 bytes
255 heads, 63 sectors/track, 14596 cylinders, total 234493056 sectors
Units = sectors of 1 * 512 = 512 bytes
Device Boot Start End Blocks Id System
...
/dev/hda6 21109473 31358879 5124703+ 83 Linux
df -kl |sort
...
/dev/hda6 5044156 365184 4422740 8% /home
grep /home /etc/fstab
LABEL=/home /home ext3 defaults 1 2
tune2fs -l /dev/hda6 |grep Block
Block count: 1281175
Block size: 4096
Blocks per group: 32768
We have 8 sectors in a block = 4096 / 512
Thus: the block offset number within the partition is 987882
bc
( 29012535 - 21109473 ) / 8
987882
quit
29012535 = bad sector
21109473 = first sector of partition
which debugfs
/sbin/debugfs
[root@hp-interex ~]# debugfs
debugfs 1.37 (21-Mar-2005)
debugfs: open /dev/hda6
debugfs: icheck 987882
Block Inode number
987882 481248
debugfs: ncheck 481248
Inode Pathname
481248 /boinc/BOINC/slots/0/skygrid_0750Hz_S5R3.dat
debugfs: quit
So I stopped boinc, and moved the impacted file.
ls -l /home/boinc/BOINC/slots/0/skygrid_0750Hz_S5R3.dat
-rw-rw-r-- 1 boinc boinc 4552614 Sep 2 20:45 /home/boinc/BOINC/slots/0/skygrid_0750Hz_S5R3.dat
We could move this 4 MB file to e.g. /home/badblocks/ and ask boinc to abort processing of this file.
Or it could be that boinc will skip this file automatically, when we start boinc up again, after moving the file containing the bad block.
We can give it a try.
mkdir /home/badblocks
mv /home/boinc/BOINC/slots/0/skygrid_0750Hz_S5R3.dat /home/badblocks/
Note that we must move the file within the "same" partition ... we moved it to a badblocks subdirectory.
Better double check
debugfs
debugfs 1.37 (21-Mar-2005)
debugfs: open /dev/hda6
debugfs: icheck 987882
Block Inode number
987882 481248
debugfs: ncheck 481248
Inode Pathname
481248 /badblocks/skygrid_0750Hz_S5R3.dat
debugfs: quit
Yes, we did it !!! If we never move or delete this file, the bad block will stay within and does not harm us any more...
We will simulate the sector read error:
dd if=/home/badblocks/skygrid_0750Hz_S5R3.dat of=/dev/null
dd: reading `/home/badblocks/skygrid_0750Hz_S5R3.dat': Input/output error
Now we can restart boinc
And yes, it skips this file that it could no longer find, and automatically recoveres processing by downloading a new file to process... notice 99 % CPU now, which is what we want.
ps -fu boinc
UID PID PPID C STIME TTY TIME CMD
boinc 31263 1 3 23:30 pts/2 00:00:00 ./boinc
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
31277 boinc 39 19 25624 6620 12 R 99.2 3.5 0:06.92 setiathome-5.27
We do no longer have I/O wait times...
vmstat 5
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 0 92352 12496 5064 29672 1 1 56 33 181 64 7 6 10 77
1 0 92352 12468 5080 29672 0 0 0 14 1014 41 100 0 0 0
1 0 92352 12484 5096 29672 0 0 0 18 1020 47 100 0 0 0
The Hardware_ECC_Recovered is always and rapidly increasing ... seems not OK ??
195 Hardware_ECC_Recovered 0x000a 100 100 000 Old_age Always - 434019663
195 Hardware_ECC_Recovered 0x000a 100 100 000 Old_age Always - 434019677
And a serious indication of the near death of the disk ??
Disk should be replaced immediately? Serious HW problem ??
* I have now found the command how to measure the disk temerature, and to alarm on temperature problems:
smartctl -A /dev/hda
...
194 Temperature_Celsius 0x0022 145 094 000 Old_age Always - 31
disktemp=$(smartctl -A /dev/hda |awk '/Temperature_Celsius/ {print $10}')
echo $disktemp
31
if [ $disktemp -gt 31 ] ;then echo "Disk temperature alarm" ;fi
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО01-31-2008 04:37 AM
тАО01-31-2008 04:37 AM
Re: How to locate a sector into the file system
>>> Disk should be replaced immediately? Serious HW problem ??
Don't know, but as it is a IDE disk (according to /dev/hd*), IDE didsks are not "smart" to avoid the usage of badblocks, and also, is true that badblocks happens. Your options are:
- Run fsck with -c argument (badblocks verification - must be done offline). This will avoid the usage of badblocks in your disk.
- Replace the disk
Don't know, but as it is a IDE disk (according to /dev/hd*), IDE didsks are not "smart" to avoid the usage of badblocks, and also, is true that badblocks happens. Your options are:
- Run fsck with -c argument (badblocks verification - must be done offline). This will avoid the usage of badblocks in your disk.
- Replace the disk
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
The opinions expressed above are the personal opinions of the authors, not of Hewlett Packard Enterprise. By using this site, you accept the Terms of Use and Rules of Participation.
News and Events
Support
© Copyright 2024 Hewlett Packard Enterprise Development LP