- Community Home
- >
- Software
- >
- HPE Ezmeral Software platform
- >
- EDF 7.3: cldb CID1 container disk marked as failed...
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Discussions
Discussions
Discussions
Forums
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО12-31-2024 03:43 AM - last edited on тАО01-01-2025 01:09 AM by support_s
тАО12-31-2024 03:43 AM - last edited on тАО01-01-2025 01:09 AM by support_s
Hi!
We have a problem with the disk that is used as maprfs. After some mistake with Ceph keys creation for virtual disks in proxmox - some disks were read-only for some time but returned to normal read-write status. But 2 disks on 1 control node with CLDB and with Name Container (CID1) on them were marked as failed by MapR handle_disk_failure.sh script. So CLDB cannot start without CID1.
############################ Disk Failure Report ###########################
Disk : sdc
Failure Reason : I/O time out
Time of Failure : Thu 26 Dec 2024 01:27:49 AM EET
Resolution :
Please refer to Data Fabric online documentation at https://docs.datafabric.hpe.com/home/AdministratorGuide/Managing-Disks.html
on how to handle disk failures. If you have further questions, please either post on https://community.datafabric.hpe.com/s/
or contact Data Fabric technical support.
############################ Disk Failure Report ###########################
Disk : sdb
Failure Reason : I/O time out
Time of Failure : Thu 26 Dec 2024 01:33:55 AM EET
Resolution :
Please refer to Data Fabric online documentation at https://docs.datafabric.hpe.com/home/AdministratorGuide/Managing-Disks.html
on how to handle disk failures. If you have further questions, please either post on https://community.datafabric.hpe.com/s/
or contact Data Fabric technical support.
There is a suggestion to use /opt/mapr/server/fsck to check the disks, but as I understand we need to remove them first with mrconfig disk remove - this may cause a loss of CID1.
Trying to bring back SP online we get this log in mfs.log-3:
2024-12-26 18:55:32,2630 INFO IOMgr iomgr.cc:1764 SP1:/dev/sdc on DG Concat1-3 consists of 2 disks:
2024-12-26 18:55:32,2630 INFO IOMgr iomgr.cc:1767 SP1:0 /dev/sdc
2024-12-26 18:55:32,2630 INFO IOMgr iomgr.cc:1767 SP1:1 /dev/sdd
2024-12-26 18:55:32,2630 INFO IOMgr spinit.cc:251 Read SP /dev/sdc Superblock
2024-12-26 18:55:32,2639 ERROR IOMgr spinit.cc:310 SP SP1:/dev/sdc online failed, it was previously marked with disk ERROR: I/O time out error, 110. To bring it online first repair the SP using fsck utility.
2024-12-26 18:55:32,2639 INFO IOMgr spinit.cc:51 Storage Pool DeInit()
2024-12-26 18:55:32,2639 INFO IOMgr spserver.cc:1001 < SPOnline ctx 0x558b4f3b4000 err 110
mrconfig sp list
ListSPs resp: status 0:3
No. of SPs (3), totalsize 3234281 MiB, totalfree 1122840 MiB
SP 0: name SP1, Offline, size 3686398 MiB, free 0 MiB, path /dev/sdc
SP 1: name SP3, Online, size 1667316 MiB, free 595549 MiB, path /dev/sdb
SP 2: name SP4, Online, size 1566964 MiB, free 527290 MiB, path /dev/sde
mrconfig disk list
ListDisks resp: status 0 count=4
ListDisks /dev/sdc
size 1843200MB
DG 0: Single SingleDisk1 Online
DG 1: Raid0 Stripe1-2 Online
DG 2: Concat Concat1-3 Online
SP 0: name SP1, Offline, size 3686398 MiB, free 0 MiB, path /dev/sdc
ListDisks /dev/sdb
size 1740800MB
DG 0: Single SingleDisk5 Online
DG 1: Concat Concat5-2 Online
SP 0: name SP3, Online, size 1667316 MiB, free 595549 MiB, path /dev/sdb
ListDisks /dev/sde
size 1638400MB
DG 0: Single SingleDisk6 Online
DG 1: Concat Concat6-2 Online
SP 0: name SP4, Online, size 1566964 MiB, free 527290 MiB, path /dev/sde
ListDisks /dev/sdd
size 1843200MB
DG 0: Single SingleDisk3 Online
DG 1: Raid0 Stripe1-2 Online
SP 0: name SP1, Offline, size 3686398 MiB, free 0 MiB, path /dev/sdc
We believe that those disks are good now and there is no failure on them (it can still be).
Using Ezmeral Data Fabric v7.3 with 1 CLDB node
How could we unmark those disks and try to restart the cluster? Is there any chance of losing data with this?
Solved! Go to Solution.
- Tags:
- drive
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО12-31-2024 07:21 AM - edited тАО12-31-2024 07:21 AM
тАО12-31-2024 07:21 AM - edited тАО12-31-2024 07:21 AM
SolutionHi, there's no need to remove the disks.
If fsck fails with 'Device or resource busy' then it needs to either be offlined or unloaded:
mrconfig sp offline /dev/sdc
or
mrconfig sp unload SP1
One of these should allow fsck -r to complete, then use:
mrconfig sp refresh
to bring it back online.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО01-07-2025 06:12 AM - edited тАО01-07-2025 06:29 AM
тАО01-07-2025 06:12 AM - edited тАО01-07-2025 06:29 AM
Re: EDF 7.3: cldb CID1 container disk marked as failed IO Timeout
@ldarby Thank you for the advice!
We had run
/opt/mapr/server/fsck -n SP1
Without -r option first, to see the problem
2025-01-06 12:03:06,5367 INFO CacheMgr cachemgr.cc:3517 cachePercentagesIn: inode:0:log:0:meta:2:dir:0:small:0:db:0:valc:0
2025-01-06 12:03:06,5367 INFO CacheMgr cachemgr.cc:3533 CacheSize 74126 MB, inode:0:meta:2:dir:0:small:0:large:98:db:0:valc:0:spillc:0:segmc:0
2025-01-06 12:03:07,1887 INFO CacheMgr cachemgr.cc:3243 lru meta 0: start 1 end 189762 blocks 189762 [1482M], dirtyquota 75904 [ 593M]
2025-01-06 12:03:07,2472 INFO CacheMgr cachemgr.cc:3243 lru large 2: start 189763 end 9488128 blocks 9298366 [72643M], dirtyquota 8368529 [65379M]
2025-01-06 12:03:07,2524 INFO CacheMgr cachemgr.cc:3610 BlockCacheCount 9488128
2025-01-06 12:03:42,4551 INFO IO iodispatch.cc:110 using IO maxEvents: 5000
2025-01-06 12:03:42,4553 INFO IOMgr iomgr.cc:363 maxSlowIOs 30, slowDiskTimeOut 240 s, maxOutstandingIOsPerDisk 50000, MaxStoragePools 129, port 0, isDARE 0
fsck.cc:656 Repair flag: 0
iomgr.cc:3069 found 4 disks in disktab
lun.cc:1127 Loading disk:/dev/sdc
lun.cc:1139 /dev/sdc LoadDisk 0x55a97ecfb200 retry 0
lun.cc:838 disk /dev/sdc numaid -1
lun.cc:775 Disk Open /dev/sdc isSSD_ initialized to 0
lun.cc:1127 Loading disk:/dev/sdd
lun.cc:1139 /dev/sdd LoadDisk 0x55a97ecfb608 retry 0
lun.cc:838 disk /dev/sdd numaid -1
lun.cc:775 Disk Open /dev/sdd isSSD_ initialized to 0
lun.cc:1127 Loading disk:/dev/sdb
lun.cc:1139 /dev/sdb LoadDisk 0x55a97ecfba10 retry 0
lun.cc:735 target device open /dev/sdb failed: Device or resource busy, errno 16
lun.cc:1143 OnlineDisk /dev/sdb failed Device or resource busy, errno 16
lun.cc:1127 Loading disk:/dev/sde
lun.cc:1139 /dev/sde LoadDisk 0x55a97ecfbe18 retry 0
lun.cc:735 target device open /dev/sde failed: Device or resource busy, errno 16
lun.cc:1143 OnlineDisk /dev/sde failed Device or resource busy, errno 16
lun.cc:1318 Disk /dev/sdc, Loading concat DG Concat1-3 readystate(0)
iomgr.cc:1807 SP SP1 found on disk /dev/sdc
lun.cc:1435 /dev/sdc Disk Loaded
lun.cc:1436 Disk /dev/sdc loaded numRecords 3
lun.cc:1318 Disk /dev/sdd, Loading concat DG Concat1-3 readystate(1)
lun.cc:1374 DG already added to sptable
lun.cc:1435 /dev/sdd Disk Loaded
lun.cc:1436 Disk /dev/sdd loaded numRecords 2
12:03:50 phase1.cc:39 ERROR FSERR Superblock is marked with error 110
phase2.cc:725 start orphanage container processing
phase2.cc:2367 WalkContainer 64: rw 64 inodes 333824 clus 1304 rblock 0x1013419 size 85458944: con 1 of 276
phase2.cc:746 done orphanage container processing
phase2.cc:929 runningSnapChainWalks 7 maxSnapChains 0 maxInodeScans 35, numInodeScansPerContainer 5
phase2.cc:2367 WalkContainer 2052: rw 2052 inodes 256 clus 1 rblock 0x0 size 65536: con 8 of 276
phase2.cc:2367 WalkContainer 2067: rw 2067 inodes 256 clus 1 rblock 0x0 size 65536: con 8 of 276
...
phase2.cc:2367 WalkContainer 3624: rw 3624 inodes 4096 clus 16 rblock 0x5460048 size 1048576: con 275 of 276
phase2.cc:2367 WalkContainer 3634: rw 3634 inodes 4096 clus 16 rblock 0x10cc0380 size 1048576: con 276 of 276
12:04:24 fsck.cc:542 FSCK start time(1736157830 | 348397)
fsck.cc:544 FSCK end time(1736157864 | 639374)
fsck.cc:545 FSCK time taken: 34 sec
fsck.cc:551 FSCK read-ahead stats: t-was: 1155, i-was: 0, y-was: 128, n-was: 0, btd: 18, btr: 18, dd: 28528, dr: 16027
fsck.cc:554 FSCK cache stats: lu: 2802772, mi: 1453250
fsck.cc:561 FSCK IO stats: reads: 1410027, readBlocks: 1500527 writes: 2331, writeBlocks: 27106
alloc.cc:297 Number of Data blocks 310818696 shared 0
alloc.cc:297 Number of Inode blocks 48689 shared 0
alloc.cc:297 Number of Orphanage blocks 0 shared 0
alloc.cc:297 Number of BTreeIntr blocks 52371 shared 0
alloc.cc:297 Number of BTreeLeaf blocks 1317911 shared 0
alloc.cc:297 Number of Log blocks 51200 shared 0
alloc.cc:297 Number of BlockBitmap blocks 14400 shared 0
alloc.cc:297 Number of SPMetaBlock blocks 67 shared 0
alloc.cc:297 Number of DGPrivate blocks 0 shared 0
alloc.cc:297 Number of Fidmap blocks 0 shared 0
alloc.cc:297 Number of Misc blocks 260 shared 0
alloc.cc:297 Number of SymLink blocks 0 shared 0
alloc.cc:297 Number of Unknown blocks 0 shared 0
alloc.cc:303 Total Number of blocks 312303594 shared 0 crc checked 722
fsck.cc:570 errorsInFsck = 1
fsck.cc:576 ERROR
FSCK completed with errors.
So there was a Superblock marked with 110 (timeout I guess)
12:03:50 phase1.cc:39 ERROR FSERR Superblock is marked with error 110
Is this safe to do fsck -r now?
Also, there is no faileddisk.log inside /opt/mapr/logs, does fsck delete it?
I have checked mfs.conf and at the bottom, I saw:
mfs.on.virtual.machine=0
But we are running nodes on proxmox VMs, should I change it to 1 on all nodes?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО01-07-2025 09:42 AM - edited тАО01-07-2025 12:59 PM
тАО01-07-2025 09:42 AM - edited тАО01-07-2025 12:59 PM
Re: EDF 7.3: cldb CID1 container disk marked as failed IO Timeout
@filip_novak Since, virtual disks issue is resolved now, you should try to reboot the node and observed if you are still getting "I/O time out" error messages. If so, then verify that there is no issue at disk or OS end. If team identified that there's no issue at disk or OS end and team still observing the same error message then try to run fsck with "-r" option for SP1. Before running fsck command, please make sure that "CID:1" replica's are available and fully resync.
Thanks,
Vineet
I work at HPE
HPE Support Center offers support for your HPE services and products when and how you need it. Get started with HPE Support Center today.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]
