ProLiant Servers (ML,DL,SL)
cancel
Showing results for 
Search instead for 
Did you mean: 

logical drive unavailable after a disk was replaced

drookie
Occasional Contributor

logical drive unavailable after a disk was replaced

Hi.

 

I have a server with a long story. To understand a problem, I should first tell this story. I had a server with P212 controller and 12 drives 1.5Tb each. First two drives I used as a system mirror, but, since the 1.5Tb was a bit too much, I split the mirror to two slices. Then I created two R5 arrays five and four drive each, left one spare, and created the ZFS from 2 arrays/LDs and one slice I mentioned before. Since the server was intended for storing slow static objects, it was okay.

 

It was working for me for several years running Solaris 10. Then a drive in system failed. I replaced it. Then the controller died. This is where the actual story starts - I replaced it with the P410 one. Everything was still fine. Then after some months a drive died in second R5 array (still running fine !). Everything was fine until my technician replaced the drive. One thing: since we were unable to quickly find 1.5Tb drive, we decided to replace faulty one with 2Tb drive. Plus, I had some issues with the memory and we decided to clean the memory from a dust. So, the technician took the server offline, and replaced the rived along with cleaning the memory chips. After booting up I lost the pool, it started looking like this:

 

pool: datatank
    id: 11340815205521362361
 state: UNAVAIL
status: One or more devices contains corrupted data.
action: The pool cannot be imported due to damaged devices or data.
   see: http://www.sun.com/msg/ZFS-8000-5E
config:

        datatank    UNAVAIL  insufficient replicas
          c4t0d0p2  ONLINE
          c4t1d0    ONLINE
          c4t2d0    UNAVAIL  corrupted data <--- this is the R5 array where the disk was replaced

 

Second of all, when running format I started to see the sector/cylinders configuration for disks 2 and 3 (1st and 2nd R5 arrays identically) - weird, considering one was 1Tb bigger. Like system was unable to see something. The controller was showing for some time the 3rd LD as "Recovering" and I was hoping that after it will be completed I will see my pool back. But, unluckilly, it remained in this exact state.
 I tried powering server down via IPMI, reconfiguring reboot with reboot -- -rv, but I still didn't see the pool. I have backups and my information is unaffected, but I want to understand what happened.

 

Any ideas ?

 

hpacucli tells I'm fine:

 

hpacucli ctrl slot=1 ld all show

Smart Array P410 in Slot 1

   array A

      logicaldrive 1 (1.4 TB, RAID 1, OK)

   array B

      logicaldrive 2 (5.5 TB, RAID 5, OK)

   array C

      logicaldrive 3 (4.1 TB, RAID 5, OK)

 

format shows this (notice that 1st and 2nd seem to be identical in fdisk, which is impossible):

 

# format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
       0. c4t0d0 <DEFAULT cyl 3480 alt 2 hd 255 sec 189>
          /pci@0,0/pci8086,340a@3/pci103c,3243@0/sd@0,0
       1. c4t1d0 <HP-LOGICAL VOLUME-6.40-5.46TB>
          /pci@0,0/pci8086,340a@3/pci103c,3243@0/sd@1,0
       2. c4t2d0 <HP-LOGICAL VOLUME-6.40-4.09TB>
          /pci@0,0/pci8086,340a@3/pci103c,3243@0/sd@2,0
Specify disk (enter its number): 1
selecting c4t1d0
[disk formatted]


FORMAT MENU:
        disk       - select a disk
        type       - select (define) a disk type
        partition  - select (define) a partition table
        current    - describe the current disk
        format     - format and analyze the disk
        fdisk      - run the fdisk program
        repair     - repair a defective sector
        label      - write label to the disk
        analyze    - surface analysis
        defect     - defect list management
        backup     - search for backup labels
        verify     - read and display labels
        inquiry    - show vendor, product and revision
        volname    - set 8-character volume name
        !<cmd>     - execute <cmd>, then return
        quit
format> fdisk
             Total disk size is 60799 cylinders
             Cylinder size is 192780 (512 byte) blocks

                                               Cylinders
      Partition   Status    Type          Start   End   Length    %
      =========   ======    ============  =====   ===   ======   ===
          1                 EFI               0  60798    60799    100





SELECT ONE OF THE FOLLOWING:
   1. Create a partition
   2. Specify the active partition
   3. Delete a partition
   4. Change between Solaris and Solaris2 Partition IDs
   5. Exit (update disk configuration and exit)
   6. Cancel (exit without updating disk configuration)
Enter Selection: 6


format> disk


AVAILABLE DISK SELECTIONS:
       0. c4t0d0 <DEFAULT cyl 3480 alt 2 hd 255 sec 189>
          /pci@0,0/pci8086,340a@3/pci103c,3243@0/sd@0,0
       1. c4t1d0 <HP-LOGICAL VOLUME-6.40-5.46TB>
          /pci@0,0/pci8086,340a@3/pci103c,3243@0/sd@1,0
       2. c4t2d0 <HP-LOGICAL VOLUME-6.40-4.09TB>
          /pci@0,0/pci8086,340a@3/pci103c,3243@0/sd@2,0
Specify disk (enter its number)[1]: 2
selecting c4t2d0
[disk formatted]
format> fdisk
             Total disk size is 60799 cylinders
             Cylinder size is 144585 (512 byte) blocks

                                               Cylinders
      Partition   Status    Type          Start   End   Length    %
      =========   ======    ============  =====   ===   ======   ===
          1                 EFI               0  60798    60799    100





SELECT ONE OF THE FOLLOWING:
   1. Create a partition
   2. Specify the active partition
   3. Delete a partition
   4. Change between Solaris and Solaris2 Partition IDs
   5. Exit (update disk configuration and exit)
   6. Cancel (exit without updating disk configuration)
Enter Selection: 6


format> quit

 

Thanks.

1 REPLY
drookie
Occasional Contributor

Re: logical drive unavailable after a disk was replaced

Yeah, as it was pointed to me, disks 1 and 2 arent' identical - they do have equal number of cylinders, but they have different cylinder size, and, according to this, their size can be calculated and it matches the actual size. So, the problem shrinked to 'ZFS thinks that LD is corrupted, and the controller thinks it's OK'.