cancel
Showing results for 
Search instead for 
Did you mean: 

Server crashes

 
Dermot1
Advisor

Server crashes

Hi,

 

Proliant DL 165 G7. P410/256 Int and P212 Ext with D2600 Disk enclosure.

Linux kernel 2.6.18-194.17.4.el5xen , RedHat 5.5. P410 has a mirrored pair of 146GB SAS disk for OS.

 

This server has crashed about 4 times in the last two weeks. The problem seems to be with the P212 controller and/or the disk enclosure. The symptons are that the server becomes un-responsive and the console will show a panic message like this (big post sorry):

 

 [419160.430190] INFO: task jbd2/dm-0-8:584 blocked for more than 120 seconds.

[419160.430224] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[419160.430250] jbd2/dm-0-8   D ffff88000959e980     0   584      2 0x00000000

[419160.430268]  ffff8805e3d53c10 0000000000000246 ffffffff00000000 0000000000015980

[419160.430289]  ffff8805e3d53fd8 0000000000015980 ffff8805e3d53fd8 ffff8805e0bbadc0

[419160.430308]  0000000000015980 0000000000015980 ffff8805e3d53fd8 0000000000015980

[419160.430327] Call Trace:

[419160.430350]  [<ffffffff8117d510>] ? sync_buffer+0x0/0x50

[419160.430367]  [<ffffffff8159c923>] io_schedule+0x73/0xc0

[419160.430379]  [<ffffffff8117d555>] sync_buffer+0x45/0x50

[419160.430391]  [<ffffffff8159cf9f>] __wait_on_bit+0x5f/0x90

[419160.430402]  [<ffffffff8117d510>] ? sync_buffer+0x0/0x50

[419160.430413]  [<ffffffff8159d048>] out_of_line_wait_on_bit+0x78/0x90

[419160.430427]  [<ffffffff8107f0c0>] ? wake_bit_function+0x0/0x40

[419160.430439]  [<ffffffff8117d506>] __wait_on_buffer+0x26/0x30

[419160.430454]  [<ffffffff8122874a>] jbd2_journal_commit_transaction+0x97a/0x1350

[419160.430469]  [<ffffffff81008696>] ? __switch_to+0x166/0x320

[419160.430483]  [<ffffffff8159e87e>] ? _raw_spin_unlock_irqrestore+0x1e/0x30

[419160.430498]  [<ffffffff81070933>] ? try_to_del_timer_sync+0x83/0xe0

[419160.430512]  [<ffffffff8122d87d>] kjournald2+0xbd/0x220

[419160.430523]  [<ffffffff8107f080>] ? autoremove_wake_function+0x0/0x40

[419160.430534]  [<ffffffff8122d7c0>] ? kjournald2+0x0/0x220

[419160.430545]  [<ffffffff8107eb26>] kthread+0x96/0xa0

[419160.430557]  [<ffffffff8100aee4>] kernel_thread_helper+0x4/0x10

[419160.430570]  [<ffffffff8100a313>] ? int_ret_from_sys_call+0x7/0x1b

[419160.430582]  [<ffffffff8159ee1d>] ? retint_restore_args+0x5/0x6

[419160.430594]  [<ffffffff8100aee0>] ? kernel_thread_helper+0x0/0x10

[419160.430627] INFO: task perl:31318 blocked for more than 120 seconds.

[419160.430647] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[419160.430667] perl          D ffff880009562980     0 31318  31221 0x00000000

[419160.430684]  ffff88059db11a58 0000000000000286 ffffffff00000000 0000000000015980

[419160.430702]  ffff88059db11fd8 0000000000015980 ffff88059db11fd8 ffff880003ad8000

[419160.430720]  0000000000015980 0000000000015980 ffff88059db11fd8 0000000000015980

[419160.430738] Call Trace:

[419160.430750]  [<ffffffff8117d510>] ? sync_buffer+0x0/0x50

[419160.430761]  [<ffffffff8159c923>] io_schedule+0x73/0xc0

[419160.430772]  [<ffffffff8117d555>] sync_buffer+0x45/0x50

[419160.430783]  [<ffffffff8159cf9f>] __wait_on_bit+0x5f/0x90

[419160.430794]  [<ffffffff8117d510>] ? sync_buffer+0x0/0x50

[419160.430805]  [<ffffffff8159d048>] out_of_line_wait_on_bit+0x78/0x90

[419160.430817]  [<ffffffff8107f0c0>] ? wake_bit_function+0x0/0x40

[419160.430828]  [<ffffffff8117d506>] __wait_on_buffer+0x26/0x30

[419160.430841]  [<ffffffff811f56f2>] ext4_find_entry+0x1a2/0x4a0

[419160.430854]  [<ffffffff81168a01>] ? d_delete+0xc1/0x100

[419160.430866]  [<ffffffff81168a67>] ? d_alloc+0x27/0x1c0

[419160.430878]  [<ffffffff811f5a3d>] ext4_lookup+0x4d/0x110

[419160.430889]  [<ffffffff8115f193>] do_lookup+0x1e3/0x280

[419160.430900]  [<ffffffff8115fe7d>] link_path_walk+0x4cd/0xab0

[419160.430911]  [<ffffffff811605c7>] path_walk+0x67/0xe0

[419160.430922]  [<ffffffff8116079b>] do_path_lookup+0x5b/0xa0

[419160.430932]  [<ffffffff81161467>] user_path_at+0x57/0xa0

[419160.430943]  [<ffffffff810043c6>] ? xen_mc_flush+0x96/0x1c0

[419160.430956]  [<ffffffff81006b3d>] ? xen_force_evtchn_callback+0xd/0x10

[419160.430968]  [<ffffffff810072d2>] ? check_events+0x12/0x20

[419160.430980]  [<ffffffff8115756c>] vfs_fstatat+0x3c/0x80

[419160.430991]  [<ffffffff810072d2>] ? check_events+0x12/0x20

[419160.431002]  [<ffffffff8115768b>] vfs_stat+0x1b/0x20

[419160.431013]  [<ffffffff811576b4>] sys_newstat+0x24/0x50

[419160.431025]  [<ffffffff810072bf>] ? xen_restore_fl_direct_end+0x0/0x1

[419160.431035]  [<ffffffff810041a1>] ? xen_clts+0x71/0x80

[419160.431046]  [<ffffffff8100b322>] ? math_state_restore+0x42/0x60

[419160.431059]  [<ffffffff8159f3de>] ? do_device_not_available+0xe/0x10

[419160.431070]  [<ffffffff8100b00b>] ? xen_hypervisor_callback+0x1b/0x20

[419160.431082]  [<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b

[419160.431092] INFO: task perl:31320 blocked for more than 120 seconds.

[419160.431111] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[419160.431131] perl          D ffff880009562980     0 31320  31221 0x00000000

[419160.431148]  ffff880563873a58 0000000000000282 ffffffff00000000 0000000000015980

[419160.431164]  ffff880563873fd8 0000000000015980 ffff880563873fd8 ffff880003ad96e0

[419160.431184]  0000000000015980 0000000000015980 ffff880563873fd8 0000000000015980

[419160.431200] Call Trace:

[419160.431212]  [<ffffffff8117d510>] ? sync_buffer+0x0/0x50

...

....

....

 

dm-0 is one of the LVM volumes on the disk enclosure.

 

At this point, I have to power-down the server as nothing I have tried seems to make the server close down gracefully. When the server comes up I get a lot of message like the one below. 

 

ioctl32(cmaidad:5612): Unknown cmd fd(6) cmd(c058420b){00} arg(ffe64520) on /dev                                    /cciss/c0d0

ioctl32(cmaeventd:5495): Unknown cmd fd(5) cmd(c058420b){00} arg(fff1c460) on /d                                ev/cciss/c0d0

ioctl32(cmaeventd:5495): Unknown cmd fd(5) cmd(c058420b){00} arg(fff1c460) on /d                                     ev/cciss/c1d0

 

During the POST, there is a message from the P212 controller about  "a controller failure event occurred prior to this power-up, previous lock code..."

 

I have seen other posts about the ioctl32 error. One of the posts on the HP forum said the problem was resolved when he changed the cable. I switched the SAS cable but I am still getting the error. I applied the last firmware (5.6) and used the 9.5 firmware DVD to flash all the components on the system two days ago. I still get the messages above. The server was not resposive yesterday morning and I have to power it down again.

 

I do not know if the messages above as related to the crash. I'd like to hear any suggestions at this point because the current situation is loosing me sleep.

 

Thanks in advance.

Dermot

 

1 REPLY
Dermot1
Advisor

Re: Server crashes

I logged a call with HP about this. I resisted doing so because I didn't know where the problem was, the server, controller or disk enclosure. I got this message from the last crash:

 

Call Trace: [<ffffffff80262fb3>] wait_for_completion+0x7d/0xaa

[<ffffffff80288e63>] default_wake_function+0x0/0xe 

[<ffffffff880b6c9f>] :cciss:start_io+0xa3/0xdd

 [<ffffffff880bade1>] :cciss:cciss_ioctl+0x6a7/0xc44

[<ffffffff8020d58c>] do_lookup+0x65/0x1e6 

[<ffffffff8025a280>] kobject_get+0x12/0x17 

[<ffffffff880bb3a8>] :cciss:do_ioctl+0x2a/0x39 

[<ffffffff880bb5cb>] :cciss:cciss_compat_ioctl+0x214/0x249 

<ffffffff8031bf66>] inode_has_perm+0x56/0x63

[<ffffffff802d8134>] blkdev_open+0x0/0x4f 

<ffffffff802d8157>] blkdev_open+0x23/0x4f 

[<ffffffff80337f3c>] compat_blkdev_ioctl+0x4c/0x5f 

[<ffffffff802ed664>] compat_sys_ioctl+0xc5/0x2b2 

[<ffffffff8026168d>] ia32_sysret+0x0/0x5

 

I mentioned the error from the P212 during the POST and  HP think it's a faulty P212 controller and are dispatching me  new one.

 

Fingers crossed.

Dp.