Re: How to check the sanity of disks, controllers, and i/o subsystem?

Yogeeraj · ‎01-07-2002

Hi,

I am currently investigating on a serious problem that we encountered on my Oracle 8i database on our L1000 last saturday (05/01/2002).

The error occured with one of our Oracle Datafiles and our users were not able to use our application for some hours. We feared the worst. Fortunately there were no data loss because of this problem. We were able to identify it and create another datafile to replace the defective one.

The error message we got is: ORA-01115 when running applications.
______________________________________________________________
>
> ORA-01115: IO error reading block from file 13 (block #45276)
> ORA-01110: data file 13:
> '/d06/oradata/cmtdb/pfs_indx_kn01.dbf'
> ORA-27050: function called with invalid FIB/IOV structure
> Additional information: 10
______________________________________________________________

Since then we have been investigating on all possible causes of the problem.

One of our Oracle contact mentions about possible problems with HARDWARE and recommend "Run operating system level utilities and diagnostic tools that check for the sanity of disks, controllers, and the I/O subsystem" (Please find attached excerpt of report received from Oracle and my log files)

How do i troubleshoot?

Thank you very much for a reply.

Regards
Yogeeraj

Sanjay_6 · ‎01-07-2002

Hi,

Try STM.

https://software.hp.com/cgi-bin/swdepot_parser.cgi/cgi/try.pl?productNumber=B6191AAE&date=

you can get this from your support Plus CD.

Here is the STM FAQ,

http://docs.hp.com/hpux/onlinedocs/diag/stm/stm_faq.htm

Here is the link from hp docs site for more info,

http://docs.hp.com/hpux/diag/index.html#Online%20Diagnostics:%20Support%20Tools%20Manager%20(STM)

Hope this helps.

Regds

harry d brown jr · ‎01-07-2002

You can use the "stm" programs, I prefer xstm (graphical version).

Also, did you get any errors in syslog?

live free or die
harry

Live Free or Die

James R. Ferguson · ‎01-07-2002

Hi:

Install and the EMS and Predictive Support diagnostics tools. These are available on the SupportPlus CDROM as part of the DIAGNOSTICS bundle and/or from the link below. These tools will give you early warning alerts of hardware problems.

http://www.software.hp.com/cgi-bin/swdepot_parser.cgi/cgi/displayProductInfo.pl?productNumber=B6191AAE

Regards!

...JRF...

Steven Gillard_2 · ‎01-07-2002

Also check for file system errors. Are there any vxfs errors in syslog or dmesg? If so you may need to umount the file system and run a full fsck. I would also recommend installing the latest vxfs patches if this is the case.

Regards,
Steve

Patrick Wallek · ‎01-07-2002

As the others have said, STM will be your best bet, as will looking at the system logs (dmesg and /var/adm/syslog/syslog.log).

If you just want to check individual disks, like the one that contained that data file, you can also do a dd of the disks.

Do something like:

# dd if=/dev/dsk/c#t#d# of=/dev/null bs=4k

and if it completes successfully with no errors then the disk is probably OK. If you do get a read error, then you have a disk problem and the disk should be replaced at the earliest possible opportunity.

A. Clay Stephenson · ‎01-07-2002

Hi:

The others have given you the answer (stm ot xstm) but the better question is 'How can I setup my system so that disk/controller/cable failures don't harm my application?". The answer to that is either Mirror/UX or arrays with multiple paths. If done correctly, in most cases you can repair the equipment without ever taking the system down. You can take this to the next level with MC/ServiceGuard. For critical systems, you need to take the approach that stuff happens. When attacked correctly, you can then say 'so what' and your users never know anything has happened.

Food for thought, Clay

If it ain't broke, I can fix that.

T. M. Louah · ‎01-07-2002

Basically it looks like a disk failure
.. check that disks are shown CLAIMED in S/W state of this command:
.. # isocan -fnC disk
.. on each disk of that Volume group run:
.. # diskinfo /dev/rdsk/cXtYdZ ; this should return correct mnufacturer info with size different than null 0.
.. the success of the above dd command would proves that disks are OK.
.. check syslog.log & OLDsyslog for LBOLT, scsi timeout & Power Fail messages about a dev_T with hex number for example 0x1f006000 --> remove last 2 zeros, & the disk is c0t6d0.
.. check with pvdislay /dev/dsk/cXtYdZ the
IO Timeout (Seconds), if it's default verify with DB vendor what's the appropriate value.
.. sanity of filesystems are checked with:
# fsck -F fstype -y -o full, nolog /dev/vgXX/rlvolYY
to run fsck you need to unmount filesystem of course.

G'd luck
t++

Little learning is dangerous!

Yogeeraj · ‎01-08-2002

Hello everybody,

Thanks for all these replies.
I have checked my /var/adm/syslog/syslog.log. No errors have been logged!!

# ioscan -fnC disk
Class I H/W Path Driver S/W State H/W Type Description
=====================================================================
disk 17 0/0/1/1.0.0 sdisk CLAIMED DEVICE SEAGATE ST318404LC
/dev/dsk/c1t0d0 /dev/rdsk/c1t0d0
disk 0 0/0/1/1.2.0 sdisk CLAIMED DEVICE SEAGATE ST318404LC
/dev/dsk/c1t2d0 /dev/rdsk/c1t2d0
disk 18 0/0/2/0.0.0 sdisk CLAIMED DEVICE SEAGATE ST318404LC
/dev/dsk/c2t0d0 /dev/rdsk/c2t0d0
disk 1 0/0/2/0.2.0 sdisk CLAIMED DEVICE SEAGATE ST318404LC
/dev/dsk/c2t2d0 /dev/rdsk/c2t2d0
disk 2 0/0/2/1.2.0 sdisk CLAIMED DEVICE HP DVD-ROM 304
/dev/dsk/c3t2d0 /dev/dsk/cdrom /dev/rdsk/c3t2d0
#

I would also, add that the problem datafile contained some indexes for my Oracle Tables. Hence, i was fortunate not to suffer any data loss. The indexes could be reconstructed without any problem on another tablespace/datafile.

The tablespace which used the problem datafile has been left intact. I would like to know what to do if this is no case disk failure.

regards
yogeeraj

Printaporn_1 · ‎01-08-2002

If I were you , and like many peer suggest , better use STM and LOGTOOL to check for evident of hardware problem.

on Xwindows use
#xstm
goto tool -> utility -> Run
select LOGTOOL.

select raw current log
format raw log
then view formated log.
this is GUI that very easy to use and you can get text file that report event associate with hardware that it detect I/O error.
then check for I/O path that corresponding to your index file.

enjoy any little thing in my life

Frank Slootweg · ‎01-11-2002

(In a later response,) You indicated that you did not see any errors in your syslog.log file, but:

1) Do you still have the syslog.log file *of the time of the Oracle (ORA-01115) error*? I.e. the system may report no errors *now*, but may have reported errors *before*.

2) Have you set up dmesg(1M) as per the example in the manual page (or root's crontab)? If so, does *that* log contain any errors?

Tim D Fulford · ‎01-11-2002

Yogeeraj

We had a similar problem. Unfortunately going to HP & saying it all went wrong & there is no evedence in syslog.log or ?stm does not cut much ice!!! Below is the way we convienced them to take the problem more seriously (& we got a fix) -->

Do you run MeasureWare? If so you can do a few things to prove when it went pear shaped

make a reptall; say rep.GLOBAL; file with the following in it

REPORT "MWA Export !DATE !TIME Logfile: !LOGFILE !COLLECTOR !SYSTEM_ID"
FORMAT ASCII
HEADINGS ON
SEPARATOR="|"
SUMMARY=60
MISSING=0

DATA TYPE DISK
DATE
TIME
BYDSK_DEVNAME
BYDSK_PHYS_READ_RATE
BYDSK_PHYS_WRITE_RATE
BYDSK_PHYS_IO_RATE
BYDSK_SYSTEM_IO_RATE
BYDSK_UTIL
BYDSK_REQUEST_QUEUE
** The below metrics are optional & may be useful
BYDSK_LOGL_READ_RATE
BYDSK_LOGL_WRITE_RATE
BYDSK_FS_READ_RATE
BYDSK_FS_WRITE_RATE
BYDSK_RAW_READ_RATE
BYDSK_RAW_WRITE_RATE
BYDSK_VM_IO_RATE

Everything above the comment line I would use.

do the extract for the day that went hay wire. Say it was 10 Jan 2002 from 10:00 to 11:00, I would add an hour either side so 9:00 to 12:00

# extract -xp -v -d -r rep.GLOBAL -b 01/10/02 09:00 -e 01/10/02 12:00

This will report on all disks so you will need to extract info from the xfrdDISK.asc file for each disk.

# egrep "0/0/1/1.0.0|MWA|Dev|Nam" xfrdDISK.asc > disk1.asc
# egrep "0/0/1/1.2.0|MWA|Dev|Nam" ..etc..

From this you will get the disk?.asc files.
copy them over to a PC with excel on it & import.

For each disk also calculate the IO time (or a guestimate of the avserv time in sar -d). IO Time is in ms (miliseconds)
IO Time = BYDSK_PHYS_IO_RATE * 10 / BYDSK_UTIL

You can now draw some graphs for each disk. Here are some guidelines
o Disk % if this is high you may have a problem/bottleneck
o For the ST3?? disks an IO time of 8ms is ok (expected) any less & you are doing well much more than 16-20 you have problems. The ST3?? are 10,000 rpm which is about 3ms seek time and about 5 ms latecy time, this gives about 8ms AVERAGE time spent looking for the data, so an IO time of 8ms is fine. However it should be a bit lower than this if you use buffercache or are doing massive reads or writes etc.
o Check that the reads & writes seems OK (BYDSK_PHYS_READ_RATE & BYDSK_PHYS_WRITE_RATE)
If you do use buffercache you may see no reads but lots of system IO, this is OK as it goes via buffercache (if you use it).
o Any queues on the disk are bad
o I'm told if you have a latter version of MeasureWare it also does ammount of data extracted (kB/s), this may be useful to look at.

If your controllers or disks were duff I would expect to see high disk utilisation & low IO throughput. (i.e. IO time would be high). We had a similar problem with fc60 disks & it ended up being a kernel parameter scsi_max_qdepth was too low.

** please bear in mind our system was fiber channel yours seems to be SCSI so the above kernel parameter may not be the problem **

Tim

-

Yogeeraj · ‎01-11-2002

thanks frank.
My answers are NO and NO.
-------------------------------------------
In fact, yesterday i did some further tests to try locate where the exact problem might be.

I did the following in sequence:
=========================================================
a. Restart the server
Verify that the database is shutting down and starting up correctly
b. shutdown the database
c. Unmount all user file Systems and run FSCK
d. Check for any evident hardware problems using HP-UX 11 OS utilities STM and logtool
e. Mount all file systems
f. Restart the database
g. Create a new table on the problem tablespace with initial extent of same size as the tablespace.
h. Populate the table with data that will fill up the initial extent.
i. Query or export the table and check for possible error
(will be checking for error occurrence at each steps)
=========================================================
I have detected no problem.

I still fearing to reuse that 700MB space used by the datafile which got data block errors.

I am attaching my syslog.

Thank you all for your replies.

Any further help and recomendations will be the most welcomed.

Best Regards
Yogeeraj

Yogeeraj · ‎01-11-2002

This is my STM report:
(The overtemp messages are when the last we had a power-cut. The IO error are they related to the test i had been doing sometime ago whereby if had filled up some files systems quite often?)
============================================================
.... L1000 : 132.147.160.9 ....

-- Logtool Utility: View Formatted Summary --

Summary of: /var/stm/logs/os/log1.fmt1
Formatted from: /var/stm/logs/os/log1.raw.cur

Date/time of first entry: Sun Aug 20 15:45:37 2000

Date/time of last entry: Sat Dec 22 09:16:44 2001

Number of LPMC entries: 0
Number of System Overtemp entries: 15
Number of LVM entries: 0
Number of Logger Event entries: 0

Number of I/O Error entries: 228

Device paths for which entries exist:

(220) 0/0/2/1.2.0
(4) 0/0/2/0.0.0
(2) 0/0/2/0.2.0
(2) 0/0/1/1.2.0

Products for which entries exist:

(228) SCSI Disk

Product Qualifiers for which entries exist:

(220) HPDVD-ROM
(8) SEAGATEST318404LC

Logger Events for which entries exist:

(228) sdisk

Device Types for which entries exist:

(228) Disk

Device Qualifiers for which entries exist:

(220) DVDROM
(8) Hard

Marc Dijkstra · ‎01-11-2002

Hi Yogeeraj!

I see from your response that you have done the test discussed and that there were no errors on the L-class.

Have you checked for any evidence of errors on the GSP?
(secure Web Console)

Also, I think that Tim's suggestion with the Measureware is a good one. You can load the demo for Measureware and PerfView from your 11.0 application CD's.

MND

"A computer lets you make more mistakes faster than any invention in human history - with the possible exceptions of handguns and tequila"

Frank Slootweg · ‎01-11-2002

> The IO error are they related to the test i had been doing sometime ago whereby if had filled up some files systems quite often?)

*NO*! Full file systems do not give I/O errors. Since STM reported many I/O errors and the original Oracle error also said "IO error" (or some such), you will have to look at these I/O errors. Perhaps others can help with that (as I have (nearly) no experience with STM).

STM mentions these addresses:

> (220) 0/0/2/1.2.0
> (4) 0/0/2/0.0.0
> (2) 0/0/2/0.2.0
> (2) 0/0/1/1.2.0

So it would be interesting to know if /d06/oradata/cmtdb/pfs_indx_kn01.dbf
in on any of these addresses:

bdf /d06/oradata/cmtdb/pfs_indx_kn01.dbf (gives LV name)
lvdisplay -v /dev/vg??/... (gives PV (/dev/dsk/...) name(s))
lssf /dev/dsk/... (gives hardware address(es) of PV(s)/disk(s))

Marc Dijkstra · ‎01-11-2002

This area that is giving problems, is it on the AutoRaid 12H? If I remember correctly there is a utility to look after arrays (arraymgr or some such) that will pop up errors if there is a problem with a LUN.

Is the K-Class server also talking to this area at the same time or is the mapping seperate?

MND

"A computer lets you make more mistakes faster than any invention in human history - with the possible exceptions of handguns and tequila"

Yogeeraj · ‎01-12-2002

Attention: Mr. Frank Slootweg
Thanks for the reply and comments.
/d06/oradata/cmtdb/pfs_indx_kn01.dbf (mirrored) is on
0/0/1/1.0.0 /dev/dsk/c1t0d0 and
0/0/2/0.0.0 /dev/dsk/c2t0d0

Hence, from the STM report on one of the device path where we had 4 error i.e 0/0/2/0.0.0

I would also like to mention that i have a large file system in the same Volume Group (VG 01) that is not mirrored and spans over the 2 disks.
LV Name /dev/vg01/lv_d05
LV Status available/syncd
LV Size (Mbytes) 8192
Current LE 2048
Allocated PE 2048
Used PV 2
--- Distribution of logical volume ---
PV Name LE on PV PE on PV
/dev/dsk/c1t0d0 1722 1722
/dev/dsk/c2t0d0 326 326

mounted on /d05 (Oracle 9iAS)
===================================================
L1000: home/deg> bdf /d06/oradata/cmtdb/pfs_indx_kn01.dbf
Filesystem kbytes used avail %used Mounted on
/dev/vg01/lv_d06 4194304 3687617 475024 89% /d06

L1000: home/deg> lvdisplay -v /dev/vg01/lv_d06
--- Logical volumes ---
LV Name /dev/vg01/lv_d06
VG Name /dev/vg01
LV Permission read/write
LV Status available/syncd
Mirror copies 1
Consistency Recovery MWC
Schedule parallel
LV Size (Mbytes) 4096
Current LE 1024
Allocated PE 2048
Stripes 0
Stripe Size (Kbytes) 0
Bad block on
Allocation strict
IO Timeout (Seconds) default

--- Distribution of logical volume ---
PV Name LE on PV PE on PV
/dev/dsk/c1t0d0 1024 1024
/dev/dsk/c2t0d0 1024 1024

--- Logical extents ---
LE PV1 PE1 Status 1 PV2 PE2 Status 2
0000 /dev/dsk/c1t0d0 0000 current /dev/dsk/c2t0d0 0000 current
0001 /dev/dsk/c1t0d0 0001 current /dev/dsk/c2t0d0 0001 current
0002 /dev/dsk/c1t0d0 0002 current /dev/dsk/c2t0d0 0002 current
0003 /dev/dsk/c1t0d0 0003 current /dev/dsk/c2t0d0 0003 current
0004 /dev/dsk/c1t0d0 0004 current /dev/dsk/c2t0d0 0004 current
0005 /dev/dsk/c1t0d0 0005 current /dev/dsk/c2t0d0 0005 current
...
...
1020 /dev/dsk/c1t0d0 1120 current /dev/dsk/c2t0d0 2620 current
1021 /dev/dsk/c1t0d0 1121 current /dev/dsk/c2t0d0 2621 current
1022 /dev/dsk/c1t0d0 1122 current /dev/dsk/c2t0d0 2622 current
1023 /dev/dsk/c1t0d0 1123 current /dev/dsk/c2t0d0 2623 current

L1000: home/deg>lssf /dev/dsk/c1t0d0
sdisk card instance 1 SCSI target 0 SCSI LUN 0 section 0 at address 0/0/1/1.0.0 /dev/dsk/c1t0d0

L1000: home/deg>lssf /dev/dsk/c2t0d0
sdisk card instance 2 SCSI target 0 SCSI LUN 0 section 0 at address 0/0/2/0.0.0 /dev/dsk/c2t0d0

______________________________________________________________

Yogeeraj · ‎01-12-2002

Hi Marc.
Nice to hear from u.

1. GSP
As far as i remember, the GSP displayed an error last time there was a power cut. It was about Temperature. Since, then i never saw the front panel ALARM LED blinking yellow. I will check it again on Monday and let you know.

By the way, is it possible to direct messages generated to an email address? (We have recently configured SMTP on L1000 so that we can now send emails to our Exchange Server)

NB. The secure web console is still not operational. Remember, we were told that we can have either the console or the web console (not both at the same time)!

2. Measurement software
Well, it has already expired! I will try to uninstall then reinstall and do the tests. I hope it works.

3. Problem area/AutoRAID 12H
No. The problem is not on the autoRAID. It is still connected to the K250 and have not been connected to the L1000 yet. We are here talking about the Internal Disks of the L1000. Remember we have 4x18 GB disks included in 2 volume groups.

4. Problem area/K250 talking to that area.
Well, how to explain? Let's see...
We have 2 file systems from the L1000 that have been mounted on the K250 using NFS.
mount gigal:/BACKUP/ /tmp_mnt/
mount gigal:/users/ /users1/
_______________________________________________
LV Name /dev/vg00/lv_backup
LV Status available/syncd
LV Size (Mbytes) 2700
Current LE 675
Allocated PE 675
Used PV 2
--- Distribution of logical volume ---
PV Name LE on PV PE on PV
/dev/dsk/c1t2d0 339 339
/dev/dsk/c2t2d0 336 336

LV Name /dev/vg01/lv_users
LV Status available/syncd
LV Size (Mbytes) 400
Current LE 100
Allocated PE 200
Used PV 2
--- Distribution of logical volume ---
PV Name LE on PV PE on PV
/dev/dsk/c1t0d0 100 100
/dev/dsk/c2t0d0 100 100
_______________________________________________

Now, /tmp_mnt keeps our Database Exports files that created every night at 23:00 on the K250.
/users1 keeps user files that are periodically generated on the K250 and that are FTPed (from the L1000 every 5 mins to one of our remote servers where it is used for a batch update.
(Hope that this is not too confusing!)

NB. These file systems are not used for Oracle Data files.

The only possible case where the two servers might be accessing the same file is in the FTP case (described above). The file are on average 200k to 400k each.

Hope that all these answer to your questions. I will be posting more info from our GSP and reports from measureware on monday.

Thanks
Yogeeraj

Marc Dijkstra · ‎01-14-2002

Yogeeraj

Well, the measureware will NOT re-install! But as to your query regarding the emailing of problems, I have just set up SCM (Service Control Manager), which is free off the latest Support+ CD's, this with EMS and ODE enables me to send email notifications thru to me, or, if your Cellular service provider allows this -- to a mobile phone via SMS! It also integrates VERY nicely with HP TopTools v.5.5

I am a little stumped on your problems with the L1000... will read thru all this again!

MND

PS: Think it is time I came and spent a week in MU!

"A computer lets you make more mistakes faster than any invention in human history - with the possible exceptions of handguns and tequila"

Yogeeraj · ‎01-14-2002

Hey Marc,
You are right! Measureware does not re-install!
So i guess, i will not be able to prepare the Measureware reports.

To install SCM will definitely be a good idea. We will also need to see which other monitoring tools that are appropriate for our environment.

regards
Yogeeraj

PS. As for a next visit, you are most welcomed. We will soon have to rebuild the autoRAID and connect both the K250 and L1000 to it. I will try to list out a few other interesting things that need to be setup as well (so that u don't get bored! :))

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: How to check the sanity of disks, controllers, and i/o subsystem?

How to check the sanity of disks, controllers, and i/o subsystem?