Simpler Navigation for Servers and Operating Systems
Completed: a much simpler Servers and Operating Systems section of the Community. We combined many of the older boards, so you won't have to click through so many levels to get at the information you need. Check the consolidated boards here as many sub-forums are now single boards.
General
cancel
Showing results for 
Search instead for 
Did you mean: 

"false" Oracle corruption messages

"false" Oracle corruption messages

Hi,
platform HP-UX 11.00 on N-class and K-class + Oracle 8.1.5

I'm getting a lot of messages like the attached .jpg on my Vantive application. This normally shows hardware errors on a disk. I'm getting the error however on 4 different servers!
HP checked the disks and there is nothing wrong with it. I applied all the latest patch bundles + a special patch set as supplied by HP to possibly fix this problem.
Also, the error seems to fix itself after a few minutes. This is however very annoying for the application users who:
- cannot save data for a while
- think that the system is corrupted

Anyone experienced this before? Oracle is no help either on this one.

thanks,
Dirk
16 REPLIES
A. Clay Stephenson
Acclaimed Contributor

Re: "false" Oracle corruption messages

Hi Dirk:

This is not a hardware problem but a software problem. Errno 9 indicates that an I/O operation (read, write, seek) was attempted on a file descriptor that is not open. My best guess is that you need to increase NFILES. Do a sar -v and look for overflows.

If it ain't broke, I can fix that.

Re: "false" Oracle corruption messages

It gives me the following output:
HP-UX xxxxxx B.11.00 U 9000/800 02/01/02

17:19:13 text-sz ov proc-sz ov inod-sz ov file-sz ov
17:19:14 N/A N/A 452/5000 0 7048/7048 0 3908/52058 0
17:19:15 N/A N/A 452/5000 0 7048/7048 0 3908/52058 0
17:19:16 N/A N/A 452/5000 0 7048/7048 0 3908/52058 0
17:19:17 N/A N/A 452/5000 0 7048/7048 0 3908/52058 0
17:19:18 N/A N/A 452/5000 0 7048/7048 0 3909/52058 0
17:19:19 N/A N/A 452/5000 0 7048/7048 0 3908/52058 0
17:19:20 N/A N/A 452/5000 0 7048/7048 0 3909/52058 0
17:19:21 N/A N/A 452/5000 0 7048/7048 0 3909/52058 0
17:19:22 N/A N/A 452/5000 0 7048/7048 0 3908/52058 0
17:19:23 N/A N/A 452/5000 0 7048/7048 0 3912/52058 0

the system is momentarily very quiet (Friday evening in Belgium). I'll run it again next week to see whether it increases...
I must admit I'm very bad at kernel tuning. I once ordered a kernel audit from HP, but I 'm still not confident about the setup.
Basically the server runs Vantive 'application server' and its database...

Dirk

Re: "false" Oracle corruption messages

I ran "tusc" on a process that gave this false Oracle message.
It's generating the following output. (ERR#246 EWOULDBLOCK). Anyone knows this?

[17982] select(250, 0x6fff0778, NULL, NULL, NULL) ........ [sleeping]
[17982] select(250, 0x6fff0778, NULL, NULL, NULL) ........ = 1
[17982] getnumfds() ...................................... = 13
[17982] ioctl(0, I_XTI_RCV, 0x6fff1110) .................. = 0
[17982] ioctl(0, I_XTI_SND, 0x6fff11c8) .................. = 0
[17982] select(250, 0x6fff0778, NULL, NULL, NULL) ........ = 1
[17982] getnumfds() ...................................... = 13
[17982] ioctl(0, I_XTI_RCV, 0x6fff1110) .................. = 0
[17982] sigvec(SIGALRM, 0x6fff1308, 0x6fff1318) .......... = 0
[17982] alarm(5) ......................................... = 0
[17982] ioctl(11, FIONBIO, 0x6fff25c8) ................... = 0
[17982] select(0, NULL, NULL, NULL, 0x6fff1378) .......... = 0
[17982] write(11, "\0e8\0\006\0\0\0\0\011x d501\0\0".., 232) = 232
[17982] read(11, 0x400f6256, 2064) ....................... ERR#246 EWOULDBLOCK
[17982] open("/opt/oracle/product/8.1.5/rdbms/mesg/oraus.msb", O_RDONLY, 0) = 13
[17982] fcntl(13, F_SETFD, 1) ............................ = 0
[17982] lseek(13, 0, SEEK_SET) ........................... = 0
[17982] read(13, "1513" 011303\t\t\0\0\0\0\0\0\0\0".., 256) = 256
[17982] lseek(13, 512, SEEK_SET) ......................... = 512
[17982] read(13, "1d1 [ z x 0e\0\0\0\0\0\0\0\0\0\0".., 512) = 512
[17982] lseek(13, 1024, SEEK_SET) ........................ = 1024
[17982] read(13, "\018\0$ \07 \0@ \0J \0V \0a \0j ".., 512) = 512
[17982] lseek(13, 98304, SEEK_SET) ....................... = 98304
[17982] read(13, "\0\n\f+ \0\0\0D \f, \0\0\0r \f- ".., 512) = 512
[17982] close(13) ........................................ = 0
[17982] select(0, NULL, NULL, NULL, 0x6fff1378) .......... = 0
[17982] read(11, "\0be\0\006\0\0\0\0\00602\0\b\0\0".., 2064) = 190
[17982] ioctl(11, FIONBIO, 0x6fff2548) ................... = 0
[17982] alarm(0) ......................................... = 5
[17982] sigvec(SIGALRM, 0x6fff1308, 0x6fff1318) .......... = 0
[17982] sigvec(SIGALRM, 0x6fff1308, 0x6fff1318) .......... = 0
[17982] alarm(5) ......................................... = 0
[17982] ioctl(11, FIONBIO, 0x6fff25c8) ................... = 0
[17982] select(0, NULL, NULL, NULL, 0x6fff1378) .......... = 0
[17982] write(11, "\0[ \0\006\0\0\0\0\003^ d7\0\0\0".., 91) = 91
[17982] read(11, 0x400f7e1e, 2064) ....................... ERR#246 EWOULDBLOCK
[17982] open("/opt/oracle/product/8.1.5/rdbms/mesg/oraus.msb", O_RDONLY, 0) = 13
[17982] fcntl(13, F_SETFD, 1) ............................ = 0
[17982] lseek(13, 0, SEEK_SET) ........................... = 0
[17982] read(13, "1513" 011303\t\t\0\0\0\0\0\0\0\0".., 256) = 256
[17982] lseek(13, 512, SEEK_SET) ......................... = 512
[17982] read(13, "1d1 [ z x 0e\0\0\0\0\0\0\0\0\0\0".., 512) = 512
[17982] lseek(13, 1024, SEEK_SET) ........................ = 1024
[17982] read(13, "\018\0$ \07 \0@ \0J \0V \0a \0j ".., 512) = 512
[17982] lseek(13, 98304, SEEK_SET) ....................... = 98304
[17982] read(13, "\0\n\f+ \0\0\0D \f, \0\0\0r \f- ".., 512) = 512
[17982] close(13) ........................................ = 0

thanks
Dirk
Rita C Workman
Honored Contributor

Re: "false" Oracle corruption messages

Thank you, Thank you, Thank you...

I got the same error while running an upgrade on something ... I knew the hardware was solid, but the DBA was intent that a "bad block had occurred..data could (is) lost...yada yada yada..".
Of course they checked everything when we were done..and couldn't find anything missing...but...

I'm going to enjoy 'sharing' this with him.

You have made a cold Tuesday warmer !

I love this Forum,
Rita

..no points here...the joy of this is enough !
A. Clay Stephenson
Acclaimed Contributor

Re: "false" Oracle corruption messages

Hi Dirk:

Strangely, man'ing the read system call does not indicate that a 246 errno is ever set but obviously it is. I would look at nflocks (you could be running out of system-wide file lock structures). I would also look at the semaphore settings. The one other thing I would look at is a timeslice value of 1 rather than 10. Some of the tuned parameter sets for database environments have very stupidly set the timeslice to 1 and this can cause all sorts of very strange semaphore problems and I suspect it could also cause file lock problems as well. I would also look through all the system call man pages to see if there are any that set errno to EWOULDBLOCK. In the meantime, I'll look into the ioctl on fdes 11 that precedes this read.

Regards, Clay


If it ain't broke, I can fix that.

Re: "false" Oracle corruption messages

If it can help, I attached my current kernel config.

Dirk
A. Clay Stephenson
Acclaimed Contributor

Re: "false" Oracle corruption messages

Hi Dirk:

Your timeslice is set to 1; also nflocks is rather low for your nfiles setting. I would set timeslice to 10 and nflocks to 10000. You can do all this within SAM and build a new kernel.
If it ain't broke, I can fix that.

Re: "false" Oracle corruption messages

we're getting there.
The user connection is twofold: 1 Application process and 1 Oracle connection.
The tusc I sent before contained the appl. process. Now I managed to trap the Oracle connection with tusc (see attachment).
I cannot apply your kernel changes immediately. The DB cannot be brought down easily...

regards,
Dirk

Re: "false" Oracle corruption messages

I doesn't seem to relate to file locks. I ran this script provided by HP:
# sh /tmp/lock.sh
# cat outputfile
Tue Feb 5 17:08:21 GMT 2002
Number of used file lock table entries : 393
Tue Feb 5 17:19:47 GMT 2002
Number of used file lock table entries : 393

I'll try to get some downtime to increase the timeslice to 10...

Dirk
A. Clay Stephenson
Acclaimed Contributor

Re: "false" Oracle corruption messages


Hi:

This does look very strange:

The lseek is failing on fdes 0; EBADF indicates that fdes is not open
lseek(0, 33832960, SEEK_SET) ..... ERR#9 EBADF
...
...
open("/usr/lib/nls/msg/C/strerror.cat", O_LARGEFILE, 0177777) .......... = 0

This open returns 0 as a file descriptor; open returns the lowest available file descriptor.

You need to do some more digging to find where close(0) is called before the lseek that fails.
This is looking more and more like a software bug but I would definitely set timeslice to 10 because your current setting can cause all sorts of very stange behavior. If timeslice doesn't fix this, it's probably time to call Oracle support.

If it ain't broke, I can fix that.
Carlos Fernandez Riera
Honored Contributor

Re: "false" Oracle corruption messages


I recall a similar message on 8.1.6 ( more or less). The is a pacth from Oracle. See oracle??s alert.log on database server.

sar -v reports that inode is on his high value. If you are using HFS filesystems you need to raise ninode kernel parameter.



unsupported

Re: "false" Oracle corruption messages

all filesystems are VxFS

I checked with Oracle and they recommend to "upgrade to 8.1.7.2 and than apply patch for BUG 1247796."
This is a patch related to ASYNCH_IO which we're not using however. It will be very difficult to upgrade since the product is only supported on Oracle 8.1.5 (which is no longer supported by Oracle, so I'm as usual stuck between a rock and a hard place!)

I'll try to convince our Change Control Board to give me some downtime to change the timeslice value.

many thanks,
Dirk
Ruediger Noack
Valued Contributor

Re: "false" Oracle corruption messages

Hi Dirk,

I'm interested in your lock.sh script. Would you please post this script as attachement?

Thanks a lot
Ruediger

Re: "false" Oracle corruption messages

I had to do some preparational work before it would run on my HP-UX 11.00 64 bits:
#cp /stand/vmunix /stand/vmunix.orig
#q4pxdb /stand/vmunix

in the script change the line /usr/contrib/bin/q4 /stand/vmunix /dev/mem with /usr/contrib/Q4/bin/q4 /stand/vmunix /dev/kmem

#sh /tmp/lock.sh
#cat /tmp/outputfile

regards,
Dirk

Re: "false" Oracle corruption messages

info from HP support:
The lseek() tries to seek on a file which has gotten filedescriptor 0 from the process. The lseek() can't open the filedescriptor because the filedescriptor isn't open anymore. (in other words was closed before with the close() command) How do we know this, because the filedescriptor 0 is assigned to the first file that is opened.

The result of this EBADF is that, oracle sends to the iwserver process a message that it gets the EBADF.
The message is passed via the file "/opt/oracle/product/8.1.5/rdbms/mesg/oraus.msb".
file. The iwserver then further communicates this to the iwclientprocess on the pc.

with the following open questions:
questions :

1/ When is the filedescriptor 0 of the "oracleVANPROD" process closed ?

2/ Why is the filedescriptor closed ?

3/ Why doesnt the oracle process notice that the filedescriptor is closed and persists in doing a lseek() ?

Doing a continuous tusc on the oracle process may reveal when its closed and by who.

Dirk

Re: "false" Oracle corruption messages

We found the culprit: Apparently there's a bug in the UTL_HTTP package of Oracle on pre-8.1.7 versions.
We implemented a workaround of this package and this solved the "false" corruptions.

Dirk