1834340 Members
1774 Online
110066 Solutions
New Discussion

trap errors

 
SOLVED
Go to solution
hpuxsa
Frequent Advisor

trap errors

We had a job running on our system and one of the filesystems which it uses got full. Ideally the job should have failed at that point, but it did not and went forward and created all sort of wrong reports. Now the vendor blames that his program did not fail because the operating system did not generate an error.

I have checked the syslog and the "file system full" message was reported there. I have provided that message to him but he still says that it is an operating system problem. I belive his program is not being written to trap such errors and now he does not want accept that.

Could some one explain to me how programs generally trap such errors. Does anything need to be done on the os side for this?
5 REPLIES 5
Michael Schulte zur Sur
Honored Contributor

Re: trap errors

Hi,

since you already checked on the syslog, you can be sure, that this error was reported to the programme. A system call like writing to disk returns the number of written bytes and also you can check on error. The programme of course can ignore it. What kind of software is that? Your vendor will have to prove that he checks for os errors in his coding. The os does not kill a programme for that kind of errors. That's the programmers responsibility to check on that.

greetings,

Michael
Donny Jekels
Respected Contributor

Re: trap errors

download and compile tusc.

use tusc the suspected PID and send all the messages your program send and receives to a log file.

tusc will show ALL his program errors and what he accepts and not accept.
"Vision, is the art of seeing the invisible"
A. Clay Stephenson
Acclaimed Contributor
Solution

Re: trap errors

The operating system did exactly what it was supposed to do: the result from the write() system call returned a -1 indicating an error and then errno was set to indicate the exact nature of the error (in this case ENOSPC). It is the sole responsibility of the programmer to detect this error and act accordingly.

This same logic applies to whatever function that wrote to a file was used because ultimately it made a system call to write() either directly or indirectly.

I can assure you that a software vendor would have only tried this lame excuse on me once.

Having said this, you as a sysadmin should have had watchdog daemons in place sending you alerts that critical resources were running low --- but that does not in any way excuse the programmer.

The really bad news is that this is a very clear indication that other kinds of errors are also not being checked for. Real programmers either check the result of each function/system call or use the throw/catch exception model.
If it ain't broke, I can fix that.
A. Clay Stephenson
Acclaimed Contributor

Re: trap errors

One more point that should be made is that if the OS did pre-emptively kill processes like this there would be no way to make programs handle errors gracefully.
If it ain't broke, I can fix that.
Bill Hassell
Honored Contributor

Re: trap errors

I have to agree with the replies that this is a very lame excuse. EVERY system call requested by a program returns a status code. The programmers chose to ignore the man pages and just 'assume' that all is well or that the operating system would somehow 'fix' the problem or rewrite the program so that it comes to a graceful end. Sorry, but this is a novice programmer's mistake. It ranks right next to buffer overflow coding errors that are so prevalent in modern code.

The only way that a "file system full" condition would affect the vendor's code is when the code makes a system call to write more data. The call will return immediately with an error status ERRNO=2 (also known as ENOSPC, see man errno). It is guarenteed that HP-UX returned this error to the program but the result code was ignored. ERRNO processing is absolutely basic programming for Unix systems and errno codes are pervasize throughout all flavors of Unix.

Perhaps it is my process control background with the HP 1000 (a true real time computer system) that makes me assume that EVERY system call (read, write, get memory, even date/time) has FAILED until proven otherwise. When your computer is controlling 100 tons of mechanical presses, mistakes like this are NOT an option.


Bill Hassell, sysadmin