Operating System - HP-UX
1827855 Members
1520 Online
109969 Solutions
New Discussion

Crash in a socket application

 
SOLVED
Go to solution
Steve_The_King
Frequent Advisor

Crash in a socket application

We have a client socket application sending and receiving some data. It crashes on HPUX Itanium systems. The tusc utility shows following output,

munmap(0x7ed60000, 1052672) ........................................................................................... = 0
shutdown(5, SHUT_RDWR) ................................................................................................ = 0
recv(5, 0x4025b4a0, 8191, 0) .......................................................................................... ERR#9 EBADF
close(5) .............................................................................................................. = 0
Received signal 11, SIGSEGV, in user mode, [SIG_DFL], partial siginfo
Siginfo: si_code: SEGV_MAPERR, faulting address: 0x200000007ed58bb8, si_errno: 0
PC: 00000001000000a0.0 break.m 0x16000
exit(11) [implicit] ................................................................................................... WIFSIGNALED(SIGSEGV)|WCOREDUMP


When run from gdb, it does not crash. Also, if you put a small delay between shutdown() and close() system calls, it works file.

It does not crash on Solaris and HP PA-RISC systems.

Does anyone have any idea about this problem?
9 REPLIES 9
Steven E. Protter
Exalted Contributor
Solution

Re: Crash in a socket application

Shalom,

Also, if you put a small delay between shutdown() and close() system calls, it works file.

Is this not already fixex then? Or is the solution unacceptable.

First thing I'd check would be library and compiler patches.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
Steve_The_King
Frequent Advisor

Re: Crash in a socket application

Putting in the delay would cause performance issues.
Also, why the same code works on other platforms and not on Itanium?
Need to find the root cause of the problem. :)

Tried playing with gdb and core file today, and got following output.

Program terminated with signal 11, Segmentation fault.
#0 0x60000000c42df7d0:0 in dld_bor_text_entry + 0x30 ()
from /usr/lib/hpux32/dld.so

Not much help on the net about dld_bor_text_entry in dld.so.

Can you please point me to library and compiler patches related to this?
Dennis Handly
Acclaimed Contributor

Re: Crash in a socket application

Well, we would need a stack trace, bt.

#0 0x60000000c42df7d0:0 in dld_bor_text_entry + 0x30 /usr/lib/hpux32/dld.so

Did you use bt? Or was this all bt gave you?

BOR is Bind On Reference. If you use "chatr -B immediate a.out" you can skip this code.

Can you disassemble this?
(gdb) disas $pc-16*4 $pc+16*4

>Can you please point me to library and compiler patches related to this?

Not that I think it will help but you can try the latest linker/dld patch, PHSS_36336.

Do you have threads? It may be a thread stack overflow??
Steve_The_King
Frequent Advisor

Re: Crash in a socket application

>Did you use bt? Or was this all bt gave >you?

Yes, that was all bt gave me.

>BOR is Bind On Reference. If you >use "chatr -B immediate a.out" you can >skip this code.

Yes, it did work. So linking the application with -Wl,-B,immediate option solves the problem.
But why it fails with BOR?? How can we find that out?
Also, how does putting a small delay helps BOR?


>Can you disassemble this?
Here is the output of disassemble,

(gdb) disas $pc-16*4 $pc+16*4
Dump of assembler code from 0x60000000c42df790:0 to 0x60000000c42df810:0:
0x60000000c42df790:0 <__do_error+0x30>: br.ret.sptk.few b0
0x60000000c42df790:1 <__do_error+0x31>: nop.b 0x0
0x60000000c42df790:2 <__do_error+0x32>: nop.b 0x0;;
0x60000000c42df7a0:0 :
alloc r40=ar.pfs,0,11,2,0;;
0x60000000c42df7a0:1 : mov r41=r43
0x60000000c42df7a0:2 :
mov r42=r44;;
0x60000000c42df7b0:0 :
adds r2=-8,r12;;
0x60000000c42df7b0:1 :
adds r12=-176,r12
0x60000000c42df7b0:2 : nop.i 0x0;;
0x60000000c42df7c0:0 : nop.m 0x0
0x60000000c42df7c0:1 : mov r3=b0
0x60000000c42df7c0:2 : nop.b 0x0;;
0x60000000c42df7d0:0 : st8 [r2]=r3,-8;;
0x60000000c42df7d0:1 : mov r43=r16
0x60000000c42df7d0:2 : nop.i 0x0;;
0x60000000c42df7e0:0 : mov r44=r15;;
0x60000000c42df7e0:1 : mov.m r3=ar.unat
0x60000000c42df7e0:2 : nop.i 0x0;;
---Type to continue, or q to quit---
0x60000000c42df7f0:0 : st8 [r2]=r3,-8;;
0x60000000c42df7f0:1 : st8.spill [r2]=r8,-24
0x60000000c42df7f0:2 : nop.i 0x0;;
0x60000000c42df800:0 : stf.spill [r2]=f8,-16;;
0x60000000c42df800:1 : stf.spill [r2]=f9,-16
0x60000000c42df800:2 : nop.i 0x0;;
0x60000000c42df810:0 : stf.spill [r2]=f10,-16;;
End of assembler dump.


>Do you have threads? It may be a thread >stack overflow??
It has only two threads.
Dennis Handly
Acclaimed Contributor

Re: Crash in a socket application

>But why it fails with BOR?? How can we find that out?

By finding out the thread and the thread stack size.

>how does putting a small delay helps BOR?

It may be related to timing? It may be related to which thread gets a signal??

>It has only two threads.

Two total, or a main and two threads?
What is your thread stack size??

You have a thread stack overflow.
0x60000000c42df7d0:0: st8 [r2]=r3,-8;;

BOR needs a 176 byte frame, you are writing to the very top of it and hitting the guard page.

The call site is $br0-16. To get name do:
(gdb) x /i $br0-16
To get SP which may identify which thread:
(gdb) p /x $r12
(gdb) info thread

By looking at the tusc's mmap & mprotect calls, you might be able to figure out which thread contains the above $r12 value.
rick jones
Honored Contributor

Re: Crash in a socket application

After you've gotten the SIGSEGV thing worked-out, you may want to address the lingering bug of calling recv against a socket on which the code has called shutdown(SHUT_RDWR).

If that code segment was meaning to indicate to the client that it should close, and the recv was to await the read return of zero from the client's FIN, then the shutdown() call should be SHUT_WR, not SHUT_RDWR. Otherwise, you might just as well call close() in the first place.
there is no rest for the wicked yet the virtuous have no pillows
Steve_The_King
Frequent Advisor

Re: Crash in a socket application


There is one main thread which creates one thread for recv on a socket.
You are right, increasing the stack size for second thread did solve the problem.
But why does it work on PA-RISC with same stack size?

Of the two solutions available, which one is better? Increasing the stack size OR linking the application with -Wl,B,immediate?
What are the impacts/drawbacks of them?

Rick,
Yes, we will look into that issue. Thanks. :)
Dennis Handly
Acclaimed Contributor

Re: Crash in a socket application

>But why does it work on PA-RISC with same stack size?

You are only imagining it is the same. :-)
On IPF, there are two stacks, the RSE stack and the normal stack. So you may have to double the size. (Since libpthread doesn't know the right ratio of the two.)

>Of the two solutions available, which one is better?

Obviously increasing the stack size. You may have other thread stack overflows without BOR.

>What are the impacts/drawbacks of them?

Using bind immediate will cause longer startups. If your application runs for days, this is moot. BOR spreads this cost over the execution, until everything is called once. Also spreading the cost also has extra overhead, more than if you did it at once at the start.

The biggest drawback is the fact that you may be depending on ignoring an uncalled unsat that would be caught with -B immediate. Of course you can use -B nonfatal with -B immediate.

Since you only have one thread, the size issue is moot.
Steve_The_King
Frequent Advisor

Re: Crash in a socket application

Closing the Thread.