Operating System - HP-UX
1753435 Members
4788 Online
108794 Solutions
New Discussion юеВ

Re: HPUX: program stuck in TE_do_list()

 
smoto
New Member

HPUX: program stuck in TE_do_list()

Hi,

we have a problem with a process getting stuck in a CPU loop after it appears to exit cleanly.

A tusc of the process shows this

Siginfo: si_code: SEGV_MAPERR, faulting address: 0xffffffef7efdd0, si_errno: 0
PC: 00000001000000a0.0 break.m 0x16000
Received signal 11, SIGSEGV, in user mode, [0x9fffffffef54ddd0], partial siginfo
Siginfo: si_code: SEGV_MAPERR, faulting address: 0xffffffef7efdd0, si_errno: 0
PC: 00000001000000a0.0 break.m 0x16000
...

a pstack of the process does not show anything within our library.

$ /usr/ccs/bin/pstack 14855
----------------------- lwpid : 2365377 -------------------------------
0: c00000000004f561 : TE_do_list() + 0x2d1 (/usr/lib/hpux64/dld.so)
1: c000000000054e60 : TE_do_program_exit() + 0x300 (/usr/lib/hpux64/dld.so)
2: c0000000002bfc50 : (unknown) () (unknown)
-------------------------------- lwpid : 2365378 -------------------------------
0: c0000000003547d0 : (unknown) () (unknown)

Our signal handle should catch SIG_SEGV and abort(). Of course this is not reproducible,

This is a multi-threaded 64 bit application, which is a mixture of C/C++ running on HPUX 11.23

Any clues as to what could be causing this would be very helpful.

Thanks

Alan

 

 

P.S. This thread has been moved from HP-UX > General to HP-UX > Languages and Scripts - HP Forums Moderator

11 REPLIES 11
Raj D.
Honored Contributor

Re: HPUX: program stuck in TE_do_list()

Alan,

It seems to be the process is causing an invalid memory reference, or segmentation fault during the execution in cpu loop.


http://www.opengroup.org/onlinepubs/009695399/basedefs/signal.h.html

- SIGSEGV
SEGV_MAPERR
Address not mapped to object.

- SIGSEGV
void * si_addr
Address of faulting memory reference.





Debuging SIGSEGV:
http://forums11.itrc.hp.com/service/forums/questionanswer.do?threadId=703638


Also to check pthread , libc patches & also if any resource contention in regards to dbc_max_pct, shmmax / memory or swap space.



Hth,
Raj.
" If u think u can , If u think u cannot , - You are always Right . "
smoto
New Member

Re: HPUX: program stuck in TE_do_list()

Hi Raj,

we dont think that its a problem with system resource, as far as we can tell the system has GBytes of memory free when this happens. The process itself is using under 200MBytes.

Our signal handler would normally catch signals such as SIGSEGV. For example if I run via gdb and force malloc/realloc to return null I get this

hpi2-~/ora: gdb simple
HP gdb 5.2.03 for HP Itanium (32 or 64 bit) and target HP-UX 11.2x.
Copyright 1986 - 2001 Free Software Foundation, Inc.
Hewlett-Packard Wildebeest 5.2.03 (based on GDB) is covered by the
GNU General Public License. Type "show copying" to see the conditions to
change it and/or distribute copies. Type "show warranty" for warranty/support.
..
(gdb) set heap-check null-check-size 50000
(gdb) run
Starting program: /home/alan/ora/simple
jrealloc: Error 0

Program received signal SIGSEGV, Segmentation fault
si_code: 1 - SEGV_MAPERR - Address not mapped to object.
[Switching to process 9924]
0x9fffffffef6ad4c0:1 in __milli_strlen+0x41 ()
from /home/alan/5.0_rels/jbc5.0.20/lib/libjbase.so
(gdb) c
Continuing.
jBASE: Segmentation violation. Aborting

Program received signal SIGABRT, Aborted
si_code: -1 - Unknown si_code. Report to HP..
[Switching to process 9924]
0x9fffffffec1f9890:0 in kill+0x30 () from /usr/lib/hpux64/libc.so.1
(gdb) c
Continuing.

Program received signal SIGABRT, Aborted
si_code: -1 - Unknown si_code. Report to HP..
[Switching to process 9924]
0x9fffffffec1f9890:0 in kill+0x30 () from /usr/lib/hpux64/libc.so.1
(gdb) c
Continuing.

Program terminated with signal SIGABRT, Aborted.
The program no longer exists.
(gdb) quit


hpi2-~/ora: gdb simple core
HP gdb 5.2.03 for HP Itanium (32 or 64 bit) and target HP-UX 11.2x.
Copyright 1986 - 2001 Free Software Foundation, Inc.
Hewlett-Packard Wildebeest 5.2.03 (based on GDB) is covered by the
GNU General Public License. Type "show copying" to see the conditions to
change it and/or distribute copies. Type "show warranty" for warranty/support.
..
Core was generated by `simple'.
Program terminated with signal 6, Aborted.
SI_UNKNOWN - signal of unknown origin
#0 0x9fffffffec1f9890:0 in kill+0x30 () from /usr/lib/hpux64/libc.so.1
(gdb) where
#0 0x9fffffffec1f9890:0 in kill+0x30 () from /usr/lib/hpux64/libc.so.1
#1 0x9fffffffec11e1d0:0 in raise+0x30 () from /usr/lib/hpux64/libc.so.1
#2 0x9fffffffec1baf90:0 in abort+0x190 () from /usr/lib/hpux64/libc.so.1
#3 0x9fffffffef38b270:0 in SynchronousSignalHandler () at jediSignalUnix.c:531
#4
#5 0x9fffffffef6ad4c0:1 in __milli_strlen+0x41 ()


so you can see we get an error in a realloc, which causes a SIGSEGV, which leads to us calling abort.

In my real scenario, the process seems to have run to completion, and from what we can tell has called exit (all output indicates this), and then just got stuck in TE_do_list() which is in /usr/lib/hpux64/dld.so.

This is random, and only occurs rarely, and only on the production system (never on Test).

If we knew what TE_do_list(), was trying to do then we might be able to replicate it.

Thanks again

Alan
Raj D.
Honored Contributor

Re: HPUX: program stuck in TE_do_list()

Alan,

>>
(gdb) set heap-check null-check-size 50000
(gdb) run
Starting program: /home/alan/ora/simple
jrealloc: Error 0

Program received signal SIGSEGV, Segmentation fault

- it makes sense , SIGSEGV produced with address not mapped error.





I hope you are using gcc to compile the code .

- Could you check the below link ,

http://www.mail-archive.com/gcc-bugs@gcc.gnu.org/msg255898.html

It look like there is a bug with gcc passing through TE_do_list () function (in #17) and similarly exiting with library error. (#18)



Hth,
Raj.
" If u think u can , If u think u cannot , - You are always Right . "
smoto
New Member

Re: HPUX: program stuck in TE_do_list()

Hi Raj,

we use HP's C/C++ compiler aCC, and not gcc

Alan
ranganath ramachandra
Esteemed Contributor

Re: HPUX: program stuck in TE_do_list()

Hello Alan,

The top frame shows that the dynamic loader dld.so is trying to invoke terminator functions of shared libraries on program exit.

The faulting PC shown by tusc does not seem to be a good address, but you could try and determine where it lies, through gdb.

It is possible that these are addresses that the dynamic loader is using as addresses of terminator functions, possibly because of memory corruption by the application. Please check for memory corruption using gdb and/or vaccine (cadvise).

ranga
 
--
ranga
[i work for hpe]

Accept or Kudo

Dennis Handly
Acclaimed Contributor

Re: HP-UX: program stuck in TE_do_list()

It looks like dld is getting a signal. Do you have the latest dld patch? PHSS_39822

In some cases, dld blocks signals, causing infinite loops.
Have you tried using gdb?

>This is random, and only occurs rarely, and only on the production system

Do you have a corefile? Or if it loops, can you attach with gdb?

>If we knew what TE_do_list(), was trying to do then we might be able to replicate it.

Run shlib terminators and do run C++ static destruction.

>raj: passing through TE_do_list function (in #17)

#17 should be calling compiler generated routines like #16, __do_global_dtors_aux (g++).

>ranga: The faulting PC shown by tusc does not seem to be a good address

It never does for Integrity.

smoto
New Member

Re: HPUX: program stuck in TE_do_list()


I managed to replicate this issue on a local dev machine (running HPUX 11.31 ), although Its very random (I had to run more than 10,000 processes before I got this one)

If I attach to the process using gdb, and try and get a stack trace, then gdb exits wih a SIGSEGV :-( ( sam thing happens if I generate a core file using gcore and try to use gdb on this)

CPU TTY PID USERNAME PRI NI SIZE RES STATE TIME %WCPU %CPU COMMAND
0 ? 5319 alan 152 20 101M 6128K run 853:14 101.11 100.93 Loop1


hpitv3-~/ora: pstack 24764
24764: ./Loop1

-------------------------------- lwpid : 7794933 -------------------------------

-1: c000000000435e10 : [ sendsig ]
1: c00000000004f561 : TE_do_list() + 0x2d1 (/usr/lib/hpux64/dld.so)
2: c000000000054e60 : TE_do_program_exit() + 0x300 (/usr/lib/hpux64/dld.so)
3: c00000000037cd50 : (unknown) () (unknown)

-------------------------------- lwpid : 7794934 -------------------------------

0: c000000000435e10 : (unknown) () (unknown)

hpitv3-~/ora: gcore 5319

hpitv3-~/ora: ls -l core*
-rw------- 1 alan users 6042872 Oct 22 09:50 core.5319

goORA hpitv3-~/ora: file core*
core.5319: ELF-64 core file - IA64 from 'Loop1'


hpitv3-~/ora: /opt/langtools/bin/gdb Loop1 core.5319
HP gdb 5.9 for HP Itanium (32 or 64 bit) and target HP-UX 11.2x.
Copyright 1986 - 2001 Free Software Foundation, Inc.
Hewlett-Packard Wildebeest 5.9 (based on GDB) is covered by the
GNU General Public License. Type "show copying" to see the conditions to
change it and/or distribute copies. Type "show warranty" for warranty/support.
..
Core was generated by `Loop1'.


warning: Load module /home/oracle/10gR2/oracle/product/client_1/lib/libclntsh.so.10.1 has been stripped.
Debugging information is not available.


warning: Load module /home/oracle/10gR2/oracle/product/client_1/lib/libnnz10.so has been stripped.
Debugging information is not available.

#0 0xe0000001085c6c60 in (0) 0x40000000005aae00 gdb_crash_handler + 0x180 [/opt/langtools/bin/gdb]
(1) 0xe0000001085c6c80 ---- Signal 11 (SIGSEGV) delivered ----
(2) 0x400000000033d9e1 internalize_unwinds + 0x5c1 [/opt/langtools/bin/gdb]
(3) 0x400000000033be50 read_unwind_info + 0x2d0 [/opt/langtools/bin/gdb]
(4) 0x400000000033ba60 find_unwind_entry + 0x250 [/opt/langtools/bin/gdb]
(5) 0x4000000000473930 print_frame + 0x1530 [/opt/langtools/bin/gdb]
(6) 0x400000000046c8b0 print_frame_info_base + 0x7d0 [/opt/langtools/bin/gdb]
(7) 0x40000000004db630 print_stack_frame_stub + 0x70 [/opt/langtools/bin/gdb]
(8) 0x4000000000423740 catch_errors + 0x1a0 at ../../../Src/gnu/gdb/top.c:746 [/opt/langtools/bin/gdb]
(9) 0x40000000004db590 print_stack_frame + 0x70 [/opt/langtools/bin/gdb]
(10) 0x4000000000647340 core_open + 0x810 at ../../../Src/gnu/gdb/corelow.c:168 [/opt/langtools/bin/gdb]
(11) 0x40000000006aae30 core_file_command + 0xd0 at ../../../Src/gnu/gdb/corefile.c:114 [/opt/langtools/bin/gdb]
(12) 0x40000000004b2a00 do_captured_command + 0x60 at ../../../Src/gnu/gdb/top.c:823 [/opt/langtools/bin/gdb]
(13) 0x4000000000423740 catch_errors + 0x1a0 at ../../../Src/gnu/gdb/top.c:746 [/opt/langtools/bin/gdb]
(14) 0x40000000004b2980 catch_command_errors + 0x60 at ../../../Src/gnu/gdb/top.c:788 [/opt/langtools/bin/gdb]
(15) 0x4000000000185fd0 captured_main + 0x25a0 [/opt/langtools/bin/gdb]
(16) 0x4000000000423740 catch_errors + 0x1a0 at ../../../Src/gnu/gdb/top.c:746 [/opt/langtools/bin/gdb]
(17) 0x40000000001839e0 main + 0x60 [/opt/langtools/bin/gdb]
(18) 0xc000000000032f90 main_opd_entry + 0x50 [/usr/lib/hpux64/dld.so]

GDB crashed with signal 11! About to dump core into 'core' in the directory:
/home/alan/ora
Select one of the following options...
[N] No, do not dump core
[Y] Yes, dump core (default)
NOTE: Make sure to rename any existing core file in this
directory, as gdb's core will overwrite it.
[C] Continue execution (at your own risk)
> N
hpitv3-~/ora:

Alan
Dennis Handly
Acclaimed Contributor

Re: HP-UX: program stuck in TE_do_list()

>then gdb exits with a SIGSEGV

Try downloading gdb 6.0?
http://www.hp.com/go/wdb

>same thing happens if I generate a core file using gcore and try to use gdb on this)

At least if you can get gdb fixed, you won't have to create 10,000 processes.

smoto
New Member

Re: HPUX: program stuck in TE_do_list()

Already done that, same thing :-(

hpitv3-~/ora: /opt/langtools/bin/gdb Loop1 core.5319
HP gdb 6.0 for HP Itanium (32 or 64 bit) and target HP-UX 11iv2 and 11iv3.
Copyright 1986 - 2009 Free Software Foundation, Inc.
Hewlett-Packard Wildebeest 6.0 (based on GDB) is covered by the
GNU General Public License. Type "show copying" to see the conditions to
change it and/or distribute copies. Type "show warranty" for warranty/support.
..
Core was generated by `Loop1'.


warning: Load module /home/oracle/10gR2/oracle/product/client_1/lib/libclntsh.so.10.1 has been stripped.
Debugging information is not available.


warning: Load module /home/oracle/10gR2/oracle/product/client_1/lib/libnnz10.so has been stripped.
Debugging information is not available.

#0 0xe0000001085c6c60 in (0) 0x400000000024b570 gdb_crash_handler + 0xf0 [/opt/langtools/bin/gdb]
(1) 0xe0000001085c6c80 ---- Signal 11 (SIGSEGV) delivered ----
(2) 0x40000000002cc141 internalize_unwinds + 0x5e1 [/opt/langtools/bin/gdb]
(3) 0x40000000002ca4d0 read_unwind_info + 0x2d0 [/opt/langtools/bin/gdb]
(4) 0x40000000002c9ec0 find_unwind_entry + 0x260 [/opt/langtools/bin/gdb]
(5) 0x40000000003925f0 print_frame + 0x1450 at ../../../Src/gnu/gdb/stack.c:4732 [/opt/langtools/bin/gdb]
(6) 0x40000000003867d0 print_frame_info_base + 0x750 at ../../../Src/gnu/gdb/stack.c:4732 [/opt/langtools/bin/gdb]
(7) 0x400000000037c390 print_stack_frame_stub + 0x70 at ../../../Src/gnu/gdb/stack.c:4732 [/opt/langtools/bin/gdb]
(8) 0x40000000001ed0d0 catch_errors + 0x190 [/opt/langtools/bin/gdb]
(9) 0x400000000037c2f0 print_stack_frame + 0x70 at ../../../Src/gnu/gdb/stack.c:4732 [/opt/langtools/bin/gdb]
(10) 0x400000000063c020 core_open + 0x830 at ../../../Src/gnu/gdb/corelow.c:180 [/opt/langtools/bin/gdb]
(11) 0x40000000006a1bb0 core_file_command + 0xd0 at ../../../Src/gnu/gdb/corefile.c:114 [/opt/langtools/bin/gdb]
(12) 0x40000000004421c0 do_captured_command + 0x60 at ../../../Src/gnu/gdb/top.c:823 [/opt/langtools/bin/gdb]
(13) 0x40000000001ed0d0 catch_errors + 0x190 [/opt/langtools/bin/gdb]
(14) 0x4000000000442140 catch_command_errors + 0x60 at ../../../Src/gnu/gdb/top.c:788 [/opt/langtools/bin/gdb]
(15) 0x400000000024a500 captured_main + 0x29a0 [/opt/langtools/bin/gdb]
(16) 0x40000000001ed0d0 catch_errors + 0x190 [/opt/langtools/bin/gdb]
(17) 0x40000000001ecf00 main + 0x60 [/opt/langtools/bin/gdb]
(18) 0xc000000000032f90 main_opd_entry + 0x50 [/usr/lib/hpux64/dld.so]

GDB crashed with signal 11! About to dump core into 'core' in the directory: