Languages and Scripting

Re: HP-UX IA64 B.11.31 BUS_ADRALN

 
Phalgun
Occasional Advisor

HP-UX_IA64_B.11.31_BUS_ADRALN

I am working on HP-UX, IA64 B.11.31 machine. 

One of my executables crashes with BUS_ADRALN. The problem happens sporadically though it runs fine most of the time.

The following is the snippet from stack trace in gdb :

Program terminated with signal 10, Bus error.
BUS_ADRALN - Invalid address alignment. Please refer to the following link that helps in handling unaligned data: http://docs.hp.com/en/7730/newhelp0610/pragmas.htm#pragma-pack-ex3
#0 0xc000000000211ab0:0 in _lwp_kill+0x30 ()
from /usr/lib/hpux64/libpthread.so.1
(gdb) db
Undefined command: "db". Try "help".
(gdb) bt
#0 0xc000000000211ab0:0 in _lwp_kill+0x30 ()
from /usr/lib/hpux64/libpthread.so.1
#1 0xc000000000178810:0 in pthread_kill+0x9d0 ()
from /usr/lib/hpux64/libpthread.so.1
#2 0xc0000000003f80e0:0 in raise+0xe0 () from /usr/lib/hpux64/libc.so.1
#3 0xc00000001e5a2d80:0 in skgesigOSCrash () at skgesig.c:376
#4 0xc00000001f666900:0 in kpeDbgSignalHandler () at kpedbg.c:1074
#5 0xc00000001e5a3220:0 in skgesig_sigactionHandler () at skgesig.c:799
#6 <signal handler called>
#7 Foccur32 () at Foccur32.c:87
#8 0xc00000001498c020:0 in _tmaff_delallflds () at affinity.c:725
#9 0xc00000001498b570:0 in _tmaff_acall () at affinity.c:117
#10 0xc00000001478f7a0:0 in _tpacall_internal () at tmacall.c:588
#11 0xc0000000147a2a30:0 in _tpcall_internal () at tmcall.c:349
#12 0xc0000000147a0ed0:0 in _tpcall_ () at tmcall.c:157
#13 0xc0000000147a3790:0 in tpcall () at tmcall.c:474
#14 0xc000000002a3bc90:2 in inline ztux_flags () at blbn_trx_tux.c:1078
#15 0xc000000002a3bc80:2 in ztux_sync (l_name=<not available>,
l_service=<not available>, l_request_buf=<not available>,
l_request_buf_len=<not available>, l_response_buf=<not available>,
l_response_buf_len=<not available>, l_flags=<not available>)
at blbn_trx_tux.c:1215
#16 0xc00000001467b300:0 in zfn_call (l_fn=<not available>,
---Type <return> to continue, or q <return> to quit---
l_tx_buf_len=<not available>, l_rx_buf_len=<not available>) at blbn_trx.c:14340

 

I was checking similar post related to BUS_ADRALN in the community, but was not able to follow them.

I do understand that there is some address which happens to be misalligned, but I do not know which address is misalligned and how to find that out. 

 

Thanks in advance for your help.

11 REPLIES 11
boukari
Frequent Advisor

Re: HP-UX_IA64_B.11.31_BUS_ADRALN

Hello ,

Most processors (not x86 and friends ) require accesses to certain elements to be aligned on multiples of bytes. I.e. if you read an integer from address 0x04 that is okay, but if you try to do the same from 0x03 you will cause an interrupt to be thrown.

This is because it's easier to implement the load/store hardware if it's always on a multiple of the data size with which you're working.

DATA STRUCTURE ALIGNEMENT :

http://en.wikipedia.org/wiki/Data_structure_alignment


Regards,

BCS SW/HW GSC Engineer (L1)
IEEE Student Member
LPI 3 CORE & High Availability
VCP Vshpere 5 Datacenter
Novell CLA and Data Center specialist Certified
.....
Microsoft Partner & Microsoft student Partner
Phalgun
Occasional Advisor

Re: HP-UX_IA64_B.11.31_BUS_ADRALN

@I do understand that there is misallignment of address, but what I do not understand is which variable's address is misalligned and how do I find that out.

 

It would be great if someone could help me figure that out. Once, I know which variable is causing the problem, I could take corrective action in that direction.

Matti_Kurkela
Honored Contributor

Re: HP-UX_IA64_B.11.31_BUS_ADRALN

With just the gdb backtrace and no other information, it will be impossible to tell you which variable had a misaligned address... but it just might be possible to identify the location in the source code where it happened.

 

Note that entry #6 in your backtrace is <signal handler called>. I think this is the CPU detecting a misaligned access attempt and jumping to the appropriate signal handler instead of continuing to run the program. So entries #0..#6 would be from the signal handler code that killed the process, and entry #7 would be the one closest to the actual error location.

 

Now, entry #7 is listed as "Foccur32 () at Foccur32.c:87". That is, function Foccur32(), located in line 87 in source code file named Foccur32.c. Does this mean anything to you?

 

If possible, look at the source code of the Foccur32() function to determine what variables it uses, and how it uses them. You might be able to use gdb to peek at the values of those variables at the time of the crash, and even see which addresses those variables had.

 

You may also need to examine the parameters given to the Foccur32() function: entries #8..#16 will describe where the Foccur32() function was called from.

MK
Phalgun
Occasional Advisor

Re: HP-UX_IA64_B.11.31_BUS_ADRALN

@

tpcall(l_name,
l_service,
l_request_buf,
l_request_buf_len,
l_response_buf,
l_response_buf_len,
ztux_flag(l_flags));

 

From Oracle documentation, we have following information for tpcall() and Foccur32(). Since the implementation is hidden I am finding it difficult to find out which address actually was misaligned. Is there some other way to be able to understand that?

Dennis Handly
Acclaimed Contributor

Re: HP-UX IA64 B.11.31 BUS_ADRALN

>entry #7 would be the one closest to the actual error location.

 

Yes, this IS the fault location.

You now need to go into the debugger to debug from the corefile.

Use the frame command to go to the proper frame.

Then do the following:

bt

info reg

disas $pc-16*8 $pc+16

 

From the instruction with the fault, you should be able to find the register with the address and then check its value.

 

>You may also need to examine the parameters given to the Foccur32() function

 

Yes, the misaligned value could be from a parm.

Phalgun
Occasional Advisor

Re: HP-UX_IA64_B.11.31_BUS_ADRALN

Thanks Dennis, I do get something now.

(gdb) disas $pc-16*8 $pc+16
Dump of assembler code from 0xc00000000e8a3b40:0 to 0xc00000000e8a3bd0:0:
;;; DOC Line Information: [Line, Column Start, Column End] [Line, Column] [Line]
;;; File: Foccur32.c
;;; Line: 59
0xc00000000e8a3b40:0 <Foccur32+0xc0>: (p2) mov r44=5
0xc00000000e8a3b40:1 <Foccur32+0xc1>: (p5) mov r44=r32
0xc00000000e8a3b40:2 <Foccur32+0xc2>: (p5) chk.s.i r40,Foccur32+576
0xc00000000e8a3b50:0 <Foccur32+0xd0>: nop.m 0x0
0xc00000000e8a3b50:1 <Foccur32+0xd1>:
(p2) br.call.dptk.many b0=0xc00000000e891dc0
0xc00000000e8a3b50:2 <Foccur32+0xd2>: (p3) br.cond.dpnt.many Foccur32+560;;
;;; Line: 64
0xc00000000e8a3b60:0 <Foccur32+0xe0>: (p2) mov r1=r34
0xc00000000e8a3b60:1 <Foccur32+0xe1>: (p2) mov r8=-1
0xc00000000e8a3b60:2 <Foccur32+0xe2>: nop.i 0x0
0xc00000000e8a3b70:0 <Foccur32+0xf0>: nop.m 0x0
0xc00000000e8a3b70:1 <Foccur32+0xf1>:
(p5) br.call.dptk.many b0=0xc00000000e893ce0
0xc00000000e8a3b70:2 <Foccur32+0xf2>: (p2) br.cond.dpnt.many Foccur32+512;;
;;; Line: 73
0xc00000000e8a3b80:0 <Foccur32+0x100>: adds r9=8,r8
0xc00000000e8a3b80:1 <Foccur32+0x101>: cmp.ne.unc p6=r0,r8
---Type <return> to continue, or q <return> to quit---
0xc00000000e8a3b80:2 <Foccur32+0x102>: mov b0=r36
0xc00000000e8a3b90:0 <Foccur32+0x110>: mov r1=r34
0xc00000000e8a3b90:1 <Foccur32+0x111>: mov r8=0;;
0xc00000000e8a3b90:2 <Foccur32+0x112>: mov.i ar.pfs=r35
;;; Line: 78
0xc00000000e8a3ba0:0 <Foccur32+0x120>: (p6) ld4 r9=[r9]
0xc00000000e8a3ba0:1 <Foccur32+0x121>: nop.i 0x0;;
;;; Line: 83
0xc00000000e8a3ba0:2 <Foccur32+0x122>: (p6) add r42=r9,r32;;
0xc00000000e8a3bb0:0 <Foccur32+0x130>: cmp.geu.unc p6=r42,r41
;;; Line: 86
0xc00000000e8a3bb0:1 <Foccur32+0x131>: nop.m 0x0
0xc00000000e8a3bb0:2 <Foccur32+0x132>: (p6) br.cond.dpnt.many Foccur32+464;;
;;; Line: 87
0xc00000000e8a3bc0:0 <Foccur32+0x140>: ld4 r9=[r42]
0xc00000000e8a3bc0:1 <Foccur32+0x141>: adds r8=4,r42;;
0xc00000000e8a3bc0:2 <Foccur32+0x142>: extr.u r10=r9,25,7
End of assembler dump.
(gdb) info line 87

 

Since, the signal was generated at line 87, the last three lines should have the answer. The registers used are r9, r8 and r10. However in the oput for 'info reg' I do not find any registers named r9,r9 and r10; there are registers with name pr[0-63],gr[0-47], br[0-7], rsc etc.

I have attached the output of 'info reg' , could you let me know which registers shall I look at?

 

Dennis Handly
Acclaimed Contributor

Re: HP-UX IA64 B.11.31 BUS_ADRALN

;;; Line: 83
0xc00000000e8a3ba0:2 <Foccur32+0x122>: (p6) add r42=r9,r32;;
0xc00000000e8a3bb0:0 <Foccur32+0x130>: cmp.geu.unc p6=r42,r41
;;; Line: 87
0xc00000000e8a3bc0:0 <Foccur32+0x140>: ld4 r9=[r42]

 

>the signal was generated at line 87, the last three lines should have the answer.

 

They tell you why it aborted, r42 isn't aligned: 0x600000000051f2fa

 

>The registers used are r9, r8 and r10.  there are registers with name pr[0-63],gr[0-47], br[0-7]

 

These are the target registers, not helpful.  r => gr.

The value of r42 is the sum of r32 (first parm) and r9, which is the misaligned value, 0xa.

Phalgun
Occasional Advisor

Re: HP-UX_IA64_B.11.31_BUS_ADRALN

Thanks Dennis, we now have an address that we know is misaligned. From the lines:-

 

;;; Line: 73

0xc00000000e8a3b80:0 <Foccur32+0x100>: adds r9=8,r8

;;; Line: 78
0xc00000000e8a3ba0:0 <Foccur32+0x120>: (p6) ld4 r9=[r9]

;;; Line: 83
0xc00000000e8a3ba0:2 <Foccur32+0x122>: (p6) add r42=r9,r32;;

 

it r9 gets its value from r8, but its not clear as to where did it get its value from. Since, we do not have anything code after Frame 13 i.e. after call to 

tpcall(l_name,
l_service,
l_request_buf,
l_request_buf_len,
l_response_buf,
l_response_buf_len,
ztux_flag(l_flags));

 

can you suggest some way to reckon which variable might have caused the alignment issue?

Dennis Handly
Acclaimed Contributor

Re: HP-UX IA64 B.11.31 BUS_ADRALN

>it r9 gets its value from r8,

 

r8 is a pointer.  It seems to increment that pointer by 8 and then extracts an int.

And this is treated as a byte offset, used to load a misaligned int, in some packed structure?  So if you don't have control over this data structure, you are going to have to follow the directions given by gdb when it detected that alignment trap.

 

Since you are processing this data structure, you know exactly what it does.

Of course if I had the source to Foccur32, I could make better guesses.  :-)

 

>we do not have anything code after Frame 13 i.e. after call to

 

What direction is "after"?  Please list your code by using start and ending frame numbers.  I.e. do you own frame 7?