Re: Need some guide on solving such problem

arking1981 · ‎08-03-2011

Dear all,

I have a tough problem. Tough for me, may not for you. :)

I have a process called m61 running on HP-UX server and it cores.

The call stack looks like

(gdb) bt
#0 0x9ed4 in M611f_decodeDPData (c_r_funcCode_b=0 '\000', c_r_objHdr_p=0x0, c_r_areaGrpPara_p=0x7f7f24c0, c_r_DPTypeInst_p=0xc3550a50,
c_r_record_pb=0x7f7f2d64 "@.?\360\300\031\263\340", c_w_length_pu=0x7f7f24bc) at /projects/a7tm/access7/ms/m61/m611f.c:225
#1 0x9790 in M611e_procDPType (c_r_DPType_p=0x7f7f1fac) at /projects/a7tm/access7/ms/m61/m611e.c:371
#2 0x6d98 in main () at /projects/a7tm/access7/ms/m61/m6100.c:340
#3 0x9eac in M611f_decodeDPData (c_r_funcCode_b=154 '\232', c_r_objHdr_p=0x7f7f1b7b, c_r_areaGrpPara_p=0x7f7f1b64, c_r_DPTypeInst_p=0x7f7f1b52,
c_r_record_pb=0x7f7f1b3a "SDU_DEBUG_PRINT_MSGID=0", c_w_length_pu=0x7f7f1b1a) at /projects/a7tm/access7/ms/m61/m611f.c:215

What surprised me is the red string SDU_DEBUG_PRINT_MSGID=0. while c_r_record_pb should be the message body.

This string is defined in a shell script which should be absolutely unrelated to this core and the process m61. I am sure the scritp was not excuted at that moment.

Then why a string a thousand miles away appears in the core? I feel I used up my intelligence.

This problem looks like the process attempts to reach address out of its own scope and finnaly hit the disk file??. I can only guess. But I don't know how to troubleshoot such problem.

Please kindly help me out. Any tools any techniques are mostly welcome.

I don't know 100% sure which community is the best place to post. This one looks proper to me.

Please the manager don't delete it if I post to a wrong place.

Thanks a lot.

Best Regards

Kang

Hello world...

arking1981 · ‎08-04-2011

by run strings core | grep SDU_DEBUG_PRINT_MSGID

I can find it. So I guess it is included in the data segment of the binary, and the pointer was wrong lead to the data segment. Could it be? How to troubleshoot such problem?

Regards

Kang

Hello world...

Laurent Menase · ‎08-04-2011

The pointer looks to be in the stack so it is very probable it points to an env varigable so if the process was lauched by a script which sourced the script which contain it is possible.

Is it possible that M611f_decodeDPData parse the env variables? ( with getenv or by hand?)

Dennis Handly · ‎08-04-2011

>I have a process called m61 running on HP-UX server and it cores.

Which specific signal?

>What surprised me is the red string SDU_DEBUG_PRINT_MSGID=0. while c_r_record_pb should be the message body.

It's not aborting in frame 3 but in frame 0. You've figured out it is due to c_r_record_pb?

>This string is defined in a shell script which should be absolutely unrelated to this core and the process m61.

If that script exports a variable with that string and is a parent (or indirect parent) of this process, then that string will be in argv.

>Then why a string a thousand miles away appears in the core?

Either in the environment or the file was read.

>This problem looks like the process attempts to reach address out of its own scope and finally hit the disk file?

Not as easy as that.

>Any tools any techniques are mostly welcome.

Hardware watch points may help. Also looking at the source to see where that comes from.

>I don't know 100% sure which community is the best place to post.

This is a language issue, not sysadmin. I've asked the moderators to move it.

arking1981 · ‎08-04-2011

Thanks.
Dennis Handly wrote:
>I have a process called m61 running on HP-UX server and it cores.

Which specific signal?
[kang] 11, segment fault

>What surprised me is the red string SDU_DEBUG_PRINT_MSGID=0. while c_r_record_pb should be the message body.

It's not aborting in frame 3 but in frame 0. You've figured out it is due to c_r_record_pb?
[kang] I think the error did occur in frame 3 but it didn't quit anyhow and the adress space was messed up. Then next run ( the main() is a message reading loop which in one turn processes an incoming message) it quit. Could this happen?
c_r_record_pb is the message body which should never be such a string. This string does exist and it is an environment. I think it sits in the data segment(or something like that). I have no idea how the program runs into its data segment. Maybe the message read in by the main() loop is corrupted.
Frame 0 has some wrong parameters too

>This string is defined in a shell script which should be absolutely unrelated to this core and the process m61.

If that script exports a variable with that string and is a parent (or indirect parent) of this process, then that string will be in argv.
[kang] it's not the parent.

>Then why a string a thousand miles away appears in the core?

Either in the environment or the file was read.

>This problem looks like the process attempts to reach address out of its own scope and finally hit the disk file?

Not as easy as that.

>Any tools any techniques are mostly welcome.

Hardware watch points may help. Also looking at the source to see where that comes from.

>I don't know 100% sure which community is the best place to post.

This is a language issue, not sysadmin. I've asked the moderators to move it.

Hello world...

Dennis Handly · ‎08-04-2011

>I guess it is included in the data segment of the binary, and the pointer was wrong lead to the data segment. Could it be?

If the debugger prints it, it's in the processes data area.

You need to print out all frames and see what frame 4 was passing as c_r_record_pb and then find out where that came from.

If argv, you might want to print those in main.

Either by changing the program or by doing in the debugger.

Wait a minute, how can frame 3 be calling main at frame 2?

Are you looking at the core file on the server where it aborted? If not, you need to do a packcore to move it to another system.

Or are you running the application live?

arking1981 · ‎08-04-2011

Thanks Dennis.

You are right. I copies to core and binary to another server. On the original server the call stack does look different:

(gdb) bt

#0 0x9ed4 in M611f_decodeDPData (c_r_funcCode_b=0 '\000', c_r_objHdr_p=0x0, c_r_areaGrpPara_p=0x7f7f24c0, c_r_DPTypeInst_p=0xc3550a50, c_r_record_pb=0x7f7f2d64 "@.?\360\300\031\263\340", c_w_length_pu=0x7f7f24bc) at /projects/a7tm/access7/ms/m61/m611f.c:225

#1 0x9790 in M611e_procDPType (c_r_DPType_p=0x7f7f1fac) at /projects/a7tm/access7/ms/m61/m611e.c:371

#2 0x6d98 in main () at /projects/a7tm/access7/ms/m61/m6100.c:340 this is printed by gdb 6.1.

while using WDB, the stack looks different and has two more frames:

#0 0x9ed4 in M611f_decodeDPData (c_r_funcCode_b=0 '\000', c_r_objHdr_p=0x0, c_r_areaGrpPara_p=0x7f7f24c0, c_r_DPTypeInst_p=0xc3550a50, c_r_record_pb=0x7f7f2d64 "@.?\360\300\031\263\340", c_w_length_pu=0x7f7f24bc) at /projects/a7tm/access7/ms/m61/m611f.c:225

#1 0x9790 in M611e_procDPType (c_r_DPType_p=0x7f7f1fac) at /projects/a7tm/access7/ms/m61/m611e.c:371

#2 0x6d98 in main () at /projects/a7tm/access7/ms/m61/m6100.c:340

#3 0x6b24 in main () at /projects/a7tm/access7/ms/m61/m6100.c:152 warning: GDB cannot print complete stack trace since some shared libraries are missing. Set GDB_SHLIB_PATH and try again.

#4 0x4f4c445c in ()

#5 0x9eac in M611f_decodeDPData (c_r_funcCode_b=Cannot access memory at address 0x7f7eff1f ) at /projects/a7tm/access7/ms/m61/m611f.c:215 Cannot access memory at address 0x7f7eff2c

This message "warning: GDB cannot print complete stack trace since some shared libraries are missing. Set GDB_SHLIB_PATH and try again." appears. Does it mean anything?

Hello world...

arking1981 · ‎08-05-2011

I am sorry for the formating. The new hp forum seems much more difficulty to use than the old one. I hope it can become as well as the old soon.

Hello world...

Dennis Handly · ‎08-05-2011

>I am sorry for the formatting.

Please go back and edit that post and correct the formatting. Under Options on the right, select Edit Reply.

You can put block of code in box by using the clipboard with a [C] in it

I don't have problems like that. Perhaps because I don't have "Turn on the Rich Text Editor" in my preferences?

Dennis Handly · ‎08-05-2011

>I copied core and binary to another server.

You can't do that without all of the load modules used. That's why you use gdb's packcore command.

>On the original server the call stack does look different:

This is the only one that matters. Again what signal did you get?

#0 0x9ed4 in M611f_decodeDPData (c_r_funcCode_b=0 '\000', c_r_objHdr_p=0x0, c_r_
areaGrpPara_p=0x7f7f24c0, c_r_DPTypeInst_p=0xc3550a50, c_r_record_pb=0x7f7f2d64
"@.?\360\300\031\263\340", c_w_length_pu=0x7f7f24bc) at /projects/a7tm/access7/ms/m61/m611f.c:225

What's on m611f.c:225? What variables are being used and what are their values?

>while using WDB, the stack looks different and has two more frames:

WDG-GUI? And you ran it on the original server?

>This message "warning: GDB cannot print complete stack trace since some shared libraries are missing. >Does it mean anything?

It means you need to use packcore & unpackcore.

arking1981 · ‎08-05-2011

Thanks Dannis. I edited the last post into a better formatted. It turns out to be the problem of my browser.

@Dennis Handly wrote:
>I copied core and binary to another server.

You can't do that without all of the load modules used. That's why you use gdb's packcore command.
[Kang] ok, next time I will pack ti

>On the original server the call stack does look different:

This is the only one that matters. Again what signal did you get?

[kang] 11 the segment fault

#0 0x9ed4 in M611f_decodeDPData (c_r_funcCode_b=0 '\000', c_r_objHdr_p=0x0, c_r_
areaGrpPara_p=0x7f7f24c0, c_r_DPTypeInst_p=0xc3550a50, c_r_record_pb=0x7f7f2d64
"@.?\360\300\031\263\340", c_w_length_pu=0x7f7f24bc) at /projects/a7tm/access7/ms/m61/m611f.c:225

What's on m611f.c:225? What variables are being used and what are their values?
[kang] it is a memcpy, and the poiter is 0x0. I know this leads to a segment fault. but only knowing this is not sufficient to troubleshoot the root cause. It is not common we have such an error as this program is running well at so many other servers. So I guess something outside affects it, just my guess. It will be greate if any tools or techniques can help find the root cause.

>while using WDB, the stack looks different and has two more frames:

WDG-GUI? And you ran it on the original server?
[kang] yes on the original server

>This message "warning: GDB cannot print complete stack trace since some shared libraries are missing. >Does it mean anything?

It means you need to use packcore & unpackcore.
[kang] but I was on the orignial server

Your advices are mostly welcome.

Thanks again.

regards

kang

Hello world...

Dennis Handly · ‎08-06-2011

> >That's why you use gdb's packcore command

[Kang] ok, next time I will pack it

You may need to redo it this time to get reasonable results.

[kang] it is a memcpy, and the pointer is 0x0.
>only knowing this is not sufficient to troubleshoot the root cause.

What are the source and target variable names? You need to track where they are initialized or assigned.
Are they globals, locals or parms?

>as this program is running well at so many other servers. So I guess something outside affects it

It could be a latent bug or something depending on the command line, environment or a data file.

>It will be great if any tools or techniques can help find the root cause.

If you know it works in some cases, you can track down where the variable gets set to a proper value. If it later gets reset to 0, you can put a hardware watch point on it.

[kang] yes on the original server

Better to stick with plain gdb. What version is on the original server?

[kang] but I was on the original server

It could mean the stack is corrupted?

What language are you using? What compiler version?
aCC6 has +check=bounds and +check=uninit.

arking1981 · ‎08-09-2011

Dennis, thanks. Are you referring to override the signal handler for SEGMENT_FAULT(11) with my own one by saying hardware watch point?

Hello world...

Dennis Handly · ‎08-09-2011

>Are you referring to override the signal handler ... by saying hardware watch point?

No, this is a gdb hardware watch point.

(gdb) watch *(void**)0x.....

This will do a hardware watch point on the above hex address for 4 byte value.

This runs at hardware speeds, not gdb software.

arking1981 · ‎08-09-2011

Aha, greate. This is new technology for me.

Thanks a lot, Dennis. I would try it.

it is a pity I can't assign points as what we did in the old HP. :)

Thanks again.

Hello world...

Dennis Handly · ‎08-10-2011

>it is a pity I can't assign points as what we did in the old HP.

You can assign kudos for every answer if you want.

Categories

Company

Local Language

Forums

Discussions

Knowledge Base

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Need some guide on solving such problem

Need some guide on solving such problem