- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - HP-UX
- >
- Re: Need some guide on solving such problem
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Forums
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-03-2011 11:36 PM
08-03-2011 11:36 PM
Need some guide on solving such problem
Dear all,
I have a tough problem. Tough for me, may not for you. :)
I have a process called m61 running on HP-UX server and it cores.
The call stack looks like
(gdb) bt
#0 0x9ed4 in M611f_decodeDPData (c_r_funcCode_b=0 '\000', c_r_objHdr_p=0x0, c_r_areaGrpPara_p=0x7f7f24c0, c_r_DPTypeInst_p=0xc3550a50,
c_r_record_pb=0x7f7f2d64 "@.?\360\300\031\263\340", c_w_length_pu=0x7f7f24bc) at /projects/a7tm/access7/ms/m61/m611f.c:225
#1 0x9790 in M611e_procDPType (c_r_DPType_p=0x7f7f1fac) at /projects/a7tm/access7/ms/m61/m611e.c:371
#2 0x6d98 in main () at /projects/a7tm/access7/ms/m61/m6100.c:340
#3 0x9eac in M611f_decodeDPData (c_r_funcCode_b=154 '\232', c_r_objHdr_p=0x7f7f1b7b, c_r_areaGrpPara_p=0x7f7f1b64, c_r_DPTypeInst_p=0x7f7f1b52,
c_r_record_pb=0x7f7f1b3a "SDU_DEBUG_PRINT_MSGID=0", c_w_length_pu=0x7f7f1b1a) at /projects/a7tm/access7/ms/m61/m611f.c:215
What surprised me is the red string SDU_DEBUG_PRINT_MSGID=0. while c_r_record_pb should be the message body.
This string is defined in a shell script which should be absolutely unrelated to this core and the process m61. I am sure the scritp was not excuted at that moment.
Then why a string a thousand miles away appears in the core? I feel I used up my intelligence.
This problem looks like the process attempts to reach address out of its own scope and finnaly hit the disk file??. I can only guess. But I don't know how to troubleshoot such problem.
Please kindly help me out. Any tools any techniques are mostly welcome.
I don't know 100% sure which community is the best place to post. This one looks proper to me.
Please the manager don't delete it if I post to a wrong place.
Thanks a lot.
Best Regards
Kang
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-04-2011 12:13 AM
08-04-2011 12:13 AM
Re: Need some guide on solving such problem
by run strings core | grep SDU_DEBUG_PRINT_MSGID
I can find it. So I guess it is included in the data segment of the binary, and the pointer was wrong lead to the data segment. Could it be? How to troubleshoot such problem?
Regards
Kang
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-04-2011 01:56 AM
08-04-2011 01:56 AM
Re: Need some guide on solving such problem
The pointer looks to be in the stack so it is very probable it points to an env varigable so if the process was lauched by a script which sourced the script which contain it is possible.
Is it possible that M611f_decodeDPData parse the env variables? ( with getenv or by hand?)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-04-2011 02:49 AM - edited 08-04-2011 02:51 AM
08-04-2011 02:49 AM - edited 08-04-2011 02:51 AM
Re: Need some guide on solving such problem
>I have a process called m61 running on HP-UX server and it cores.
Which specific signal?
>What surprised me is the red string SDU_DEBUG_PRINT_MSGID=0. while c_r_record_pb should be the message body.
It's not aborting in frame 3 but in frame 0. You've figured out it is due to c_r_record_pb?
>This string is defined in a shell script which should be absolutely unrelated to this core and the process m61.
If that script exports a variable with that string and is a parent (or indirect parent) of this process, then that string will be in argv.
>Then why a string a thousand miles away appears in the core?
Either in the environment or the file was read.
>This problem looks like the process attempts to reach address out of its own scope and finally hit the disk file?
Not as easy as that.
>Any tools any techniques are mostly welcome.
Hardware watch points may help. Also looking at the source to see where that comes from.
>I don't know 100% sure which community is the best place to post.
This is a language issue, not sysadmin. I've asked the moderators to move it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-04-2011 04:51 AM
08-04-2011 04:51 AM
Re: Need some guide on solving such problem
Thanks.
Dennis Handly wrote:
>I have a process called m61 running on HP-UX server and it cores.
Which specific signal?
[kang] 11, segment fault
>What surprised me is the red string SDU_DEBUG_PRINT_MSGID=0. while c_r_record_pb should be the message body.
It's not aborting in frame 3 but in frame 0. You've figured out it is due to c_r_record_pb?
[kang] I think the error did occur in frame 3 but it didn't quit anyhow and the adress space was messed up. Then next run ( the main() is a message reading loop which in one turn processes an incoming message) it quit. Could this happen?
c_r_record_pb is the message body which should never be such a string. This string does exist and it is an environment. I think it sits in the data segment(or something like that). I have no idea how the program runs into its data segment. Maybe the message read in by the main() loop is corrupted.
Frame 0 has some wrong parameters too
>This string is defined in a shell script which should be absolutely unrelated to this core and the process m61.
If that script exports a variable with that string and is a parent (or indirect parent) of this process, then that string will be in argv.
[kang] it's not the parent.
>Then why a string a thousand miles away appears in the core?
Either in the environment or the file was read.
>This problem looks like the process attempts to reach address out of its own scope and finally hit the disk file?
Not as easy as that.
>Any tools any techniques are mostly welcome.
Hardware watch points may help. Also looking at the source to see where that comes from.
>I don't know 100% sure which community is the best place to post.
This is a language issue, not sysadmin. I've asked the moderators to move it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-04-2011 05:53 AM
08-04-2011 05:53 AM
Re: Need some guide on solving such problem
>I guess it is included in the data segment of the binary, and the pointer was wrong lead to the data segment. Could it be?
If the debugger prints it, it's in the processes data area.
You need to print out all frames and see what frame 4 was passing as c_r_record_pb and then find out where that came from.
If argv, you might want to print those in main.
Either by changing the program or by doing in the debugger.
Wait a minute, how can frame 3 be calling main at frame 2?
Are you looking at the core file on the server where it aborted? If not, you need to do a packcore to move it to another system.
Or are you running the application live?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-04-2011 11:58 PM - edited 08-05-2011 08:11 PM
08-04-2011 11:58 PM - edited 08-05-2011 08:11 PM
Re: Need some guide on solving such problem
Thanks Dennis.
You are right. I copies to core and binary to another server. On the original server the call stack does look different:
(gdb) bt
#0 0x9ed4 in M611f_decodeDPData (c_r_funcCode_b=0 '\000', c_r_objHdr_p=0x0, c_r_areaGrpPara_p=0x7f7f24c0, c_r_DPTypeInst_p=0xc3550a50, c_r_record_pb=0x7f7f2d64 "@.?\360\300\031\263\340", c_w_length_pu=0x7f7f24bc) at /projects/a7tm/access7/ms/m61/m611f.c:225
#1 0x9790 in M611e_procDPType (c_r_DPType_p=0x7f7f1fac) at /projects/a7tm/access7/ms/m61/m611e.c:371
#2 0x6d98 in main () at /projects/a7tm/access7/ms/m61/m6100.c:340 this is printed by gdb 6.1.
while using WDB, the stack looks different and has two more frames:
#0 0x9ed4 in M611f_decodeDPData (c_r_funcCode_b=0 '\000', c_r_objHdr_p=0x0, c_r_areaGrpPara_p=0x7f7f24c0, c_r_DPTypeInst_p=0xc3550a50, c_r_record_pb=0x7f7f2d64 "@.?\360\300\031\263\340", c_w_length_pu=0x7f7f24bc) at /projects/a7tm/access7/ms/m61/m611f.c:225
#1 0x9790 in M611e_procDPType (c_r_DPType_p=0x7f7f1fac) at /projects/a7tm/access7/ms/m61/m611e.c:371
#2 0x6d98 in main () at /projects/a7tm/access7/ms/m61/m6100.c:340
#3 0x6b24 in main () at /projects/a7tm/access7/ms/m61/m6100.c:152 warning: GDB cannot print complete stack trace since some shared libraries are missing. Set GDB_SHLIB_PATH and try again.
#4 0x4f4c445c in ()
#5 0x9eac in M611f_decodeDPData (c_r_funcCode_b=Cannot access memory at address 0x7f7eff1f ) at /projects/a7tm/access7/ms/m61/m611f.c:215 Cannot access memory at address 0x7f7eff2c
This message "warning: GDB cannot print complete stack trace since some shared libraries are missing. Set GDB_SHLIB_PATH and try again." appears. Does it mean anything?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-05-2011 12:02 AM
08-05-2011 12:02 AM
Re: Need some guide on solving such problem
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-05-2011 02:12 AM - edited 08-05-2011 02:34 AM
08-05-2011 02:12 AM - edited 08-05-2011 02:34 AM
Re: Need some guide on solving such problem
>I am sorry for the formatting.
Please go back and edit that post and correct the formatting. Under Options on the right, select Edit Reply.
You can put block of code in box by using the clipboard with a [C] in it
I don't have problems like that. Perhaps because I don't have "Turn on the Rich Text Editor" in my preferences?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-05-2011 02:47 AM
08-05-2011 02:47 AM
Re: Need some guide on solving such problem
>I copied core and binary to another server.
You can't do that without all of the load modules used. That's why you use gdb's packcore command.
>On the original server the call stack does look different:
This is the only one that matters. Again what signal did you get?
#0 0x9ed4 in M611f_decodeDPData (c_r_funcCode_b=0 '\000', c_r_objHdr_p=0x0, c_r_
areaGrpPara_p=0x7f7f24c0, c_r_DPTypeInst_p=0xc3550a50, c_r_record_pb=0x7f7f2d64
"@.?\360\300\031\263\340", c_w_length_pu=0x7f7f24bc) at /projects/a7tm/access7/ms/m61/m611f.c:225
What's on m611f.c:225? What variables are being used and what are their values?
>while using WDB, the stack looks different and has two more frames:
WDG-GUI? And you ran it on the original server?
>This message "warning: GDB cannot print complete stack trace since some shared libraries are missing. >Does it mean anything?
It means you need to use packcore & unpackcore.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-05-2011 08:19 PM
08-05-2011 08:19 PM
Re: Need some guide on solving such problem
Thanks Dannis. I edited the last post into a better formatted. It turns out to be the problem of my browser.
@Dennis Handly wrote:>I copied core and binary to another server.
You can't do that without all of the load modules used. That's why you use gdb's packcore command.
[Kang] ok, next time I will pack ti
>On the original server the call stack does look different:
This is the only one that matters. Again what signal did you get?
[kang] 11 the segment fault
#0 0x9ed4 in M611f_decodeDPData (c_r_funcCode_b=0 '\000', c_r_objHdr_p=0x0, c_r_
areaGrpPara_p=0x7f7f24c0, c_r_DPTypeInst_p=0xc3550a50, c_r_record_pb=0x7f7f2d64
"@.?\360\300\031\263\340", c_w_length_pu=0x7f7f24bc) at /projects/a7tm/access7/ms/m61/m611f.c:225
What's on m611f.c:225? What variables are being used and what are their values?
[kang] it is a memcpy, and the poiter is 0x0. I know this leads to a segment fault. but only knowing this is not sufficient to troubleshoot the root cause. It is not common we have such an error as this program is running well at so many other servers. So I guess something outside affects it, just my guess. It will be greate if any tools or techniques can help find the root cause.
>while using WDB, the stack looks different and has two more frames:
WDG-GUI? And you ran it on the original server?
[kang] yes on the original server
>This message "warning: GDB cannot print complete stack trace since some shared libraries are missing. >Does it mean anything?
It means you need to use packcore & unpackcore.
[kang] but I was on the orignial server
Your advices are mostly welcome.
Thanks again.
regards
kang
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-06-2011 06:14 AM - edited 08-06-2011 06:19 AM
08-06-2011 06:14 AM - edited 08-06-2011 06:19 AM
Re: Need some guide on solving such problem
[Kang] ok, next time I will pack it
You may need to redo it this time to get reasonable results.
[kang] it is a memcpy, and the pointer is 0x0.
>only knowing this is not sufficient to troubleshoot the root cause.
What are the source and target variable names? You need to track where they are initialized or assigned.
Are they globals, locals or parms?
>as this program is running well at so many other servers. So I guess something outside affects it
It could be a latent bug or something depending on the command line, environment or a data file.
>It will be great if any tools or techniques can help find the root cause.
If you know it works in some cases, you can track down where the variable gets set to a proper value. If it later gets reset to 0, you can put a hardware watch point on it.
[kang] yes on the original server
Better to stick with plain gdb. What version is on the original server?
[kang] but I was on the original server
It could mean the stack is corrupted?
What language are you using? What compiler version?
aCC6 has +check=bounds and +check=uninit.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-09-2011 12:08 AM
08-09-2011 12:08 AM
Re: Need some guide on solving such problem
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-09-2011 05:07 AM
08-09-2011 05:07 AM
Re: Need some guide on solving such problem
>Are you referring to override the signal handler ... by saying hardware watch point?
No, this is a gdb hardware watch point.
(gdb) watch *(void**)0x.....
This will do a hardware watch point on the above hex address for 4 byte value.
This runs at hardware speeds, not gdb software.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-09-2011 07:24 PM
08-09-2011 07:24 PM
Re: Need some guide on solving such problem
Aha, greate. This is new technology for me.
Thanks a lot, Dennis. I would try it.
it is a pity I can't assign points as what we did in the old HP. :)
Thanks again.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-10-2011 05:50 AM
08-10-2011 05:50 AM
Re: Need some guide on solving such problem
>it is a pity I can't assign points as what we did in the old HP.
You can assign kudos for every answer if you want.