Re: A COM process on OpenVMS guests

Richard Shao Gang · ‎03-17-2012

Hi HP Community,

Could you please advice on how to trouble a process in a COM state?

OS: OpenVMS 8.4 guest on HPVM

Process: ixiikc

We have run the same program in hundreds of the servers and I haven't seen the same problem on a physical server. But we have seen this on the virtual guests.

Docs I have read some.

http://labs.hoffmanlabs.com/node/231

BA554-90017 HP OpenVMS System Analysis Tools Manual

Best regards,

Richard

IKEA IT

Hoff · ‎03-17-2012

A computable process is just that; a process in a computable state.

A computable process can represent a very well-tuned system, or it can indicate an application-level error.

An infinite loop or one of its variations, or an algorithm that doesn't scale well to the data (eg: polynomial time), etc.

It's all about the context for the application processing.

Determining the context involves figuring out whether the process is doing usable work or not, and whether the application code is operating with sufficient effiiciency. And where the process is spending its time.

Source code that runs on hundreds of servers does not confirm that the source code is bug free, nor that it's stable, nor well-written. Bugs in application code can be latent for years, and bugs can be triggered by differences in timing. And HP-VM can most certainly trigger differences in timing. And bugs can arise in even some of the best-written code, too; whether application code, or OpenVMS, or HP-VM.

Looking at this from my perspective, that you are asking this question in this particular fashion (and comparative lack of detail), and that you're citing a topic that's probably not germane to a computable process (here's one that's closer to the target), and that you have no program counter traces and have posted no details of the errant code, can all be inferred to strengthen the circumstantial case against this particular source code, too. That there's a bug somewhere in this application code.

Do your due dilligence here with the application source code, and either rule in your code as the trigger for the looping, or rule it out. Start by sampling the PCs in the loop; that'll tend to reduce the scope of the error. Use integrated debugging and integrated logging where that's available, and add that where it's not.

OpenVMS is certainly not error free. It does see a whole lot of use. And this case may well prove to be an OpenVMS bug or an HP-VM bug. But given the error is arising in your application code, you own figuring out if this is your bug, or if it's a bug in some supporting code.

See this recent thread for somebody that learned something about latent bugs in existing code.

And this old topic has a list of common source code bugs to look for in existing code.

Volker Halle · ‎03-18-2012

Richard,

an OpenVMS process, which has used 10:50 hours of CPU time in 11:57 hours of existance, while only doing minimum buffered IO, direct IO and pagefaults is certainly suspect of 'looping'. Except if it is supposed to do complex CPU-intensive work, which you would know about, right ?

The process only appears to be in COM state all the time, in reality, it is in CUR state (i.e. owning the CPU) most of the time, but you only see it in COM state, if this a system with just one CPU, which is executing your SHOW SYSTEM or SHOW PROC/CONT/ID=xx code at that time.

The best tool to obtain information about this process is PCS (the SDA extension PCS$SDA, PCS = PC sampling). This tool can be started in the running system and will collect PC values from the overall system or just that process.

$ ANALYZE/SYSTEM

SDA> PCS ! to get get some help

SDA> PCS LOAD

SDA> PCS START TRACE ! while your process is running/looping

SDA> PCS STOP TRACE ! stop trace after a couple of seconds

SDA> SET PROC ixiikc ! set context to looping process to allow SDA to symbolize adresses

SDA> PCS SHOW TRACE/STAT ! will show, which PC trace value has been captured how often

SDA> PCS SHOW TRACE/PID=<pid-of-your-process> ! show collected PC values in your process

SDA> PCS UNLOAD

SDA> EXIT

You will need to find the PC values collected from your process and map them to the source code, this needs access to the current linker map and source listing (machine code) listing, if the PC values are actually in P0 space.

If you carefully look at the data posted from the PCB (Process Control Blocks), you'll spot:

PCB$L_EXEC_COUNTER 003BF532

This is equivalent to about 10.9 hours spent in EXECUTIVE mode in the context of this process ! Let me guess: either RMS or Oracle RDB (or some other code running in EXEC mode).

Volker.

Richard Shao Gang · ‎03-19-2012

Hi,

Please accept my greetings from China.

I have read the docs in your reply and from that some more related docs via google.
I would be interested in these internal stuff of OpenVMS although it's too much to a system manager.

I don't have access to the code but I could check the development when it is must.
And I will try to use the English as much as possible I have.

->
The looping process ixiikc is gone when the guest crashed yesterday morning.

We have seen some errors in operator.log and app logs before the crash besides the ixiikc looping. And after the crash the it looks normal and calm both the system and application.

Before the crash->
M00029/SYS0.SYSMGR> sear operator.log.-1 "-e-","-f-","-w-"
%SYSTEM-F-ACCVIO, access violation, reason mask=00, virtual address=0000000000000035, PC=FFFFFFFF922809A0, PS=0000000B
%SYSTEM-F-ACCVIO, access violation, reason mask=00, virtual address=0000000000000035, PC=FFFFFFFF922809A0, PS=0000000B
%SYSTEM-F-ACCVIO, access violation, reason mask=00, virtual address=0000000000000035, PC=FFFFFFFF92280940, PS=0000000B
%SYSTEM-F-ACCVIO, access violation, reason mask=00, virtual address=0000000000000035, PC=FFFFFFFF922809A0, PS=0000000B
%SYSTEM-F-ACCVIO, access violation, reason mask=00, virtual address=0000000000000035, PC=FFFFFFFF922809A0, PS=0000000B
%SYSTEM-F-ACCVIO, access violation, reason mask=00, virtual address=0000000000000035, PC=FFFFFFFF922809A0, PS=0000000B
%SYSTEM-F-ACCVIO, access violation, reason mask=00, virtual address=0000000000000035, PC=FFFFFFFF92280A00, PS=0000000B
%SYSTEM-F-ACCVIO, access violation, reason mask=00, virtual address=0000000000000035, PC=FFFFFFFF922809A0, PS=0000000B
%SYSTEM-F-ACCVIO, access violation, reason mask=00, virtual address=0000000000000035, PC=FFFFFFFF922809A0, PS=0000000B
%COSI-F-BUGCHECK, internal consistency failure
%SYSTEM-F-ACCVIO, access violation, reason mask=00, virtual address=0000000000000035, PC=FFFFFFFF922809A0, PS=0000000B
%SYSTEM-F-ACCVIO, access violation, reason mask=00, virtual address=0000000000000035, PC=FFFFFFFF922809A0, PS=0000000B
%SYSTEM-F-ACCVIO, access violation, reason mask=00, virtual address=0000000000000035, PC=FFFFFFFF922809A0, PS=0000000B
%SYSTEM-F-ACCVIO, access violation, reason mask=04, virtual address=000000008081E000, PC=FFFFFFFF91D33040, PS=0000000B
M00029/MHS.LOGFILES> sear IXLC2_2.LOG.-1 "++++"
++++++++++ Error in batch!!!!!!!!!! ICQA2 dumped.
++++++++++ Error in batch!!!!!!!!!! ICQA4 dumped.
++++++++++ Error in batch!!!!!!!!!! IXID7 dumped.

After the crash->
M00029/SYS0.SYSMGR> sear operator.log "-e-","-f-","-w-"
%SEARCH-I-NOMATCHES, no strings matched
M00029/MHS.LOGFILES> sear IXLC2_2.LOG "++++"
%SEARCH-I-NOMATCHES, no strings matched

I have run analyze/crash_dump and attached output here.

M00029> ana/crash SAVEDUMP.DMP

OpenVMS system dump analyzer
...analyzing an I64 compressed full memory dump...

Dump taken on 18-MAR-2012 06:12:08.67 using version V8.4
MACHINECHK, Machine check while in kernel mode

SDA> set output /nohead sysdump.lis
SDA> read/exec
SDA> show crash
%SDA-W-NOREAD, unable to access location 00000000.00000088
SDA> show stack
SDA> show summary
SDA> show process/pcb/phd/reg
SDA> show symbol/all
SDA> exit

So could you find something related to the process looping and accvio errors in the crash, and the bad code behind.

Best regards,
Richard

Volker Halle · ‎03-19-2012

Richard,

a MACHINECHK crash is typically caused by a hardware problem, to see this kind of crash on a HPVM guest seems unusual. You need to extract the ERRLOG entries from the dump and process them with WEBES/SEA/'whatever OpenVMS errlog analysis tool of the day' to find out about the underlying HW error:

$ ANALYZE/CRASH SAVEDUMP.DMP

SDA> CLUE ERRLOG

...

SDA> EXIT

This will generate SYS$SCRATCH:CLUE$ERRLOG.SYS, you need to decode the entries in that errorlog file.

The %COSI-F-BUGCHECK message found in OPERATOR.LOG definitely points to Oracle RDB.

You will have to wait for the problem to re-appear, the looping iixikc process was not existant anymore at the time of the system crash.

Volker.

Richard Shao Gang · ‎03-19-2012

Hi Volker,

SDA> clue errlog

Dumpfile Errorlog Entry Information:
------------------------------------
Sequence Date        Time            Error Message Type
-------- ----------- -----------     --------------------------------
     396 18-MAR-2012 06:12:08.67     Machine Check 670
     397 18-MAR-2012 06:12:08.67 * Crash Entry

Config Entry and Errlog Entries written to CLUE$ERRLOG.SYS file.
Use System Event Analyzer or Error Log Viewer to analyze.

I have SEA analyzed the file generated and attached below.

Best regards,

Richard

Volker Halle · ‎03-19-2012

Richard,

if I see the multiple question marks in the SEA output, I would conclude, that this version of SEA probably does not know about HPVM at all. I cannot see any error except that this HPVM guest system incurred a 670 UnCorrectable Processor Event - whatever this means in the context of a HPVM guest system.

Maybe check with the system managers of the HPVM host, whether anything has been reported on the underlying hardware system at that time and/or if they have done anything to your 'guest' system at the time of this MACHINECHK crash.

Volker.

Richard Shao Gang · ‎03-20-2012

Voker,

We manage the host as well and we haven't changed the hardware / hpvm software, installed any patch etc.

We have seen the machine check on another guest m00238 as well.

M00238> ana/crash SAVEDUMP_238_120319.DMP

OpenVMS system dump analyzer
...analyzing an I64 compressed full memory dump...

Dump taken on 19-MAR-2012 23:53:54.40 using version V8.4
MACHINECHK, Machine check while in kernel mode

SDA> clue errlog

Dumpfile Errorlog Entry Information:
------------------------------------
Sequence Date        Time            Error Message Type
-------- ----------- -----------     --------------------------------
    7221 19-MAR-2012 23:53:54.40     Machine Check 670
    7222 19-MAR-2012 23:53:54.40 * Crash Entry

Config Entry and Errlog Entries written to CLUE$ERRLOG.SYS file.
Use System Event Analyzer or Error Log Viewer to analyze.
SDA> exit

We also see other crashes although I guest it might be the same root cause.

M00238> ana/crash SAVEDUMP_238_120317.DMP

OpenVMS system dump analyzer
...analyzing an I64 compressed full memory dump...

Dump taken on 17-MAR-2012 12:34:59.46 using version V8.4
PGFIPLHI, Pagefault with IPL too high

SDA> exit

Best regards,

Richard

Volker Halle · ‎03-20-2012

Richard,

are those HPVM guests using dedicated physical Itanium CPUs/cores ? Otherwise, how would you know, which physical processor has caused the problem, if you're running as a virtual machine guest ?

A PGFIPLHI crash ist most likely a software problem in OpenVMS. To get more information about that type of crash, please consider to provide the CLUE file (see CLUE$COLLECT:CLUE$node_ddmmyy_hhmm.LIS from that crash) or at least provide the output from SDA> CLUE CRASH (which shows the failing module and offset).

Volker.

Hoff · ‎03-20-2012

Presuming this code does not incorporate kernel-mode software of its own, this looks like flaky hardware or flaky "hardware" (HP-VM) or flaky OpenVMS or kernel-mode software, and -- all the discussions of the RAS features aside -- these glitches can and do arise with some Itanium boxes.

You're arguably extending the time to resolution here by pursuing and debugging this here in HPSC. Call HP support. You've paid for that access privileges, after all. Pass along the CLUE CRASH data or potentially the full carcasses from the crashes to HP, and have them sort this out.

Richard Shao Gang · ‎03-22-2012

Yes. Hoff,

The cases have already been reported to HP for some time, investigated by HP-UX and the Lab according to HP. Now the OpenVMS teams are involved as well.

We have installed a third UNOF patch for the HPVM after these crashes. I feeled a bit embarrassed when opened this post, because I think there could also be UNOF support in the communities, at least the valuable inputs. "Two heads are better than one."

Volker,

The host blade has only one cpu of 4 cores. I found the guest runs and crashes on a random core on which it starts.

#top
System: ITSEELM-                                      Fri Mar 23 05:07:07 2012
Load averages: 0.50, 0.45, 0.36
203 processes: 127 sleeping, 76 running
Cpu states:
CPU   LOAD   USER   NICE    SYS   IDLE BLOCK SWAIT   INTR   SSYS
0    0.64   0.0%   0.0% 59.8% 40.2%   0.0%   0.0%   0.0%   0.0%
1    0.21   0.0%   0.0%   3.2% 96.8%   0.0%   0.0%   0.0%   0.0%
2    0.52   0.0%   0.0% 16.2% 83.8%   0.0%   0.0%   0.0%   0.0%
3    0.64   0.2%   0.0% 85.6% 14.2%   0.0%   0.0%   0.0%   0.0%
---   ---- ----- ----- ----- ----- ----- ----- ----- -----
avg   0.50   0.0%   0.0% 41.2% 58.8%   0.0%   0.0%   0.0%   0.0%

System Page Size: 4Kbytes
Memory: 32473240K (32211056K) real, 34537852K (33776996K) virtual, 55907024K fre
e Page# 1/23

CPU TTY PID USERNAME PRI NI   SIZE    RES STATE    TIME %WCPU %CPU COMMAND
3   ? 3949 root     152 20 3253M 3151M run    105:52 82.10 81.95 hpvmapp   <-guest process
3   ? 3829 root     152 20 3253M 3151M run     96:28 20.33 20.30 hpvmapp
2   ? 3863 root     152 20 3253M 3151M run     85:52 8.32 8.31 hpvmapp
2   ? 4039 root     152 20 3253M 3151M run    150:12 7.63 7.61 hpvmapp
1   ? 3893 root     152 20 3253M 3151M run    121:31 7.03 7.02 hpvmapp
0   ? 3930 root     152 20 3253M 3151M run     67:21 4.04 4.03 hpvmapp

#machinfo

CPU info:
1 Intel(R) Itanium(R) Processor 9340 (1.6 GHz, 20 MB)
          4.79 GT/s QPI, CPU version E0
          4 logical processors (4 per socket)

Memory: 98198 MB (95.9 GB)

Firmware info:
   Firmware revision: 01.24
   FP SWA driver revision: 1.18
   IPMI is supported on this system.
   BMC firmware revision: 1.30

Platform info:
   Model:                  "ia64 hp Integrity BL860c i2"
   Machine ID number:      97be7d8e-6105-11e0-8a68-294e7ff67a42
   Machine serial number: CZ31099TF4

OS info:
   Nodename: ITSEELM-
   Release:   HP-UX B.11.31
   Version:   U (unlimited-user license)
   Machine:   ia64
   ID Number: 2545843598
   vmunix _release_version:
@(#) $Revision: vmunix:    B.11.31_LR FLAVOR=perf

And I also uploaded the list files in clue$collect.

Best regards,

Richard

Volker Halle · ‎03-22-2012

Richard,

4 unusual crashes within 4 days. Something is really wrong here:

PGFIPLHI in SWP$SHELL_INIT_C+00DA1 trying to execute st8 [r31] = r20 with R31=40000000.00000000

KRNLSTAKNV in SCH$INTERRUPT_C+00B90

and 2 MACHINECHK crashes on a HP VM virtual machine ?

Only HP will be able to help you here ...

Do you have other guests running on this HP VM node ? Any of them also seeing unusual problems ?

Volker.

Hoff · ‎03-23-2012

Having looked at these OpenVMS crashes for years and years, making any headway without access to the crashdump files and without source code listings for OpenVMS is a whole lot of work; what's tricky at best becomes intractable.

And that's for VMS running native-booted.

This particular configuration is exceptionally complex, and I've found each of the layers here (HP-VM, HP-UX, Integrity, Itanium) can and variously does introduce errors.

CLUE CRASH is good for the "saw that already" crashes that can be automatically scanned and identified, if HP is using those sorts of crash-scanning tools. For figuring out why VMS face-planted, the dump file is (for me) far more interesting.

Given you've already undoubtedly run the error logs and related and looked for hardware glitches and HP has run the CLUE CRASH past whatever they're using these days, if in your situation, I'd next get rid of HP-UX and HP-VM here, and boot OpenVMS native onto the Tukwila hardware. That'll either change the footprint substantially, eliminate the crashes entirely, or (because VMS has a better view into the hardware, if that's the trigger here) potentially identify the error.

And FWIW, I continue to be surprised that folks are willing to do this with production environments and don't choose native boot either directly or via EFI-level partitioning. While I do grok the "cool factor" and the "power and cooling" decisions of VMs, the particular nature of HP's VM implementation for Itanium (you can't boot a VM on the VM here, for testing or debug) and the "moving target" that is the VMS error-decoding tools, makes for a very hairy stack here. You're basically beholden to HP Support with this and similar cases, and across four (VMS, VM, UX, Integrity) HP entities.

Volker Halle · ‎03-23-2012

Richard,

are you aware of this article from the OpenVMS technical journal V16 ?

OpenVMS Guest Troubleshooting

http://h71000.www7.hp.com/openvms/journal/v16/troubleshooting.html

Maybe you can use some of the troubleshooting guidelines in that article to obtain more and better information.

Volker.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: A COM process on OpenVMS guests

A COM process on OpenVMS guests