MNOWAIT link - undefined symbol

Rob Young_4 · ‎03-05-2007

I've downloaded:

http://vmsone.com/~decuslib/vmssig/vms94a/dsj/mnowat.dsj

But can't resolve nor find info on how to
resolve EXE$NAMPID:

$ show system/noproc
OpenVMS V7.3-2 on node AANODE 5-MAR-2007 12:06:40.43 Uptime 23 22:00:47

$ macro mnowait
$ link/sysexe mnowait
%LINK-W-NUDFSYMS, 1 undefined symbol:
%LINK-I-UDFSYM, EXE$NAMPID
%LINK-W-USEUNDEF, undefined symbol EXE$NAMPID referenced
in psect $LINKAGE offset %X000000C0
in module .MAIN. file DVLP$DISK:[AANODE.YOUNGR.WORK]MNOWAIT.OBJ;3

What does it take to link this correctly?

What problem am I trying to solve? A process
stuck in a tight CPU loop that thinks it has
a pending buffered IO to run down?:

Process index: 02EC Name: AUSER Extended PID: 204052EC
--------------------------------------------------------------------
Process status: 02040003 RES,DELPEN,PHDRES,INTER
status2: 00000001 QUANTUM_RESCHED

PCB address 82C8F4C0 JIB address 82ECF0C0
PHD address 87420000 Swapfile disk address 00000000
KTB vector address 82C8F7AC HWPCB address FFFFFFFF.87420080
Callback vector address 00000000 Termination mailbox 003B
Master internal PID 000A02EC Subprocess count 0
Creator extended PID 00000000 Creator internal PID 00000000
Previous CPU Id 00000000 Current CPU Id 00000001
Previous ASNSEQ 000000000008A63C Previous ASN 000000000000004F
Initial process priority 4 # open files remaining 150/150
Delete pending count 0 Direct I/O count/limit 150/150
UIC [02040,000024] Buffered I/O count/limit 149/150
Abs time of last event 01A32F07 BUFIO byte count/limit 100000/100000
# of threads 1 ASTs remaining 250/250
Swapped copy of LEFC0 00000000 Timer entries remaining 10/10
Swapped copy of LEFC1 00000000 Active page table count 0
Global cluster 2 pointer 00000000 Process WS page count 149

I'm trying to avoid a reboot as this is mostly
an annoyance (user is pegging whatever CPU
it is running on). The process on show users/full shows the NTY as disconnected
(perhaps that is the BIOLM mismatch? I've
not seen Multinet not rundown a process
correctly, by the way.)

Reading here:

http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=1025937

has me going this route. I'm not fond of
going the DELTA route.

Thanks,
Rob

Volker Halle · ‎03-05-2007

Rob,

> What does it take to link this correctly?

Link it on a VAX ;-)

This program seems to be written for a VAX. The EXE$NAMPID internal entry point does not exist on OpenVMS Alpha. You would not want to trust such a program, as it will probably crash the system anyway.

The looping process should have one outstanding IO (busy channel): SDA> SHOW PROC/CHAN

In the mean time, set the priority of the looping process to 0, this will minimize the effect on the rest of the system.

Volker.

Rob Young_4 · ‎03-05-2007

> Link it on a VAX ;-)

Wow - missed the obvious. One thing I thought
was "2006, must be AlphaVMS discussion."
Who'da thought it?

Yes - one channel open:

Process index: 02EC Name: NEUBAUERLA Extended PID: 204052EC
--------------------------------------------------------------------

Process active channels
-----------------------

Channel CCB Window Status Device/file accessed
------- --- ------ ------ --------------------
0010 7FF6E000 00000000 DSA1300:

Total number of open channels : 1.

So.. short of a reboot, what does it take
to run this process down?

Rob

Hoff · ‎03-05-2007

There be OpenVMS VAX kernel code here? EXE$NAMPID has morphed on OpenVMS Alpha.

The replacements for the various routines are in the family:

EXE$NAM_TO_PCB
EXE$CVT*

At least some of these are in the device driver book, and the rest are in the source listings.

I'd look for other changes needed in the MNOWAIT code, too.

DECamds can clear various wedged processes now.

And what's driving the NTY loop? I'd look at that in some detail -- that could be a lost I/O and not a mutex, for instance.

Volker Halle · ‎03-05-2007

Rob,

if everything looks otherwise o.k. with the DSA1300: shadowset, this must be a lost IO.

Probably time for a reboot...

Volker.

Rob Young_4 · ‎03-05-2007

I found EXE$NAM_TO_PCB here:

SYS$LDR> dir/date/size=all pro*.stb

Directory SYS$COMMON:[SYS$LDR]

PROCESS_MANAGEMENT.STB;1
86/105 8-FEB-2006 13:49:23.10
PROCESS_MANAGEMENT_MON.STB;1
93/105 8-FEB-2006 13:49:29.11

Certainly didn't want to nor capable of
mucking with the .mar code so kept hunting
for the elusive EXE$NAMPID.

Rob

Rob Young_4 · ‎03-05-2007

> if everything looks otherwise o.k. with the DSA1300: shadowset, this must be a lost IO.

> Probably time for a reboot...

I was afraid of that. I hate that. Now
it is a change control, etc.

I read the other discussions... it sure would
be nice if there was a way to BLAST away
processes like this without corrupting the
IO database (whatever)
or bringing the system to a grinding halt.

Rob

Volker Halle · ‎03-05-2007

Rob,

this process is looping, because of some previous SEVERE internal error (most likely either in the shadowing code or the IO sub-system).

OpenVMS thus allows you to get notified of this error in a 'friendly' manner, which does not immediately take down the whole system, but just loops this single process. So when you have a chance and a support contract, you could force a system crash and escalate the problem to OpenVMS engineering for analysis.

Would you have preferred OpenVMS to just ignore the error and continue ? And then crash sometimes later ? Or develop some other nasty and unpredictable behaviour due to this problem ?

No program can securely 'kill' such a process and 'un-do' the previous malfunction. It might be possible to manually get rid of this process after thorough analysis and risk assessment, but a reboot is cheaper...

Volker.

Rob Young_4 · ‎03-05-2007

> this process is looping, because of some
> previous SEVERE internal error (most likely > either in the shadowing code or the IO
> sub-system).

Hmmm... I have these relevent patches
in place:

DEC AXPVMS VMS732_FIBRE_SCSI V9.0 Patch Install 17-SEP-2006 02:45:23
DEC AXPVMS VMS732_UPDATE V7.0 Patch Install 17-SEP-2006 02:44:58

Perhaps something has been fixed since.
But I'm not so sure how severely broken
that level of UPDATE/FIBRE_SCSI could be
(after all, 7 and 9 go-rounds of patches
there).

>OpenVMS thus allows you to get notified of >this error in a 'friendly' manner, which >does not immediately take down the whole >system, but just loops this single process. >So when you have a chance and a support >contract, you could force a system crash and >escalate the problem to OpenVMS engineering >for analysis.

I have both and will do prior to reboot.
Further, here's my take on the "severity"
of such an issue. I've run up and down
many processes, you tell me if typically
a PID is 2021C985 , that's 100000+ processes
that have run by (a cluster). So I've
got one looping.

>Would you have preferred OpenVMS to just
>ignore the error and continue ? And then >crash sometimes later ? Or develop some
> other nasty and unpredictable behaviour due >to this problem ?

No. However, I guess I'm looking forward
to the day when each process runs inside
its own Virtual Machine so I can just down
the Virtual Machine. Don't know which OS,
don't really care at this point. As long
as I don't have to be awake at 2 a.m. for
a 7x24 mission critical application just
to clear a freakin' process.

>No program can securely 'kill' such a
>process and 'un-do' the previous
>malfunction. It might be possible to >manually get rid of this process after >thorough analysis and risk assessment, but a >reboot is cheaper...

A reboot is cheaper?

For many of us perhaps, not for me and
other folks I know (7x24x365 and you
fight for a maintenance window. Yes,
fight).

Rob

Jan van den Ende · ‎03-05-2007

Rob,

I had the fleeting impression we were in the same kind of circumstances (NEVER down), but
>>>
7x24x365
<<<
has me wondering.

Are you on a special 7-year cycle?

Or did you mean 7 * 365 (still leaves the leap-day every 4 years)
or 24 * 7 * 52 (wow! one day every year!)

We usually define our operation as 24 * 365.25 :-)

But seriously, in those circumstances, you REALLY need AM/AMDS, or get at ease with DELTA.
And, rolling reboots should be a piece of cake.

fwiw,

Proost.

Have one on me.

jpe

Don't rust yours pelled jacker to fine doll missed aches.

Rob Young_4 · ‎03-05-2007

> 7x24x365

I'm bellyachin. I can probably get
by with a Friday evening thing as I'll
bring one node down (the node with
the spinning process) and
maintain application up.
Fortunately, this node
has none of the interfaces running. I still
have the pain of change control and doing
the work, family interruptions, etc.

But in general, we've got to get past this
thing. I see the attraction of VMWare (et al)
to make up for the weakness of certain OSes.
We're stuck with VMS when the machine gets
wedged. What would be cool is if we could
do these VMs, I'd isolate a "machine" for
all the important interfaces , users would
run in multiple VMs. If a process got
wedged/hung/spinning , I'd load balance
everyone off that machine , and reboot in
some far off future (maybe go from 8 to
7 machines in the process).

Hoff · ‎03-05-2007

There are certainly advantages to VMs, but VMs are not a panacea. Just as processes can get messed up due to bugs, a VM can get messed up. And if a VM gets messed up or itself needs some sort of ECO or maintenance or just a reboot, there's a whole lot more affected.

In the interim, OpenVMS Galaxy might be of interest, depending on the particular box.

The upper field of a PID is the cluster system id. If you've got big values in the low bits, you have numbers of processes churning, or a whole lot of reuse of a PCB. A looping process doesn't chew up PID numbers.

Volker Halle · ‎03-05-2007

Rob,

all of this is software and software has bugs, if you like it or not. Some OSes may have more bugs than others. Some OSes may be harder to diagnose in case of problems than others. Some OSes will spread problems into other areas within the OS without telling you.

I believe OpenVMS is pretty good at detecting problems and preventing them from spreading. OpenVMS also provides extremely good tools and structures, to allow you to diagnose a problem such as you are seeing. And it even contains tools - such a DELTA - to allow you to 'fix' such a problem with minimal risk and impact, if you - or someone else - has the required OpenVMS internals knowledge to analyze and diagnose the extent of the problem in the running system.

Setting the process priority to 0 limits the impact of this problem to your system and gives you time for diagnosis.

If there is an outstanding IO, one needs to find the IRP (IO Request Packet) somewhere in pool, decode the function bits and find out, what might have happened. If this IO is not in any pending lists in the system, it may be possible to clean up the process-related IO data structures and allow this process to successfully run down. I have down this before and I've shown an example in a DECUS presentation some years ago. So it is possible after thorough analysis, but not by running some 'magic tool'.

Volker.

Rob Young_4 · ‎03-07-2007

If there is an outstanding IO, one needs to find the IRP (IO Request Packet) somewhere in pool, decode the function bits and find out, what might have happened. If this IO is not in any pending lists in the system, it may be possible to clean up the process-related IO data structures and allow this process to successfully run down. I have down this before and I've shown an example in a DECUS presentation some years ago. So it is possible after thorough analysis, but not by running some 'magic tool'.

---

If you are willing,
Let's do this exercise here.
If nothing else it helps others.
Where to start?

Rob

Rob Young_4 · ‎03-07-2007

There are certainly advantages to VMs, but VMs are not a panacea. Just as processes can get messed up due to bugs, a VM can get messed up. And if a VM gets messed up or itself needs some sort of ECO or maintenance or just a reboot, there's a whole lot more affected.

---

I had about a 45 minute discussion (should have
been 10) with someone that works with VMWare
day to day. Granted, ESX server can get wedged, but it is a rare thing. VMWare
attractiveness and explosion is due to ROI
- very easy to prove (many Windows servers
are spinning 10% CPU on average). It took
me a good 20 minutes to try to get the guy
to understand how exactly it could help me,
as ROI isn't a concern when you're banging
the CPUs.

Something like this:

Take a physical server , make 4, 8 guest
OSes. Now when I have a run-away process,
load balance everyone off that box. Perhaps
use VMWare to tell that OS it is only getting
2% of a physical CPU so the process could
spin to its heart's content. Some future
date (way in the future it need be, or during
the day, whatever) I reboot the problem OS.

Galaxy isn't an option. Fast-forward...
hypervisors aren't so bad now (they can do
things like limit a host OS 2% of a physical
CPU) with their 4-5% overhead hypervisors
aren't a nusance with CPUs gaining speed
so quickly. Not unlike the whole TOE
for iSCSI gotcha... lost in the noise.
No more TOE discussions!

Of course my pissy co-worker "but when 10 gig
comes along, it becomes more of an issue."
and round and round we go.

Rob

atul sardana · ‎03-07-2007

Rob,

Dont waste time reboot the system anyway your work is appriciable but without reboot this process cannot be kill.
This is bug and Hp trying to give solution for all bugs from time to time for better and smooth works.

thanks
Atul sardana

I love VMS

Volker Halle · ‎03-08-2007

Rob,

to find the IRP for the pending IO operation on DSA1300:, you need to start as follows:

$ ANAL/SYS
SDA> READ SYSDEF
SDA> SHOW DEV DSA1300:
I/O data structures
-------------------
DSA1300 RZxx UCB: 8xxxxxxx
...

Search for this address in nonpaged pool (replace 8xxxxxxx with the real UCB address of DSA1300):

SDA> SEARCH @mmg$gl_npagedyn:@MMG$GL_NPAGNEXT 8xxxxxxx
...
Match at FFFFFFFF.8yyyyyyy 8xxxxxxx
...

For every match found, issue the following command:

SDA> EXA 8yyyyyyy-irp$l_ucb;10

and post the data found. We will be looking for an IRP in nonpaged pool, which points back to the UCB of DSA1300. The 3rd longword in this IRP should contain: 000A0240

Note that this could be a lengthy process through this forum. And it's not guaranteed to be successful. But as long as you need or want to keep this system up with the looping process, we can continue troubleshooting...

Volker.

Ian Miller. · ‎03-08-2007

is a chance there is a pointer to the IRP from the UCB? So looking at pointers in the UCB could shorten the search.

____________________
Purely Personal Opinion

Robert Brooks_1 · ‎03-09-2007

is a chance there is a pointer to the IRP from the UCB? So looking at pointers in the UCB could shorten the search.

---

Of course, there is UCB$L_IRP, as well as UCB$L_IOQFL, which may or may not be relevant.

For a SCSI disk device using DKDRIVER (a driver that can handle more than one I/O at at time), UCB$L_IRP is explicitly set (whilst holding the correct synchronisation -- typically the fork lock of the UCB) during I/O setup and completion. If there is more than one I/O in the driver at a time, UCB$L_IRP can point to any of them, so the odds of it pointing to the "lost" I/O for a busy device is slim.

-- Rob (who used to spend a fair amount of time worrying about these things)

Hoff · ‎03-09-2007

The IRPs I've chased around pool are usually well and truly lost.

I've also seen cases where there is no lost IRP, but a case where a counter increment-decrement sequence was entangled, and a decrement was lost. This case can arise in driver error handling and related quota processing.

I've been known to brute-force this correction, and bomb the count into a semblance of correctness. This mid-flight kernel correction may or may not have a beneficial effect toward the eventual achievement of long-term system stability, as they say. The IRP itself is lost.

If these IRPs are disappearing with any sort of regularity, it's time to have a careful look at the kernel or driver or other related code involved.

If it's not your code, check first for ECOs, then contact the vendor.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

MNOWAIT link - undefined symbol

MNOWAIT link - undefined symbol

Re: MNOWAIT link - undefined symbol

Re: MNOWAIT link - undefined symbol

Re: MNOWAIT link - undefined symbol

Re: MNOWAIT link - undefined symbol

Re: MNOWAIT link - undefined symbol

Re: MNOWAIT link - undefined symbol

Re: MNOWAIT link - undefined symbol

Re: MNOWAIT link - undefined symbol

Re: MNOWAIT link - undefined symbol

Re: MNOWAIT link - undefined symbol

Re: MNOWAIT link - undefined symbol

Re: MNOWAIT link - undefined symbol

Re: MNOWAIT link - undefined symbol

Re: MNOWAIT link - undefined symbol

Re: MNOWAIT link - undefined symbol

Re: MNOWAIT link - undefined symbol

Re: MNOWAIT link - undefined symbol

Re: MNOWAIT link - undefined symbol

Re: MNOWAIT link - undefined symbol