Re: Did some testing of pagefile usage under 7.3

Wim Van den Wyngaert · ‎06-06-2005

Not a question, just a report.

I wrote a program that mallocs 200 MB of memory and makes it dirty by filling it with '0'.

I ran it on a AS500 with 256 MB memory and a pagefile of 62469 pages (=1.000.000 pages = 512 MB). I ran it 1 at the time and when it succeeded it stayed alive with the memory allocated. Then I started the next.

Each time the commited pagefile usage as seen with sh mem/fi increased with 200 MB. The free space also decreased with 200 MB but only after physical memory was exhausted.

Before the 4th run, only 972 pages were free. The malloc of 200 MB succeeded and the making dirty started.

The system went into hang for about 30 minutes and not 1 error message.

I investigated with AMDS and found all processes trying to do something in state FPG (waiting for free pages).

After 30 minutes, the workstation was still not dead but it did operations very very very slowly. Control T took about 15 minutes to complete.

I killed 1 process with AMDS and after 30 seconds of high swapper activity the station was running again. But not normal. Several processes were regularly in MWAIT and after 15 minutes DECW was still not reacting. I killed a 2nd process (also one with 200 MB allocated) and only then normal activity was restored.

Nothing special in operator log file.

I find it strange that killing 1 process didn't solve the memory problem.

I have no "playstation" in 6.2 but I a quick test revealed the same behaviour under 6.2. But the system reacted less dead that 7.3.

Wim

Wim

Antoniov. · ‎06-06-2005

Hi Wim,
very interesting!
You have 256+512 Mb of virtual memory.
After 3 running process, each use 200 Mb you allocated 600 of 750 Mb; in effect you see 972 page free and this may be normal.
You run 4th process and it try to allocate more memory than it is avaiable.

It's very strange your process recevice successfull (dirty memrory) allocation !!
It's more strange and dangerous killing 1.st process can't free memory.
How many time did you wait for after 1.st killing? May be system can resume normal activity after 1 minutes? I remember on some old vax in similar situation, machine resumed normal activity after about 1 minute.

Antonio Vigliotti

Antonio Maria Vigliotti

Marc Van den Broeck · ‎06-06-2005

Wim,

as far as i recall it has always been that way (for at least 20 years).
But i used to see (opcom?) messages on the console such as:
Free page file size low!
Free page file size critical!

and then the system entered a 'hung' state from which you had to reboot or you had to free memory by killing/ending processes and being very patient.

Rgds
Marc

Wim Van den Wyngaert · ‎06-06-2005

Further testing.

Started a program when pagefile was nearly full. Before the malloc, I created a 2nd pagefile. The dirty memory was written to both pagefiles. Idem but pagefile added between malloc and making memory dirty : same reaction. Excellent.

Remark : a reboot is now needed to remove a pagefile since all processes directly start allocating pages in it.

Remark 2 : unbelievable how long it takes to kill processes when memory is unavailable (RWMPB, page writer busy).

Wim

Wim

Wim Van den Wyngaert · ‎06-06-2005

Antonio,

Killing a process takes at least 5 minutes on my alphastation 500. Remember that 200 MB of pagefile must be erased but that still doesn't explain why it takes that long. Maybe because at the same time it has to reoganize memory assigned to other processes ?

Marc,

Not a single message ...

Wim

Wim

Wim Van den Wyngaert · ‎06-06-2005

For those that are interrested in how Linux reacts : http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=855513 somewhere at the end.

Wim

Wim

Antoniov. · ‎06-06-2005

Wim,
don't forget your memory is also used by cache and XFC works differently by previous vms version (like V6.2). How cache can interfere with pagefile?

Antonio Vigliotti

Antonio Maria Vigliotti

Wim Van den Wyngaert · ‎06-06-2005

Tried to add a 2nd pagefile while the pagefile was full and a process was trying to put dirt in the pages.

Command didn't complete after 10 minutes. I killed (with AMDS) a smaller user process that consumed 10000 pages (of 512). Took 2 minutes to remove the process. Pagefile still not added.

Killed a process holding 4000 pages. Removed after 30 seconds but pagefile still not added. Killed a 3rd with 4000 pages. Pagefile finally added. And used. Excellent except for the time to do it.

Wim

Wim

Uwe Zessin · ‎06-06-2005

Antonio,
a process can request memory against its own pagefile quota. I have never heard that the system checks against what is still free in the pagefile(s).

It's called overcommitment. Else, you would have to create enough pagefile space for _each_ process' pagefile quota.

.

Wim Van den Wyngaert · ‎06-06-2005

While system was back normal : removing a process that had 200 MB dirty pages took just a second (a short disk activity was noticed).

While system had exactly 0 pages free but no dirty program was busy : any activity caused lots of mwaits (rwmpb) and the session no longer was usable. Had to delete 7 smaller processes before going back in operation. And still processes in mwait.

Put my prio on 20 while pagefile was full. Reacted faster. I stopped a process and it was directly gone (??? why that fast this time ??).

Wim

Wim

Wim Van den Wyngaert · ‎06-06-2005

FWIW : just found out that malloc only adjusts your quotas and doesn't use the physical memory directly. A malloc of 200 MB increased physical memory usage with 0.5 MB.

Wim

Wim

Antoniov. · ‎06-06-2005

I guess with prio = 20, your process became real time process and it is not submitted to round robin.

Antonio Vigliotti

Antonio Maria Vigliotti

Antoniov. · ‎06-06-2005

a process can request memory against its own pagefile quota. I have never heard that the system checks against what is still free in the pagefile(s).

Agree.
The 4.th process try to allocate more memory than it's avaiable so OS suspend it.
Wim killed 2 processes to react system. I tought there is a threshold on pagefile to reactivate mwait processes.

Antonio Vigliotti

Antonio Maria Vigliotti

Marc Van den Broeck · ‎06-06-2005

Antonio,

i dont think the process is suspended. As long as the program does not actually use memory, vms does not know (as Wims observations prove, onlu 0.5 Mb added).
But what happens is that the systems gets in a near hang when the memory is used beyond pagefile limit.

Rgds
Marc

Uwe Zessin · ‎06-06-2005

That's right. SUSP is a voluntary wait state (well, not if it is forced from another process ;-), but the only process I am aware that SUSPs other processes is AUDIT_SERVER.

If you have used up your pagefile quota, then the next request will fail (you do check the return from malloc()? ) with an error status, but the process is not put into a wait state.

.

Wim Van den Wyngaert · ‎06-06-2005

I share Uwe's expierence. That's why I do "set audit/excl" for e.g. my monitoring process. No use of a watch dog if it get suspended when there are problems.

Wim

Wim

John Gillings · ‎06-07-2005

re: Wim, "Not a single message ..."

The system almost certainly DID write messages, but they're direct to OPA0, not to OPERATOR.LOG. But since you're running this on a workstation, OPA0 messages are usually lost. OPA0 I/Os don't cost any exta memory because the console driver is resident, but I/O's to the log file do. You don't want to excerbate the problem by trying to report it!

Page file allocation is done exponentially, that is we attempt to allocate one page file cluster (PFC), if that fails, we try half, then half again, and again, until we get down to single blocks. Next allocation attempt we start back at PFC again. So, when the page file starts to get full (& fragmented), allocations take a LOOONG time because the attempts to get large blocks have to fail on each request. (you got a problem with that? Here's a nickle, go buy yourself a few more GB of page file space!)

When the pagefile reaches a point where allocations are taking longer than "normal", the message:

PAGEFRAG, page file is badly fragmented, system continuing

is written to the console, OPA0. If the allocation attempts reach single blocks, the message:

PAGECRIT, page file space critical, system trying to continue

is written.

If you've missed the messages, you can test if they've ever been issued by examining the system global cell EXE$GL_FLAGS in SDA (or in a crash dump). Bit 20 is set when the PAGEFRAG message is issued and bit 21 when PAGECRIT is issued. Left as an exercise to write some DCL code to test the bits. EXE$GL_FLAGS can be read from user mode.

"Put my prio on 20 while pagefile was full. Reacted faster. "

Things get more interesting at priority 20. First, you're higher priority than the modified page writer (SWAPPER runs at 16), and because ou're real time, you don't get any working set adjustment.

The bottom line here is that NO OpenVMS system should EVER see PAGEFRAG or PAGECRIT errors. It's simply bad economics to NOT give your system enough page file space that it never becomes a problem. Consider the cost of downtime, and the cost of the system manager's time - even the cost of the time of the people reading and writing this thread. You'd spend all that to save a few cents worth of disk space? This is a rare case where, even in a world run by accountants, common sense will prevail.

A crucible of informative mistakes

Marc Van den Broeck · ‎06-07-2005

Well my messages were far from exact but i am glad someones confirms they do occur.

Rgds
Marc

Wim Van den Wyngaert · ‎06-07-2005

John,

Thanks for the info. I never had the message on my servers because I get an alarm at 40% used. And even 40% is never reached.

But I wanted to know how VMS reacts in case something goes wrong in the application (already happened in 98).

Wim

Wim

Wim Van den Wyngaert · ‎06-07-2005

But I also tested on 6.2. This machine is connected to console manager.

Not a single message in operator.log and not in console manager extract (and it is connected !).

???

Wim

Wim

Antoniov. · ‎06-07-2005

i dont think the process is suspended.

mwait process is not suspended? It's merely name state; when process is in mwait state can't work.

As long as the program does not actually use memory, vms does not know (as Wims observations prove, onlu 0.5 Mb added).

I don't agree. Memory allocation process is more complex than a simple use ram or else pagefile. When a process require a large amount of memory it hibernated/suspended (not same HIB/SUSP state) than OS try to return memory requested.
Read John's post.

Antonio Vigliotti

Antonio Maria Vigliotti

Uwe Zessin · ‎06-07-2005

It sounds like you have used suspension as a generic term, while others (including me) thought you were using OpenVMS' meaning of SUSP.

No, MWAIT is not the same as SUSPension when we are using OpenVMS terminology. HIB is a pure process voluntary wait state, SUSP can also be forced by another process.

Accoring to the OpenVMS wizard, MWAIT is short for "Miscellaneous WAIT state" and this one comes from deep inside the OpenVMS Kernel.

http://h71000.www7.hp.com/wizard/wiz_5841.html

It appears that there is an inconsistency in 'SHOW PROCESS/CONTINUOUS', which is explained on that page.

.

Ian Miller. · ‎06-07-2005

the RWxxx states are all MWAIT - involentary misc wait for a resource. The scheduler treats them the same.

____________________
Purely Personal Opinion

Wim Van den Wyngaert · ‎06-07-2005

Open points remain :

1) why didn't I get the opa messages in 6.2 (may be also in 7.3) ?

2) if there is free pagefile space before starting a process (and the system is functioning normally), why isn't normal functioning restored if I kill that process ?

Wim@home

Wim

Ian Miller. · ‎06-07-2005

messages about page file full are only output to the console and only once. There are flags (in MMG$GL_FLAGS or some such) which are set to say the message has been output and they are not cleared.

Do you know which MWAIT state the processes where in - RWMPB parhaps or something else. How many pages on the modified list?
Was there activity that kept the modified list up near MPW_WAITLIM?

____________________
Purely Personal Opinion

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Did some testing of pagefile usage under 7.3

Did some testing of pagefile usage under 7.3