Re: Java "not enough core"

Ben Armstrong · ‎03-28-2011

I am trying to run MSpec against JRuby 1.6.0 on an rx2600 running V8.3-1H1 with JAVA60 V1.6-2.

Unless I break up this large test suite into smaller runs of individual tests, I invariably end up with a stack trace that looks something like:

OpenVMS stack trace:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#
%SYSTEM-F-ACCVIO
(0xc)
at pc=2E78C20
, pid=564662602
, tid=85815808

#
# JRE version: 6.0
# Java VM: Java HotSpot(TM) 64-Bit Server VM (14.3-b01 mixed mode -ia64 )
# Problematic frame:
#
C
[JAVA$JAVA_SHR+0x1670]

#
# An error report file with more information is saved as:
#
/tmp/hs_err_pid564662602.log

The resulting file, sys$scratch:hs_err_pid564662602.log contains:

#
# A fatal error has been detected by the Java Runtime Environment:
#
# %SYSTEM-F-ACCVIO (0xc) at pc=2E78C20, pid=564662602, tid=83898880
#
# JRE version: 6.0
# Java VM: Java HotSpot(TM) 64-Bit Server VM (14.3-b01 mixed mode -ia64 )
# Problematic frame:
# C [JAVA$JAVA_SHR+0x1670]
#
# Please report this error to HP customer support.
#

--------------- T H R E A D ---------------

Current thread (4EF2000): JavaThread "ScriptThreadProcess: -v" daemon [_thread_in_native, id=83898880, stack(4EFE000,5002000)]

siginfo:si_signo=(null): si_errno=not enough core, si_code=0

Top of Stack: (sp=0)
0:

I'm at a loss as to how to debug this. jmspec is defined as follows:

$ show sym jmspec
JMSPEC == "$ SYS$COMMON:[JAVA$60.bin]java$java "-Djruby.home=/dym$sys_a/dymax/jruby" "-Djruby.lib=/dym$sys_a/dymax/jruby/lib" "-Djruby.script=jruby" "-Djruby.memory.max=500m" "-Djruby.stack.max=1024k" "-Xmx500m" "-Xss1024k" "-Xbootclasspath/a:/dym$sys_a/dymax/jruby/lib/bsf.jar:/dym$sys_a/dymax/jruby/lib/jruby.jar" /dsa0/bg/mspec/bin/mspec"

I have tried altering:

the stack max, both for jruby and java, doubled, from 1024 -> 2048

the heap max, both for jruby and java, doubled, from 500m -> 1000m

my PGFLQUO, doubled, from 600000 to 1200000

my WSEXTENT, doubled, from 6000 to 12000

Nothing I do here seems to help. Can someone please give me some guidance?

Thanks,
Ben

Hoff · ‎03-28-2011

Given that what you've already tried are all of the usual suspects for Java (with the possible exception of increasing the system parameter WSMAX to allow for the increased working sets requested, if that's needed), ring up HP Support, as the diagnostic states.

Though depending on how big JRuby is here, you may need further increases.

Ben Armstrong · ‎03-28-2011

My WSMAX is 95616 pages, which appears to be enough. Thanks. Well, HP support it is, then. Just wanted to see if I overlooked something obvious.

Incidentally, the same test suite works fine with JAVA150 V1.5-6_P1 on a DS10/617 running VMS V8.3. How strange. Either some JAVA6-specific thing or an Itanium-specific thing.

Ben

P Muralidhar Kini · ‎03-28-2011

Hi Ben,

>> Thanks. Well, HP support it is, then
Connect to HP via the Office of OpenVMS programs interface.
You can send mail to OpenVMS.Programs@hp.com.

>> Incidentally, the same test suite works fine with JAVA150 V1.5-6_P1
>> on a DS10/617 running VMS V8.3. How strange.
>> Either some JAVA6-specific thing or an Itanium-specific thing.
I hope you have compared the parameters (PGFLQUO and others) settings
on system that worked with that on the system that does not work.

Regards,
Murali

Let There Be Rock - AC/DC

Ben Armstrong · ‎03-28-2011

It's a mixed architecture cluster with shared UAF, so the account quotas are the same between the Alpha and Itanium test systems at least.

Ben

Jan van den Ende · ‎03-28-2011

Ben,

>>>
so the account quotas are the same between the Alpha and Itanium test systems at least.
<<<

Are those quota the old Alpha quota, or updated to I64 level? Doubling is fairly minimalish, quadrupling is just about "equivalent"....
And the larger quota are not hurting (at least that I know not of) on Alpha.
hth

Proost.

Have one on me.

jpe

Don't rust yours pelled jacker to fine doll missed aches.

Ben Armstrong · ‎03-28-2011

Quadrupled, as suggested, and no difference (even when I also quadruple the jruby & java maximums specified in the command as well). Indeed, the place where it dies is precisely where it was before.

...
/DSA0/BG/RUBYSPEC/core/file/chown_spec.rb
OpenVMS stack trace:
...

Curiously, chown_spec.rb contains only tests in an "as_superuser" clause which, when run (with or without elevated privileges) does nothing on this platform. The test case does not die when run in isolation.

Ben

P Muralidhar Kini · ‎03-28-2011

Hi Ben,

>> The test case does not die when run in isolation.
This is interesting.

When you run the entire test suite, is this the only test that fails ?
i.e.
if you say do not run only this particular test in the entire test suite then does the entire test suite pass successfully or some other test down the line fails ?

Regards,
Murali

Let There Be Rock - AC/DC

Ian Miller. · ‎03-29-2011

You may find some useful information at

http://vouters.dyndns.org/tima/

____________________
Purely Personal Opinion

Ben Armstrong · ‎03-29-2011

Yes, thanks. We are aware of the work by Phillipe Vouters and Thierry Uso on JRuby for OpenVMS and hope they succeed. We have kept on-hand the prebuilt image they have produced of 1.5.3 and have also been testing against that version from time to time.

While I can't be certain yet that this is the same problem as is occurring in mspec, we have made a simple test program that fails with the same stack dump as indicated above. This fails on the 1.5.3 kit as well:

$ jruby -e "20000.times{|i| puts i; `DIR`}"

I am going to file some additional information on the bug we filed on JRuby some time ago here as soon as I'm certain we have a test that is reproducible by anyone. In particular, I would like to know what account quotas may have a bearing on this issue, as late yesterday afternoon, after testing on one developer's account, we found that the test program did not fail on my own (with increased quotas). However, the same test does fail, even with the increased quotas, using the 1.5.3 kit.

For reference purposes, here is that bug:

http://jira.codehaus.org/browse/JRUBY-3902

Ben

Ben Armstrong · ‎03-29-2011

Murali,

That's a good question. No, a considerable number of tests fail prior to this failure. In fact, they don't merely fail, but rather they have errors, which renders those test results invalid. It looks like mspec is unable to recover from those, as a single error is often followed by a whole stream of them (every test case is marked "E" thereafter to the end of the file in which the error occurs). Perhaps it is the accumulation of such errors that ultimately leads to the stack dump. We're not sure. In any case, in parallel with an inquiry to HP, we are continuing to investigate these errors to see if they can be resolved, and perhaps if they can be, the stack dump will also be solved as a by-product.

Ben

P Muralidhar Kini · ‎03-29-2011

Hi Ben,

>> Perhaps it is the accumulation of such errors that ultimately leads
>> to the stack dump. We're not sure.
This is what even i had in mind given the observation that
when the test suites runs as a whole there is a failure with a particular test but
when only that particular test is run, there is no failure.

>> No, a considerable number of tests fail prior to this failure
May be the first (or first few) test that failed might have a clue as to what the root cause might be.

>> In any case, in parallel with an inquiry to HP, we are continuing to investigate
>> these errors to see if they can be resolved, and perhaps if they can be,
>> the stack dump will also be solved as a by-product.
Good luck.
Also please do post the solution to this problem in this thread, so that we all know the root cause (& solution) for this problem.

Regards,
Murali

Let There Be Rock - AC/DC

Ben Armstrong · ‎03-29-2011

OK, this is very strange. I had a hard time narrowing down what quota differed between the failing case and the successful one, and after reducing all my quotas down to equal or lower than ones on the failing account, I still couldn't reproduce it, until I observed that some quotas on the failing account were actually higher, so I started bumping mine up, one at a time until ...

I discovered actually *increasing* my BYTLM from 40000 to 382000 to match his finally causes the test to fail! Would someone please explain why this parameter would be relevant here? I have some guesses, based on what I've read about it, but would really like an expert opinion. (Keep in mind that in my simple test case, the relevant construct is backticks, which is supposed to spawn a process and return any output from that process.)

Thanks,
Ben

Hein van den Heuvel · ‎03-29-2011

>> *increasing* my BYTLM from 40000 to 382000 to match his finally causes the test to fail!

Interesting. Can you watch bytlm during the test with a dedicated program, dcl script, or simply with SHOW PROC/CONT ... Q

Does the issue occur 'right away' or after minutes or after a specific known test is a long series of tests?

Can you slow the problem down by sleeping every so-many iterations, or print a summary line every so often? just to better understand when this happens?

SPAWN uses bytlm to transfer symbols and logical names.

Do we know how the backticks are implemented? crtl-system call? call to Lib$spawn? concoction around SYS$CREPRC?

Since bytlm plays a role, you gotta think system services play a role.
How about trying to grab a log of those with SET PROC/SSLOG ?
It is a sledge-hammer approach, and you'll probably need some smarts (perl!) to weed through the details generated, but it could help pinpoint the root cause.
(I typically SPAWN before SET PROC/SSLOG as it has caused me to 'loose' processes with specific command ordering. )

hope this helps some,
Hein

Hoff · ‎03-29-2011

>...*increasing* my BYTLM from 40000 to 382000 to match his finally causes the test to fail! Would someone please explain why this parameter would be relevant here?

That reeks of a Java application bug somewhere, or of a Java VM bug.

On zero evidence, I'd look for a synchronization problem in the I/O processing where the code was implicitly synchronizing or implicitly throttling itself by running out of buffer storage space and the associated resource wait, and where the same code was allowed to free run by a higher quota, the process could then consume (other) process memory resources for use as buffers, and eventually crashing.

See if your (for instance) the pending I/O counts stored in the PCB spike when this thing tips over.

Whether this is a bug in JRuby or in the underpinnings is an open question.

Ben Armstrong · ‎03-31-2011

We delved into the source and found this problematic code in [.src.org.jruby.util]ShellLauncher.java:

private void verifyExecutableForShell() {
String cmdline = rawArgs[0].toString().trim();
if (doExecutableSearch && shouldVerifyPathExecutable(cmdline) && !cmdBuiltin) {
verifyExecutable();
}

// now, prepare the exec args

execArgs = new String[3];
execArgs[0] = shell;
execArgs[1] = shell.endsWith("sh") ? "-c" : "/c";

if (Platform.IS_WINDOWS) {
// that's how MRI does it too
execArgs[2] = "\"" + cmdline + "\"";
} else {
execArgs[2] = cmdline;
}
}

It turns out that "shell" is ultimately set in [.src.org.jruby.libraries]RbConfigLibrary.java by:

// TODO: note lack of command.com support for Win 9x...
public static String jrubyShell() {
return SafePropertyAccessor.getProperty("jruby.shell", Platform.IS_WINDOWS ? "cmd.exe" : "/bin/sh").replace('\\', '/');
}

Of course, this is not Windows, so it's assuming /bin/sh here, and then appending "-c", followed by the arguments. Fortunately, this can be worked around by replacing the shell with something else, e.g. set:

"-Djruby.shell=/path_to/sh.exe"

After writing a very simple shell in C++ that does nothing but drop the spurious "/c" (because sh.exe doesn't end with "sh", the windows switch form is used here) and pass the arguments to lib$do_command, and defining the jruby.shell as indicated above, the stack dumps have ceased. Here is that code:

#include
#include
#include
#include
#include

int main(int argc, char **argv)
{
string args="";
for(int i=1; i < argc; i++) {
if(strncasecmp(argv[i], "/c", strlen(argv[i]))!=0 )
args = args + string(" ") + argv[i];
}

char *str = const_cast(args.c_str());
static unsigned long int r0_status;

struct dsc$descriptor_s str_d =
{strlen(str), DSC$K_DTYPE_T, DSC$K_CLASS_S, str };

r0_status = lib$do_command(&str_d);

return 0;
}

Our only complaint is that this is slow. Three times slower to do 40 iterations of backticks to execute a simple DCL command than, say, executing sh.exe in a DCL procedure that spawns an execution of sh.exe 40 times. Any reason why things go so much slower in Java? Or is this likely to be a JRuby-specific issue? (I guess I should write a wrapper to shell out in Java, stripping away all of the JRuby stuff; btw, any good doc on how to call system services and RTL from Java?)

Here's a simple ruby test:

100.times{|_|puts `show time`; puts _}

On our rx2600 this takes 10 seconds, compared to 3 seconds for the DCL equivalent:

$loop:
$spawn/nolog sh "show time"
$index=index+1
$echo index
$if index.eq.100 then exit
$goto loop

Ben

Hoff · ‎03-31-2011

>Any reason why things go so much slower in Java?

Must. resist. zinger. :-)

That lib$do_command is an image teardown and a restart, so that's not going to be all that speedy. (And Unix does process creation operations vastly faster than VMS.)

And this:

string args="";
for(int i=1; i < argc; i++) {
if(strncasecmp(argv[i], "/c", strlen(argv[i]))!=0 )
args = args + string(" ") + argv[i];
}

If the argv strings are long or if there are a number of /c tokens here, that for-loop should be replaced with a couple of pointers in a for-loop that shuffle through the whole string, looking a the current character (as /) and then peeking at the next (as c) and compressing the string, rather than repeatedly searching the front part of that string. (And you can save off the length there, since you'll have it, rather than adding a strlen to fetch it again.)

Now as for where the wallclock is going, profile what you can.

And the difference between 10 seconds and 3 seconds doesn't look all that bad, given the volume of baggage here. And IIRC, DCL is likely running that particular command out of the CLI itself rather than an image activation, so you have 100 image activations and DCL doesn't.

Hein van den Heuvel · ‎03-31-2011

[arghhh, ITRC not behaving (or is it my internet connection?) ]

>> Any reason why things go so much slower in Java?

Image activations, for that 'shell'

in DCL the SHOW TIME is a native command, not image will be run.

You can easily verify that using a USER mode logical.
SHOW LOGICAL is an image and will 'eat' the logical.
SHOW TIME, SHOW DEFAULT and more are not and the logical will survive the command.

See below.
Hein

$ define/user test blah
$ show log test
"TEST" = "BLAH" (LNM$PROCESS_TABLE)
$ show log test
%SHOW-S-NOTRAN, no translation for logical name TEST
$ define/user test blah
$ show time
31-MAR-2011 08:54:28
$ show log test
"TEST" = "BLAH" (LNM$PROCESS_TABLE)
$ show log test
%SHOW-S-NOTRAN, no translation for logical name TEST

Ben Armstrong · ‎03-31-2011

Did you guys miss the fact that i'm running the "SHOW TIME" through "SH" which is my sh.cpp that I pasted earlier to use as jruby.shell? So that's always an image activation right there. Now, maybe there still are image activations that I'm missing, but I tried to be quite careful to make the two tests equivalent ...

i.e.

$ sh=="$path_to:sh.exe"

Ben Armstrong · ‎03-31-2011

In any case, optimizing this is probably premature, and particularly in the case of tightening up the argument parsing loop, I don't think will pay off much. I need to move on and figure out how to call system services/RTL routines from Java, and so forth. Guess I have some reading to do.

Thanks for all of your answers!

Ben

Ian Miller. · ‎04-01-2011

Ben,
you may find this of interest

http://vouters.dyndns.org/tima/OpenVMS-IA64-Java-JNA-libffi-Porting_JNA_to_OpenVMS_Itanium_servers.html

____________________
Purely Personal Opinion

Craig A Berry · ‎04-02-2011

Since increasing BYTLM adds to the troubles, consider that spawning a subprocess from a parent with a very large BYTLM gives you two processes with very large BYTLM, which I believe counts agains NPAGEDYN. Do SHOW MEM/POOL/FULL to see if you're running out of a finite system resource rather than a process quota.

I would try to get away from the lib$do_command approach. Perhaps simply replacing the lines

execArgs[0] = shell;
execArgs[1] = shell.endsWith("sh") ? "-c" : "/c";

with

execArgs[0] = "";
execArgs[1] = "";

would do the trick. I'm guessing that Ruby's exec() is implemented in terms of Java's exec, which is probably implemented in terms of the CRTL exec() unless the Java folks rolled their own for some reason. These implementations are likely to already do a reasonable job of the heuristics necessary to determine if the thing to be run is a program that can be run directly with SYS$CREPRC or is a DCL command that needs to have LOGINOUT dragged in.

Or maybe they drag LOGINOUT in regardless, which is sort of the moral equivalent of what Ruby is trying to do by specifying a shell. Which is a long-winded way of saying, you really don't need or want to specify a Unix shell as part of the command on VMS unless the thing you are running is a Unix shell script.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Java "not enough core"

Java "not enough core"