Re: How to optimize subroutine placement in DCL

Wim Van den Wyngaert · ‎03-12-2008

Test1 : 10000 gosub x. x is directly after the gosub code. Result : 3933 direct IO's.

Test2 : idem but x is 4200 lines of DCL down the script. Result : 23947 IO's. CPU went up by 15% compared with test 1.

It didn't read the DCL procedure 10000 times. What is the algorithme of the caching / reading / searching DCL ?

Wim

Wim

Wim Van den Wyngaert · ‎03-12-2008

x is directly after the gosub code.

must be
x is directly after the gosub call.

Wim

Wim

John Reagan · ‎03-12-2008

Did those IOs turn into real disk reads or did XFC have them in cache?

Jess Goodman · ‎03-12-2008

I'm not certain exactly what your question is, but I can confirm that in a very-long DCL command procedure there is a very signficant performance advantage to putting a GOSUB routine close to the corresponding GOSUB statements, even if you have to branch around the GOSUB routine with an ugly "GOTO label" before it and "label:" after it.

If the GOSUB routine is at the bottom of a long command procedure DCL apparently re-reads the entire procedure with every GOSUB statment in order to find the GOSUB label. My guess is that DCL only caches x number of statement labels.

I have one, but it's personal.

Wim Van den Wyngaert · ‎03-12-2008

The IO was satisfied by VIOC. Will test on Monday with XFC.

Wim

Wim

John Gillings · ‎03-12-2008

Wim,

Stating the bleeding obvious...

If you want optimized code, use an optimizing compiler, NOT DCL.

DCL has some syntactic peculiarities that mean it isn't possible to just record the location of every label for easy jumping. Nature of the beast. Ever seen a dynamic label? It's perfectly legal syntax and works "mostly" as you'd expect. Also consider IF THEN ELSE scoping, SUBROUTINES, forward and backward references and potential duplicates. Again, this is the nature of an unstructured, untyped, interpreted language.

IMHO 4000 lines of DCL in a single procedure is insane. Restructure it into multiple procedures for better performance, and easier debugging (simple experiment, find an ENDIF statement after the first 100 or so lines, remove the leading "$" and see how long it takes your best DCL programmer to find the bug).

DCL is a fine language for some things, but when you find yourself banging up against its inherent limitations, it's time to reimplement your program in a more appropriate language.

A crucible of informative mistakes

Hein van den Heuvel · ‎03-12-2008

>> What is the algorithme of the caching / reading / searching DCL ?

DCL remembers RMS RFA's for labels, so it can tell RMS to 'jump' to the right spot.

DCL uses RMS in Process IO context with a single buffer of an OpenVMS version dependent size. In my case (Alpha 8.3) that was 4096 bytes.

If the calling line and ENTIRE call/gosub code executed is not it the same buffer, then read IOs will happen.

So just go see for yourself what it does!
Start op two sessions:
1:
$SET PROC/NAME=TEST
$@TEST

2:
$SET PROC/PRIV=CMKRNL
$ANALYZE/SYSTEM
SDA> SET PROC TEST
SDA> SHOW PROC/RMS=(PIO,NOIFB:3,RAB,BDBSUM)

Suggested test.com (verbatim!):

$ Loop:
$ Read/promt="Go Far? " sys$command x
$ if x
$ then gosub far
$ eLse gosub near
$ endif
$ goto loop
$ near:
$ read/prompt="Return from near? " sys$command x
$ return
$ x=1 !The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.
$ x=1 !The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.
:

:
$ x=1 !The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.
$far:
$read/prompt="Return from far? " sys$command x
$return

For bonus appreciation and learning, convert the above to an INDEXED files. (Spelling Magic involved here!).

$ conv/fdl="fil; org ind; key 0; dup yes; seg0_l 4" test.com test_idx.com
$ type test_idx.com
$ diff test.com test_idx.com

Now try again.

You'll see you get 2 buffers, each a bucket (default 2 blocks) big. But 1 of the buffers is used for the index root has a cache value keeping it there. Thus you'l get the same, or more, IOs depending on the exact fill of buckets.

Are we having fun yet?

groetjes,
Hein

Hein van den Heuvel · ‎03-12-2008

I wrote "verbatim!" to honor spaces and case, but I meant 'mostly verbatim' ;-)
(een beetje zwanger).

The part with:
:

:

Is supposed to represent a bunch of the long lines where a bunch is larger than 50 such that we exceed 4000-ish bytes using 100 of those 35-letter pangrams + stuff.

Hein.

Wim Van den Wyngaert · ‎03-16-2008

Same result with XFC (7.3). XFC prevents the IO's.

John G.,

The procedure is monitoring the systems and is running for 8 years without major problems. But it has very strict debugging built-in (1 warning and it aborts+restarts saying in which part it aborted -> maximum 100 lines of DCL are possible). The advantage is that ANY system manager can read/debug/modify it. In contrast with the pascal and fortran programs we have too. It also has the advantage that it doesn't create extra processes (I pase output of ncl, tcpip, ...). BTW : already found that subprocedures (@) and calls are more expensive than gosubs.

So, no way I will go to a programming language.

So, just trying to optimize my Fiat (or whatever small car you have is .au), not planning to buy a Mercedes.

Wim

Wim

Wim Van den Wyngaert · ‎03-16-2008

Still don't get it.

Tried only 1 gosub. Directly after : 42 IO's. 4000 lines further : 56 IO's. The procedure is 248 blocks on disk. Did it read the entire procedure in about 10 IO's ?

The gosub only contained return. So I would expect that read by RFA only requires 1 IO. Then why did I get a punishment of 20000 IO's in the original test ? 2 IO's for each RFA read ?

And how to explain the 3933 IO's from test1 ? If it had to read something again, I would expect at least 10.000 IO's.

Wim

Wim

Hein van den Heuvel · ‎03-17-2008

Hmm, If you really care about understanding it then you should use the tools I outlined.

But a few reads in the procedure to 'break', use a second screen to look.

Be sure to run with SET WATCH/FILE/CLASS=MAJOR so see the final read/write IOs counted
Or pick up Volker's PROCIO.
I _think_ it will show the IOs counts as they happen, but it didn't work just now on my test system.

You can also used ANALYZE/SYSTEM... SET PROC ... SHOW PROC/CHAN... READ SYSDEF ... FORMAT ---> See read/write count. or go on to the FCB

If you go backt o my earlier reply for the SHOW RMS hints, then you may want to know the RFAs for the labels, as teh first part is the BLOCK NUMBER.

I use SEARCH/NUMB to get some relevant line numbers, then DUMP/RECOR=(COUNT=1,START=) to get the RFA.
While in SDA> the RAB will show the current RFA of course and an EXAMINE of the RBF will show the line it is on.

Finally, it is critical how the executed lines fit in buffers.
If a label happens to be placed just before an buffer (8 block?) boundary, and the rest just on the other side, then you see lot's of IOs (cached) to flip back and forward.

Good luck!
Hein.

Willem Grooters · ‎03-17-2008

I learned - long ago (VAX?) - that looking for GOSUB and CALL targets, DCL will search the file top-to-bottom, and that, for speed, you should located them on top of the file. Search may start from the current location (of GSUB or CALL), but when not found, search for a label will still start on top of the file.

I think it still works this way - because an error in DCL code between the GOSUB or CALL, and the target location, may halt the procedure.

HTH

Willem Grooters
OpenVMS Developer & System Manager

Wim Van den Wyngaert · ‎03-17-2008

Hein,

I tried your SDA stuff but it's rather technical : what exactly to watch for, what the fields mean).

WIllem,

I found that too.

But I'm curious if there is a description in plain English of how it works.

Wim

Wim

Hein van den Heuvel · ‎03-17-2008

Wim, did you try the example I suggested?

You'll find it very educational and not too hard to figure out.

>>> SHOW PROC/RMS=(PIO,NOIFB:3,RAB,BDBSUM)

The PIO tells the code in RMS$SDA to report the PROCESS structures (DCL) not the IMAGE.
RAB is RAB... See RMS REFERENCE MANUAL.
BDBSUM is a Summary for the Buffer Descriptor Blocks. Internal RMS stuff.

Here is quick sample run (on OpenVMS V8.3!)
"test session input highlighted with >>

$ set proc/name=test
$ set promp="test>> "
test>> @test
>>Go Far?

SDA> SHOW PROC/CHAN
:
Channel CCB Window Status File
0020 7FF7C020 82420100 test.com

SDA> READ SYSDEF
SDA> FORMAT 82420100 ! Window
:
FFFFFFFF.82420138 WCB$L_READS 00000001
:
SDA> SHOW PROC/RMS=(PIO,NOIFB:3,RAB,BDBSUM)
RAB Address: 7FFCF014
:
RFA: 00000001,000A
RBF: 7FF9FEA6
RSZ: 0026 38.

So the first question came a record at offset 0xA in VBN 1.
To confirm:
SDA> exam 7FF9FEA6;26
%SDA-W-UNALIGNED, unaligned address 00000000.7FF9FEA6; converting to aligned address
CTER_C$ Read/promt="Go Far? " s 00000000.7FF9FEA0
ys$command x.mmand x.dog. The qu 00000000.7FF9FEC0
:
BDB/GBPB Summary
SIZE NUMB VBN BLB_PTR ADDR
00001000 00001000 00000001 00000000 000000007B0B3800

So the buffer used by RMS for DCL was 0x1000 = 4096 bytes, and all were filled.
The current VBN is 1, as expected, and the buffer address some P1 zone. We can look at that buffer, which has raw disk block data:

(Removed the HEX mumbo jumbo)
SDA> exa 000000007B0B3800;100
..$ Loop:&.$ Read/promt="Go Fa 00000000.7B0B3800
r? " sys$command x..$ if x ..$ 00000000.7B0B3820
then gosub far...$ eLse gosub n 00000000.7B0B3840

See? Every last bit can be seen and explained.
Now JUMP:

>> Go Far? y
>> Return from far?

SDA> ex FFFFFFFF.82420138 ! WCB$L_READ
FFFFFFFF.82420138: 00000000.00000003
:
RAB
RFA: 00000014,0112
:
BDB/GBPB Summary
SIZE NUMB VBN BLB_PTR ADDR
00001000 00000800 00000011 00000000 000000007B0B3800

Same buffer (address) now holds VBN 0x11 = 17 through VBN 0x14
Only 0x800 bytes where read, hitting EOF.
The RFA points to the last VBN in the buffer. To confirm:
SDA> exa 007B0B3800+(3*200)+112;40
%SDA-W-UNALIGNED, unaligned address 00000000.7B0B3F12; converting to aligned address
: ..$read/prompt="Return from fa 00000000.7B0B3F10
r? " sys$command x..$return..... 00000000.7B0B3F30

>>Return from far? y
>>Go Far?

Extra read done...
FFFFFFFF.82420138: 00000000.00000004
And VBN 1 is back in the buffer.

Clear as mud?
Very predictable!

Now you have to realize that RMS when asked to 'jump' to a remembered label (or returning, or returning from an @), will
NOT read starting at that VBN.
It will start the read at its natural VBN buffer boundaries.

It take the target VBN from the RFA (here x14, integer divide by block-in-buffer (here 8), and re-multiply by buffers size plus one (because oddly enough the first block is 1, not 0) giving 0x11 here. So if the code after the label takes more bytes than fit in that buffer, then an extra IO will be done every time again.

In fact, the label record itself may require 2 read IOs on a bad day!

Moving and changing 1 comment line from before a label to behind it could cause the label to be brought forward to that the start is just in the last bytes of one buffer, and push the return just over the end of the next buffer, changing 1 IO to 3.
... or visa versa.

Far all who had the interest to read this far...

1) Would an slow-motion rundown to show 'what happens behind the scenes' as above be a boot Camp Session you would attent?

2) Think about sticking a DCL procedure in an indexed file. It would be ordered by key right? Well, I did not provide a handy key ina (fixed byte offset) comment fields, but just used the first 4 bytes as key. Do you see how it can work at all?

Leuk? Leerzaam? Groetjes!

Hein.

Jan van den Ende · ‎03-17-2008

@Hein:

>>>
Far all who had the interest to read this far...

1) Would an slow-motion rundown to show 'what happens behind the scenes' as above be a boot Camp Session you would attent?
<<<

Well, IF I manage to get there (and at this moment chances are slim), THEN I would love to see that!
(And if I can not make it, my guess is that at the next TUD it would also gather interest, certainly mine).

Proost.

Have one on me.

jpe

Don't rust yours pelled jacker to fine doll missed aches.

Hein van den Heuvel · ‎03-17-2008

Back to the underlying question.
"How to optimize subroutine placement in DCL"

The answer we derived above is that for optimal speed the goal would be to have the caller, and callee live in the same RMS buffer when it comes to execution time.

Now this is unrealistic / tedious / next to impossible to arrange.

So the next best thing is to increase the odds.
That is basic commen sense for speeding up DCL, but rarely followed.

Notably an easy win:
- restrict the 'documentation' at top to a referral pointing to the bottom.

Or you could make the intro just over 4096 bytes long (counting overhead).
Of course I mean this 99% as a joke, but it could be a neat experiment. If the procedure has a clear 'main loop' for example for a 'main menu'. then use space lines comments to push that start into a fresh RMS buffer. Verify with $DUMP/BLO=(COUN=1,START=9)

Other thoughts....
- Move all usage and maintenance comments to the very bottom, behind rarely executed ('help') routines.

- Consider a 'branch out' at the top to an 'init' section located towards the end and back there to the top.

- Consider putting extensive logical name and symbol defintion in an executable program.

- Keep callers and callees close. This could mean to NOT put all subroutines at the end, but sprinkeld amongst main logic.

- Maintain 2 copies of a performance sensitive script:
1) main version, in CMS
2) executable version, trimmed down using a tool like 'DCLDIET' or 'SQUEEZE' to strip all comments, minimize whitespace, reduce lexical functions names to abbreviations, transform all variables to a0, a1,.. a9, aa, ab,.. az, b0, b1,...

Hope this helps,
Hein.

$! x.com first, lastname. Date. Version.
$! Full comments and usage at "README"
$goto setup
$main:

:
goto part_'x
part_1:
:
goto main
:
subroutines for part 1 and part 2
:
part_2:
:
part_5:
:
other bunch of subroutines for notably for part_4 - part_7
:
part_z:
:
$setup:
:
$goto main ! Let's start for real now.
$
$help:
$type sys$input

$!README
$exit

blah blah.

John Gillings · ‎03-17-2008

Wim,

An "optimization" I've used for a large suite of DCL is to have the main entry point check the location of the files. If they're on a real disk drive, create a RAM disk, move all the procedures to it and reexecute the new copy from the RAM disk. This gave me about 15% speed up, even counting the work to setup and tear down the RAM disk. However, it was pre XFC, so your mileage may vary today.

Another real example... I had a task of parsing some data to gather statistics from several dozen large log files (>1.5GB each). I wrote a DCL prototype and found it was taking about 90 minutes per file. While it was running, I reimplemented the DCL in MACRO32. The MACRO version produced identical output to the DCL, but took less than 60 seconds per file. In this case the end to end runtime of the compiled version was significantly faster than the DCL, including the development, compilation and debugging of the compiled version!

I think sometimes the real potential difference between interpreted code and compiled code is not appreciated. I'd also argue that although DCL is more easily accessible, it's often MUCH harder to write and get it correct than Pascal, FORTRAN, MACRO32 or even C and Basic. It's a fallacy to think that "anyone can do it" when there are so many subtle pitfalls like the myriad of potential single character typos to break programs, and there's no compiler to protect you. Sure, it's great to be able to knock out a page or so of script to run once or twice, but when you're getting into thousands of lines of code, or something that needs frequent modification, or that runs all day, every day, it's well worth putting in the investement to do it properly in the most appropriate language for the task.

That said, please don't get the wrong idea about my attitude to DCL. I write huge volumes of DCL, including procedures of several thousand lines, and even some that run all day, every day (but are not too performance sensitive). There are many things that can be done very easily in DCL that require lots of code in other languages (simple example, string subtraction!), BUT the DCL version often runs orders of magnitude slower.

On the other hand, a DCL advantage in *favour* of performance is to use multiple pipes in order to automatically exploit multiple CPUs and avoid temporary files. I've got lots of DCL that forms itself into complex trees of pipes, with up to a dozen processes, each doing their own subtask. What I lose in process creation overhead, I often gain many times over in avoiding I/O to and from temporary files, and all the directory & file system overhead.

> It also has the advantage that it doesn't
>create extra processes (I pase output of
>ncl, tcpip, ...).

On the contrary! I think you'll find that this could be a fruitful area of optimization. Get rid of all those temporary files! PIPE the output through SEARCH (compiled code, INSTALLed, and highly optimized by some of the best code cutters in the business, like Guy Peleg). Trim it down as much as possible, then back into your DCL parser. Yes it creates extra processes, but compare the overheads! These days avoiding disk I/O is your biggest bang for buck.

The important thing is to realise when you're running up against the inherent limitations of the language (true regardless of what language you're using), and wasting your time trying to achieve a particular objective, when the real solution is to restructure or rewrite in a more appropriate language.

A crucible of informative mistakes

Wim Van den Wyngaert · ‎03-17-2008

John,

$ if day1 .and. first_server
$ then
$ first_server="f"
$ gosub init_found
$ define/us sys$output 'ct_workf'
$ ucx show prot tcp/pa
$ open/read x 'ct_workf'
$r39:
$ read/end=e39 x x_rcd
$ x_rcd=f$ed(x_rcd,"compress")
$ it="Delay ACK:"
$ pos=f$loc(it,x_rcd)
$ if pos .ne. f$len(x_rcd)
$ then
$ f_1="t"
$ x_rcd2=f$extr(pos,80,x_rcd)
$ how=f$el(2," ",x_rcd2)
$ if f$extr(0,3,how) .nes. "dis"
$ then
$ call wms 'E' "TCP prot param ''it' must be disabled on Sybase nodes !"
$ endif
$ goto e39
$ endif
$ goto r39
$e39:
$ close x
$ del 'ct_workf'.*
$ if .not. f_1
$ then
$ call wms "W" "Output scan failed for ucx show prot tcp /pa"
$ endif
$ endif

This is an example of the code. As you can see there is protection against the item "delay ack" not found. I know I can nest mere lexicals but I don't do that for readability. If something goes wrong, I still have the workfile to see what went wrong.

BTW : the monitoring consumes 10 cpu sec per hour and 5000 DIO (pitty that the real IO's are not available in acc). This wile hundreds of checks are done, including a full disk scan for version >30.000.

BTW2 : 1 version running on 5.5, 6.2, 7.2, 7.3. Only the last 2 remain. Thus piping could not be used.

BTW3 : I said ANY system manager. ANY system manager should know basic DCL. What I can't say of C, FORTRAN or PASCAL (I'm a COBOL guy). And the ideal programming language changes every 10 years.

For me the performance is acceptable. But if I can improve it ...

Wim

Wim

labadie_1 · ‎03-18-2008

Wim

If you just want to check if delay ack is enabled or disabled, may be it is faster to do, after the creation of your file
$ define/us sys$output 'ct_workf'
$ ucx show prot tcp/pa

$ conv /fdl=sys$input 'ct_workf' new
record; format stream;
ctrl Z
so the file created is no longer a one-long-line file.

Then a
$ sea new "Delay ACK",enabled/match=and
and branch according to the result of the search.

The convert is necessary, because as window scale is enabled, you will always find the 2 strings (Delay ACK and enabled) without the convert.

Really a pity that with your versions of Vms, you can't have the same procedure on all nodes, as using Pipe heavily (as John Gillings said) would save a lot.

Wim Van den Wyngaert · ‎03-18-2008

Labadie,

Compared the 2 in batch :

DCL : 95 DIO, .17 cpu, 1679 PF
Convert : 97 DIO, .19 CPU, 2371 PF

With set file/at=rfm=stm instead of convert (is what I use in my dcl)
78 DIO, .16 CPU, 1768 PF

And most of the code searches for many things and I want to know for each item if it was found or not. DCL will be difficult to beat.

Wim

Wim

Wim Van den Wyngaert · ‎03-18-2008

May be the enclosure is a better example of what I do.

Wim

Wim

labadie_1 · ‎03-18-2008

Wim

May be you should have 2 differents versions of your scripts, the actual and another one for the Vms versions allowing Pipe.

As you seem to look at some memory cells, you could use the same idea as in this procedure

http://dcl.openvms.org/stories.php?story=06/03/21/8098045

I have in my notes the sda commands to get various things, for npagedyn pool expansion see the previous procedure.

I bet Volker Halle will have posted how to get the number of used slots and such things before I find my notes !

Hein van den Heuvel · ‎03-18-2008

Wim,

Here is some concrete advise on how to speed up all you DCL code, based on the example shown.

$ it="Delay ACK:"
$r39:
$ read/end=e39 x x_rcd
$ x_rcd=f$ed(x_rcd,"compress")
$ pos=f$loc(it,x_rcd)
$ if pos .eq. f$len(x_rcd) then goto r39
$ f_1="t"
$ x_rcd2=f$extr(pos,80,x_rcd)
$ how=f$el(2," ",x_rcd2)
$ if f$extr(0,3,how) .nes. "dis"
$ then
$ call wms 'E' "TCP prot param ''it' must be disabled on Sybase nodes !"
$ endif
$e39:

1) get any constant assignment out of the loop.
2) loop back as soon as possible.

note 1) if somehow that "Delay ACK" was not on the line, then the example would use whatever what was left behind in in 'f_1' or barf at the test no?

note 2) You don't need a loop for ucx today

note 3) You could simply $SEARCH for "Delay ACK: disabled", exactly with spaces and test for SEARCH status no?
I know... your construct would allow for the spaces the change some, but it could not handle "Delay ACK: disfunctional" :-)

Just for demonstration purposed, in perl, using a similar construct the code would look like:

use strict;
my $found_it = 0;
my $target = 'Delay ACK:';
foreach (qx( ucx show prot tcp/pa)) { #Delay ACK: enabled/disabled
if (/^\s+$target\s+(\w+)bled/) {
$found_it++;
print STDERR "TCP Prot param $target must be DISABLED for Sybase" unless ($1 eq 'dis');
last;
}
}
print STDERR "Output scan failed for ucx show prot tcp /pa" unless $found_it;

The perl code has of course many interesting option to efficiently look for 'many' things. Typically you would load all name/values into an array while scanning and then just use the values in the array for testing.

Labadie,

Yes UCX output is a single long line with embedded CR/LF.
The other way to force seperate lines is:

$ cre tmp.tmp/fdl=sys$input
record; format stream;
$ open/app log tmp.tmp
$ define/user sys$output log
$ ucx show prot tcp/pa
$ close log

Just 1 file created, not 2.

But a simple SET FILE/ATTRI=RFM=STM, as Wim mentions, is a reasonable hack for this output anyway.

Hein.

labadie_1 · ‎03-18-2008

The number of process entry slots seems to be at SCH$GW_PROCLIM, the number of used process entry slots at SCH$GW_PROCCNT.

labadie_1 · ‎03-18-2008

SGN$GL_BALSETCT and/or SGN$GL_BALSETMAX
- > the starting value for the balance set slots

SWP$GL_BALCNT -> the balance set slots used

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: How to optimize subroutine placement in DCL

How to optimize subroutine placement in DCL