Re: RMS Performance Query

Mark_Corcoran · ‎02-28-2018

In July of last year and errant hardware device resulted in the same information being presented to our application, and due to poor original design, this allowed 185564 records (one every two seconds) to be inserted into an indexed RMS file for the item (they aren't duplicates, because the key includes the item's value plus a timestamp in YYMMDDHHMMSS format).

Initially this was only detected by virtue of another part of the software reporting that there was an attempt to move the item into a logical zone that it was already in, which wasn't permitted - generating rather large log files.

It took about another 6 months before a team that use a reporting tool happened to try to generate a report on that zone, and found these 185564 records.

An initial attempt to delete these records had to be killed off after ~8 hours, when only 103351 of them were deleted.

There was a high CPU usage on the system (subsequent trials on a test system using a copy of the index file taken last year (with the excess records) with MONITOR MODES /INT=1 indicated most of the time was in Executive mode, indicating that the bottleneck was around RMS), that led to other issues (batch jobs not getting enough time to run, so when they resubmitted themselves, there ended up being multiple instances of the jobs; a watchdog process not getting responses from heartbeat messages to other processes &etc.)

I've been spending a long time looking at this, and I have read a number of posts, technical journals and presentations by Hein and others, but I still have trouble getting my head around what RMS is doing (or rather, is being asked to do).

If I take the original file with the 185564 excess (i.e. in addition to the normal number of records that would be expected; I don't have any figures on it now, but I'd guess no more than 30000) and use the application command to delete the records, it starts off fine, but gets slower and slower.

The underlying Pascal code repeatedly loops in the loop calls a routine that does a RESETK, FINDK, DELETE, and on return from the routine, does a GET, and compares the item ID to see whether or not it matches the one being sought (if not, loop exits).

[I already identified that for 3 or more records for the item ID in this file, you can save 25% of the logical I/O operations by simply not bothering with the GET, and relying on whether or not the FINDK returns an indication that the key was not found; I'm not sure whether or not the underlying Pascal GET translates to multiple I/Os]

Once a large number of records have been deleted and performance drops to a crawl, if I stop the application and do a CONVERT /FDL, then performance seems to increase again until after another large tranche of records are deleted, and again it slows down.

I've done some analysis - MONITOR RMS/FILE doesn't really help - it only tells me what I already know about the $DELETE call rate dropping to between 1 and 3 per second).

I saw a posting about RMS Tools in Freeware V5.0, and downloaded the .ZIP from the HP web site, though it appears to only have the directory structure, and not the actual tools themselves.

However, I did manage to find RMS_TOOLS.ZIP elsewhere, and extracted RMS_STATS and RMS_TUNE_CHECK (not sure if the .EXEs were AXP, but they wouldn't run and ANALYZE /IMAGE didn't like them, so I had to recompile & relink them).

An ANALYZE /RMS /STATISTICS shows the following (edited to remove a lot of the duplicate blank lines, and removing the file name and changing the key names to protect the guilty) - note that this is based on the file containing the 185564 excess records, without any attempt at deleting them.

RMS File Statistics 28-FEB-2018 10:05:12.12 Page 1

FILE HEADER

Contiguity Options: contiguous-best-try
Performance Options: none
Reliability Options: none
Journaling Enabled: none

RMS FILE ATTRIBUTES

File Organization: indexed
Record Format: fixed
Record Attributes: carriage-return
Maximum Record Size: 229
Longest Record: 229
Blocks Allocated: 80550, Default Extend Size: 3006
Bucket Size: 9
File Monitoring: disabled
Global Buffer Count: 50

FIXED PROLOG

Number of Areas: 3, VBN of First Descriptor: 3
Prolog Version: 3

AREA DESCRIPTOR #0 (VBN 3, offset %X'0000')

Bucket Size: 9
Reclaimed Bucket VBN: 0
Current Extent Start: 1, Blocks: 72621, Used: 768, Next: 769
Default Extend Quantity: 3006
Total Allocation: 72621

STATISTICS FOR AREA #0

Count of Reclaimed Blocks: 0

AREA DESCRIPTOR #1 (VBN 3, offset %X'0040')

Bucket Size: 9
Reclaimed Bucket VBN: 0
Current Extent Start: 72622, Blocks: 423, Used: 9, Next: 72631
Default Extend Quantity: 45
Total Allocation: 423

STATISTICS FOR AREA #1

Count of Reclaimed Blocks: 0

AREA DESCRIPTOR #2 (VBN 3, offset %X'0080')

Bucket Size: 9
Reclaimed Bucket VBN: 0
Current Extent Start: 73045, Blocks: 7506, Used: 81, Next: 73126
Default Extend Quantity: 315
Total Allocation: 7506

STATISTICS FOR AREA #2

Count of Reclaimed Blocks: 0

KEY DESCRIPTOR #0 (VBN 1, offset %X'0000')

Next Key Descriptor VBN: 2, Offset: %X'0000'
Index Area: 1, Level 1 Index Area: 1, Data Area: 0
Root Level: 1
Index Bucket Size: 9, Data Bucket Size: 9
Root VBN: 72622
Key Flags:
(0) KEY$V_DUPKEYS 0
(3) KEY$V_IDX_COMPR 1
(4) KEY$V_INITIDX 0
(6) KEY$V_KEY_COMPR 1
(7) KEY$V_REC_COMPR 1
Key Segments: 1
Key Size: 6
Minimum Record Size: 6
Index Fill Quantity: 3640, Data Fill Quantity: 3640
Segment Positions: 0
Segment Sizes: 6
Data Type: string
Name: "key0_fieldname"
First Data Bucket VBN: 4

STATISTICS FOR KEY #0

Number of Index Levels: 1
Count of Level 1 Records: 85
Mean Length of Index Entry: 8
Count of Index Blocks: 9
Mean Index Bucket Fill: 13%
Mean Index Entry Compression: 14%

Count of Data Records: 1822
Mean Length of Data Record: 229
Count of Data Blocks: 765
Mean Data Bucket Fill: 77%
Mean Data Key Compression: 40%
Mean Data Record Compression: 32%
Overall Space Efficiency: 1%

KEY DESCRIPTOR #1 (VBN 2, offset %X'0000')

Index Area: 2, Level 1 Index Area: 2, Data Area: 2
Root Level: 1
Index Bucket Size: 9, Data Bucket Size: 9
Root VBN: 73054
Key Flags:
(0) KEY$V_DUPKEYS 0
(1) KEY$V_CHGKEYS 1
(2) KEY$V_NULKEYS 0
(3) KEY$V_IDX_COMPR 1
(4) KEY$V_INITIDX 0
(6) KEY$V_KEY_COMPR 1
Key Segments: 1
Key Size: 14
Minimum Record Size: 20
Index Fill Quantity: 3686, Data Fill Quantity: 3686
Segment Positions: 6
Segment Sizes: 14
Data Type: string
Name: "key1_fieldname1, key1_fieldname2"
First Data Bucket VBN: 73045

STATISTICS FOR KEY #1

Number of Index Levels: 1
Count of Level 1 Records: 8
Mean Length of Index Entry: 17
Count of Index Blocks: 9
Mean Index Bucket Fill: 3%
Mean Index Entry Compression: 6%

Count of Data Records: 1822
Mean Duplicates per Data Record: 0
Mean Length of Data Record: 15
Count of Data Blocks: 72
Mean Data Bucket Fill: 74%
Mean Data Key Compression: 57%

The analysis uncovered NO errors.

Note that the used block counts for the areas is a lot smaller than the allocated block counts - the various indexed files used by the application had gone without a CONVERT /FDL for over a year, so I updated the .FDL file in 2016 to cater for the usage at the time, and to allow for growth without fragmenting the file if it happened not to be CONVERT /FDLed for another year.

RMS_TUNE_CHECK reports the following (again, based on the 185564 excess records):

Key 1

Found 196139 sidr records in 669 buckets
There were 196139 user data record pointers.
There were 0 deleted, 0 RU_delete and 0 RU_update pointers.
No duplicate key values found.

In a PowerPoint presentation by Hein included in RMS_TOOLS.ZIP is a slide with a title "How many GLOBAL Buffers" with the first bullet point being 'None' is the only wrong answer.

I tried to increase this, but a.n.other detached process that opens the file terminates on startup due to exceeding the global page file limit (unfortunately, it is started up by a process that explicitly specifies in the Pascal source code what the quotas are that it uses in the $CREPRC call (rather than being started by a .COM where I could manipulate it without having to recompile & relink)), and I can only increase it to 40 before it barfs.

So, I looked at changing the local cache in my process that I execute the application command to delete the records from the file (it handles it within the context of my process, and doesn't hand it off to some database process - well, not on the master node anyway, it sends a copy of the transaction to the slave, and it does the deletes via a database process).

I tried SET RMS_DEFAULT /BLOCK_COUNT=127 /BUFFER_COUNT=255 and for a while, this does improve things; Hein's RMS_STATS reports the following when starting to delete the 185564 records:

0.01 IO/Ops. Ops: 21559 Cache: 99% Global: 0% VEH_ZONE_HIST.IDX
Local Cache trials 197873 hit 197760 read 113 write 0
Global Cache trials 0 hit 0 read 0 write 0

Locks: Enqueue Dequeue Convert Wait Block-ast
-----------------+----------+-----------+-----------+------------+----------
Shared file: 19 1 3 0 2
Local buffer: 113 19 97193 0 0
Global buffer: 0 0 0 0 0
Append locks: 0 0 0 0 0
Global section: 0 0 0 0 0
Data record: 16170 16170 0 0

Open: 19 Close: 1 Con: 18 Discon: 0
Rewind: 5391 Flush: 0 Extend 0 blocks: 0
Outbufquo: 0 Xqpqio: 38 Truncate 0 blocks: 0
Reads 0 Bytes: 0 Writes 0 bytes: 0

Find seq: 0 key: 0 rfa: 0
Get seq: 10780 key: 5390 rfa: 0 bytes: 662970
Put seq: 0 key: 0 bytes: 0
Deletes: 5389 Updates 0 bytes: 0
Splits: 0 multy: 0 OutbufQuo 0

But later on, that local cache hit rate drops to ~24%

Hein's PPT alludes to this (obviously, it's mostly bullet points, where he'd then actually expand on it, if you'd been there in person), but the change in the number of (local) buffers appears to require an increase in ENQLM - I found that after doing the SET RMS_DEFAULT, the application command to delete the records would randomly terminate before all of the records were deleted (issue the command again, it would continue for another few thousand, before doing the same again).

Only after I added some diagnostics I could see that it wasn't because the item ID didn't match the one being sought, so the Pascal STATUS call on the file had to indicate an error - more diagnostics added revealed it was 103, which is PAS$K_ERRDURGET

An old ASCII version of the user manual for v3.5 of the Pascal compiler had code examples (which later versions don't appear to have) on getting the underlying RMS reason - after some fiddling (I haven't really used Pascal for 20 years, so don't remember the syntactical/semantical layout required), I managed to get the code compiled and working, which resulted in LIB$SIGNAL reporting "%RMS-F-EXENQLM, exceeded enqueue quota".

I bumped the ENQLM count from 200 to 1000 in my UAF entry, logged out & back in, and started running the command again - it was still running (very slowly) this morning, but before I'd left yesterday, looking at the JIB in SDA indicated that the ENQCNT had dropped to %X308 (i.e. 776).

I'm not sure whether or not this is indicative of a leak either in RMS (unlikely) or the Pascal RTL, or if it is to be expected:

As more and more records are deleted from the indexed file, presumably RMS has to (potentially) perform more underlying I/O operations in response to a Pascal FINDK, DELETE and/or GET, as it has to skip over deleted entries in the index?

[depending on whether or not that index is cached in either local or global buffers, means painful physical I/O (unfortunately, this is OpenVMS/VAX, so it is the VIOC cache, meaning you can't get more meaningful information on file cacheing by OpenVMS itself)]
When local or global RMS buffers are specified, presumably there's no precision/control about whether they are used for the index rather than the data records?
There seems to be a fairly high ENQ count for locks - does this increase (exponentially) the more (deleted) records RMS has to skip past to find the record being requested by Pascal's FINDK?

i.e. might my ENQCNT eventually drop to 0 despite having quintupled the ENQLM quota from 200 to 1000?
The performance is initially good when deleting the records, but worsens as more and more records are deleted, so it's not a persistent RMS problem, it's only once a tipping point (probably of where it needs to find key records in the index) is reached.

Depending on how/what RMS is cacheing, would it be worth me shrinking the index file (with a CONVERT /FDL) to a smaller size that is just enough to cover the index and data records being used.

[On the production system, 103351 of the excess records have already been deleted, so the real in-use space has already decreased]

How might a local buffer count of 255 with a block size of 127 relate to what is or isn't cached (255 blocks * 127 = 32385 blocks, which would be larger than the actual space in use, although I'm sure it's not a simple as the calculation I just gave, to get at least the index (if not the data records) cached).
I'm only interested in increasing the preformance at the moment for the purposes of this loop in Pascal that is doing the RESETK, FINDK and DELETE - I already have a code change in mind to prevent this looping of recrods being added to the file in future, so I don't expect to have to delete large swathes of records in future.

Any answers/suggestions gratefully received!

Mark

[Formerly appearing as woeisme]

Hein van den Heuvel · ‎02-28-2018

I'm afraid I don't have the tiem to study this completely.

Seems you did a good bit of homework. Thank you.

The **bleep**/RMS/STAT numbers do not at all match the story. notably: "Count of Data Records: 1822". Are you looking at the correct file? Minor note. Just use **bleep**/RMS/FDL next time. It has the same info in a more usable format.
IF the find is for GT the same low value every time, and then the record is deleted, then RMS has to start at the same corresponding data bucket all the time, but it will be emptied out, as will the next and the next, so over time each find may take longer skipping over empty bucket... but they should all be in the file system cache. Mind you tune_check would report a long chain of empty buckets, but it did not.
You want to enable Fast Delete ( mask RAB$M_FDL / bit RAB$V_FDL) to stop RMS from purging the alternate key structures for each delete.
Instead of reset (why reset??), find, delete, just find once and go sit in a read/delete loop validating the key values as you read.
Better still... write a program (or use SEARCH/MATCH=NOR?) to save only the record you want into a sequential file, next convert that to indexed and presto. I suspect 10 minutes for the whole thing.
The RMS_STATS look suspect also, like it is counting for 1 file, but doing the work on an other one.

Good luck.

Hein.

Chris Scheers · ‎02-28-2018

I have seen something like this before.

When you enter a lot of sequential records with ascending keys, it seems that the key structure is not built in an optimal way. Then when you try to delete records, it can take a long time.

I once had a file that was so degenerate that deleting one record took about 12 minutes. (This was on a VAX with slow disks.)

If you can take the file offline, a CONVERT/RECLAIM may help. Finish your deletions, then take the file offline again and do a final CONVERT/RECLAIM.

If you can't take the file offline, you are probably stuck with the performance you are seeing.

Good luck!

Mark_Corcoran · ‎03-01-2018

Thanks Hein & Chris for responding...

>/RMS/STAT numbers do not at all match the story

Yep, you're right, Hein. I checked my session log from yesterday and saw that I specified the wrong file.

>Just use **bleep**/RMS/FDL next time
Hmm, I was convinced that it was a subset, but on comparing the two, it it's just a different order and method or presentation.

The output of ANALYZE /RMS /FDL for the original file (with the excess 185564 records) is as follows:

IDENT " 1-MAR-2018 07:32:14 OpenVMS ANALYZE/RMS_FILE Utility"

SYSTEM
SOURCE OpenVMS

FILE
ALLOCATION 247464
BEST_TRY_CONTIGUOUS yes
BUCKET_SIZE 9
CLUSTER_SIZE 9
CONTIGUOUS no
EXTENSION 6003
FILE_MONITORING no
GLOBAL_BUFFER_COUNT 0
NAME "<deleted>"
ORGANIZATION indexed
OWNER [<deleted>]
PROTECTION (system:RWED, owner:RWED, group:RE, world:)

RECORD
BLOCK_SPAN yes
CARRIAGE_CONTROL carriage_return
FORMAT fixed
SIZE 41

AREA 0
ALLOCATION 186084
BEST_TRY_CONTIGUOUS yes
BUCKET_SIZE 9
EXTENSION 6003

AREA 1
ALLOCATION 5004
BEST_TRY_CONTIGUOUS yes
BUCKET_SIZE 9
EXTENSION 504

AREA 2
ALLOCATION 56376
BEST_TRY_CONTIGUOUS yes
BUCKET_SIZE 9
EXTENSION 1008

KEY 0
CHANGES no
DATA_KEY_COMPRESSION yes
DATA_RECORD_COMPRESSION no
DATA_AREA 0
DATA_FILL 78
DUPLICATES yes
INDEX_AREA 1
INDEX_COMPRESSION yes
INDEX_FILL 78
LEVEL1_INDEX_AREA 1
NAME "field1, field2 and field3"
NULL_KEY no
PROLOG 3
SEG0_LENGTH 20
SEG0_POSITION 0
TYPE string

KEY 1
CHANGES yes
DATA_KEY_COMPRESSION yes
DATA_AREA 2
DATA_FILL 78
DUPLICATES no
INDEX_AREA 2
INDEX_COMPRESSION yes
INDEX_FILL 78
LEVEL1_INDEX_AREA 2
NAME "field2, field3"
NULL_KEY no
SEG0_LENGTH 14
SEG0_POSITION 6
TYPE string

ANALYSIS_OF_AREA 0
RECLAIMED_SPACE 0

ANALYSIS_OF_AREA 1
RECLAIMED_SPACE 0

ANALYSIS_OF_AREA 2
RECLAIMED_SPACE 0

ANALYSIS_OF_KEY 0
DATA_FILL 55
DATA_KEY_COMPRESSION 76
DATA_RECORD_COMPRESSION 0
DATA_RECORD_COUNT 92788
DATA_SPACE_OCCUPIED 17505
DEPTH 2
INDEX_COMPRESSION 62
INDEX_FILL 59
INDEX_SPACE_OCCUPIED 54
LEVEL1_RECORD_COUNT 1945
MEAN_DATA_LENGTH 41
MEAN_INDEX_LENGTH 22

ANALYSIS_OF_KEY 1
DATA_FILL 38
DATA_KEY_COMPRESSION 74
DATA_RECORD_COUNT 92788
DATA_SPACE_OCCUPIED 6021
DEPTH 2
DUPLICATES_PER_SIDR 0
INDEX_COMPRESSION 44
INDEX_FILL 45
INDEX_SPACE_OCCUPIED 27
LEVEL1_RECORD_COUNT 669
MEAN_DATA_LENGTH 13
MEAN_INDEX_LENGTH 17

>IF the find is for GT the same low value every time, and then the record is deleted, then RMS has to start at the same corresponding data bucket all the time, but it will be emptied out, as will the next and the next, so over time each find may take longer skipping over empty bucket... but they should all be in the file system cache. Mind you tune_check would report a long chain of empty buckets, but it did not.

I left the delete running overnight yesterday (I'd modified the source so it would stop after deleting 103351 records (the same number as was originally deleted on production), and it took ~18:00:35.68 to delete the records (curiously, although I had done a SET RMS_DEFAULT /BLOCK_COUNT=127 /BUFFER_COUNT=255 /INDEXED, on this occasion (when I had earlier reset my ENQLM back to 200), it didn't barf with a "%RMS-F-EXENQLM, exceeded enqueue quota" message).

I then ran RMS_TUNE_CHECK against the file again (after the 103351 records were deleted), and it reported this:

Key 1

Found 92788 sidr records in 669 buckets
There were 92788 user data record pointers.
There were 0 deleted, 0 RU_delete and 0 RU_update pointers.
No duplicate key values found.

>You want to enable Fast Delete ( mask RAB$M_FDL / bit RAB$V_FDL) to stop RMS from purging the alternate key structures for each delete.

At this stage, I was looking at trying to avoid making code changes, just getting the normal status quo back for the file (albeit stopping the application and doing CONVERT /FDL and any other DCL required), but the helpful point has been noted!

[I do have a number of bugs that I need to fix beyond this issue that resulted in 185564 records being created), but the lead time for any change is probably at least a month.

I need to satisfy myself on the test system that the code does as intended, then produce a ~30pp document that explains to change management et al what is being done, why it's being done, what will happen if it's not done, how to implement it, how to test it and how to regress it, then get my colleagues to each attempt this on the test system, plus the compulsory lead time without getting as far as raising a change request and chasing approvers...]

>Instead of reset (why reset??), find, delete, just find once and go sit in a read/delete loop validating the key values as you read.
The routine that does the reset, find & delete is a lower level routine that is called in multiple places, including this higher level routine that repeatedly calls it in a loop when it is trying to delete multiple records (I can see the case for avoiding reinventing of the wheel, and the original developers probably didn't anticipate a design flaw to allow nearly 200K records being created (it's easy to proclaim yourself a visionary if you're using hindsight ;-))

Ordinarily, there probably wouldn't be more than ~30 records tops for any one "item", even if it goes around the houses and flits between several zones over and over again when different parties argue whose responsibility a particular problem is, and whether or not it has been resolved to the satisfaction of the folks in the next zone...

>Better still... write a program (or use SEARCH/MATCH=NOR?) to save only the record you want into a sequential file, next convert that to indexed and presto. I suspect 10 minutes for the whole thing.

I took another look at the .IDX file this morning following you highlighting I'd done the ANALYZE /RMS /STATISTICS on the wrong one.

As I would be taking the file offline (and shutting down the application), I will have unfettered access to it; in the original FDL (which I made a handful of changes to it, mostly for increasing the original allocation sizes, and to make the allocations, extensions and bucket sizes a multiple of the cluster size), the DATA_FILL and INDEX_FILL were set at 79.

I understand that you need to allow leeway for other records to be added in future, but for my purposes here, the application won't have access, so I can ramp up the fill values to 100.

I have done so, and also changed the FDL so that the new file created by CONVERT is only as large as the (rounded-up to multiples of the cluster size) combined area of the used block count for all of the areas of the file (even if it doesn't affect how much is kept in the cache, it should reduce the wait times for disk head movement, as the file is 4.5% of the original 247464 blocks).

I've left this running trying to delete the remaining 82213 records (with ENQLM set to 1000, and the RMS_DEFAULT values I'd mentioned earlier); after 05:17:53.72, it has deleted 58813 records (71.53%); RMS_STATS reports that the local cache hit rate has dropped to 32% :-(

I think I will have to give it a go of doing a convert to SEQUENTIAL, then trying to convert it back to INDEXED (not something I've ever done before, so I'll need to have a read-up on this, to find out how to dictate where the values should be derived from to regenerate the keys).

>The RMS_STATS look suspect also, like it is counting for 1 file, but doing the work on an other one.

I definitely didn't get that bit wrong :-)

I've takne another snapshot of that now, as the remaining ~20K records are in the process of being deleted (despite me intending to hide the guilty party in my original post, I forgot to blat out the file name as reported by RMS_STATS, so there's no point in me doing it now; what gets put on the internet, stays on the internet...)

41.25 IO/Ops. Ops: 239171 Cache: 31% Global: 0% VEH_ZONE_HIST.IDX
Local Cache trials14451177 hit4585345 read9865831 write 28169
Global Cache trials 0 hit 0 read 0 write 0

Locks: Enqueue Dequeue Convert Wait Block-ast
-----------------+----------+-----------+-----------+------------+----------
Shared file: 22 4 6 0 2
Local buffer: 9865831 9865564 10768029 0 0
Global buffer: 0 0 0 0 0
Append locks: 0 0 0 0 0
Global section: 0 0 0 0 0
Data record: 179379 179379 0 0

Open: 22 Close: 4 Con: 18 Discon: 0
Rewind: 59794 Flush: 0 Extend 0 blocks: 0
Outbufquo: 0 Xqpqio: 44 Truncate 0 blocks: 0
Reads 0 Bytes: 0 Writes 0 bytes: 0

Find seq: 0 key: 0 rfa: 0
Get seq: 119586 key: 59793 rfa: 0 bytes: 7354539
Put seq: 0 key: 0 bytes: 0
Deletes: 59792 Updates 0 bytes: 0
Splits: 0 multy: 0 OutbufQuo 0

Chris:

>If you can take the file offline, a CONVERT/RECLAIM may help

Its an idea, but we'd be doing our regular convert if we took it offline (or in this case, with a modified .FDL which increases the fill rate and reduces the overall size of the file), which regenerates the file without the empty buckets (a bit like doing an image backup and restore), so it's 6 of one and half a dozen of another :-(

I think at the moment unless I can get a convert to sequential, deletion of the excess records and reconversion back to indexed, I'm going to have to sit and wait for the convert I am already doing, to complete.

And just hope that the errand hardware device does present another ID repeatedly before I can get the original code bug fixed...

Mark

[Formerly appearing as woeisme]

Hein van den Heuvel · ‎03-01-2018

>The RMS_STATS look suspect also, like it is counting for 1 file, but doing the work on an other one.

I definitely didn't get that bit wrong :-)

You totaly got that part wrong. I suspect the stats were enabled after the delete program opened the file.

If it was doing 0,01 IO/operation, the those 100,000 (deleted) operations would be finished in 1000 IO's 02 10 second give or take. That later number # 45 IO/operations makes it 4500 times longer = 45000 second = half a day approx.

I don't get it. I commend you for the effort you have taken, but you are doing something which is garantueed to be slow, and the proof is there that indeed it is slow, just looking at the 'hours'.

Using that 'standard routine you are asking the system to rewind for evey record (to the the key of reference), find (index walk) and delete (write) for every record, leaving a bloody mess behind, and taking dozens of locks every step of the way.

Just write a fresh program for the extract of the record to be retained. Any language you can get at... Pascal, Macro. or even a DCL loop. Convert/stat/nosort/fdl=xxx to indexed with temp name, on succes rename to real file to supersede the current file, be happy.

Yes the file will be 'offline' or stale for 20 seconds.

If the file is read sequentially, you'll get all the data in about 2000 IOs, write out only what is needed with say 100 IOs- write behind IOs of 64 block, convert the file back to indexed with another 4 or 5000 IOs (including second index) and be all done in 30 seconds. Much faster then it took to write this whole topic.

If you must keep the actual file in place, then just go sit in a sequential read loop, and delete unwanted as needed, but do NOT find. Switch on DEFERRED WRITE (FAB$V_DFW) and the Fast delete, and it will be done in a minute also.

What is the key used for find anyway? How do you know what needs to be found and destroyed?

Forget about cluster-size, fill-factor, convert/reclain and there rest.

Those are all 5% tweaks, when you can do a 500% process improvement.

$ SET RMS_DEFAULT /BLOCK_COUNT=127 /BUFFER_COUNT=255 /INDEXED

Well, Block_count does not count for indexed file, as it is superseded by the cluster size.

Buffer count may well be over-ruled by program defaults, if you are curious then use ANALYZE/SYSTEM ... SET PROC/IN=xxx ... SHOW PROC/RMS=(RAB,FAB,BDBSUM).

You'd be better off to use: SET FILE/GLOB=1000 before the delete program runs. You can set back to 0 when done, but why not leave it on !?

Like you observe, Convert/reclaim is a total waste of time. Pretty much always. Normal convert is often faster, and more effective!

For the TUNE_CHECK, maybe you needed the not so obvious option -s=-1 to ask an exhaustive count for the primary key. The default is a 10,000 record sample if I recall correctly.

Good luck!

Hein

Mark_Corcoran · ‎03-02-2018

>You totaly got that part wrong

>I suspect the stats were enabled after the delete program opened the file

Actually, no, I didn't. I was referring to the RMS_STATS being run against the correct file, and as I originally stated "Hein's RMS_STATS reports the following when starting to delete the 185564 records:"; based on the figures that I reported, I would guess that "Deletes: 5389 Updates 0 bytes: 0" indicated that it had successfully deleted 5389 records at that point.

>I don't get it. I commend you for the effort you have taken, but you are doing something which is garantueed to be slow, and the proof is there that indeed it is slow, just looking at the 'hours'.

>Using that 'standard routine you are asking the system to rewind for evey record (to the the key of reference), find (index walk) and delete (write) for every record, leaving a bloody mess behind, and taking dozens of locks every step of the way.

Well, for someone who's life's work at Digital was RMS, it might be obvious that it is guaranteed to be slow, but for a large majority of OpenVMS user base who don't get involved with RMS (indexed files) [if they can avoid it], or understand what the underlying Pascal RTL is requesting from RMS, it wasn't obvious to me that it was guaranteed to be slow and couldn't be fixed by some simple RMS tweak.

The more flexible something is made to be (q.v. RMS), the more complex it normally becomes, with a dizzying array of options that are interdependent on each other, and if one is set marginally wrongly, can make a massive difference.

It's not for no reason that there is an RMS Reference Manual and an RMS Utilities Reference Manual.

I'm not expecting a free ride of posting a question without having done any legwork or trial & error attempts myself, but even the documentation for RMS isn't particularly intuitive (and no, I don't expect it to be expanded to provide examples of every conceivable combination of parameters and how one might convert from every conceivable format to another).

q.v.:
>Well, Block_count does not count for indexed file

Conceded, but only after getting to page 115 of the copy of the RMS reference manual that I have.

>as it is superseded by the cluster size.

Perhaps in a later version of OpenVMS than being used on this particular system, but the copy of the manual I have makes no reference to the cluster size.

>taking dozens of locks every step of the way

That's one of the things I couldn't understand - your PPT alludes to increasing the ENQLM, but I can't readily find any explanation in either of the two aforementioned RMS manuals to explain the requirement (maybe the source code or some kind of RMS internals course might explain, but neither my wallet nor my employer's will pay $$$ just to satisfy my curiosity (if one understand why it is doing the (seemingly excessive) locking, one can maybe understand what is "wrong", and hence how to fix it)).

>Just write a fresh program for the extract of the record to be retained

[deletia]

>Much faster then it took to write this whole topic

Any code change or new code requires time to develop, document, test & go through the change management procedure; I had been hoping that the issue was a "simple" configuration error that could be fairly easily worked around with a couple of commands in DCL (which would still require a change, but significantly less work beforehand), just so that I could get the file back in to the correct state and then worry about putting in a fix (in code which I have inherited) to prevent the situation of excess numbers of records recurring again in future.

Contributions to the HPE community forums are on a voluntary basis; if you don't have time, you don't have time - don't feel obligated to respond to people that are clearly idiots with significantly less RMS knowledge than yourself and who haven't worked for Digital, Compaq (and HP?) for a large part of their working career.

I have since converted the file to sequential file organization, deleted the records in EDT (yes, some of us prefer it), then reconverted to indexed (as unclear as it was from the fairly terse RMS Utilities Reference Manual that the SEGn_LENGTH and SEGn_POSITION keywords in the FDL file would be used to derive the keys in the indexed file from the records in the sequential file).

Yes, much, much quicker - in the order of maybe a minute.

>What is the key used for find anyway? How do you know what needs to be found and destroyed?

An item has a serial number, and that item can move into different zones before it leaves the building. The record stores the item serial number, the zone number and the date/time that the item entered that zone.

Deletion typically tends to occur for all records for the item some days after it has left the building, to make space in the file (there may be other places where it is deleted and hence why the original developers chose to have a library routine for deleting a single record, but I haven't trawled the code base to find out).

Various people need to be able to report on all items into their zone, or to track zone movements of a particular item, and when trying to list the entries in the affected zone, they have to (by default, unless they choose an additional qualifier on their query command) wade through all these thousands of historic excess records first of all.

In this case, the same serial number & zone number appears in the offending records 185664 times but with a date/time that is 2 seconds apart each time.

>SET FILE/GLOB=1000 before the delete program runs. You can set back to 0 when done, but why not leave it on !?

Unfortunately, the application delete command requires a response from a.n.other process, and that process terminates with a "global page file exceeded error" on startup if SET FILE/GLOB= has a value higher than 50 (the watchdog detached process that creates all other processes, explicitly specifies the process quotas for each process it creates, using values that the original developers hard-coded into the source code).

I didn't want to have to increase/fix these in the watchdog program (another change document, testing, change management process) before I could then make other "simple" DCL-command-only changes (doing the SET FILE /GLOB, then re-running the command).

>Good luck!

Thanks. Someone with slightly thinner skin might have thought that a vituperative CJL had possessed your soul :-D

Mark

[Formerly appearing as woeisme]

Hein van den Heuvel · ‎03-02-2018

>> RMS_STATS reports the following when starting to delete the 185564 records:";

I think you are correct, I forgot that I made to program subtract the required minimum (write) IOs from the write IO count to attempt to only show excessive IOs. It helped normalize towards 0, because otherwise a PUT would always show at least 10 IO/ops for a 10 key file, and just 1 for many other (shared) files. At the time I thought that might be too hard to interpret.

Yes RMS is a bit of a beast, with limited best usage information, allthough the guide to files and many language (Cobol, Basic, ... ) userguide to a admirable job. They are all frozen in time though. On the other hand, in the end there are not all that much option and the action - reaction is easy enough to learn if you get any time to study it.

>> Yes, much, much quicker - in the order of maybe a minute.

Hmm, not sure how to interpret this. I hope / expect you mean that the while job was done in a minute, not that is was faster by a minute.

>> In this case, the same serial number & zone number appears in the offending records 185664 times but with a date/time that is 2 seconds apart each time.

Right. That is what I expected. So a single FIND followed by a read-delete loop until the serial+zone changes could have avoided the re-positioning. One can even tell RMS to do the compare with the RAB$M_LIM option after the first generic find. So maybe a clone of the delete-one-for-serial to delete-all-for-serial would have been nice, but that's too easy to say looking back.

It also means that a simple SEARCH/MATCH=NOR would have done the job...But EDIT is fine also. I would also have preferred EDT, as long as the records are less than 255 bytes (here 41).

Anyway, it explains the behaviour. As the deletes progress they leave empty buckets behind. The delete function surely did NOT take the time portion of the key into account just doing a generic key match.

Well, the delete never removes index pointers (conver/reclaim would) even if a databucket is empty, so for each find RMS went to the original first bucket, found none, read the next bucket, found none. At first all of those are in the cached, but after a few thousand deletes there are a few hundred empty buckets and the caches start to fail. The end of the chain flushes out the begin. The subroutine could avoid this behaviour by adding an optional time segment >= last_time to the find, just seed with low value initially, and leave last value behind.

If you still have the file which slowed down, try the TUNE_CHECK too again with -s=-1, it will report that chain of empty buckets. Too bad it did not occurs in the first few thousand records, and it woudl have reported it all along

>> >as it is superseded by the cluster size.

Oops, major typo. That was meant to be BUCKET size.

>> your PPT alludes to increasing the ENQLM

ENQLM does not make things go faster, it just allows more things to be accumulated, in this case alowing bucket locks for many buckets to be held at NL. Minor.

>> a "global page file exceeded error" on startup if SET FILE/GLOB= has a value higher than 50

Fix that!

I have not had to look at that error for years, but I think that is jus the semi-permanent GBLPAGES sysgen parameter. This may be OpenVMS version dependent. Is this like version 7.3-2?

You can request explicit global buffers in teh program, IF you are the first one to do so, or SET FILE/GLO=xxx/SHAR (recent OpenVMS!) ... RUN .. SET FILE/GLO=0/SHAR, but those shenanigans should not be needed. They system shoudl be configured to handle a good few.

>> Any code change or new code requires time to develop, document, test & go through the change management procedure

Nah.. This was an emergency fix. Take a file copy (backup/ignore=int even), try, verify, ship it.

Carl! ... My hero. :-) :-)

Hein

Mark_Corcoran · ‎05-02-2018

Finally getting around to providing an update.

During a planned shutdown, the application was shut down, the indexed file converted to a sequential file, that file edited to remove the excess records, saved to a new version then converted back to an index file again, as had been trialled on one of our test systems.

All A-OK now (so I now need to fix the bug in the code deisgn that allowed this to occur in the first place).

Thanks Hein & others for your input.

As I said, I had hoped that some configuration tweak could be made that would have allowed the internal code mechanism for deleting what was only ever expected to be a small number of records, to perform far better; alas, that was not the case.

Mark

[Formerly appearing as woeisme]

Mark_Corcoran · ‎05-02-2018

Oops, bad form to follow up my own post from a few minutes ago, but the browser initially took me to my last post, so I hadn't seen that Hein had replied to my previous post.

Yes Hein, it was in the order of about a minute to do the CONVERT from indexed to sequential, edit the file, save it, then re-convert it (of course, with EDT, you are limited to deleting ~65535 lines at a time, so it required two block deletes, but I much prefer EDT to EVE/TPU).

Unfortunately, another application running on another system has behemoth indexed files for which the developers never created any command interface or housekeeping tools to prune the contents.

One of these is in the region of 350,000 blocks, and contains about 15 years' worth of history :-(

Alas, the keys in that file (and others) are not entirely text (the date/time in the keys is stored as quadword binary times), so they don't readily lend themselves to the same manner of deleting the records.

We are migrating it from physical iron to CharonVAX, so I can at least get the disk that the files reside on "moved" from a notional RF35 to an RF73. That might give me enough time to create a utility to delete these historic records (or for the system to be replaced) before we run out of disk space...

[Formerly appearing as woeisme]

Dennis Handly · ‎05-05-2018

> Oops, bad form to follow up my own post from a few minutes ago

You can use the Post Options > Edit Reply to copy this post into your previous one. Then use the Delete to delete the second.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: RMS Performance Query

RMS Performance Query