0% read hit rate on XFC cache for RMS indexed file being read sequentially using C RTL fgets()?

Mark Corcoran · ‎04-07-2010

Following on from my previous thread about CONVERT /SORT versus SORT + CONVERT /NOSORT, another
problem has arrived in my lap...

A job which post-processes RDB dumped tables (RMS indexed files) to generate a file with records
formed from related parts of these tables, has started to slow down.

[There is one main table file which is sorted in order, and depending on the field/columnar values on each row/record, will
determine whether or not the C program has to check other dumped table files]

Unfortunately, there's no evidence to back this up, just people's vague recollection of how quick
they think it used to be.

Looking at the job, the first thing I found is that the output file it generates, was very
fragmented - between 5000 and 7000 fragments of 200 to 900 blocks each.

To see if the fragmentation was the main issue, I worked around this by doing the following:

$ SET RMS_DEFAULT /EXTEND_SIZE=65376
$ COPY NLA0: dev:[dir]output_filename.ext /ALLOCATION=11000000 /CONTIGUOUS

The device on which the file is created has a cluster size of 288 blocks, and 65376 was the highest
multiple of 288 possible that was <= 65535.

The COPY pre-allocates a contiguous file for the C program which was updated to open the file in
append mode.

After running the new job, it was obvious from the following:

$ SHOW MEMORY /CACHE=FILE=dev:[dir]main_table.DAT

that whilst the main input table file was being cached, virtually no reads were being serviced by
the XFC from read aheads, and virutally all were read throughs.

I momentarily forgot that whilst the main input table file is an RMS indexed sequential file, it is
being read sequentially by the C program using simple fgets() calls.

Thinking that the file was perhaps not sorted in order after all, I TYPE/PAGEd it (given that it
has ~49m records in it, I wanted some control over when my ^C would get picked up), then held the
RETURN key down for a good minute or so.

The records appeared to be in order, but what I did notice was that after this, the SHOW MEMORY
/CACHE indicated that every single read was being serviced as a read ahead from the XFC.

After about 90mins, I killed the job, and found that the XFC cache hit rate was at ~90% (obviously,
it would never get to 100%, because of the initial ~14,000 which were treated as read throughs).

I then ran the job again, but without using TYPE /PAGE on the main table file.

It has now been running for almost as long as the first run, but the cache hit rate is 54%, and
although the read ahead counter value displayed by SHOW MEMORY /CACHE is increasing, so is the
read through counter - approximately 1 in 4 reads end up as READ AHEAD.

Now, I know that this is only 2 individual runs, and hardly what you'd call exhaustive evidence...

However, I'm going to go out on a limb here, and say that without my TYPE/PAGE, either:

a) the C program is largely running ahead of the XFC cache in reading the file contents, so most
reads won't cause sequential read ahead of the file to occur (unless from outside interference,
such as me doing TYPE /PAGE)

or

b) however XFC determines that something is performing sequential reads doesn't work (in this
particular scenario).

For what it's worth, the file attributes are this:

Size: 13956192/13956192 Owner: [SYSTEM,*]
Created: 5-APR-2010 11:01:01.32
Revised: 6-APR-2010 18:24:21.06 (4)
Expires:
Backup: 7-APR-2010 02:34:38.75
Effective:
Recording:
Accessed:
Attributes:
Modified:
Linkcount: 1
File organization: Indexed, Prolog: 3, Using 3 keys
In 2 areas
Shelved state: Online
Caching attribute: Writethrough
File attributes: Allocation: 13956192, Extend: 65520, Maximum bucket size: 18
Global buffer count: 0, No version limit
Contiguous best try
Record format: Variable length, maximum 200 bytes, longest 0 bytes
Record attributes: Carriage return carriage control
RMS attributes: None
Journaling enabled: None
File protection: System:RWED, Owner:RWED, Group:RE, World:
Access Cntrl List: None
Client attributes: None

Total of 1 file, 13956192/13956192 blocks.

The last SHOW MEMORY /CACHE command gave the following results:
Extended File Cache File Statistics:

_dev:[dir]table.DAT;1 (open)
Caching is enabled, active caching mode is Write Through
Allocated pages 5122 Total QIOs 144399
Read hits 79682 Virtual reads 144399
Virtual writes 0 Hit rate 55 %
Read aheads 22443 Read throughs 144399
Write throughs 0 Read arounds 0
Write arounds 0

Total of 1 file for this volume

Write Bitmap (WBM) Memory Summary
Local bitmap count: 93 Local bitmap memory usage (MB) 8.40
Master bitmap count: 96 Master bitmap memory usage (MB) 8.27

Is the fact that the Global Buffer Count set to 0 and/or the fact that the file is an RMS indexed
file being read using the C RTL fgets() partly to blame here, or is something else going on?

Clearly, I'm reluctant to have a second concurrent job run at the same time as this main job,
simply to TYPE the table file, then be killed after a minute, to ensure that a sufficient
quantity of the file is cached to permit the XFC to service read requests.

If anybody has any thoughts/suggestions, I'd be most grateful.

Mark

[Grrr, hit some sequence on the keyboard, causing IE to go back a page, and lose 90% of this
post, so had to go back and do it from scratch again in notepad...]

Hein van den Heuvel · ‎04-07-2010

Hello Mark,

That sure is long description, and I coudl not always follow it they way I would have liked to, but at least we have some pertinent data. Good!.
I'll takes a first reply to clear some crud, and then try to get to the real problem.

[There is one main table file which is sorted in order, and depending on the field/columnar values on each row/record, will
determine whether or not the C program has to check other dumped table files]

>> people's vague recollection of how quick
they think it used to be.

Too late now, but sprinkle your programs liberally with LIB$SHOW_TIMER!

>> the output file it generates, was very fragmented

That can certainly cause unpredictable run times. pre-allocate, perhaps based on input file size, and max extend (64000, 65535, whatever).

>> $ SET RMS_DEFAULT /EXTEND_SIZE=65376

Fine for a process. But too much if done system wide. Slows down tasks like unzipping many little files.

>> $ COPY NLA0: dev:[dir]output_filename.ext /ALLOCATION=11000000 /CONTIGUOUS

Excellent. If contiguous then extend size is irrelevant.
I uses to use COPY NL: all the time myself for that purpose.
Since 8.3 I use the inline FDL strings:

$cre/fdl="file; contiguous yes; allo 12345678"/log x.x

>>highest multiple of 288 possible that was <= 65535.

Nice thought/touch, but largely irrelevant. OpenVMS has no choice but to round up.

>> The COPY pre-allocates a contiguous file for the C program which was updated to open the file in
append mode.

Excellent

>> input table file is an RMS indexed sequential file, it is being read sequentially by the C program using simple fgets() calls.

No matter. Those maps to RMS SYS$GET calls.

Next step is probably to SET FILE/STAT on the existing files, in an output, and use ANAL/SYS.. SHOW PROC/RMS=FSB or my RMS_STATS tool to display all counters.

>> Thinking that the file was perhaps not sorted in order after all

An indexed file is sorted by primary key. No ifs or buts about that.

>>, I TYPE/PAGEd it (given that it
has ~49m records in it, I wanted some control over when my ^C would get picked up), then held the
RETURN key down for a good minute or so.

How crude.
$ perl -pe "last if $. > 10000" > nl:

>> The records appeared to be in order, but what I did notice was that after this, the SHOW MEMORY
/CACHE indicated that every single read was being serviced as a read ahead from the XFC.

As pre-loaded by the program.

>> it would never get to 100%, because of the initial ~14,000 which were treated as read throughs).

Read-throughs is just through the cache, not through to the disk.
Read to the disk = reads-hits + ahead.

See HELP SHOW MEMORY... deep down:
7 Read throughs Number of Virtual Reads that are capable of being satisfied by the extended file cache.

>> Size: 13956192/13956192 Owner:

Is that the table/driver file?

>> Is the fact that the Global Buffer Count set to 0 and/or the fact that the file is an RMS indexed
file being read using the C RTL fgets() partly to blame here, or is something else going on?

Nah.

>> Clearly, I'm reluctant to have a second concurrent job run at the same time as this main job, simply to TYPE the table file, then be killed after a minute, to ensure that a sufficient
quantity of the file is cached to permit the XFC to service read requests.

That's not so clear to me.
Clearly TYPE is a silly tool for this, but you know more that the XFC can guess.
So launching something to pre-read is not that crazy an idea for predictable jobs with critical run time requirement.

I once created a 'read-ahead-and-keep-ahead' tool, just for that reason.
It woudl pre-read N buckets worth of data. The used an RMS compatible bucket lock with blocking AST on the first to detectt 'interest in a bucket'. When the AST triggered on bucket M, grab a lock for the next (M+1), release M, read M + N + 1.

>> If anybody has any thoughts/suggestions, I'd be most grateful.

- RMS stats.
- Be sure to watch activity on those other files.
- Engage a professional in this space if it is really critical.

Cheers,
Hein van den Heuvel ( at gmail dot com )
HvdH Performance Consulting

Hein van den Heuvel · ‎04-07-2010

Meant to open with

0% hit rate is perfectly normal for
- files that have not been read/written in a while.
- files that well exceed the cache capacity and a read sequentially
- when nochace is in effect
- when IOs are done larger than the max-cache IO size.
- when concurrent updates are happening on other nodes in the cluster.

Hein

Ian Miller. · ‎04-07-2010

At present
- files that well exceed the cache capacity and a read sequentially

looks likely but how big is the cache on this system, and what aged version of VMS is being used?

____________________
Purely Personal Opinion

Mark Corcoran · ‎04-07-2010

Hein:
>Too late now, but sprinkle your programs liberally with LIB$SHOW_TIMER!
I know how much the bean-counters like to have stats, so I always try to make sure I get timing info for various stages of programs (can also be useful for myself too).

Alas, this is someone else's code, developered some time ago, and the concern was more with getting it working than making it perfect ;-)

>>> $ SET RMS_DEFAULT /EXTEND_SIZE=65376
>Fine for a process. But too much if done system wide

Don't worry, it was only for this one job, as a test :-)

>Excellent. If contiguous then extend size is irrelevant.

I'd wondered about this - assuming that the next highest contiguous block on the disk was 130752 blocks, and the RMS extend size had been set to 65376, I'm guessing that if exactly 130752 blocks were required, then:
a) they'd be allocated in two logical single operations
b) as far as BITMAP.SYS is concerned, the fact that there are two groups of 65376 blocks is irrelevant, because they are "next to each other", so would appear as a single fragment...

>>>highest multiple of 288 possible that was <= 65535.
>Nice thought/touch, but largely irrelevant. OpenVMS has no choice but to round up.

So, if I set extent size to 65535, and the cluster size was 288 blocks, presumably extending the file should theoretically mean 65664 blocks allocated?

I had guessed that the 65535 limit was as a result of a word being used to store the value, so I couldn't see how 65664 (17 bits) would fit...

>Next step is probably to SET FILE/STAT on the existing files, in an output, and use ANAL/SYS.. SHOW PROC/RMS=FSB or my RMS_STATS tool to display all counters.

I tried the SET FILE/STAT and a MONITOR RMS /FILE=, but to be honest, it didn't reveal very much - the only non-zero counters were the CUR, AVE andMAX $GET Call Rate (Seq).

I knocked up a quick .EXE of my own to effectively do the same as the real one, and this was the MONITOR RMS /FILE output (as a snapshot):

Active Streams: 1 CUR AVE MIN MAX

$GET Call Rate (Seq) 19375.33 4123.63 0.00 21861.00
(Key) 0.00 0.00 0.00 0.00
(RFA) 0.00 0.00 0.00 0.00
$FIND Call Rate (Seq) 0.00 0.00 0.00 0.00
(Key) 0.00 0.00 0.00 0.00
(RFA) 0.00 0.00 0.00 0.00
$PUT Call Rate (Seq) 0.00 0.00 0.00 0.00
(Key) 0.00 0.00 0.00 0.00
$READ Call Rate 0.00 0.00 0.00 0.00
$WRITE Call Rate 0.00 0.00 0.00 0.00
$UPDATE Call Rate 0.00 0.00 0.00 0.00
$DELETE Call Rate 0.00 0.00 0.00 0.00
$TRUNCATE Call Rate 0.00 0.00 0.00 0.00
$EXTEND Call Rate 0.00 0.00 0.00 0.00
$FLUSH Call Rate 0.00 0.00 0.00 0.00

As for the ANA /SYS and SHOW PROC /FSB, that didn't reveal much either:

FSB Address: 00064000
-----------
OPEN: 1. CLOSE: 0.
CONNECT: 1. DISCONN: 0.
REWIND: 0. FLUSH: 0.
EXTEND: 0. blocks: 0.
TRUNCATE: 0. blocks: 0.

FIND seq: 0. key: 0. rfa: 0.
GET seq: 159199. key: 0. rfa: 0. bytes: 18296029.
PUT seq: 0. key: 0. bytes: 0.
UPDATE: 0. bytes: 0.
DELETE: 0.

READ: 0. bytes: 0.
WRITE: 0. bytes: 0.

LOCAL CACHE attempts: 161187. hits: 159198. read: 1989. write: 0.
GLOBAL CACHE attempts: 0. hits: 0. read: 0. write: 0.
GLOBAL BUFFER INTERLOCKING:
GBHSH Intlck Collisions: 0 GBH Intlck Collisions: 0
GBHSH Held at Rundown: 0 GBH Held at Rundown: 0

LOCKS: Enqueue Dequeue Convert Block-ast
Shared file: 0. 0. 0. 0.
Local buffer: 0. 0. 0. 0.
Global buffer: 0. 0. 0. 0.
Shared append: 0. 0. 0. 0.
Global section: 0. 0. 0. 0.
Data record: 0. 0. 0.

XQP QIO: 1.

BUCKET SPLIT (1) : 0. SPLIT (N) : 0. OUTBUFQUO: 0.

DEV1 .. DEV5: 00000000 00000000 00000000 00000000 00000000

>An indexed file is sorted by primary key. No ifs or buts about that.
Ah sorry, I *think* what I meant was that the file is indexed in order, and the records are also stored in order (rather than having the a nice sequential index still pointing to "random" disk blocks).

>>> Size: 13956192/13956192 Owner:
>Is that the table/driver file?

Yes, this is the primary input file, just under 14m blocks in size.

>That's not so clear to me.
>Clearly TYPE is a silly tool for this, but you know more that the XFC can guess.
I looked up XFC in the system management manual, and its discussion of XFC detecting sequential reads of same-size I/O requests led me to the VCC_READAHEAD SYSGEN parameter - thinking that perhaps it wasn't set, but alas it was.

On the face of it, it appears that the executable is simply reading from the primary input file sequentially quicker than XFC can detect that that is what is happening, so although XFC is cacheing the file, it's always behind the executable (unless it gets a head start from something else, whereby the reads from executable allow XFC to keep on topping up the file into the cache).

>Engage a professional in this space if it is really critical.
perhaps this is not the place to discuss it, but I never heard the story about how you and the Hoff come to part ways with HP - jumped, or pushed? How has the private sector been treating you since?

>when concurrent updates are happening on other nodes in the cluster.
Not the case here - other jobs may happen to read the same primary input file, but certainly during my testing, there was just the one process accessing the file, and it was doing the sequential read.

Ian:
>looks likely but how big is the cache on this system
XFC currently allocated at 2.75GB.

>and what aged version of VMS is being used?
You know me and many other HP customers only too well ;-) 7.3-2 on this cluster.

Hein van den Heuvel · ‎04-07-2010

>> I'd wondered about this - assuming that the next highest contiguous block on the disk was 130752 blocks, and the RMS extend size had been set to 65376, I'm guessing that if exactly 130752 blocks were required, then:
a) they'd be allocated in two logical single operations

Yes.

>> b) as far as BITMAP.SYS is concerned, the fact that there are two groups of 65376 blocks is irrelevant, because they are "next to each other", so would appear as a single fragment...

They would appear as a single fragment in the MAP area for the file using them ($ DUMP/HEAD/BLOCK=COUNT=0 ). In the bitmap they would be 2 * 227 adjacent bits.

>> So, if I set extent size to 65535, and the cluster size was 288 blocks, presumably extending the file should theoretically mean 65664 blocks allocated?

Yes indeed. Because VMS has to give you 227 + 1 cluster to satisfy the extend request.

>> the 65535 limit was as a result of a word being used to store the value

Correct

>>> I tried the SET FILE/STAT and a MONITOR RMS /FILE=, but to be honest, it didn't reveal very much

IMHO the way MONI RMS presents that data is next to useless.

>>> As for the ANA /SYS and SHOW PROC /FSB, that didn't reveal much either:

FSB Address: 00064000
:
GET seq: 159199. key: 0. rfa: 0. bytes: 18296029.
:
LOCAL CACHE attempts: 161187. hits: 159198. read: 1989. write: 0.

IMHO that indicated a lot. You needed an IO about once every 80 records. So there must have been 80 records to a bucket. Those 1989 IOs would have gone through to the XFC to be resolved thread from a prior read (ahead) or from a real IO.

>> Ah sorry, I *think* what I meant was that the file is indexed in order, and the records are also stored in order (rather than having the a nice sequential index still pointing to "random" disk blocks).

Got it. Yes, for records arriving in primary key order both CONVERT FAST-LOAD and Plain-old RMS will allocate in ever increasing adjacent buckets. A minor exception is that if the file needs to grow while doing so, then the new bucket is started in the fresh extend, potentially leaving the tail end of the current extend unused for up to bucket size minus 1. In this case the bucket size divides evenly into the cluster size, so that's not an issue.

>>> On the face of it, it appears that the executable is simply reading from the primary input file sequentially quicker than XFC can detect that that is what is happening

I never really studied the read-ahead for XFC. RMS only does read-ahead for sequential files, not indexed, and for sequential files it 'bursts' reading a bunch, but not keeping ahead. I actually tried to implement that while in RMS engineering but there were gotcha and I had to abandon at the time.

>> I never heard the story about how you and the Hoff come to part ways with HP - jumped, or pushed?

I can only speak for myself. I received an early retirement opportunity which seemed too nice to refuse. It was a volunteered choice creating optimal (financial) conditions to try work independent for a while. That was October 2005. So far so good!

Regards,
Hein

John McL · ‎04-07-2010

Hein, I'm watching this thread with some interest so a question - two actually - for you...

In the second last paragraph of your response immediately above this one you seem to be implying that there's no read-ahead on indexed files but there is for sequential files. Is this correct?

If so, is that set by the file characteristics or by the parameters in the open statement?

Hein van den Heuvel · ‎04-07-2010

Hello John

John >>In the second last paragraph of your response immediately above this one you seem to be implying that there's no read-ahead on indexed files but there is for sequential files. Is this correct?

Only from an RMS perspective, is it not reading ahead into its buffers.
The XFC is blisfully ignorant as to whether RMS is doign an IO from a sequential file or indexed file, so the XFC can independent from RMS trigger a read-ahead into its buffers for RMS to find the data later.
And behind the XFC the Controller knows even less and it can do read-aheads, and behind that the physical Disk can be doing read ahead. So the odds that you'd be waiting for a disk seek/rotation are low!

>> If so, is that set by the file characteristics or by the parameters in the open statement?

For sequential file you have to request RAB$V_RAH in the connect, which is part of teh OPEN from an HLL perspective. It is the default for many languages. The number of buffers defines how deep the read ahead goes.

The RMS read ahead (on sequential files) can probably disrupt the XFC read ahead recognition. I never experimented with that though.

RMS Read ahead on indexed file would not seem too hard to implement, but it was never done nor requested. Regrettably. Again, the XFC may well decided to do the read ahead for indexed files.

I haven't looked at the code, but it woudl nor surprise me if the XFC would find it easier to do read-ahead for IOs which nicely line up with its 16-block cache lines. But for that to happen for an indexed files, many stars need to line up! (Bucketsize 2, 4, 8,16, or 32. Clustersize a power of 2. Rms primary key data NOT in area 0, or not pre-allocated.)

Hein

Hoff · ‎04-07-2010

Records in an indexed file aren't necessarily adjacent, so there's no direct way to warm up a generic block cache given the current design of RMS. RMS would need to do that, or to provide hints to XFC. Neither of which, AFAIK, exists at present.

Whether Hein's suggested leading-traversal approach might be worth the implementation effort is interesting; I'd want to measure that cache pre-populate scheme.

It would be equally interesting to toss upgrade or a RAM disk or an SSD at the problem, and measure throughput with that. 66 megabytes isn't all that much data; that'd be close to fitting entirely into the RAM in my cellphone, and would be dwarfed by what I've got stored in the flash. Best case, this application should be limited by the spiral transfer rate of the disk. Or by your RAM disk or SSD bandwidth. Arguably, RMS could just be getting in the way here if you can run from analogous in-memory data structures. (RMS doesn't have the concept of hauling an entire file into memory as one big wad, performing the required operations, and then rolling it all out as a big wad.)

It'd be interesting to compare RMS indexed files to an application built on Apache Cassandra, too. But that's fodder for discussion on another day. And no, I'm not aware of a VMS port of Cassandra.

And after that wall of text...

When I go after RMS files from C, I use this code:

http://labs.hoffmanlabs.com/node/595

And generally not with the file I/O portions of the C RTL.

The C I/O has its share of considerations here; that you can even get at indexed files through a mostly-generic C API is somewhat of a remarkable implementation achievement. But by that same token, don't expect it to be the go-fast implementation. I might well look to haul it all into memory with a few and large I/Os.

P Muralidhar Kini · ‎04-07-2010

XFC will not cache IO's to a particular file
in case -
* IO's done to the file are of size greater
than VCC_MAX_IO_SIZE blocks.

* file is present on a local RAMDISK

* The file is accessed cluster wide and
there is atleast one node in the cluster
that is doing a write IO to the file.

* file will be temporarily not cached if
logical IO's are done to the file or the
volume on which the file resides

XFC ReadAhead -
* XFC does read ahead for a file if the
SYSGEN parameter VCC$GL_READAHEAD is set
to 1.

* XFC has a read ahead factor of 3 which
would mean that when read ahead is being
performed on a file, 1 among 4 IO's to the
file will be read ahead.

XFC ReadHits
* Whether the IO is read-through or
read-ahead, it is still a Read IO
operation that XFC has to perform and
would be used in the statistic as a IO.

* The hit rate for the file is calculated
as follows -
HitRate = ReadHits/TotalIO

Here,
ReadHits - Number of times a Read
operation was satisfied from
the cache
Total IO - Number of Read operations

Both the "ReadHits" as well as "TotalIO"
include read-through as well as read-ahead.

From the information you have provided,
>> SHOW MEMORY /CACHE
>> Allocated pages 5122
>> Total QIOs 144399
>> Read hits 79682
>> Virtual reads 144399
>> Virtual writes 0
>> Hit rate 55 %

IO's to the file are going through the XFC
cache and there are some number of IO's
that are getting satisfied from the cache
and hence we are seeing the hit rate of 55%.

The question is why so low Hit-rate?
My suspicion is that, there is some other
operation on that file (or volume on which
the file resides) that is causing the
contents of the file to get deposed
(i.e. cleared) from the cache once in a
while. This would cause subsequent IO's to
the file to get read from the disk(read
miss). Couple of obvious reasons for the
file depose would be either logical IO's
to the file/volume or cluster-wide write
operations on the file.

Please provide the following information about the file -
1) XFC statistics from SDA
ANAL/SYS
SDA> XFC SHOW FILE/ID=/STATS
SDA> XFC SHOW MEM

NOTE: FID_IN_HEX is the FID of the file
(dev:[dir]table.DAT;1) in Hex

2) How big is the IO size issued by the
application to the file
(i.e. How big is the IO's that the
application issues to the file. are
they 50 blocks or 100 blocks ....)

3) Is the file accessed cluster-wide.
If yes, what type of IO (Read/write)
are performed on that file cluster-wide
and how frequently

These information could provide further
clues as to why the hit rate is very low
for the file.

Regards,
Murali

Let There Be Rock - AC/DC

Hein van den Heuvel · ‎04-07-2010

Good summary Murali. Thanks.

The read-ahead 3x description sounds suspect.
So it sees 3 (adjacent QIOs) and then adds just one read ahead QIO? How much? Of equal size to last IO, or some large size like 8 cache lines?

I don't think the hit rate is low.
I kinda expect 0%. The real file is 6+ GB, the primary key data perhaps 5+ GB. That could well be larger than the active maximum cache on an Alpha. Now if it is the same mox Mark asked about before then it is a 24-CPU Alphaserver GS1280 7/1300, running OVMS v7.3-2. So that is likely to have 48 GB or more memory, and the cache may be as high as 20 - 32 GB if not actively throttled. "
So normally a 6GB file would fit, and running the down stream program shortly (hours?) after the load would find the data in the cache. But other, totally unrelated activities, maybe as silly as SEAR [*...]*.log "FATAL ERROR", could flush out those, or a part of the 6GB, and cause a tremendous slowdown compared to other days/weeks.

Time to close shop for the day!
Cheers,
Hein.

Steve Reece_3 · ‎04-07-2010

Hi Mark,

I'm assuming that even though you're in a cluster the file is only being read in on one node at any time? If this isn't the case then you'll never effectively cache it in my experience with XFC. You'd need to rely on third party products like PerfectCache or on the raw IO performance of the system and the disk array that's hung off it.

Steve

P Muralidhar Kini · ‎04-07-2010

Hi,

From the information provided,

>> Total of 1 file, 13956192/13956192 blocks.
This is around 7GB.

>> XFC currently allocated at 2.75GB.
XFC current size is 2.75 GB

As Hein as pointed out,
The file size is bigger than the XFC cache
size and hence you cannot have the entire
file cached.

As and when file is accessed, it would get
in to the cache. But as the file size is
larger than the XFC cache, the data for the
large file when read may have to throw out
other data for the same file already in
cache.
Example: When block 50 of the file is read
and there is no space in XFC cache then XFC
may have to throw say block 20 of the same
file out of cache in order to make room for
the block 50.
Subsequent IO's to block 20 would now be a
read miss and the data to now go to the
disk. This way the read hits will come down.

Also certain activities have the potential
to thrash the entire XFC cache or depose the
data of a file/volume.

Thrash XFC cache
* SEARCH/COPY or any 3rd party backup
operation

Note: VMS backup does not thrash the XFC
cache because the IO it performs skips the
XFC cache. This is done by specifying the
function modifier IO$M_NOVCACHE for the IO
that it issues.

Depose File/volume data from cache
* Logical IO to file/volume
* cluster-wide write operation

What is the physical memory size of the
system ?
You can get that from DCL "$SHOW MEM/PHY"

XFC is sized at 2.75GB. By default XFC would
size itself to be 1/2 of physical memory.
So I guess the Physical memory would be
around 5.5GB. Is that correct.

As a side note, Other things to consider
from Read-hits point of view is caching at
different levels such as CRTL and RMS.
CRTL uses buffering and so does RMS with its
local buffering. When a file is accessed, it
its data is present in the CRTL or RMS
cache, then the request will be satisfied
from there itself. The request wonâ t come to
XFC and the XFC statistics would remain the
same.

Regards,
Murali

Let There Be Rock - AC/DC

P Muralidhar Kini · ‎04-08-2010

>> I'm assuming that even though you're in
>> a cluster the file is only being read in
>> on one node at any time?
>> If this isn't the case then you'll never
>> effectively cache it in my experience
>> with XFC.

Is there any particular scenario that you
would like to share. Because the XFC caching
behavior in the cluster environment should
be as follows

* Multiple Reader -
XFC does cache a file when they are
multiple readers of the same file.
The file will be cached on all nodes in
the cluster.

* One Writer, Multiple Reader-
In case there is only one writer node and
multiple reader nodes, then XFC does
caching for the file only on the Node
where the writer is present. On the nodes
where readers are present the file wont
the be cached.

* Multiple Reader/Writer
Where there are multiple writers to a
file, XFC wont cache that file cluster
wide.
i.e. All the nodes will not cache the file.

Regards,
Murali

Let There Be Rock - AC/DC

Mark Corcoran · ‎04-08-2010

Hoff:
>Records in an indexed file aren't necessarily adjacent, so there's no direct way to warm up a generic block cache given the current design of RMS.

In this particular case, the index by definition is sorted by primary key order; the actual data records are present in the file in primary key order too.

(A bit like what I was alluding to in my response to Hein - e.g. having records physically located in random order, but with the index in primary key order (so you quickly find it in the index, but to actually access the record, may involve a "lot" of disk activity, because the records are not logically adjacent on the disk)

Obviously, XFC can't know whether or not records in an indexed file happen to be stored adjacent to each other in order (but is this something it can guess at, or be told about??)

>It would be equally interesting to toss upgrade or a RAM disk or an SSD
As is often the case, one team looks after the O/S, and layered products, whereas another team looks after the primary application.

Getting agreement to O/S related changes is often an uphill struggle, and not something that would happen any time soon (lead time for notice of changes, getting all the approvers to approve changes, yada yada).

Unfortunately, this is particularly the case when it can't be definitiely stated how much of a difference it would make...

>When I go after RMS files from C, I use this code:
The person who wrote the program originally, does now use use RMS for accessing RMS-indexed files, but this is some of his earlier code; ideally, it will be changed, but of course, everyone is looking for a quick fix "X is wrong; Change Y to Z and that will fix it, or at least be a workaround, whilst we can schedule recoding the program into the plan...

Mark Corcoran · ‎04-08-2010

Murali:
>XFC will not cache IO's to a particular file in case -
>* IO's done to the file are of size greater
> than VCC_MAX_IO_SIZE blocks.
It's the C RTL being requested to fgets() 225 bytes or stop when a Line Feed character is encountered, whichever comes first; what it actually requests "behind the scenes", I don't know.

>file is present on a local RAMDISK
No

>The file is accessed cluster wide and there is atleast one node in the cluster that is doing a write IO to the file.
No - a single process running on a single node in the cluster accessing it for read only.

>XFC ReadAhead -
>* XFC does read ahead for a file if the SYSGEN parameter VCC$GL_READAHEAD is set to 1.
A little bit of confusion here using SDA symbols and SYSGEN params, but I'm guessing you mean VCC_READAHEAD, in which case I'll just list all the VCC settings:

$ MC SYSGEN SHOW VCC
Parameter Name Current Default Min. Max. Unit Dynamic
-------------- ------- ------- ------- ------- ---- -------
VCC_FLAGS 2 2 0 -1 Bitmask
VCC_MAXSIZE 6400 6400 0 3700000 Blocks
VCC_MAX_CACHE -1 -1 0 -1 Mbytes D
VCC_MAX_IO_SIZE 127 127 0 -1 Blocks D
VCC_MAX_LOCKS -1 -1 50 -1 Locks D
VCC_READAHEAD 1 1 0 1 Boolean D
VCC_WRITEBEHIND 1 1 0 1 Boolean D
VCC_WRITE_DELAY 30 30 0 -1 Seconds D
VCC_PAGESIZE 0 0 0 -1 D
VCC_RSVD 0 0 0 -1 D

>The question is why so low Hit-rate?
Well, actually, this is a funny thing....

When I was running the job yesterday afternoon, after 85 mins, the hit rate was 55%.

I ran it again this morning so that I could get the additional XFC stats from the SDA extension for you, and after 35 minutes, it was 83%.

It looks therefore that it might quite possibly be contention for XFC resource that is causing/contributing to the problem - in heav(y|ier) system loads, the file can't be cached as quickly, either because other files need to be (part) removed from the cache to make space, or there is a delay (timeout?) in XFC servicing read requests for this file.

I'm not actually sure how read requests get to the XFC, so I'm not sure whether or not such timeouts could occur...

Do all read requests go to the XFC first of all, and they have to wait for the XFC to say "in cache" or "not in cache" before progressing further (if it doesn't receive a response within X amount of time, does the read "just go to disk" rather than waiting for the XFC to respond?

>My suspicion is that, there is some other operation on that file (or volume on which the file resides) that is causing the contents of the file to get deposed (i.e. cleared) from the cache once in a while.

What I was observing, was that the the amount of the pages of the file being cached was constantly increasing, but the hit rate was remaining at 0%.

Obviously, I couldn't really tell whether or not some old pages were being dropped out of the cache as new blocks were being added (so, if allocated pages jumped from say 1000 to 1050, it could mean 50 new pages added, or it could mean 80 new pages added and 30 old pages removed).

>Please provide the following information about the file -
>1) XFC statistics from SDA

I've attached a file which shows two sets of XFC SHOW FILE /STAT from SDA - the first is 5 minutes after starting the job (when the hit rate was still 0%), and the second is from ~66mins after the job starts (hit rate=90%)

>2) How big is the IO size issued by the application to the file
C RTL fgets() call, with a max size of 225 bytes, but like I said, I don't know what DEC C is doing under the hood...

>3) Is the file accessed cluster-wide.
No, a single process on one node in the cluster doing sequential reads only - once the file has been created, it is not used by anything other than this job which post-processes it.

Hein:
> I kinda expect 0%. The real file is 6+ GB, the primary key data perhaps 5+ GB. That could well be larger than the active maximum cache on an Alpha. Now if it is the same mox Mark asked about before then it is a 24-CPU Alphaserver GS1280 7/1300, running OVMS v7.3-2. So that is likely to have 48 GB or more memory, and the cache may be as high as 20 - 32 GB if not actively throttled.
Yup Hein, it's the same cluster.

>But other, totally unrelated activities, maybe as silly as SEAR [*...]*.log "FATAL ERROR", could flush out those, or a part of the 6GB, and cause a tremendous slowdown compared to other days/weeks.
Well, when the job runs, the file is not in the cache, but more of the pages of the cache allocated to the file increase as the job runs.

Like I said earlier, I can't really tell whether or not XFC is dropping the pages from the start of the file, as it adds more pages as the file is read by the job.

On the face of it, it didn't appear to be the case, so it simply seemed that XFC was either doing read-behind in comparison to the job, or (if it is possible) the read has a timer driven AST so that if XFC hasn't responded within X time, the read goes "straight" to disk instead...

Steve:
>I'm assuming that even though you're in a cluster the file is only being read in on one node at any time?
One process, on one node, exclusively seuquentially reading the file, several hours after it has been created (and expunged from the cache).

Murali:
>As Hein as pointed out, The file size is bigger than the XFC cache size and hence you cannot have the entire file cached.

Having the entire file cached isn't really what we want or need - just a "window" on the bit of the file we are looking at - the file is being read sequentially, so once all of the records from a bucket are processed by the job, the job has no further interest or requirement in those records, and they could happily be expunged from the XFC.

>What is the physical memory size of the system ?
The two A/S GS1280 7/1150 systems each have 56GB, and the GS1280 7/1300 has 48GB

>When a file is accessed, it its data is present in the CRTL or RMS cache, then the request will be satisfied from there itself.
Indeed; since it seems that XFC can in fact detect that read ahead cacheing is required for this file under the right circumstances (system load?), I'm wondering whether or not it might actually be a better idea simply to have a few large RMS global buffers for the file, to ensure some kind of cacheing, rather than have the potential of failed read hits on the XFC...

Ian Miller. · ‎04-08-2010

Hein has said elsewhere the only wrong answer for global buffers is zero but for a file being accessed only by one process then do global buffers behaving the same has having local buffers?

Does the code specify a multi buffer count?
If not then it should pick up values set with SET RMS_DEFAULT so you can experiment.

____________________
Purely Personal Opinion

P Muralidhar Kini · ‎04-08-2010

Hi Mark,

>> A little bit of confusion here using SDA
>> symbols and SYSGEN params, but I'm
>> guessing you mean VCC_READAHEAD,
Yes. I meant VCC_READAHEAD SYSGEN parameter.
(VCC$GL_READAHEAD was typo)

>> Do all read requests go to the XFC first
>> of all,
Yes in case XFC caching is enabled.
Application would call QIO to perform the IO
operation. QIO would then check if XFC is
enabled, if yes then it would call XFC to
take over the IO. In case XFC is disabled on
the node then QIO would not call XFC.
Once XFC is called, XFC would do its own set
of checks to determine if the IO needs to
skip the cache.
Some common scenarios in which XFC decides
to Skip the IO are
- Caching is disabled on Volume
(MOUNT/NOCACHE)
- Caching is disabled on file
(SET FILE/CACHING_ATTRIBUTE=NO_CACHING)
- Caching is disabled on IO
(using function modifier IO$M_NOVCACHE in
the QIO call)
- IO size is greater than VCC_MAX_IO_SIZE

>> and they have to wait for the XFC to
>> say "in cache" or "not in cache" before
>> progressing further (if it doesn't
>> receive a response within X amount of
>> time, does the read "just go to disk"
>> rather than waiting for the XFC to
>> respond?

When XFC does a read IO to a file, it first
checks if the data is already available in
the XFC Cache. If YES then it returns the
data immediately. If not then it performs a
Read IO to disk. In any case, IO would
always go through XFC.

However, in case the file is shared
cluster-wide and some other node in the
cluster is doing a write operation to the
file then XFC won't be able to get a lock
on the file in the desired mode. In such a
case, XFC will convert the read-through to
read around IO. Read-around would mean, XFC
will make the IO skip the cache and let IO
happen to disk.
As you have mentioned that there is no other
node in cluster doing write IO to the file,
this scenario is eliminated.

>> (so, if allocated pages jumped from say
>> 1000 to 1050, it could mean 50 new pages
>> added, or it could mean 80 new pages
>> added and 30 old pages removed).
Yes thatâ s correct. Allocated pages only
indicates how much of the file's data is
currently in cache.

Data that you have provided
1) File is not accessed cluster-wide
From this we can rule out the scenario
where the file is written once in a while
from some other node of the cluster

2) IO Size issued by the application
Here also the application does not seem
to be doing a IO greater than
VCC_MAX_IO_SIZE

3) XFC SDA Data
>> XFC File stats from ~5 mins after job
>> starts (hit rate=0%)
The data here indicates that only a few IOs
were satisfied from the cache, for all other IO's XFC had to fetch the requested data
from the disk

>> XFC File stats from ~66 mins after job
>> starts (hit rate=90%)
Here we can see quite a number of reads
being satisfied from the cache and hence the
read hit rate is higher.

My suspicion is that, initially data is not
there in the XFC cache and hence hit rate
is very less. As data gets filled in the
cache, subsequent IO will find the data in
cache and hence the read hit rate increases. Sometime later, a logical IO might have been
performed on the Volume as a result of which
entire data on the volume gets cleared.
Next set of reads to the file has to now
fetch the data again from the disk, this
would now reduce the cache hit rate.

Some questions -
1) Is that every time the application runs,
it gets 0 hit rate in the beginning and
the hit rate increases after some point
of time.

2) When does the hit rate become 0.
Only when application starts accessing
the data for the first time or some other
time also.

3) Are you aware of any logical IO's being
performed on that volume.
If the disk is mounted cluster-wide then,
are any other nodes performing any
Logical IO to the volume.

>> everyone is looking for a quick fix "X is
>> wrong; Change Y to Z and that will fix
>> it, or at least be a workaround, whilst
>> we can schedule recoding the program into
>> the plan...
One suggestion for workaround would be to increase the XFC size -
The current Physical memory size is 56GB
(GS1280/1150) and 48GB(GS1280 7/1300).
You had mentioned that XFC is sized at
2.75GB. One suggestion would be to increase
the XFC size from the current 2.75GB
to 8GB. XFC is tested with memory sizes up
to 8GB and hence you can increase the current
size of XFC to 8GB for better performance.

Regards,
Murali

Let There Be Rock - AC/DC

Mark Corcoran · ‎04-08-2010

Ian:
>Hein has said elsewhere the only wrong answer for global buffers is zero but for a file being accessed only by one process then do global buffers behaving the same has having local buffers?

Mea culpa. Global buffers (as I understand them) isn't actually what I meant - I meant the buffers that SET RMS_DEFAULT /BUFFER= refers to.

>Does the code specify a multi buffer count?

Although the C RTL does allow specification of RMS options on the fopen(), it just specifies "r".

>If not then it should pick up values set with SET RMS_DEFAULT so you can experiment.

Great minds think alike - I was just doing some back-of-a-fag-packet calculations on what to use, and will post results back here.

The problem of course is that between tests, you have to wait for any cached part of the file to be expunged (either through normal system load, or to force it using something like "SEA [...]*.* blah") before you can test again.

Murali:
>Caching is disabled on Volume
Not in this case (obviously, otherwise none of the file would appear in it :-)

>Caching is disabled on file
Again, not in this case ("Caching attribute: Writethrough")

>Caching is disabled on IO
It is an fgets() call in the C RTL, but I wouldn't have thought that it would do any disabling.

>IO size is greater than VCC_MAX_IO_SIZE
Unless fgets() is doing something weird, when told to read 225 bytes.

>Sometime later, a logical IO might have been performed on the Volume as a result of which entire data on the volume gets cleared.
>Next set of reads to the file has to now fetch the data again from the disk, this would now reduce the cache hit rate.

I'm not certain what you mean in this context by a logical IO - can you give some examples?

I can understand that dismounting the volume (or a member of its shadow set) could cause this issue.

However, could "SEA [...]*.filename_type blah" really do this?

I'm not sure whether or not you mean this is a logical IO which would cause that volume's cache contents to be expunged...
...or if there is a per-volume limit in the XFC, and that depending on that limit (and the size of files be SEArched), this would cause the existing volume's cache contents to be expunged, to make way for those files that SEARCH is processing?

>1) Is that every time the application runs, it gets 0 hit rate in the beginning and the hit rate increases after some point of time.

Hmm, unfortunately, there's no statistics available from previous daily runs (unlike many of the other jobs, it actually runs at 09:00, so I don't need to log in in the middle of the night, to check).
However, from my manual testing, this appears to the case.

>2) When does the hit rate become 0. Only when application starts accessing the data for the first time or some other time also.
From my manual tests, it only appears to be when it starts accessing it for the first time (and where the start of the file is not in the cache), that the hit rate is 0.

I'm not sure how the cache hit reporting code works, but the only way for the rate to drop to 0% whilst the job runs, would be due to mathematical rounding...

(there will have been some successful hits, so successes/attempts would always yield a non-zero value unless you round it down.

>Are you aware of any logical IO's being performed on that volume.
>If the disk is mounted cluster-wide then, are any other nodes performing any Logical IO to the volume.

It depends on what precisely you mean by logical IO.

The volume is mounted cluster-wide.

Normally, nothing else creates files on this volume; the only other activity may be a backup or defragger job that runs overnight, but it should be finished by 09:00 when this job runs.

Ian Miller. · ‎04-08-2010

If there was a job before this one which accessed lots of other files on the same disk then the file in question is not going to be in the XFC.

I guess your real aim is to reduce the elapsed time of this job that processes the RDB dumped tables. What has happened in previous runs is only useful in perhaps helping you to reduce the time for future runs.

You can specify RMS options on the C fopen.

If there are no RMS options specified now then you can experiment with
$ SET RMS/INDEX/BUFFER=3/BLOCK=

I wonder about the physical layout of the file and could CONVERT help.

logical I/O - different from the usual virtual I/O which address a file as a array of blocks starting at 1 - logical I/O addresses a disk as a array of blocks starting at 0. Unlikely although I wonder about the defrag job - has it finished when this job starts?

____________________
Purely Personal Opinion

Hoff · ‎04-08-2010

XFC and the disk caches are the cache schemes of last resort; they're effective and fast but are inherently also fairly "stupid" about what's going on. They just don't have the visibility into RMS and particularly into the application.

For larger RMS indexed files, it's common practice to size for and to try to cache the whole of the index structures in global buffers, and to then let the data blocks be accessed either through XFC or from disk. That to work around the smaller cache sizes that are available.

For this case, can you haul the whole file into memory and run it from, say, a file-backed section? With what you've posted, this really looks like you're getting slammed by the I/O and XFC and RMS paths, and (as a testament to the folks that have and are working on this stuff) I doubt there is a substantial (generic) improvement lurking here.

Given you're on an Alpha, consider not implementing that read-ahead cache, but haul the whole file in, and unpack it and into 64-bit memory once, and pound on it from there. Flush pages from the section for checkpoints, or other such.

Or as has been mentioned in a couple of the replies, look to get faster hardware. For this, I'd start with the I/O path (RAM disk being a common choice), but as nice as an EV7 is, the Itanium boxes are newer, and also with newer and faster I/O paths. Any of the current Itanium boxes are likely faster than the AlphaServer GS1280; that cross-over happened a few years ago. The FC HBAs and the storage controllers can also be a potential limit; the 2 Gb stuff is pretty slow.

The speeds and feeds on the QuickPath processors are way past those of the EV7 processors, too.

If the code is sufficiently portable, try it on a ProLiant x86 Nehalem EP box. Or on a ProLiant Nehalem EX or Integrity Tukwila box if and as and when those become available. (Check with HP about that.)

And check the free memory and the working sets and the faulting; if they even look low, toss another bank of RIMMs into the box or bump the working sets and the quotas. Or both.

Hein van den Heuvel · ‎04-08-2010

We certainly sparked some interesting conversations here, did we not?

At this point I am firmly convinced that the trigger for this all was a significantly reduced XFC size availability.

From the data Mark attached, you can see how the maximum is in fact the default of 1/2 memory = 28GB, and that the peak usage was close to that @25 GB, but the current usage is just 3GB.

Other active memory usage is limiting the XFC expansion... or you hit some 7.3-2 XFC code glitch.
Surely when this process run well previously, it managed to cache the large file as it is nicely in that in-between range of being able to be cache sometimes, but not all the time (now).

That would explain why it is hurting now and not before, and Mark will have to focus on understanding the current memory utilization.

Maybe, just maybe, this is a case of 'Ein reboot mag=cht immer gut'.

All the other angles are certainly interesting and worth understanding, but i think they are all status quo. Nothing relevant changed.

Mark, in reply to Hoff>> In this particular case, the index by definition is sorted by primary key order; the actual data records are present in the file in primary key order too.

So you'd really expect the linear access pattern, with read-ahead potential.
If you really want to dig into this, then you could use my attached TUNE_CHECK tool.
Run it with -s=-1 -a=xxx.log
This will a report:
"hit=.., series>5=1,max= XXX ,avg= XXX"
That's your indicator of adjacency.

Note, by default the attached V6 for the tool is use an IO size 1 larger than VCC_MAX_IO_SIZE in order to avoid polluting the cache. If you reduce that ( -i=127) to the max, and use -s=-1 -k=0 then it is the fastest way you can pre-load the file primary data bucket into cache.

>> use RMS for accessing RMS-indexed files

Switching to native RMS will give you easier control, but you can only expect is to slightly reduce usermode CPU. I would not expect breakthrough performance potential, comparing 'fgets' and sys$get to read a whole file.

Mark to Murali>> It's the C RTL being requested to fgets() 225 bytes

That's two layers away from the issue.
The IO size equals the BUCKET SIZE (18 blocks here). Period.

>> not actually sure how read requests get to the XFC,

The XFC gets first dibs for file IO. It sees all, it serves all. No timeouts/delays. It does not hand of. It resolves and answers.

>> Having the entire file cached isn't really what we want or need - just a "window" on the bit of the file we are looking at - the file is being read sequentially,

I hear you, and it is not an uncommon desire, but there is no provision to accommodate this. That would be a wonderful thing to have in combination with a good read-ahead.

Mark>> have a few large RMS global buffers for the file,

Correction needed. The size of a global buffer is always the maximum bucket size for the file. No choices. You can choose a few, or many of those, but not 'a few large'.

Ian in reply to Mark>> Hein has said elsewhere the only wrong answer for global buffers is zero but

Ya, with random access and some concurrency.

But... here RMS is asked to visit a bucket over and over, until all records are returned, then it moves to the next, never to look back. Just a few (the default) local buffers will satisfy that need just fine.
Exclusive access and local buffers is optimal for this process.

>> Although the C RTL does allow specification of RMS options on the fopen(), it just specifies "r".

I which case C just picks up the SET RMS process/system defaults.
If the process is truly just doing sequential reads, then 1 buffers is all you need, and the defaults will be fine.

>> you have to wait for any cached part of the file to be expunged

I use a little program to perform a Logical Block write to the volume with files of interest. That flushes any and all file on that volume out of the cache.

You can also explicitly set the max cache size down to minimum, wait some / set cache/reset , and set back up to -1 again.

Mark>> I'm not certain what you mean in this context by a logical IO - can you give some examples?

I'll post my little program after the signature.

>> could "SEA [...]*.filename_type blah" really do this?

It will put the pressure on, and it will flush old blocks when running out of room, but it is not sufficiently controllable and high overhead.

Cheers!
Hein

------- flush XFC cache for TEST_DRIVE logical ----
# Probably want to change filename to include [000000], and accept command line.
#
#include RMS
#include IODEF
#include
struct FAB fab;
struct XABFHC xab;
main()
{
/* This program will invalidate all cached vbns for all files on a selected
** volume. The VMS VBN cache (VIOC) has no mechanisme to associate an LBN
** back to a VBN and thus plays it safe by invalidating all cached files
** for that disk. To get a valid LBN on a disk, the program creates or
** opens a contigeous file for which RMS will return the LBN for VBN 1 in
** a provided XABFHC.
**
** Needs LOG_IO priv and must point TEST_DEVICE to device to be nuked
** Have fun, Hein van den Heuvel 1993
*/
int i, stat;
short iosb[4];
char buf[512] = "No more cache";
char name[] = "TEST_DEVICE:NOCACHE.TMP";

fab = cc$rms_fab;
xab = cc$rms_xabfhc;
fab.fab$l_fop = FAB$M_CTG|FAB$M_CIF; /* need contigeous file */
fab.fab$b_fns = strlen(name);
fab.fab$l_fna = &name;
fab.fab$l_alq = 1;
fab.fab$l_xab = &xab;

stat = sys$create( &fab);
stat = sys$close (&fab);
fab.fab$l_fop = FAB$M_UFO;
stat = sys$open( &fab);
if (!(stat&1)) return stat;
stat = sys$qiow ( 0, fab.fab$l_stv, IO$_WRITELBLK, iosb, 0, 0,
buf, 512, xab.xab$l_sbn, 0, 0, 0);
if (!(stat&1)) return stat;
return iosb[0];
}

P Muralidhar Kini · ‎04-08-2010

Hi Mark,

I had another look at the "SDA>XFC SHOW MEM"
data that you have provided,

>> Current Maximum Cache Size :
>> 30064771072 ( 28.0 GB)
>> Total Allocated Cache Memory :
>> 3220861760 ( 3.0 GB)
>> Peak: 27620469824 ( 25.7 GB)
Just a brief description before we analyze
the above statistic.

XFC by default can grow up to 1/2 the
physical memory size. If the physical memory
in a system is around 10GB then XFC can grow
up to a maximum of 5GB.
However XFC dynamic memory will not
allocate the memory upfront but would grab
memory from the system on a need basis.
As and when it has to store more data, it
requests memory from the system,
and once it gets memory from the system,
the allocated cache memory increases.

Also when the system is low in memory, it
calls XFC to release some memory and at this
time XFC acts like a good citizen and
deposes(clears) data from the cache and
returns memory back to the system.

The above statistic indicates that XFC
memory can grow to a maximum of 28GB
(i.e. system must be having a physical
memory of 56GB). The interesting part is the
allocated cache memory is only 3GB and the
peak value is 25.7GB (i.e. at some point of
time XFC allocated memory was around 25.7GB).
I was expecting the allocated cache memory
to be more than 3GB.

How is the physical memory usage on the system, are lot of heavy applications
running on the System. Can it get memory
starved at times ??

If it can get memory starved then system
would request XFC to trim down as a result
of which XFC has to clear the contents of
the cache. This can be one reason for XFC to
throw out data for a particular file/volume
out of the cache.

>> I'm not certain what you mean in this
>> context by a logical IO - can you give
>> some examples?
Virtual IO -> IO is done to the VBN of the
file.
READVBLK, WRITEVBLK function
code in QIO)
Logical IO -> IO that is done directly
to the LBN of the file.
(READLBLK, WRITELBLK function
code in QIO)

When Logical IO are done, the file system
wont be aware of it. Hence consistency of
the data cannot be guaranteed if there are
logical & virtual IOs to the same file.
However, XFC is logical IO aware. It does
guarantee the consistency of data by
deposing data for the file/volume the moment
it detects a logical IO to a file/volume.

>> Hmm, unfortunately, there's no statistics
>> available from previous daily runs
>> (unlike many of the other jobs, it
>> actually runs at 09:00, so I don't need
>> to log in in the middle of the night, to
>> check).
Please provide the output of the following
SDA command
$ANAL/SYS
SDA>XFC SHOW HISTORY

This has data about the XFC cache hits at
different point of time in a day. The buffer
should be big enough to hold 3 days worth of
data. You can execute the above command to
get an idea of how the read hits are
fluctuating throughout the day.

>> I'm not sure how the cache hit reporting
>> code works, but the only way for the rate
>> to drop to 0% whilst the job runs, would
>> be due to mathematical rounding...
Let me rephrase my question.
Once the cache hits increases to 90% after
66mins( as per the data you have given), it
would be interesting to know from what point
onwards the hit rate starts coming down.
Is it that the hit rate keeps on increasing
and decreasing or is it that after some time
that the application has run, the hit rate
decreases and stays constant at some lower
value.

>> Normally, nothing else creates files on
>> this volume; the only other activity may
>> be a backup or defragger job that runs
>> overnight, but it should be finished by
>> 09:00 when this job runs.
I guess you are referring to VMS backup
here. In case of VMS backup, the IOs that
it performs specifically skips the XFC cache
and hence it wont thrash the XFC cache.
Is this VMS backup or 3rd party backup application?

Defragger anyway should not have anything to
do with data already in XFC.

>> However, could "SEA [...]*.filename_type
>> blah" really do this?
Yes. SEARCH & COPY command are potential
operation that can thrash the XFC cache.
As these commands are going to recursively
access lot of files (i.e. contents of the file), the data of the file that they access
gets copied in to the XFC cache. If the
cache size is full then all the important
data already in cache will get cleared out
in order to make room for the new data of
COPY/SEARCH.

Regards,
Murali

Let There Be Rock - AC/DC

Hoff · ‎04-08-2010

Ten gigabytes of main memory is often under-configured when you want "go-fast."

Laptops are now commonly configured with four or more gigabytes.

Samsung started shipping 32 gigabyte DDR3 RDIMMs last month; that's 32 gigabytes with a single stick of RDIMM memory. (OK, that's not cheap memory, but it'll continue to push down the cost of the 8 GB and 16 GB sticks.)

You're running with RDRAM RIMMs here, which are smaller, older and, yes, individually with comparably lower-bandwidth than those DDR3 RDIMMs. About half the bandwidth, based on a quick look at the specs.

For grins, I might try this on a mid-vintage or newer Integrity box. I'd be tempted to try this on one of the under-US$1000 Integrity boxes that have been showing up on the used-equipment market, load it up with some DDR2-class (cheaper) memory, configure a RAM disk, and see where that lands you for performance. (I'd suggest looking for some PCIe 3 Gbps or preferably some 6 Gbps HBAs for the I/O path out to some matching SAS drives, but that's extra cost for testing. DDR2 and a RAM disk will get you best speed.) With a faster I/O path and more memory and faster processors, you might well find that the Montecito- and Montvale-class Integrity boxes outrunning your AlphaServer GS1280 for this application, too.

Mark Corcoran · ‎04-09-2010

Hoff:
>For this case, can you haul the whole file into memory and run it from, say, a file-backed section?

I'd need to spend some time analysing memory usage on the 3 nodes, to consider whether or not this might be a possibility, but it's certainly something I'll look at.

>Or as has been mentioned in a couple of the replies, look to get faster hardware.

As most of us working for VBCs are probably aware, even before the credit crunch, there was always a mantra about saving money, not spending it (yes, yes, I know, sometimes you need to spend in order to save, but unfortunately, bean counters don't see it this way).

The cluster has only recently (within the last 2 years) had some significant hardware changes to it, in an effort to improve performance, but any uplift that there was, was either marginal, or very short-lived.

One could question whether or not it is bad overall application design (what I've been looking at is simply some .COMs that are used to produce data files that another system then reports on), incorrectly configured RDB or inappropriately designed table (index) layouts and relationships, or badly tuned VMS, or was never designed with the current capacity requirements in mind.

As regards the application (of which there are many subsystems), seeing how just *some* of the code works, I know it's not the way I'd do it.

However, the application has been in existence for at least the 10yrs I've been with the company, and probably many more before that; I don't know the original programmers' general (and VMS-specific) skills at the time of development - whether or not some of the code only works the way it does because they didn't know any better.

The developers were outsourced, and the development then transferred to another company rather than the one where the original developers were outsourced to...

There is constant talk (even 10 years ago) of replacing the system with a more modern off-the-shelf system (probably U*ix or L**ux-based), but that has been dragging its heels, because of course, any O.T.S. system won't do half of the stuff the current system does (though arguably, the core part of what it does, might be "better" than the current system).

I would be all for getting someone in (such as yourself and/or Hein) to analyse the system performance, but unfortunately, I don't control the purse strings, and those that do, had a hard enough job making a case for the recent-ish hardware improvements.

Inevitably, if we were to ask to get someone in, there would be questions as to why the team that looks after the H/W, O/S and layered apps aren't already doing this, and would put a number of noses out of joint (internal politics etc.)

>If the code is sufficiently portable
For this one particular "job", it is just a single .COM file, plus a fairly basic C program, and a number of (RMS-indexed) RDB table unload files.

As for the main application, porting it to another architecture would likely be so long and expensive, it wouldn't be considered.

It's certainly possible that we could try moving the code for this job onto a different architecture, but I don't know whether or not it might cause network bandwidth issues in copying the large files onto that box on a daily basis (or whether time saved in processing is actually offset by the time spent copying files across the network).

>And check the free memory and the working sets and the faulting

Will do so this morning, when the job runs, but I'm pretty sure it's not an issue.

Hein:
>At this point I am firmly convinced that the trigger for this all was a significantly reduced XFC size availability.
[deletia]
>but the current usage is just 3GB.
Looking on the 3 nodes in the cluster, the two older machines currently have percentage-wise about 33% physical memory free (but this amounts to about 18GB), and the newer one has about 20% free (~10GB).

So, percentage-wise, not necessarily a lot, but for most processes, certainly sufficient.

>Surely when this process run well previously, it managed to cache the large file as it is nicely in that in-between range of being able to be cache sometimes, but not all the time (now).
>That would explain why it is hurting now and not before

Possibly, but let me come back to this ;-)

>All the other angles are certainly interesting and worth understanding, but i think they are all status quo. Nothing relevant changed.

I'll be coming back to this, particularly the last sentence :-D

>If you really want to dig into this, then you could use my attached TUNE_CHECK tool.
Many thanks for this and your XFC cache flusher, very much appreciated.

Murali:
>How is the physical memory usage on the system, are lot of heavy applications running on the System. Can it get memory starved at times ??

It's not something I've had occasion to look at - there's certainly plenty of processes all with their own memory footprints, but I don't believe that the system gets memory starved; I'll see if I can perform some analysis on it over the course of a week, to see the highs and lows.

>If it can get memory starved then system would request XFC to trim down as a result of which XFC has to clear the contents of the cache.
>This can be one reason for XFC to throw out data for a particular file/volume out of the cache.

Indeed, but I don't think this is what is happening (I will come back to this).

>Please provide the output of the following SDA command
>$ANAL/SYS
>SDA>XFC SHOW HISTORY
I was going to say that I don't believe the system gets starved of memory, until I looked at the output from this (attached).

Admittedly, this is just for one of the nodes in the cluster (there is a tendency to move certain jobs/processes onto the node on which a DB (appears to be) lock mastered, to improve performance, rather than waiting on inter-nodal RDB locking &etc.)

However, it does show that the cache drops in size to ~400MB in places.

>Is this VMS backup or 3rd party backup application?

I've been informed that the volume does get backed up, and it is backed up using Legato.

Now, the bit that I've been alluding to, where I kept on saying I'd "come back to this"...

I'm WFH today, and woke up with a Eureka moment.

My previous thread was on the subject of SORT + CONVERT /NOSORT versus CONVERT /SORT, and I eventually came to an understanding as to why the first was better than the second.

So, I had been writing and testing a command file that would take parameters to tell it which table from which database to dump, which columns, output file name, whether or not it needed to be converted into an RMS-indexed file etc.

This one command file would replace initially about 23 other ones (that were used for individual tables == 23 places to change things...)

Of course, I don't get to spend all of my time on this, and because I was adding in significant levels of error/"exception" handling and comments, it has been taking some time to complete (I'm now half way through testing it) - I won't dare say how many lines it now encompasses.

Anyway, in the meantime, because there was one table that was having that problem, one of my colleagues in another team had used the validated solution (sORT + CONVERT/NOSORT) in the existing command file, simply to work around the problem.

I believe that this has in fact now led to the problem that we're currently seeing with the XFC cache...

The output file from the CONVERT command, is the input file used by the reporting job that is having the XFC "issue".

I believe that in the past (when we were doing CONVERT /SORT), because it was taking so long to run, when the second job ran, there was sufficient levels of the file still in XFC, that a large proportion of the reads successfully hit the cache.

Now, the indexing job finishes some 4-5 hours before this other job starts, and (as I observed, when I logged on after my Eureka moment), the file was no longer in the cache.

Issues over the XFC cache size aside, I was thinking that the real solution here, was for the indexing job to call this report-generating job after it has finished.

[Not simply bring the report job's start time forward, because if the indexing job is delayed, or has failed, then the second job's input file may not exist, or may still be being generated]

However, it transpires that some of the other tables used by the report job, aren't finished being unloaded and/or indexed until much closer to the report job's run time (I've not looked into it, so it may simply be a case of needing to convert their indexing to SORT + CONVERT/NOSORT too).

If we can't bring forward the report job because other table files are not yet ready, there is possibly another solution...

As part of indexing this main table, we generate a sorted version of the RMU unload file, then CONVERT it, and delete the original unsorted sequential RMU unload file, but we do keep the sorted sequential version of the file.

So, we could use this as the input file for the C program in the reporting job, and since it isn't an indexed file (and for that program, doesn't need to be), I think XFC will be able to correctly guess about the read-ahead usage that we are making of it, and more than likely have what we want cached.

I'm trialling this now, to see what results we get, and will report back.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

0% read hit rate on XFC cache for RMS indexed file being read sequentially using C RTL fgets()?

0% read hit rate on XFC cache for RMS indexed file being read sequentially using C RTL fgets()?

Re: 0% read hit rate on XFC cache for RMS indexed file being read sequentially using C RTL fgets()?

Re: 0% read hit rate on XFC cache for RMS indexed file being read sequentially using C RTL fgets()?

Re: 0% read hit rate on XFC cache for RMS indexed file being read sequentially using C RTL fgets()?

Re: 0% read hit rate on XFC cache for RMS indexed file being read sequentially using C RTL fgets()?

Re: 0% read hit rate on XFC cache for RMS indexed file being read sequentially using C RTL fgets()?

Re: 0% read hit rate on XFC cache for RMS indexed file being read sequentially using C RTL fgets()?

Re: 0% read hit rate on XFC cache for RMS indexed file being read sequentially using C RTL fgets()?

Re: 0% read hit rate on XFC cache for RMS indexed file being read sequentially using C RTL fgets()?

Re: 0% read hit rate on XFC cache for RMS indexed file being read sequentially using C RTL fgets()?

Re: 0% read hit rate on XFC cache for RMS indexed file being read sequentially using C RTL fgets()?

Re: 0% read hit rate on XFC cache for RMS indexed file being read sequentially using C RTL fgets()?

Re: 0% read hit rate on XFC cache for RMS indexed file being read sequentially using C RTL fgets()?

Re: 0% read hit rate on XFC cache for RMS indexed file being read sequentially using C RTL fgets()?

Re: 0% read hit rate on XFC cache for RMS indexed file being read sequentially using C RTL fgets()?

Re: 0% read hit rate on XFC cache for RMS indexed file being read sequentially using C RTL fgets()?

Re: 0% read hit rate on XFC cache for RMS indexed file being read sequentially using C RTL fgets()?

Re: 0% read hit rate on XFC cache for RMS indexed file being read sequentially using C RTL fgets()?

Re: 0% read hit rate on XFC cache for RMS indexed file being read sequentially using C RTL fgets()?

Re: 0% read hit rate on XFC cache for RMS indexed file being read sequentially using C RTL fgets()?

Re: 0% read hit rate on XFC cache for RMS indexed file being read sequentially using C RTL fgets()?

Re: 0% read hit rate on XFC cache for RMS indexed file being read sequentially using C RTL fgets()?

Re: 0% read hit rate on XFC cache for RMS indexed file being read sequentially using C RTL fgets()?

Re: 0% read hit rate on XFC cache for RMS indexed file being read sequentially using C RTL fgets()?

Re: 0% read hit rate on XFC cache for RMS indexed file being read sequentially using C RTL fgets()?