Re: Same job different direct i/o and Cpu, test vs prod

Björn E Rydén · ‎11-02-2010

Hi all,
Hope someone kan piont me to what to look for. I´m not the guru.

I've got a production cluster consisting of 2 Alpha DS25 with OpenVMS 8.3, connected to two EVA8000, using volume shadowing.

We have a nightly batch which is loading data to a RDB-database (7.2-201). The batch gets some infiles - creates tmp-tables - loads (rmu/load) the tmp_tables, and then do some sql to put in rows from the tmp-tables to production tables. The time for this job has increased over time due to bigger infiles and more data in production tables.

Now i tried testing this in our test environment, 3 Alpha DS15 clustered, same operating system, connected to the same SAN.

When I run the exact same batch in test (with copy last backup of prod database, and same infiles), it takes 20 min CPU time .vs. 100 min in prod and a third of DIO in test when I look at the Accounting information.

The user UAF parameters for running the batchjob is the same or higher in prod than test.

The prod db is newly tuned using dbtune and the disk is not fragmented at all.

I´d be very happy to get some hints of what possible causes to look at...

Best Regards
Björn R

Hein van den Heuvel · ‎11-02-2010

>> Hope someone kan piont me to what to look for. IÂ´m not the guru.

BjÃ¶rn,

You started out pretty good with lots of pertinent information. Thanks.
Looks like you made sure most obvious potential sources of differences have been taken into account.

raw CPU speed differences for a DS15 vs DS25 can not explain what you see, but what about memory? Similar physical memory and similar memory pressure, or does the test box perhaps have much more memory to spare?
Are the processes allowed to use the memory properly on prod?
RDMS$BIND_WORK_VM

Good thing you mentioned UAF quotas, but you may also need to check the sysgen PQL settings, notably the minimums which may give processes much more quotas than the UAF suggests.

Or just watch what they use with SHOW PROC/ID or the RMU alternative, the PROCESS ACCOUTING screen.

Using a copy of the DB suggests the same indexes and buffer settings and so on.

You may have to treat this as a regular performance issue with the added advantage of having a comparison system.
Notably check out RMU (active user) Stall Messages and so one.

I would focus on the difference in DIRIO more so than the CPU difference, hoping that they are closely related anyway and the DIRIO often being more tangible: Which files have more IO ?! TMP? RUJ? DB?

Maybe this is a SORT / temp file problem only? .. $ SHOW RMS ?

Spend some quality time with the RDB tuning manual:
http://download.oracle.com/otn_hosted_doc/rdb/pdf/dbpt.pdf

And last but not least, there are a good few folks/companies out there eager to help you with RDB. Google will find them. Some that come to mind in alphabetic order : JCC, Oracle itself, SCI, VXcompany... several independent consultants and so on.

OpenVMS Bootcamp notes, and RDB update proceedings may provide further hints.

Good luck!
Hein van den Heuvel
HvdH Performance Consulting

Bob Blunt · ‎11-02-2010

In addition to Hein's suggestions and questions, I'd also look at the peripheral configuration of the two clusters. Are they using disks presented from the same SAN and controllers or does the DS15 cluster have it's own set of shared disks? Once that point is known then you need to find out how the disks themselves are setup. What OpenVMS version and patches have been installed?

In reality these are some very normal questions that have been experienced many times from the early VAX days to present. OpenVMS is, if I may say so, great stuff and highly configurable. That both works as an advantage AND disadvantage in these type circumstances. Many of my customers have experienced problems because their "test" platform was "close to production" only by the virtue that it was an Alpha running the same O/S release OR what they were testing was too small a subset of their application to reflect real life. So, personally, I'd like to know more about your "car" and it's components before I start measuring for tires.

bob

John McL · ‎11-02-2010

Thanks for telling us as much as you did. There's plenty of other possible differences that may be worth checking ...

I/O loading on both systems?
Disk queue length and I/O throughput?
File fragmentation?
Default process quotas (i.e. PQL* values from SYSGEN)?
Cluster comms performances, especially resends?
Any db rollbacks or journalling?
It's a nightly batch job, so do you also run Backup overnight and is there a conflict?

... and that's with about 30 seconds thought.

Richard J Maher · ‎11-02-2010

Hi Bjorn,

I'd punt for something "dbtune" did.

Can you dbtune a test copy of the datbase and see if you can reproduce the results in test?

Optimizer changes? Different index node sizes? More/Less indices? SPAM threashholds? Many more page-discards?

Cheers Richard Maher

Robert Gezelter · ‎11-02-2010

Bjorn,

Thank you for mentioning the hardware details in the original posting, it is appreciated.

To add some items onto the list:

Install T4 and gather the broad spectrum of statistics from BOTH systems. The OP notes that both systems are using the same SAN; how are both sets of logical volumes configured (size, RAID-level, not to mention actual disks).

- Bob Gezelter, http://www.rlgsc.com

Chris Barratt · ‎11-02-2010

The first thing I would look to see is whether there is something else running in production, accessing the same logical disk units as the production database is on (assuming the test copy is restored to some other disk. I guess a $show dev d/multi would also be useful to see which paths to the SAN each disk uses, as it is really this where you might get congestion to the SAN.
(eg. perhaps the test db uses a quiet path, while the production disks all share the same path ?)

Following that, a comparison of "rmu/dump header" on each database might be useful - although you say that the test db is created from a backup of production, so peresumably it would have all the dbtune changes as well.

Another idea might be to run rmu/show stats on the database while each run occurs and compare the results (assuming nothing else is happening on the production database when the load runs).

cheers,
chris

Paul Gotkin · ‎11-03-2010

Hi Bjorn, one question. In your test environment did you set up shadowsets analogous to your production environment? You said it was the same SAN, but if you are not shadowing in your test environment it will make a significant difference in write times due to volume shadowing cloned i/o overhead. I have seen differences up to 50% when writing to single spindles rather than multiple member shadowsets. Just a thought.

Bill Pedersen · ‎11-03-2010

BjÃ¶rn:

Lots of good information here. I am more inclined to be concerned about "induced" work load due to environment rather than system "pressure".

Something along the path is making the job do more direct IO on prod versus test for some reason. You mention that you have run dbtune on prod, but no mention of if you ran it on the test db environment as well.

You say there is no fragmentation on the prod disk. Is free space similar when you create the temporary tables on prod and on test?

What about file extents? Are the infiles newly created on prod or are they the result of appends? Are the disks in prod and test created with similar window sizes and cluster sizes? Are they the same size disks?

Just some thoughts where I might look.

Bill.

Bill Pedersen
CCSS - Computer Consulting System Services, LLC

Hein van den Heuvel · ‎11-05-2010

BjÃ¶rn, talk to us!
You started so strongly, but we heard nothing since.

>> a third of DIO in test when I look at the Accounting information.

And we are talking significant IO counts right? Millions, not thousands?

That CPU time, when watched with MONI MODE or correlated to a T4 window, is mostly EXEC MODE (RDB) or something else?

fwiw, in these cases significant difference in IO, I look as MONI and T4 only go guide me a little in the right direction.
It is less important to know what else the system is doing, whether there is fragmentation, or shadowing or which exact IO sub-systems.
All those thing influence the speed for sure, but not the IO count. And the speed impact would but fractional, not the factor 4x - 5x as suggested.
So first order of business is to understand where those extra IOs are going... assuming all along there are millions of them.
We want to look at GETJPI data more than GETSYI.
How much memory is the process using is a more important start here than how much free memory there is in the system. Assuming there is some.

Good luck!
Hein

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Same job different direct i/o and Cpu, test vs prod

Same job different direct i/o and Cpu, test vs prod