Low performance and time responses

Manuel G · ‎07-02-2001

Hello:

- We have a binode cluster system composed by 2 D230 servers with 128 MB. RAM each.
- Operating system is HPUX 10.20 and we already have a Oracle 8.0.4 database.
- Database information is stored in a external discs box accesible from both nodes.
- A node works as the primary one and the other remains as secondary waiting for any problem can occur with primary. Cluster managment is made by HP Service Guard Software.
- We are running a PRO*C application on the system since February-3.(until now it has runned on the primary node)
- Response application times were OK during first 2 months more or less. Last days of March we began to appreciate a degratation of time responses.
- Degradation increases as information stored in database grows up.
- We tried to eliminate database information and response times were OK again, but we have to store a 13 months historical so the solution is not to eliminate information gradually.
- We have notice top?s load average reaches 6 points (even more) when 6 or 7 users are accesing database using application. The load continues growing as oracle user processes remains waiting for an answer.
- We have notice too that idle time isn?t so bad when the top?s load got high.

Can someone help to discover where we have the bottleneck?
Glance could help?
Could it be a disk I/O problem?
Someone told us to configurate database as MTS (multithread server), could it be a solution or could have any problem with it?
If someone need more information, please ask me.

Thanks a lot.

A. Clay Stephenson · ‎07-02-2001

Hi:

128MB is very little memory for an Oracle server especially if it is also running an application. In your case, because your problem
seems to be very dependent upon the amount of data, I suspect that the majority of your bottleneck is in the Pro/C SQL code. It is very easy for poorly structured cursor loops to degrade at an N-squared rate or even worse
with nested selects. My first attack would be to enable SQL_TRACE to determine which statements are hogging the resources. If possible, I would also plot response vs. data volume, if the response degrades at N-squared then almost certainly you have a problem with poorly written joins. In many cases, one good choice of an index will clear up most of the performance problems.

Hope this at least gets you started, Clay

If it ain't broke, I can fix that.

Magdi KAMAL · ‎07-02-2001

Hi Manuel,

To solve your problem you may look for the following subjects :

1. hp-ux kernel parameters tuning.
2. Oracle SGA size according to physical memory.
3. Oracle tablespace distribution amoung disks.
4. Network configuration.
5. I/O subsystem speed.
6. Application design and implementing.
7. Physical memory increase.
8. Swap space.

1. hp-ux kernel parameters :
db_max_pct : should not be more 10 ( percent )
db_min_pct : should not be more 5 ( percent )
streampipes set to zero.

2. Oracle SGA should be as bigger as you can. Your physical memory ( 128 MB ) seems to be really not enough for what you attempt to do !
I would increase memory to 512 MB and give 300MB to the Oracle SGA ( The price of memory on D class serevr is not expensive !).

3. Try to fetch weather your tablespaces are well distributed amoung disks to insure access balancing .

4. Network configuration :
- the subnet mask value for either node 1, node 2 and package ?
- Broadcast value for either node 1, node 2 and package ?
- either node 1, node 2 and package sould be in the same subnet.
- If you are heavy communicating with another application or system, you would integrate it into the same subnet ( if possible ) to reduce network traffic ( bottleneck ).
- Network component specific, in case of mixing FDDI and ethernet protocols you may configure your the "Don't Fragment Bit" to ZERO .

- you may issue the command netstat -rn to check networking statistics and this could give you an idea about what is happening.

- Control your MTU size ( Ethernet must be 1500 and FDDI 4352.

5. I/O subsystem :
The command vmstat can give you an idea about running, blocking and waiting process.
It's very important to appreciate the values giving by this commands :
- Wainting process : Waiting for I/O request to be acomplish.
- Blocking process : process waiting for a resource ( could be db record, ... )
The command sar ( system activity reporter ) can give you also alot of information about the I/O performance.

6. You have to increse your physical memroy, 128 seams to be really not enough. I would recommand 512 MB ( 300 for Oracle SGA and the rest for the hp-ux and users process )
Notice : Any way you must set the db_max_pct and db_min_pct kernel parameters describe before.

7. Your swap space have to be three or four times the amount of physical memory.
So in case of 128 MB -> swap will be 512 MB
and in case of 512 MB -> swap will be 2048 MB.

8. Application desgin and implementing could be a bottleneck itself. If a process will executing is asking for a lot of resources at the same time, this will conclude on process competition to look for resources.

Hope this can help you, I can give you more details if you want ( I solved this problem last year for my customer with this approch and all this elements were involved in the final solution and now it works VERY GOOD ).

Magdi

Phil Miesle · ‎07-03-2001

Manuel,

It seems that you only have 6 or 7 users on the system, I assume that it is only used as a database server for this application. Given that your load average is so high at this user level, I too would look for poor-performing SQL and PL/SQL code. It is not uncommon for applications to perform worse and worse as more data is added to the system; this is solved by careful design and tuning of the application.

I would suggest using the utlbstat/utlestat utility provided with Oracle, setting TIMED_STATISTICS=true in the init.ora parameter file. There are some great documents on http://metalink.oracle.com (you need a valid Oracle support agreement), start with Note 61998.1 to get an overview, and then look into Note 62161.1 for how to interpret the bstat/estat report. There are also documents on Metalink which can help you with tuning SQL.

Two good books on the subject of tuning Oracle are "Oracle Performance Tuning" (published by O'Reilly) and "Oracle SQL High-Performance Tuning" (published by Prentice-Hall and written by Guy Harrison).

It could be that you're trying to do too much on a 1-CPU machine with 128 MB of memory, but I would do everything that you can to tune your application on the current hardware before upgrading; otherwise, your boss could be disappointed that a $5000 upgrade didn't solve the problem!

:-Phil

Manuel G · ‎07-03-2001

Clay:

- We have done what you suggest 2 months ago. In spite of this we are going to review again the code.

- We reindexed all database and "tkprof" code. We appreciate certain improve but when data grew the problem arised again.

- We use "tkprof" for reviewing, know other way?

- Thanks.

Charles McCary · ‎07-03-2001

Just out of curiousity, how are your lvols set up? Are they striped?

Also, in order to keep up interest in solving your problem, you may want to assign points to some of the other folks that have already worked hard for you on this - of course, I have yet to provide you anything useful, but hopefully if you provide the lvol info, I might be able to pluck some "low hanging fruit" for you.

Manuel G · ‎07-03-2001

Hello again:

- I said you our RAM memory was 128 MB, I was wrong, we have 256 MB.

- Magdi, I reviewed pontis you notice with the next result:

1) HP-UX kernel parameters:
dbc_max_pct: 2
dbc_min_pct: 2
streampipes: 0
We have the tuned kernel parameter sam offers for database servers.

2) Oracle SGA: 17 MB.

3) Tablespaces:
All tablespaces are in the same disk. We could distribute them across 2 disks (in the external disk box) but we are not sure having 2 controllers one for each disk.

4) Network:
Any system communicates with ours at low performance periods. Communicating runs at nigth/low using system times.
All users access our application via telnet sessions using an industrial network, so all the processes run locally on the system.
We have a Ethernet Network with 1500 MTU.

5) I/O system:
I read some about vmstat but I don?t now where it shows causes of waiting processes. Can it show them?

6) Memory:
We will try to resolve the problem withoput increasing memory because we are not sure if this is the bottleneck. The system seems to not swap for example.

7) Swap space:
We have 512 MB of swapspace, I think it?s OK, is it?

8) App. dessign:
As I told Clay, we are working on it trying to minimize accesing to database.

- I have started "sar" this afternoon, I hope tomorrow can show you some results.

- I will run vmstat and netstat too.

Thanks all for your help, we continue working on it.

PD: How can I point your answers?

Charles McCary · ‎07-03-2001

Manuel, you should see an "Assign Points" beside the headings of each of our answers.

tx,

C

Manuel G · ‎07-03-2001

Charles:

What kind of lvol information do you need?

Charles McCary · ‎07-03-2001

How about an lvdisplay on all of your database data file systems?

Manuel G · ‎07-03-2001

OK Charles, I?ll give you information tomorrow (in Spain is 21:55 and I don?t have system access now)

Thanks.

Charles McCary · ‎07-03-2001

Manuel,

Ok - but I probably won't look at it until July 5th sometime. Tomorrow is our Independence Holiday and I will be celebrating.

tx,

Charlie

A. Clay Stephenson · ‎07-03-2001

Hi Manuel,

In spite of everything you have said, I continue to think the problem lies in the code itself. In your case there is a fairly easy way of increasing memory without spending any money. I would simply halt the cluster, transfer the memory from machine B to machine A
and then manually start the application w/o MC/SG. If performance improves, you have your answer. In your case, I would set bufpages to a non-zero value and thus turn off dynamic buffer cache. I would start with a value of about 24 MB (bufpages=6144) and measure performance. Then increase it and see it the situation improves. Avoiding dynamic buffer cache allows you to alter just 1 variable at a time.
I have seen very few cases where even a badly tuned machine behaves as you describe (unless it is woefully inadequate for the task) but I have seen many cases where poorly constructed code does this and no amount of hardware/tuning will fix the problem.

Good luck and regards, Clay

If it ain't broke, I can fix that.

Bill Thorsteinson · ‎07-03-2001

Your symptoms suggest that you have a table that has
grown to large to keep in memory that is being
tablescanned by queries.

Look at the v$sql view of the statements in the SGA
for expensive statements. (See attached script
for a start.)

For query optimization I would suggest enabling
the autotrace command. Your developers can
then analyze the queries by entering
set autotrace on
into an sqlplus session before running their query.
If they are developing the queries in Pro*C then
tkprof works well, but is more difficult.
Also look at running explain plans on the query.
These won't work well unless you have a reasonable
amount of date in the larger tables.

You should run analyze on the production tables
to help the optimizer to its job. Rapidly growing tables
should be analyzed every week or two. You can
build a script to analyze all the application tables and
analyze them. Run it it batch every Sunday.

More memory will help in the short term. Give Oracle as
many block buffers as you can without causing
HP-UX to swap.

Magdi KAMAL · ‎07-03-2001

Hi Manuel,

I would do the following :

1. dmc_mx_pct from 2 to 10.
2. Oracle SGA of 17 MB is really not enough for a 256 MB system ! I would give 100 MB to Oracle System Global Area.
3. Swap should be increased to 1024 MB ( four times physical memory ).

"vmstat -S "command gives the following columns :

Under procs :
- r : Process (r)running.(*)
- b : Process (b)blocked for a resource.(*)
- w : Process (w)wait for I/O operation.(*)

Under Memory :
- avm : Active virtuel pages.
- free: Size of the free list.

Under Pages :
- si : Process swaped in.(*)
- so : Process swaped out.(*)
- pi : Pages paged in.(*)
- po : Pages paged out.(*)
- fr : Pages freed per second.
- de : Anticipated short term memory shortfall.
- sr : Pages scanned by clock algorithm.

Under faults :
- in : Device interupptes per second.
- sy : System calls per second.
- cs : Context switching per second.

Under CPU :
- us : User time. (*)
- sy : System time.(*)
- id : CPU idle. (*)

May you have a look to the output of this command and give special attention under columns marked with a star (*).

4. Try to investigate wheather new network components had just been intergrated into the network topology. This is a major issue for network traffic and performances falldown ( new net card, new hub, new switch, new gateway or new router) or configuration of one of them !!!.

Hope this helps.

Magdi

Manuel G · ‎07-04-2001

Hello all:

- I?ve got other point, oracle archive log is started, could this cause a load problem?

- We have HP Mirror Disk Software too for mirroring disks in the external box (2+2 disks), could it be a load problem mirroring?

Manuel G · ‎07-06-2001

Hi:

- I?ve got some statistics from sar.
- I think we?ve got high I/O traffic from disks and processes remaina waiting for an answer making CPU overloading.
- I try to format results and make some graphs this weekend.
- Next week we are going to:

* Increase Oracle SGA.
* Continue reviewing code with tkprof trying to minimize accesses.
* Stop archive log (this thread writes a lot of information in disks).

- I attach statistics and results for lvdisplay.

- Charles, disks are not stripped (I think ...). Files ended in ".lv" are the results. base is where we have tablespaces, programa is where we have application binaries and redo log files, and internal is where we have /oracle partition. First and second one are in external box disks and third in the internal one.

Thank you very much for your help and have a great weekend.

Manuel G · ‎07-08-2001

Hi:

- I?ve got some statistics from sar (see attach).
- First review of them show us a high I/O traffic, it seems processes keep waiting from disks answers.

- This week we are going to:

* Increase SGA size.
* Continue reviewing code.
* Stop archive logging (it writes a lot of information on disks).

- I hope tell you more during week.

- Results of lvdisplay are on *.lv files. /base (base.lv) has tablespaces files, /prg (prog.lv) has binary application files and archive log files, /oracle (oracle.lv) is the oracle partition. First two are in the external box disks (mirrored 2+2 disks) and the third one is in the internal disk.

- We are formatting results in excel, I attach some graphs when we finish them.

- Thank you very much.

Steve Slade · ‎07-08-2001

Manual,

I assume that when you say 'turn off archive logging' that you mean turn off archive log mode, as you cannot stop the database from logging the changes it makes to its' tables, etc. This will change your ability to recover from any disk/software failiures in the future, as you will no longer be able to roll forward - and you will have to have more outages for cold backups. (You can add multiple log writers to increase through put).

I agree with the guys above that the problems definitely seems code related and not HPUX related. could you maybe show us the contents of your init.ora file - as maybe you have some Oracle options switched on which are taking resources that you do not need. (Or maybe a utlbstat/estat report with timed stats - of 10 -20 mins of bad period).

I noted, though is that you have re-indexed your system using tkprof as a guide - but this performance inhancement has deteriated. Are you using the cost or rule based optimiser. If you are using the cost based optimiser it sounds as if your statistics might be a bit old.

If at first you do not succeed. Destroy all evidence that you even attempted.

Volker Borowski · ‎07-09-2001

Trying a diffrent approach:

with 6 to 7 users causing trouble, there should be no need to go for MTS.

Now something to be cleared first.
- Is your problem related to the amount of data ?
- Is your problem related to the number of concurrent users.

You said, that responsetimes got better as you decreased the amount of data. That let's one think, your either have old statistics, or your application is doing too much tablescans.
Try to find Index-Solutions for bad statements. Do COMPUTE STATISTICS to your application tables.

Check if ONE user can work with the application, even when there is a lot of data inside the DB. If YES, you definately have a lock-problem in your application, when several users are working. You might need to restructure the application then.

Do not know if this helps.....
Volker

Manuel G · ‎07-09-2001

Steve:

- You?re right, we?ll turn off archive log mode. We assume loosing recovering abilities.

- I?ve got some statistics from utlbstat and utlestat. They cover few days when application was performing slowy. (view attach)

- If I understand you well, you think that tkprof's stats are older and we should "tkprof" again?
******************************
Volker:

- Answer to your questions is "both". Problem is related with both things, when database load increases response times grow up, same occurs with concurrent users. As user increases as response time grows up, it seems that processes keep waiting for disk responses...

- You ask me to create new indexes, we have 5,000 transactions (inserts / updates) each day, could a excess of indexes create a response time decrease in transactions?

- Related to tablescan, poblem with tables is located in three of them that support all the application. When I said "Decreasing amount of data improves
response times" I want to say "Decreasing amount of data in three tables".

- We?ll try to "tkprof" code again for refresh our stats.

Thank you very much

Steve Slade · ‎07-10-2001

Manuel,

Yes if you turn off archive log mode. You will only be able to restore your database - if a failure occured - to the time of your last backup. Any activity between the time of the last backup and when the failure occured, will be lost. So if you backed up your system on Saturday, and lost a disk on Tuesday. You will only be to recover the database to Saturday - any activity on Sun & Mon will be lost.

Looking at the report, I can see that you have the default optimiser setting of choose. This means that if you have statistics - can be seen in all/dba/user_tables, _indexes, _ histograms etc, then Oracle will use the cost based optimiser and decide on whether to use an index based on those stats. If those stats were collected when the system contained a little amount of data, then Oracle will tend to go for scan - more data then an index retrieval. (There also a load of other rules, but suffice to say, the more acurate your stats, the more infmored Oracle will be about which method to retrieve your data.

Loking at the stats and doing a quick analysis - I would suggest the following : (NB. this is a quick once through, its worth checking the stats against Oracle performance docs, also I may have mis interpretted things.)

1. LIbrary Cache Stats. The pin hit ratio should be as near 1 as possible. The Body and Index stats are quite low. The body stats suggest that should increase the size of your Shared Pool, maybe. Alternatively, you could pin common code into the Shared Pool. You could use the Shared Pool reserve parms to allocate space to specific, commonly called code? The index stats indicate that perhaps your buffer cache is too small, however, if you are scanning multiple tables, then index blocks will be dumped out.

2. The second load of stats contains a heap of info. I do not have time look at it all, but here a few things which seem 'interesting'

a. There seems to be a lot of check pointing - this can be confirmed in the alert.log file. Looking at the init.ora parms given, I notice that you have the default for the db_block size - this could be crucial. I would expect this to be around 8k in size, but I do not know the default value. If this is a low value, then your database blocks are too small. This is a big problem, as this parameter can be set at the 'create database' stage.

b. You are performing a large number of consistent gets - 2700 per second !. This is where the data is read from the rollback segment rather than the table, and takes therefore extra time. This again could be becuase of all the tablescans.

c. physical reads/writes. These show that you do not actually write much. But you read a hell of a lot.

d. table fetch continued row - this indicates the amount of data that is spread of more than one data block - requiring an additional read. Maybe issues with the storage parms on your tables?

Looking at the buffer busy waits - it seems that the database spends most its time waiting for data blocks and rollback. Again, this could tie in with a small buffer cache, or one that it being cleared quite quickly. If your buffer cache is quite small, and you cannot affored the memory, you could try splitting it in two (using the buffer pool keep/recycle parms - not sure if this is in your version.) This way any tables which you have to scan can be placed in the recycle area, and will not affect the performance of the buffer cache.

There are probably a number of things that I have missed (as I only have limited time), so again I would recommend looking at the Oracle Ref. Doc for guidance on the stats.

Hope this helps, I am not sure how awake I am this morning. It might help if you sent your init.ora settings. (In svrmgrl spool myfile; show parameter; spool off - this should produce a neat file.)

Good luck

If at first you do not succeed. Destroy all evidence that you even attempted.

Manuel G · ‎07-10-2001

Thanks Steve, we?ll try and tell how?s going on.

I?ll attach you database parameters this evening (in Spain)

Volker Borowski · ‎07-10-2001

Hello Manuel,

Concerning to your question about additional indexes cost write performance: This is true, but look at your read/write ratio. I guess you do not have a write problem!

You mentioned three specific tables. Check out Bills Script for bad SQL-Statements and see, if it shows something up with them. Bad Statements have a high read-count compared to a low number ob records. Those statement can be optimized by an index mostly (Let us have them).

High read count and high record count is worse, because it might be (although unlikely) that the application needs all that data ?!

db_file_multiblock_read_count is at 32. Would try a lower value, to cause the optimizer to favor index-scans.

shared_pool_size=9M (increase if box does currently not do swaps)

Icrease the size of the online redologs, to soften the checkpoint incomplete intervalls (but I give this no big effect), or add one or two groups.

Now you said you have a high number of transactions. What do they do to you three big tables ? Is this a
- insert / delete - Activity
- update - Activity
- mixture

Check PCT_FREE and PCT_USED, if you have a heavy update_activity. If you have haevy update activity and the table contains a lot of fields, that are initialized NULL or short VARCHARS and get filled later, you should preserve freespace in the blocks for those updates. If you do not, the updates will lead to chained or migrated rows, which will cause additional read upon select. If there is a problem, you need to reorganize those tables.
Check Oracle guides on chained rows.

If you have insert/delete profile, your indexes might be fragmented fast (do index-rebuilds).

In general I like to know from the SQL-Statistics, how many of your consistent gets do make sense (means comparrison to the number of records read)

Last big thing:
Your application does quite a lot of sorts.
Check if really needed, and get rid of this if possible. If not, check if it will be of profit to serve the sort with an index and not the WHERE-clause (rare case, but if a hit, big effect). Since most of the sorts are done in memory, I suspect only small tables being sorted. If this are only small lookup tables, check if they can be converted to index-organized-tables.

Volker

Manuel G · ‎07-10-2001

Other point:

- Yesterday we were reviewing OS block size and Oracle block size.

- We use "ll -s" for getting OS block size and it seems to be 512 bytes. Do you know a direct way (command) to get it?

- Oracle block size is 2048 bytes.

- Could little size of OS block size cause a low I/O data transfering from disk to memory?

- Coult it be a problem the difference between Oracle and OS block size?.

- I attach you Oracle parameters.

- Thanks a lot guys.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Low performance and time responses

Low performance and time responses

Re: Low performance and time responses

Re: Low performance and time responses

Re: Low performance and time responses

Re: Low performance and time responses

Re: Low performance and time responses

Re: Low performance and time responses

Re: Low performance and time responses

Re: Low performance and time responses

Re: Low performance and time responses

Re: Low performance and time responses

Re: Low performance and time responses

Re: Low performance and time responses

Re: Low performance and time responses

Re: Low performance and time responses

Re: Low performance and time responses

Re: Low performance and time responses

Re: Low performance and time responses

Re: Low performance and time responses

Re: Low performance and time responses

Re: Low performance and time responses

Re: Low performance and time responses

Re: Low performance and time responses

Re: Low performance and time responses

Re: Low performance and time responses