Showing results for 
Search instead for 
Do you mean 

TruCluster Performance problem

Advisor

TruCluster Performance problem

Hi ,

We experince a performance problem on our two node Cluster running Tru64.5.1A

We want to run a database application from both nodes reading/writing mostly to one adfvs_dmn, but due to some problemen we can only use one node safely to keep performance high. So for now I switched on nologon at one node, but that's not a real sollution.

Does anyone have a clue, why performance problems rise, when activating this second-cluster node for users to do their stuff on the database. (high load averages!)

Greetings Roodveldt

8 REPLIES
Frequent Advisor

Re: TruCluster Performance problem

Can you specify the database server (e.g. Oracle/Ingres/? and version to assist). Also would be nice to know the setup of these systems (e.g. cluster interconnect and storage systems).

A question on the performance issue: is performance still degraded if the application is running on both nodes with users only allowed to log on to one node ?

I can remember reading about the Cluster Interconnect latency being a factor in performance (Memory Channel v GigaBit Ethernet) there are docs available from HP on this (in their docs area for Tru64).

What I'm not so sure about, is whether in v5 of TruCluster on Advfs, that both nodes are actively involved in writing to the disk . If you do a cfsmgr -v it will show you the node "serving" the file systems.

For what it's worth, i use Ingres on TruCluster but run it as a service on one one. In the CAA scripts I issue a cfsmgr command to ensure the mount-points for Ingres are on the node serving the dbms and also issue a cluamgr command to "up" the priority of the alias on that node (on stop it degrades the priority down again). This then ensures all TCP connections go to the node running the dbms server.

The reason for running on one node and not both = 30% performance improvement... No need for DLM mapping so all locks etc are run in memory for that box...

My 2 pennies...

Cheers

Gary
Honored Contributor

Re: TruCluster Performance problem


If you would upgrade to V5.1B, only the db-writers on the non-serving member would be impacted by the indirect path to the filesystem. DB-readers will take the local direct path.

BTW: how is your new_wire_method and fifo_do_adaptive setting ?? For Oracle fifo_do_adaptive=0 would certainly help.

__ Johan.

_JB_
Advisor

Re: TruCluster Performance problem

Gary,

Well it's not a famous database application we're running like Oracle or Ingress.
It's a specialised Logistic database solition having their onw database structure. As far as we know the problemen doesn't come from the database application itself. Indeed we suspect the Interconnect to be cause of the problem. Cause one of the nodes is mastering the database, the other node loses performance somehow due to the interconnect. So I'm going to look further to Interconnect options in combination with clu alias. I've you know more about clu_aliases please let me know.
so far the original problem is not solved.
..or do you mean the step to 5.1b could maka a differnce?

Greetings
Roodveldt


So far
Esteemed Contributor

Re: TruCluster Performance problem

I think that what you are seeing is that the cluster file system (CFS) is a client / server model. The client will have to ship all I/O over the cluster interconnect to the CFS server for this particular domain.

There are certain optimizations to the above, when both the client and the server for the domain have the same files open for writing, those optimizations can not be used.

Recommendations are difficult to make since we don't know your specialized logistics database or how much control you have over that.

One thing you might be able to do is to create multiple AdvFS domains and make one node the server of one domain and the other node the server of the other domain. Then distribute the database over both domains. Also try to optimize the load so that acesses to the "remote domain" are kept to a minimum.

If you could place the database on raw volumes, then that would help too. An alternative to this would be to get the database to use Direct IO (F_DIRECTIO on the open). Using Direct IO allows you to access the storage locally without going to the server for read and write operations. However both raw volumes and Direct IO require that the application (in your case the database) handles data consistency.

Johan, small correction: Note that in the case of Oracle, since it uses Direct IO, both reads and writes are handled locally. It is only writing of the archive logs that doesn't use Direct IO. This is because the archive logs are not pre-allocated.
Frequent Advisor

Re: TruCluster Performance problem

Roodveldt,

Check what Johan has suggested (5.1B allowing reads from local machines to shared storage) this could improve rates, Looking at the 5.1A release notes it informs you about how CFS works regarding one node being master (unless I/O > 64K). We are guessing here that your storage is shared between both nodes ?

As for setting up an alias so your users end up on the correct machine:

1) Obtain an unused IP address and load it along with name into your DNS server. (e.g. MYDB)

2) edit /etc/hosts on your cluster and add this alias / IP address

3) use sysman and pick TruCluster specific -> Cluster Alias Manager and add the alias to all of your nodes...

4) On each node login as root and do:-

cluamgr -r start

cluamgr -s all (shows all known aliases)

5) Create a network monitoring service for CAA: -

caa_profile -create mydbnet -t network your_ip_subnet_addr

6) Register network monitoring...

caa_register mydbnet

7) Now create a CAA service dependant on the network service (e.g. This will ensure that if the node serving the db cannot see the network it will move to a node that can).

8) In the CAA script for your service..

-------------------------------------
START block : -
-------------------------------------

#
logtxt "Promoting Cluster Alias priority to this server... "
cluamgr -a selp=10,alias=mydb
if [ $? -ne 0 ]; then
postevent "MYDB cluamgr priority" start
exit 2
fi
logtxt "Promote Cluster-Alias done."
#
# Ensure this server handles the Disk IO
#
logtxt "Re-Locating Advfs Domain MYDB_ADVFS_DOMAIN to this server... "
cfsmgr -a server=`hostname -s` -d MYDB_ADVFS_DOMAIN
logtxt "Advfs relocate done."

-------------------------------------
STOP block : -
-------------------------------------

#
# Demote the Cluster Alias
#
logtxt "Demoting Cluster Alias priority to this server... "
cluamgr -a selp=1,alias=MYDB
if [ $? -ne 0 ]; then
postevent "MYDB cluamgr priority" stop
exit 2
fi
logtxt "Demote Cluster-Alias done."


9) Register and start the service, then get your users to connect to MYDB (DNS will resolve) address. They will always then go to the node running the database...

Hope this gives you some things to play with (rather than no-logins on other server)..

Here we use one server for database only, other server runs batch servers / HTTP's etc. On a server failure they'll end up running on single node together..

Hope this helps...

Gary
Esteemed Contributor

Re: TruCluster Performance problem

Gary,

When Johan wrote that, I think he was talking about Oracle and Direct IO. What you read in the release notes is "concurrent Direct IO read". This only helps if there are no writers to the same file and you have, as you said, I/O's larger than 64K. Typically this helps backup applications, not parallel databases.

Roodveldt,

Let's first try to determine whether you have a cluster file system, a disk I/O, a network between the application on both nodes or a cluster interconnect performance issue.

Adding a cluster alias will certainly not help you to eleviate any file I/O performance issues. Why do you think that adding a cluster alias would help your database application?
Advisor

Re: TruCluster Performance problem

When we enable the 2nd TruCluster node which is at that time not hosting the domain/dsks and normal users login and do their job on the database, we sometimes see the cpu running at 80% system and 10% user and 10% kernel Idle.

Iostat commands tells us at that time that bps lies arround 15K-30K istead of the normal 500 to 1000 bps. tty is also little bit higher.

First we thought the large scale of linear readings/writing/locking of the database tables would create the problem, so we build a tool to generate random reads/writes and tested deadlock situations Still the same problem.

Could it be the link to the SAN-storage (MA8000) which is slowing down the read/writes due a wrong setting?

Currently building a test-environnement to test striping and aliasing the most read tables along two independant raid sets and/or filedomains.

So still trying..

Esteemed Contributor

Re: TruCluster Performance problem

80% system time is usually somthing to take a closer look at. If you could give a "lockinfo sleep 60" command at a time when you see that much system time and post the results here, then we may be able to help.

Do I understand correctly that you are at V5.1A and use LSM? Can you explain which patch level and which features you use of LSM (mirroring, striping)?
//Add this to "OnDomLoad" event