TruCluster
cancel
Showing results for 
Search instead for 
Did you mean: 

EVA disk access/performance reference

SOLVED
Go to solution
Ivan Ferreira
Honored Contributor

EVA disk access/performance reference

Hi all, I want to know if someone has a similar configuration. I'm using an EVA5000 (68 disks) with TruCluster 5.1B PK5. The write access speed for the system in the cluster are very low, see for example:

ALL CONNECTED TO THE SAME EVA, AND COMMANDS EXECUTED ON THE "CFSMGR" OWNER HOST:

4 NODES CLUSTER (VRAID1)
================

time dd if=/dev/zero of=bigfile bs=1024 count=1000000
1000000+0 records in
1000000+0 records out

real 215.0
user 35.6
sys 36.5

2 NODES CLUSTER (VRAID 5)
================

time dd if=/dev/zero of=bigfile bs=1024 count=1000000
1000000+0 records in
1000000+0 records out

real 67.9
user 43.0
sys 38.7

SINGLE SYSTEM NOT IN CLUSTER (VRAID 5)
============================

time dd if=/dev/zero of=bigfile bs=1024 count=1000000
1000000+0 records in
1000000+0 records out

real 9.2
user 0.3
sys 8.8


Can the cluster file system affect the performance in that way? Can someone with a similar configuration share his write statistics?

Also, VDISK in the EVA are with preferred paths configured and balanced between the controllers (checked with evaperf).
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
15 REPLIES
Seref Tufan SEN
Occasional Advisor

Re: EVA disk access/performance reference

I think it is because one server in the cluster owns the disk and you are running the dd from other members.

You can find the owning member of the mount point by using "csfmgr -v /mount_point" Try the dd command from the owning member or relocate the mount point to the member you are running dd..
Ivan Ferreira
Honored Contributor

Re: EVA disk access/performance reference

No, i'm running the command from the cfsmgr file system owner host.
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
Michael Schulte zur Sur
Honored Contributor

Re: EVA disk access/performance reference

Hi,

there is a major difference in the first and the other logical drives. It would be interesting to know how many disks are involed in each drive. The first and second stats aren't so different except for the real time which could come from high cpu use.

greetings,

Michael
Ivan Ferreira
Honored Contributor

Re: EVA disk access/performance reference

Thanks for your answers, but this is an EVA, virtual disks creates always uses all 68 disks, so, besides the RAID LEVEL, all are the more or less the same.

Now, I'm also passing information to hp support to see if they find something wrong.
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
Joris Denayer
Respected Contributor

Re: EVA disk access/performance reference

Ivan,

There is also a DRD layer in the cluster.
Find out which disk contains the filedomain where the file bigfile is located.

On the "CFSMGR owner host", execute follwing command
# drdmgr dsk??

example on a host called coca.

coca: drdmgr dsk57

View of Data from member coca as of 2006-01-03:09:44:20

Device Name: dsk57
Device Type: Direct Access IO Disk
Device Status: OK
Number of Servers: 2
Server Name: cola
Server State: Server
Server Name: coca <<<<
Server State: Server <<<<
Access Member Name: coca
Open Partition Mask: 0x4 < c >
Statistics for Client Member: coca
Number of Read Operations: 9389708
Number of Write Operations: 4484493
Number of Bytes Read: 964393760768
Number of Bytes Written: 53966315520



If you do not see 2 lines "Server Name" and "Server State" for the "CFS owner", something is wrong in the DRD layer.


BTW: What type of Cluster Interconnect was used (MEMCHAN, Gigabit, 100 Mbit)

Another suggestion.
Enable cfsd on a cluster. This daemon will produce suggestions to locate filesystems on the correct member.


You can have a small overview of DRD in chapter 2 on http://h30097.www3.hp.com/docs/base_doc/DOCUMENTATION/V51B_ACRO_DUX/ARHGVETE.PDF
To err is human, but to really faul things up requires a computer
Mark Poeschl_2
Honored Contributor

Re: EVA disk access/performance reference

The way you're using the 'dd' command isn't going to show you anything like real I/O capability. You're using a really small transfer size and have the file system involved as well. I'd try something like:

time dd if=/dev/zero of=/dev/rdisk/dsknnc bs=256k count=100000
Ivan Ferreira
Honored Contributor

Re: EVA disk access/performance reference

Hi Joris, I already check that, and all members have direct I/O to the disks (All are server for that disk). Thanks for the suggestion.

I really forgot to menthion that we are using memory channel.

Now, for the dd block size, it may be true, I will increase the size but consider that the same command is issued on a non-cluster system and the command finish in 9.2 seconds.
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
Ivan Ferreira
Honored Contributor

Re: EVA disk access/performance reference

Hi Mark, you have reason in that if I increase the block size, the results gets better. Please post again, you deserve more points. You make me feel dumb!


My tests wasn't realistic. I should use at least file system block size.

8k 4 Nodes Cluster
==================
time dd if=/dev/zero of=bigfile bs=8192 count=131072
131072+0 records in
131072+0 records out

real 14.8
user 4.6
sys 7.2


8k 4 Nodes Cluster
===================

time dd if=/dev/zero of=bigfile bs=8192 count=131072
131072+0 records in
131072+0 records out

real 15.5
user 2.3
sys 7.7

8k Single server
================

time dd if=/dev/zero of=bigfile bs=8192 count=131072
131072+0 records in
131072+0 records out

real 3.7
user 0.1
sys 3.3


256k 4 Nodes Cluster
=====================

time dd if=/dev/zero of=bigfile bs=256k count=4096
4096+0 records in
4096+0 records out

real 38.0
user 0.5
sys 5.1

256k 2 Nodes Cluster
====================

time dd if=/dev/zero of=bigfile bs=256k count=4096
4096+0 records in
4096+0 records out

real 3.5
user 0.2
sys 4.1

256k Single server
===================

time dd if=/dev/zero of=bigfile bs=256k count=4096
4096+0 records in
4096+0 records out

real 2.3
user 0.0
sys 2.3

Still, there are differences but is not alarming any more.

Thanks to all!
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
Joris Denayer
Respected Contributor

Re: EVA disk access/performance reference

Hi Ivan,

In fact, you are testing the UBC of your system with the dd command to an outputfile.


If we make following calculation.

1000000 IO's / 9.2 seconds
is
108695 IO/sec to one HSV controller

There are 68 disks in your EVA

(108695 IOs for all disks) / 68 =~ 1600 IO/sec on each disk.

I have the feeling that this is not possible. As I am no EVA specialist, correct me if I'm wrong




To err is human, but to really faul things up requires a computer
Joris Denayer
Respected Contributor

Re: EVA disk access/performance reference

Ivan,

1) I really think that you should make your test to a raw device (of=/dev/rdisk/dsk)

2) There are some strange figures in the bs=8k/256k results

For 8k blocksize, the 4 node cluster is faster then the 2 node cluster
For 256 k blocksize, the 2 node cluster is much faster then the 4 node cluster
The 4 node cluster is slower with 256k blocks then with 8k blocksize. ???

? Do all these systems have comparable HW (1Gbit <-> 2Gbit)?
? Do you have the same result on each member of the cluster ?
? Are you sure there is no other activity on the other cluster members, on the SAN, on the EVA

Joris
To err is human, but to really faul things up requires a computer
Ivan Ferreira
Honored Contributor

Re: EVA disk access/performance reference

Well Joris, the eva is capable of handle Up to 141K IOPS and up to 700MB/s throughput per Controller Pair.

Now, the tests won't be 100% accurate because I have done with differents loads on the systems running in production. But the results are more or less the same and can give a more realistic results. Thank you.

I currently cannot do write tests over the raw devices and also, our databases are located on files in the file system, so I need performance statistics for the file system itself. I know that the bottleneck is not the EVA, I've been checking EVA performance statistics and SanSwitch statistics and they are very good. I was thinking in something wrong with the cluster configuration, but I think that everything is normal. This considering that oracle databases writes in at least 8k pages in our configuration.
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
Joris Denayer
Respected Contributor

Re: EVA disk access/performance reference

Ivan,

141kIOPS is per controller pair. bigfile is on one disk and that one disk is controlled by only one HSV controller. So, you should divide the figures at least by 2.

I am pretty sure that Oracle will do Direct_IO to the files in your AdvFS fileset and so bypassing the UBC.

Anyway, I did some test on an in-house system, but with 1Gbit infrastructure :-(
My results are far worse then yours.

Can you run collect during the tests with a interval of 1 second.
Let the collect run 30 seconds before the test till 1 minute after the test
This will show the real throughput and IO rate to the disk.


To err is human, but to really faul things up requires a computer
Ivan Ferreira
Honored Contributor

Re: EVA disk access/performance reference

I agree with you that there are numbers that are hard to match. But writes does not go directly to the EVA, as you can see in the attachment. The eva caches the writes and then send them to the disk. In the attachment, you can large MB/s in low WR/s.

For example, I'ts also hard to match the output from iostat/collect with the dd command (dsk3 is the disk accesed).

Anyway, I feel good now that you tell me that you have even worst performance ;). And about the Oracle Direct I/O, you are also right, it should bypass file system cache.
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
Joris Denayer
Respected Contributor
Solution

Re: EVA disk access/performance reference

Ivan,
You made me curious and I ran the same test as you did, but this time directly to a file.

Now, my result on a 2 node cluster is:

cola : time dd if=/dev/zero of=/tmp/bigfile bs=8192 count=131072
131072+0 records in
131072+0 records out

real 0m10.13s
user 0m0.16s
sys 0m9.90s

I also ran collect, see attached text

The dd runs between Record#10 and Record #16.
What you can see is that there is only very little activity on the disk (less than 1MB/s)
Meanwhile, the UBC grows at ~100MB/s.

As there still was no diskactivity, I gave a sync command round ~ Record 25
At that point, we see increasing write activity to dsk82 (~ 90MB/s with between 711-784 IO/s)
So, this means that the write blocksizes are ~ 120KB

These writes are indeed in the cache of the EVA, as you mentioned befor.

Conclusion: The dd-test where you write to a regular file, with a system that has enough memory and UBC parameters that are not undertuned, can not be used to simulate the performance of an I/O subsystem. It only measures how fast you can fill up the UBC.

So, the question of your problem should be rephrased.
"Why does UBC performs so bad with my 4 member cluster, compared with a 2 member cluster or standalone system ?"

There, I propose to verify the UBC parameters of the different systems.

Also, If the systems run production, UBC might be already on it's ubc_maxpercent. Or maybe, there are no more free pages to transfer to UBC.

Hope this helps you
To err is human, but to really faul things up requires a computer
Ivan Ferreira
Honored Contributor

Re: EVA disk access/performance reference

Hi Joris, thank you very much for your support and time. You can also note that in the interval showed in my statistics file, there are some big writes and the sy cpu starts increasing again. This is when I issued the sync command. I forgot to menthion that.

But, to conclude, real write performance tests should be don to raw devices. File system write tests should have a block size >= to the file system block size. File generation tends to stay in the UBC.
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?