1819800 Members
3030 Online
109607 Solutions
New Discussion юеВ

SAN got slow suddenly

 
Eric Antunes
Honored Contributor

SAN got slow suddenly

Hi,

I noticed recently that the Data Protector backups were taking 4 to 5 more time to complete for the same amount of data.

To troubleshoot the issue I tested a rman backup directly to the server's disks and it is also taking more than twice the time it was taking before the issue.

This is happening since last month. Any idea?

Thank you,

Eric Antunes



Each and every day is a good day to learn.
15 REPLIES 15
Ivan Ferreira
Honored Contributor

Re: SAN got slow suddenly

A full description of your storage area network would help.

There are a lot of possible reasons for that, including, storage utilization by other hosts, location, disk failures, MPxIO related issues.

Check the service times for your disks, transfer rate, etc. If you have historic performance data, compare them.
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
Eric Antunes
Honored Contributor

Re: SAN got slow suddenly

Hi Ivan,

This is a 1.5 Tera bytes SAN.

It is part of an EVA 4000 with 2 HP-UX 11iv1 servers, 2 Windows Servers (Win Server 2003) , 2 32-bits Blades (Win Server 2003) and 1 64-bits Blade (Win Server 2003).

It has the following hardware:

- 1 "HP 2-ch U320 SCSI HBA" (whatever that means);
- 1 "HP StorageWorks 4/8 Base SAN Switch";
- 1 "HP Universal Rack 10642 G2 Shock ALL";
- 1 "Mod PDU 16A HV WW ALL";
- 1 "HP EVA4000-A 2C1D Array";
- 8 "HDD 146GB FC 10K 1' ALL ADD-ON ALL";
- And, finaly, last December we added 1 "HDD ~292GB (not sure about the exact size) FC 10K 1' ALL ADD-ON ALL".

Eric
Each and every day is a good day to learn.
Richard Tengdin
Trusted Contributor

Re: SAN got slow suddenly

You said you have the following:

- 8 "HDD 146GB FC 10K 1' ALL ADD-ON ALL";
- And, finaly, last December we added 1 "HDD ~292GB (not sure about the exact size) FC 10K 1' ALL ADD-ON ALL".

Since you only have 9 drives listed they must be in the same disk group. You will have 1373 GiB of total space in the Group (Protection level of None), but fully 20% of the space is held on the 300GB drive.

20% of space meants 20% of the data which means 20% of I/O. Your 300GB drive is badly hot-spotted.

Your best solution is to get two 146GB drives, add them to the Disk Group, and remove the 300GB drive. That configuration will add about the same space but each of the 10 drives will have 10% of the space, data, and I/O.
Ivan Ferreira
Honored Contributor

Re: SAN got slow suddenly

You said, a direct backup with rman to the server's disk.

├В┬┐Is the source and destination storage the same? or the backup is local?

Have you tried doing read and write performance test only with dd, using /dev/zero and /dev/null?

├В┬┐What is your disks service time (avserv) as reported by sar -d 5 1000, during the backup?

├В┬┐If you have multiple paths, have you verified which path is using, and tested forcing the usage of the second path? We had a performance problem related to one HBA failure in a multi HBA host.

Also, if the other hosts connected to the storage changed they load, they can affect the overall storage performance. The evaperf tool may help you to check the status of your storage.
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
Eric Antunes
Honored Contributor

Re: SAN got slow suddenly

Richard:

After all we have one more "HDD 146GB FC 10K 1' ALL ADD-ON ALL". So we have the following:

- 9 "HDD 146GB FC 10K 1' ALL ADD-ON ALL";
- 1 "HDD ~292GB (not sure about the exact size) FC 10K 1' ALL ADD-ON ALL".

Yes, our 292Gb disk gets rougthly 18% of the total I/O. You made a good point but we can't add 146Gb disks anymore because they were discontinued.

I noticed that this same 292Gb disk has an older firmware (H02) than the other disks (H03).


Ivan:

"Is the source and destination storage the same? or the backup is local?"

Yes, the source and destination storage was the same: from SAN to SAN, not local.


"Have you tried doing read and write performance test only with dd, using /dev/zero and /dev/null?"

What do you mean with "only with dd"? No activity on the server besides dd?


"What is your disks service time (avserv) as reported by sar -d 5 1000, during the backup?

Bad, vey bad:


"If you have multiple paths, have you verified which path is using, and tested forcing the usage of the second path? We had a performance problem related to one HBA failure in a multi HBA host."

We have only one HBA. Have we a bottleneck here...?

Will check the evaperf tool this afternoon ...

Thanks,

Eric
Each and every day is a good day to learn.
Eric Antunes
Honored Contributor

Re: SAN got slow suddenly

I forgoted to post the "sar -d 5 333" averages:

Average c9t0d0 0.46 0.63 13 220 0.04 0.42
Average c4t0d1 14.95 1.24 56 1678 0.50 5.07
Average c4t0d2 0.60 0.50 11 681 0.01 1.09
Average c4t0d3 0.01 0.50 0 0 0.00 5.58
Each and every day is a good day to learn.
Eric Antunes
Honored Contributor

Re: SAN got slow suddenly

Those averages were collected with the Production Oracle Applications 11i up and a single "dd if=/dev/zero of=/disc3/x count=1000000": no backup was run.
Each and every day is a good day to learn.
Ivan Ferreira
Honored Contributor

Re: SAN got slow suddenly

Your average service time was good, but you should test with larger block sizes dd if=/dev/zero of=(device) bs=8k (256 and 512).

Check also your rman parallelism configuration, check also your overall system resource usage when you run your backup.

I would like to see some performance data collected with sar during your backup session. Disk, memory, and cpu.

Also, please specify which one is your SAN disk on the output of sar -d.
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
Eric Antunes
Honored Contributor

Re: SAN got slow suddenly

Hi Ivan,

Here are the stats collected during the rman backup (paralellism=4).

The SAN physical volumes are c4t0d1 and c4t0d2.

Eric
Each and every day is a good day to learn.
Ivan Ferreira
Honored Contributor

Re: SAN got slow suddenly

Checking your performance statistics, it seems that the storage respoding correctly according to the service time, but you have constant %wio.

I have found this:

============================================
What is your database server doing during the backup? Is the CPU busy? Does it page a lot; are you getting large amounts of I/O wait? Setting filesperset=100 may be a mistake. If you├в re seeing a high %WIO, try setting this to the number of disks on your system (or less). Unless your disks are more than 10 years old, serial reads on one or two disks should be able to easily exceed the bandwidth of most networks.

It is recommended that 2 channels per device are allocated, but test to verify the performance improvement.

============================================

See also:
http://www.sc.ehu.es/siwebso/KZCC/Oracle_10g_Documentacion/server.101/b10734/rcmtunin.htm

Maybe, there is some tuning that you should do to the rman backup process.

Please also report the results of (preferibly over a file system and a destroyable raw device):

FS
date; time dd if=/dev/zero of=testfile bs=8k count=131072
RAW
date; time dd if=/dev/zero of=/dev/rdsk/ bs=8k count=131072

FS
date; time dd if=/dev/zero of=testfile bs=256k count=4096
RAW
date; time dd if=/dev/zero of=/dev/rdsk/ bs=256k count=4096

Again the sar ouput during the tests. Remember, writing to the raw device will destroy it's data.
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
Simon Setina
Advisor

Re: SAN got slow suddenly

Hi Eric.

"You made a good point but we can't add 146Gb disks anymore because they were discontinued"

It's true, 146GB 10K disk are discountinued, but you can add 2 or more 146GB 15K disk, which is better solution then 300GB.

364621-B22 / 146 GB 15K rpm dual-port 2/4 Gb/s FC-AL 1-inch (2.54 cm) drive.

Regards
Simon
Eric Antunes
Honored Contributor

Re: SAN got slow suddenly

Hi Ivan,

Here are the "sar -d" stats collected friday morning while doing some rman archivelog backups.

c4t0d1 and c4t0d2 (SAN physical devices) are 10 times slowest than the local physical device - c9t0d0 - which has all the redolog activity.

This is not always slow (my thursday stats were ok) but it is frequently slow.

Eric

Each and every day is a good day to learn.
Eric Antunes
Honored Contributor

Re: SAN got slow suddenly

Friday stat were collected friday in the afternoon, not morning.

Each and every day is a good day to learn.
Mark Le Clair
New Member

Re: SAN got slow suddenly

Eric,

Do you have any EVAperf data? Ivan comments are accurate on the disks sizes. I know that HP best practice is to keep different size disks in different disk groups. There are several reasons behind that.

Although equally important, I think you need to determine if the problems you are having are SAN or Host based. The only way to do this is look at performance logs from each and then start ruling out one or the other.

On the SAN side, I would look at the I/O of the disk group. 9 disks do not give a lot of I/O capability. If you are pushing the I/O capacity of the disk group then you will probably run into high disk queues/busy. An example would be 9 146 GB 10K drives will give you approx 484 I/O's total at about a 60/40 Read/Write configuration and about 661 I/O's @ 80/20 R/W. That really isn't a lot of I/O's.

As far as having 1 300GB 10k drive in there, this can lead to performance issues on the Array eventually. RSS ID's are formed in sets of 6 - 12 disks. Within that RSS ID, you probably have a 146 GB drive that mirrored itself to the 300GB drive. You are probably losing half the space on the 300GB drive in addition to it taking 20% of the I/O. Also when you set up a disk group, an EVA will take the largest drive in a disk group and take double that space (the EVA is assuming worst case RAID1) for protection. If you aren't using any type of protection then keep an eye on your space to make sure that you can lose at least 300GB of space.

My recommendation would be to run some EVAperfs and perfmon from the host side. There are hundreds of possibilities here that could be the issue but you need to know what your EVA is doing.
Eric Antunes
Honored Contributor

Re: SAN got slow suddenly

Hi Mark,

It is surely a SAN issue since almost all server's I/O got slow suddenly.

I will explore - if allowed - all the suggestions.

Thank you,

Eric Antunes
Each and every day is a good day to learn.