cancel
Showing results for 
Search instead for 
Did you mean: 

Copy performance problem

Ivan Ferreira
Honored Contributor

Copy performance problem

Hi all. I need help with this.

We have TruCluster over HP EVA Storages replicating with Continuous Access.

We did a firmware upgrade of the destination upgrade, failover, then firmware upgrade of the source.

After the firmware upgrade or failover, we are facing problems with the speed of the backups.

Database backups are done to disks. Very simple, put database un backup mode, and use a cp to copy the database files from one file system to another. This process, previous to the upgrade and failover, had a throughput of 30 MB/s. Now, we are having only 15 MB/s throughput.

From the cfsmgr and drdmgr perspective, nothing changed from the time that performance was acceptable.

I changed the EVA controller that servers the disks to check if performance increases, but it don't.

If during backup, I create another read operation, for example with "dd" command, over the source (database) disks/filesystems, I can increase the throughput to 30 MB/s.

If during backup, I create another write operation, for example with "dd" command, over the destination (backup) disk/filesystem, I can increase the throughput to 30 MB/s.

So the question is, why the cp command is not reading/writing to the speed that it could? The host performing the backup over 70% cpu available, no paging activity and no iowait.
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
13 REPLIES
Rob Leadbeater
Honored Contributor

Re: Copy performance problem

Hi Ivan,

Exactly what upgrades took place on the EVAs ?

Did anything else change in the infrastructure ?

Cheers,

Rob
Hein van den Heuvel
Honored Contributor

Re: Copy performance problem

Is the additional 'dd' write test to a pre-allocated file or a raw device? That is... does it involve file allocation / growth?

You already mentioned:

>> From the cfsmgr and drdmgr perspective, nothing changed from the time that performance was acceptable

Still, for me the cfs is the biggests suspect. Which server owns the mountpoint?

Do you get similar speed from every server?

fwiw.... a throughput of 30mb/sec is not too great. Is the target file system free space heavily fragmented?

Did you verify a cp of the input itself by copying to /dev/nul?

Hope this helps a little,
Hein.

Ivan Ferreira
Honored Contributor

Re: Copy performance problem

Hi all, thanks for your answers:

>> Exactly what upgrades took place on the EVAs ?

Controllers and disks firmware. Currently running 6100.

>> Did anything else change in the infrastructure ?

Nothing.

>> Still, for me the cfs is the biggests suspect. Which server owns the mountpoint?

We have a strict control about the file system owners. It's a five nodes cluster, and one of them performs the copy from the other file systems, one served by every node. This has been in this way always.

>> Do you get similar speed from every server?

The problem is, I don't get the speed I had previously having the exactly same configuration. One host backing up other nodes file systems at 30 MB/s.

>> fwiw.... a throughput of 30mb/sec is not too great. Is the target file system free space heavily fragmented?

Correct, is not great, that true, but; the server that performs the backup is a little old (GS160) and has 1 GB HBAs. Anyway, 30 MB/s was acceptable, but 15 MB/s is not, this doubles the backup time.

>> Did you verify a cp of the input itself by copying to /dev/null?

I tried read and write with dd having 30 MB/s. After the backup finished, I tried with cp, in the same way the backup script lauched from the oracle is done, and I got 30 MB/s. I cannot reproduce the problem, but at night, when backup starts, the I don't get 30 MB/s. The only difference is that for backups, database is in backup mode.

I will check tonight and keep searching and posting any test I perform or data obtained.

Thanks.
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
Rob Leadbeater
Honored Contributor

Re: Copy performance problem

Hi,

>> I cannot reproduce the problem, but at night,
>> when backup starts, the I don't get 30 MB/s.

From what you've said, I suspect this probably isn't the case, but have you checked whether there's anything else on the SAN that is now doing more I/O overnight, that could be causing conflicts with your backup jobs ?

Cheers,

Rob
DCBrown
Frequent Advisor

Re: Copy performance problem

What does drdmgr show on each of the cluster members?

# drdmgr dsk0

View of Data from member blah...
Device Name: dsk0
Device Type: Direct Access IO Disk

There is a known issue that sometimes occurs after a firmware upgrade such that some - but not all - nodes recieve back incorrect information from storage on a few devices. The information for the device indicates that it is not cluster safe and this results in some of the nodes having a device type of "served" instead fo DAIO as seen above. This is typically the most common issue. There is a workaround available.


The 2nd issue has to do with active/active vs active/standby on the controller. Did you upgrade from an active/standby firmware revision? If so, it is possible that the first path in the path list is to the off-speed controller. In a single IO stream application, it would then continually select the off-speed controller path. Which kernel revision is being run? BL27 (latest kit) has T10 ALUA support that can sometimes be used to help avoid negative pathological conditions. The code change recognizes which path is the off-speed connection and puts it back into the standby path list. This will make the internal path configuration the same as used for the active/standby firmware. Typically only a 25% drop in performance is seen when utilizing the off-speed controller in a single IO stream however. So, that may not be the problem.

DCBrown
Frequent Advisor

Re: Copy performance problem

...and one more thing. Was the previous configuration running CA/Async or CA/Sync and what is the current configuration set for.
Rob Leadbeater
Honored Contributor

Re: Copy performance problem

Hi,

DCBrown said: "Did you upgrade from an active/standby firmware revision?"

Ivan said: "Controllers and disks firmware. Currently running 6100."

Therefore no. The EVA is running XCS 6100, therefore it's an EVA4/6/8000 which has always been Active/Active.

Cheers,

Rob
Ivan Ferreira
Honored Contributor

Re: Copy performance problem

All servers have direct I/O to the disks from the drdmgr point of view.

The controllers are active/active (eva4k).

I'm not able to identify or reproduce the problem easily, because the system has 7x24 activity.

The performance of the backup drops down at any moment, it does not matter the time. That is, is not caused by another nighly process or servers connected to the storage.

The server read from eva4k and writes the backup to eva5k. What I know by now, is that the write performance is not the problem, I'm having problems with the read performance.

I don't have the latest patch kit for 5.1B, I have PK5. It's interesting the new code for load balancing, as the "round robin" method used right now is my first concern about this problem, but probably will be hard to install the latest patch kit at this moment.
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
Vladimir Fabecic
Honored Contributor

Re: Copy performance problem

Hello Ivan
Did you fix the problem?
Are you using same FC adapter for active connection to EVA disk?
In vino veritas, in VMS cluster
Ivan Ferreira
Honored Contributor

Re: Copy performance problem

Thank you for asking, but I cannot do anything right now because I'm in a training course. As a workaround, we moved the backup process to another host in the cluster, with more cpu power and HBA speed. Still there, the backup is slow (20-25 MB/s) but can help until the time to troubleshoot arrives.

Last check I saw:

- Disable one of the HBA and check if the performance increases avoiding the round-robin load balance.
- 8 paths for some vdisks, 4 of them in stale state, 4 available. A hwmgr -refresh scsi have to be done in single user mode. Don't know if will help.
- A "failback" of the storage. Downtime required again.
- Patch installation. I don't like rolling upgrade, so downtime required again.

Downtime not available.
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
DCBrown
Frequent Advisor

Re: Copy performance problem

The best bet might be to raise a support case if the system is under a support contract. There are "methods" :-) that can be used without taking the system down to gather rudimentary FC performance information out of the emx driver. This information often can help eliminate or isolate the problems down to specific paths, switches, or controller nports, time of day, etc... or indicate an os tuning issue. When raising a case provide a dumpsys of the running system(s) in the cluster along with information on what other hosts are connected to the san.
Ivan Ferreira
Honored Contributor

Re: Copy performance problem

Thanks for your suggestions.

>>> The best bet might be to raise a support case if the system is under a support contract.

Yes, we have contract. But normally takes forever to diagnose the problem, and too many information is required. Suddenly, right now I don't have time to do this, we are implementing 30 new servers, 6 clusters, and 4 new storages =( (Going crazy).

>>> Disable one of the HBA and check if the performance increases avoiding the round-robin load balance.

No performance effect. I disabled one HBA and ensured that the vdisks where served by the right controller.

>>> 8 paths for some vdisks, 4 of them in stale state, 4 available. A hwmgr -refresh scsi have to be done in single user mode. Don't know if will help.

Refresh of scsi did not increased the performance also, was done in some LUNS without shutting down the system.

Hope next month will have time to troubleshoot the problem =).
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
Ivan Ferreira
Honored Contributor

Re: Copy performance problem

Finally, I had time to do some test.

Individual read and write performance test indicated that the problem was not related to SAN or storage. A performance about 40-50 MB/s can be reached.

The problem was solved replacing the "cp" command with "dd" and using a large block size. The transfer rate increased was doubled.

What I cannot explain is why the "cp" command was enough before.

Thanks all for your responses.
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?