Simpler Navigation for Servers and Operating Systems
Completed: a much simpler Servers and Operating Systems section of the Community. We combined many of the older boards, so you won't have to click through so many levels to get at the information you need. Check the consolidated boards here as many sub-forums are now single boards.
Operating System - Tru64 Unix
cancel
Showing results for 
Search instead for 
Did you mean: 

Tru64 NetBackup Performance

hkyeakley
Advisor

Tru64 NetBackup Performance

--== Warning: Wall of text incoming ==--

I have a NetBackup environment consisting of:

-= Local Site =-
1 Red Hat Linux AS 4 Master running NBU 6.0 MP7
2 Red Hat Linux AS 4 Media Servers running NBU 6.0 MP7
3 Tru64 V5.1B (Rev. 2650) SAN Media running NBU 6.0 MP7 (Mix between O/S patch kit 6 and 7)
1 Spectra Logic T380 with 12 IBM LTO4 drives running latest BlueScale patches and drive firmware.
1 NetApp 1400 VTL running latest firmware.

-= DR Site =-
1 Red Hat Linux AS 4 Master running NBU 6.0 MP7
1 Tru64 V5.1B (Rev. 2650) SAN Media running NBU 6.0 MP7
1 Spectra Logic T200 with 12 IBM LTO4 drives running latest BlueScale patches and drive firmware.

Last July we replaced our ADIC i2000 library (LTO2 drives) with a Spectra Logic T380. Once we got the library deployed we noticed that our Linux systems are able to write to the library at LTO4 speeds, and the regular network clients even get decent throughput over a 1gb ethernet network. But the 3 Tru64 SAN media servers absolutely crawl. In spite of the fact that I have the SAN media server license installed, I can only get about 10 - 20 MB/s on the policies using the Tru64 storage units.

Our main production database sits on a GS1280 (30 CPUS ,114 GB memory), and we have a ES80 attached to another Spectra Logic library at our DR site. Every Sunday morning, I backup an RMAN backup to tape, mail the tapes to my DR site, and restore the RMAN files using a Spectra Logic T200 attached to the ES80, which also has the SAN Media Server software installed. My GS1280 system takes 15-20 hours to backup, but my DR system can restore the same files in 6-7 hours running at 80 - 110 MB/s. I'm completely baffled how the smaller system gets such awesome throughput while my production box plods along at sub-ethernet speeds.

I've spent the past several months researching performance and tuning suggestions and I've applied settings 1 at a time when I can get an outage.

To speed up testing, we have another GS1280 with 1/2 the CPU and memory as the production system, and it only runs test databases, so it's easier to ask to reboot it if I want to try tuning a particular kernel parm or what not. I installed the SAN media server software on this second 1280 and I've been trying to tune it to NetBackup for the last couple of months.

Within NetBackup, I've tuned the Size and Number of data buffers, and it has no visible effect.

I've used the hwmgr command to look at the driver and firmware level of just about every piece of equipment on both systems, up to and including the individual busses. The GS1280 has everything the ES80 does, it just has more of it.

I've verified HBA drivers on all boxes and all appear to be at the latest firmware.

I've asked my SAN guys to double check the zoning, LUN masking, configuration and firmware levels on the SAN switches here and at my DR site to see if there's anything that might be preventing Tru64 from writing to either of my libraries at SAN speeds. They have checked and everything seems to be in order on both SAN environments. Furthermore, I've asked them to look at port utilization on the SAN switches during test backups from the 1280 and they tell me that the HBAs are hardly being utilized.

We recently deployed a NetApp VTL, and I was curious if perhaps the VTL got better performance (which would indicate some type of incompatibility between Tru64 and Spectra Logic). There isn't one that I can find. If I setup a test policy to write to the VTL from my test GS1280 and let it write to all 80 virtual drives, no one stream exceeds about 10 - 20 MB/s.

Next, I looked at the fragmentation level of the AdvFS domains on both systems. While some are heavily fragmented, the I/O performance on both systems is 100% for every file domain I've checked.

The fact that all my clients (Windows, Linux and the handful of Solaris 10) work well with both libraries makes me think that this is something in Tru64. If that's true, then I'm trying to figure out what is set correctly on my DR ES 80 that's jacked up on my local 1280.

According to section 1.9 of the Tru64 tuning manual (http://h30097.www3.hp.com/docs/base_doc/DOCUMENTATION/V51B_HTML/ARH9GCTE/TITLE.HTM) the 5 most commonly tuned kernel subsystems are: vm, ipc, proc, inet, and socket. Furthermore, http://seer.entsupport.symantec.com/docs/235845.htm is a technote advising Tru64 kernel changes for NetBackup. I have examined the values across all my systems. In most cases, the values on both systems meet or exceed tuning suggestions I have found in manufacturer documentation. The two or three values I have tuned so far have had no effect.

http://www.scribd.com/doc/19213788/Net-Backup-6
I found this TechNote which recommends setting the sem_mni and sem_msl values to 1,000. sem_msl is currently set to 500 on my local 1280, and I think this is perhaps the only kernel parm I have yet to tune. I'm going to ask for an outage this week to increase this setting to 1,000. If that doesn't work, then I believe I will be officially stumped.

I've also watched the EVM channels and the binary error log and haven't seen anything alarming. The tape drives aren't throwing errors and appear to be working fine.

This is leading me to believe that there is something not tuned correctly between the Tru64 O/S and the NetBackup client. If it's not in the kernel then I simply don't know where else to look.

I'll be posting this to the NetBackup forums on Symantec.com, the ITRC forums on HP.com and the NetBackup mailing list.

Can anyone think of any stone I've left unturned? Thanks.
16 REPLIES

Re: Tru64 NetBackup Performance

Hi hkyeakley!

Time to time I have tune backup for my customers. As you understand writing on tape are in start/stop mode instead straming one. Otherwize stream mode reading from tape done at ES80. As a rule, it's not Tru64 tuning problem, but storage performance and SAN topology. How many FCA installed at GS1280? Do you use separate one for tape router? Are FCA and tape router on the same quad? The are a lot of questions...

Regards Alexander
Rob Leadbeater
Honored Contributor

Re: Tru64 NetBackup Performance

Hi,

What I/O drawers do you have on the GS1280 and where are the FC HBAs connected ?

I'm wondering if you've somehow managed to get some contention on the internal I/O busses.

Also with 30 CPUs and 114GB of RAM seems to be an odd number. How are CPUs and RAM split over the 4 system drawers ?

Cheers,

Rob
hkyeakley
Advisor

Re: Tru64 NetBackup Performance

Thank you for your reply. I'm wondering if you could elaborate on a couple of points.

You mention streaming to tape instead of start/stop mode. I'm not sure I understand that question. I setup and configured each library the exact same way through my OS and Backup software. If my DR site is streaming and my local site is in start/stop mode, I'm not sure how they got configured that way. How do I check that?

As far as FCA's, I'm attaching a file with the output of hwmgr -show fibre -adapter on both of my systems.

As far as dedicating HBAs to tape, I've tried it both ways. I have 8 HBAs on my 1280, and I've tried zoning 4 HBAs so that they see only my HP EVA and NetApp Storage Arrays while the other 4 only see my tape library, and I've tried zoning it to where all 8 HBAs see all my storage arrays AND my tape drives. I get about the same lousy performance either way.

At my DR site, the 4 HBAs in my ES80 are zoned to see all 6 LTO4 tape drives and my 1 EVA array, and I get blazing (well, blazing when compared to my local site) speeds in that configuration.

Are the HBAs and tape on the same quad? I didn't mention this in my original post, but my drives are Fibre attached drives. So I've got fibre running from the drives to a brocade switch and then I just zone the drives to the server I want the drives on.

How do I tell if HBAs are on the same quad? I understand the concept of quads and I've seen senior administrators run commands to manage I/O on individual quads, but I myself don't know how to get that deep into messing around with individual quads. If my HBAs are on the same quad, my assumption would be that I'm potentially I/O clogging 1 quad while other quads have idle I/O resources. If that's true, what's the remedy for that?

If this is, as you say, a storage performance and SAN topology issue, then I'm afraid I don't even know where to start. I know SAN 101 level stuff, but if we're talking about performance tuning and advanced SAN layout, then I'm going to need some help learning exactly where to look.

Thank you again for your reply. I look forward to your answers.
hkyeakley
Advisor

Re: Tru64 NetBackup Performance

Mr. Ledbeater,

Thank you for your reply.

How do I find out what I/O drawers I have in the 1280? Is there a command to run to get that info or do I need to go in and get some information off the 1280?

If there is contention on the internal I/O bus, is there anything I can do in software to remedy that, or am I going to have to power down the 1280 and physically rearrange cards?

As far as the CPU and memory, we actually have 32 processors. My DBAs came to me about a year ago and said there was some bug in Oracle that happened on Alpha servers with over 30 processors and asked me if it was possible to disable 2 processors in the box. A strange request, but technically possible. So I used psradm to disable two processors.

As far as the memory, I did think 114GB was an odd number when I posted it. I got that number from running a vmstat -P | head and dividing the total memory number by 1,024. If that's not the right way to report physical memory, please feel free to correct me. If it *is* the correct way, then I have no idea why my memory isn't a multiple of 8.

Thanks again for your reply.
Rob Leadbeater
Honored Contributor

Re: Tru64 NetBackup Performance

Hi,

It's been a while since I've worked on one of these, but you should be able to drop to the MBM console and get the configuration with:

MBM> show sys

Cheers,

Rob
Rob Leadbeater
Honored Contributor

Re: Tru64 NetBackup Performance

The output of the Unix command:

# hwmgr view hier

would also be useful to see...

Cheers,
Rob
hkyeakley
Advisor

Re: Tru64 NetBackup Performance

Mr. Ledbeater,

Attaching output of MBM from local 1280 and remote es 80.

Thank you.
hkyeakley
Advisor

Re: Tru64 NetBackup Performance

Mr. Ledbeater,

Attached is the hwmgr view hier of my 1280.
hkyeakley
Advisor

Re: Tru64 NetBackup Performance

And here's the hierarchy of the ES80.

Rob Leadbeater
Honored Contributor

Re: Tru64 NetBackup Performance

You appear to have far too many tape devices on the GS1280 for a 12 drive library...

Can you also post the output of:

# hwmgr show scsi -full


Cheers,

Rob
hkyeakley
Advisor

Re: Tru64 NetBackup Performance

OK, on the tape drives.

I have a NetApp VTL and I have about 100 virtual drives presented to my systems.

The difference between the VTL drives and the real drives is that the real drives have a firmware revision of 94D0 and the NetApp virtual drives have a "firmware" revision of 022C.

So yes, the system does see about 120 tape drives, but all of them only write at about 10 MB/s - 20 MB/s.

Do you still want the output of # hwmgr show scsi -full?
cnb
Honored Contributor

Re: Tru64 NetBackup Performance

Maybe run the latest syscheck version (145)on the 'good' and 'suspect' systems and compare results?

Rgds,

Re: Tru64 NetBackup Performance

"You mention streaming to tape instead of start/stop mode. I'm not sure I understand that question."
It's rather simple: If system for some reason could not supply data to tape so fast as it compress and write to media, device buffer become empty and tape will stop. Unfortunatelly drive stop tape somewere after end of recorded data, so it's need to rewind and search EndOfRecord mark before next portion of data could be written. Huge seek time delay.
So it's need supply data as fas as 120Mb/s if compression enabled. If bottleneck are somewere and only 80Mb/s could be provided, tape drop to start/stop mode with speed 10-20Mb/s.
"quad" - each 4 switch ports are run on one IC. For example for Brocade ports 0,1,2,3
used by FCA1,Eva1,FCA2,Eva2. Next quad ports 4,5,6,7 are used by tape routers, we will get bottleneck at switch backplane.
From Tru64 side if you have about 1Gb free memory and backup SW use large block or good blocking factor, it's not unix problem.
You could check dataflow by dd local SCSI to tape. UltraSCSI 160Mb/s and dd with 1Mbyte block give good checking.


Rob Leadbeater
Honored Contributor

Re: Tru64 NetBackup Performance

Hi,

> Do you still want the output of # hwmgr show scsi -full?

Yes please...

Another thought... have you tried running backups with something like vdump to rule out any issues with NetBackup ?

Cheers,
Rob
hkyeakley
Advisor

Re: Tru64 NetBackup Performance

I'm attaching the output of hwmgr show scsi -full

Sorry I've been quiet on this topic. I really do appreciate everyone's input on this topic.

Someone mentioned that LTO-4 is not supported on Tru64. According to IBM's documentation, that's correct. However, LTO-3 is supported. Shouldn't Tru64 autonegotiate and at least run at LTO-3 speeds? Even if all I was getting was 60MB/s - 80MB/s, that still beats the 10MB/s - 20MB/s that I am getting.

Furthermore, that still doesn't explain why my ES80 at my DR site gets blazing speeds.

While doing some research on Google trying to find a solution to my slow backup and restore issues on Tru64, I stumbled across the following white paper:

http://bit.ly/b5XuwL

Is that my problem? In order to get performance out of my 8 HBAs going to my library, do I need to use the runon command to specify which QBB or processor my backups run on? That seems a bit convoluted to me.

Again, I greatly appreciate any assistance everyone can be in solving this issue.

Thank you.

Re: Tru64 NetBackup Performance

You are find right white paper, but one difference GS1280 more better than GS320/160/80. You not need use "runon" so much as it is on above paper.Interprocessor mesh GS1280 so powerfull that aplift from "runon" is not so huge. All other topics are very important: zoning-mapping, HBA place at PCI box, data flow across switches...