Re: disk storage solution

Stefan Farrelly · ‎08-07-2000

Hi Martha,

im off home, its late here! will answer tomorrow.

Im from Palmerston North, New Zealand, but somehow ended up in London...

Anthony deRito · ‎08-07-2000

I agree, there is a mystery here and I am not convinced it is totally the disks.

The sleep state does not only mean the process is busy performing I/O. Your %wio also confirms this may not be so. Some system call(s) are taking place that you need to find out what they are.

In Glance, what is your observation of the types of system calls being made? Look at this in "Process System Calls" in you Process List screen. Based on the system calls, you can see the type of work being performed by the kernel on behalf of the program. This will give you an idea of what is actually going on.

Martha, here's the thing with database connections: Run the command "man unix". What you are looking at is the description of the local UNIX domain protocol. This is the protocol that should be used by you database access facility (like Oracle's SQL-Net) to do ALL local database connections and I/O. In contrast, using the TCP/IP protocol means you are consuming a LOT of networking overhead that is not necessary. Most people do not know this (especially DBAs so you need to introduce them to it.) Oracle clearly make several references to this and tells you how to configure your listener.ora file so that it will be used. Usually this is the default, but not necessarily. Poeple are cutting down there database access and I/O in half by changing this (I am one of them).

The best way to determine this is to load lsof and use the bigbrother script located in the scripts directory. It will show you exactly how your connections are being made and it shows you "real time" connections.

Tony

Anthony deRito · ‎08-07-2000

Martha, I wanted to clear up some confusion that you may of had. The unix protocol is for LOCAL connections only. Of course access from remote system do need to use TCP.

Tony

Martha Mueller · ‎08-07-2000

Tony,

This is outstanding information. Thank you so much for your input. I will take this unix domain protocol to the DBA's.

I can go along with the premise that the disks are not necessarily at fault. This could be a symptom, not a cause.

On the process list, the system call that it is spending the majority of it's time on is the select call. We found this out when we first moved to version 11.9.2 of sybase and found that they were using continuous polling rather than interrupts. There are about 5000 ( that's right, five thousand ) selects per second. The next busiest call would be write ( 120 / sec ), read ( 80 / sec ), recv ( 50 / sec ), send ( 35 / sec ), and lseek64 ( 35 / sec ).

Again, I must plead ignorance...what is lsof?

Anthony deRito · ‎08-07-2000

Martha,

Wow 5000 sel/sec! It would be worth taking a look at the whole picture from this new perspective.

As for lsof:

1. Go to www.software.hp.com

2. Scroll down to the end and follow the link "HPUX Public Domain Software"

3. Type lsof into the search field and click search.

4. Download the current version of lsof

This tool lists open files using many different arguments but what you want is the big_brother script in the scripts directory. There is a readme on the scripts directory.

Tony

Martha Mueller · ‎08-07-2000

Tony,

Went to get lsof as you recommended. However, the already compiled versions will only run on 32 bit...I am using 64 bit. They state I can build it myself from the source code, but I don't know how to do that.

Anthony deRito · ‎08-07-2000

Go to Vic Abell's home page:

http://www-rcd.cc.purdue.edu/~abe/

This is the guy who maintains it. If there is a version for it, it will be here.

Tony

Anthony deRito · ‎08-07-2000

Martha, seems as though Vic thought about you. Go to

ftp://vic.cc.purdue.edu/pub/tools/unix/lsof/binaries/hpux/B.11.00/64/

Regards,

Tony

Martha Mueller · ‎08-07-2000

Thanks. I couldn't find anything refering to 64 bit on my first visit to Abe's page. But, I have remembered that glanceplus shows the files opened by the application, in this case, dataserver. I can see that all the sockets have a protocol of tcp, which may indicate that the listener may need to be modified. That is being reviewed by the DBA's now. But, from this particular server, there aren't any local applications that would be accessing the database, so what I am seeing may be correct. I'll have to try to track down some applications that are using their local database. We have generally have things set up to where they access the production database that is only a database server.

I'll go find the big_brother script and see if anything else comes to light.

Martha Mueller · ‎08-07-2000

Tony,

I haven't found the big_brother script. However, I ran lsof -i -U and found the following:

dataserver pid owner 41u inet address 0t0 TCP server.domainname.net:3050->server.domainname.net:65298 (ESTABLISHED)

Does the fact that both sides of the arrow point to the same server, and that TCP is listed as the protocol indicate that we are wasting cycles as you have described above? In case this doesn't display well the arrow is located between 3050->server

This is the development server, so some of the applications reside on the same server with the database.

Anthony deRito · ‎08-07-2000

Martha,

What you are seeing here is a locally established socket as identified by "inet" TYPE specifier. This is defined to be "an Internet domain socket". The "TCP" NODE specifier is what they call the node number of the file, in this case it is a socket and so they call it a "TCP" or "Internet protocol" type.

This connection is not using the local unix domain protocol. The TYPE specifier would be "unix" instead of "inet" if this were a connection using the local unix protocol.

Now your getting the picture!

You are consuming a lot of overhead here when you do not need to.

I would talk to your DBA and have him contact the vendor. They can give your DBA some insight to this issue.

Regards,

Tony

Stefan Farrelly · ‎08-08-2000

Hi Martha,

Firstly, as promised, here is my assessment of the SC10 and HVD10.

The SC10 is Ultra2SCSI (80MB/s) which is great, the HVD10 by default is FWD (20MB/s) but you can get it with Ultra2SCSI cards (a must!).
All disks inside are the same as the really fast Jamaica disks, 10k rpm, 15MB/s each. Lovely and quick.

The differences are that the HVD10 is much more expandable (8TB) as opposed to only 2TB for the SC10. Both have 2 controllers but the HVD10 mentions 2 connections to the host per controller (total 4) as for 2 for the SC10. The more controllers the better (still, I couldnt imagine 8TB on only 4 controllers!) Also the HVD10 can have RAID software installed which means you can have a lot more available space as opposed to mirroring where you lose half.

I cant comment on price, you will need to speak to an HP reseller. Im sure the SC10 will be a lot cheaper. Certainly the performance on either with Ultra2SCSI and if you used striped lvols over as many disks and both controllers should be excellent, a large jump up over the older Jamaicas (20 MB/s -> 80 MB/s).
However, a friend of mine looked into all this in detail when they wanted to replace their Autoraids with something really fast - fiber. The HP FC60 is fiber to the server but its internal disks are only SCSI connected, and HP wanted 288k for 1TB and yet the Clariion Nikes were fiber to the host, and fiber to its internal disks and the price was 120k for 1TB. Also the Nikes came with 512Mb of cache, the FC60 only 128. It was a no contest. Theyve now got 2x Model30s installed, LVM striped (+RAID) and performance is awesome.

Now, for Dragans question on how to get Nikes working as well as a Jamaica.
Firstly, both have 2 controllers connected to a server so maximum throughput is the same (20MB/s x2). Now, lets imagine both have the same size and speed disks. A Jamaica can take 8 disks, a Model20 Nike 20 disks. Now, if we use extent based striping (or distributed) over all 8 disks on both the Jamaica and the Nike then when you read a large amount of data you will be accessing 20 disks on the Nike as opposed to only 8 on the Jamaica. As well as more disks the Nike has 64Mb of cache so its reading ahead etc. Cant you see how this will be faster ?
For writes, if you use striping+mirroring on all the Jamaica disks (for disk protection) and the Nike uses RAID (for its disk protection) then every write on the Jamaica has to send 2 writes down our SCSI channels to the Jamaica, if were using RAID-S on the Nike then we dont need mirroring so only 1 write goes down our SCSI channel to the Nike, but internally the Nike has to do multiple writes depending on how many disks are in a lun (usually 5), but it caches it all up so no write delay to the host. So in this example the write throughput down our SCSI channels is double on a Nike system than a Jamaica.

And of course a Nike has intelligent controllers that allow alternate paths and controller replacement on the fly, which Jamaicas dont.

Im not a believer in people who measure disk peformance by milliseconds to access etc. I prefer to measure it the way we/users use it - by LVM. The number of times Ive heard EMC say to a customer that the peformance is excellent because access time is 300ms or so when I simply create a striped lvol over multiple EMC disks (logical) and controllers and the performance thru LVM jumps up by a quantum amount!

Im from Palmerston North, New Zealand, but somehow ended up in London...

Dragan Krnic · ‎08-09-2000

With due respect, Stefan, I HAVEN'T asked how to get Nikes working as well as a Jamaica!

I made tests on sequential and random access times and rates of a JODB consisting of 9 disks 73GB each connected to the computer with a single SCSI cable. Though the enclosure does have a second set of SCSI cabling internally and externally, in addition to twin fans and twin power supplies, we only have one controller on the computer so we use this one. The box with all the disks etc. costs $16,000.

The mix of long sequential reads and of random accesses is far too individual for any set of applications running that it is impossible to establish a standard benchmark capable of fully evaluating the disk I/O. However there are two extremes worth examining as limiting conditions, the best case being long simultaneous sequential reads (which we have seen to top at 8 disks and then drop a little when a ninth disk is included) whereas the worst-case is totally random accesses invoked by independently randomized concurrent processes.

The reality is somewhere in between. However, it may safely be assumed that the better the results in these two extreme tests, the better also the results for any mixture of the two in between.

So stop beating around the bush, do the tests and let us know what you have found out. Theoretical considerations as in your last letter won't bring us any nearer to the truth. It may only reinforce the nagging suspicion that there is something to hide about the performance of RAID systems.

Martha Mueller · ‎08-09-2000

Sorry for the late response...I was out of the office yesterday.

Tony, thanks for that insight. I am sending this to the DBA's attention. While it probably doesn't apply to our main production server, it does apply to our development server and possibly some of the other servers.

Stefan and Dragan, this is priceless information. I have scheduled a meeting with a vendor for next Tuesday. It will be interesting to hear what they have to say. I have learned so much from this discussion that I will be able to ask pointed questions and, even better, be able to understand their answers.

I would like to touch on the point about extent-based striping. Does it boil down to being the same as regular striping via LVM, but can be used in conjunction with mirroring? Or, is it something different? My understanding is that it is the same, but that you have to manually set up the striping.

John Palmer · ‎08-09-2000

Martha,

Hp-UX doesn't yet support mirroring of 'proper' striped volumes i.e those with an 8K, 16K, 32K... stripe size.

Extent based mirroring/stiping is a technique where Physical Extents (minimum size 1Mb and fixed for the whole volume group) are created on several disks in a stripe set and similarly mirrored. You used to have to do this with multiple calls to lvextend in a script but recent versions of LVM commands include the ability to do this in one call to lvcreate utilising physical volume groups. See man lvcreate and look at the -D flag.

Regards,

John

Stefan Farrelly · ‎08-09-2000

Hi Martha,

yes, extent-based striping is done manually with lvextend and it allows mirroring. It was available before lvcreate -i -I was around (or lvcreate -D for distributed/mirrored). It takes a lot longer to create a stripe this way (on a large filesystem) but its more flexible, you can place each extent of primary or mirror on whichever device you want - which you cant with lvcreate -i or -D). Only problem with it is you have to exetend by your extent size - typically 4MB unless you created your lvol with a different one.

HP use extent-based striping on all their internal email servers worldwide - I was on the team that set it up.

Im from Palmerston North, New Zealand, but somehow ended up in London...

Martha Mueller · ‎08-09-2000

Does anyone use extent-based striping? Is there anything to watch out for? From the description, it sounds like something I should be incorporating here.

Martha Mueller · ‎08-09-2000

Excuse me, I should have said, does anyone currently administer extent-based striping? Stefan just said that HP uses this.

John Palmer · ‎08-09-2000

Yes, I do on several servers. I don't have any problems with it.

Regards,

John

Stefan Farrelly · ‎08-09-2000

I thought it best to say why HP use it on their email servers;
Its because with extent-based striping you can effectively load-balance (pseudo RAID10) across all channels and all disks available. Typically people mirror from a device to device basis, usually across channels for redundancy etc. This is not the best for performance because a mirrored disk is only available for reads if the primary is really swamped. What weve done is take every channel and disk on a server and we create each lvol across the whole lot, and mirror across the whole lot (except of course each mirror extent is on a different device and channel), so for heavy throughput applications (which openmail definitely is) all disks and channels are being used for reading and writing together which is the optimum. Even with distributed lvols (lvcreate -D) you have to have at least 2 PGVS with half your avaiable disks in each. With extent-based striping we basically have all devices in each group.

Im from Palmerston North, New Zealand, but somehow ended up in London...

Martha Mueller · ‎08-09-2000

Thanks for that explanation.

Dragan Krnic · ‎08-10-2000

Extent-striping is a good thing. I use it on all my servers to spread the disk queues thinner. Typically a server in my domain has a system disk with boot, swap and a single root volume occupying the rest of the disk*. The external disks are always one extent-striped mirrored volume extending over all of the disks in a JBOD.

With so many servers and frequent migrations to ever bigger disks, I had to devise a way to work around the tedious extent-by-extent build of a stripe-mirrored volume. I usually execute a "mediainit -v" over all the disks simultaneously, control the outputs and if everything is all right write zero-bytes over all the disks, simultaneously of course. Since "mediainit" is a necessary step in any installation I don't even count the upto 2 hours as part of the volume build process. Writing zero-bytes takes upto half an hour for really big disks.

The best thing is, once all the disks are guaranteed to be mirror images of one another (zero-bytes) I can skip the "lvextend -l++ /dev/vg01/lvol1 diskX diskX+1"** with a relatively simple (55 lines) C program which writes the extent sequences in the first MB of each disk. That takes another minute and the volume is ready for deployment. (Before running it I create the VG and a mirrored LV with a single extent manually. The program reads this information from the first MB of each disk, adds the rest of the extents and flushes it back to disk.)

It can stripe-mirror the extents over even AND odd number of disks. In case you never did it this is how odd number of disks are stripe-mirrored: Suppose you have 3 disks, A, B and C. To mirror you stripe A-B, C-A, B-C and so on in this circular fashion. Each extent is mirrored on two separate disks. The apparent disadvantage is that it is not possible to use twin SCSI pathways or rather that one branch would be incomplete, so if the interface on the wrong path fails one has a 50% chance to have to do some manual work to recover. I seldom use twin SCSI paths other than for performance purposes, so it's not an issue for me and besides, in all my 20-odd years of Unix administration I've neve seen a defective SCSI interface from HP, though there are rumours that it is not really unlikely.

*_______
So far I've never had a single instance of root volume getting full (one reason most often offered in favour of separate usr, opt, var, home and tmp volumes). Separate volumes are an invitation to the "volume full" event, unless they are really humongous.

**______
Apparently each time an extent is added to a mirrored volume, the extent on the primary device is copied to the secondary to "sync" it, the first MB of each disk in the group is updated and an implicite "vgcfgbackup" is executed. Though this latter step can be deferred to a manual step afterwards by including the option "-A n", I haven't seen any appreciable difference in execution time.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: disk storage solution