Re: vfast on clusters

Orjan Petersson · ‎06-22-2006

Hi,
Does it matter (performanswise or other) on which cluster node I activate vfast? (Or on which node I enable e.g. defragment?)

The system is two clustered ES80 running Tru64 5.1 PK5. Except for 2 local disks on each node, the disks are on an EVA5000 shared between the nodes.

Thanks,
Örjan

Orjan Petersson · ‎06-22-2006

Oops,
> The system is two clustered ES80 running Tru64 5.1 PK5
That should be Tru64 5.1b PK5

Han Pilmeyer · ‎06-22-2006

The vFast status for a domain gets recorded in the metadata. The actions are executed on the CFS server.

At least that is what a small experiment showed. And that's what would make sense too.

So it doesn't matter where you give the command, but the activity will be on the node that is the CFS server.

Orjan Petersson · ‎06-23-2006

Thanks Han!

What triggered my question is that I started to see, on one node, one of the kernel threads use close to 100% CPU (as shown by ps -o pcpu -m -p "kernel-pid"). This started a short time after I activated vfast and enabled defragment.

After playing around with the cfs SERVER attribute for the file systems and deactivating/reactivating vfast on the domains everything seems to be back to normal.

Any known bugs related to vfast in 51b PK5?

Han Pilmeyer · ‎06-23-2006

PK5 (BL26) should be be pretty much okay.

If you want to make sure, you could load pfm (sysconfig -c pfm) and then run kprofile. If the top routines are in the msfs space, then it's AdvFS.

Orjan Petersson · ‎06-23-2006

There is definitely something strange going on. During the night the "kernel idle" process on node 1 again started to eat CPU (105%). This happened with vfast running defragment on 3 domains.

After stopping vfast on 2 of the domains everything went back to normal. (CPU usage for "kernel idle" back to 20-25%)

I did run kprofile and there are a couple of "msfs" routines high up in the list. The results from both with and without the problem are in the attached file.

(I will raise a support case through our local support company)

Han Pilmeyer · ‎06-23-2006

It's supposed to do this when the CPU and disks are idle. Are you saying it was doing this when the system wasn't idle?

Orjan Petersson · ‎06-23-2006

No the system was not idle.

If the system is supposed to use only "idle" time to run vfast, I would expect the %CPU of the "kernel idle" process to stay the same before and after deactivating vfast.
What happened is that it went from 105% to 25%.

Han Pilmeyer · ‎06-23-2006

Yes, but wasn't the 25% with one domain active (instead of 3 with 105%)?

Orjan Petersson · ‎06-23-2006

Yes, the 25% was after I deactivated vfast on 2 domains.

If I understand you correctly, vfast is supposed to use only othewise idle time to do its work. So, the %CPU of "kernel idle" should not change between vfast running on three domains or on a single one, and either the system would be "idle doing nothing" or "idle doing vfast".

Do I miss something here?

Han Pilmeyer · ‎06-23-2006

vFast can use as much CPU and I/O resources as it wants, provided that their is no useful "user" work that needs to be done (subject to percent_ios_when_busy limit).

Orjan Petersson · ‎06-25-2006

(To put some of the figures into context: the machines have 4 CPUs each)

I restarted vfast (activate, defragment=enable) on the 3 domains yesterday at 18h30.
The system was pretty idle (load average between 1 and 3; "kernel idle" %cpu 10%-25%) until about 22h30. Then the load average increased to 4-5, and "kernel idle" %cpu increased to 105%-115%.

It stayed like that until the morning, and at around 8 o'clock when people started using the system (Sunday is a normal working day here), the load average increased to 9-10, but the "kernel idle" %cpu stayed at 110%. At that time the CPU definitely had other useful "user" work to do.
At around 9, I stopped vfast (defragment=disable, deactivate) on the eppix_apps domain which made everything go back to normal.

When the change at 22h30 happened, "collect" data shows that CPU Idle goes down from around 60% to 15%-30%, and CPU System increases from 15% to 45%
The eppix_apps domain has been defragmented (with defragment) a few days ago without any errors, and verify on the active domain reported only minor problems ("probably due to file system activity")

This is the 2nd night in a row that I see this behaviour. I only started to run vfast a few days ago so I can not tell if it has been like that before.

Any ideas of other things I can check to get an understanding of what is really going on?

I will let vfast continue on the two domains to see if the %cpu for kernel idle will increase during the night or not.

Han Pilmeyer · ‎06-25-2006

I just checked and the last performance fix for vFast is in BL26 (PK5). So this would most likely be an unknown issue at the moment.

Orjan Petersson · ‎06-25-2006

Thanks Han,
The result from running vfast during the night on the two domains was no increase in %cpu for "kernel idle" so the problem seems to be related to the specific domain "eppix_apps".

I will raise this issue through our local support company but I will leave this site in a few days so I am not sure how much will come out of that.

Joris Denayer · ‎06-28-2006

Han, Orian

I've seen similar behaviour on an internal server and also on one of our customers' system.

With vfast enabled, we observed kernel_idle CPU percentages upto 200%.
The vfast kernelthreads took ~180% of the total. At the same time the IO load was considerable. High enough to impact the applications throughput.

Disabling balance and topIObalance didn't change much.

On our local system, large files (up to 200GB are created) while on the customer system a couple of million small files/month are created.

The first file_creation pattern is completely different from the second one.

At the end, vfast was disabled and now the customer is again running a defragment, a couple of hours during the night and a complete day during the weekend.

I guess that the algoritms that must calculate the permitted cpu/io load are not working correctly under certain conditions.

Possible conditions might be:
- creation of very large number of files/day
- creation of very large files with very much extents

To err is human, but to really faul things up requires a computer

Orjan Petersson · ‎06-28-2006

Good to know that we are not the only ones seeing this behaviour.

We do not create any huge files on the domain, but are more in the "large number of files/day" case (on the order of 10000/day).

I would like to avoid running defragment as I experienced the same kind of cluster service check timeouts as described here.

http://www.ornl.gov/lists/mailing-lists/tru64-unix-managers/2005/04/msg00021.html

(That happenened even though I ran defragment on the same cluster member as the one serving the domain)

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: vfast on clusters

vfast on clusters