Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

Is this too many alignment faults?

Alex Vournas
Occasional Advisor

Is this too many alignment faults?

Attached is an chart of our user alignment faults from our T4 stats. It seems to average around 1,000. We have itanium servers and I know that alignment faults are supposed to hurt performance a lot. Does this seem to be too much? Would this be hurting performance?

Thanks,

Alex
8 REPLIES
Robert Gezelter
Honored Contributor

Re: Is this too many alignment faults?

Alex,

From my perspective, any alignment faults are too many. That said, there are situations where one cannot change an underlying package, but those should be rare.

My suggestion is to track down where the alignment faults are happening. If they are in locally developed code, they should be fixed. If they are in HP code, or the code from another vendor, they should be reported as bugs.

- Bob Gezelter, http://www.rlgsc.com
Alex Vournas
Occasional Advisor

Re: Is this too many alignment faults?

How do you track down which process is causing the alignment faults? Also, is there any way to measure how much CPU time is spent handling the faults that are happening? I don't want to spend a lot of time fixing them to find out that we got no performance gain.

Hein van den Heuvel
Honored Contributor

Re: Is this too many alignment faults?

1000 Alignment Faults per second is nothing.
Your box can probably do 400,000 / second
There is no way you can notice 1000/sec on the application level.

It should not be ignored though, but treated as an early warning potential scaling issue. Why bring scaling into play, and not just CPU overhead? Because only 1 cpu at a time can resolve aligments faults under (memory management) spinlock protection.

So actually... if those 1000 were from 10 processes each hitting alignment faults in the same clock tick (all waiting fot the same timer or efn), then they will all be waiting for each other, and the effect will be 5 - 10x worse then the alignment itself would cause.

>> How do you track down which process is causing the alignment faults?

I kinda miss a MONITOR/TOPALIGN. That would be nice to have.

I just use the ANALYZE/SYSTEM FLT extention
SDA> FLT LOAD
SDA> FLT START TRACE /CALLER [/MODE=U] [INDEX=xxx]
SDA> WAIT 0:0:20 ! Or 10 or whatever
SDA> FLT STOP TRACE
SDA> SET OUT FLT.LOG
SDA> SHOW SUMM/IMAGE ! To map PID's to users/images
SDA> SET PROC "popular proggie"
SDA> FLT SHOW TRACE/SUMM ! To get top hitter(s)
SDA> FLT SHOW TRACE ! Gory details

I use a PERL script to post process the FLT.LOG produced above to show faults/image, and to show which of the processes are all responsible for the top fault. But just using an editor you can get a good picture as well.

HINT... BEFORE issuing the FLT SHWO TRACE commands you want to set the SDA context to a process, any process, which is running a 'popular' image. SDA tries to make sense out of the addresses based on the image for the current process. So if you do nothing, it will SEEM that all addresses are in SDA itself. Not so, it is just interpreting in that context. By picking a process with a popular image, more of those interpretation will make sense.

>> Also, is there any way to measure how much CPU time is spent handling the faults that are happening? I don't want to spend a lot of time fixing them to find out that we got no performance gain.

Not really, short of PCS or PRF PC tracing.

I suspect that the best indicators would from from SDA> SPL TRACE. How long are spinlocks being held!?!

The fixup is happening in Kernel mode. So if your system is 50% user, 10% kernel, 40% idle with those alignment faults, then there is nothing to be gained.

For better help you may want to provide more details like the exact platform, notably how many CPUs, perhaps some MONI MODE ( or some money :-) :-).

I would just use the 1000/sec as a trigger to investigate, but not as a requirement to fix. If the investigation lead to low hanging fruit (like a mis-aligned return-length word on LIB$TIM and LIB$GET_LOGICAL that I just trace back today) then why not fix those.

If one call is responsible for 90% of the faults, then yeah sure, dig in!

Hope this helps some.
Regards,

Hein van den Heuvel
HvdH Performance Consulting
John Gillings
Honored Contributor

Re: Is this too many alignment faults?

Alex,
There are some unaligned data structures which can never be fixed (one example which springs to mind is the PQL list passed to every $CREPRC call). These will always incur alignment faults, and there's nothing you can do about them.

The cases you may want to concern yourself with are those where a data structure under your control is being accessed frequently. Is say "may" because if your processes are performing to your satisfaction and not encroaching on execution time windows, why worry?

>How do you track down which process is
>causing the alignment faults?

In the data you've posted, I'd be looking at the big spike at around 22:30. What starts at that time? I'd also look at sustained periods of high(er) faults, like 02:45 to 04:30. Your background load of 1000 probably isn't worth worrying about.

SDA has a FLT extension which can be used to identify processes, and code regions that are generating alignment faults. See

$ ANALYZE/SYSTEM
SDA> HELP FLT
A crucible of informative mistakes
Hoff
Honored Contributor

Re: Is this too many alignment faults?

Too many alignment faults?

Unknown.

Alex, you are the only one that can answer that question.

The answer is specific to your environment.

If this fault rate is not adversely effecting your application performance, then no, this number of alignment faults isn't a problem.

If your performance isn't where you want and you have the cycles to dig into this stuff, then yes, alignment faults can be worth a look. You will (also) have to decide if addressing those alignment faults is cheaper than upgrading to a faster box, for instance.

http://h71000.www7.hp.com/doc/82final/6549/6549pro_030.html#sda_flt

http://labs.hoffmanlabs.com/node/160

http://labs.hoffmanlabs.com/node/1397

http://www.eight-cubed.com/examples/framework.php?file=sys_align_faults.c

As for CPU costs, what follows is entirely back of the envelope. Check my math, too. One of the estimates around figures overhead analogous to 10000 to 15000 instructions executed per alignment fault. Also look at how many instructions an Itanium invokes per second given a clock rate of a gigahertz; at a billion per second. One of the calculations around instruction rate estimates of about 1 to 1.5 times the Itanium clock rate; the instructions per second. So how bad is, say, a fault rate of 15,000,000 = 15000 per times 1000 faults. Out of 1,000,000,000 IPS on 1 GHz at the low-end IPS calculation is what, a 1.5% loss if the box is going flat out, and yours probably isn't. And again, check my math.
P Muralidhar Kini
Honored Contributor

Re: Is this too many alignment faults?

Hi Alex,

>> Attached is an chart of our user alignment faults from our T4 stats.
From the data that you have given, we can see a spike in alignment
faults between 28-Apr-2010, 22:00:00 and 29-Apr-2010, 00:00:00.
The spike is more closed to 28-Apr-2010 than to 29-Apr-2010.
You should check out what applications were running during this period
of time. This will give a rough idea as to who that culprit is.
(there would be many!)

>> How do you track down which process is causing the alignment faults?
As Hein as rightly pointed out, you need to use the FLT trace and try
to map the alignment faults to PID and PC value. This way you would
know which PID (i.e. the process) and Exception PC (i.e. the exact code)
that is causing the alignment faults.

>> I don't want to spend a lot of time fixing them to find out that we
>> got no performance gain.
I guess you need to fix a bunch of those in order to see a significant
performance gain. Once you get a FLT traces and you find that there is
some PC value that is getting repeated several 100's or 1000's of time,
then fixing these would be better pay off than fixing some PC value that
gets listed only several 10's of time.

Regards,
Murali
Let There Be Rock - AC/DC
Volker Halle
Honored Contributor

Re: Is this too many alignment faults?

Alex,

as you've collected T4 data already, use TLviz and display the alignment fault data together with '[MON.MODE]Kernel Mode' and/or '[MON.MODE]Mp synch' (if you're running on a SMP system) and look for possible correlations.

If you don't see appropriately high spikes of kernel or mpsync at times of high aligment faults, the overall performance gain would be negligible.

Volker.
Steve Reece_3
Trusted Contributor

Re: Is this too many alignment faults?

Hi Alex,

As well as Volker's comments above (indicating overall performance and how much time you're wasting doing things that aren't directly helping your users), I'd definitely look around and see what else is running at some of the peaks that the graph identifies.

For example, you have a window of about six hours from 2am when alignment faults increase. Is this a backup window? If it's backup and there's nothing else running on the box, don't sweat it. If it's an in-house cleanup routine, you might get a bit of benefit by tidying it up and reducing alignment faults. Don't expect big returns on performance though.

At around 22:30 you have a bigger peak but it's short duration. Is this shutting down batch jobs or kicking off batch jobs for overnight processing? Either way, probably isn't worth digging at given its duration.

For me, I'd be looking at the advice given by Volker and looking at how much of the box's time is spent in modes other than User. If the users are doing real work then there will be a mix of modes, but stuff that's not User isn't necessarily doing you any good. Sure, there will be exec mode stuff for RMS and there'll be some Kernel for VMS itself, but if you have lots of kernel and MPSynch then you're not getting the best from the system. For well behaved and sweetly running apps I'd be hoping to get more User mode than kernel+MPSynch+interrupt. It does, of course, depend upon the application though and what it's doing.

Steve