Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

What is the performance impact of reporting alignment faults?

 
SOLVED
Go to solution
Highlighted
Advisor

What is the performance impact of reporting alignment faults?

We are porting an application from Alpha to Itanium, and want to keep an eye on alignment faults due to their performance impact.  The critical part of the application consists of a number of transaction server processes that field requests from other hosts on the network.  We already have instrumented these servers to track the transaction duration and a few other parameters; it seemed logical to add alignment fault count to our instrumentation.

 

To do this , we changed our server processes so that before it processes a transaction, it now calls SYS$START_ALIGN_FAULT_REPORT, and after the transaction has been processed, it calls SYS$GET_ALIGN_FAULT_DATA, followed by SYS$STOP_ALIGN_FAULT_REPORT.   All of this work is bracketed by a call to LIB$INIT_TIMER, and calls to LIB$STAT_TIMER to collect elapsed and CPU time, direct and buffered I/O and page fault counts.  This data, along with the number of fault entries returned by SYS$GET_ALIGN_FAULT_DATA is logged for this transaction.

 

The issue that we have is that adding the code to collect the number of alignment faults drastically increases the execution duration of the transaction, as reported by LIB$STAT_TIMER.  In a simple, isolated test, a transaction that without aliignment fault reporting takes 5 msec, executing the alignment fault reporting and collecting calls increases this time to almost 400msec.  In this case, 486 alignment faults were reported. We added the enabling and disabling of the alignment fault reporting via a logical name that is checked before each transaction is serviced, and the test results are very repeatable - Call the alignment fault reporting services, the transaction duration is about 400msec.  Don't call the alignment fault services, and the transaction duration is about 5msec.

 

This increase in elapsed time will not work in production - our application currently runs at about a 21 transactions/second rate, and can be very bursty.

 

I've tried both the AFR$C_BUFFERED and AFR$C_EXCEPTION approaches - both have similar performance impacts (the exception approach seems marginally worse).

 

My questions are:

1) Is the above implementation of these system services the way the are intended to work? Or should I just poll at the start of the transaction to clear any residual faults, and then poll at the end to get the count, leaving the reporting enabled the entire time the server is running?

2) Would I have better results with the SYS$PERM_REPORT_ALIGN_FAULT approach?

3) Has anyone else seen this kind of performance impact from using these system services?

By the way, our test environment is an HP rx2620  (1.60GHz/6.0MB) with 16GB of memory, running OpenVMS V8.3.  We see similar behavior on our intended production environment: HP BL860c i2  (1.60GHz/5.0MB) 16 cores with 32GB of memory, running OpenVMS V8.4.

 

 

24 REPLIES 24
Highlighted
Honored Contributor

Re: What is the performance impact of reporting alignment faults?

Glenn,

 

   System service calls are potentially expensive. I suspect your observed problem is due to the START/STOP added to each transaction, not the tracing itself.

 

What happens if you move the START and STOP higher up the stack and just exectute the GET on the transaction? Don't worry about "clearing", just keep track of the previous value to calculate the cost per transaction.

 

You could also build a simple test program that measures the cost of the START/GET/STOP cycle, possibly with and without dummy loads with known numbers of alignment faults.  

 

My assumption is these services are really only of use during development to find sources of alignment faults. There's therefore no real incentive to make the services themselves very fast. One would hope that in a stable, production system, the number would be more or less constant, and not interesting enough to continue to track. Do you see any variation in transactions?

 

Can you identify what's generating the 486 faults you're seeing, eliminate them and then remove or disable the sampling code for production?

 

 

A crucible of informative mistakes
Highlighted
Trusted Contributor

Re: What is the performance impact of reporting alignment faults?

Is there any reason why you would expect the alignment faults to vary much when the code is on Production?  If not I'd take some typical data back to your development system and use it to refine the code, and then not include your diagnostic code when you go to Production.

 

Also, I wouldn't use LIB$INIT_TIMER and LIB$STAT_TIMER, the latter I ssume is called 4 times to get all the data you want.  Use SYS$GETTIM and SYS$GETJPI, the latter with an item list that includes all the requested values, and calculate the difefrences yourself. 

 

Minimising calls to LIB$ and SYS$ routines is a good idea because there's overheads like checking that you have write access to memory where output values will be written, and of course all the internal code in those routines to deal with different input. 

 

Highlighted
Honored Contributor
Solution

Re: What is the performance impact of reporting alignment faults?

Glenn,

   Simple experiment. attached. The routine "bad" performs 1000 unaligned additions. The first loop repeats the routine 100000 times. The second repeats it (to remove the influence of the image activation pagefault). The third does the same thing surrounded by your START/GET/STOP, the fourth moves the START and STOP outside the loop.

 

Here are the results.

 

 ELAPSED:    0 00:00:01.02  CPU: 0:00:01.03  BUFIO: 0  DIRIO: 0  FAULTS: 1
 ELAPSED:    0 00:00:01.02  CPU: 0:00:01.06  BUFIO: 0  DIRIO: 0  FAULTS: 0
 ELAPSED:    0 00:00:01.36  CPU: 0:00:01.35  BUFIO: 0  DIRIO: 0  FAULTS: 0
 ELAPSED:    0 00:00:00.94  CPU: 0:00:00.93  BUFIO: 0  DIRIO: 0  FAULTS: 0

 

As expected, the START/GET/STOP loop is more expensive, but curiously the one with just the GET is actually faster than the loops with no monitoring. Note that I'm seeing less than 350msec for 100000 repeats of the START/GET/STOP cycle, so I don't know where your numbers are coming from.

 

        .title timealign
$AFRDEF
reps=100000
        .psect data,rd,wrt,noexe,quad
l: .LONG
   .BYTE
a: .LONG
.align quad
BufSiz=1024
b: .BLKB BufSiz

        .psect code,rd,nowrt,exe
        .entry start,^M<R2>
        MOVL #reps,R2
        CALLS #0,G^LIB$INIT_TIMER
loop:   CALLS #0,bad
        SOBGTR R2,loop
        CALLS #0,G^LIB$SHOW_TIMER

        MOVL #reps,R2
        CALLS #0,G^LIB$INIT_TIMER
loop0:  CALLS #0,bad
        SOBGTR R2,loop0
        CALLS #0,G^LIB$SHOW_TIMER

        MOVL #reps,R2
        CALLS #0,G^LIB$INIT_TIMER
loop1:    PUSHL #BufSiz
          PUSHAL b
          PUSHL #AFR$C_BUFFERED
          CALLS #3,G^SYS$START_ALIGN_FAULT_REPORT
          CALLS #0,bad
          PUSHAL l
          PUSHL  #BufSiz
          PUSHAL b
          CALLS #3,G^SYS$GET_ALIGN_FAULT_DATA
          CALLS #0,G^SYS$STOP_ALIGN_FAULT_REPORT
        SOBGTR R2,loop1
        CALLS #0,G^LIB$SHOW_TIMER

        MOVL #reps,R2
          PUSHL #BufSiz
          PUSHAL b
          PUSHL #AFR$C_BUFFERED
          CALLS #3,G^SYS$START_ALIGN_FAULT_REPORT
        CALLS #0,G^LIB$INIT_TIMER
loop2:    CALLS #0,bad
          PUSHAL l
          PUSHL  #BufSiz
          PUSHAL b
          CALLS #3,G^SYS$GET_ALIGN_FAULT_DATA
        SOBGTR R2,loop2
        CALLS #0,G^LIB$SHOW_TIMER
        CALLS #0,G^SYS$STOP_ALIGN_FAULT_REPORT
        RET
        .ENTRY bad,^M<R3>
        MOVL #1000,R3
badloop: INCL a
        SOBGTR R3,badloop
        RET
        .END Start

 

A crucible of informative mistakes
Highlighted
Acclaimed Contributor

Re: What is the performance impact of reporting alignment faults?

On HP-UX, this can be 1 to 2 orders of magnitude since reporting requires a user signal handler and the emulation is done in the kernel or even in the CPU.

 

Also on HP-UX it isn't as bad, since the default is natural alignment and it aborts if not.  So most applications are compiled with the right options to prevent the alignment traps in the first place.

Highlighted
Honored Contributor

Re: What is the performance impact of reporting alignment faults?

That's not a default with that abort-penalty for an unaligned reference Dennis, that's a requirement.

 

The default on all of the OpenVMS compilers is natural alignment, which may or may not be the best model.

 

With OpenVMS, the programmer has to disable that alignment via directive, via compiler qualifier (switch), or via some code construct that the compiler writers either hadn't handled or hadn't foreseen. 

 

If there was an option to abort on unaligned references (with some decent traceback), I suspect some OpenVMS folks would use it.  Particularly given the (large) penalty for unaligned references within the Itanium architecture.

Highlighted
Advisor

Re: What is the performance impact of reporting alignment faults?

Hoff, look closer at the defaults for COBOL and BASIC.  Not as aligned as you think.  However, the compiler will generate multiple instructions to load the data in pieces to avoid alignment faults.

 

Back to the question at hand.  Yes, there is certainly a Heisenberg effect with asking for fault reporting.

 

From the SRM Vol 1, User Mask

 

ac flag : 0: unaligned data memory references may cause an Unaligned Data Reference fault.

               1: all unaligned data memory references cause an Unaligned Data Reference fault

 

Note the "may" in the 0 case.  The Itanium architecture allows implementations to handle some unaligned references totally 'on-chip'.  I suspect unaligned references inside the same cache line (or perhaps just within a 64-bit quadword) are handled 'on-chip' and VERY fast.

 

However, when you use the system services that you mention, the 'ac' flag gets set in the user mask.  So now EVERY unaligned memory reference causes a fault, traps to OpenVMS, which in turn takes out a spinlock in order to fix up the unaligned reference, and then in turn signals the unaligned access fault to the program.  So just by looking, you increase the overhead.

 

The same holds true for using SET BREAK/UNALIGNED in the debugger or even the SDA FLT support.

Highlighted
Respected Contributor

Re: What is the performance impact of reporting alignment faults?

It would seem to me that the time cost of the monitoring is irrelevant.  I would expect that the developers use the instrumented code to locate the alignment faults in order to "fix" them.  Once "fixed" there is no longer a need for the instrumentation and thus no cost on the production system.  Lets face it, the goal is to avoid the alignment faults.  Once this has been accomplished during the testing phase, the instrumentation (specific to these faults) should be disabled.

 

Just my approach...

 

Dan

Highlighted
Advisor

Re: What is the performance impact of reporting alignment faults?

I agree with Dan 100%

Highlighted
Advisor

Re: What is the performance impact of reporting alignment faults?

John:

 

Thanks for taking the time to provide the example; it's very similar to the C example from Eight-Cubed, which I had already exercised.  It provided the assurance that I needed - yes, I was doing something wrong, and forced me to look closer at my code to find it.

 

It turns out that when changing my alignment fault reporting from exception based to buffered, I introduced a bug that caused it to continue to use the exception approach anyway, thus the "similarity" of my test results between the exception based and what-I-thought-was-buffered.  I should have verified that my code was really doing what I expected before posting.

 

Now that I've corrected THAT oversight, I see results very similar to yours - a millisecond or two to start the alignment fault collection, and nothing measurable to report it. 

 

It is the exception processing that was taking the extra time.  Yet another thing to be wary of (as has been reported in other posts) when moving to Itanium.

 

To answer some of the other questions, yes we may see some variability in the number of alignment faults by transaction.  The transactions process requests that are presented in a "query language" (think SQL, but not too hard), and we will see the query usage change over time.  Today our transactions vary in duration from a few milliseconds to hundreds of milliseconds, based on what they are doing, so I am quite willing to pay a millisecond or two to be able to correlate a rise in alignment faults to a particular transaction signature.  And, as it is coded today, we can easily turn on or off the alignment fault reporting in real time if it gets in the way.

 

We will certainly pursue the reported alignment faults and eradicate as many as possible, but I'm pretty certain that we won't get to zero.  This data will help us understand how close to zero we can get.

 

Thank you to all who responded. As always, this forum is of immense help in finding the solutions that I need.

 

Glenn