topic Re: CPU STALLS in Operating System - HP-UX

CPU STALLS

Anurag_7 — Wed, 28 Jun 2006 11:37:23 GMT

We have our 32 bit application running on Itanium.

The application response is a cause of concern.

% of cycles lost due to CPU stalls is 71.45.

Analysis of a process using caliper suggests that the CPU stall is very high. Below is the report.

Is there a way to overcome this?

Please guide.

Anurag

HP Caliper Total CPU Event Counts Report for PkMS
================================================================================

Target Application
Program: /u03/wmpso/CMLACARE/opt/app/wmprd/bin/Lttr
Invocation: /u03/wmpso/CMLACARE/opt/app/wmprd/bin/Lttr
Process ID: 3456 (started by Caliper)
Start time: 09:43:11 PM
End time: 09:43:16 PM
Termination Status: 0
Last modified: May 05, 2006 at 09:26 PM
Memory model: ILP32
Main module text page size: default

Processor Information
Machine name: itanium1
Number of processors: 4
Processor type: Itanium2 6M
Processor speed: 1300 MHz

Target Execution Time
Real time: 4.915 seconds
User time: 1.757 seconds
System time: 0.228 seconds

-----------------------------------------------
PLM
Event Name U..K TH Count
-----------------------------------------------
BACK_END_BUBBLE.ALL x___ 0 1551927751
CPU_CYCLES x___ 0 2171937631
IA64_INST_RETIRED x___ 0 2094967607
NOPS_RETIRED x___ 0 403213401
-----------------------------------------------
PLM: Privilege Level Mask
U/K = user/kernel levels (U: level 3, K: level 0)
The intermediate levels (1, 2) are unused on HP-UX or Linux
x : the metric is measured at the given level (_ : not measured)
TH: event THreshold, determines the event counter behavior,
TH == 0 : counter += event_count_in_cycle
TH > 0 : counter += (event_count_in_cycle >= threshold ? 1 : 0)
-----------------------------------------------
% of Cycles lost due to stalls (lower is better):
100 * BACK_END_BUBBLE.ALL / CPU_CYCLES = 71.45

Effective CPI (lower is better):
CPU_CYCLES / (IA64_INST_RETIRED - NOPS_RETIRED) = 1.2838

Effective CPI during unstalled execution (lower is better):
(CPU_CYCLES - BACK_END_BUBBLE.ALL) / (IA64_INST_RETIRED - NOPS_RETIRED) = 0.3665
-----------------------------------------------

Re: CPU STALLS

Steven E. Protter — Wed, 28 Jun 2006 11:58:03 GMT

Shalom,

This is a bit strange, but I think I'd have the hardware checked.

First use cstm mstm or xstm and run the cpu excercizes. Any failures, HP needs to replace.

Then call in the HP Hardware team.

SEP

Re: CPU STALLS

Anurag_7 — Wed, 28 Jun 2006 12:02:59 GMT

Hi Shalom,

Thanks a lot for your reply.

Can you please let me know as to how to use these commands and what kind of output should i collect and analyze?

Thanks a lot

Anurag

Re: CPU STALLS

A. Clay Stephenson — Wed, 28 Jun 2006 12:07:31 GMT

It looks like there is a big chunk of branch misprediction occurring which is starving the processor for instructions. Since you mention 32-bit application, I assume that means your are running a PA-RISC application under the ARIES emulator. Do you have the option of porting the application to native code?

Re: CPU STALLS

Anurag_7 — Wed, 28 Jun 2006 12:15:39 GMT

Its not a PA-RISC binary :-)

Its a ELF-32 executable object file - IA64

Re: CPU STALLS

Michael Steele_2 — Wed, 28 Jun 2006 12:21:44 GMT

71.45 is a high number. Greater then 50 is unacceptable. But this all is an indication of a poorly written program. (loops in loops vs. straight line code.) Check for cache misses with caliper.

caliper icache -o reports/icachem.txt ./matmul

http://docs.hp.com/en/5991-5499/ch02s04.html

"...A hot spot is an instruction or set of instructions that has a higher execution count than most other instructions in a program. For example, code that is inside a loop inside a loop inside a loop will likely be executed more times than straight-line code. Usually the â hotnessâ is measured with CPU cycles, but it could also be measured with metrics such as cache misses...."

this doc. makes some patching suggestings as well as increasing page size:

http://h21007.www2.hp.com/dspp/files/unprotected/hpux/Top_Ten_Perf_Tips.pdf

There's also a lot of reference to the latest pthread patch. See page 12 MxN v. 1x1.

Re: CPU STALLS

A. Clay Stephenson — Wed, 28 Jun 2006 12:27:57 GMT

In that case, the most likely culprit is poorly written code so that there are tons of cache misses. I would almost certainly rule out any sort of hardware problem since I assume this hardware doesn't know how to just pick on your application. If you see widespread performance degradation then I would reconsider but if the poor performance is limited to your application then ...

Re: CPU STALLS

Anurag_7 — Wed, 28 Jun 2006 12:32:54 GMT

Thanks clay,

in fact, cache misses are only 12%

Here is a report.........What do you suggest?

Please guide.

Anurag

L1 data cache miss percentage:

Sampling Specification
Sampling event: DATA_EAR_EVENTS
Sampling period: 10000 events
Sampling period variation: 500 (5.00% of sampling period)
Sampling counter privilege: user (user-space sampling)
Data granularity: 16 bytes
Number of samples: 452
Data sampled: Data cache miss

Data Cache Metrics Summed for Entire Run
-----------------------------------------------
PLM
Event Name U..K TH Count
-----------------------------------------------
DATA_REFERENCES x___ 0 447196040
L1D_READS x___ 0 312349645
L1D_READ_MISSES.ALL x___ 0 37692146
-----------------------------------------------
PLM: Privilege Level Mask
U/K = user/kernel levels (U: level 3, K: level 0)
The intermediate levels (1, 2) are unused on HP-UX or Linux
x : the metric is measured at the given level (_ : not measured)
TH: event THreshold, determines the event counter behavior,
TH == 0 : counter += event_count_in_cycle
TH > 0 : counter += (event_count_in_cycle >= threshold ? 1 : 0)
-----------------------------------------------
L1 data cache miss percentage:
12.07 = 100 * (L1D_READ_MISSES.ALL / L1D_READS)

Percent of data references accessing L1 data cache:
69.85 = 100 * (L1D_READS / DATA_REFERENCES)
-----------------------------------------------

Load Module Summary
------------------------------------------------------------------
% Total Avg.
Dcache Cumulat Sampled Dcache Dcache
Latency % of Dcache Latency Laten.
Cycles Total Misses Cycles Cycles Load Module
------------------------------------------------------------------
95.97 95.97 365 13715 37.6 dld.so
1.66 97.63 31 237 7.6 libunwind.so.1
0.85 98.48 17 122 7.2 libpthread.so.1
0.82 99.30 10 117 11.7 libncursesw.so
0.38 99.69 6 55 9.2 librtc.sl
0.13 99.82 3 19 6.3 liborb_r.so
0.13 99.95 2 19 9.5 libc.so.1
0.05 100.00 1 7 7.0 libCsup.so.1
------------------------------------------------------------------
100.00 100.00 435 14291 32.9 Total
------------------------------------------------------------------

Function Summary
--------------------------------------------------------------------------------------------
% Total Avg.
Dcache Cumulat Sampled Dcache Dcache
Latency % of Dcache Latency Laten.
Cycles Total Misses Cycles Cycles Function File
--------------------------------------------------------------------------------------------
0.76 0.76 9 108 12.0 libncursesw.so::__milli_memcmp
0.40 1.15 7 57 8.1 libunwind.so.1::uwx_step uwx_step.c
0.29 1.44 6 41 6.8 libpthread.so.1::*unnamed@0x4042(920-cc0)* mutex.c
0.25 1.69 4 36 9.0 libunwind.so.1::uwx_get_frame_info uwx_step.c
0.19 1.88 3 27 9.0 libpthread.so.1::pthread_setcancelstate cancel.c
0.19 2.07 3 27 9.0 libunwind.so.1::uwx_reclaim_scoreboards uwx_scoreboard.c
0.17 2.24 4 24 6.0 libunwind.so.1::uwx_decode_prologue uwx_uinfo.c
0.15 2.39 1 21 21.0 { STUB }->libunwind.so.1::uwx_reset_str_pool
0.14 2.53 4 20 5.0 libpthread.so.1::pthread_mutex_lock mutex.c
0.13 2.65 2 18 9.0 libpthread.so.1::pthread_mutex_unlock mutex.c
0.13 2.78 2 18 9.0 librtc.sl::rtc_split_special_region infrtc.c
0.11 2.89 2 16 8.0 libpthread.so.1::ENTER_PTHREAD_LIBRARY_FUNC pthread.c
0.10 2.99 3 15 5.0 libunwind.so.1::uwx_search_utable32 uwx_utable.c
--------------------------------------------------------------------------------------------
[Minimum function entries: 5, percent cutoff: 0.10, cumulative percent cutoff: 100.00]

Function Details
----------------------------------------------------------------------
% Total Avg.
Dcache Sampled Dcache Dcache Line|
Latency Dcache Latency Laten. Slot| >Statement|
Cycles Misses Cycles Cycles Col,Offset Instruction
----------------------------------------------------------------------
[Cutoffs excluded all entries (minimum: 0; percent: 1.00; cumulative percent: 100.00;)]

Re: CPU STALLS

A. Clay Stephenson — Wed, 28 Jun 2006 14:07:25 GMT

I more concerned about instruction cache misses. You may well benefit from using profile based optimization.

It's done like this:

aCC +Oprofile=collect -O sample.C -o sample.exe // Compile to instrumented executable. sample.exe < input.file // Collect execution profile data. aCC +Oprofile=use -O sample.C -o sample.exe // Recompile with optimization

Profiled base optimization will make much better decisions in laying the code because it uses statistics gathered during an actual execution.

Re: CPU STALLS

A. Clay Stephenson — Wed, 28 Jun 2006 14:10:29 GMT

Oh, and because this is UNIX, and because as a general rule, UNIX code tends to be i/o bound rather than actually CPU bound (intense analysis programs are the exception) make sure that you are not spending tons of time optimizing code when the CPU may only be a small component of the actual bottleneck.