1829817 Members
1818 Online
109993 Solutions
New Discussion

CPU STALLS

 
Anurag_7
Advisor

CPU STALLS

We have our 32 bit application running on Itanium.

The application response is a cause of concern.

% of cycles lost due to CPU stalls is 71.45.

Analysis of a process using caliper suggests that the CPU stall is very high. Below is the report.

Is there a way to overcome this?

Please guide.

Anurag


HP Caliper Total CPU Event Counts Report for PkMS
================================================================================

Target Application
Program: /u03/wmpso/CMLACARE/opt/app/wmprd/bin/Lttr
Invocation: /u03/wmpso/CMLACARE/opt/app/wmprd/bin/Lttr
Process ID: 3456 (started by Caliper)
Start time: 09:43:11 PM
End time: 09:43:16 PM
Termination Status: 0
Last modified: May 05, 2006 at 09:26 PM
Memory model: ILP32
Main module text page size: default

Processor Information
Machine name: itanium1
Number of processors: 4
Processor type: Itanium2 6M
Processor speed: 1300 MHz

Target Execution Time
Real time: 4.915 seconds
User time: 1.757 seconds
System time: 0.228 seconds

-----------------------------------------------
PLM
Event Name U..K TH Count
-----------------------------------------------
BACK_END_BUBBLE.ALL x___ 0 1551927751
CPU_CYCLES x___ 0 2171937631
IA64_INST_RETIRED x___ 0 2094967607
NOPS_RETIRED x___ 0 403213401
-----------------------------------------------
PLM: Privilege Level Mask
U/K = user/kernel levels (U: level 3, K: level 0)
The intermediate levels (1, 2) are unused on HP-UX or Linux
x : the metric is measured at the given level (_ : not measured)
TH: event THreshold, determines the event counter behavior,
TH == 0 : counter += event_count_in_cycle
TH > 0 : counter += (event_count_in_cycle >= threshold ? 1 : 0)
-----------------------------------------------
% of Cycles lost due to stalls (lower is better):
100 * BACK_END_BUBBLE.ALL / CPU_CYCLES = 71.45

Effective CPI (lower is better):
CPU_CYCLES / (IA64_INST_RETIRED - NOPS_RETIRED) = 1.2838

Effective CPI during unstalled execution (lower is better):
(CPU_CYCLES - BACK_END_BUBBLE.ALL) / (IA64_INST_RETIRED - NOPS_RETIRED) = 0.3665
-----------------------------------------------
9 REPLIES 9
Steven E. Protter
Exalted Contributor

Re: CPU STALLS

Shalom,

This is a bit strange, but I think I'd have the hardware checked.

First use cstm mstm or xstm and run the cpu excercizes. Any failures, HP needs to replace.

Then call in the HP Hardware team.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
Anurag_7
Advisor

Re: CPU STALLS

Hi Shalom,

Thanks a lot for your reply.

Can you please let me know as to how to use these commands and what kind of output should i collect and analyze?

Thanks a lot

Anurag
A. Clay Stephenson
Acclaimed Contributor

Re: CPU STALLS

It looks like there is a big chunk of branch misprediction occurring which is starving the processor for instructions. Since you mention 32-bit application, I assume that means your are running a PA-RISC application under the ARIES emulator. Do you have the option of porting the application to native code?
If it ain't broke, I can fix that.
Anurag_7
Advisor

Re: CPU STALLS

Its not a PA-RISC binary :-)

Its a ELF-32 executable object file - IA64
Michael Steele_2
Honored Contributor

Re: CPU STALLS

71.45 is a high number. Greater then 50 is unacceptable. But this all is an indication of a poorly written program. (loops in loops vs. straight line code.) Check for cache misses with caliper.

caliper icache -o reports/icachem.txt ./matmul

http://docs.hp.com/en/5991-5499/ch02s04.html

"...A hot spot is an instruction or set of instructions that has a higher execution count than most other instructions in a program. For example, code that is inside a loop inside a loop inside a loop will likely be executed more times than straight-line code. Usually the â hotnessâ is measured with CPU cycles, but it could also be measured with metrics such as cache misses...."

this doc. makes some patching suggestings as well as increasing page size:

http://h21007.www2.hp.com/dspp/files/unprotected/hpux/Top_Ten_Perf_Tips.pdf

There's also a lot of reference to the latest pthread patch. See page 12 MxN v. 1x1.
Support Fatherhood - Stop Family Law
A. Clay Stephenson
Acclaimed Contributor

Re: CPU STALLS

In that case, the most likely culprit is poorly written code so that there are tons of cache misses. I would almost certainly rule out any sort of hardware problem since I assume this hardware doesn't know how to just pick on your application. If you see widespread performance degradation then I would reconsider but if the poor performance is limited to your application then ...
If it ain't broke, I can fix that.
Anurag_7
Advisor

Re: CPU STALLS

Thanks clay,

in fact, cache misses are only 12%

Here is a report.........What do you suggest?

Please guide.

Anurag


L1 data cache miss percentage:

Sampling Specification
Sampling event: DATA_EAR_EVENTS
Sampling period: 10000 events
Sampling period variation: 500 (5.00% of sampling period)
Sampling counter privilege: user (user-space sampling)
Data granularity: 16 bytes
Number of samples: 452
Data sampled: Data cache miss

Data Cache Metrics Summed for Entire Run
-----------------------------------------------
PLM
Event Name U..K TH Count
-----------------------------------------------
DATA_REFERENCES x___ 0 447196040
L1D_READS x___ 0 312349645
L1D_READ_MISSES.ALL x___ 0 37692146
-----------------------------------------------
PLM: Privilege Level Mask
U/K = user/kernel levels (U: level 3, K: level 0)
The intermediate levels (1, 2) are unused on HP-UX or Linux
x : the metric is measured at the given level (_ : not measured)
TH: event THreshold, determines the event counter behavior,
TH == 0 : counter += event_count_in_cycle
TH > 0 : counter += (event_count_in_cycle >= threshold ? 1 : 0)
-----------------------------------------------
L1 data cache miss percentage:
12.07 = 100 * (L1D_READ_MISSES.ALL / L1D_READS)

Percent of data references accessing L1 data cache:
69.85 = 100 * (L1D_READS / DATA_REFERENCES)
-----------------------------------------------

Load Module Summary
------------------------------------------------------------------
% Total Avg.
Dcache Cumulat Sampled Dcache Dcache
Latency % of Dcache Latency Laten.
Cycles Total Misses Cycles Cycles Load Module
------------------------------------------------------------------
95.97 95.97 365 13715 37.6 dld.so
1.66 97.63 31 237 7.6 libunwind.so.1
0.85 98.48 17 122 7.2 libpthread.so.1
0.82 99.30 10 117 11.7 libncursesw.so
0.38 99.69 6 55 9.2 librtc.sl
0.13 99.82 3 19 6.3 liborb_r.so
0.13 99.95 2 19 9.5 libc.so.1
0.05 100.00 1 7 7.0 libCsup.so.1
------------------------------------------------------------------
100.00 100.00 435 14291 32.9 Total
------------------------------------------------------------------

Function Summary
--------------------------------------------------------------------------------------------
% Total Avg.
Dcache Cumulat Sampled Dcache Dcache
Latency % of Dcache Latency Laten.
Cycles Total Misses Cycles Cycles Function File
--------------------------------------------------------------------------------------------
0.76 0.76 9 108 12.0 libncursesw.so::__milli_memcmp
0.40 1.15 7 57 8.1 libunwind.so.1::uwx_step uwx_step.c
0.29 1.44 6 41 6.8 libpthread.so.1::*unnamed@0x4042(920-cc0)* mutex.c
0.25 1.69 4 36 9.0 libunwind.so.1::uwx_get_frame_info uwx_step.c
0.19 1.88 3 27 9.0 libpthread.so.1::pthread_setcancelstate cancel.c
0.19 2.07 3 27 9.0 libunwind.so.1::uwx_reclaim_scoreboards uwx_scoreboard.c
0.17 2.24 4 24 6.0 libunwind.so.1::uwx_decode_prologue uwx_uinfo.c
0.15 2.39 1 21 21.0 { STUB }->libunwind.so.1::uwx_reset_str_pool
0.14 2.53 4 20 5.0 libpthread.so.1::pthread_mutex_lock mutex.c
0.13 2.65 2 18 9.0 libpthread.so.1::pthread_mutex_unlock mutex.c
0.13 2.78 2 18 9.0 librtc.sl::rtc_split_special_region infrtc.c
0.11 2.89 2 16 8.0 libpthread.so.1::ENTER_PTHREAD_LIBRARY_FUNC pthread.c
0.10 2.99 3 15 5.0 libunwind.so.1::uwx_search_utable32 uwx_utable.c
--------------------------------------------------------------------------------------------
[Minimum function entries: 5, percent cutoff: 0.10, cumulative percent cutoff: 100.00]

Function Details
----------------------------------------------------------------------
% Total Avg.
Dcache Sampled Dcache Dcache Line|
Latency Dcache Latency Laten. Slot| >Statement|
Cycles Misses Cycles Cycles Col,Offset Instruction
----------------------------------------------------------------------
[Cutoffs excluded all entries (minimum: 0; percent: 1.00; cumulative percent: 100.00;)]
A. Clay Stephenson
Acclaimed Contributor

Re: CPU STALLS

I more concerned about instruction cache misses. You may well benefit from using profile based optimization.

It's done like this:

aCC +Oprofile=collect -O sample.C -o sample.exe // Compile to instrumented executable. sample.exe < input.file // Collect execution profile data. aCC +Oprofile=use -O sample.C -o sample.exe // Recompile with optimization

Profiled base optimization will make much better decisions in laying the code because it uses statistics gathered during an actual execution.
If it ain't broke, I can fix that.
A. Clay Stephenson
Acclaimed Contributor

Re: CPU STALLS

Oh, and because this is UNIX, and because as a general rule, UNIX code tends to be i/o bound rather than actually CPU bound (intense analysis programs are the exception) make sure that you are not spending tons of time optimizing code when the CPU may only be a small component of the actual bottleneck.
If it ain't broke, I can fix that.