Introduction to Performance Tuning
DocId:UPERFKBAN00000726  Updated:20010918

DOCUMENT

           Introduction to performance tuning for HP-UX 

When considering the performance of any system it is important to determine a
baseline of what is acceptable. How does the system perform when there is no
load from applications or users? What are the systems resources in terms of
memory , both physical and virtual?  How many processors does the system have
and what is the speed and RISC level?  What is the layout of the data ? What
are the key kernel parameters set to and how are those resources being
utilized? What are the utilities to measure these ?

Memory Resources 

HP-UX utilizes both physical memory , RAM  and virtual memory, referred to as
swap. There are three resources that can be used to determine the amount
of RAM: syslog.log,dmesg, and adb (absolute de-bugger) .The information dmesg
reports comes from /var/adm/syslog/syslog.log ,while using dmesg is
convienient, if the system has logged too many errors recently, the memory
information may not be available.

Insufficient memory resources are a major cause of performance problems and
should be the first area to check .

The memory information from dmesg is at the bottom of the output .

example:
 Memory Information :
physical page size =4096 bytes, logical page size= 4096 bytes
Physical: 524288 Kbytes, locakble: 380880 Kbytes , available: 439312

Using adb reads the memory from a more reliable source, the kernel.

To determine the physical memory (RAM) using adb:

for HP-UX 10.X
 example:

 echo  physmem/D | adb -k /stand/vmunix /dev/kmem 
physmem:
physmem:  24576

for HP-UX 11.X systems running on 32 bit architecture:
 example:

 echo phys_mem_pages/D | adb /stand/vmunix /dev/kmem 
phys_mem_pages:
phys_mem_pages: 24576


for HP-UX 11.X systems running on 64 bit architecture:
 example:

 echo phys_mem_pages/D | adb64 -k /stand/vmunix /dev/mem
phys_mem_pages:
phys_mem_pages: 262144

The results of these commands are in 4 Kb memory pages, to determine the size
in bytes multiply by 4096 .

To fully utilize all of the RAM on a system there must be a sufficient amount
of virtual memory to accomodate all processes . The HP recommendation is at
that virtual memory be at least eaqual to physical memory plus application
size.  This is outlined in the System Adminstration Tasks Manual .

To determine virtual memory configuration run the following command :
#swapinfo –tam 
       Mb       Mb      Mb      PCT    START/     Mb
TYPE  AVAIL    USED      FREE   USED    LIMIT    RESERVE     PRI    NAME
dev           1024         0             1024      0      1   /dev/vg00/lvol1
reserve        184       -184
memory 372     96         276    26
total 1396    280       1116     20

The key areas to monitor are reserve , memory and total .  For a process
to spawn it needs a sufficient amount of virtual memory to be placed in
reserve. There should be a sufficient amount of free device swap to open any
processes that may be spawned during the course of operations. By subtracting
thereserve from the device total you can determine this value.
If  there is an insufficient  amount available ( typically from device swap)
you will receive an error : cannot fork : not enough virtual memory.  If
this error is received , you will need to allocate more device swap. This
should be configured on a disk with no other swap partitions, and ideally of
the same size and priority of existing swap logical volumes to enable
interleaving.

Refer to the Application Note KBAN00000218
Configuring Device Swap for details on the procedure.

The memory line is enabled with the kernel parameter swapmem_on set to 1 . This
allows a percentage of RAM to be allocated for pseudo-swap. This is the default
and should be used unless the amount of lockable memory exceeds 25% of RAM.

You can determine the amount of lockable memory by running the command:
example:
echo total_lockable_mem/D | adb -k /stand/vmunix  /dev/mem 

total_lockable_mem:
total_lockable_mem: 185280

This will return the amount in Kbytes of lockable memory in use.

If pseudo-swap is disabled by setting swapmem_on to 0 , there will typically be
a need to increase the amount of device swap in the system to accommodate
paging and reserve area.

If the total under PCT USED is 90 or greater it is recommended to increase the
amount of device swap.


After physical and virtual memory is determined , we need to determine how much
buffer cache has been configured and how much is being used. By default the
system will use dynamic buffer cache . The kernel will show buf pages and nbuf
set to 0 in SAM. The parameters that govern the size of the dynamic buffer
cache are dbc_min_pct and dbc_max_pct , these define the minimum
and maximum percentage of RAM allocated.  The default values are 5% mimimum and
50% maximum .

On systems with small amounts of RAM these values may be useful for
dedicated applications. Since the introduction of HP-UX 11.0 the amount of RAM
a system can have has increased from 3.75Gb to our newest systems with up to
256Gb.  Keeping the default values for systems with a large amount of RAM can
have a negative impact on performance, due to the time the lower level routines
that check on free memory in the cache take.

 To monitor the use of the buffer cache run the following command :

sar –b 5 30 

You will see output similar to :

bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s
  0      95      100       1       2      54       0       0


Ideally we want to see a %wcache of 95 or greater. If the system consistenly
shows %wcache less than 75 it would be advisable to lower the value of
dbc_max_pct. In 32 bit architecture, the buffer cache resides in quadrant 3,
limiting the maximum size to 1 Gb. Typically values less than 300Mb are
preferable.

Memory for applications

For applications to have a sufficient amount of space for text, data and stack
in memory the kernel has to be tuned. The total size for text, data and stack
for 32 bit systems using EXEC_MAGIC is in quadrant 1 and 2, and is 2Gb less the
size of the Uarea . These are represented by the kernel parameters maxtsize,
maxdsiz and maxssiz . Under SHMEM_MAGIC the total size is limited to 1Gb in
quadrant 1.


If these parameters are undersized the system will error.  Insufficient maxdsiz
will return “out of memory “ and insufficient maxsiz will
return “stack growth failure”.


The last configurable area of memory to check is shared memory . Any
application running within the 32 bit domain will have a limit of 1.75Gb total
for shared memory for EXEC_MAGIC and 2.75Gb using SHMEM_MAGIC. Individual
processes cannot cross quadrant boundaries , so the largest shmmax can be for
32 bit is 1Gb.

It is important to determine if the application is running 32 bit or 64 bit
when troubleshooting 64 bit systems.

This can be done with the file command :

example :

file /stand/vmunix
/stand/vmunix:  ELF-64 executable object file - PA-RISC 2.0 (LP64)

PA-RISC versions under 2.0 are 32 bit

 For an overview on shared memory for 32 bit systems refer to the Application
Note RCMEMKBAN00000027 Understanding Shared Memory on PA RISC Systems  .

 The kernel parameter shmmax determines the size of the shared memory
region. SAM  will not allow this to be configured greater than 1 quadrant , or
1Gb even on 64 bit systems. If a larger shmmax value is needed for 64 bit
systems it has to be done using a manual kernel build.

In a 64 bit system 32 bit applications will only address the 32 bit shared
memory region,64 bit applications will only address the 64 bit regions.

Example of a creating a kernel with shmmax at 2Gb:

 cd /stand/build
 /usr/lbin/sysadm/system_prep -v -s  system
 kmtune -s shmmax= 2147483648  -S /stand/build/system
 /usr/sbin/mk_kernel -s ./system
 mv /stand/system /stand/system.prev
 mv /stand/build/system /stand/system
 kmupdate
 shutdown -ry 0

To determine shared memory allocation, use ipcs this utility is used to
report status of interprocess communication facilities. Run the following
command:

ipcs –mob

You will see an output similar to this :

ipcs -mob
IPC status from /dev/kmem as of Tue Apr 17 09:29:33 2001
T      ID     KEY        MODE        OWNER     GROUP NATTCH  SEGSZ
Shared Memory:
m       0 0x411c0359 --rw-rw-rw-      root      root      0    348
m       1 0x4e0c0002 --rw-rw-rw-      root      root      1  61760
m       2 0x412006c9 --rw-rw-rw-      root      root      1   8192
m       3 0x301c3445 --rw-rw-rw-      root      root      3 1048576
m    4004 0x0c6629c9 --rw-r-----      root      root      2 7235252
m       5 0x06347849 --rw-rw-rw-      root      root      1  77384
m     206 0x4918190d --rw-r--rw-      root      root      0  22908
m    6607 0x431c52bc --rw-rw-rw-    daemon    daemon      1 5767168


The two fields of the most interest are NATTCH and SEGSZ.

NATTCH -The number of processes attached to the associated  shared
memory segment. Look for those that are 0, they indicate processes who have not
released their shared memory segment.

If there are multiple segments showing with an NATTACH of zero , especially if
they are owned by a database, this can be an indication that the segments are
not being efficiently released . This is due to the program not calling
detachreg  .  These segments can be removed using ipcrm -m shmid.


Note : Even though there is no process attached to the segment , the data
structure is still intact. The shared memory segment and data structure
associated with it are destroyed by executing this command.

SEGSZ The size of the associated shared memory segment in bytes. The
total of SEGSZ for a 32 bit system using EXEC_MAGIC cannot exeed 1879048192
bytes or 1.75Gb, or 2952790016 bytes or 2.75Gb for SHMEM_MAGIC.




CPU load 

Once we have determined that the memory resources are adequate, we need to
address the processors. We need to determine how many processors there are,
what speed they run at and what load they are under during a variety of system
loads .

To find out processor speed,run :
example:
 echo itick_per_usec/D | adb -k /stand/vmunix /dev/mem
itick_per_usec:
itick_per_usec: 360

To find out how many processors are in use, run  :
example:
 echo runningprocs/D | adb -k /stand/vmunix /dev/mem
runningprocs:
runningprocs:   2

To find out cpu load on a multi-processor system, run :

sar –Mu 5 30 

The output will look similar to :


11:20:05     cpu    %usr    %sys    %wio   %idle
11:20:10       0       1       1       0      99
               1      17      83       0       0
          system       9      42       0      49


This will return data on the cpu load for each processors:

cpu - cpu number (only on a multi-processor  system and used with the -M option)
%usr -  user mode
%sys- system mode
%wio - idle with some process waiting for I/O
(only block I/O, raw I/O, or VM pageins/swapins indicated)
%idle - other idle


Typically the %usr value will be higher than %sys . Out of memory errors can
occur When excessive CPU time given to system versus user processes. These can
also be caused when maxdsiz is undersized.  As a rule , we should expect to see
%usr at 80% or less, and %sys at 50% or less.  Values higher than these can
indicate a CPU bottleneck.

The %wio should ideally be 0%,  values less than 15% are acceptable. The %idle
being low over short periods of time is not a major concern .  This is the
percentage of time that the CPU is not running processes.  However  low  %idle
over a sustained period could be an  indication of a CPU bottleneck.

If  %wio is greater than 15% and %idle is low , consider the size of the runq
(runq-sz) .Ideally we would like to see values less than 4 .If the system is a
single processor system under heavy load the CPU bottleneck may be unavoidable.

If the cpu load appears high , but the system is not heavily loaded check the
value of the kernel parameter timeslice. By default it is 10, if a Tuned
Parameter Set was applied to the kernel, it will change timeslice to 1. This
will cause the cpu to context switch every 10mS instead of 100mS. In most
instances this will have a negative effect on cpu efficiency .

To find out what the run queue load is, run :

sar –q 5  30 

The output will look similar to :

          runq-sz %runocc swpq-sz %swpocc
10:06:36     0.0       0     0.0       0
10:06:41     1.5      40     0.0       0
10:06:46     3.0      20     0.0       0
10:06:51     1.0      20     0.0       0
Average      1.8      16     0.0       0

runq-sz  - Average length of the run queue(s) of  processes
(in memory and runnable)

%runocc - The percentage of time the run queue(s)  were occupied by processes
(in memory and runnable)

swpq-sz - Average length of the swap queue of runnable processes
 (processes swapped out but ready to run)

High kernel values can negatively effect system performance

Three of the most critical kernel resources  are nproc, ninode and nfile . By
default these are controlled by the formula value of maxusers.


Ideally we want to keep these settings within 25% of the peak observed usage.
Using sar -v we can monitor the proc table  and file table,  the inode table
reporting is not accurate.


The output of sar -v will show the usage/kernel value for each area.
Example:
08:05:08 text-sz  ov  proc-sz  ov  inod-sz     ov  file-sz  ov
08:05:10   N/A    0  272/6420  0  3427/7668  0  5458/12139 0



What do these parameters control?

nfile
Number of open files for all processes running on the system. Though each entry
is relatively small, there is some kernel overhead in managing this table.
Additionally, each time a file is opened, it will consume an entry in nfile
even if the file is already opened by another process. When nfile entries are
exhausted, a console and/or syslog error message will appear specifically
indicating "File table full". The value should usually be 10-25% greater than
the maximum number during peak load .


ninode

The kernel parameter ninode only effects HFS filesystems , JFS (VxFS) allocate
thier own inodes dynamically (vx_ninode) based on the amount of
available memory . The true inode count is only incremented by each unique HFS
file open, ie. the initial open of a file, each subsequent opens of that file
increments the file-sz column, and decrements the available nfile value.

This variable is frequently oversized, and can impose a heavy toll on the
processor (especially machines with multiple CPUs). The best rule is not to
increase it unless console/syslog messages are received specifically
stipulating "Inode table is full", otherwise the table will look almost or
completely full some time after boot. In most modern systems the only HFS file
system is /stand .  You can verify this by viewing the /etc/fstab file.  If
this is the case in your system  , the value of ninode needs be no larger than
the total number of files in the /stand filesystem . The default value of 476
is adequate . You can verify the inodes in use by runnning the command:
 bdf -i

Filesystem          kbytes    used   avail %used  iused  ifree %iuse Mounted on
/dev/vg00/lvol3     143360   59431   78687   43%   2835  20981   12% /
/dev/vg00/lvol1     111637   49575   50898   49%     64  17984    0% /stand
/dev/vg00/lvol8    2097152 1339747  710695   65%  93746 189350   33% /var
/dev/vg00/lvol7     946176  746242  187457   80%  28853  49983   37% /usr
/dev/vg00/lvol4     307200    9287  279346    3%     80  74476    0% /tmp
/dev/vg00/lvol6    1536000  594459  882855   40%  11906 235382    5% /opt
/dev/vg00/lvol5     409600   68726  319613   18%    116  85216    0% /home

The iused column will show the inodes in use per filesystem. As you can
see in this example only 64 hfs inodes are actually in use . Also not that the
sum of the other inodes in use is far greater than what sar -v reports.

Unlike nfile, each time a file is opened it will consume only one entry in the
table. Excessive values can  cause high CPU loads , and a network timeout
condition for a High Availability Cluster, most often at the start of a backup
routine. When this variable is large, the initial wait time to hash an entry is
quite short so that file opens can occur quickly at first. Since there is no
active accounting, the only method of determining what is in this table is a
serial search, which results in very expensive processing time. When the
processor 'walks' this table, little other activity is performed.

Inode count as expressed by sar, only refers to the cache size for HFS inodes
not the actual inode count.

If your system is all vxfs except for /stand you are using 286 bytes of RAM per
inode for all non hfs inodes allocated in the kernel.

This is a known problem and can in some cases cause severe cpu problems, as
well as sysmap fragmentation. This will show as very high %wio or low idle
times. Values  over 15% are considered a cpu bottleneck.

To determine the amount of vxfs inodes allocated ( these are not reported by
sar) run :
 example:
 echo  vxfs_ninode/D | adb -k /stand/vmunix  /dev/mem 

vxfs_ninode:
vxfs_ninode:    64000



nproc

This pertains to the number of processes system-wide. This is another variable
affected by indiscriminate setting of maxusers. It is most commonly referenced
when a ps -ef is run or when Glance/GPM and similar commands are initiated. The
value should usually be 10-25% greater than the maximum number of processes
observed under load to allow for unanticipated process growth.

For a complete overview of 11.X kernel parameters refer to :

http://docs.hp.com/hpux/onlinedocs/os/KCparams.OverviewAll.html

Disk  I/O  

Disk  bottlenecks can be caused by a number of factors. The buffer cache usage,
cpu load and high disk I/O load can all contribute to a bottleneck .  After
determining the cpu and buffer cache load check the disk I/O load.

To determine disk i/o performance run :

 sar –d 5 30

The output will look similar to :

device   %busy   avque   r+w/s    blks/s  avwait  avserv
c1t6d0    0.80    0.50       1       4    0.27    13.07
c4t0d0    0.60    0.50       1       4    0.26     8.60


 %busy          Portion of time device was busy  servicing a request

 avque          Average number of requests outstanding for the device

 r+w/s          Number of data transfers per second (read and writes)
                from  and to the device

 blks/s         Number of bytes transferred (in 512-byte units)
                  from and to the device

 avwait         Average time (in milliseconds) that transfer requests
                waited idly on queue for the device

 avserv         Average time (in milliseconds) to service each
                transfer request   (includes seek, rotational latency,
                and data transfer times) for  the device.

When average wait (avwait) is greater than average service time (avserv) it
indicates the disk can't keep up with the load during that sample . This is
considered a bottleneck .

 The avwait is similar to %wio returned for sar -u on cpu .

If a bottleneck is identified, run:

strings /etc/lvmtab 
 to identify the volume group associated with the disks.

lvdisplay -v  /dev/vgXX/lvolX
   to tell you what disks are associated with the disks

bdf 
 to see if this volume groups files sytems are full ( > 85%)

cat /etc/fstab 
  to determine the file system type assiciated with the lvol/mountpoint

How can improve disk I/O ? 

1. Reduce the volume of data on the disk to less than 90%
2. Stripe the data across disks to improve I/O speed
3. If you are using Online JFS , run fsadm –e to defragment the extents.
4. If you are using HFS filesystems , implement asynchronous writes by setting
the kernel parameter fs_async to 1 or consider converting to VxFS.
5. Reduce the size of the buffer cache ( if %wcache is less than 90)
6. If you are using raw logical volumes , consider implementing asynchronous IO.

The difference between the async i/o and the synchronous i/o is that async does
not wait for confirmation of the write before moving on to the next task. This
does increase the speed of the disk performance at the expense of robustness.
Synchronous I/O waits for acknowledgement of the write (or fail) before
continuing on. The write can have physically taken place or could be in the
buffer cache but in either case, acknowledgement has been sent. In the case of
async, no waiting.



 To implement asynchronous IO on HP-UX
    * add the asyncdisk Driver (Asynchronous Disk Pseudo Driver)
       to the HP-UX Kernel (using SAM),

 * create the device file:

# mknod /dev/async c 101 0x00000# 
#=the minor number can be one of the following values:

         0x000000 default
         0x000001 enable immediate reporting
         0x000002 flush the CPU cache after reads
         0x000004 allow disks to timeout
         0x000005 is a combination of 1 and 4
         0x000007 is a combination of 1, 2 and 4

   Note: Contact your database vendor or product vendor to determine the
correct minor number for your application.

Patches

There are a number of OS performance issues that are resolved by current
patches . For the most up to date patches contact the Hewlett-Packard Response
Center.