Introduction to Performance Tuning
DOCUMENT
Introduction to performance tuning for HP-UX
When considering the performance of any system it is important to determine a
baseline of what is acceptable. How does the system perform when there is no
load from applications or users? What are the systems resources in terms of
memory , both physical and virtual? How many processors does the system have
and what is the speed and RISC level? What is the layout of the data ? What
are the key kernel parameters set to and how are those resources being
utilized? What are the utilities to measure these ?
Memory Resources
HP-UX utilizes both physical memory , RAM and virtual memory, referred to as
swap. There are three resources that can be used to determine the amount
of RAM: syslog.log,dmesg, and adb (absolute de-bugger) .The information dmesg
reports comes from /var/adm/syslog/syslog.log ,while using dmesg is
convienient, if the system has logged too many errors recently, the memory
information may not be available.
Insufficient memory resources are a major cause of performance problems and
should be the first area to check .
The memory information from dmesg is at the bottom of the output .
example:
Memory Information :
physical page size =4096 bytes, logical page size= 4096 bytes
Physical: 524288 Kbytes, locakble: 380880 Kbytes , available: 439312
Using adb reads the memory from a more reliable source, the kernel.
To determine the physical memory (RAM) using adb:
for HP-UX 10.X
example:
echo physmem/D | adb -k /stand/vmunix /dev/kmem
physmem:
physmem: 24576
for HP-UX 11.X systems running on 32 bit architecture:
example:
echo phys_mem_pages/D | adb /stand/vmunix /dev/kmem
phys_mem_pages:
phys_mem_pages: 24576
for HP-UX 11.X systems running on 64 bit architecture:
example:
echo phys_mem_pages/D | adb64 -k /stand/vmunix /dev/mem
phys_mem_pages:
phys_mem_pages: 262144
The results of these commands are in 4 Kb memory pages, to determine the size
in bytes multiply by 4096 .
To fully utilize all of the RAM on a system there must be a sufficient amount
of virtual memory to accomodate all processes . The HP recommendation is at
that virtual memory be at least eaqual to physical memory plus application
size. This is outlined in the System Adminstration Tasks Manual .
To determine virtual memory configuration run the following command :
#swapinfo –tam
Mb Mb Mb PCT START/ Mb
TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME
dev 1024 0 1024 0 1 /dev/vg00/lvol1
reserve 184 -184
memory 372 96 276 26
total 1396 280 1116 20
The key areas to monitor are reserve , memory and total . For a process
to spawn it needs a sufficient amount of virtual memory to be placed in
reserve. There should be a sufficient amount of free device swap to open any
processes that may be spawned during the course of operations. By subtracting
thereserve from the device total you can determine this value.
If there is an insufficient amount available ( typically from device swap)
you will receive an error : cannot fork : not enough virtual memory. If
this error is received , you will need to allocate more device swap. This
should be configured on a disk with no other swap partitions, and ideally of
the same size and priority of existing swap logical volumes to enable
interleaving.
Refer to the Application Note KBAN00000218
Configuring Device Swap for details on the procedure.
The memory line is enabled with the kernel parameter swapmem_on set to 1 . This
allows a percentage of RAM to be allocated for pseudo-swap. This is the default
and should be used unless the amount of lockable memory exceeds 25% of RAM.
You can determine the amount of lockable memory by running the command:
example:
echo total_lockable_mem/D | adb -k /stand/vmunix /dev/mem
total_lockable_mem:
total_lockable_mem: 185280
This will return the amount in Kbytes of lockable memory in use.
If pseudo-swap is disabled by setting swapmem_on to 0 , there will typically be
a need to increase the amount of device swap in the system to accommodate
paging and reserve area.
If the total under PCT USED is 90 or greater it is recommended to increase the
amount of device swap.
After physical and virtual memory is determined , we need to determine how much
buffer cache has been configured and how much is being used. By default the
system will use dynamic buffer cache . The kernel will show buf pages and nbuf
set to 0 in SAM. The parameters that govern the size of the dynamic buffer
cache are dbc_min_pct and dbc_max_pct , these define the minimum
and maximum percentage of RAM allocated. The default values are 5% mimimum and
50% maximum .
On systems with small amounts of RAM these values may be useful for
dedicated applications. Since the introduction of HP-UX 11.0 the amount of RAM
a system can have has increased from 3.75Gb to our newest systems with up to
256Gb. Keeping the default values for systems with a large amount of RAM can
have a negative impact on performance, due to the time the lower level routines
that check on free memory in the cache take.
To monitor the use of the buffer cache run the following command :
sar –b 5 30
You will see output similar to :
bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s
0 95 100 1 2 54 0 0
Ideally we want to see a %wcache of 95 or greater. If the system consistenly
shows %wcache less than 75 it would be advisable to lower the value of
dbc_max_pct. In 32 bit architecture, the buffer cache resides in quadrant 3,
limiting the maximum size to 1 Gb. Typically values less than 300Mb are
preferable.
Memory for applications
For applications to have a sufficient amount of space for text, data and stack
in memory the kernel has to be tuned. The total size for text, data and stack
for 32 bit systems using EXEC_MAGIC is in quadrant 1 and 2, and is 2Gb less the
size of the Uarea . These are represented by the kernel parameters maxtsize,
maxdsiz and maxssiz . Under SHMEM_MAGIC the total size is limited to 1Gb in
quadrant 1.
If these parameters are undersized the system will error. Insufficient maxdsiz
will return “out of memory “ and insufficient maxsiz will
return “stack growth failure”.
The last configurable area of memory to check is shared memory . Any
application running within the 32 bit domain will have a limit of 1.75Gb total
for shared memory for EXEC_MAGIC and 2.75Gb using SHMEM_MAGIC. Individual
processes cannot cross quadrant boundaries , so the largest shmmax can be for
32 bit is 1Gb.
It is important to determine if the application is running 32 bit or 64 bit
when troubleshooting 64 bit systems.
This can be done with the file command :
example :
file /stand/vmunix
/stand/vmunix: ELF-64 executable object file - PA-RISC 2.0 (LP64)
PA-RISC versions under 2.0 are 32 bit
For an overview on shared memory for 32 bit systems refer to the Application
Note RCMEMKBAN00000027 Understanding Shared Memory on PA RISC Systems .
The kernel parameter shmmax determines the size of the shared memory
region. SAM will not allow this to be configured greater than 1 quadrant , or
1Gb even on 64 bit systems. If a larger shmmax value is needed for 64 bit
systems it has to be done using a manual kernel build.
In a 64 bit system 32 bit applications will only address the 32 bit shared
memory region,64 bit applications will only address the 64 bit regions.
Example of a creating a kernel with shmmax at 2Gb:
cd /stand/build
/usr/lbin/sysadm/system_prep -v -s system
kmtune -s shmmax= 2147483648 -S /stand/build/system
/usr/sbin/mk_kernel -s ./system
mv /stand/system /stand/system.prev
mv /stand/build/system /stand/system
kmupdate
shutdown -ry 0
To determine shared memory allocation, use ipcs this utility is used to
report status of interprocess communication facilities. Run the following
command:
ipcs –mob
You will see an output similar to this :
ipcs -mob
IPC status from /dev/kmem as of Tue Apr 17 09:29:33 2001
T ID KEY MODE OWNER GROUP NATTCH SEGSZ
Shared Memory:
m 0 0x411c0359 --rw-rw-rw- root root 0 348
m 1 0x4e0c0002 --rw-rw-rw- root root 1 61760
m 2 0x412006c9 --rw-rw-rw- root root 1 8192
m 3 0x301c3445 --rw-rw-rw- root root 3 1048576
m 4004 0x0c6629c9 --rw-r----- root root 2 7235252
m 5 0x06347849 --rw-rw-rw- root root 1 77384
m 206 0x4918190d --rw-r--rw- root root 0 22908
m 6607 0x431c52bc --rw-rw-rw- daemon daemon 1 5767168
The two fields of the most interest are NATTCH and SEGSZ.
NATTCH -The number of processes attached to the associated shared
memory segment. Look for those that are 0, they indicate processes who have not
released their shared memory segment.
If there are multiple segments showing with an NATTACH of zero , especially if
they are owned by a database, this can be an indication that the segments are
not being efficiently released . This is due to the program not calling
detachreg . These segments can be removed using ipcrm -m shmid.
Note : Even though there is no process attached to the segment , the data
structure is still intact. The shared memory segment and data structure
associated with it are destroyed by executing this command.
SEGSZ The size of the associated shared memory segment in bytes. The
total of SEGSZ for a 32 bit system using EXEC_MAGIC cannot exeed 1879048192
bytes or 1.75Gb, or 2952790016 bytes or 2.75Gb for SHMEM_MAGIC.
CPU load
Once we have determined that the memory resources are adequate, we need to
address the processors. We need to determine how many processors there are,
what speed they run at and what load they are under during a variety of system
loads .
To find out processor speed,run :
example:
echo itick_per_usec/D | adb -k /stand/vmunix /dev/mem
itick_per_usec:
itick_per_usec: 360
To find out how many processors are in use, run :
example:
echo runningprocs/D | adb -k /stand/vmunix /dev/mem
runningprocs:
runningprocs: 2
To find out cpu load on a multi-processor system, run :
sar –Mu 5 30
The output will look similar to :
11:20:05 cpu %usr %sys %wio %idle
11:20:10 0 1 1 0 99
1 17 83 0 0
system 9 42 0 49
This will return data on the cpu load for each processors:
cpu - cpu number (only on a multi-processor system and used with the -M option)
%usr - user mode
%sys- system mode
%wio - idle with some process waiting for I/O
(only block I/O, raw I/O, or VM pageins/swapins indicated)
%idle - other idle
Typically the %usr value will be higher than %sys . Out of memory errors can
occur When excessive CPU time given to system versus user processes. These can
also be caused when maxdsiz is undersized. As a rule , we should expect to see
%usr at 80% or less, and %sys at 50% or less. Values higher than these can
indicate a CPU bottleneck.
The %wio should ideally be 0%, values less than 15% are acceptable. The %idle
being low over short periods of time is not a major concern . This is the
percentage of time that the CPU is not running processes. However low %idle
over a sustained period could be an indication of a CPU bottleneck.
If %wio is greater than 15% and %idle is low , consider the size of the runq
(runq-sz) .Ideally we would like to see values less than 4 .If the system is a
single processor system under heavy load the CPU bottleneck may be unavoidable.
If the cpu load appears high , but the system is not heavily loaded check the
value of the kernel parameter timeslice. By default it is 10, if a Tuned
Parameter Set was applied to the kernel, it will change timeslice to 1. This
will cause the cpu to context switch every 10mS instead of 100mS. In most
instances this will have a negative effect on cpu efficiency .
To find out what the run queue load is, run :
sar –q 5 30
The output will look similar to :
runq-sz %runocc swpq-sz %swpocc
10:06:36 0.0 0 0.0 0
10:06:41 1.5 40 0.0 0
10:06:46 3.0 20 0.0 0
10:06:51 1.0 20 0.0 0
Average 1.8 16 0.0 0
runq-sz - Average length of the run queue(s) of processes
(in memory and runnable)
%runocc - The percentage of time the run queue(s) were occupied by processes
(in memory and runnable)
swpq-sz - Average length of the swap queue of runnable processes
(processes swapped out but ready to run)
High kernel values can negatively effect system performance
Three of the most critical kernel resources are nproc, ninode and nfile . By
default these are controlled by the formula value of maxusers.
Ideally we want to keep these settings within 25% of the peak observed usage.
Using sar -v we can monitor the proc table and file table, the inode table
reporting is not accurate.
The output of sar -v will show the usage/kernel value for each area.
Example:
08:05:08 text-sz ov proc-sz ov inod-sz ov file-sz ov
08:05:10 N/A 0 272/6420 0 3427/7668 0 5458/12139 0
What do these parameters control?
nfile
Number of open files for all processes running on the system. Though each entry
is relatively small, there is some kernel overhead in managing this table.
Additionally, each time a file is opened, it will consume an entry in nfile
even if the file is already opened by another process. When nfile entries are
exhausted, a console and/or syslog error message will appear specifically
indicating "File table full". The value should usually be 10-25% greater than
the maximum number during peak load .
ninode
The kernel parameter ninode only effects HFS filesystems , JFS (VxFS) allocate
thier own inodes dynamically (vx_ninode) based on the amount of
available memory . The true inode count is only incremented by each unique HFS
file open, ie. the initial open of a file, each subsequent opens of that file
increments the file-sz column, and decrements the available nfile value.
This variable is frequently oversized, and can impose a heavy toll on the
processor (especially machines with multiple CPUs). The best rule is not to
increase it unless console/syslog messages are received specifically
stipulating "Inode table is full", otherwise the table will look almost or
completely full some time after boot. In most modern systems the only HFS file
system is /stand . You can verify this by viewing the /etc/fstab file. If
this is the case in your system , the value of ninode needs be no larger than
the total number of files in the /stand filesystem . The default value of 476
is adequate . You can verify the inodes in use by runnning the command:
bdf -i
Filesystem kbytes used avail %used iused ifree %iuse Mounted on
/dev/vg00/lvol3 143360 59431 78687 43% 2835 20981 12% /
/dev/vg00/lvol1 111637 49575 50898 49% 64 17984 0% /stand
/dev/vg00/lvol8 2097152 1339747 710695 65% 93746 189350 33% /var
/dev/vg00/lvol7 946176 746242 187457 80% 28853 49983 37% /usr
/dev/vg00/lvol4 307200 9287 279346 3% 80 74476 0% /tmp
/dev/vg00/lvol6 1536000 594459 882855 40% 11906 235382 5% /opt
/dev/vg00/lvol5 409600 68726 319613 18% 116 85216 0% /home
The iused column will show the inodes in use per filesystem. As you can
see in this example only 64 hfs inodes are actually in use . Also not that the
sum of the other inodes in use is far greater than what sar -v reports.
Unlike nfile, each time a file is opened it will consume only one entry in the
table. Excessive values can cause high CPU loads , and a network timeout
condition for a High Availability Cluster, most often at the start of a backup
routine. When this variable is large, the initial wait time to hash an entry is
quite short so that file opens can occur quickly at first. Since there is no
active accounting, the only method of determining what is in this table is a
serial search, which results in very expensive processing time. When the
processor 'walks' this table, little other activity is performed.
Inode count as expressed by sar, only refers to the cache size for HFS inodes
not the actual inode count.
If your system is all vxfs except for /stand you are using 286 bytes of RAM per
inode for all non hfs inodes allocated in the kernel.
This is a known problem and can in some cases cause severe cpu problems, as
well as sysmap fragmentation. This will show as very high %wio or low idle
times. Values over 15% are considered a cpu bottleneck.
To determine the amount of vxfs inodes allocated ( these are not reported by
sar) run :
example:
echo vxfs_ninode/D | adb -k /stand/vmunix /dev/mem
vxfs_ninode:
vxfs_ninode: 64000
nproc
This pertains to the number of processes system-wide. This is another variable
affected by indiscriminate setting of maxusers. It is most commonly referenced
when a ps -ef is run or when Glance/GPM and similar commands are initiated. The
value should usually be 10-25% greater than the maximum number of processes
observed under load to allow for unanticipated process growth.
For a complete overview of 11.X kernel parameters refer to :
http://docs.hp.com/hpux/onlinedocs/os/KCparams.OverviewAll.html
Disk I/O
Disk bottlenecks can be caused by a number of factors. The buffer cache usage,
cpu load and high disk I/O load can all contribute to a bottleneck . After
determining the cpu and buffer cache load check the disk I/O load.
To determine disk i/o performance run :
sar –d 5 30
The output will look similar to :
device %busy avque r+w/s blks/s avwait avserv
c1t6d0 0.80 0.50 1 4 0.27 13.07
c4t0d0 0.60 0.50 1 4 0.26 8.60
%busy Portion of time device was busy servicing a request
avque Average number of requests outstanding for the device
r+w/s Number of data transfers per second (read and writes)
from and to the device
blks/s Number of bytes transferred (in 512-byte units)
from and to the device
avwait Average time (in milliseconds) that transfer requests
waited idly on queue for the device
avserv Average time (in milliseconds) to service each
transfer request (includes seek, rotational latency,
and data transfer times) for the device.
When average wait (avwait) is greater than average service time (avserv) it
indicates the disk can't keep up with the load during that sample . This is
considered a bottleneck .
The avwait is similar to %wio returned for sar -u on cpu .
If a bottleneck is identified, run:
strings /etc/lvmtab
to identify the volume group associated with the disks.
lvdisplay -v /dev/vgXX/lvolX
to tell you what disks are associated with the disks
bdf
to see if this volume groups files sytems are full ( > 85%)
cat /etc/fstab
to determine the file system type assiciated with the lvol/mountpoint
How can improve disk I/O ?
1. Reduce the volume of data on the disk to less than 90%
2. Stripe the data across disks to improve I/O speed
3. If you are using Online JFS , run fsadm –e to defragment the extents.
4. If you are using HFS filesystems , implement asynchronous writes by setting
the kernel parameter fs_async to 1 or consider converting to VxFS.
5. Reduce the size of the buffer cache ( if %wcache is less than 90)
6. If you are using raw logical volumes , consider implementing asynchronous IO.
The difference between the async i/o and the synchronous i/o is that async does
not wait for confirmation of the write before moving on to the next task. This
does increase the speed of the disk performance at the expense of robustness.
Synchronous I/O waits for acknowledgement of the write (or fail) before
continuing on. The write can have physically taken place or could be in the
buffer cache but in either case, acknowledgement has been sent. In the case of
async, no waiting.
To implement asynchronous IO on HP-UX
* add the asyncdisk Driver (Asynchronous Disk Pseudo Driver)
to the HP-UX Kernel (using SAM),
* create the device file:
# mknod /dev/async c 101 0x00000#
#=the minor number can be one of the following values:
0x000000 default
0x000001 enable immediate reporting
0x000002 flush the CPU cache after reads
0x000004 allow disks to timeout
0x000005 is a combination of 1 and 4
0x000007 is a combination of 1, 2 and 4
Note: Contact your database vendor or product vendor to determine the
correct minor number for your application.
Patches
There are a number of OS performance issues that are resolved by current
patches . For the most up to date patches contact the Hewlett-Packard Response
Center.