Operating System - HP-UX
1833789 Members
2581 Online
110063 Solutions
New Discussion

Need help from Olympians!

 
SOLVED
Go to solution
paul courry
Honored Contributor

Need help from Olympians!

And hopefully it translates into AIX Unix!

As always liberal points awarded!

Here is the deal, the app that runs the warehouse, Exceed Fullfill 4000 (specifically forte, developed by Sun), crashes up to three times a day and everything STOPS until the app can be restarted. We run 24/7.

Notice I said nothing about the operating system. It runs fine.

Now I know I should be placing a call with IBM for assistance, but the client (the guilty shall remain nameless) needs to be convinced that their mission critical application which fullfills more than $260M in orders needs both a hardware maintenance contract AND a software maintenance contract. So I'm winging it.

Hopefully you guys will be JATO units to my set of wings and this turkey can be made to fly.

Sys config is 2 P series RS 6000 boxes (8Gb RAM each) in a cluster config with a shared disk array. OS is 4.3.1. One server strictly runs the oracle app (8.1.7.3) the other runs the warehouse app (Exceed Fullfill 4000 ver. 14.3.0.1 (specifically forte, developed by Sun)) Boxes are configured that if one goes down the other takes over, much like MCServiceGuard.

We suspect a memory problem, but have no proof. Start walking me through the basics, step at a time. Since it crashes frequently I'll have no problem having enough log files to look through.

My private email is PAUL@COURRY.COM

Tell me what data I failed to provide and I'll post it. This is my full time job until it is fixed.

You Olympians are my best hope!
21 REPLIES 21
James R. Ferguson
Acclaimed Contributor
Solution

Re: Need help from Olympians!

Hello Paul!

Welcome back, my friend!

OK, I know a little bit and suggest you start by looking at the error log:

# errpt -a

Have a look too at the 'diag' commands man pages. Its kind of an 'stm'.

Regards!

...JRF...
A. Clay Stephenson
Acclaimed Contributor

Re: Need help from Olympians!

Hi Paul:

The bad news is that it has been years since I've played with AIX but let's make sure that we are on the same page. When you say that the application crashes and everything stops, I take that to mean that all work-related activity associated with the application ceases but that the OS continues to run and trhat you can still login. This is thus not an OS problem or a hardware problem. I rather doubt you have a memory problem (meaning bad physical memory) but you may have a memory problem in that critical data has been stepped on.

If that is the case, then you should see some sort of core file (unless core filesize has been limited to zero); by far your best option is to locate that core file and get a stack trace with a debugger. That should narrow the scope of the problem greatly.

I also assume that you can do simple queries on the database eventhough the app is dead.



If it ain't broke, I can fix that.
John Poff
Honored Contributor

Re: Need help from Olympians!

Hi,

I'm not an Olympian, just the court jester, but I'll jump in too. *>8-)

Did the problem just start happening recently, or has it been going on for a while? Loaded any application updates recently? Have you seen any indications of a memory leak on the Exceed side? I'd suggest running some 'ps -el' commands and keeping them in an log file to see if any of the processes are turning into memory hogs and crashing.

Do you have any of the IBM performance tools for monitoring the systems? I haven't used AIX in a while, but I remember using Performance Toolbox (or something like that).

JP


paul courry
Honored Contributor

Re: Need help from Olympians!

Hi Jim! I'm in Germany on an extended contract with "a guilty client who shall remain nameless".

Did the errpt -a, will dig into it.

Ooops! no man pages (grumble, grumble), will check other machines for copy.



Hey Clay! You assume correct. App dies, OS on BOTH boxes okay, no problem logging in. I am going to check the physical memory error log for logged problems, but by mem probs I meant leaks, bad coding, failure of OS to provide memory, bad OS config (or cluster config) that is strangling app, etc., etc.

I don't Core from apples Clay, but I will by noon tomorrow. I will check on database. I will bet the Oracle server is just running fine though.
paul courry
Honored Contributor

Re: Need help from Olympians!

John,

I'll start doing those in the morning.

Don't know if we have any performance monitoring tools, got any idea which directory they'd be stored in? (God, if you give me a performance monitoring tool THIS ONE TIME I promise to go to church every Sunday, stop cheating on my taxes and be nice to my dog)

Paul
John Poff
Honored Contributor

Re: Need help from Olympians!

Paul,

I'll poke around here and see if I can figure out what the performance tools were and where they might live.

You mentioned the man pages. I remember that in AIX you have to install the man pages separately as they don't get installed in a default installation.

JP
James R. Ferguson
Acclaimed Contributor

Re: Need help from Olympians!

Hi (again) Paul:

The 'errpt' log is going to be hardware oriented, and upon rereading this post I sense it's probably not that, too.

Insofar as documentation, you might browse here;

http://publib.boulder.ibm.com/cgi-bin/ds_form

The 'man' pages can be found by exploding the Reference Documentation tab. There under too, is a Problem Solving Guide tree-base.

Regards!

...JRF...
John Poff
Honored Contributor

Re: Need help from Olympians!

Paul,

Using the link provided by James (thanks!) I went out and rattled my memory for the Performance Toolbox stuff. Here is a URL and some notes from the web site about the Performance Toolbox. Maybe you will be living right and find that they have this software installed on the box. Good luck!

JP



http://publib.boulder.ibm.com/doc_link/en_US/a_doc_lib/aixbman/prftungd/2365c54.htm


Using the Performance Diagnostic Tool
The Performance Diagnostic Tool (PDT) is a tool available in operating system version 4. PDT collects configuration and performance information and attempts to identify potential problems, both current and future.

PDT is an optionally installable component of the Base Operating System. Its name is bos.perf.diag_tool. After PDT has been installed, it must be activated with the /usr/sbin/perf/diag_tool/pdt_config command. This causes appropriate entries to be made in the crontab file, which causes PDT to run periodically, recording data and looking for new trends.

In assessing the configuration and the historical record of performance measurements, PDT attempts to identify:

Resource imbalances: asymmetrical aspects of configuration or device utilization
Usage trends: changes in usage levels that will lead to saturation
New consumers of resources: expensive processes that have not been observed previously
Inappropriate system parameter values: settings that may cause problems
Errors: hardware or software problems that may lead to performance problems





Wodisch
Honored Contributor

Re: Need help from Olympians!

Hi Paul,

even though I am only an "in-between" (jesters and olympians), why not simply install "MeasureWare for AIX", as it is available for AIX4.3, works fine, starts with the 60-days instant-on license, and all you have to do is to configure the "parm" file to contain the "application" definitions. Put the list of those "fullfil" processes into one app, "oracle" processes into another one, restart MWA, and let it run.
Then AFTER the crash export/extract the data (or use PerfView, maybe instant-on license, too) to view the data.
That should get you going pretty soon (like after the next crash).

FWIW,
Wodisch
Reinhard Burger
Frequent Advisor

Re: Need help from Olympians!

Hi Paul

I'm toally not an olymian and it's possible that all the specialists may be very amused about what i'm telling. But if i do not try i do not learn.
May be that some ??problems i had with an ORACLE db showing the same sympthom may give you a small idea.
What i had was a little bit the same. We were starting a job which should do a lot of changes.
This job crashed after running for 2 hours and the application died. ORACLE was still available.
Reason was : the chnages made by the job were filling up rollback segments and as there was no commit the job falied with an ORACLE errormessage which was not shown in the application. As the application was not able to handle this error it just died producing core dumps.
To fix it we added some more space to the rollback segments and reconfigured them to allow dynamicly growing and shrinking.
The default settings that has been coosen during db creation when the application has been installed were not able to handle big amount of changes. Maybe it's a idea where to look into, but maybe it's nonsense i'm telling you. In the later case i appologize for wasting your time.

Reinhard
keep it simple
H.Merijn Brand (procura
Honored Contributor

Re: Need help from Olympians!

Any chance in upgrading to the latest ML? [ For the non-AIX: ML is maintainance level ], if only to rule out any OS issues.

Latest for AIX 4.3 is AIX 4.3.3.0 4330-10_AIX_ML


Since AIX is already at 5.1, chances that any 4.3 problems are solved (by IBM) are close to negative. And knowing IBM's arrogance, they'll just refuse to cooperate until you've installed all the patches to above ML, or ask you 6 digit amounts for useless help.
Enjoy, Have FUN! H.Merijn
paul courry
Honored Contributor

Re: Need help from Olympians!

Jim, please send me your email address to

PAUL@COURRY.COM

I have a few questions for you.....

Paul
paul courry
Honored Contributor

Re: Need help from Olympians!

Procura,

'Any chance in upgrading to the latest ML?'

Not a hope in Hell......

This is a production system. Nuff' said.
benoit Bruckert
Honored Contributor

Re: Need help from Olympians!

Hi Paul,
If the OS is always running fine after crashes, then you should see the app side !
ie you should try to ask to wharehouse app supplier for any support, if there's any log somewhere, and so on...
As you don't have any crashes for the OS you cannot have any information from a dump or any other OS relatives, excepts syslog (or errpt on AIX), if the app is well written !!!
hth
Benoit
Une application mal pansée aboutit à une usine à gaze (GHG)
Victor BERRIDGE
Honored Contributor

Re: Need help from Olympians!

Hi Paul,
Im no olympian either...
As you suggested maybe memory problem, what is the size of your swap?
I have no more 4.3.1 but Ill check again...
I migrated my last old os last july (4 SP2 nodes- and changed PSSP level...)
to see the swap usage:
lsps -a

all the best
Victor
nancy rippey
Trusted Contributor

Re: Need help from Olympians!

AIX has a monitorying tool called 'monitor'. It can be downloaded from
http://www.transarc.ibm.com/Library/whitepapers/tg/node67.html

Here is a sample screen print from the tool
# monitor -top

AIX System monitor v2.1.7PRE 24sep1999: \ Wed Mar 22 16:19:10 2000
Uptime: 29 days, 03:00 Users: 2 of 2 active 2 remote 00:08 sleep time
CPU: User 1.1% Sys 1.0% Wait 0.0% Idle 97.9% Refresh: 10.00 s
0% 25% 50% 75% 100%


Runnable (Swap-in) processes 0.00 (0.00) load average: 0.46, 1.04, 0.93

Memory Real Virtual Paging (4kB) Process events File/TTY-IO
free 22 MB 218 MB 0.0 pgfaults 36 pswitch 0 iget
procs 90 MB 37 MB 0.0 pgin 140 syscall 16 namei
files 14 MB 0.0 pgout 7 read 0 dirblk
total 128 MB 256 MB 0.0 pgsin 0 write 19408 readch
IO (kB/s) read write busy% 0.0 pgsout 0 fork 140 writech
hdisk0 0.0 0.0 0 0 exec 0 ttyrawch
hdisk1 0.0 0.0 0 Client Server NFS/s 0 rcvint 0 ttycanch
hdisk2 0.0 0.0 0 0.0 0.0 calls 0 xmtint 140 ttyoutch
hdisk3 0.0 0.0 0 0.0 0.0 retry 0 mdmint
hdisk4 0.0 0.0 0 0.0 0.0 getattr
hdisk5 0.0 0.0 0 0.0 0.0 lookup Netw read write kB/s
cd0 0.0 0.0 0 0.0 0.0 read lo0 0.0 0.0
0.0 0.0 write tr0 0.7 0.2
0.0 0.0 other
PID USER PRI NICE SIZE RES STAT TIME CPU% COMMAND
516 root 127 21 264k 240k run 16+03:55 97.9/55.5 Kernel (wait)
1032 root 37 21 320k 276k slp 4:04:53 0.6/ 0.6 Kernel (gil)
34448 fnsw 60 0 566k 668k Fslp 0:00 0.4/ 0.7 monitor
32870 root 60 0 666k 768k Frun 0:00 0.3/ 0.6 monitor
10862 oracle 60 0 8678k 4172k slp 3:38 0.2/ 0.0 oracle
15480 oracle 60 0 8762k 4268k slp 6:58 0.0/ 0.0 oracle
15738 oracle 60 0 8706k 4268k slp 6:57 0.0/ 0.0 oracle
2156 root 60 0 1375k 488k slp 56:53 0.0/ 0.1 dtgreet
3374 root 60 0 4651k 548k slp 27:19 0.0/ 0.1 X
0 root 16 21 268k 228k slp 20:12 0.0/ 0.0 Kernel (swapper)
4406 root 60 0 349k 264k slp 20:07 0.0/ 0.0 syncd
12904 root 60 0 2531k 480k slp 15:04 0.0/ 0.0 i4lmd
15222 oracle 60 0 8698k 4268k slp 6:57 0.0/ 0.0 oracle
5498 root 60 0 577k 336k slp 6:48 0.0/ 0.0 routed


nrip
paul courry
Honored Contributor

Re: Need help from Olympians!

Whoops!

We are on

more aix_release.level
4.3.3.0

So it looks current.
paul courry
Honored Contributor

Re: Need help from Olympians!

Victor

dexinas5:root> lsps -a
Page Space Physical Volume Volume Group Size %Used Active Auto Type
paging00 hdisk0 rootvg 7680MB 1 yes yes lv
hd6 hdisk0 rootvg 512MB 1 yes yes lv


With 8Gb of RAM and only 1 app everything is being held in memory with 1% usage of swap.
Victor BERRIDGE
Honored Contributor

Re: Need help from Olympians!

In aix 4.3.3
I know 2 commands:
topas and nmon4
They should help a lot
Victor BERRIDGE
Honored Contributor

Re: Need help from Olympians!

Hi again Paul,
Have you tried nmon4 and topas yet?
Now thinking about it when you say crash, does that mean failover?
In which case have a look at HACMP logs:/tmp/hacmp*
paul courry
Honored Contributor

Re: Need help from Olympians!

Finally got TOPAS up and running in batch mode.

MONITOR required a compiler we were apparently not set up with and the GNU version kept erroring out.

NMON did not provide detailed enough info.

So we ARE collecting data in 1 second intervals and can figure out who died and trace it back to their last final seconds.

Thanks!