1827245 Members
2500 Online
109716 Solutions
New Discussion

Re: a nice enigma!

 
SOLVED
Go to solution
Tim D Fulford
Honored Contributor

a nice enigma!

Hi

I have two computers which are configured "exactly" the same (you know what I mean). However, when I do "top" I sometimes see that one is using lots of "nice" CPU & virtually no "user" cpu & the other is reversed, namely lots of "user" & no "nice"! It is not always consistent which only adds to the puzzle.

My first thoughts were that my processes were suffering from priority degredation, which will only get worse with time. However, I thought "nice" & HPUX priorities were seperate entities - could be wrong here -.

o Can anyone set me straight on these issues? Explain the issues at hand (in simple low number of sylable terms that management have a chance of understanding)?
o Is there a way of fixing the priorities of these processes (say 154 or something) or stop them degrading with time (I canot use rtprio or rtsched to give then a Real Time or POSIX priority [<127] as this will/may cause ServiceGuard failover at the busy periods!!! trust me, I HAVE seen this before).
o I do give all my advisers points, the more advice, the more points (check out my stats)

Any takers

Tim
-
20 REPLIES 20
Mark Greene_1
Honored Contributor

Re: a nice enigma!

Assuming that you mean the hardware is identical, are they running the same applications/databases? What software is on there? Have you checked that the kernel parameters are the same?

HTH
mark
the future will be a lot like now, only later
Jeff Schussele
Honored Contributor

Re: a nice enigma!

Hi Tim,

I believe the nice column is only showing values that have been "nice-altered" or deviate up or down from the default (20).
I could be wrong here.
Could be as simple as a good deal of users have started procs in the background which imposes a nice hit of 5, I think.
Most everything runs @ 20 by default except some of the logging daemons.
What do the actual process lines show - do you have a lot of procs that aren't 20?

I usually only pay attentions to the user & system columns anyway - they tell the story.

Rgds,
Jeff
PERSEVERANCE -- Remember, whatever does not kill you only makes you stronger!
Tim D Fulford
Honored Contributor

Re: a nice enigma!

Thanks for the replys.

1 - The computers are built identically, filesystems, kernels, software, patches, storage, network, cards in slots with same instance numbers the whole shooting match. There are occasionally "slight" differences but this is normally administrator error.
2 - There are no "users" as such. It runs an application that deals with phone call routing/process/handling, the only people who fiddle are administrators.
3 - By way of an example, there is a "daemon" process that is called "pmd". On one box it had a nice value of 20 & the other 24. These processes SHOULD (I will not discount admin error) start automatically using various "identical" configuratio files.
4 - The only majour variation is that the "services/daemon" (pmd etc) are/have/do get restarted at different times, so one machine/set of daemons could have been running for 7 days un-touched, whereas the other may only have run for 1 day.
5 - There is a database (Informix) but this runs on it's own server/computer & is connected to via the network.

Basically I'm 70% sure there is some priority degradation going on, BUT I thought this had nothing to do with the nice value, as I believe/understand they are seperate entities! If I'm right this implies that someone may be using "renice" on running processes. In which case I need to "re-educate" them urgently. If I'm wrong, then I need to explain why processes seem to have a nice value of > 20, and hopefully fix it (if possible).

Tim

Any more suggestions
-
John Payne_2
Honored Contributor

Re: a nice enigma!

The system will nice a processes priority based on how long it has run, if it has been waiting for a while, if it has been waiting on IO, etc. It is difficult to say what could be causing the nice'ing of the process, but this could certainly be it.

Also, you could have a processor failing, or some other type of bottleneck on the one system that you have not yet seen. This could cause contention for the processes also.

It's just a thought.

Hope it helps

John
Spoon!!!!
Mark Greene_1
Honored Contributor

Re: a nice enigma!

>>4 - The only majour variation is that the "services/daemon" (pmd etc) are/have/do get restarted at different times, so one machine/set of daemons could have been running for 7 days un-touched, whereas the other may only have run for 1 day. <<

This, most likely, is your culprit. Remember that the nice values are not intrinsic measurements of any one thing, they are relative values of processing time compared to the other processes on the system. Unless you are seeing other symptoms like swap issues or i/o binding, I wouldn't worry too much about it.

HTH
Mark
the future will be a lot like now, only later
Jeff Schussele
Honored Contributor

Re: a nice enigma!

Hi (again) Tim,

Well if an admin restarts the daemon from the command line & in the background using "&" it would be at a nice of 24.
I was incorrect "&" imposes a nice hit of 4 not 5.
That would be my educated guess & the solution would be to "instruct" the admins to not start/restart it using the executable but to run the startup script in /sbin/init.d....hopefully it has one.

Rgds,
Jeff
PERSEVERANCE -- Remember, whatever does not kill you only makes you stronger!
Tim D Fulford
Honored Contributor

Re: a nice enigma!

Hi again

Many thanks for the responses... Here is some more info attached in the nice.txt file. I have supplied 3 things for each node
o top, showing how the CPU is split
o glance, showing pmd (The "Daddy" daemon proc)
o glance, showing pmd (The "Daddy" daemon proc) cummulatively

From you answers there seems to be two possibilities:
1 - INITIATION METHOD; node 1 was started as a background process and node 2 was not. As the nice value is inherited (I believe) this will explain the difference.
2 - PRI & NICE DEGRADE; There is some priority degredation, which also degrades the nice value (which I did not think happened, but we live and learn).

Unfortunately there is evedence for either as o The "niced" node has had pmd running for over 7 weeks and the other has only been runnig for two weeks.
o A quick check by myself shows that all the procs I checked do indeed have a nice value of 24. These are child procs of pmd

I will be digging a bit deeper....

Regards

Tim
-
Stefan Farrelly
Honored Contributor

Re: a nice enigma!


Hi Tim,

no 2 servers are identical. Just to do a quick check, is the output from;

swlist -l fileset | wc -l

The same on both servers ?

Cheers,

Stefan
Im from Palmerston North, New Zealand, but somehow ended up in London...
Tim D Fulford
Honored Contributor

Re: a nice enigma!

I did check the patch levels, I was going to bet the farm on them being the same, but.... they are not. Our quality standards are slipping.

I also awarded you 3 points, in retrospect this should be more (7), sorry... put a dummy reply in and I'll give you 4 more...

Tim
-
Stefan Farrelly
Honored Contributor

Re: a nice enigma!


Hi Tim,

aha, so they do have different numbers of filesets (patches+software) installed. The only way to ensure the software install is identical is to start by ensuring the same number of installed filesets. Just curious - how many filesets different were they ?
Im from Palmerston North, New Zealand, but somehow ended up in London...
Tim D Fulford
Honored Contributor

Re: a nice enigma!

sn1c --> 1361
sn2b --> 1734

I have checked "patches" which is probably more important and there are many a difference. I'm not wholy convinced of the patch stuff, but I will dig a bit deeper.

On a slightly different tack, I looked at another cluster running similar (but different version) software and found that despite the fact it had been running for some 7-8 weeks the priorities are 20.

My current favorite is the background process as ALL the processes that are fathered by pmd have a nice value of 24 even the ones with a priority of 0 (zero)...

Any more thoughts, any one, generosity is my middle name....

Tim
-
Mladen Despic
Honored Contributor

Re: a nice enigma!

Tim,

From your 'top' and 'glance' samples, pmd is only active on the 2nd node. It may be interesting to know which processes on the 1st node are consuming CPU "nicely".

Also, if the patches on the two nodes are different, that *may* be the cause. Have you also checked with 'swlist -l fileset -a state' if all patches are configured?

Mladen
Mark van Hassel
Respected Contributor
Solution

Re: a nice enigma!

Hi Tim,

Nice value don't change over time. They are set when a process start or ,indeed, inherited from the parent.
The thing that does change is priority (see top).When (Time Shared) processes run, they loose priority and regain priority as they wait their turn to run. A process's nice value is used as a factor in calculating how fast a process regains priority.
Priority queues:
-32 - -1 : Real time (POSIX)
0 - 127 HPUX real time (rtprio)
128 - 251 Time Share procs
252 - 255 Swapped processes

HtH,

Mark
The surest sign that life exists elsewhere in the universe is that none of it has tried to contact us
Mladen Despic
Honored Contributor

Re: a nice enigma!

Tim,

You can also check the differences between the files /var/adm/sw/swagent.log on the two nodes.
Another useful check may be the output from 'kmtune'. Any differences may point you further in terms of how the two nodes are different.

As for the CPU utilization, can you list top 2 or 3 processes that consume most of the CPU on each system?

Mladen
Paula J Frazer-Campbell
Honored Contributor

Re: a nice enigma!

Tim

Two identical machines running the same jobs for the usr / sys and nice to match would have to have "Exactly" the same processes running at the the same point of execuation at the same time.

Even this is unlikly as the hardware throughput of devices CPU/ MEMORY/ETC whilst rated the same is not.


So if you have processes "NICED" exactly the same on both machines the value of nice from top or glance will never match.

Paula
If you can spell SysAdmin then you is one - anon
Tim D Fulford
Honored Contributor

Re: a nice enigma!

pmd is running on both nodes if not (believe me) we would be in deeeeep do-do's. I do apriciate that pmd does not run continously, it does very little (spawns, re-spawns, starts, halts & monitors it's children). It may well not show much in the first Glance (immidiate) but you will see that it has consumed some 91.3 seconds of CPU since 11 May

As far as the configured state of the software everything is "configured", there are a few items in the "installed" state, but I can explain these, nothing is "partial" or "corrupt"

regards

Tim
-
Tim D Fulford
Honored Contributor

Re: a nice enigma!

Mladen - I do not need to do a top to tell you that it is fsdlexe | errord, they ALWAYS are top-of-the-pops, if not we're doing no work.

I have however done the following
# ps -el | awk '$8=="24"{print $0}'

This shows that ALL the processes started by pmd have a nice value of 24. As I believe nice is an inhereted value I think this is damming evedence that someone either re-niced pmd or started the application as a background process.

Paula - I'm not sure what you are saying.
a) No two machines are alike therefore you would not expect to see usr/nice the same. I partially agree, but I would not expect to see the pattern in the nice.txt file which is totally reversed.
b) The machines are different, so the nice values will be different. I disagree, I would expect to see a nice value of 20 across the board, it is the same software/binaries (with some minor exceptions)

I'm still figuring that someone started the application in the background or re-niced pmd.

Regards

Tim
-
Paula J Frazer-Campbell
Honored Contributor

Re: a nice enigma!

Tim

How about nicing the process to what it should be and monitor it.


I use a script that picks up certain logings and nices them down as their routines can cause load problems on my main server.

I am sure that you can modify it to monitor the nice value of your process.

--------------------------------------------
#!/bin/ksh
# Automatically nice down the ftpbbs universe routines
######################################################
# PJFC 2001
######################################################
# Get parent pids
######################################################
q=`who -u | grep ftpbbs `
p=`who -u | grep ftpbbs | awk '{print $7, $15 }'`
######################################################
# Seperate each pid to a string
######################################################
a=`echo $p | awk '{print $1}'`
b=`echo $p | awk '{print $2}'`
######################################################
# Pick up pid of universe process and nice value
######################################################
y=`ps -efl | grep $a | grep -v grep | grep -v sh | grep root | grep uv | awk '{print $4}'` # PID
z=`ps -efl | grep $a | grep -v grep | grep -v sh | grep root | grep uv | awk '{print $8}'` # Nice value
######################################################
# Check nice value
if [ $z = 20 ]
######################################################
# If nice value = 20 then a restart has occured so nice it down
######################################################
then
renice -n 19 $y
fi
######################################################
# Do it all again for other ftpbbs login
######################################################
w=`ps -efl | grep $b | grep -v grep | grep -v sh | grep root | grep uv | awk '{print $4}'`
x=`ps -efl | grep $b | grep -v grep | grep -v sh | grep root | grep uv | awk '{print $8}'`
######################################################
# Check nice value
if [ $x = 20 ]
######################################################
# If nice value = 20 then a restart has occured so nice it down
######################################################
then
renice -n 19 $w
fi
echo "Renice ran "
exit 1

---------------------------------------------

HTH

Paula
If you can spell SysAdmin then you is one - anon
John Bolene
Honored Contributor

Re: a nice enigma!

No processes just get niced for any reason.

They have to be niced when they start by the command line or if someone changes them after they have started running.

Nicing is a people thing, HPUX does not just nice processes because it feels like it.
It is always a good day when you are launching rockets! http://tripolioklahoma.org, Mostly Missiles http://mostlymissiles.com
Tim D Fulford
Honored Contributor

Re: a nice enigma!

Paula & John

Many thanks for the scripts. I was going to manually "renice" pmd then do a "rolling restart" of the app, but the script I can use to "renice" everything with no restart.

It is also good to know that [unlike HPUX timeshare] "nice" does not degrade with time. So my original understanding about them being seperate was correct.

Many thanks for the input.

Tim
-