General
cancel
Showing results for 
Search instead for 
Did you mean: 

diaglogd running away - 90% cpu util; single bit memory error

Stuart Abramson_2
Honored Contributor

diaglogd running away - 90% cpu util; single bit memory error

It's running at NI = 10, which is lower than everybody else, but it's abnormal. What should I do.

BTW, after poking around in logs, stm logs show single bit memory error. Could that have set this off? I'm going to call HP Response Center about single bit error.
9 REPLIES
Michael Steele_2
Honored Contributor

Re: diaglogd running away - 90% cpu util; single bit memory error

/sbin/init.d/diagnostics stop/start
Support Fatherhood - Stop Family Law
Stuart Abramson_2
Honored Contributor

Re: diaglogd running away - 90% cpu util; single bit memory error

Should we try to figure this out first, or should we just stop/start.

Stuart
Eugeny Brychkov
Honored Contributor

Re: diaglogd running away - 90% cpu util; single bit memory error

Right. Call HP and let them replace the DIMM (their action depends on how many errors were logged. If 1-5 in total, then do not care, if 100 frequently then should act). If this dimm will cause double bit parity error, machine will crash.
Diaglogd can be busy logging these events. So fix memory first. If will not help look through logs again to understand what's diaglogd doing. If will not find - try updating Online Diags, GR and HWE patch bundles to at least Sep 2002
Eugeny
Michael Steele_2
Honored Contributor

Re: diaglogd running away - 90% cpu util; single bit memory error

Well gee, by all means lets figure it out. (* I guess there was no call to the response center.*)

Start with LOGTOOL which will look at your I/O then try lsof:

STM > TOOLS > UTILITY > RUN > LOGTOOL > FILE > VIEW > RAW SUMMARY.

Note the first and last dates of transactions and calculate the difference. If the difference is short, like 4 hours, then this is important to note. Now read down the report of hardware addresses and observe the integer numbers in parenthesis. Anything over 150 in this 4 hour period should be called into HP for replacement.

lsof -p pid (* diagmond *)
Support Fatherhood - Stop Family Law
Stuart Abramson_2
Honored Contributor

Re: diaglogd running away - 90% cpu util; single bit memory error

I did call the HP Response Center. Here is what they said:

The HP Response Center says no problem on the single bit errors. I only had 4 single bit errors. This is no cause for alarm. If we had 100 single bit errors, they would ask us to reboot, and see if they went away.

The way memory works on the V2200, they have extra memory, like disk tracks, which they can vary on or off. During a reboot process the system runs memory tests and if it detects a bad memory area it can vary it off line and vary on a spare memory area in it's place, and we would keep right on going. No need to replace DIMMS, until they run out of spare memory areas.

What probably happened on the diaglogd running away is this:

1. diaglogd, the logger, gets called by diagmond, the monitor, when it needs to log something.
2. Our diagmond had gone away for some reason, after calling diaglogd, and diaglogd got confused.
3. I just killed diaglogd and restarted all diagnostics, (as suggested above):

/sbin/init.d/diagnostic start

and now diagmond is running again. And he calls diaglogd to log.

Thanks for help!

Stuart
John Poff
Honored Contributor

Re: diaglogd running away - 90% cpu util; single bit memory error

Hi,

We used to see lots of the single bit memory errors on our V-class boxes and we learned to just not worry about it. I think HP finally released a patch that stopped all the single bit errors from going to the syslog file.

JP
S.K. Chan
Honored Contributor

Re: diaglogd running away - 90% cpu util; single bit memory error

Adding or echoing a bit of what Eugeny had mentioned. From what I've experienced so far, the single bit memory error that I've seen increases over time, so you may want to deal with it now rather than waiting for it to get worse. This means you got to call HP and get it replaced.
Michael Steele_2
Honored Contributor

Re: diaglogd running away - 90% cpu util; single bit memory error

Regarding the only 4 errors and the response centers comment about if it were 100 then they'd worry.

This is the procedure that I've run into for I/O errors, but not single bit errors.

Perhaps you should try another call to the response center since you're sometimes apt to get a different answer depending on who you talk too, especially if its a front line engineer. While you're more likely to get a backline engineer on the second call since the front line is obligated to send the case back and escalate after 20 or 30 minutes.
Support Fatherhood - Stop Family Law
John Poff
Honored Contributor

Re: diaglogd running away - 90% cpu util; single bit memory error

Hi again,

We had three V-class boxes here for several years, and we got them early in the life cycle when we had to go through all the problems with replacing the DIMMS and the memory carriers to get the memory problems solved, so I'm real familiar with the single-bit memory errors in the V-classes. I've made several calls to HP about the single-bit memory errors in the V's, and they say that as long as it is just a few errors, don't worry about it. The memory is designed to trap and work around the single-bit errors. If enough of them show up, it will quit using that part of memory. My experience with the errors matches what HP is telling him. If you just have a few, don't worry about it. If you have a hundred or more, you are about to lose some memory. Now, you can hop on the phone and replace the bad DIMM whenever you see a single-bit error, but it really isn't worth the trouble.

JP