Re: The end of a long rope...

Allen Brand · ‎01-10-2007

OSF1 section9 V5.1 2650 alpha

So I'm using MRTG in conjunction with rrdtool, running on the above AlphaPC. 100% of the calls MRTG/rrdtool tracks is SNMP, and I am monitoring ~25 machines--a mix of Alphas, HP & Dell. I've daemonized the MRTG session to run only one instance, and I use a .cfg file full of 'Include:' statements to load the .cfg files for each machine monitored. Each server has disk space (on several different drives/mountpoints), CPU load, User/Processes, and Memory usage being mapped, and most of them work fine...

Except for memory statistics for the Alphas.

Of the seven Alphas being monitored, two at least show data (though it never changes/updates). The rest show 'NaNQ' where numbers should be. At one time, four of the Alphas had working memory usage graphs, and it was accurately monitoring changes in memory usage on those machines. The statistic I am monitoring specifically on each Alpha is 1.3.6.1.2.1.25.2.3.1.6.1, or 'hrStorageUsed.1' (used kernel memory). I've tried everything to get this to function. I have tried changing where the target for memory is located in the .cfg file(s), I've tried creating a shell script to do the query and multiplication for MRTG and simply feed it numerical values. There are no error messages in the log (which I keep in /var/adm for the master.cfg file). The system flat refuses to map this value. And yet, it will map 1.3.6.1.2.1.25.2.3.1.6.2, or 'hrStorageUsed.2' (used swap).

I thought perhaps that the folder where the cgi and .png files were kept was the issue, so I placed .htaccess files in these directories to specify expirations (5 minutes) to force apache to refresh in those folders. I have tried restarting every service, I've tried rebooting the machine, I've tried .cfg files with just the memory target, on the outside chance that there were too many targets in each .cfg file (~10-12 per .cfg file). I double-triple checked that the user with which MRTG performs these functions as (httpd) could run the necessary commands, write to specified folders, and read from others.

And yet here I am. At whits' end, still no closer to discovering why MRTG/rrdtool will not map memory usage for the Alphas. Memory usage for those servers/PCs running Windows works just fine. New servers added to the network instantly start graphing once the .cfg file for that machine is created...

Except for memory utilization on the Alphas. Some of you may be aware that 'NaNQ' basically means 'not a number quantity', but if the query was returning alphanumeric or string results, this would throw an error to the log file. Which it doesn't. I even changed the target to call an external script that returned garbage, and confirmed that at least error reporting for that target was working (which it was). Does anyone have any thoughts? Ideas? I mean, I'll douse the monitoring server with holy water if I thought it would help...

As mentioned previously, the SNMP query for that value of used kernel memory returns a number that must be multiplied by the hrStorageAllocationUnits.1 value (usually 1024). I tried creating a script that would do all that outside MRTG, and I formatted the script to feed MRTG data the way it likes it, namely four lines, 'in', 'out', 'server uptime' and 'hostname'...nuthin'. No errors, no complaints, no cursing (except from me), just 'NaNQ'. I'm starting to hate Alphas...

Hein van den Heuvel · ‎01-11-2007

>> I'm starting to hate Alphas...

I'm sorry to hear that.

And I appreciate one could get annoyed by a function not working as desired. That can be aggravating.

However...
Is the Alpha doing fine otherwise?

And this problem is NOT an Tru64/OSF native function is it?
And it is just a freaking monitor function to generate data that in all likelyhood noone will really ever care to look at because the Alpha itself is hopefully doing just fine.
What's the priority here?
A good monitor or a good production system?
Yeah I like my Alpha, and no it's nothing personal, it's just an other box. Albeit a promissing box shamefully lost through poor management decision.

Cheers!
Hein.

Allen Brand · ‎01-16-2007

Um...okay...

Thanks?

Dan Nelson_6 · ‎01-17-2007

The first thing you need to do is determine where the problem is. Try manually fetching that OID from the commandline and see if it returns a sane value. Here's the output from my server:

$ snmp_request localhost public get 1.3.6.1.2.1.25.2.3.1.6.1
1.3.6.1.2.1.25.2.3.1.6.1 = 3476136

If it returns a number, then the snmp daemon is off the hook. Things to try at this point would be: tcpdumping port 161 to verify that mrtg is actually sending the right query to the right server, looking really closely at your mrtg config file, and running mrtg with the --debug flag to see what it does with the value once it has it. Perl/mrtg/rrdtool updates might be useful too.

If snmp_request doesn't return a number, then I'd try bouncing snmpd or looking for errors in syslog. As a last resort you could install net-snmp and run it on an alternate port, and point mrtg at that.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: The end of a long rope...

The end of a long rope...