1832273 Members
2039 Online
110041 Solutions
New Discussion

fbackup kills system

 
SOLVED
Go to solution
David Connolly
Regular Advisor

fbackup kills system

Hi all.

I have an issue whereby my HPUX11.11 L1000 occasionally enters a state does not respond at the GSP console or to remote sessions. My only recourse is to reset the system from the GSP. It happens about 1 per week, but I haven't established a firm pattern yet. From what I can tell, the last thing the system does is a cron'd fbackup job the does not complete.

There is nothing that provides a hint of what may be happening in /var/adm/syslog, or the backup log, or the chassis error logs. The system is patched to Goldapps/Goldbase June 2006 and hardware enablement to March 2003.

Any pointers to help with diagnosis would be appreciated.
10 REPLIES 10
Reshma Malusare
Trusted Contributor

Re: fbackup kills system

Hi David,
I suggest, check crontab entries by
crontab -l.
Fabian Briseño
Esteemed Contributor

Re: fbackup kills system

Hi David.

Waht is the exact command that runs this fbackup.


have you tryed looking in dmesg ?

Knowledge is power.
David Connolly
Regular Advisor

Re: fbackup kills system

Hi Fabian

The fbackup is a standard SAM configured unattended backup.

00 2 * * 2-6 /usr/sam/lbin/br_backup DAT FULL Y /dev/rmt/0m /var/sam/graphPCAa23177 root Y 1 N > /tmp/SAM_br_msgs 2>&1 #sambackup

dmesg just give me output since the last boot, with no problems evident.

@remsha - thanks, but I was aware how you list cron jobs (my fbackup is cron'd)
Wouter Jagers
Honored Contributor

Re: fbackup kills system

Strange.. have you tried leaving a session (remote or console) open at all times ? Any clues there ?

Also, does once a week mean every week at the same day, or could it be for example tuesday on one week and thursday on the next ? Does it happpen outside of peak hours as well ?

Cheers
an engineer's aim in a discussion is not to persuade, but to clarify.
David Connolly
Regular Advisor

Re: fbackup kills system

Hi Wouter

> Strange.. have you tried leaving a session (remote or console) open at all times ? Any clues there ?

Console shows no hints in console history and won't bring up the login prompt, but a remote session might remain logged in and allow me some view - good idea, i'll try that.

> Also, does once a week mean every week at the same day, or could it be for example tuesday on one week and thursday on the next ? Does it happpen outside of peak hours as well ?

The last two times it happened during the nightly backup (between 2 and 3am), but not on the same day. I'm thinking there's something being "consumed" like file handles and the backup consumes a big chunk of them. I cannot point to anything that has changed over the last couple of months to bring about the more frequent occurance.

I've patched it to HW Enablement Sept 2005 to see if that helps. I was hoping someone else might have experienced similar symptoms.
Ralph Grothe
Honored Contributor

Re: fbackup kills system

Just a vague idea,
has the nettl traced anything?

e.g.

# nettl -status all

Has the machine dumped anything to /var/adm/crash,
or something in /var/tombstones?
Madness, thy name is system administration
David Connolly
Regular Advisor

Re: fbackup kills system

Nothing since 2003 in the nettl logs
nothing in /var/adm/crash since 2003 either
/var/adm/tombstones contains tombstones from each of my resets, but no errors.

I think the system is up and running, just not responding to console or network logins.
James R. Ferguson
Acclaimed Contributor
Solution

Re: fbackup kills system

Hi David:

Memory is something that 'fbackup' craves. I'd begin by examining 'swapinfo -tam' and running 'vmstat' looking for page-outs (the 'po' column) that are in double-digits.

If swap utilization is high and you see significant memory pressure you may have found your reason.

'fbackup' uses shared memory segments (see 'shmmax' in your kernel) to buffer the files it is copying. 'fbackup' can have up to six (6) reader processes running at a time. I assume that you have a default 'fbackup' configuration file in place because you are doing a standard SAM backup.

If you have ever killed (-9) any 'fbackup' session, you probably have left orphaned memory segments lying about, tying up part of your memory. A 'ps -ef|grep fbackup' might expose some old processes lingering. If so, a reboot is the easist penalty to pay.

Regards!

...JRF...
Bill Hassell
Honored Contributor

Re: fbackup kills system

The only definitive way to see what is happening is to use the TC command rather than RS (reset) from the GSP. The TC (Transfer of Control) command will force a crash dump which can then be read and the hang condition can be identified.

I agree with James that this may be a memory starvation issue as fbackup can use hundreds of megs of RAM to store all the filenames and this may be a lot more than is available, causing excessive paging as well as a massive slowdown in the backup speed. Depending on memory fragmentation due to orphaned shared memory segments and other process memory pressure, there may be a lot of virtual memory paging. This paging can indeed cause loss of console or session prompts -- well, eventually they will respond but it may be several minutes.

So check ipcs -bmop to see if there are orphaned segments and clear them. Then run fbackup during the day for a few minutes (until the tape starts moving) and check ipcs -bmop to see just how much shared memory is needed for fbackup. Then terminate fbackup (NEVER use kill -9) with a straight kill or kill -15. If root has been using kill -9 a lot, that is a major source of problems. kill -9 is a last resort and you must cleanup all the program's resources by hand (every time).

You might want to bring HWE patches up to June 2006 level too.


Bill Hassell, sysadmin
David Connolly
Regular Advisor

Re: fbackup kills system

Thanks folks - good advice. I've checked memory after the scheduled backup last night and there doesn't seem to be a significant drop. No problems with swap or orphaned memory segements as far as I can see.

I had set (a long time ago) the dbc_max_pct down to 10% to avoid fbackup hogging hundreds of MB of RAM, so that should help there.

I've put HWE Sept 2005 on so I'll let it run for a couple of days while monitoring the above and see how the patches help.