Re: Quality Control on a Server rollout.

Steven E. Protter · ‎07-25-2003

I am rolling an rp5450 L2000 server into production this week.

I have done a lot of quality control checks but know that 6000 itrc members are better than just me.

So here is the thread.

3 Points for any unduplicated suggestion
2 Bonus Points for those with perfect point giving records
1 Bonus point for those like Pete Randall who have more hand outs than questions.

If your suggestion results in me catching something, I will post a notice and you can come back for a rabbit.

Only a representitive of the Chosen people we, who invented buearacracy but can't spell it could come up with such a convoluted system.

There is no limit to the number of suggestions, but be realistic, If I can't run the check, the best you'll do is 3-6 points. Try and be thoughtful and provide procedures for running the checks.

Its 1 rabbit per every suggestion that actually results in me finding and eliminating a quality control problem.

Please read carefully, because I want you to know what I've done so your suggestion is relavent.

Old System:
D380 2 Way All apps 32 bit. OS 32 bit
11.00 32 Bit
Oracle 8.1.7.0
Oracle 9ias 1.0.2.2 Patch Level 12
Cyborg 4.5.3(staying on this server)
adabas/natural from Software AG Legacy Application.
Not Trusted

New System
rp5450
B.11.11 June 2003 and lots of other patches 64 bit.
Oracle 8.1.7.4 64 bit
Oracle 9ias 1.0.2.2 Patch Level 12 32 bit
adabas/natural from Software AG 64 bit fully tested Legacy Application.

The system was creaetd with Ignite.

Major issues caught thus far:
Audit id for non root cron users was not set up correctly, used itrc tsconvert utility to fix it. Invented the restart utility because of it.

We have made Ignite correctly distribute the /etc/hosts and nsswitch.conf file.

We have practiced the Oracle 32 bit to 64 bit conversion and the data migration of the adabas data. funny how the adabas database can migrate to 64 bits in 15 minutes and it takes 6 hours for Oracle.

We have fully tested the actual Oracle application with test plans. Same for the legacy.

We have developed a memo with pictures for the users that will be forced to change their passwords and the fail rate on that is 50%. We're setting passwords and notifying those users. We didn't try and migrate the non-trusted password to the trusted systems.

I know we've done a good job, because 60 days ago, I put an rp5450 server in for the developers. But I want it to be perfect.

Why? Because I'm anal. Also becasue I want to pitch a promotion to Senior Systems Specialist and I want this to come off clean.

Thanks in Advance

Steve

Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com

Stefan Farrelly · ‎07-25-2003

A couple spring to mind;

1. Have you tested the new server under a typical user load (ie. a normal days usage with tons of users on) to really see any problem induced by load - something very very hard to otherwise test.

2. If you have done 1. then you have noticed that if you dont tune vx_ninode to 90% of ninode as soon as tons of users get on your used ram figure will go astronomical and only setting vx_ninode and a reboot will fix.

Recently encountered these problems myself on a new server rollout!

Im from Palmerston North, New Zealand, but somehow ended up in London...

Patrick Wallek · ‎07-25-2003

Things just off the top of my head:

1) Buffer cache - have you reduce dbc_max_pct from the default value of 50?

2) Are any users going to be connecting directly to the box? If so, have you reset the npty, nstrpty and nstrtel kernel parameters, regenerated the kernel and made sure that the additional pty / tty device files are created.

Pete Randall · ‎07-25-2003

Steve,

(what happened to SEP, or Steven, for that matter)

In the past when I've faced a similar situation, I arranged for a dry run with actual users doing actual work. At the appointed time, I shut down the production server, switched the name and IP of the new server to be that of the production server, and turned the users loose. After a couple hours of quasi-production work, we had unearthed a few unforseen problems which we took care of before the actual roll-out.

Pete

Pete

Steven E. Protter · ‎07-25-2003

Stephan,

We had our vx_ninode crisis on the prior server. Great idea. We have load tested as best we can, and have further tests set for next week. The default was zero which means the system sets it, which is ridiculous. We have set a value on that based on HP's recommendation.

Patrick,

We are pushing a configuration file for telnet connection changing the hostname to all of our users who still use the legacy app, which is telnet(I wish management would switch to ssh).

The dbc_ max_pct issue was resolved in the same performance problem above, mentioned by Stephan with regards to vx_ninode.

Good stuff.

This is going to be very helpful.

Some Oracle suggestions would be cool too.

SEP

Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com

Massimo Bianchi · ‎07-25-2003

Hi,

some checks:

- some "+ +" left in .rhosts
- MWC and Bad Block Relocation on vg00's lv

...thinking...

Massimo

Stefan Farrelly · ‎07-25-2003

Steve - what was your HP recommendation for vx_ninode ? HP told us to set it to 90% of ninode - not the same for you ?

Cheers,

Stefan

Im from Palmerston North, New Zealand, but somehow ended up in London...

Helen French · ‎07-25-2003

Did you test anything from the backup software side? Since you have Oracle, there will be Open file issues and you may have to come up with a good solution (cold or hot backups). You can check sample backups and see what impact it makes for your database and consult with DBAs too.

Did you test a Disaster recovery?

Did you document all changes you made to the server?

Just some thoughts ...

Life is a promise, fulfill it!

Pete Randall · ‎07-25-2003

Steve,

I didn't catch where your data resides. Will you be using existing or is that getting migrated as well? If new, have you looked at replicating any FS tuning that may have been applied previously?

Pete

Pete

Steven E. Protter · ‎07-25-2003

Massimo,

rhosts and all Berkley protocols are totally disabled.

I am checking on your Bad Blocks Allocation item, possible Bunny alert, check back.

Stephen,

vx_ninode was set with HP's assistance to a figure greater than ninode.

This was handled by a support call, and as I recall I resisted setting this figure lower.

I am actually having some issues with this box, its running some stuff slwoer than a box with half the memory and the same kernel configuration(swap is bigger).

So Stephan, check for a bunny on that suggestion as well. It might take a few days to figure that one out.

SEP

Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com

Massimo Bianchi · ‎07-25-2003

Oracle ?

That's here:

hpux parameters

FS for oracle

take care of the mount option like convosync=direct and mincache=direct. Some says they are usefull, some says they can unprove performance, in SGA is not properly set.

- oracle parameters:

sessions = 1.2 * processes
process number: enough for the connection of all your users + 20% for security. Remember that you need as many semaphores as process

db_files
default is 256, and you may incur fast in problem if the db is growing. raise it to 512 at least.

Massimo

doug mielke · ‎07-25-2003

Not rally QC, but CYA.
If another will have root access, a backdoor root login to protect against password change.

Steven E. Protter · ‎07-25-2003

Pete,

Data is stored on a Xiotech Magnitude disk array, dual Fiber card connection. PVLINKS is not set up yet.

DR tests have been done with Ignite, fbackup and Veritas netbackup.

We are running Veritas netbackup and all backups as if it were really production next week to work out the kinks.

Looking into Massimo's second suggestion as well.

The gerbel is running as fast as he can.

Security: Secure shell fully implmented with public keys exchanged. root password is secure. Bastille was run on the box where the Golden Image was created.

Security audit we had done three years ago is being run to make sure all issues were handled.

SEP

Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com

Elena Leontieva · ‎07-25-2003

Steve,

You may want to check your DBA requirements for maxdsiz_64bit, maxssiz_64bit, and maxtsiz_64bit.

Elena.

Steven E. Protter · ‎07-25-2003

I need elaboration and others to discuss Massimo's FS recommendations. I admit thats a bit over my head.

vx_ninodes is being dropped severely during this weekend maintenance.

SEP

Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com

Massimo Bianchi · ‎07-25-2003

On OnlineJFS

> ?? mincache=direct ? The default read operation for JFS copies data
> from
> disk to the HP-UX buffer cache, and then copies data to the Oracle SGA.
> Setting this mount options causes the data to be moved directly into the
> Oracle SGA; this may provide a minor improvement in the performance for
> non-sequential read operations. In 8.x versions of Oracle, this mount
> option will cause unnecessary physical I/O for sequential I/Os. This
> mount
> option should NOT BE USED with Oracle 8.x tablespace files, however it is
> recommended for Oracle 8.x redo and archive file systems.
> ?? convosync=direct ? This option changes the behavior of files opened
> with the Osync flag enable, which Oracle always uses. This will enable
> Osync I/O operations to operate the same as non-osync file operations and
> thus use the mincache=direct mount option. In 8.x versions of Oracle,
> this
> mount option will cause unnecessary physical I/O for sequential I/Os.
> This
> mount option should NOT BE USED with Oracle 8.x tablespace files, however
> it is recommended for Oracle 8.x redo and archive file systems.
>
> Why does mincache=direct impact the performance of Oracle sequential
> access
> (table scans)?
>
> 1) Oracle 8.x uses the system call readv rather than the read system
> call, which is used in Oracle 7.x. In Oracle 7.x using the readv system
> call was an option that was enabled by a parameter in the init.ora file.
> Oracle 8.x provides no provision for using the read system call.
> 2) Using readv changes the behavior of large I/Os performed for
> sequential access on JFS file system mounted with the mincache=direct
> option. The readv system call (read vector) passes and array of vectors
> (blocks) to be transferred for sequential operations. How this works:
> i) A common value for the db_file_multiblock_read_count is 8.
> ii) When using readv with JFS file systems mounted with mincache=direct,
> JFS performs a separate physical I/O for each block.
> iii) This results in 8 physical I/Os of 8k each rather than a single 64 k
> I/O (assuming an 8k block size).
> 3) When the mincache=direct mount option is not used, the readv system
> call passes the requests through the HP-UX buffer cache. This allows JFS
> to coalesce the (8) vectors into a single I/O.
> 4) An added benefit of using the HP-UX buffer cache is the JFS
> read-ahead facility. JFS will identify a table scan (sequential access)
> and initiate 1 MB of read-ahead, further increasing the performance of
> table scans.
> 5) Just to keep things interesting:
>
>
> If the Oracle db_file_multi_block_read_count is set to a value greater
> than 16, Oracle will revert to using the read system call. Using the
> mincache=direct mount option in this environment will not result in the
> readv performance penalty for sequential I/O; however there will not be
> the benefit of the JFS read-ahead. This may be appropriate for some
> large data warehouse environments.

But i suggest to wait for other experts hints. i heard many controverse voices on this subject, and looks like the answer is "depends"

Massimo

Ken Hubnik_2 · ‎07-25-2003

Do you have a printer queue setup requirements.
Did you customize your inetd.conf file.

Steven E. Protter · ‎07-25-2003

Elona,

yes.

Ken,

yes.

Keep em coming ....

SEP

Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com

Stefan Farrelly · ‎07-25-2003

Steve,

youve got to get your pvlinks setup and tested under heavy load (ie. pull a fibre while many dd's running) - to see how lvm behaves and how it behaves when you re-insert the pulled fibre. ie, it need to be able to behave properly and cope with both events without your app going down or hp-ux seeming to grind to a halt or go nuts.

Over the years ive always noticed slightly different lvm behaviour with the patch bundles and sometimes a larger difference necessitating some more current patches on top of the patch bundle - in order to get lvm behaviour back to whats expected (reliable, accurate and quick).

Also - when you add the pvlinks arent you going to balance your vg's across both fibre channels (primary and pvlink) so that you get redundancy and a doubling of i/o throughput ?
We always do this. Usually this is done at vg creation time so pvlinks need to be setup right at the start otherwise its vg recreation or constant use of pvchange -s.

Im from Palmerston North, New Zealand, but somehow ended up in London...

Massimo Bianchi · ‎07-25-2003

Hi,
Stefan suggested me another check:

PV timeouts !!!!

For Xp/EMC/similar FC attached devices, usually 90 or higher may be raccomended

Massimo

Steven E. Protter · ‎07-25-2003

Massimo,

No bunny on the Oracle parms, our dba says we exceed both recommendations. Still, excellent suggestion.

Stephan,

With regards to PVLINKS.

We need cooperation from our Xiotech admin to test PVLINKS.

We are going to do this after rollout on our sandbox. Once its done we're going to do our developers box, then production.

It should not be handled as a seperate project, but is. This rollout was delayed a month because the disk allocation was done weeks behind schedule.

SEP

Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com

Steven E. Protter · ‎07-25-2003

Apologies for butchering your first name

Stefan

SEP

Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com

A. Clay Stephenson · ‎07-25-2003

Have you done a thermal load analysis to make sure that your HVAC system can handle the additional load in the worst-case scanario? Ditto for UPS? Ditto for backup generator?

Are modems working? Email? Have you added this box to your monitoring system (e.g. OpenView VP/O or CA)? Have you checked to make sure that all your new equipment/software is under maintenance?

Finally, since this is a "rollout" have you made certain that the cabinet wheels are well lubricated? I suggest Mobil-1 5W-30, 2 drops per caster axle with an additional mist applied to the thrust bearing.

If it ain't broke, I can fix that.

Steven E. Protter · ‎07-25-2003

PVLINKS will be used for failover, not throughput.

I thought PVLINKS didn't help with throughput. Am I wrong? Apparently so.

SEP

Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com

Stefan Farrelly · ‎07-25-2003

Steve,

Yes, you can use pvlinks in live+backup mode. When you create your VG's alternate the disks from one fibre to the alternate (pvlink) then primary/alt etc etc. Keep going until all added in. Now you have a VG accessing the disks via 2 fibres, each a backup for each other. The best of both worlds!

eg. instead of (pri + pvlink on each line);
vgextend vgxx /dev/dsk/c1t0d0 /dev/dsk/c2t0d0
vgextend vgxx /dev/dsk/c1t0d1 /dev/dsk/c2t0d1
...
you do;
vgextend vgxx /dev/dsk/c1t0d0 /dev/dsk/c2t0d0
vgextend vgxx /dev/dsk/c2t0d1 /dev/dsk/c1t0d1
...

Im from Palmerston North, New Zealand, but somehow ended up in London...

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Quality Control on a Server rollout.

Quality Control on a Server rollout.