Databases
cancel
Showing results for 
Search instead for 
Did you mean: 

Oracle 10g or 11g parallel server (RAC)- crash handling?

SOLVED
Go to solution
TwoProc
Honored Contributor

Oracle 10g or 11g parallel server (RAC)- crash handling?

I've got some questions. Hopefully, some of the forumers can weigh in on this. And yeah, it's probably dumb, because I'm asking the questions that the advertisements purport to do. HOWEVER, I've never trusted anything like ads, software salespeople, etc., to tell me what software can do. I trust them to deceptively overstate what it can do in the narrowest of circumstances, and that's how we got to this point of asking my peers.

Let's say I've got three Oracle servers in a RAC configuration (running Oracle APPS).

First question, is my data on ALL 3 servers?

And, if one of them crashes - do the remaining two servers keep everything running?

And the remaining two servers have the ability to maintain data for the missing server until it's backup?

And in another scenario, let's say it's just two servers in a cluster (not three) - is the above still true? That is, if one crashes, would I keep running, having access to all my data?

While I realize that a process which has been attached a single server would die, (e.g. a process spawned to fulfill a query) - other than that - would *total* database availability remain?
We are the people our parents warned us about --Jimmy Buffett
9 REPLIES
Hasan Atasoy
Honored Contributor
Solution

Re: Oracle 10g or 11g parallel server (RAC)- crash handling?

hi ;

1. ) there is one data and 3 system see same data.
2. ) if one of the server survives, your system will be avaliable to outworld.
3. ) same answers for two server.


hasan.
TwoProc
Honored Contributor

Re: Oracle 10g or 11g parallel server (RAC)- crash handling?


Thanks!
We are the people our parents warned us about --Jimmy Buffett
Duncan Edmonstone
Honored Contributor

Re: Oracle 10g or 11g parallel server (RAC)- crash handling?

>> First question, is my data on ALL 3 servers?

No, your data is on shared disk - each node has its own database buffer cache which is kept in sync to some degree (when needed) via the cluster interconnect

>> And, if one of them crashes - do the remaining two servers keep everything running?

Depends what you mean by everything... certainly the database is kept clean, cos uncomitted in-flight transactions from the failed node are rolled back. If your application fully supports Transparent Application Failover (TAF), then your users may only notice a momentary pause - but not all applications do - some apps may need to reconnect to the database completely. Either way, once connected to a remaining node, the application can carry on pretty much from where it left off - repeating the last failed transaction obviously.

>> And the remaining two servers have the ability to maintain data for the missing server until it's backup?

No - data is on the shared disk, so the other nodes don't have anything to maintain for the failed node.

>> And in another scenario, let's say it's just two servers in a cluster (not three) - is the above still true? That is, if one crashes, would I keep running, having access to all my data?

Yes - no difference between a 3 node and 2 node cluster at that level.

>> While I realize that a process which has been attached a single server would die, (e.g. a process spawned to fulfill a query) - other than that - would *total* database availability remain?

Yes.

Of course none of it is a simple as Oracle would have you paint - RAC is a *great* product, but it still needs some pretty hot DBAs to manage effectively compared to a single instance Oracle database, and the Oracle clusterware suffers from being an entirely user space application on HP-UX, whereas Serviceguard is integrated into the OS.

I usually give clients the following matrix when considering Oracle database availability:

average DBA, average sysadmin = single system will deliver best system availability
average DBA, good sysadmin = clustered (Serviceguard) system will deliver best system availability
good DBA, average sysadmin = Oracle RAC with vanilla Oracle stack will deliver best system availability
good DBA, good sysadmin = Oracle RAC with Serviceguard Extension for RAC delivers the *best* systems availability

Of course all this will only work well if you boot it all from the SAN! ;o) (joke...)

HTH

Duncan


HTH

Duncan
Steven E. Protter
Exalted Contributor

Re: Oracle 10g or 11g parallel server (RAC)- crash handling?

Shalom,

Q&A embedded.
>>>
First question, is my data on ALL 3 servers?
Yes and no. RAC requires shared storage, so the data resides in one place.

Parellel server can copy the data server to server and will try and catch up after an outage of one of the three servers. If not Parellel server then DataGuard


And, if one of them crashes - do the remaining two servers keep everything running?

RAC could care less how many servers are running because the data is in one place.
Datagaurd/Parellel server will try and catch up once the downed server is resurrected.

And the remaining two servers have the ability to maintain data for the missing server until it's backup?

Yes.

And in another scenario, let's say it's just two servers in a cluster (not three) - is the above still true? That is, if one crashes, would I keep running, having access to all my data?

Yes, RAC is active-active the database is open on both nodes of the clusters. A serviceguard based two server solution still requires shared storage but takes longer to fail over because it needs to open the database.


You should consider Serviceguard in this matrix if 1-10 minutes is acceptable downtime while the database is open. It is substantially less expensive from the Oracle license standpoint and your data can reside on normal filesystems.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
TwoProc
Honored Contributor

Re: Oracle 10g or 11g parallel server (RAC)- crash handling?

So, now I'm a bit confused.

I've got service guard, but it's for failover for a standalone server which participates in a cluster (just for the failover). This is necessary to roll the disks over the next server during an event (critical or maintenance). But, the end result is that you've still got a standalone server. In my mind (up to this point, hoping to learn more), that's what Service Guard does.

So, what do you need service guard for, if you're doing RAC and all servers are already connected to the database? Is it there just to define the cluster itself? What additional level of service is SG providing at that point?

We are the people our parents warned us about --Jimmy Buffett
Duncan Edmonstone
Honored Contributor

Re: Oracle 10g or 11g parallel server (RAC)- crash handling?

OK, welcome to the wonderful world of RAC options... there are plenty!

Way back in the mists of time, when RAC was still Oracle Parallel server, Oracle didn't have any kind of clustering capability at all - in those days you had to use a vendors clusterware to get RAC to work for you - it did a few things to make OPS work:

- provided cluster membership services
- provided concurrent access to sgared storage
- provided a distributed lock manager (DLM), to manage access to those distributed database buffer caches

Serviceguard offered these features through Serviceguard Extension for RAC - so did IBM with HACMP and Sun with SunCluster etc.

That was all well and good, but idin't fit too well with Larrys "take over the world" approach to IT, so with 9i things started to change... Oracle bought the rights from HP to use a pile of the code that used to be in TruCluster and started altering it to work with Oracle.

The first release of this was with 9i, and Oracle changed the name to 'Real Application Clusters'. 9iRAC replaced the DLM with Oracle's own, called cache fusion, and on the Linux platform (where they can easily dig into the Kernel), Oracle also made cluster membership services available, and an attempt at a cluster filesystem to manage storage (the poorly received OCFS).

On the commercial UNIX platforms though, you still needed a vendor cluster solution to provide cluster services and storage concurrency etc.

Now move forward to 10gR2, and Larry's master plan for increasing wallett share really gets going... Now Oracle also offer their own clusterware (Oracle CRS - Cluster Ready Services), and their own Storage Solution (ASM - Automatic Storage Management) - Oracle will now have you believe you don't need any vendor cluster solution at all.

The reality however is there are still some good reasons to use Serviceguard to provide cluster membership services and for Storage - from my point of view they are:

- Oracle's clusterware is a user space application - Serviceguard has hooks into the kernel - the upshot is that Serviceguard has much better hung node detection that CRS

- You don't just run a database right? Oracle clusterwar is largely unproven with anything apart from Oracle RAC itself - what about all the other application/system components you want to make highly available?

- CRS has no capability to provide trunking or failover for LAN cards - you have to add this yourself.

- ASM is a pretty immature storage stack missing many key features you'd get out of LVM/VxVM - and ASM isn't a filesystem - you can't manage your database files like you do on a standalone node.

- ASM doesn't provide any kind of MPIO - you need to get that from your OS.

- Serviceguard with it's additional Serviceguard Storage Management Suite for Oracle RAC is the only option that provides a for a full Cluster based Filesystem with performance close to raw.

I could go on... I'm sure this will spark off more questions - please feel free to fire away!

HTH

Duncan

HTH

Duncan
TwoProc
Honored Contributor

Re: Oracle 10g or 11g parallel server (RAC)- crash handling?

Thanks for clearing those up SEP, and thanks also for the lengthy reply Duncan, and the offer of answering more questions.

That clears it up for me on the whole "Why Service Guard" question - because in Oracle Apps, there are lots of things besides the database, plus some of our own extensions that must be kept running for full functionality - and I hadn't thought about those. Also, the issue about the card failover - excellent point.

So, you're saying that the Serviceguard Storage Management Suite for Oracle RAC includes the Veritas file system which gives async I/O. Very nice.

Two more:
1) If I elected to use raw - can I backup the cold volumes efficiently with DataProtector like I do now? More on why I'd want to do this instead of Veritas file system later on below...
2) Are customers doing this on Linux boxes yet for large server replacements? Say, replacing a 32 way dome with a 3 4xquadcore (16 procs each) servers?

I could do the replacement on Itanium (and I'm leaning toward that solution), but the prices for the Intel solution begs the required visit.
So far, my biggest issues with Linux Intel are a) servers occasionally go down, b) file systems occasionally get corrupted.
I could address a) with a more servers and I'm thinking I could possibly address b) with no file system - raw. Would that help that problem, - or would this just now make me have occasionally corrupt databases, instead of occasionally corrupt file systems (just redefine the same problem)?
Would you use HP's Service Guard for a RH Linux RAC solution as well?
Would you use Veritas file systems for RH Linux RAC solution here also? Or would you go raw as I posed earlier?

I'm at this point basically against the conventional Intel processor idea (higher risk factors in my mind), but I need to examine more information to better assess the issue, and of course the best resource for this is the people here who've had experience with both the Itanium solutions and the Linux ones. I use lots o' Linux now (hence the concerns), but not for Oracle databasing. Your thoughts are GREATLY appreciated.
We are the people our parents warned us about --Jimmy Buffett
Duncan Edmonstone
Honored Contributor

Re: Oracle 10g or 11g parallel server (RAC)- crash handling?

1) If I elected to use raw - can I backup the cold volumes efficiently with DataProtector like I do now? More on why I'd want to do this instead of Veritas file system later on below...

Backups of raw Oracle partitions are very difficult without using RMAN. Of course Data Protector integrates with RMAN, so as long as you've paid for the integration licenses, you can still use DP. I've come across other customers with smaller databases who just backup the database via RMAN to a filesystem somwehere and then write the resulting backup off to tape as a normal filesystem backup - not massively elegant but it works.

2) Are customers doing this on Linux boxes yet for large server replacements? Say, replacing a 32 way dome with a 3 4xquadcore (16 procs each) servers?

Some are some aren't - it really comes down to what you feel comfortable managing - after all management costs contribute much further to any TCO figure than acquisition costs. From a management standpoint given a choice between 3-4 Proliants with RedHat or 1 Superdome with HP-UX, I know which one I'd bet my business on...

The whole scale-up vs. scale-out argument is pretty skewed these days by the crowds of Oracle and Linux fanboys who've never know anything better. The reason you get all the kind of problems you describe on x86 architectures is nothing much to do with Linux and more to do with the sort of error checking the CPU is capable of - silent data corruption can be rife when you don't have decent ECC error checking on your memory - and diagnosing serious hardware problems? You often can't even tell which component had a failure! It's all very well saying that the boxes are cheap and you have more of them, but when you can't track down a persistently failing component your life isn't going to be aby easier.

The thing is RAC isn't quite the panacea its painted as by Oracle (everyones heard of a site where they got better performance in their RAC cluster when they *reduced* the number of nodes!), so you need to tread very carefully.

This whitepaper is pretty out of date now, but is an excellent starting point for injecting a bit opf realism into your RAC considerations:

http://www.miracleas.dk/WritingsFromMogens/YouProbablyDontNeedRACUSVersion.pdf


HTH

Duncan

HTH

Duncan
tom quach_1
Super Advisor

Re: Oracle 10g or 11g parallel server (RAC)- crash handling?

Hello,

In RAC environtment the data is on one location in my case, it is on EVA3000.
for example your database name: TEST
Each node will has an instance called
TEST1 ON NODE1
TEST2 ON NODE2
TEST3 ON NODE3
if you setup Oracle TAF (transparent application failover, you need to setup)in your oracle database, and NODE1 and NODE2 fails. It just re-route users to NODE3.
and it happens automatically.
regarding MCServieGuard and oracle cluster(CRS and ASM).
we used MCServiceGuard for our database 9i RAC with RAW devices. It was pain. We migrated to Oracle 10gR2 1 year ago and i completely dropped MCserviceGuard only user Oracle CRS and ASM for oracle 10gR2. I really glad that i made that decision, it saves me lot of time and easy to manage. with the latest patch 10.2.0.3 it's very stable.
With ASM for example if your DBA needs more space.
in my case, just create a LUN, present to the UNIX goes in /dev/rdsk/c1t0d0 change ownership to oracle:dba and 644
and that about it for sysadmin and tell the DBA the disk is ready. He will use OEM (oracle enterprise manager or TOAD) to add the disk to the database and it would rebalance all data across all the disks, fast or slow depending on the DBA to select the power of balancing from 1to 11.
if you decide to use ASM, the only tool to backup it is RMAN. you can not copy file to the OS but you can use command line to see it, move it around inside ASM. There is a way for you to FTP and URL ASM files to the OS, but that is not popular. on oracle 11g they allow us to use command line to copy ASM files to the O.S.
Hope this helps!
Thanks,
Tom