<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: RAC/SG split brain in Operating System - HP-UX</title>
    <link>https://community.hpe.com/t5/operating-system-hp-ux/rac-sg-split-brain/m-p/3282166#M706883</link>
    <description>This is actually an Oracle error message, and appears to indicate that there has been a communication loss between the two nodes.&lt;BR /&gt;This is especially true if you have Hyperfabric between the nodes, as RAC likes to use this for it's comms.&lt;BR /&gt;I suggest you raise a call with Oracle&lt;BR /&gt;</description>
    <pubDate>Thu, 20 May 2004 16:23:51 GMT</pubDate>
    <dc:creator>melvyn burnard</dc:creator>
    <dc:date>2004-05-20T16:23:51Z</dc:date>
    <item>
      <title>RAC/SG split brain</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/rac-sg-split-brain/m-p/3282163#M706880</link>
      <description>While testing a RAC cluster, if we pull the first node from the network. Any oracle requests to the second node, cause they database to come down. &lt;BR /&gt;&lt;BR /&gt;the following is from the oracle alert log.:&lt;BR /&gt; &lt;BR /&gt;Wed May 19 16:25:16 2004&lt;BR /&gt;Waiting for clusterware split-brain resolution&lt;BR /&gt;Wed May 19 16:25:17 2004&lt;BR /&gt;Errors in file /oracle/app/admin/admin/PDCSPRD/bdump/pdcsprd2_lmon_16718.trc:&lt;BR /&gt;ORA-29740: evicted by member 1, group incarnation 18&lt;BR /&gt;LMON: terminating instance due to error 29740&lt;BR /&gt;Wed May 19 16:25:17 2004&lt;BR /&gt;Errors in file /oracle/app/admin/admin/PDCSPRD/bdump/pdcsprd2_lmd0_16720.trc:&lt;BR /&gt;ORA-29740: evicted by member , group incarnation&lt;BR /&gt;Instance terminated by LMON, pid = 16718&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;It looks like there is a split brain condition occuring but, I'm confused how, this is a 2 node cluster with 1 lock vg and 1 lock pv. &lt;BR /&gt;It also has no packages. &lt;BR /&gt;&lt;BR /&gt;I dont see anything strange in the syslog.&lt;BR /&gt;&lt;BR /&gt;I have been searching the web all morning, and found many documents on split brain, but nothing telling me how to correct the problem. &lt;BR /&gt;&lt;BR /&gt;anyone have any advice here?</description>
      <pubDate>Thu, 20 May 2004 10:21:39 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/rac-sg-split-brain/m-p/3282163#M706880</guid>
      <dc:creator>Marvin Strong</dc:creator>
      <dc:date>2004-05-20T10:21:39Z</dc:date>
    </item>
    <item>
      <title>Re: RAC/SG split brain</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/rac-sg-split-brain/m-p/3282164#M706881</link>
      <description>Here is a note from Oracle on how to resolve this issue. Hope it helps.&lt;BR /&gt;&lt;BR /&gt;Note:219361.1&lt;BR /&gt;PURPOSE &lt;BR /&gt;======= &lt;BR /&gt;This note was created to troubleshoot the ORA-29740 error in a Real Application &lt;BR /&gt;Clusters environment. &lt;BR /&gt;SCOPE &amp;amp; APPLICATION &lt;BR /&gt;==================== &lt;BR /&gt;This note is for DBA's needing to resolve ORA-29740. &lt;BR /&gt;Troubleshooting ORA-29740 in a RAC Environment &lt;BR /&gt;============================================== &lt;BR /&gt;An ORA-29740 error occurs when a member was evicted from the group by another &lt;BR /&gt;member of the cluster database for one of several reasons, which may include &lt;BR /&gt;a communications error in the cluster, failure to issue a heartbeat to the &lt;BR /&gt;control file, and other reasons. This mechanism is in place to prevent &lt;BR /&gt;problems from occuring that would affect the entire database. For example, &lt;BR /&gt;instead of allowing a cluster-wide hang to occur, Oracle will evict the &lt;BR /&gt;problematic instance(s) from the cluster. When an ORA-29740 error occurs, a &lt;BR /&gt;surviving instance will remove the problem instance(s) from the cluster. &lt;BR /&gt;When the problem is detected the instances 'race' to get a lock on the &lt;BR /&gt;control file (Results Record lock) for updating. The instance that obtains &lt;BR /&gt;the lock tallies the votes of the instances to decide membership. A member &lt;BR /&gt;is evicted if: &lt;BR /&gt;a) A communications link is down &lt;BR /&gt;b) There is a split-brain (more than 1 subgroup) and the member is &lt;BR /&gt;not in the largest subgroup &lt;BR /&gt;c) The member is perceived to be inactive &lt;BR /&gt;Sample message in Alert log of the evicted instance: &lt;BR /&gt;Fri Sep 28 17:11:51 2001 &lt;BR /&gt;Errors in file /oracle/export/TICK_BIG/lmon_26410_tick2.trc: &lt;BR /&gt;ORA-29740: evicted by member %d, group incarnation %d &lt;BR /&gt;Fri Sep 28 17:11:53 2001 &lt;BR /&gt;Trace dumping is performing id=[cdmp_20010928171153] &lt;BR /&gt;Fri Sep 28 17:11:57 2001 &lt;BR /&gt;Instance terminated by LMON, pid = 26410 &lt;BR /&gt;The key to resolving the ORA-29740 error is to review the LMON trace files &lt;BR /&gt;from each of the instances. On the evicted instance we will see something &lt;BR /&gt;like: &lt;BR /&gt;*** 2002-11-20 18:49:51.369 &lt;BR /&gt;kjxgrdtrt: Evicted by 0, seq (3, 2) &lt;BR /&gt;^ &lt;BR /&gt;| &lt;BR /&gt;This indicates which instance initiated the eviction. &lt;BR /&gt;On the evicting instance we will see something like: &lt;BR /&gt;kjxgrrcfgchk: Initiating reconfig, reason 3 &lt;BR /&gt;*** 2002-11-20 18:49:29.559 &lt;BR /&gt;kjxgmrcfg: Reconfiguration started, reason 3 &lt;BR /&gt;... &lt;BR /&gt;*** 2002-11-20 18:49:29.727 &lt;BR /&gt;Obtained RR update lock for sequence 2, RR seq 2 &lt;BR /&gt;*** 2002-11-20 18:49:31.284 &lt;BR /&gt;Voting results, upd 0, seq 3, bitmap: 0 &lt;BR /&gt;Evicting mem 1, stat 0x0047 err 0x0002 &lt;BR /&gt;You can see above that the instance initiated a reconfiguration for reason 3 &lt;BR /&gt;(see Note 139435.1 for more information on reconfigurations). The &lt;BR /&gt;reconfiguration is then started and this instance obtained the RR lock &lt;BR /&gt;(Results Record lock) which means this instance will tally the votes of the &lt;BR /&gt;instances to decide membership. The last lines show the voting results then &lt;BR /&gt;this instance evicts instance 1. &lt;BR /&gt;For troubleshooting ORA-29740 errors, the 'reason' will be very important. &lt;BR /&gt;In the above example, the first section indicates the reason for the &lt;BR /&gt;initiated reconfiguration. The reasons are as follows: &lt;BR /&gt;Reason 0 = No reconfiguration &lt;BR /&gt;Reason 1 = The Node Monitor generated the reconfiguration. &lt;BR /&gt;Reason 2 = An instance death was detected. &lt;BR /&gt;Reason 3 = Communications Failure &lt;BR /&gt;Reason 4 = Reconfiguration after suspend &lt;BR /&gt;For ORA-29740 errors, you will most likely see reasons 1, 2, or 3. &lt;BR /&gt;----------------------------------------------------------------------------- &lt;BR /&gt;Reason 1: The Node Monitor generated the reconfiguration. This can happen if: &lt;BR /&gt;a) An instance joins the cluster &lt;BR /&gt;b) An instance leaves the cluster &lt;BR /&gt;c) A node is halted &lt;BR /&gt;It should be easy to determine the cause of the error by reviewing the alert &lt;BR /&gt;logs and LMON trace files from all instances. If an instance joins or leaves &lt;BR /&gt;the cluster or a node is halted then the ORA-29740 error is not a problem. &lt;BR /&gt;ORA-29740 evictions with reason 1 are usually expected when the cluster &lt;BR /&gt;membership changes. Very rarely are these types of evictions a real problem. &lt;BR /&gt;If you feel that this eviction was not correct, do a search in Metalink or &lt;BR /&gt;the bug database for: &lt;BR /&gt;ORA-29740 'reason 1' &lt;BR /&gt;Important files to review are: &lt;BR /&gt;a) Each instance's alert log &lt;BR /&gt;b) Each instance's LMON trace file &lt;BR /&gt;c) Statspack reports from all nodes leading up to the eviction &lt;BR /&gt;d) Each node's syslog or messages file &lt;BR /&gt;----------------------------------------------------------------------------- &lt;BR /&gt;Reason 2: An instance death was detected. This can happen if: &lt;BR /&gt;a) An instance fails to issue a heartbeat to the control file. &lt;BR /&gt;When the heartbeat is missing, LMON will issue a network ping to the instance &lt;BR /&gt;not issuing the heartbeat. As long as the instance responds to the ping, &lt;BR /&gt;LMON will consider the instance alive. If, however, the heartbeat is not &lt;BR /&gt;issued for the length of time of the control file enqueue timeout, the &lt;BR /&gt;instance is considered to be problematic and will be evicted. &lt;BR /&gt;Common causes for an ORA-29740 eviction (Reason 2): &lt;BR /&gt;a) NTP (Time changes on cluster) - usually on Linux, Tru64, or IBM AIX &lt;BR /&gt;b) Network Problems (SAN). &lt;BR /&gt;c) Resource Starvation (CPU, I/O, etc..) &lt;BR /&gt;d) An Oracle bug. &lt;BR /&gt;Common bugs for reason 2 evictions: &lt;BR /&gt;BUG 2820871 - Abrupt time adjustments can crash instance with ORA-29740 &lt;BR /&gt;(Reason 2) (Linux Only) &lt;BR /&gt;Fixed-Releases: 9204+ A000 &lt;BR /&gt;If you feel that this eviction was not correct, do a search in Metalink or the &lt;BR /&gt;bug database for: &lt;BR /&gt;ORA-29740 'reason 2' &lt;BR /&gt;Important files to review are: &lt;BR /&gt;a) Each instance's alert log &lt;BR /&gt;b) Each instance's LMON trace file &lt;BR /&gt;c) Statspack reports from all nodes leading up to the eviction &lt;BR /&gt;d) The CKPT process trace file of the evicted instance &lt;BR /&gt;e) Other bdump or udump files... &lt;BR /&gt;f) Each node's syslog or messages file &lt;BR /&gt;----------------------------------------------------------------------------- &lt;BR /&gt;Reason 3: Communications Failure. This can happen if: &lt;BR /&gt;a) The LMON processes loose communication between one another. &lt;BR /&gt;b) One instance loses communications with the LMD process of another &lt;BR /&gt;instance. &lt;BR /&gt;c) An LMON process is blocked, spinning, or stuck and is not &lt;BR /&gt;responding to the other instance(s) LMON process. &lt;BR /&gt;d) An LMD process is blocked or spinning. &lt;BR /&gt;In this case the ORA-29740 error is recorded when there are communication &lt;BR /&gt;issues between the instances. It is an indication that an instance has been &lt;BR /&gt;evicted from the configuration as a result of IPC send timeout. A &lt;BR /&gt;communications failure between a foreground, or background other than LMON, &lt;BR /&gt;and a remote LMD will also generate a ORA-29740 with reason 3. When this &lt;BR /&gt;occurs, the trace file of the process experiencing the error will print a &lt;BR /&gt;message: &lt;BR /&gt;Reporting Communication error with instance: &lt;BR /&gt;If communication is lost at the cluster layer (for example, network cables &lt;BR /&gt;are pulled), the cluster software may also perform node evictions in the &lt;BR /&gt;event of a cluster split-brain. Oracle will detect a possible split-brain &lt;BR /&gt;and wait for cluster software to resolve the split-brain. If cluster &lt;BR /&gt;software does not resolve the split-brain within a specified interval, &lt;BR /&gt;Oracle proceeds with evictions. &lt;BR /&gt;Oracle Support has seen cases where resource starvation (CPU, I/O, etc...) can &lt;BR /&gt;cause an instance to be evicted with this reason code. The LMON or LMD process &lt;BR /&gt;could be blocked waiting for resources and not respond to polling by the remote &lt;BR /&gt;instance(s). This could cause that instance to be evicted. If you have &lt;BR /&gt;a statspack report available from the time just prior to the eviction on the &lt;BR /&gt;evicted instance, check for poor I/O times and high CPU utilization. Poor I/O &lt;BR /&gt;times would be an average read time of &amp;gt; 20ms. &lt;BR /&gt;Common causes for an ORA-29740 eviction (Reason 3): &lt;BR /&gt;a) Network Problems. &lt;BR /&gt;b) Resource Starvation (CPU, I/O, etc..) &lt;BR /&gt;c) Severe Contention in Database. &lt;BR /&gt;d) An Oracle bug. &lt;BR /&gt;Common bugs for reason 3 evictions: &lt;BR /&gt;BUG 2276622 - ORA-29740 (Reason 3) possible in RAC under heavy load &lt;BR /&gt;Fixed-Releases: 9014+ 9202+ &lt;BR /&gt;BUG 2994260 - IPCSOCK_SEND FAILED WITH STATUS: 10054 (Windows only) &lt;BR /&gt;Fixed-Releases: 9203 with patch or 9204+ &lt;BR /&gt;BUG 2210879 - ORACLE PROCESS CRASHES, WITH ASSERTION FAILURE IN LOWFAT &lt;BR /&gt;SKGXP CODE (HP-UX only with clic interface) &lt;BR /&gt;Fixed-Releases: Fixed by HP in PHNE 26551 or above. &lt;BR /&gt;Tips for tuning inter-instance performance can be found in the following note: &lt;BR /&gt;Note 181489.1 &lt;BR /&gt;Tuning Inter-Instance Performance in RAC and OPS &lt;BR /&gt;If you feel that this eviction was not correct, do a search in Metalink or the &lt;BR /&gt;bug database for: &lt;BR /&gt;ORA-29740 'reason 3' &lt;BR /&gt;Important files to review are: &lt;BR /&gt;a) Each instance's alert log &lt;BR /&gt;b) Each instance's LMON trace file &lt;BR /&gt;c) each instance's LMD trace file &lt;BR /&gt;d) Statspack reports from all nodes leading up to the eviction &lt;BR /&gt;e) Other bdump or udump files... &lt;BR /&gt;f) Each node's syslog or messages file &lt;BR /&gt;g) Netstat -i and netstat -s output &lt;BR /&gt;----------------------------------------------------------------------------- &lt;BR /&gt;References : &lt;BR /&gt;[NOTE:139435.1] Fast Reconfiguration in 9i Real Application Clusters &lt;BR /&gt;[BUG:2276622] ORA-29740 UNDER HEAVY LOAD &lt;BR /&gt;[BUG:1999778] RAC/OPS DATABASE CRASHES WITH ORA-29740 ON RESTART ON FAILED SYSTEM &lt;BR /&gt;[BUG:2529223] INSTANCE EVICTED WITH ORA-29740 &lt;BR /&gt;[NOTE:175678.1] RAC Instances Crash with ORA-29740 or ORA-600 [ksxpwait5] on IBM AIX &lt;BR /&gt;[NOTE:212381.1] RAC: Cluster Node evicted due to Change of System Time</description>
      <pubDate>Thu, 20 May 2004 12:21:31 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/rac-sg-split-brain/m-p/3282164#M706881</guid>
      <dc:creator>Navin Bhat_2</dc:creator>
      <dc:date>2004-05-20T12:21:31Z</dc:date>
    </item>
    <item>
      <title>Re: RAC/SG split brain</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/rac-sg-split-brain/m-p/3282165#M706882</link>
      <description>Hi Marvin,&lt;BR /&gt;How you configure the interconnect for RAC DLM traffic? If you specify interconnect that is different from the lan where hostname resides, then you may run into issue when Oracle and SG has different view of the connectivity. &lt;BR /&gt;</description>
      <pubDate>Thu, 20 May 2004 16:19:51 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/rac-sg-split-brain/m-p/3282165#M706882</guid>
      <dc:creator>JW_8</dc:creator>
      <dc:date>2004-05-20T16:19:51Z</dc:date>
    </item>
    <item>
      <title>Re: RAC/SG split brain</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/rac-sg-split-brain/m-p/3282166#M706883</link>
      <description>This is actually an Oracle error message, and appears to indicate that there has been a communication loss between the two nodes.&lt;BR /&gt;This is especially true if you have Hyperfabric between the nodes, as RAC likes to use this for it's comms.&lt;BR /&gt;I suggest you raise a call with Oracle&lt;BR /&gt;</description>
      <pubDate>Thu, 20 May 2004 16:23:51 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/rac-sg-split-brain/m-p/3282166#M706883</guid>
      <dc:creator>melvyn burnard</dc:creator>
      <dc:date>2004-05-20T16:23:51Z</dc:date>
    </item>
    <item>
      <title>Re: RAC/SG split brain</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/rac-sg-split-brain/m-p/3282167#M706884</link>
      <description>What storage are you using and how is it connected.  A two node cluster is needs some cluster lock disk, which is why only certain storage is supported.  Not sure how you could set it up without one, but it might be good to confirm that you have a supported hardware configuration, and that the cluster lock was properly set up.</description>
      <pubDate>Thu, 20 May 2004 23:16:33 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/rac-sg-split-brain/m-p/3282167#M706884</guid>
      <dc:creator>Ted Buis</dc:creator>
      <dc:date>2004-05-20T23:16:33Z</dc:date>
    </item>
    <item>
      <title>Re: RAC/SG split brain</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/rac-sg-split-brain/m-p/3282168#M706885</link>
      <description>We are using EMC storage, and there is one cluster lock disk, actually you must have a cluster lock disk for a 2 node cluster. &lt;BR /&gt;&lt;BR /&gt;I have been informed that the split brain error message is normal, when losing a connection to one of the nodes, and it is nothing to worry about. Since it was resolved very quickly if you look at the timestamps. &lt;BR /&gt;&lt;BR /&gt;So I guess I was barking up the wrong tree. &lt;BR /&gt;&lt;BR /&gt;Still investigating that why the first node evicts the second node. When I disconnect the first node. &lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Fri, 21 May 2004 08:40:36 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/rac-sg-split-brain/m-p/3282168#M706885</guid>
      <dc:creator>Marvin Strong</dc:creator>
      <dc:date>2004-05-21T08:40:36Z</dc:date>
    </item>
  </channel>
</rss>

