<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic serviceguard problem in Operating System - HP-UX</title>
    <link>https://community.hpe.com/t5/operating-system-hp-ux/serviceguard-problem/m-p/3448383#M703759</link>
    <description>Hi,&lt;BR /&gt;&lt;BR /&gt;We just had an awfull crash:&lt;BR /&gt;We have a 2-node RAC cluster with CVM.&lt;BR /&gt;One of the nodes had a file table overflow a few days ago, that went unnoticed. &lt;BR /&gt;Neither Oracle or CVM had shown problems. However, we decided to reboot the node for good measure. (It was the CVM master node, btw)&lt;BR /&gt;When shutting the node, a panic was received in the second node, who went down in a flash.&lt;BR /&gt;Needless to say, we were not prepared for it and the RAC was open on the second node. This caused MAJOR data corruption.&lt;BR /&gt;&lt;BR /&gt;Now we restore.&lt;BR /&gt;&lt;BR /&gt;Why, Why, Why does these things happen?</description>
    <pubDate>Tue, 21 Dec 2004 14:25:46 GMT</pubDate>
    <dc:creator>uvc</dc:creator>
    <dc:date>2004-12-21T14:25:46Z</dc:date>
    <item>
      <title>serviceguard problem</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/serviceguard-problem/m-p/3448383#M703759</link>
      <description>Hi,&lt;BR /&gt;&lt;BR /&gt;We just had an awfull crash:&lt;BR /&gt;We have a 2-node RAC cluster with CVM.&lt;BR /&gt;One of the nodes had a file table overflow a few days ago, that went unnoticed. &lt;BR /&gt;Neither Oracle or CVM had shown problems. However, we decided to reboot the node for good measure. (It was the CVM master node, btw)&lt;BR /&gt;When shutting the node, a panic was received in the second node, who went down in a flash.&lt;BR /&gt;Needless to say, we were not prepared for it and the RAC was open on the second node. This caused MAJOR data corruption.&lt;BR /&gt;&lt;BR /&gt;Now we restore.&lt;BR /&gt;&lt;BR /&gt;Why, Why, Why does these things happen?</description>
      <pubDate>Tue, 21 Dec 2004 14:25:46 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/serviceguard-problem/m-p/3448383#M703759</guid>
      <dc:creator>uvc</dc:creator>
      <dc:date>2004-12-21T14:25:46Z</dc:date>
    </item>
    <item>
      <title>Re: serviceguard problem</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/serviceguard-problem/m-p/3448384#M703760</link>
      <description>These things generally happen because ServiceGuard configuration is inadequate and not tested.&lt;BR /&gt;&lt;BR /&gt;I'm in the process of building SG in the lab with two ancient D class servers. The point is to learn how to configure and test the product and develop procedures for coding proper monitoring scripts and other configurations.&lt;BR /&gt;&lt;BR /&gt;I will be able to test this setup without fear of hurting anything. Prior to going live with any SG configuration I will test several failure solutions.&lt;BR /&gt;&lt;BR /&gt;It does appear that your monitor scripts are a problem, and the alert.ora logs are not being looked at and acted upon. Very simple scripts can check these and email you before trouble happens.&lt;BR /&gt;&lt;BR /&gt;I know you have a crisis now, but once you have everything put back together, write and execute a test plan on this setup.&lt;BR /&gt;&lt;BR /&gt;SEP</description>
      <pubDate>Tue, 21 Dec 2004 14:43:04 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/serviceguard-problem/m-p/3448384#M703760</guid>
      <dc:creator>Steven E. Protter</dc:creator>
      <dc:date>2004-12-21T14:43:04Z</dc:date>
    </item>
    <item>
      <title>Re: serviceguard problem</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/serviceguard-problem/m-p/3448385#M703761</link>
      <description>What really annoys me is the 'solidarity' of the nodes in serviceguard. &lt;BR /&gt;You can't trust it on nothing but the simplest situations. If a network segment went down - poof, you're dead. if a single node is experiencing software problems (cvm) - poof, you're dead. If you got multiple disk failures - poof, you're dead. &lt;BR /&gt;&lt;BR /&gt;All these have happened to me in various s/g configurations. the solution is usually to disable as much functionality as you can, or - tidiously try to get the machine to a 'stable state' - in which the cluster will work ok - until something (sftware or hardware) change - when all cluster testing need to be done again. &lt;BR /&gt;&lt;BR /&gt;It's just bad. Really. It causes more downtime than it saves.&lt;BR /&gt;</description>
      <pubDate>Tue, 21 Dec 2004 14:52:15 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/serviceguard-problem/m-p/3448385#M703761</guid>
      <dc:creator>uvc</dc:creator>
      <dc:date>2004-12-21T14:52:15Z</dc:date>
    </item>
    <item>
      <title>Re: serviceguard problem</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/serviceguard-problem/m-p/3448386#M703762</link>
      <description>sounds like you do not have your monitoring or configuration set correctly.&lt;BR /&gt;It may be worth looking at what happened and use it as a learning curve and fix what did not appear to work or react correctly&lt;BR /&gt;</description>
      <pubDate>Tue, 21 Dec 2004 14:54:36 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/serviceguard-problem/m-p/3448386#M703762</guid>
      <dc:creator>melvyn burnard</dc:creator>
      <dc:date>2004-12-21T14:54:36Z</dc:date>
    </item>
    <item>
      <title>Re: serviceguard problem</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/serviceguard-problem/m-p/3448387#M703763</link>
      <description>Gee, I thought this thread had a different subject the first time I saw it!!!&lt;BR /&gt;&lt;BR /&gt;;^)&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Pete&lt;BR /&gt;&lt;BR /&gt;P.S.  I do offer my commiseration.</description>
      <pubDate>Tue, 21 Dec 2004 14:55:54 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/serviceguard-problem/m-p/3448387#M703763</guid>
      <dc:creator>Pete Randall</dc:creator>
      <dc:date>2004-12-21T14:55:54Z</dc:date>
    </item>
    <item>
      <title>Re: serviceguard problem</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/serviceguard-problem/m-p/3448388#M703764</link>
      <description>Men at some time are masters of their fates; &lt;BR /&gt;The fault is not in our stars, &lt;BR /&gt;But in ourselves. &lt;BR /&gt;&lt;BR /&gt;Julius Caesar - I. 2. &lt;BR /&gt;&lt;BR /&gt;I've run one MC/SG production cluster for over 5.5 years without a single package failover that was not intentionally initiated --- and with zero unplanned downtime. If setup correctly, MC/SG is extremely robust. In fact, your carping about losing one network connection and dying is the key. Your configuration is not nearly robust enough. Network cable/NIC/switch failures should be considered routine events and should not cause any failures. Disk replacements should be considered absolutely routine tasks. The whole point is to have your systems so well configured that MC/SG itself very rarely comes into play.&lt;BR /&gt;&lt;BR /&gt;When a cluster and its packages are well constructed and configured robustly, you should be able (and should actually try) to yank any one thing --- including yanking a server's power cable, yanking a disk out, turning off a network switch, ... --- and the package should continue to function with at most a switch to an alternate node.&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Tue, 21 Dec 2004 15:08:29 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/serviceguard-problem/m-p/3448388#M703764</guid>
      <dc:creator>A. Clay Stephenson</dc:creator>
      <dc:date>2004-12-21T15:08:29Z</dc:date>
    </item>
    <item>
      <title>Re: serviceguard problem</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/serviceguard-problem/m-p/3448389#M703765</link>
      <description>I saw such "solidarity" problems on a 2 node&lt;BR /&gt;cluster with lock disk.&lt;BR /&gt;The reason was a total loss of hearbeat and&lt;BR /&gt;the 2 nodes started a "fight" for the cluster&lt;BR /&gt;lock.&lt;BR /&gt;There is a 50:50 chance that the "wrong" node&lt;BR /&gt;gets the cluster lock and the "right" one reboots. Shortly after that the other node&lt;BR /&gt;gets toc'ed because of it's own problems.&lt;BR /&gt;&lt;BR /&gt;Maybe this happend to you?&lt;BR /&gt;Check the OLDsyslog.log for information.&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Wed, 22 Dec 2004 04:15:17 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/serviceguard-problem/m-p/3448389#M703765</guid>
      <dc:creator>Armin Kunaschik</dc:creator>
      <dc:date>2004-12-22T04:15:17Z</dc:date>
    </item>
    <item>
      <title>Re: serviceguard problem</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/serviceguard-problem/m-p/3448390#M703766</link>
      <description>My condolences on the data corruption.&lt;BR /&gt;The other node should have survived the shutdown.  Though it is too late to prevent the disaster this time, investigate the OLDsyslog.log on both servers (and potentially the package control logs) to try to determine the source of the problem.&lt;BR /&gt;&lt;BR /&gt;As for the issue of proper cluster arbitration when all HB NICs suffer an outage on one node, the online manual discusses how to use the Serial Heartbeat feature to prevent the "dead" node from winning the cluster lock disk arbitration race:&lt;BR /&gt;  &lt;A href="http://docs.hp.com/en/B3936-90079/ch02s02.html" target="_blank"&gt;http://docs.hp.com/en/B3936-90079/ch02s02.html&lt;/A&gt;&lt;BR /&gt;Caviates apply - search the manual for all references to the Serial Heartbeat concepts.&lt;BR /&gt;&lt;BR /&gt;-StephenD.&lt;BR /&gt;</description>
      <pubDate>Fri, 24 Dec 2004 11:00:11 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/serviceguard-problem/m-p/3448390#M703766</guid>
      <dc:creator>Stephen Doud</dc:creator>
      <dc:date>2004-12-24T11:00:11Z</dc:date>
    </item>
  </channel>
</rss>

