Friday, April 24, 2026

Split Brain Syndrome in Oracle RAC



๐Ÿง  What Every DBA Must Know

In a clustered database environment like Oracle RAC, maintaining data consistency across nodes is critical. But what happens when nodes suddenly stop communicating with each other?

Welcome to one of the most critical scenarios in RAC:

⚠️ Split Brain Syndrome


๐Ÿ” What is Split Brain?

In an Oracle RAC cluster, nodes communicate using a private interconnect.

๐Ÿ‘‰ If this interconnect fails:

  • Nodes cannot see each other

  • Each node assumes others are down

  • Each continues processing independently

This leads to:

❌ Multiple “brains” operating simultaneously
❌ No coordination
❌ High risk of data corruption


⚡ Real Problem Explained

Imagine a 2-node RAC:

  • Node 1 updates a data block

  • Node 2 updates the same block

  • No communication between them

๐Ÿ‘‰ Result:

๐Ÿ’ฅ Data inconsistency / corruption


๐Ÿงฉ Why Does This Happen?

Split brain occurs when:

  • Private interconnect fails

  • Network heartbeat is lost

  • Nodes are still physically UP

  • Database instances continue running

Each node thinks:

“I am the only surviving node.”


๐Ÿ” Types of Heartbeats in RAC

Oracle uses two mechanisms to detect node health:

1️⃣ Network Heartbeat

  • Via private interconnect

  • Fast communication between nodes


2️⃣ Disk Heartbeat

  • Via Voting Disk

  • Backup mechanism when network fails


๐Ÿ—ณ️ Role of Voting Disk

The Voting Disk is the brain behind conflict resolution.

๐Ÿ‘‰ Each node:

  • Writes its presence

  • Checks connectivity with others


๐Ÿ”ฅ In Split Brain Scenario:

  • Nodes form sub-clusters

  • Each group tries to claim majority

  • Voting disk decides:

✅ Which nodes survive
❌ Which nodes get evicted


⚖️ Who Wins?

๐Ÿ‘‰ Majority rule applies

Example:

  • 10-node RAC cluster

  • 6 nodes can communicate

  • 4 nodes isolated

๐Ÿ‘‰ Result:

  • 6-node group survives

  • 4-node group gets evicted


๐Ÿšซ Node Eviction – Who Does It?

The eviction is handled by:

๐Ÿ‘‰ CSSD (Cluster Synchronization Services Daemon)


๐Ÿ”ง CSSD Responsibilities:

  • Monitor node health

  • Check heartbeats

  • Detect communication failure

  • Evict problematic nodes


⚙️ How CSSD Monitors Nodes

MechanismPurpose
Network HeartbeatInterconnect communication
Disk HeartbeatVoting disk verification

⚡ Node Eviction Process

When a node is unhealthy:

  1. CSSD detects heartbeat failure

  2. Voting disk validation occurs

  3. Node is forcibly evicted

  4. Node is usually rebooted automatically

  5. Cluster reconfigures


๐Ÿšจ Common Error

ORA-29740: evicted by instance

๐Ÿ‘‰ Indicates:

  • Node eviction occurred

  • Cluster protection mechanism triggered


๐Ÿงช Real-World Scenario

Situation:

  • 4-node RAC

  • Node 3 loses interconnect

What happens:

  • Node 1 detects issue

  • Voting disk confirms

  • Node 3 is evicted

  • Remaining nodes continue


๐Ÿ”„ Why Eviction is Important

Eviction is NOT a failure.

๐Ÿ‘‰ It is a protection mechanism

Without eviction:

  • Multiple nodes update same data

  • Corruption occurs

With eviction:

✅ Data integrity is preserved


๐Ÿง  Key DBA Takeaways

  • Split brain is network-related issue

  • Always ensure:

    • Redundant interconnect

    • Stable network

  • Monitor:

    • CSSD logs

    • Clusterware alerts


๐ŸŽฏ Interview Questions


❓ What is Split Brain in RAC?

๐Ÿ‘‰ When nodes cannot communicate but continue working independently, risking data corruption.


❓ How is Split Brain resolved?

๐Ÿ‘‰ Using Voting Disk + CSSD


❓ Who evicts nodes?

๐Ÿ‘‰ CSSD process


❓ What decides survival?

๐Ÿ‘‰ Voting Disk (majority rule)


❓ What is fencing?

๐Ÿ‘‰ Isolating/evicting a node to protect cluster integrity.


๐Ÿš€ Final Thought

“In RAC, survival is not about being alive…
It’s about being connected.”



No comments:

Post a Comment