Wednesday, July 1, 2026

Oracle RAC Monitoring Framework

For production environments, I recommend turning it into a modular RAC Monitoring Framework rather than a single script. That makes it easier to schedule, troubleshoot, and extend.

Oracle RAC Monitoring Framework

Directory Structure

rac_monitoring/
├── rac_health_check.sh
├── db_health.sql
├── asm_health.sql
├── wait_events.sql
├── blocking_sessions.sql
├── tablespace.sql
├── fra_usage.sql
├── archive_log.sql
├── cpu_memory.sh
├── alert_log.sh
├── generate_report.sh
├── reports/
├── logs/
└── config.env

1. Configuration File (config.env)

#!/bin/bash

export ORACLE_BASE=/u01/app/oracle
export GRID_HOME=/u01/app/19.0.0/grid
export ORACLE_HOME=/u01/app/oracle/product/19.0.0/dbhome_1

export ORACLE_SID=PROD1

export PATH=$GRID_HOME/bin:$ORACLE_HOME/bin:$PATH

DB_NAME=PROD

REPORT_DIR=/home/oracle/rac_monitoring/reports
LOG_DIR=/home/oracle/rac_monitoring/logs

DATE=$(date +"%Y%m%d_%H%M%S")

REPORT=${REPORT_DIR}/RAC_Health_${DATE}.html
LOGFILE=${LOG_DIR}/RAC_Health_${DATE}.log

2. RAC Health Check Script (rac_health_check.sh)

#!/bin/bash

source ./config.env

exec > $LOGFILE

echo "==============================================="
echo "Oracle RAC Health Check"
echo "Server : $(hostname)"
echo "Date   : $(date)"
echo "==============================================="

echo
echo "=============================="
echo "Clusterware Status"
echo "=============================="
crsctl check crs

echo
echo "=============================="
echo "Cluster Resources"
echo "=============================="
crsctl stat res -t

echo
echo "=============================="
echo "Node Status"
echo "=============================="
olsnodes -n -s

echo
echo "=============================="
echo "ASM Status"
echo "=============================="
srvctl status asm

echo
echo "=============================="
echo "Diskgroups"
echo "=============================="
asmcmd lsdg

echo
echo "=============================="
echo "Database Status"
echo "=============================="
srvctl status database -d ${DB_NAME}

echo
echo "=============================="
echo "Services"
echo "=============================="
srvctl status service -d ${DB_NAME}

echo
echo "=============================="
echo "Listener"
echo "=============================="
srvctl status listener

echo
echo "=============================="
echo "SCAN Listener"
echo "=============================="
srvctl status scan_listener

echo
echo "=============================="
echo "VIP"
echo "=============================="
srvctl status vip

echo
echo "=============================="
echo "OCR"
echo "=============================="
ocrcheck

echo
echo "=============================="
echo "Voting Disk"
echo "=============================="
crsctl query css votedisk

echo
echo "Health Check Completed"

3. Wait Event Monitoring (wait_events.sql)

set lines 200
col event format a45

SELECT
event,
total_waits,
time_waited,
average_wait
FROM v$system_event
ORDER BY time_waited DESC
FETCH FIRST 20 ROWS ONLY;

4. Blocking Sessions

set lines 200

SELECT
inst_id,
sid,
serial#,
username,
blocking_session,
seconds_in_wait,
event
FROM gv$session
WHERE blocking_session IS NOT NULL;

5. ASM Monitoring

set lines 200

SELECT
name,
state,
type,
total_mb,
free_mb,
ROUND(free_mb*100/total_mb,2) FREE_PERCENT
FROM
v$asm_diskgroup;

6. Tablespace Monitoring

SELECT
tablespace_name,
ROUND(used_percent,2) USED_PERCENT
FROM dba_tablespace_usage_metrics
ORDER BY used_percent DESC;

7. FRA Monitoring

SELECT
SPACE_LIMIT/1024/1024 MB_LIMIT,
SPACE_USED/1024/1024 MB_USED,
SPACE_RECLAIMABLE/1024/1024 MB_RECLAIMABLE
FROM
V$RECOVERY_FILE_DEST;

8. Archive Log Generation

SELECT
TRUNC(first_time),
COUNT(*),
ROUND(SUM(blocks*block_size)/1024/1024/1024,2) GB
FROM
v$archived_log
GROUP BY
TRUNC(first_time)
ORDER BY
1 DESC;

9. CPU & Memory Monitoring (cpu_memory.sh)

#!/bin/bash

echo "========== CPU =========="
top -bn1 | head -5

echo

echo "========== Memory =========="
free -g

echo

echo "========== Swap =========="
swapon -s

echo

echo "========== Disk =========="
df -h

10. Alert Log Monitoring (alert_log.sh)

#!/bin/bash

adrci exec="show alert -tail 200"

11. Cluster Log Collection

#!/bin/bash

diagcollection.pl --collect cluster

12. Email Report

mailx -s "Oracle RAC Health Report $(hostname)" \
shashi_dba@shashidba.com < $LOGFILE

13. Cron Scheduling

Run every hour:

0 * * * * /home/oracle/rac_monitoring/rac_health_check.sh

Run daily at 8 AM:

0 8 * * * /home/oracle/rac_monitoring/rac_health_check.sh

Run every Sunday:

0 6 * * 0 /home/oracle/rac_monitoring/rac_health_check.sh

Sample Health Check Output

===================================================
Oracle RAC Health Check
===================================================

Hostname : racnode1
Date     : 01-Jul-2026 08:00

✔ CRS Status               ONLINE
✔ Cluster Resources        ONLINE
✔ Node Status              ACTIVE
✔ ASM                      RUNNING
✔ Diskgroups               DATA, RECO, OCR
✔ Database                 PROD OPEN
✔ Services                 RUNNING
✔ Listener                 RUNNING
✔ SCAN                     RUNNING
✔ VIP                      RUNNING
✔ OCR                      HEALTHY
✔ Voting Disk              NORMAL

Tablespace Usage
----------------------------
SYSTEM        72%
SYSAUX        61%
USERS         42%
TEMP          15%

ASM Usage
----------------------------
DATA      67%
RECO      58%

Blocking Sessions : NONE

Top Wait Event
----------------------------
db file sequential read

CPU Usage : 18%
Memory Usage : 63%

Overall RAC Health : PASS


Oracle RAC Administration Handbook

 Absolutely. Given the amount of content, this is best developed as a complete handbook rather than a single chat response.

📘 Oracle RAC Administration Handbook (100–150 Pages)

Section 1 – Oracle RAC Fundamentals

  • Oracle RAC Architecture

  • RAC Components

  • Grid Infrastructure

  • Oracle Clusterware

  • ASM Architecture

  • Cache Fusion

  • Global Cache Service (GCS)

  • Global Enqueue Service (GES)

  • OCR & Voting Disk

  • SCAN, VIP, GNS

  • RAC Networking

  • RAC Storage Architecture

  • RAC vs Single Instance

  • RAC vs Data Guard

  • Real-world RAC Deployment Architecture


Section 2 – Oracle RAC Installation

  • Hardware Prerequisites

  • OS Configuration

  • Kernel Parameters

  • User Configuration

  • Passwordless SSH

  • Network Planning

  • Storage Planning

  • ASM Configuration

  • Grid Infrastructure Installation

  • RAC Database Installation

  • Post-installation Verification

  • Architecture diagrams throughout


Section 3 – RAC Administration

  • Instance Management

  • Service Management

  • Listener Management

  • SCAN Management

  • VIP Management

  • OCR Backup & Restore

  • Voting Disk Management

  • Node Addition

  • Node Deletion

  • Database Creation

  • Database Deletion

  • RAC Patching

  • OPatchAuto

  • Rolling Patch

  • One-off Patch

  • RU Upgrade


Section 4 – Oracle RAC Health Check Framework

This section expands the framework into approximately 25–30 pages.

Includes:

  • Clusterware Health Check

  • ASM Health Check

  • Database Health Check

  • Node Health Check

  • Listener Health Check

  • VIP Health Check

  • SCAN Health Check

  • OCR Health Check

  • Voting Disk Health Check

  • CRS Resource Health Check

  • Cache Fusion Monitoring

  • Interconnect Latency Checks

  • Redo Log Health

  • Undo Health

  • Tablespace Health

  • FRA Health

  • Archive Log Health

  • Alert Log Review

  • ADRCI Diagnostics

  • AWR Health Indicators

  • ASH Monitoring

  • Blocking Sessions

  • Wait Events

  • OS Monitoring

  • Filesystem Checks

Each topic will include:

  • Purpose

  • Commands

  • Sample outputs

  • Interpretation

  • Common issues

  • Troubleshooting steps

  • Best practices


Section 5 – RAC Monitoring Scripts

Cluster Health Script

#!/bin/bash

echo "================================="
echo "Oracle RAC Health Check"
echo "================================="

hostname

echo
echo "CRS Status"
crsctl check crs

echo
echo "Node Status"
olsnodes -s

echo
echo "ASM Status"
srvctl status asm

echo
echo "Diskgroups"
asmcmd lsdg

echo
echo "Database Status"
srvctl status database -d PROD

echo
echo "Services"
srvctl status service -d PROD

echo
echo "VIP Status"
srvctl status vip

echo
echo "SCAN Listener"
srvctl status scan_listener

echo
echo "OCR"
ocrcheck

echo
echo "Voting Disk"
crsctl query css votedisk

echo
echo "Resources"
crsctl stat res -t

Wait Event Monitoring Script

SELECT
event,
total_waits,
time_waited
FROM
v$system_event
ORDER BY
time_waited DESC;

Blocking Session Script

SELECT
blocking_session,
sid,
serial#,
username,
event
FROM
gv$session
WHERE
blocking_session IS NOT NULL;

ASM Space Monitoring

SELECT
name,
total_mb,
free_mb,
ROUND(free_mb*100/total_mb,2) FREE_PERCENT
FROM
v$asm_diskgroup;

Cluster Resource Report

crsctl stat res -t

VIP Verification

srvctl status vip

OCR Verification

ocrcheck

CRS Alert Monitoring

adrci

show alert

Cluster Log Collection

diagcollection.pl --collect cluster

Section 6 – Automation Framework

The handbook will include a Daily Health Check Automation that generates HTML reports, CSV summaries, and email notifications.

Features:

  • Clusterware status

  • ASM status

  • Diskgroup utilization

  • Database status

  • Listener status

  • Services

  • SCAN

  • VIP

  • OCR

  • Voting disks

  • CPU

  • Memory

  • Disk usage

  • Top wait events

  • Blocking sessions

  • FRA usage

  • Archive log generation

  • Tablespace utilization

  • Alert log errors

  • CRS errors

Output formats:

  • HTML dashboard

  • CSV report

  • Email summary

  • Log file


Section 7 – Performance Tuning

  • Cache Fusion tuning

  • Interconnect tuning

  • ASM tuning

  • HugePages

  • NUMA

  • Linux kernel tuning

  • AWR analysis

  • ASH analysis

  • ADDM

  • SQL Monitoring

  • OSWatcher

  • ExaWatcher

  • Cluster Health Monitor (CHM)


Section 8 – Production Incident Runbooks (40+)

Examples include:

  • Node Eviction

  • CRS Won't Start

  • CSS Failure

  • ASM Disk Offline

  • OCR Corruption

  • Voting Disk Failure

  • VIP Not Failing Over

  • SCAN Listener Down

  • Split Brain

  • ORA-29740

  • ORA-29702

  • CRS-4535

  • CRS-4530

  • CRS-1606

  • PRCR-1079

  • PRCR-1064

  • ORA-15064

  • ORA-15032

  • ORA-15041

  • ORA-15042

  • ORA-00257

  • ORA-19809

  • Interconnect Packet Loss

  • High GCS Waits

  • gc buffer busy

  • gc cr request

  • gc current block busy

Each runbook will include:

  • Symptoms

  • Root cause

  • Diagnostic commands

  • Resolution steps

  • Validation

  • Prevention

  • Lessons learned


Section 9 – Oracle RAC Interview Guide

  • 500+ interview questions

  • L1 questions

  • L2 questions

  • L3 questions

  • Oracle ACE–level scenarios

  • Whiteboard architecture questions

  • Real production case studies


Section 10 – Architecture Diagrams

The handbook will contain over 50 professional diagrams, including:

  • Oracle RAC Architecture

  • Grid Infrastructure

  • Cache Fusion Flow

  • GCS/GES Communication

  • SCAN Listener Flow

  • VIP Failover

  • OCR Architecture

  • Voting Disk Layout

  • ASM Diskgroup Architecture

  • Redo Thread Architecture

  • RAC Networking

  • Client Connection Flow

  • Clusterware Stack

  • Service Failover

  • Node Eviction Flow

  • Split Brain Detection

  • CRS Startup Sequence

  • Rolling Patch Architecture

  • RAC + Data Guard Hybrid Architecture

  • RAC Backup Architecture

  • RAC Disaster Recovery Design


Oracle RAC Health Check Framework

 

Standard Operating Procedure (SOP)

Document Version: 1.0
Applicable Versions: Oracle RAC 11gR2, 12c, 18c, 19c, 21c, 23ai, 26ai
Prepared For: Oracle Database Administrators (L1/L2/L3)


Purpose

This document provides a structured Oracle RAC Health Check Framework that helps DBAs verify the health of Oracle Clusterware, ASM, Database, Network, and Cluster Resources. Performing these checks regularly helps detect issues early, reduce downtime, and maintain high availability.


Health Check Workflow

Clusterware
      │
      ▼
Node Status
      │
      ▼
ASM Health
      │
      ▼
Database Health
      │
      ▼
Network Health
      │
      ▼
Cluster Resources

1. Clusterware Health Check

Objective

Verify that Oracle Clusterware components are running correctly.

Components

  • OHASD

  • CSSD

  • CRSD

  • EVMD

Command

crsctl check crs

Expected Output

CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

Validation

ComponentExpected Status
OHASDOnline
CSSDOnline
CRSDOnline
EVMDOnline

If Failed

  • Check Clusterware logs.

  • Verify voting disks.

  • Verify OCR accessibility.

  • Restart Clusterware if required.


2. Node Health Check

Objective

Ensure all RAC nodes are available and participating in the cluster.

Commands

olsnodes
olsnodes -s
olsnodes -n

Expected Output

racnode1 Active
racnode2 Active

Validation

  • All nodes visible

  • Status should be Active

  • Node numbers should match cluster configuration

Troubleshooting

If a node is missing:

  • Verify private interconnect

  • Check Clusterware

  • Verify CSSD

  • Review node logs


3. ASM Health Check

Objective

Verify ASM availability and storage health.

Check ASM Status

srvctl status asm

Expected

ASM is running on racnode1
ASM is running on racnode2

Check Diskgroups

asmcmd lsdg

Example

DATA
RECO
OCR

Verify

  • Mounted

  • Free Space

  • Offline Disks

  • Redundancy


SQL Validation

SELECT
name,
state,
type,
total_mb,
free_mb
FROM v$asm_diskgroup;

Troubleshooting

  • Check failed disks

  • Verify ASM alert log

  • Validate storage connectivity


4. Database Health Check

Objective

Ensure all RAC database instances and services are available.

Database Status

srvctl status database -d <db_name>

Expected

Instance PROD1 is running
Instance PROD2 is running

Service Status

srvctl status service -d <db_name>

Verify

  • Application services

  • Preferred instances

  • Available instances


SQL Validation

SELECT
INSTANCE_NAME,
STATUS,
DATABASE_STATUS
FROM GV$INSTANCE;

Expected

OPEN
ACTIVE

5. Network Health Check

Objective

Verify communication between RAC nodes.


Public and Private Network

oifcfg getif

Verify

  • Public Interface

  • Private Interconnect


Network Configuration

srvctl config network

SCAN Configuration

srvctl config scan

Verify

  • SCAN Name

  • SCAN IPs

  • SCAN Listeners


VIP Status

srvctl status vip

Expected

VIP is enabled
VIP is running

Troubleshooting

  • Verify DNS

  • Check SCAN listeners

  • Verify VIP failover

  • Test private interconnect latency


6. Cluster Resource Health Check

Objective

Verify all Oracle Cluster resources are online.

Command

crsctl stat res -t

Verify

  • Database

  • ASM

  • Listeners

  • VIPs

  • SCAN Listeners

  • Diskgroups

Expected Status

ONLINE

Additional Recommended Health Checks

Listener Status

srvctl status listener

SCAN Listener Status

srvctl status scan_listener

OCR Check

ocrcheck

Expected

Status : healthy

Voting Disk

crsctl query css votedisk

Verify

  • All voting disks accessible


Cluster Synchronization

crsctl check css

CRS Stack

crsctl stat res -t

Verify every resource is ONLINE.


Daily RAC Health Check Checklist

CheckStatus
Clusterware Running
All Nodes Active
ASM Running
Diskgroups Mounted
Database Open
RAC Services Running
Public Network Healthy
Private Interconnect Healthy
VIP Running
SCAN Listener Running
OCR Healthy
Voting Disk Healthy
Cluster Resources ONLINE

Common Production Issues

IssuePossible CauseResolution
Node EvictionInterconnect failureCheck private network and CSS logs
ASM DownStorage unavailableVerify SAN/ASM disks and restart ASM
VIP OfflineNetwork issueValidate interface and relocate VIP
Service Not RunningInstance failureStart service with SRVCTL
CRS Resource OfflineClusterware issueReview CRS logs and restart the affected resource
Diskgroup Not MountedDisk failureCheck ASM disks and storage connectivity

Best Practices

  • Perform RAC health checks daily.

  • Monitor ASM free space and rebalance operations.

  • Verify OCR and voting disk health after maintenance.

  • Monitor interconnect latency to prevent node eviction.

  • Ensure SCAN listeners and VIPs are functioning correctly.

  • Keep Clusterware and database patches up to date.

  • Review alert logs and CRS logs regularly.

  • Automate routine health checks using shell scripts or Enterprise Manager where possible.


Conclusion

A disciplined RAC health check routine is essential for maintaining a stable Oracle RAC environment. Regular verification of Clusterware, nodes, ASM, databases, networking, and cluster resources helps identify issues proactively, minimize downtime, and ensure continuous availability of critical business applications.