Standard Operating Procedure (SOP)
Document Version: 1.0
Applicable Versions: Oracle RAC 11gR2, 12c, 18c, 19c, 21c, 23ai, 26ai
Prepared For: Oracle Database Administrators (L1/L2/L3)
Purpose
This document provides a structured Oracle RAC Health Check Framework that helps DBAs verify the health of Oracle Clusterware, ASM, Database, Network, and Cluster Resources. Performing these checks regularly helps detect issues early, reduce downtime, and maintain high availability.
Health Check Workflow
Clusterware
│
▼
Node Status
│
▼
ASM Health
│
▼
Database Health
│
▼
Network Health
│
▼
Cluster Resources
1. Clusterware Health Check
Objective
Verify that Oracle Clusterware components are running correctly.
Components
OHASD
CSSD
CRSD
EVMD
Command
crsctl check crs
Expected Output
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
Validation
| Component | Expected Status |
|---|---|
| OHASD | Online |
| CSSD | Online |
| CRSD | Online |
| EVMD | Online |
If Failed
Check Clusterware logs.
Verify voting disks.
Verify OCR accessibility.
Restart Clusterware if required.
2. Node Health Check
Objective
Ensure all RAC nodes are available and participating in the cluster.
Commands
olsnodes
olsnodes -s
olsnodes -n
Expected Output
racnode1 Active
racnode2 Active
Validation
All nodes visible
Status should be Active
Node numbers should match cluster configuration
Troubleshooting
If a node is missing:
Verify private interconnect
Check Clusterware
Verify CSSD
Review node logs
3. ASM Health Check
Objective
Verify ASM availability and storage health.
Check ASM Status
srvctl status asm
Expected
ASM is running on racnode1
ASM is running on racnode2
Check Diskgroups
asmcmd lsdg
Example
DATA
RECO
OCR
Verify
Mounted
Free Space
Offline Disks
Redundancy
SQL Validation
SELECT
name,
state,
type,
total_mb,
free_mb
FROM v$asm_diskgroup;
Troubleshooting
Check failed disks
Verify ASM alert log
Validate storage connectivity
4. Database Health Check
Objective
Ensure all RAC database instances and services are available.
Database Status
srvctl status database -d <db_name>
Expected
Instance PROD1 is running
Instance PROD2 is running
Service Status
srvctl status service -d <db_name>
Verify
Application services
Preferred instances
Available instances
SQL Validation
SELECT
INSTANCE_NAME,
STATUS,
DATABASE_STATUS
FROM GV$INSTANCE;
Expected
OPEN
ACTIVE
5. Network Health Check
Objective
Verify communication between RAC nodes.
Public and Private Network
oifcfg getif
Verify
Public Interface
Private Interconnect
Network Configuration
srvctl config network
SCAN Configuration
srvctl config scan
Verify
SCAN Name
SCAN IPs
SCAN Listeners
VIP Status
srvctl status vip
Expected
VIP is enabled
VIP is running
Troubleshooting
Verify DNS
Check SCAN listeners
Verify VIP failover
Test private interconnect latency
6. Cluster Resource Health Check
Objective
Verify all Oracle Cluster resources are online.
Command
crsctl stat res -t
Verify
Database
ASM
Listeners
VIPs
SCAN Listeners
Diskgroups
Expected Status
ONLINE
Additional Recommended Health Checks
Listener Status
srvctl status listener
SCAN Listener Status
srvctl status scan_listener
OCR Check
ocrcheck
Expected
Status : healthy
Voting Disk
crsctl query css votedisk
Verify
All voting disks accessible
Cluster Synchronization
crsctl check css
CRS Stack
crsctl stat res -t
Verify every resource is ONLINE.
Daily RAC Health Check Checklist
| Check | Status |
|---|---|
| Clusterware Running | ☐ |
| All Nodes Active | ☐ |
| ASM Running | ☐ |
| Diskgroups Mounted | ☐ |
| Database Open | ☐ |
| RAC Services Running | ☐ |
| Public Network Healthy | ☐ |
| Private Interconnect Healthy | ☐ |
| VIP Running | ☐ |
| SCAN Listener Running | ☐ |
| OCR Healthy | ☐ |
| Voting Disk Healthy | ☐ |
| Cluster Resources ONLINE | ☐ |
Common Production Issues
| Issue | Possible Cause | Resolution |
|---|---|---|
| Node Eviction | Interconnect failure | Check private network and CSS logs |
| ASM Down | Storage unavailable | Verify SAN/ASM disks and restart ASM |
| VIP Offline | Network issue | Validate interface and relocate VIP |
| Service Not Running | Instance failure | Start service with SRVCTL |
| CRS Resource Offline | Clusterware issue | Review CRS logs and restart the affected resource |
| Diskgroup Not Mounted | Disk failure | Check ASM disks and storage connectivity |
Best Practices
Perform RAC health checks daily.
Monitor ASM free space and rebalance operations.
Verify OCR and voting disk health after maintenance.
Monitor interconnect latency to prevent node eviction.
Ensure SCAN listeners and VIPs are functioning correctly.
Keep Clusterware and database patches up to date.
Review alert logs and CRS logs regularly.
Automate routine health checks using shell scripts or Enterprise Manager where possible.
Conclusion
A disciplined RAC health check routine is essential for maintaining a stable Oracle RAC environment. Regular verification of Clusterware, nodes, ASM, databases, networking, and cluster resources helps identify issues proactively, minimize downtime, and ensure continuous availability of critical business applications.
No comments:
Post a Comment