๐ Oracle Data Guard L3 Troubleshooting Runbook
1. Objective
Provide a systematic approach to diagnose and resolve:
Redo transport issues
Redo apply (MRP) issues
Lag problems
Switchover/Failover failures
Broker inconsistencies
๐งญ 2. High-Level Troubleshooting Flow
Step 1 → Check Broker Status
Step 2 → Validate Role & Open Mode
Step 3 → Check Transport (Primary)
Step 4 → Check Apply (Standby)
Step 5 → Check Lag
Step 6 → Check Errors (Alert Log)
Step 7 → Take Corrective Action
๐ 3. Step-by-Step Deep Diagnosis
✅ Step 1: Check Broker Health
DGMGRL> SHOW CONFIGURATION;
Expected:
SUCCESS
If NOT:
WARNING / ERROR→ Drill down:
SHOW DATABASE VERBOSE <db_name>;
✅ Step 2: Validate Database Role & Mode
SELECT DATABASE_ROLE, OPEN_MODE FROM V$DATABASE;
Expected:
Primary →
READ WRITEStandby →
MOUNTorREAD ONLY WITH APPLY
๐ Step 3: Redo Transport Check (Primary)
Check Destination Status:
SELECT DEST_ID, STATUS, ERROR
FROM V$ARCHIVE_DEST
WHERE TARGET='STANDBY';
Check Parameter:
SHOW PARAMETER log_archive_dest_state_2;
❌ If Issue Found:
| Problem | Fix |
|---|---|
| DEST = ERROR | Check network / TNS |
| STATE = DEFER/RESET | Enable it |
ALTER SYSTEM SET log_archive_dest_state_2=ENABLE;
๐ฅ Step 4: Redo Apply Check (Standby)
SELECT PROCESS, STATUS FROM V$MANAGED_STANDBY;
Expected:
MRP0 APPLYING_LOG
❌ If MRP NOT Running:
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT;
⏱️ Step 5: Lag Analysis
SELECT NAME, VALUE FROM V$DATAGUARD_STATS;
Key Metrics:
Transport Lag
Apply Lag
๐ฏ Interpretation:
| Scenario | Meaning |
|---|---|
| Transport Lag High | Network issue |
| Apply Lag High | MRP slow |
| Both High | System-wide issue |
๐ Step 6: Sequence Gap Validation
-- Primary
SELECT MAX(SEQUENCE#) FROM V$ARCHIVED_LOG;
-- Standby
SELECT MAX(SEQUENCE#) FROM V$ARCHIVED_LOG;
❌ If Gap Exists:
FAL issue or missing logs
๐ Step 7: Alert Log Analysis
Check:
Archive errors
ORA- errors
Network failures
⚠️ 4. Critical Issue Playbooks
๐จ Scenario 1: No Redo Shipping
Symptoms:
Transport Lag increasing
No logs received
Checks:
SELECT STATUS, ERROR FROM V$ARCHIVE_DEST;
Fix:
ALTER SYSTEM SET log_archive_dest_state_2=ENABLE;
๐จ Scenario 2: Redo Received but Not Applied
Symptoms:
Transport Lag = 0
Apply Lag high
Fix:
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT;
๐จ Scenario 3: MRP Stuck
Fix:
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT;
๐จ Scenario 4: Switchover Fails
Check:
SELECT SWITCHOVER_STATUS FROM V$DATABASE;
Fix Flow:
Stop apply
Mount standby
Ensure no lag
Retry switchover
๐จ Scenario 5: ORA-16516
Root Cause:
Active Data Guard mode
Fix:
EDIT DATABASE standby SET STATE='APPLY-OFF';
SHUTDOWN IMMEDIATE;
STARTUP MOUNT;
๐จ Scenario 6: Archive Destination Full
Symptoms:
Primary freeze
ORA-00257
Fix:
Clean archive logs
Add space
๐จ Scenario 7: Gap Not Resolving
Fix:
ALTER SYSTEM SET FAL_SERVER=primary;
Manual recovery if required.
๐ 5. Start/Stop Operations (Quick Commands)
▶️ Start Redo Apply
EDIT DATABASE standby SET STATE='APPLY-ON';
⛔ Stop Redo Apply
EDIT DATABASE standby SET STATE='APPLY-OFF';
๐ Stop Transport
EDIT DATABASE primary SET STATE='TRANSPORT-OFF';
๐ Start Transport
EDIT DATABASE primary SET STATE='TRANSPORT-ON';
๐งช 6. Validation Checklist (Post-Fix)
✔ Broker Status = SUCCESS
✔ MRP0 running
✔ Transport Lag = 0
✔ Apply Lag = 0
✔ No errors in alert log
๐ง 7. L3 Decision Tree (Real-World Thinking)
Lag?
├── YES
│ ├── Transport Lag?
│ │ ├── YES → Network / log_archive_dest issue
│ │ └── NO → Apply issue (MRP)
│
└── NO
├── Data mismatch?
│ ├── YES → Sequence gap / FAL issue
│ └── NO → System healthy
๐ก️ 8. Preventive Monitoring (Must-Have in Production)
Monitor:
Transport Lag
Apply Lag
MRP status
Archive destination usage
Automate alerts for:
Lag > 5 minutes
MRP stopped
Destination error
No comments:
Post a Comment