Thursday, April 16, 2026

Oracle Data Guard Troubleshooting Runbook

 


๐Ÿš€ Oracle Data Guard L3 Troubleshooting Runbook

1. Objective

Provide a systematic approach to diagnose and resolve:

  • Redo transport issues

  • Redo apply (MRP) issues

  • Lag problems

  • Switchover/Failover failures

  • Broker inconsistencies


๐Ÿงญ 2. High-Level Troubleshooting Flow

Step 1 → Check Broker Status
Step 2 → Validate Role & Open Mode
Step 3 → Check Transport (Primary)
Step 4 → Check Apply (Standby)
Step 5 → Check Lag
Step 6 → Check Errors (Alert Log)
Step 7 → Take Corrective Action

๐Ÿ” 3. Step-by-Step Deep Diagnosis


Step 1: Check Broker Health

DGMGRL> SHOW CONFIGURATION;

Expected:

  • SUCCESS

If NOT:

  • WARNING / ERROR → Drill down:

SHOW DATABASE VERBOSE <db_name>;

Step 2: Validate Database Role & Mode

SELECT DATABASE_ROLE, OPEN_MODE FROM V$DATABASE;

Expected:

  • Primary → READ WRITE

  • Standby → MOUNT or READ ONLY WITH APPLY


๐Ÿšš Step 3: Redo Transport Check (Primary)

Check Destination Status:

SELECT DEST_ID, STATUS, ERROR 
FROM V$ARCHIVE_DEST 
WHERE TARGET='STANDBY';

Check Parameter:

SHOW PARAMETER log_archive_dest_state_2;

❌ If Issue Found:

ProblemFix
DEST = ERRORCheck network / TNS
STATE = DEFER/RESETEnable it
ALTER SYSTEM SET log_archive_dest_state_2=ENABLE;

๐Ÿ“ฅ Step 4: Redo Apply Check (Standby)

SELECT PROCESS, STATUS FROM V$MANAGED_STANDBY;

Expected:

  • MRP0 APPLYING_LOG


❌ If MRP NOT Running:

ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT;

⏱️ Step 5: Lag Analysis

SELECT NAME, VALUE FROM V$DATAGUARD_STATS;

Key Metrics:

  • Transport Lag

  • Apply Lag


๐ŸŽฏ Interpretation:

ScenarioMeaning
Transport Lag HighNetwork issue
Apply Lag HighMRP slow
Both HighSystem-wide issue

๐Ÿ“Š Step 6: Sequence Gap Validation

-- Primary
SELECT MAX(SEQUENCE#) FROM V$ARCHIVED_LOG;

-- Standby
SELECT MAX(SEQUENCE#) FROM V$ARCHIVED_LOG;

❌ If Gap Exists:

  • FAL issue or missing logs


๐Ÿ“„ Step 7: Alert Log Analysis

Check:

  • Archive errors

  • ORA- errors

  • Network failures


⚠️ 4. Critical Issue Playbooks


๐Ÿšจ Scenario 1: No Redo Shipping

Symptoms:

  • Transport Lag increasing

  • No logs received

Checks:

SELECT STATUS, ERROR FROM V$ARCHIVE_DEST;

Fix:

ALTER SYSTEM SET log_archive_dest_state_2=ENABLE;

๐Ÿšจ Scenario 2: Redo Received but Not Applied

Symptoms:

  • Transport Lag = 0

  • Apply Lag high

Fix:

ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT;

๐Ÿšจ Scenario 3: MRP Stuck

Fix:

ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT;

๐Ÿšจ Scenario 4: Switchover Fails

Check:

SELECT SWITCHOVER_STATUS FROM V$DATABASE;

Fix Flow:

  • Stop apply

  • Mount standby

  • Ensure no lag

  • Retry switchover


๐Ÿšจ Scenario 5: ORA-16516

Root Cause:

  • Active Data Guard mode

Fix:

EDIT DATABASE standby SET STATE='APPLY-OFF';
SHUTDOWN IMMEDIATE;
STARTUP MOUNT;

๐Ÿšจ Scenario 6: Archive Destination Full

Symptoms:

  • Primary freeze

  • ORA-00257

Fix:

  • Clean archive logs

  • Add space


๐Ÿšจ Scenario 7: Gap Not Resolving

Fix:

ALTER SYSTEM SET FAL_SERVER=primary;

Manual recovery if required.


๐Ÿ” 5. Start/Stop Operations (Quick Commands)


▶️ Start Redo Apply

EDIT DATABASE standby SET STATE='APPLY-ON';

⛔ Stop Redo Apply

EDIT DATABASE standby SET STATE='APPLY-OFF';

๐Ÿšš Stop Transport

EDIT DATABASE primary SET STATE='TRANSPORT-OFF';

๐Ÿš€ Start Transport

EDIT DATABASE primary SET STATE='TRANSPORT-ON';

๐Ÿงช 6. Validation Checklist (Post-Fix)

✔ Broker Status = SUCCESS
✔ MRP0 running
✔ Transport Lag = 0
✔ Apply Lag = 0
✔ No errors in alert log


๐Ÿง  7. L3 Decision Tree (Real-World Thinking)

Lag? 
 ├── YES
 │   ├── Transport Lag?
 │   │   ├── YES → Network / log_archive_dest issue
 │   │   └── NO → Apply issue (MRP)
 │
 └── NO
     ├── Data mismatch?
     │   ├── YES → Sequence gap / FAL issue
     │   └── NO → System healthy

๐Ÿ›ก️ 8. Preventive Monitoring (Must-Have in Production)

  • Monitor:

    • Transport Lag

    • Apply Lag

    • MRP status

    • Archive destination usage

  • Automate alerts for:

    • Lag > 5 minutes

    • MRP stopped

    • Destination error



No comments:

Post a Comment