Thursday, April 16, 2026

Real-time production Data Guard war stories

🚨 War Story 1: ORA-01555 on Standby (Active Data Guard)

Situation

Reporting team complained:

“Queries failing randomly on standby with ORA-01555”

Observation

  • Only happening on standby (Active Data Guard)

  • Same query working fine on primary

  • Undo retention = 900 sec

  • Long-running reports

Root Cause

On Active Data Guard:

  • Undo is not retained the same way as primary

  • Heavy redo apply + long queries = undo overwritten

Fix

  • Increased undo retention:

ALTER SYSTEM SET undo_retention=3600;
  • Enabled:

ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE DISCONNECT;
  • Tuned reporting queries

Lesson

πŸ‘‰ Standby is not a reporting replica like a data warehouse
πŸ‘‰ Treat Active Data Guard carefully for long queries


🚨 War Story 2: ORA-16532 – Broker Config Missing

Situation

Post-maintenance:

SHOW CONFIGURATION;
ORA-16532: broker configuration does not exist

Panic

Team thought:

“Data Guard configuration is gone 😨”

Root Cause

Connected to:

  • Standby instead of Primary

Fix

dgmgrl sys@primary

Lesson

πŸ‘‰ Broker metadata is controlled from primary
πŸ‘‰ Always verify connection before troubleshooting


🚨 War Story 3: Switchover Fails with ORA-16516

Situation

During DR drill:

SWITCHOVER TO standby;
ORA-16516

Observation

  • Lag = 0

  • Broker = SUCCESS

  • Everything looked perfect

Hidden Issue

Standby was:
πŸ‘‰ READ ONLY WITH APPLY (Active Data Guard)

Fix

EDIT DATABASE standby SET STATE='APPLY-OFF';

Then:

SHUTDOWN IMMEDIATE;
STARTUP MOUNT;

Retry switchover → ✅ SUCCESS

Lesson

πŸ‘‰ Switchover requires MOUNT mode, not Active Data Guard


🚨 War Story 4: Switchover Stuck – NOT ALLOWED

Situation

SELECT SWITCHOVER_STATUS FROM V$DATABASE;
NOT ALLOWED

Investigation

Checked standby:

SELECT PROCESS, STATUS FROM V$MANAGED_STANDBY;
MRP0 APPLYING_LOG

Checked primary:

SHOW PARAMETER log_archive_dest_state_2;
RESET ❌

Root Cause

Redo transport disabled:
πŸ‘‰ log_archive_dest_state_2=RESET

Fix

ALTER SYSTEM SET log_archive_dest_state_2=ENABLE;

After few minutes:

RESOLVABLE GAP → TO STANDBY

Switchover → ✅ SUCCESS

Lesson

πŸ‘‰ Switchover depends on redo transport health
πŸ‘‰ Always check archive destination state


🚨 War Story 5: Redo Lag Suddenly Increased (Network Issue)

Situation

Monitoring alert:

  • Transport lag = 45 minutes

  • Apply lag increasing

Checks

SHOW DATABASE VERBOSE standby;

Found:

  • TransportDisconnectedThreshold breached

Root Cause

  • Network latency spike

  • Packet drops between DCs

Fix

  • Switched to ASYNC temporarily:

EDIT DATABASE standby SET PROPERTY LogXptMode='ASYNC';
  • Network team fixed issue

  • Switched back to SYNC

Lesson

πŸ‘‰ Network is the backbone of Data Guard
πŸ‘‰ Always coordinate with infra team


🚨 War Story 6: Standby Not Applying Logs (MRP Stuck)

Situation

Lag increasing, but logs reaching standby

Check

SELECT PROCESS, STATUS FROM V$MANAGED_STANDBY;

Result:

  • MRP0 not running ❌

Root Cause

MRP crashed silently after:

  • Disk space issue

  • Archive log corruption

Fix

ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT;

Lesson

πŸ‘‰ Always verify MRP0 status, not just lag


🚨 War Story 7: Archive Destination Full (Production Impact)

Situation

Primary database froze

Error

ORA-00257: archiver error. Connect internal only

Root Cause

  • Standby archive location full

  • Primary unable to ship logs

Fix

  • Cleaned archive logs on standby

  • Restarted archiver

Lesson

πŸ‘‰ Storage monitoring is critical
πŸ‘‰ Archive issues affect primary availability


🚨 War Story 8: Broker Shows SUCCESS but Data Not Syncing

Situation

  • Broker: SUCCESS ✅

  • Business: “Data mismatch”

Check

SELECT MAX(SEQUENCE#) FROM V$ARCHIVED_LOG;

Mismatch found between primary & standby

Root Cause

  • Gap not resolved automatically

  • FAL server misconfigured

Fix

ALTER SYSTEM SET FAL_SERVER=primary;

Manual gap resolution done

Lesson

πŸ‘‰ Never trust only Broker
πŸ‘‰ Always validate at SQL level


🚨 War Story 9: Accidental TRANSPORT-OFF in Production

Situation

Lag suddenly increased

Check

SHOW DATABASE primary;

Found:

Intended State: TRANSPORT-OFF

Root Cause

  • Someone ran:

EDIT DATABASE primary SET STATE='TRANSPORT-OFF';

Fix

EDIT DATABASE primary SET STATE='TRANSPORT-ON';

Lesson

πŸ‘‰ Always audit DGMGRL changes
πŸ‘‰ Restrict access to Broker


🚨 War Story 10: Failover Required During DC Outage

Situation

Primary DC down completely

Action

FAILOVER TO standby;

Challenge

  • Some redo loss possible

Decision

Used:
πŸ‘‰ FORCE FAILOVER

Post Steps

  • Recreated old primary as standby

Lesson

πŸ‘‰ Understand:

  • Switchover = Zero data loss

  • Failover = Possible data loss



No comments:

Post a Comment