Monday, May 25, 2026

Standby Redo Logs Not Applying in Oracle Data Guard

 Oracle Data Guard is one of the most critical disaster recovery solutions used in enterprise environments. But one issue every DBA eventually faces is:

“Standby Redo Logs are shipping, but logs are not applying.”

This problem can quickly lead to:

  • Apply lag

  • Data synchronization delays

  • Increased RPO risk

  • Failover inconsistency

  • Production escalation calls at 2 AM

In this article, we’ll walk through:

  • Real-time production scenario

  • MRP troubleshooting

  • Standby Redo Log (SRL) validation

  • Apply lag diagnosis

  • Parallel apply tuning

  • Step-by-step fixes

  • Best practices used in production


Real-Time Production Scenario

A banking application running on Oracle 19c RAC + Data Guard suddenly reports:

  • Standby database lag increasing continuously

  • Archive logs shipping successfully

  • But redo apply is stuck

Primary database generates heavy redo during:

  • End-of-day batch jobs

  • Large payment processing

  • ETL loads

The monitoring team raises a critical alert:

ORA-16853: apply lag has exceeded specified threshold

At first glance:

  • Transport lag = 0

  • Network looks fine

  • Archive logs are arriving

But standby is still behind by 45 minutes.

This is where deeper Data Guard troubleshooting begins.


Understanding the Redo Flow

Before troubleshooting, let’s quickly understand the redo flow.

Primary Database

  • LGWR writes redo

  • Redo shipped via LNS process

  • RFS receives redo on standby

Standby Database

  • Redo written into Standby Redo Logs (SRL)

  • MRP (Managed Recovery Process) applies redo

If any component breaks:

  • Apply lag increases

  • Recovery stops

  • Synchronization fails


Step 1 — Check Data Guard Status

First step is always checking overall Data Guard health.

SELECT 
    DEST_ID,
    STATUS,
    ERROR,
    DESTINATION
FROM V$ARCHIVE_DEST
WHERE STATUS <> 'INACTIVE';

Check Apply Lag

SELECT 
    NAME,
    VALUE,
    TIME_COMPUTED
FROM V$DATAGUARD_STATS
WHERE NAME IN ('apply lag','transport lag');

Typical output:

apply lag      +00 00:45:12
transport lag  +00 00:00:00

This confirms:

  • Redo shipping is fine

  • Apply process is delayed


Step 2 — Verify MRP Process

MRP is the heart of redo apply.

Check recovery process:

SELECT 
    PROCESS,
    STATUS,
    THREAD#,
    SEQUENCE#
FROM V$MANAGED_STANDBY;

Problematic output:

MRP0 WAIT_FOR_LOG

or:

MRP0 APPLYING_LOG

but sequence not progressing.


Common Reasons MRP Stops Applying

1. Missing Standby Redo Logs

One of the most common issues.

Check SRL configuration:

SELECT 
    GROUP#,
    THREAD#,
    SEQUENCE#,
    ARCHIVED,
    STATUS
FROM V$STANDBY_LOG;

Production Issue Example

In a 2-node RAC environment:

Primary redo logs:

Thread 1 → 4 redo groups
Thread 2 → 4 redo groups

But standby had only:

4 standby redo logs total

This is incorrect.

Oracle recommends:

(Number of Online Redo Log Groups per thread + 1)
for each thread

Correct SRL Configuration

Example:

If each RAC thread has:

4 redo log groups

Then standby should have:

5 SRLs for Thread 1
5 SRLs for Thread 2

Add Missing SRLs

ALTER DATABASE ADD STANDBY LOGFILE THREAD 1
GROUP 11 ('/u01/oradata/STD/srl11.log') SIZE 2G;

Repeat for all required groups.


Step 3 — Check Alert Logs

Always inspect standby alert log.

Common errors:

ORA-16037: user requested cancel of managed recovery
ORA-00313: open failed for members of log group
ORA-01153: an incompatible media recovery is active

Step 4 — Restart MRP Properly

Sometimes MRP hangs internally.

Cancel and restart recovery:

Stop Recovery

ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;

Start Recovery Again

ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE DISCONNECT FROM SESSION;

For Oracle 12c and above:

ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT NODELAY;

Real-Time Incident Example

In one production incident:

  • Redo transport was healthy

  • SRLs existed

  • No network issue

But MRP was frozen on one sequence for hours.

Root cause:

Corrupted archived log received on standby

Solution:

  • Remove corrupted archive

  • Re-transfer archive from primary

  • Restart MRP

Apply resumed immediately.


Step 5 — Check Apply Rate

Heavy redo generation can overwhelm standby apply.

Monitor apply performance:

SELECT 
    ITEM,
    SOFAR,
    UNITS
FROM V$RECOVERY_PROGRESS;

Identify Apply Bottleneck

Look for:

  • Slow disk I/O

  • CPU saturation

  • Single-threaded apply

  • Excessive checkpoints

  • Storage latency


Step 6 — Enable Parallel Apply

One major improvement for high redo environments is parallel apply.

Enable parallel recovery:

ALTER DATABASE RECOVER MANAGED STANDBY DATABASE PARALLEL 16 DISCONNECT;

Or dynamically:

ALTER SYSTEM SET RECOVERY_PARALLELISM=16;

Real Production Performance Improvement

A telecom production standby experienced:

Apply lag = 2 hours
Redo generation = 150 GB/hour

After enabling:

RECOVERY_PARALLELISM=32

Apply lag reduced to:

Less than 5 minutes

Step 7 — Validate Real-Time Apply

Check whether standby is using:

  • Archive apply
    or

  • Real-time apply

Query:

SELECT RECOVERY_MODE FROM V$ARCHIVE_DEST_STATUS;

Desired output:

MANAGED REAL TIME APPLY

If not enabled:

ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE DISCONNECT;

Step 8 — Monitor Archive Gap

Check for missing archive logs.

SELECT * FROM V$ARCHIVE_GAP;

If gaps exist:

  • Transfer missing archives manually

  • Register logs if needed

ALTER DATABASE REGISTER LOGFILE '/tmp/1_45566.arc';

Advanced Troubleshooting Scenarios

Scenario 1 — Standby Redo Logs in ACTIVE State Forever

Cause:

  • MRP stuck

  • SRL corruption

  • I/O freeze

Fix:

  • Restart MRP

  • Clear standby logfile

ALTER DATABASE CLEAR LOGFILE GROUP 11;

Scenario 2 — High Apply Lag but No Transport Lag

Cause:

  • Standby CPU bottleneck

  • Slow storage

  • Insufficient parallelism

Fix:

  • Increase parallel apply

  • Tune I/O

  • Add faster storage


Scenario 3 — Frequent Log Switches

Cause:

  • Small redo logs

Fix:

Increase redo log size.

Recommended:

15–20 minute log switch frequency

Scenario 4 — MRP Waiting for Log

Output:

WAIT_FOR_LOG

Possible causes:

  • Network delay

  • Archive gap

  • Transport issue

  • RFS not receiving logs


Useful Data Guard Monitoring Queries

Check Apply Status

SELECT PROCESS, STATUS, THREAD#, SEQUENCE# FROM V$MANAGED_STANDBY;

Check Lag

SELECT NAME, VALUE FROM V$DATAGUARD_STATS;

Check SRLs

SELECT GROUP#, THREAD#, STATUS FROM V$STANDBY_LOG;

Check Archive Gap

SELECT * FROM V$ARCHIVE_GAP;

Best Practices for Data Guard Apply Performance

1. Configure Proper SRLs

Always follow Oracle recommendation:

Redo Log Groups + 1 per thread

2. Use Real-Time Apply

Reduces failover data loss significantly.


3. Enable Parallel Apply for Heavy Workloads

Especially for:

  • Batch systems

  • Banking

  • Telecom

  • Large OLTP databases


4. Monitor Lag Continuously

Use:

  • OEM

  • Custom scripts

  • Data Guard Broker


5. Separate Redo and Data Disks

Avoid I/O contention.


Data Guard Broker Validation

Using DGMGRL:

show configuration;
show database verbose standbydb;

Look for:

Apply Lag
Transport Lag
Real Time Query
LogXptMode

Final Thoughts

Standby redo apply issues are among the most common Oracle Data Guard production incidents.

The key to fast resolution is understanding the entire redo pipeline:

  • LGWR

  • LNS

  • RFS

  • SRL

  • MRP

Most apply lag issues usually come down to:

  • Improper SRL configuration

  • MRP hangs

  • Archive gaps

  • Insufficient parallelism

  • Storage bottlenecks

A strong Oracle DBA not only fixes the issue quickly but also proactively designs Data Guard environments to avoid these failures entirely.

In enterprise systems where every second matters, properly tuned redo apply can make the difference between:

  • Seamless failover
    or

  • Major business outage.

No comments:

Post a Comment