Thursday, April 23, 2026

OEM Runbook (L2/L3) – Oracle Monitoring & Alert Management



📘 OEM Runbook (L2/L3) – Oracle Monitoring & Alert Management

Using Oracle Enterprise Manager 13c


🎯 Objective

To:

  • Monitor database health

  • Detect issues proactively

  • Reduce alert noise

  • Troubleshoot incidents quickly


🧭 1. First Response Playbook (When Alert Comes)

🚨 Step 1: Open Incident Manager

📍 Navigation:

Enterprise → Monitoring → Incidents

Check:

  • Severity (Critical / Warning)

  • Target (DB / Host / Listener)

  • Message (Tablespace / CPU / Lock etc.)


🧠 Step 2: Identify Issue Type

Alert TypeMeaningAction
CPU HighPerformance issueCheck SQL / load
Tablespace FullStorage issueAdd space
Session BlockingLock issueKill blocker
Host DownInfra issueCheck server
Listener DownConnectivity issueRestart

🔍 2. Deep Dive Troubleshooting


⚡ Case 1: Database Performance Issue

Step 1: Open Performance Page

📍

Target → Database → Performance → Top Activity

Check:

  • CPU usage

  • Wait events

  • Active sessions


Step 2: Identify Top SQL

📍

Performance → SQL Monitoring / Top SQL

Action:

  • Find high elapsed time SQL

  • Capture SQL_ID


Step 3: Analyze Execution Plan

SELECT *
FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR('<SQL_ID>', NULL, 'ALLSTATS LAST'));

Step 4: Fix

  • Add index

  • Gather stats

  • Apply SQL Profile


🔒 Case 2: Blocking / Locking Issue

Step 1: Check Blocking Sessions

📍

Performance → Blocking Sessions

OR SQL:

SELECT blocking_session, sid, serial#
FROM v$session
WHERE blocking_session IS NOT NULL;

Step 2: Kill Blocking Session

ALTER SYSTEM KILL SESSION 'SID,SERIAL#' IMMEDIATE;

Step 3: Root Cause

  • Application not committing

  • Long transactions


💾 Case 3: Tablespace Full

Step 1: Check Usage

📍

Storage → Tablespaces

Step 2: Add Space

ALTER DATABASE DATAFILE '/path/file.dbf'
RESIZE 10G;

OR

ALTER TABLESPACE users
ADD DATAFILE '/path/file02.dbf' SIZE 5G;

🔥 Case 4: CPU Spike

Step 1: Check Load

📍

Performance → Top Activity

Step 2: Identify Cause

  • High SQL load

  • Batch jobs

  • Parallel queries


Step 3: Action

  • Tune SQL

  • Kill runaway sessions

  • Limit parallelism


🌐 Case 5: Listener / Connectivity Issue

Step 1: Check Listener Status

lsnrctl status

Step 2: Restart Listener

lsnrctl stop
lsnrctl start

🔁 3. Alert Noise Reduction (VERY IMPORTANT)


🔧 Configure Thresholds

📍

Targets → Monitoring → Metric Settings

Best Practice:

  • Warning: 80%

  • Critical: 90%

  • Occurrence: 3


🔁 Configure Incident Rules

📍

Setup → Incidents → Incident Rules

Enable:

  • ✔ Add to existing incident

  • ✔ Event grouping


🔕 Configure Notifications

📍

Setup → Notifications

Rule:

  • Only Critical alerts → Email


⛔ Configure Blackouts

📍

Enterprise → Monitoring → Blackouts

Use During:

  • Patching

  • Maintenance


📊 4. Daily Health Check (L2 Task)


✅ Check 1: Incident Summary

Enterprise → Incidents

✅ Check 2: DB Status

Targets → Databases

✅ Check 3: Tablespace Usage

Storage → Tablespaces

✅ Check 4: Backup Status

Availability → Backup Reports

✅ Check 5: Performance

Performance → Top Activity

🚀 5. L3 Advanced Activities


🔬 AWR / ADDM Analysis

📍

Performance → AWR → Reports

⚙️ SQL Tuning Advisor

📍

Performance → SQL → Tuning Advisor

🧠 ASH Analytics

📍

Performance → ASH Analytics

🔄 Corrective Actions (Auto-Healing)

📍

Metric Settings → Corrective Actions

Example:

lsnrctl start

🎯 6. SLA / Escalation Matrix

SeverityActionSLA
CriticalImmediate fix15 mins
HighInvestigate30 mins
MediumMonitor2 hrs
LowReviewNext day

🧠 7. Interview Questions (L2/L3)


❓ How do you troubleshoot OEM alert?

Answer:

I check Incident Manager, identify alert type, drill into performance metrics, analyze SQL or system issue, and apply corrective action.


❓ How do you reduce alert noise?

Answer:

  • Tune thresholds

  • Configure incident rules

  • Use blackout

  • Filter notifications


❓ What is difference between Event and Incident?

EventIncident
Raw alertGrouped issue

❓ What is your first step in performance issue?

Answer:

Check Top Activity and wait events to identify bottleneck.


🏁 Final Production Mindset

“OEM is not just a monitoring tool…
It is your control tower for the entire database ecosystem.”



No comments:

Post a Comment