📘 OEM Runbook (L2/L3) – Oracle Monitoring & Alert Management

Using Oracle Enterprise Manager 13c

🎯 Objective

To:

Monitor database health
Detect issues proactively
Reduce alert noise
Troubleshoot incidents quickly

🧭 1. First Response Playbook (When Alert Comes)

🚨 Step 1: Open Incident Manager

📍 Navigation:

Enterprise → Monitoring → Incidents

Check:

Severity (Critical / Warning)
Target (DB / Host / Listener)
Message (Tablespace / CPU / Lock etc.)

🧠 Step 2: Identify Issue Type

Alert Type	Meaning	Action
CPU High	Performance issue	Check SQL / load
Tablespace Full	Storage issue	Add space
Session Blocking	Lock issue	Kill blocker
Host Down	Infra issue	Check server
Listener Down	Connectivity issue	Restart

🔍 2. Deep Dive Troubleshooting

⚡ Case 1: Database Performance Issue

Step 1: Open Performance Page

📍

Target → Database → Performance → Top Activity

Check:

CPU usage
Wait events
Active sessions

Step 2: Identify Top SQL

📍

Performance → SQL Monitoring / Top SQL

Action:

Find high elapsed time SQL
Capture SQL_ID

Step 3: Analyze Execution Plan

SELECT *
FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR('<SQL_ID>', NULL, 'ALLSTATS LAST'));

Step 4: Fix

Add index
Gather stats
Apply SQL Profile

🔒 Case 2: Blocking / Locking Issue

Step 1: Check Blocking Sessions

📍

Performance → Blocking Sessions

OR SQL:

SELECT blocking_session, sid, serial#
FROM v$session
WHERE blocking_session IS NOT NULL;

Step 2: Kill Blocking Session

ALTER SYSTEM KILL SESSION 'SID,SERIAL#' IMMEDIATE;

Step 3: Root Cause

Application not committing
Long transactions

💾 Case 3: Tablespace Full

Step 1: Check Usage

📍

Storage → Tablespaces

Step 2: Add Space

ALTER DATABASE DATAFILE '/path/file.dbf'
RESIZE 10G;

ALTER TABLESPACE users
ADD DATAFILE '/path/file02.dbf' SIZE 5G;

🔥 Case 4: CPU Spike

Step 1: Check Load

📍

Performance → Top Activity

Step 2: Identify Cause

High SQL load
Batch jobs
Parallel queries

Step 3: Action

Tune SQL
Kill runaway sessions
Limit parallelism

🌐 Case 5: Listener / Connectivity Issue

Step 1: Check Listener Status

lsnrctl status

Step 2: Restart Listener

lsnrctl stop
lsnrctl start

🔁 3. Alert Noise Reduction (VERY IMPORTANT)

🔧 Configure Thresholds

📍

Targets → Monitoring → Metric Settings

Best Practice:

Warning: 80%
Critical: 90%
Occurrence: 3

🔁 Configure Incident Rules

📍

Setup → Incidents → Incident Rules

Enable:

✔ Add to existing incident
✔ Event grouping

🔕 Configure Notifications

📍

Setup → Notifications

Rule:

Only Critical alerts → Email

⛔ Configure Blackouts

📍

Enterprise → Monitoring → Blackouts

Use During:

Patching
Maintenance

📊 4. Daily Health Check (L2 Task)

✅ Check 1: Incident Summary

Enterprise → Incidents

✅ Check 2: DB Status

Targets → Databases

✅ Check 3: Tablespace Usage

Storage → Tablespaces

✅ Check 4: Backup Status

Availability → Backup Reports

✅ Check 5: Performance

Performance → Top Activity

🚀 5. L3 Advanced Activities

🔬 AWR / ADDM Analysis

📍

Performance → AWR → Reports

⚙️ SQL Tuning Advisor

📍

Performance → SQL → Tuning Advisor

🧠 ASH Analytics

📍

Performance → ASH Analytics

🔄 Corrective Actions (Auto-Healing)

📍

Metric Settings → Corrective Actions

Example:

lsnrctl start

🎯 6. SLA / Escalation Matrix

Severity	Action	SLA
Critical	Immediate fix	15 mins
High	Investigate	30 mins
Medium	Monitor	2 hrs
Low	Review	Next day

🧠 7. Interview Questions (L2/L3)

❓ How do you troubleshoot OEM alert?

Answer:

I check Incident Manager, identify alert type, drill into performance metrics, analyze SQL or system issue, and apply corrective action.

❓ How do you reduce alert noise?

Answer:

Tune thresholds
Configure incident rules
Use blackout
Filter notifications

❓ What is difference between Event and Incident?

Event	Incident
Raw alert	Grouped issue

❓ What is your first step in performance issue?

Answer:

Check Top Activity and wait events to identify bottleneck.

🏁 Final Production Mindset

“OEM is not just a monitoring tool…
It is your control tower for the entire database ecosystem.”

Thursday, April 23, 2026

OEM Runbook (L2/L3) – Oracle Monitoring & Alert Management

📘 OEM Runbook (L2/L3) – Oracle Monitoring & Alert Management

🎯 Objective

🧭 1. First Response Playbook (When Alert Comes)

🚨 Step 1: Open Incident Manager

Check:

🧠 Step 2: Identify Issue Type

🔍 2. Deep Dive Troubleshooting

⚡ Case 1: Database Performance Issue

Step 1: Open Performance Page

Check:

Step 2: Identify Top SQL

Action:

Step 3: Analyze Execution Plan

Step 4: Fix

🔒 Case 2: Blocking / Locking Issue

Step 1: Check Blocking Sessions

Step 2: Kill Blocking Session

Step 3: Root Cause

💾 Case 3: Tablespace Full

Step 1: Check Usage

Step 2: Add Space

🔥 Case 4: CPU Spike

Step 1: Check Load

Step 2: Identify Cause

Step 3: Action

🌐 Case 5: Listener / Connectivity Issue

Step 1: Check Listener Status

Step 2: Restart Listener

🔁 3. Alert Noise Reduction (VERY IMPORTANT)

🔧 Configure Thresholds

Best Practice:

🔁 Configure Incident Rules

Enable:

🔕 Configure Notifications

Rule:

⛔ Configure Blackouts

Use During:

📊 4. Daily Health Check (L2 Task)

✅ Check 1: Incident Summary

✅ Check 2: DB Status

✅ Check 3: Tablespace Usage

✅ Check 4: Backup Status

✅ Check 5: Performance

🚀 5. L3 Advanced Activities

🔬 AWR / ADDM Analysis

⚙️ SQL Tuning Advisor

🧠 ASH Analytics

🔄 Corrective Actions (Auto-Healing)

🎯 6. SLA / Escalation Matrix

🧠 7. Interview Questions (L2/L3)

❓ How do you troubleshoot OEM alert?

❓ How do you reduce alert noise?

❓ What is difference between Event and Incident?

❓ What is your first step in performance issue?

🏁 Final Production Mindset

No comments:

Post a Comment

Contact Form

Total Pageviews