📘 OEM Runbook (L2/L3) – Oracle Monitoring & Alert Management
Using Oracle Enterprise Manager 13c
🎯 Objective
To:
Monitor database health
Detect issues proactively
Reduce alert noise
Troubleshoot incidents quickly
🧭 1. First Response Playbook (When Alert Comes)
🚨 Step 1: Open Incident Manager
📍 Navigation:
Enterprise → Monitoring → Incidents
Check:
Severity (Critical / Warning)
Target (DB / Host / Listener)
Message (Tablespace / CPU / Lock etc.)
🧠 Step 2: Identify Issue Type
| Alert Type | Meaning | Action |
|---|---|---|
| CPU High | Performance issue | Check SQL / load |
| Tablespace Full | Storage issue | Add space |
| Session Blocking | Lock issue | Kill blocker |
| Host Down | Infra issue | Check server |
| Listener Down | Connectivity issue | Restart |
🔍 2. Deep Dive Troubleshooting
⚡ Case 1: Database Performance Issue
Step 1: Open Performance Page
📍
Target → Database → Performance → Top Activity
Check:
CPU usage
Wait events
Active sessions
Step 2: Identify Top SQL
📍
Performance → SQL Monitoring / Top SQL
Action:
Find high elapsed time SQL
Capture SQL_ID
Step 3: Analyze Execution Plan
SELECT *
FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR('<SQL_ID>', NULL, 'ALLSTATS LAST'));
Step 4: Fix
Add index
Gather stats
Apply SQL Profile
🔒 Case 2: Blocking / Locking Issue
Step 1: Check Blocking Sessions
📍
Performance → Blocking Sessions
OR SQL:
SELECT blocking_session, sid, serial#
FROM v$session
WHERE blocking_session IS NOT NULL;
Step 2: Kill Blocking Session
ALTER SYSTEM KILL SESSION 'SID,SERIAL#' IMMEDIATE;
Step 3: Root Cause
Application not committing
Long transactions
💾 Case 3: Tablespace Full
Step 1: Check Usage
📍
Storage → Tablespaces
Step 2: Add Space
ALTER DATABASE DATAFILE '/path/file.dbf'
RESIZE 10G;
OR
ALTER TABLESPACE users
ADD DATAFILE '/path/file02.dbf' SIZE 5G;
🔥 Case 4: CPU Spike
Step 1: Check Load
📍
Performance → Top Activity
Step 2: Identify Cause
High SQL load
Batch jobs
Parallel queries
Step 3: Action
Tune SQL
Kill runaway sessions
Limit parallelism
🌐 Case 5: Listener / Connectivity Issue
Step 1: Check Listener Status
lsnrctl status
Step 2: Restart Listener
lsnrctl stop
lsnrctl start
🔁 3. Alert Noise Reduction (VERY IMPORTANT)
🔧 Configure Thresholds
📍
Targets → Monitoring → Metric Settings
Best Practice:
Warning: 80%
Critical: 90%
Occurrence: 3
🔁 Configure Incident Rules
📍
Setup → Incidents → Incident Rules
Enable:
✔ Add to existing incident
✔ Event grouping
🔕 Configure Notifications
📍
Setup → Notifications
Rule:
Only Critical alerts → Email
⛔ Configure Blackouts
📍
Enterprise → Monitoring → Blackouts
Use During:
Patching
Maintenance
📊 4. Daily Health Check (L2 Task)
✅ Check 1: Incident Summary
Enterprise → Incidents
✅ Check 2: DB Status
Targets → Databases
✅ Check 3: Tablespace Usage
Storage → Tablespaces
✅ Check 4: Backup Status
Availability → Backup Reports
✅ Check 5: Performance
Performance → Top Activity
🚀 5. L3 Advanced Activities
🔬 AWR / ADDM Analysis
📍
Performance → AWR → Reports
⚙️ SQL Tuning Advisor
📍
Performance → SQL → Tuning Advisor
🧠 ASH Analytics
📍
Performance → ASH Analytics
🔄 Corrective Actions (Auto-Healing)
📍
Metric Settings → Corrective Actions
Example:
lsnrctl start
🎯 6. SLA / Escalation Matrix
| Severity | Action | SLA |
|---|---|---|
| Critical | Immediate fix | 15 mins |
| High | Investigate | 30 mins |
| Medium | Monitor | 2 hrs |
| Low | Review | Next day |
🧠 7. Interview Questions (L2/L3)
❓ How do you troubleshoot OEM alert?
Answer:
I check Incident Manager, identify alert type, drill into performance metrics, analyze SQL or system issue, and apply corrective action.
❓ How do you reduce alert noise?
Answer:
Tune thresholds
Configure incident rules
Use blackout
Filter notifications
❓ What is difference between Event and Incident?
| Event | Incident |
|---|---|
| Raw alert | Grouped issue |
❓ What is your first step in performance issue?
Answer:
Check Top Activity and wait events to identify bottleneck.
🏁 Final Production Mindset
“OEM is not just a monitoring tool…
It is your control tower for the entire database ecosystem.”
No comments:
Post a Comment