π OEM Runbook (L2/L3) – Oracle Monitoring & Alert Management
Using Oracle Enterprise Manager 13c
π― Objective
To:
Monitor database health
Detect issues proactively
Reduce alert noise
Troubleshoot incidents quickly
π§ 1. First Response Playbook (When Alert Comes)
π¨ Step 1: Open Incident Manager
π Navigation:
Enterprise → Monitoring → Incidents
Check:
Severity (Critical / Warning)
Target (DB / Host / Listener)
Message (Tablespace / CPU / Lock etc.)
π§ Step 2: Identify Issue Type
| Alert Type | Meaning | Action |
|---|---|---|
| CPU High | Performance issue | Check SQL / load |
| Tablespace Full | Storage issue | Add space |
| Session Blocking | Lock issue | Kill blocker |
| Host Down | Infra issue | Check server |
| Listener Down | Connectivity issue | Restart |
π 2. Deep Dive Troubleshooting
⚡ Case 1: Database Performance Issue
Step 1: Open Performance Page
π
Target → Database → Performance → Top Activity
Check:
CPU usage
Wait events
Active sessions
Step 2: Identify Top SQL
π
Performance → SQL Monitoring / Top SQL
Action:
Find high elapsed time SQL
Capture SQL_ID
Step 3: Analyze Execution Plan
SELECT *
FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR('<SQL_ID>', NULL, 'ALLSTATS LAST'));
Step 4: Fix
Add index
Gather stats
Apply SQL Profile
π Case 2: Blocking / Locking Issue
Step 1: Check Blocking Sessions
π
Performance → Blocking Sessions
OR SQL:
SELECT blocking_session, sid, serial#
FROM v$session
WHERE blocking_session IS NOT NULL;
Step 2: Kill Blocking Session
ALTER SYSTEM KILL SESSION 'SID,SERIAL#' IMMEDIATE;
Step 3: Root Cause
Application not committing
Long transactions
πΎ Case 3: Tablespace Full
Step 1: Check Usage
π
Storage → Tablespaces
Step 2: Add Space
ALTER DATABASE DATAFILE '/path/file.dbf'
RESIZE 10G;
OR
ALTER TABLESPACE users
ADD DATAFILE '/path/file02.dbf' SIZE 5G;
π₯ Case 4: CPU Spike
Step 1: Check Load
π
Performance → Top Activity
Step 2: Identify Cause
High SQL load
Batch jobs
Parallel queries
Step 3: Action
Tune SQL
Kill runaway sessions
Limit parallelism
π Case 5: Listener / Connectivity Issue
Step 1: Check Listener Status
lsnrctl status
Step 2: Restart Listener
lsnrctl stop
lsnrctl start
π 3. Alert Noise Reduction (VERY IMPORTANT)
π§ Configure Thresholds
π
Targets → Monitoring → Metric Settings
Best Practice:
Warning: 80%
Critical: 90%
Occurrence: 3
π Configure Incident Rules
π
Setup → Incidents → Incident Rules
Enable:
✔ Add to existing incident
✔ Event grouping
π Configure Notifications
π
Setup → Notifications
Rule:
Only Critical alerts → Email
⛔ Configure Blackouts
π
Enterprise → Monitoring → Blackouts
Use During:
Patching
Maintenance
π 4. Daily Health Check (L2 Task)
✅ Check 1: Incident Summary
Enterprise → Incidents
✅ Check 2: DB Status
Targets → Databases
✅ Check 3: Tablespace Usage
Storage → Tablespaces
✅ Check 4: Backup Status
Availability → Backup Reports
✅ Check 5: Performance
Performance → Top Activity
π 5. L3 Advanced Activities
π¬ AWR / ADDM Analysis
π
Performance → AWR → Reports
⚙️ SQL Tuning Advisor
π
Performance → SQL → Tuning Advisor
π§ ASH Analytics
π
Performance → ASH Analytics
π Corrective Actions (Auto-Healing)
π
Metric Settings → Corrective Actions
Example:
lsnrctl start
π― 6. SLA / Escalation Matrix
| Severity | Action | SLA |
|---|---|---|
| Critical | Immediate fix | 15 mins |
| High | Investigate | 30 mins |
| Medium | Monitor | 2 hrs |
| Low | Review | Next day |
π§ 7. Interview Questions (L2/L3)
❓ How do you troubleshoot OEM alert?
Answer:
I check Incident Manager, identify alert type, drill into performance metrics, analyze SQL or system issue, and apply corrective action.
❓ How do you reduce alert noise?
Answer:
Tune thresholds
Configure incident rules
Use blackout
Filter notifications
❓ What is difference between Event and Incident?
| Event | Incident |
|---|---|
| Raw alert | Grouped issue |
❓ What is your first step in performance issue?
Answer:
Check Top Activity and wait events to identify bottleneck.
π Final Production Mindset
“OEM is not just a monitoring tool…
It is your control tower for the entire database ecosystem.”