Chapter 9: AI/ML in SOC¶
Learning Objectives¶
By the end of this chapter, you will be able to:
- Explain supervised vs. unsupervised learning and their security use cases
- Evaluate ML model performance using precision, recall, F1-score, and ROC curves
- Identify common ML pitfalls in security (overfitting, data drift, adversarial evasion)
- Apply anomaly detection techniques (UEBA, clustering, autoencoders)
- Design ML-powered security workflows with appropriate human oversight
Prerequisites¶
- Chapter 1: Understanding of AI/ML concepts in SOC context
- Basic statistics knowledge (mean, standard deviation, probability)
- Familiarity with Python or willingness to understand pseudocode
Key Concepts¶
Machine Learning • Supervised Learning • Unsupervised Learning • UEBA • Anomaly Detection • Precision and Recall • Model Drift
Curiosity Hook: The Alert That Wasn't an Alert¶
Your SIEM generates 1,200 "Unusual Login" alerts daily. 95% are false positives (VPN reconnects, roaming employees).
Traditional Approach: Tune thresholds higher → Miss actual account compromises
ML Approach: - Train classifier on 6 months of labeled alerts (TP/FP) - Features: time of day, geolocation, device type, login velocity, user risk score - Result: Model scores each alert 0-100 for maliciousness
Outcome: Auto-close bottom 80% (high-confidence benign), surface top 5% (high-confidence threats), queue middle 15% for analyst review.
Result: Analysts handle 200 alerts/day instead of 1,200. True positive detection rate increases 40% (fewer missed threats buried in noise).
This chapter teaches: How to build, deploy, and maintain ML systems that make your SOC smarter.
9.1 ML Fundamentals for Security¶
Supervised Learning¶
Definition: Train a model on labeled examples (input → output) to predict outcomes for new data.
Security Use Cases: - Malware Classification: File → Malicious or Benign - Alert Triage: Alert features → True Positive or False Positive - Phishing Detection: Email → Phishing or Legitimate
Training Process:
1. Collect labeled data (e.g., 10,000 alerts labeled TP/FP by analysts)
2. Extract features (IP reputation, time of day, user behavior, etc.)
3. Train model (Random Forest, XGBoost, Neural Network)
4. Evaluate performance (precision, recall, F1)
5. Deploy model to score new alerts in production
Example: Alert Triage Classifier
from sklearn.ensemble import RandomForestClassifier
# Features
X = [
[0.9, 3, 50, 1], # [threat_intel_score, time_of_day, failed_logins, asset_criticality]
[0.1, 14, 2, 0],
# ... 10,000 more training examples
]
# Labels (1 = True Positive, 0 = False Positive)
y = [1, 0, ...]
model = RandomForestClassifier()
model.fit(X, y)
# Predict new alert
new_alert = [[0.85, 2, 100, 1]]
prediction = model.predict_proba(new_alert)
print(f"Probability of True Positive: {prediction[0][1]:.2%}")
# Output: Probability of True Positive: 92%
Unsupervised Learning¶
Definition: Find patterns in unlabeled data without predefined outcomes.
Security Use Cases: - Anomaly Detection: Identify outliers (unusual behavior without knowing what "attack" looks like) - Clustering: Group similar alerts or threats - Baseline Establishment: Learn "normal" behavior to detect deviations
Common Algorithms: - K-Means Clustering: Group similar data points - Isolation Forest: Detect anomalies by isolating outliers - Autoencoders: Neural networks that learn to reconstruct normal data (fail on anomalies)
Reinforcement Learning (Emerging)¶
Definition: Agent learns optimal actions through trial and error, receiving rewards/penalties.
Security Use Cases (Experimental): - Adaptive firewall rules (learn to block threats while minimizing false positives) - Autonomous incident response (AI agent learns optimal response sequences)
Limitation: Requires safe simulation environment. Risky to deploy in production without extensive testing.
9.2 Evaluating ML Model Performance¶
Confusion Matrix¶
Predicted Positive Predicted Negative
Actual Positive True Positive (TP) False Negative (FN)
Actual Negative False Positive (FP) True Negative (TN)
Example: Alert Triage Model - TP: 90 (Correctly identified malicious alerts) - FP: 10 (Falsely flagged benign alerts as malicious) - FN: 5 (Missed malicious alerts, labeled as benign) - TN: 895 (Correctly identified benign alerts)
Key Metrics¶
Precision: Of all alerts the model flagged as malicious, how many were actually malicious?
Use: High precision minimizes false alarms (analyst time waste).Recall (Sensitivity): Of all actual malicious alerts, how many did the model catch?
Use: High recall minimizes missed threats (false negatives).F1-Score: Harmonic mean of precision and recall (balanced metric).
F1 = 2 × (Precision × Recall) / (Precision + Recall)
Example: 2 × (0.90 × 0.947) / (0.90 + 0.947) = 92.3%
Accuracy: Overall correctness (can be misleading with imbalanced datasets).
Caution: If 99% of alerts are benign, a model that always predicts "benign" has 99% accuracy but is useless.Precision-Recall Trade-off¶
Challenge: Increasing recall often decreases precision (and vice versa).
Example: - High Recall Threshold (score > 0.3): Catch 98% of threats, but 30% false positives - High Precision Threshold (score > 0.8): 5% false positives, but miss 20% of threats
SOC Decision: - For critical assets (servers, admin accounts): Prioritize recall (accept more FPs to avoid missing threats) - For low-priority alerts: Prioritize precision (reduce noise)
ROC Curve (Receiver Operating Characteristic)¶
Visual Tool: Plot True Positive Rate (Recall) vs. False Positive Rate at various thresholds.
Interpretation: - AUC (Area Under Curve) = 1.0: Perfect classifier - AUC = 0.5: Random guessing (useless) - AUC > 0.85: Generally good for security use cases
9.3 Anomaly Detection & UEBA¶
User and Entity Behavior Analytics (UEBA)¶
Goal: Detect compromised accounts and insider threats by identifying deviations from normal behavior.
Approach: 1. Baseline: Learn typical behavior for each user/entity (login times, accessed systems, data volumes) 2. Score: Assign risk score when behavior deviates from baseline 3. Alert: Trigger alert if risk score exceeds threshold
Example: Anomalous File Access¶
Baseline (90 days): - User "jsmith" typically accesses 5-10 file shares/day - Average data read: 50 MB/day - Access times: 8 AM - 6 PM weekdays
Anomaly Detected: - User "jsmith" accessed 150 file shares in 2 hours - Data read: 5 GB - Time: 2 AM Sunday
Risk Score: 95/100 (HIGH)
Investigation: Account potentially compromised. Check: - Recent logins from unusual locations? - MFA bypass or anomalies? - Privilege escalation events?
ML Techniques for Anomaly Detection¶
1. Isolation Forest
from sklearn.ensemble import IsolationForest
# Training data: Normal user behavior
X_train = [[10, 50], [8, 45], [12, 60], ...] # [file_shares_accessed, mb_transferred]
model = IsolationForest(contamination=0.05) # Expect 5% anomalies
model.fit(X_train)
# Score new activity
new_activity = [[150, 5000]] # 150 shares, 5000 MB
score = model.predict(new_activity)
# Output: -1 (anomaly) or 1 (normal)
Pros: No labeled anomalies needed (unsupervised) Cons: Requires tuning contamination parameter; slow-moving attacks can evade
2. Autoencoders (Neural Networks)
How It Works: - Train neural network to reconstruct normal user behavior - When fed anomalous behavior, reconstruction error is high - High error → Likely anomaly
Example:
import tensorflow as tf
# Autoencoder: Input → Compressed → Reconstructed Output
autoencoder = tf.keras.Sequential([
tf.keras.layers.Dense(10, activation='relu'), # Compress
tf.keras.layers.Dense(5, activation='relu'),
tf.keras.layers.Dense(10, activation='relu'), # Decompress
tf.keras.layers.Dense(20, activation='sigmoid') # Reconstruct
])
# Train on normal behavior
autoencoder.fit(X_normal, X_normal, epochs=50)
# Score anomaly
reconstruction = autoencoder.predict(X_new)
error = mean_squared_error(X_new, reconstruction)
if error > threshold:
alert("Anomalous behavior detected")
Pros: Powerful for complex, high-dimensional data Cons: Requires significant training data and compute; harder to interpret
9.4 ML Pitfalls in Security¶
Pitfall 1: Overfitting¶
Problem: Model learns noise in training data instead of generalizable patterns.
Symptom: High accuracy on training data, poor accuracy on new data.
Example: - Training accuracy: 99% - Production accuracy: 60% - Cause: Model memorized specific malware samples instead of learning malicious behavior patterns.
Mitigation: - Use validation sets (hold out data for testing) - Regularization techniques - Simplify model complexity - Collect more diverse training data
Pitfall 2: Data Drift (Concept Drift)¶
Problem: Threat landscape evolves; training data becomes stale.
Example: - Model trained on 2023 malware samples - In 2026, attackers use new techniques (e.g., LLM-generated malware) - Model fails to detect new variants
Symptom: Gradual decline in detection rate over time.
Mitigation: - Continuous retraining: Monthly or quarterly retraining with new labeled data - Drift detection: Monitor model performance metrics; retrain when F1-score drops >5% - Ensemble models: Combine multiple models (old + new) for robustness
Pitfall 3: Adversarial Evasion¶
Problem: Attackers intentionally manipulate features to evade ML detections.
Example: Malware Classification - Model detects malware based on high entropy (encrypted/obfuscated code) - Attacker adds benign-looking padding to reduce entropy - Model misclassifies as benign
Mitigation: - Defense in depth: Combine ML with signature-based and behavioral detections - Adversarial training: Train model on adversarially modified samples - Explainability: Understand which features drive predictions (detect suspicious manipulations)
Pitfall 4: Imbalanced Datasets¶
Problem: Security events are rare (1% malicious, 99% benign).
Impact: Model biased toward majority class (predicts everything as benign).
Mitigation: - Oversampling minority class: Duplicate malicious examples (SMOTE technique) - Undersampling majority class: Reduce benign examples (risk: lose information) - Class weights: Penalize model more for misclassifying rare events - Stratified sampling: Ensure test set reflects real-world imbalance
Pitfall 5: Lack of Explainability¶
Problem: "Black box" models (deep neural networks) make predictions without clear reasoning.
Impact: Analysts can't validate model decisions; hard to debug errors.
Example: - Model flags user as high-risk - Analyst asks: "Why?" - Model: "Because neural network layer 7 activated" (not helpful)
Mitigation: - Use interpretable models: Random Forests, Decision Trees (over deep learning when possible) - SHAP/LIME: Explainability tools that highlight feature importance - Human review: For high-impact decisions, require analyst validation
Example SHAP Output:
Alert Score: 0.92 (High Risk)
Top Contributing Features:
- Login from new country: +0.35
- Failed MFA attempt: +0.28
- Unusual time of day (3 AM): +0.18
- High data download volume: +0.11
9.5 Practical ML Workflows in SOC¶
Workflow 1: ML-Augmented Alert Triage¶
[SIEM Alert] → [Feature Extraction] → [ML Scoring] → [Decision Logic]
↓ ↓ ↓
(IP rep, user risk, (Score: 0-100) Auto-close (< 20)
asset criticality) Analyst queue (20-80)
Escalate (> 80)
Implementation: 1. Extract features from alert (IP, user, time, asset, historical data) 2. ML model scores alert (0-100) 3. Automated actions based on score: - 0-20: Auto-close (high confidence benign) - 20-80: Tier 1 review (enriched with ML insights) - 80-100: Escalate to Tier 2 (likely true positive)
Human Oversight: Analysts periodically review auto-closed alerts to catch model errors.
Workflow 2: Proactive Threat Hunting with Anomaly Detection¶
[Baseline User Behavior] → [Monitor Activity] → [Anomaly Detection] → [Hunt Hypothesis]
↓ ↓ ↓ ↓
(90 days normal data) (Real-time logs) (Outlier flagged) (Investigate: Compromised?)
Example: 1. ML baselines normal behavior for all users (login patterns, file access, network activity) 2. User "admin_backup" flagged as anomaly: Accessed database at 3 AM (never seen before) 3. Analyst investigates: Account compromised? Legitimate IT maintenance? 4. Outcome: Confirm compromise → Incident response
Workflow 3: Adaptive Threat Intel Prioritization¶
[Threat Intel Feed] → [ML Relevance Scorer] → [Prioritized IOCs] → [SIEM Integration]
(50k IOCs/day) (Score 0-100) (Top 5% surfaced) (Alert on matches)
ML Features: - IOC source reputation - Age of IOC (recent = higher priority) - Threat actor relevance to industry - Historical match rate (did this feed produce actionable alerts before?)
Result: Analysts focus on 2,500 high-value IOCs instead of 50,000.
9.6 Building an ML-Powered Detection¶
Step-by-Step Example: Lateral Movement Detection¶
Goal: Detect attackers moving between systems using compromised credentials.
Step 1: Define the Problem - Positive Class: Lateral movement events (confirmed from past incidents) - Negative Class: Legitimate cross-system access (IT admin activity)
Step 2: Collect and Label Data
# Extract 6 months of authentication logs
data = query_siem(
"index=windows_auth Logon_Type=3", # Network logon
"earliest=-180d"
)
# Label examples (manual analyst review or known incident data)
data['label'] = [0, 1, 0, 0, 1, ...] # 0 = benign, 1 = lateral movement
Step 3: Feature Engineering
features = [
'time_since_last_login', # Minutes since user's last login
'source_host_criticality', # 0-5 (workstation=1, server=3, DC=5)
'destination_host_criticality',
'is_admin_account', # Boolean
'login_count_last_hour', # Velocity
'geographic_distance_km', # Distance between source/dest IPs
'time_of_day', # 0-23
'day_of_week', # 0-6
]
Step 4: Train Model
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
X = data[features]
y = data['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
Step 5: Evaluate
from sklearn.metrics import classification_report, roc_auc_score
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
# Output:
# precision recall f1-score
# Benign (0) 0.98 0.99 0.99
# Lateral (1) 0.87 0.82 0.84
#
# AUC: 0.93
# Analysis: Good performance, but 18% false negatives (missed lateral movements).
# Action: Tune threshold or add more features (e.g., parent process, command-line args).
Step 6: Deploy to Production
# Real-time scoring in SIEM or SOAR
def score_authentication_event(event):
features = extract_features(event)
score = model.predict_proba([features])[0][1] # Probability of lateral movement
if score > 0.85:
alert_tier2(event, score)
elif score > 0.50:
queue_tier1(event, score)
else:
log_for_audit(event, score)
Step 7: Monitor and Retrain
# Monthly: Evaluate model performance on new data
performance = evaluate_model(model, new_data)
if performance['f1_score'] < 0.80:
print("Model drift detected. Retraining...")
retrain_model(new_labeled_data)
Interactive Element¶
MicroSim 9: ML Model Tuning
Experiment with model thresholds and feature selection. See real-time impact on precision, recall, and alert volume.
Common Misconceptions¶
Misconception: ML Models Are Always Better Than Rules
Reality: Simple rule-based detections often outperform ML for well-defined threats (known IOCs, clear signatures). Use ML for complex, ambiguous patterns. Combine both for defense in depth.
Misconception: More Features = Better Model
Reality: Irrelevant features add noise and can degrade performance (curse of dimensionality). Focus on high-signal features. Feature engineering is more important than algorithm choice.
Misconception: ML Models Don't Require Maintenance
Reality: Models degrade over time due to data drift, adversarial adaptation, and infrastructure changes. Regular retraining and monitoring are essential.
Practice Tasks¶
Task 1: Calculate Metrics¶
Confusion Matrix: - TP: 80 - FP: 15 - FN: 10 - TN: 895
Calculate: a) Precision b) Recall c) F1-Score
Answers
a) Precision = TP / (TP + FP) = 80 / (80 + 15) = 84.2%
b) Recall = TP / (TP + FN) = 80 / (80 + 10) = 88.9%
c) F1 = 2 × (Precision × Recall) / (Precision + Recall) = 2 × (0.842 × 0.889) / (0.842 + 0.889) = 86.5%
Task 2: Identify Data Drift¶
Scenario: - Model trained in 2024, deployed in production - 2024 performance: Precision 90%, Recall 85% - 2026 performance: Precision 90%, Recall 60%
Questions: 1. What happened? 2. What should you do?
Answers
- What happened?
- Data drift (concept drift): Attack techniques evolved. Model no longer recognizes new TT Ps.
-
Precision remained high (model is still accurate when it fires), but recall dropped (missing many threats).
-
What should you do?
- Retrain model with recent labeled data (2025-2026 incidents)
- Investigate: What attack techniques are being missed? Add relevant features.
- Continuous monitoring: Set alert for when recall drops below threshold (e.g., <75%)
- Consider ensemble: Keep old model for legacy threats, add new model for emerging threats
Task 3: Feature Selection¶
Goal: Detect phishing emails using ML.
Which features are most useful? a) Email subject length (characters) b) Sender domain reputation (threat intel score) c) Presence of urgent language ("act now", "verify account") d) Recipient's job title e) Email received on a Tuesday
Answers
High-Value Features: - b) Sender domain reputation: Direct indicator of malicious infrastructure - c) Urgent language: Common phishing tactic (social engineering)
Low-Value Features: - a) Subject length: Weak signal (both phishing and legit emails vary) - d) Recipient job title: Possibly useful (executives targeted more), but requires context - e) Day of week: Likely irrelevant (phishing occurs all days)
Best Practice: Start with high-signal features (b, c). Add (d) if you have data showing job title correlation. Skip (a, e).
Exam Prep & Certifications¶
Relevant Certifications
The topics in this chapter align with the following certifications:
- CompTIA Security+ — Domains: Security Operations, Incident Response
- CompTIA CySA+ — Domains: Incident Response, Reporting and Communication
- GIAC GCIH — Domains: Incident Handling, Incident Response Process
- CISSP — Domains: Security Operations, Security and Risk Management
Self-Assessment Quiz¶
Question 1: What is the primary difference between supervised and unsupervised learning?
Options:
a) Supervised learning is faster b) Supervised learning requires labeled training data; unsupervised does not c) Unsupervised learning is more accurate d) Supervised learning only works with text data
Show Answer
Correct Answer: b) Supervised learning requires labeled training data; unsupervised does not
Explanation: Supervised learning learns from labeled examples (input → output). Unsupervised learning finds patterns in unlabeled data (e.g., clustering, anomaly detection).
Question 2: What does high precision but low recall indicate?
Options:
a) Model generates many false positives and misses many true positives b) Model generates few false positives but misses many true positives c) Model is perfect d) Model is useless
Show Answer
Correct Answer: b) Model generates few false positives but misses many true positives
Explanation: Precision = TP/(TP+FP). High precision = few FPs. Recall = TP/(TP+FN). Low recall = many FNs (missed threats).
Question 3: What is UEBA primarily used for?
Options:
a) Encrypting security logs b) Detecting anomalous user and entity behavior to identify compromised accounts c) Patching vulnerabilities automatically d) Generating threat intelligence reports
Show Answer
Correct Answer: b) Detecting anomalous user and entity behavior to identify compromised accounts
Explanation: UEBA baselines normal behavior and alerts on deviations, useful for detecting insider threats and account compromise.
Question 4: What is 'data drift' in the context of ML security models?
Options:
a) Model files getting corrupted over time b) The threat landscape evolving, causing model performance to degrade c) Training data being accidentally deleted d) Analysts drifting away from following model recommendations
Show Answer
Correct Answer: b) The threat landscape evolving, causing model performance to degrade
Explanation: Data drift (concept drift) occurs when the statistical properties of data change over time (e.g., new attack techniques), making models less effective.
Question 5: Why is explainability important for ML models in security?
Options:
a) It makes models run faster b) It allows analysts to understand and validate model decisions c) It reduces training time d) It eliminates all false positives
Show Answer
Correct Answer: b) It allows analysts to understand and validate model decisions
Explanation: Explainability (via SHAP, LIME, feature importance) helps analysts trust and debug model outputs, especially for high-impact decisions.
Question 6: What is an effective mitigation for adversarial evasion of ML models?
Options:
a) Use only ML-based detections b) Combine ML with signature-based and behavioral detections (defense in depth) c) Disable all security monitoring d) Train models on only one type of threat
Show Answer
Correct Answer: b) Combine ML with signature-based and behavioral detections (defense in depth)
Explanation: Layered defenses make it harder for attackers to evade all detection methods. ML is powerful but not foolproof; combine it with other techniques.
Summary¶
In this chapter, you learned:
- ML fundamentals: Supervised (labeled data), unsupervised (pattern discovery), reinforcement (emerging)
- Performance metrics: Precision, recall, F1-score, ROC/AUC for evaluating models
- Anomaly detection: UEBA, Isolation Forest, autoencoders for detecting unusual behavior
- ML pitfalls: Overfitting, data drift, adversarial evasion, imbalanced datasets, lack of explainability
- Practical workflows: ML-augmented triage, proactive hunting, threat intel prioritization
- Building ML detections: Feature engineering, training, evaluation, deployment, monitoring
Next Steps¶
- Next Chapter: Chapter 10: LLM Copilots & Guardrails - Deep dive into Large Language Models for SOC assistance
- Practice: Build a simple anomaly detection model using the ML Tuning MicroSim
- Explore: Experiment with SHAP library for model explainability
- Read: Research papers on adversarial machine learning in security
Chapter 9 Complete | Next: Chapter 10 →