Chapter 9: AI/ML in SOC - Quiz¶

Instructions¶

Test your understanding of supervised vs unsupervised learning, classification/clustering/regression, UEBA, feature engineering, overfitting/underfitting, model drift, and adversarial ML.

Question 1: What is the primary difference between supervised and unsupervised learning?

A) Supervised learning uses GPUs, unsupervised uses CPUs B) Supervised learning trains on labeled data (known outcomes), unsupervised finds patterns in unlabeled data C) Supervised learning is always more accurate D) There is no difference

Answer

Correct Answer: B) Supervised learning trains on labeled data, unsupervised finds patterns in unlabeled data

Explanation:

Supervised Learning: - Training Data: Labeled examples (input + correct output) - Goal: Learn mapping from input → output - Use Cases: Classification, regression - Example: Train on 10,000 emails labeled "phishing" or "legitimate" → Model learns to classify new emails

Supervised Learning SOC Example:

Training Data:
Email 1: Subject: "Your account is locked" → Label: PHISHING
Email 2: Subject: "Meeting at 3pm" → Label: LEGITIMATE
... 10,000 labeled emails

Model learns patterns:
- Urgent language + suspicious link = PHISHING
- Internal sender + calendar invite = LEGITIMATE

Prediction on new email:
Email: "Verify your account now!" → Predicted: PHISHING (confidence: 92%)

Unsupervised Learning: - Training Data: Unlabeled examples (input only, no correct answers) - Goal: Find hidden patterns, group similar items - Use Cases: Clustering, anomaly detection - Example: Analyze network traffic to find unusual patterns (no labels needed)

Unsupervised Learning SOC Example:

Training Data: 1 million network connections (no labels)

Model finds clusters:
- Cluster 1: Web browsing (80% of traffic)
- Cluster 2: Email (15% of traffic)
- Cluster 3: SSH (4% of traffic)
- Cluster 4: Unusual (1% of traffic) ← Potential threats

Anomaly detected: Connection doesn't fit any cluster → Alert

When to Use Each: - Supervised: When you have labeled training data (malware samples, phishing emails) - Unsupervised: When you lack labels or want to find unknown threats

Reference: Chapter 9, Section 9.1 - Supervised vs Unsupervised

Question 2: What type of ML task is 'predicting whether an email is phishing or legitimate'?

A) Regression B) Classification C) Clustering D) Reinforcement learning

Answer

Correct Answer: B) Classification

Explanation:

Classification: - Definition: Predicting discrete categories/classes - Output: Class label (e.g., "phishing" or "legitimate") - Algorithm Examples: Logistic Regression, Random Forest, Neural Networks

Classification SOC Examples:

1. Phishing Detection:

Input: Email features (sender, subject, links, urgency words)
Output: PHISHING or LEGITIMATE
Classes: 2 (binary classification)

2. Malware Classification:

Input: File hash, behavioral features
Output: MALWARE or BENIGN
Classes: 2 (binary)

3. Alert Severity Prediction:

Input: Alert metadata (source IP, user, asset, threat intel)
Output: LOW, MEDIUM, HIGH, CRITICAL
Classes: 4 (multi-class classification)

4. Attack Type Classification:

Input: Network traffic features
Output: NORMAL, DOS, PROBE, R2L, U2R
Classes: 5 (multi-class)

Classification vs Other Tasks:

Regression (Continuous Output): - Predicting a number (e.g., "risk score: 87.3") - Example: Predict time to compromise in minutes

Clustering (Group Similar Items): - No predefined classes - Example: Group users by behavior patterns

Classification Model Training:

Training Set: 10,000 labeled emails
- 6,000 LEGITIMATE
- 4,000 PHISHING

Model learns decision boundary:
IF (urgent_words > 3 AND external_sender AND suspicious_link):
    Predict: PHISHING
ELSE:
    Predict: LEGITIMATE

Reference: Chapter 9, Section 9.2 - Classification

Question 3: What type of ML task is 'grouping users by similar behavior patterns without predefined labels'?

A) Classification B) Regression C) Clustering D) Supervised learning

Answer

Correct Answer: C) Clustering

Explanation:

Clustering: - Definition: Grouping similar data points without predefined labels - Type: Unsupervised learning - Output: Group assignments (e.g., "User belongs to Cluster 3") - Algorithms: K-Means, DBSCAN, Hierarchical Clustering

Clustering SOC Examples:

1. User Behavior Grouping:

Input: User activity features (login times, applications used, data accessed)
Process: Clustering algorithm finds natural groups
Output:
  - Cluster 1: Sales team (CRM usage, daytime logins, high email volume)
  - Cluster 2: Engineers (IDE/Git, late-night logins, SSH usage)
  - Cluster 3: Finance (spreadsheets, 9-5 logins, financial system access)
  - Cluster 4: Executives (mobile access, travel, light usage)
  - Anomaly: User doesn't fit any cluster → Investigate

2. Network Traffic Clustering:

Input: Network flows (bytes, packets, duration, ports)
Output:
  - Cluster 1: HTTP/HTTPS browsing
  - Cluster 2: Email traffic
  - Cluster 3: Database queries
  - Cluster 4: Unknown (potential C2 traffic) → Alert

3. Alert Clustering:

Input: Alerts (source, destination, type, time)
Output: Groups of related alerts (likely same incident)
Benefit: Reduces alert fatigue by grouping 50 alerts into 1 incident

K-Means Clustering Example:

# Pseudocode: Cluster users by login behavior
from sklearn.cluster import KMeans

# Features: [avg_login_hour, logins_per_week, applications_used]
user_behaviors = [
    [9.5, 25, 8],   # User 1: Regular office hours
    [14.2, 30, 12], # User 2: Afternoon worker
    [2.3, 40, 5],   # User 3: Night shift (anomalous?)
    ...
]

# Cluster into 3 groups
kmeans = KMeans(n_clusters=3)
clusters = kmeans.fit_predict(user_behaviors)

# User 3 assigned to Cluster 2 (small cluster) → Investigate

Clustering vs Classification: - Classification: Predefined classes (phishing/legitimate) - Clustering: Discover groups (algorithm finds natural patterns)

Reference: Chapter 9, Section 9.3 - Clustering

Question 4: What type of ML task is 'predicting a risk score from 0-100 for an alert'?

A) Classification B) Clustering C) Regression D) Reinforcement learning

Answer

Correct Answer: C) Regression

Explanation:

Regression: - Definition: Predicting continuous numerical values - Output: Number on continuous scale (not discrete classes) - Algorithms: Linear Regression, Random Forest Regressor, Neural Networks

Regression SOC Examples:

1. Alert Risk Scoring:

Input: Alert features (threat intel match, asset criticality, user risk, historical FP rate)
Output: Risk score 0-100 (e.g., 87.3)
Model: Regression

2. Time-to-Compromise Prediction:

Input: Vulnerability CVSS score, patch status, exposure, threat intel
Output: Predicted days until exploitation (e.g., 14.7 days)
Model: Regression

3. MTTR Prediction:

Input: Incident severity, complexity, team availability
Output: Predicted response time in minutes (e.g., 45.2 minutes)
Model: Regression

Example: Risk Score Regression Model

# Training data: Historical alerts with manually assigned risk scores
Training:
Alert 1: [threat_intel_match=1, asset_critical=1, user_privileged=1] → Risk: 95
Alert 2: [threat_intel_match=0, asset_critical=0, user_privileged=0] → Risk: 15
Alert 3: [threat_intel_match=1, asset_critical=0, user_privileged=0] → Risk: 60
... 10,000 alerts

Model learns: Risk ≈ (40 × threat_intel) + (30 × asset) + (25 × user) + ...

Prediction:
New Alert: [threat_intel=1, asset=1, user=0]
Predicted Risk: 87.3

Regression vs Classification:

Classification: "Is this PHISHING or LEGITIMATE?" (discrete classes)
Regression: "What's the phishing probability?" (continuous: 0.873 = 87.3%)

Note: Classification and regression can solve similar problems differently: - Classification: Email is PHISHING - Regression: Email phishing probability is 0.92 → Threshold at 0.8 → Classify as PHISHING

Reference: Chapter 9, Section 9.4 - Regression

Question 5: What is UEBA (User and Entity Behavior Analytics)?

A) A type of firewall B) ML-based analysis of user and entity behavior to detect anomalies and insider threats C) A compliance framework D) A SIEM vendor

Answer

Correct Answer: B) ML-based analysis of user and entity behavior to detect anomalies and insider threats

Explanation:

UEBA (User and Entity Behavior Analytics): - Purpose: Detect anomalous behavior that deviates from established baselines - Method: Machine learning (unsupervised clustering, anomaly detection) - Entities: Users, hosts, applications, network devices

UEBA Use Cases:

1. Insider Threat Detection:

Baseline: User typically accesses 50 files/day from HR database
Anomaly: User suddenly downloads 10,000 files
Alert: "Abnormal data access - potential exfiltration"

2. Compromised Account Detection:

Baseline: User logs in from New York, 9am-5pm, Windows laptop
Anomaly: Login from Russia at 3am, Linux system
Alert: "Impossible travel + unusual OS"

3. Lateral Movement Detection:

Baseline: Workstation WKS-042 typically connects to 5 internal servers
Anomaly: WKS-042 connects to 50 servers in 10 minutes
Alert: "Abnormal network scanning behavior"

4. Privilege Escalation:

Baseline: User account "jdoe" never uses administrative tools
Anomaly: "jdoe" executes PowerShell with admin rights
Alert: "Unusual privilege usage"

UEBA ML Architecture:

Step 1: Baseline Learning (Training Phase)
- Collect 30-90 days of normal behavior
- Build user/entity profiles using unsupervised learning
- Example: User A baseline = [login_times, apps_used, data_accessed, ...]

Step 2: Anomaly Detection (Inference Phase)
- Compare current behavior to baseline
- Calculate anomaly score (0-100)
- Alert if score exceeds threshold

Step 3: Continuous Learning
- Update baselines as behavior evolves
- Handle concept drift (job changes, new applications)

Example UEBA Alert:

User: jane.doe@company.com
Anomaly Score: 89/100
Reasons:
  - Login from unusual location (China vs typical New York)
  - Login at unusual time (3am vs typical 9am-5pm)
  - Unusual application (SSH vs typical Outlook/Chrome)
  - High data access (500 files vs typical 20)
Recommendation: Investigate potential compromise

UEBA Challenges: - False Positives: Legitimate behavior changes (new project, promotion) - Baseline Pollution: Training on compromised data - Cold Start: New users lack baseline (no historical data)

Popular UEBA Platforms: Exabeam, Securonix, Microsoft Sentinel UEBA, Splunk UBA

Reference: Chapter 9, Section 9.5 - UEBA

Question 6: What is feature engineering in machine learning?

A) Building physical features for hardware B) Selecting and transforming raw data into meaningful features that improve model performance C) Engineering department features D) Feature engineering is not used in ML

Answer

Correct Answer: B) Selecting and transforming raw data into meaningful features that improve model performance

Explanation:

Feature Engineering: - Definition: Creating informative features from raw data - Goal: Help ML model learn patterns more effectively - Impact: Often more important than algorithm choice for performance

Feature Engineering SOC Examples:

1. Phishing Email Classification:

Raw Data:

Email Subject: "URGENT: Verify your account now!"
Email Body: "Click here to verify: http://evil.com/verify"
Sender: "security@paypa1.com"

Engineered Features:

- urgent_word_count: 2 (URGENT, now)
- external_sender: True
- sender_domain_typosquat: True (paypa1 vs paypal)
- link_count: 1
- link_domain_mismatch: True (evil.com != paypa1.com)
- capitalization_ratio: 0.15 (15% caps)
- has_attachment: False
- sender_in_contacts: False

Why Feature Engineering Matters: - Raw text is hard for ML to process - Engineered features are numerical and meaningful - Model can learn: "If urgent_words > 2 AND typosquat = True → PHISHING"

2. Malware Detection:

Raw Data:

File: malware.exe
Size: 2,456,032 bytes
Strings: ["cmd.exe", "powershell", "encrypt", ...]

Engineered Features:

- file_size: 2456032
- entropy: 7.8 (high entropy suggests encryption/packing)
- suspicious_imports: 5 (count of CreateRemoteThread, VirtualAllocEx, ...)
- string_matches: ["cmd.exe", "powershell"] (count: 2)
- packed: True (detected via entropy analysis)
- age_days: 3 (first seen 3 days ago)
- prevalence: 0.0001% (very rare file)

3. Network Traffic Anomaly Detection:

Raw Data:

TCP connection: 10.0.1.50:49234 → 203.0.113.45:443
Duration: 3600 seconds
Bytes sent: 1024
Bytes received: 52,428,800

Engineered Features:

- duration_minutes: 60
- bytes_ratio: 51200 (received/sent - indicates data download)
- destination_is_external: True
- destination_threat_intel_match: True
- connection_time: 02:00 (unusual hour)
- port_443_with_non_http: True (suspicious)

Feature Engineering Techniques:

1. Extraction: - Parse URLs to extract domain, TLD, path length - Extract header fields from packets

2. Transformation: - Log transformation (reduce skew in numeric features) - Normalization (scale 0-1)

3. Aggregation: - Count failed logins per user per hour - Average file size accessed per day

4. Encoding: - One-hot encoding for categorical variables (OS: Windows → [1,0,0], Linux → [0,1,0])

5. Domain Knowledge: - Threat intel lookups (convert IP → reputation score) - MITRE ATT&CK mapping (technique ID → category)

Reference: Chapter 9, Section 9.6 - Feature Engineering

Question 7: What is overfitting in machine learning?

A) A model that performs well on both training and test data B) A model that memorizes training data and performs poorly on new, unseen data C) A model that is too simple D) Overfitting improves model performance

Answer

Correct Answer: B) A model that memorizes training data and performs poorly on new, unseen data

Explanation:

Overfitting: - Problem: Model learns noise and specifics of training data instead of general patterns - Symptom: High accuracy on training data, poor accuracy on test/production data - Cause: Model too complex, too little training data, or too many features

Overfitting Example: Phishing Detection

Training Data (100 emails):

Email 1: Subject "Verify account" From: attacker@evil.com → PHISHING
Email 2: Subject "Meeting notes" From: colleague@company.com → LEGITIMATE
... 100 total

Overfit Model Behavior:

Model memorizes: "If sender = attacker@evil.com → PHISHING"

Training Accuracy: 100% (perfect!)

Test Data:
Email: Subject "Verify account" From: attacker2@evil2.com → Model predicts LEGITIMATE

Why? Model memorized specific sender "attacker@evil.com" instead of learning pattern "Verify account + external sender = PHISHING"

Test Accuracy: 60% (poor!)

Visual Representation:

Underfitting: Model too simple (straight line through scattered data)
Good Fit: Model captures pattern (smooth curve fitting trend)
Overfitting: Model too complex (zigzag line through every training point, including noise)

Signs of Overfitting: - Training accuracy: 99% - Test accuracy: 65% - Gap: 34% (large gap indicates overfitting)

Preventing Overfitting:

1. More Training Data:

100 samples → 10,000 samples
Harder for model to memorize, must learn patterns

2. Regularization:

Penalize complex models
Techniques: L1/L2 regularization, dropout (neural networks)

3. Feature Selection:

Reduce features from 100 → 20 most important
Removes noise and reduces complexity

4. Cross-Validation:

Split data into 5 folds, train on 4, test on 1
Repeat 5 times, average performance
Ensures model generalizes across different data splits

5. Early Stopping:

Stop training when validation performance stops improving
Prevents model from over-learning training data

Example: SOC Alert Scoring Model

Problem: Model achieves 98% accuracy on historical alerts but only 70% on new alerts

Diagnosis: Overfitting (memorized specific IPs/users in training data)

Solution:
- Collected 10x more training data
- Reduced features from 50 → 15 most predictive
- Applied regularization

Result: Training 92%, Test 89% (better generalization)

Reference: Chapter 9, Section 9.7 - Overfitting

Question 8: What is underfitting in machine learning?

A) A model that is too complex B) A model that is too simple and fails to capture underlying patterns, performing poorly on both training and test data C) A perfect model D) Underfitting only affects test data

Answer

Correct Answer: B) A model that is too simple and fails to capture patterns, performing poorly on training and test data

Explanation:

Underfitting: - Problem: Model too simple to learn underlying patterns - Symptom: Poor performance on both training AND test data - Cause: Model lacks complexity, insufficient features, or insufficient training

Underfitting Example: Alert Severity Prediction

Scenario:

Data: 10,000 alerts with features [source_ip, destination_ip, time, user, ...]
Task: Predict severity (Low/Medium/High/Critical)

Underfit Model: Uses only 1 feature (time of day)
Rule: "If time between 2am-6am → High, else → Low"

Result:
Training Accuracy: 55%
Test Accuracy: 54%

Why? Model ignores critical features (threat intel, asset criticality, user risk)
Patterns are too complex for simple time-based rule

Visual Representation:

Data: Complex curved pattern
Underfit Model: Straight horizontal line (ignores pattern)
Good Model: Curve that follows trend
Overfit Model: Zigzag through every point

Signs of Underfitting: - Training accuracy: 60% - Test accuracy: 58% - Both low: Model hasn't learned patterns

Fixing Underfitting:

1. Increase Model Complexity:

Change: Linear model → Decision tree or neural network
Allows model to learn non-linear patterns

2. Add More Features:

Before: [time]
After: [time, source_ip, threat_intel_match, asset_criticality, user_risk, ...]
Gives model more information to learn from

3. Train Longer:

Increase training epochs/iterations
Give model more time to learn patterns

4. Remove Regularization:

If regularization too strong → model too constrained
Relax constraints to allow learning

Example: Malware Detection Model

Problem: Model only checks file size
Rule: "If size > 10MB → Malware"

Training Accuracy: 52% (barely better than random)
Test Accuracy: 51%

Diagnosis: Underfitting (too simple)

Solution:
- Added features: entropy, imports, strings, behavior
- Changed algorithm: Simple rule → Random Forest

Result: Training 94%, Test 91% (learned real patterns)

Underfitting vs Overfitting:

Underfitting: Both training and test accuracy LOW
Good Fit: Both training and test accuracy HIGH (close values)
Overfitting: Training accuracy HIGH, test accuracy LOW (large gap)

Reference: Chapter 9, Section 9.8 - Underfitting

Question 9: What is model drift and why is it a problem in SOC operations?

A) Physical movement of servers B) Model performance degrades over time as real-world data distribution changes (e.g., new attack techniques, infrastructure changes) C) Model drift improves accuracy D) Model drift only affects regression models

Answer

Correct Answer: B) Model performance degrades over time as real-world data distribution changes

Explanation:

Model Drift (Concept Drift): - Problem: Real-world data changes, but model was trained on historical data - Result: Model accuracy degrades over time - Types: Concept drift (patterns change) and data drift (feature distributions change)

SOC Model Drift Examples:

1. Malware Detection Drift:

Training (2023):
- Model trained on 2023 malware samples
- Learns: Malware uses DLL injection, specific C2 patterns

Production (2026):
- New malware families use fileless techniques, cloud C2
- Model doesn't recognize new patterns
- Accuracy: 95% (2023) → 70% (2026)
- Reason: Attack techniques evolved

2. Phishing Detection Drift:

Training (2024):
- Phishing emails use poor grammar, obvious spoofing

Production (2026):
- Attackers use AI-generated perfect grammar
- Compromise legitimate accounts (no spoofing)
- Model trained on old patterns misses new sophisticated phishing
- Accuracy: 92% (2024) → 68% (2026)

3. Network Baseline Drift:

Training (Q1 2025):
- Company uses on-prem infrastructure
- UEBA baseline: 80% traffic to internal servers

Production (Q4 2025):
- Company migrates to cloud
- 70% traffic now to AWS/Azure
- UEBA alerts on legitimate cloud usage as "abnormal"
- False Positive Rate: 10% → 40%

4. User Behavior Drift:

Baseline (January):
- User "jdoe" is engineer, accesses code repos

Reality (June):
- User promoted to manager, now accesses HR systems
- UEBA flags as anomalous (legitimate job change)

Types of Drift:

Concept Drift: - Definition: Relationship between features and target changes - Example: Previously, exe files from email were always malicious. Now, legitimate software uses email distribution.

Data Drift: - Definition: Input feature distribution changes - Example: Average email length changes from 500 chars to 2000 chars (doesn't affect phishing patterns, but model wasn't trained on longer emails)

Detecting Model Drift:

Monitor:
1. Prediction accuracy over time (weekly/monthly)
2. Feature distributions (are inputs different than training data?)
3. False positive/negative rates
4. User feedback (analysts marking predictions as wrong)

Alert when:
- Accuracy drops >10% from baseline
- FP rate increases >20%
- Feature distributions shift significantly

Mitigating Model Drift:

1. Continuous Retraining:

Schedule: Retrain model monthly/quarterly
Data: Use recent labeled data (last 90 days)
Benefit: Model adapts to new patterns

2. Online Learning:

Update model incrementally with new data
No full retraining needed
Adapts in real-time

3. Ensemble Models:

Combine multiple models trained on different time periods
More robust to drift

4. Human-in-the-Loop:

Analysts label predictions
Feed back to training pipeline
Improves model continuously

Example Monitoring:

Malware Detection Model:
- Week 1: Accuracy 94%
- Week 10: Accuracy 91%
- Week 20: Accuracy 85% ← Alert: Significant drift detected
- Action: Retrain on recent malware samples
- Week 21: Accuracy 93% (post-retraining)

Reference: Chapter 9, Section 9.9 - Model Drift

Question 10: What is an adversarial example in ML security?

A) A training example from an adversary B) Intentionally crafted input designed to fool an ML model into making incorrect predictions C) A competitive ML model D) Adversarial examples don't exist

Answer

Correct Answer: B) Intentionally crafted input designed to fool ML model into incorrect predictions

Explanation:

Adversarial Examples: - Definition: Malicious inputs crafted to evade ML-based detection - Goal: Cause misclassification (malware → benign, phishing → legitimate) - Method: Small, often imperceptible modifications to bypass model

Adversarial Attack Examples:

1. Malware Evasion:

Original Malware:
- Hash: abc123
- Entropy: 7.9 (high)
- Imports: VirtualAlloc, CreateRemoteThread
- ML Prediction: MALWARE (confidence: 98%)

Adversarial Malware:
- Add benign comments/strings to reduce entropy
- Entropy: 6.2 (now within benign range)
- Functionality: UNCHANGED (still malicious)
- ML Prediction: BENIGN (confidence: 65%)

Result: Attacker evades detection by manipulating features

2. Phishing Email Evasion:

Original Phishing:
Subject: "URGENT: Verify account now!"
ML Model: Flags "URGENT" + "Verify" as phishing indicators
Prediction: PHISHING (95%)

Adversarial Phishing:
Subject: "Important notice regarding your account verification"
- Replaces urgent words with softer language
- Same malicious intent
- ML Prediction: LEGITIMATE (70%)

3. Network Traffic Evasion:

Original C2 Traffic:
- Large payload (10MB data exfil)
- ML detects unusual volume
- Prediction: MALICIOUS

Adversarial C2:
- Split payload into 1000 small requests (10KB each)
- Mimics normal web traffic pattern
- ML Prediction: BENIGN (traffic volume per connection is normal)

How Adversarial Attacks Work:

1. White-Box Attack (Attacker knows model):

- Attacker has access to model architecture and weights
- Calculates gradients to find minimal perturbation
- Crafts input to maximize misclassification

Example: Attacker knows malware detector uses entropy feature
→ Adds padding to reduce entropy below threshold

2. Black-Box Attack (Attacker doesn't know model):

- Attacker queries model with test inputs
- Learns decision boundary through trial and error
- Crafts evasion based on observed behavior

Example: Test 100 malware variants to find which features trigger detection
→ Modify those specific features

Defending Against Adversarial Examples:

1. Adversarial Training:

Include adversarial examples in training data
Model learns to recognize evasion attempts

Training Data:
- Original malware samples
- + Adversarially modified samples (still labeled malware)

Result: Model robust to evasion

2. Ensemble Defense:

Use multiple models with different architectures
Harder for attacker to evade all simultaneously

Vote: If 2+ models detect malware → Flag as malicious

3. Defense in Depth:

Don't rely solely on ML
Combine with:
- Signature-based detection (hash matching)
- Behavioral analysis (runtime monitoring)
- Sandboxing (execute in isolated environment)

4. Feature Robustness:

Use features that are hard to manipulate without breaking functionality

Example: For malware detection
- Bad feature: File size (easily manipulated with padding)
- Good feature: Control flow graph (changing breaks functionality)

5. Anomaly Detection:

Detect manipulation attempts
Example: File with unusually high padding ratio → Suspicious

Real-World Example:

Attack: APT group crafts malware to evade EDR ML model
Method:
- Analyzed EDR vendor's published research
- Identified entropy threshold (> 7.5 = malware)
- Added benign strings to reduce entropy to 7.2

Defense:
- EDR vendor retrains model with adversarial samples
- Adds new feature: "unusual padding ratio"
- Detects evasion attempt

Reference: Chapter 9, Section 9.10 - Adversarial Examples

Question 11: Why is continuous retraining important for security ML models?

A) To waste computing resources B) To adapt to evolving threats, new attack techniques, and changing baselines (mitigate model drift) C) Retraining is never necessary D) Models become worse with retraining

Answer

Correct Answer: B) To adapt to evolving threats, new attack techniques, and changing baselines

Explanation:

Need for Continuous Retraining: - Threat Evolution: Attackers adapt techniques - Environment Changes: Infrastructure migrations, new applications - Model Drift: Performance degrades without updates - New Attack Vectors: Zero-day exploits, novel malware families

Retraining Schedule Examples:

1. High-Frequency Retraining (Weekly/Monthly):

Use Case: Phishing detection
Reason: Phishing techniques evolve rapidly
Schedule: Retrain weekly with last 30 days of labeled emails
Benefit: Catches latest phishing trends

2. Medium-Frequency Retraining (Quarterly):

Use Case: Malware detection
Reason: New malware families emerge regularly
Schedule: Retrain quarterly with recent malware samples
Benefit: Adapts to new techniques while maintaining stability

3. Event-Driven Retraining:

Use Case: UEBA
Trigger: Major infrastructure change (cloud migration)
Action: Retrain immediately to establish new baseline
Benefit: Prevents false positive flood

Retraining Workflow:

Step 1: Collect Recent Data
- Last 90 days of labeled incidents
- Analyst feedback (false positives/negatives)
- New threat intelligence

Step 2: Prepare Training Set
- Combine historical data (for stability) with recent data (for adaptation)
- Ratio: 70% recent, 30% historical
- Balance classes (equal malicious/benign samples)

Step 3: Retrain Model
- Use same architecture or improved version
- Validate on hold-out test set
- Compare performance to previous model

Step 4: A/B Testing
- Deploy new model to 10% of traffic
- Monitor metrics (accuracy, FP rate, analyst feedback)
- If improved → Full deployment
- If worse → Rollback to previous model

Step 5: Monitor Performance
- Track accuracy over time
- Schedule next retraining

Example: Alert Scoring Model Retraining

Initial Model (January 2025):
- Trained on 2024 data
- Accuracy: 92%

March 2025:
- Accuracy degrades to 85% (drift detected)
- New attack campaigns using different TTPs

Retraining (April 2025):
- Collected Q1 2025 labeled alerts (5,000 samples)
- Combined with 2024 data (15,000 samples)
- Retrained model

Post-Retraining:
- Accuracy: 93% (improved)
- Adapted to Q1 2025 threats

Challenges: - Labeled Data: Need analyst time to label new incidents - Compute Cost: Retraining large models is expensive - Regression Risk: New model might perform worse on some scenarios

Best Practices: - Automate Pipeline: Scheduled retraining without manual intervention - Version Control: Track model versions, enable rollback - Continuous Monitoring: Detect drift early - Feedback Loop: Analyst corrections feed training data

Reference: Chapter 9, Section 9.11 - Continuous Retraining

Question 12: What is the difference between precision and recall in ML evaluation?

A) Precision and recall are the same metric B) Precision = (True Positives) / (True Positives + False Positives); Recall = (True Positives) / (True Positives + False Negatives) C) Precision measures speed, recall measures memory D) Only precision matters in security

Answer

Correct Answer: B) Precision = TP/(TP+FP); Recall = TP/(TP+FN)

Explanation:

Precision: - Definition: Of all predictions as POSITIVE, how many were actually positive? - Formula: Precision = TP / (TP + FP) - Meaning: Accuracy of positive predictions - Trade-off: High precision = few false positives

Recall (Sensitivity): - Definition: Of all actual POSITIVES, how many did we detect? - Formula: Recall = TP / (TP + FN) - Meaning: Coverage of actual threats - Trade-off: High recall = few false negatives (missed threats)

Confusion Matrix:

            Predicted MALWARE    Predicted BENIGN
Actually
MALWARE         TP = 90              FN = 10         (100 total malware)

Actually
BENIGN          FP = 20              TN = 880        (900 total benign)

Calculate Metrics:

Precision = TP / (TP + FP) = 90 / (90 + 20) = 90/110 = 81.8%
- "Of 110 malware predictions, 90 were correct"
- "18.2% of malware alerts are false positives"

Recall = TP / (TP + FN) = 90 / (90 + 10) = 90/100 = 90%
- "Of 100 actual malware samples, we detected 90"
- "We missed 10% of malware (false negatives)"

SOC Implications:

High Precision, Low Recall:

Model: Very conservative (only flags obvious malware)
Precision: 95% (very few FPs)
Recall: 60% (misses 40% of malware)

Impact:
+ Analysts trust alerts (low FP rate)
- Misses sophisticated threats (high FN rate)

Use Case: Auto-blocking (only block high-confidence threats)

Low Precision, High Recall:

Model: Very aggressive (flags anything suspicious)
Precision: 50% (high FP rate)
Recall: 98% (catches almost all malware)

Impact:
+ Catches nearly all threats (low FN rate)
- Alert fatigue (analysts waste time on FPs)

Use Case: Initial screening (send to analyst for review)

Balanced (Optimize F1 Score):

F1 Score = 2 × (Precision × Recall) / (Precision + Recall)

Example:
Precision: 85%
Recall: 80%
F1 = 2 × (0.85 × 0.80) / (0.85 + 0.80) = 0.824 = 82.4%

Use Case: Most SOC use cases (balance FP and FN)

Tuning Trade-off:

Threshold adjustment example (malware detector):

Threshold: 0.9 (very strict)
→ High Precision (95%), Low Recall (65%)

Threshold: 0.5 (balanced)
→ Medium Precision (85%), Medium Recall (82%)

Threshold: 0.3 (aggressive)
→ Low Precision (70%), High Recall (95%)

Choose based on use case!

Reference: Chapter 9, Section 9.12 - Precision vs Recall or Chapter 11 - Metrics

Question 13: A UEBA system flags a user for 'unusual data access' because they accessed 500 files (baseline: 50). Investigation reveals the user is performing a legitimate audit. What type of error is this?

A) True Positive B) False Positive C) False Negative D) True Negative

Answer

Correct Answer: B) False Positive

Explanation:

Classification:

Alert: Unusual data access (model predicted: MALICIOUS)
Reality: Legitimate audit (actual: BENIGN)
Result: FALSE POSITIVE (alarm when shouldn't be)

Four Outcomes:

True Positive (TP):

Predicted: MALICIOUS
Actual: MALICIOUS
Example: UEBA flags impossible travel, investigation confirms account compromise
Result: ✅ Correct detection

True Negative (TN):

Predicted: BENIGN
Actual: BENIGN
Example: Normal file access, no alert
Result: ✅ Correct (no alarm needed)

False Positive (FP):

Predicted: MALICIOUS
Actual: BENIGN
Example: Legitimate audit flagged as data exfiltration
Result: ❌ Incorrect alarm (wasted analyst time)

False Negative (FN):

Predicted: BENIGN
Actual: MALICIOUS
Example: Insider slowly exfiltrates data, stays under threshold, not flagged
Result: ❌ Missed threat (dangerous!)

Impact of FPs in SOC: - Analyst Time: Wasted investigating benign activity - Alert Fatigue: Too many FPs → analysts become desensitized - Missed Threats: Time spent on FPs means less time for real threats

Reducing UEBA False Positives:

1. Dynamic Baselines:

Problem: Static threshold (50 files) doesn't account for legitimate changes
Solution: Adaptive baseline
- Detect user started audit project (new behavior cluster)
- Adjust expected range: 50-500 files for audit period
- Alert only if exceeds new range (e.g., 1000 files)

2. Context Enrichment:

Alert: 500 file access
Enrichment:
- Check calendar: "Annual audit week" scheduled
- Check ticket system: Audit ticket assigned to user
- Check manager approval: Access approved
Result: Auto-close as expected behavior

3. Feedback Loop:

Analyst marks alert as FP
→ Feeds back to model training
→ Model learns: "High file access during audit week = normal"
→ Future audits don't trigger alerts

4. Tuning Thresholds:

Current: Alert if > 10x baseline (50 → 500)
Tuned: Alert if > 20x baseline (50 → 1000)
Trade-off: Fewer FPs but might miss some real threats (higher FN)

Reference: Chapter 9, Section 9.5 - UEBA and Common Pitfalls

Question 14: What is the cold start problem in UEBA?

A) UEBA systems need to warm up before use B) Inability to establish behavioral baselines for new users/entities with no historical data C) Systems run better in cold climates D) Cold start is not a real problem

Answer

Correct Answer: B) Inability to establish baselines for new users/entities with no historical data

Explanation:

Cold Start Problem: - Issue: UEBA requires 30-90 days of historical data to build baseline - Challenge: New users have no history - Impact: No baseline = no anomaly detection (blind spot)

Cold Start Scenarios:

1. New Employee:

Day 1: User "new_hire" starts
UEBA: No historical data, can't establish baseline

Problem: If new_hire is malicious or compromised from day 1, UEBA won't detect anomalies

Risk Window: 30-90 days until baseline is established

2. New Server:

New database server deployed
UEBA: No baseline for normal network traffic patterns

Problem: Can't detect if server is immediately compromised (no anomaly reference)

3. Job Role Change:

User "jdoe" promoted from Sales → IT Admin
Old Baseline: CRM access, 9-5 logins
New Role: Server access, on-call hours

Problem: New behavior looks anomalous compared to old baseline
Result: False positive flood OR need to rebuild baseline (cold start again)

Mitigating Cold Start:

1. Peer Group Baselines:

Instead of individual baseline, use role-based baseline

New hire "jdoe" role: Software Engineer
Baseline: Aggregate behavior of all engineers
- Expected apps: IDE, Git, Slack
- Expected access: Code repos, dev servers
- Expected hours: Flexible (some work nights)

Anomaly: If jdoe accesses HR database → Alert (engineers don't typically access HR)

2. Default Safe Behavior:

During baseline learning period (30 days):
- Apply stricter rule-based detections
- Flag high-risk actions (e.g., privilege escalation)
- Don't rely solely on behavioral anomaly detection

3. Transfer Learning:

Use baselines from similar users/entities

New Sales user → Use existing Sales team baseline
New web server → Use existing web server baseline

Advantage: Immediate anomaly detection
Caveat: Assumes role similarity

4. Accelerated Baselining:

Compressed baseline period:
- Traditional: 90 days
- Accelerated: 7-14 days with more aggressive data collection
- Trade-off: Less accurate baseline but faster coverage

5. Hybrid Approach:

Combine:
- Peer group baseline (immediate coverage)
- + Individual baseline (building over 30-90 days)
- + Rule-based detections (catch obvious threats)

Transition: Start with peer baseline, gradually shift to personalized baseline

Example:

New employee "alice" starts as Finance Analyst

Week 1-4: Cold Start Period
- Use Finance team baseline
- UEBA compares alice to average Finance behavior
- Alert if alice accesses engineering code (unusual for Finance)

Week 5-8: Hybrid Period
- Combine team baseline (70%) with emerging individual baseline (30%)
- Refine understanding of alice's specific patterns

Week 9+: Individual Baseline
- Sufficient data for personalized baseline
- Anomaly detection tailored to alice's specific behavior

Reference: Chapter 9, Section 9.5 - UEBA Cold Start

Question 15: Why is explainability important for ML models in SOC operations?

A) Explainability is not important B) Analysts need to understand why a model made a prediction to validate decisions, tune models, and build trust C) Models should always be black boxes D) Explainability slows down detection

Answer

Correct Answer: B) Analysts need to understand why a model made predictions to validate, tune, and build trust

Explanation:

Importance of Explainability:

1. Validation:

Alert: Malware detected (confidence: 95%)

Black Box: "File is malware" (no explanation)
→ Analyst: "Why? I need to investigate before blocking"

Explainable: "Malware because:"
  - High entropy (7.9) - indicates packing/encryption
  - Suspicious imports: VirtualAlloc, CreateRemoteThread
  - Rare file (prevalence: 0.001%)
  - Threat intel: Hash matches Emotet variant
→ Analyst: "Makes sense, approved for blocking"

2. Trust:

Analysts trust models they understand
Black box predictions → skepticism, manual override
Explainable predictions → confidence, faster response

3. Debugging/Tuning:

Problem: Model flags benign software as malware

Black Box: Hard to diagnose why

Explainable: "Flagged because high entropy (7.8)"
→ Diagnosis: Legitimate software also has high entropy (compression)
→ Fix: Add additional features (digital signature, prevalence)

4. Compliance:

Regulations (GDPR, CCPA) require "right to explanation"
- If AI blocks user action, must explain why
- Auditors need to understand decision logic

Explainability Techniques:

1. Feature Importance:

Alert Scoring Model:
Which features most influenced score?

Feature Importance:
1. Threat Intel Match: 40%
2. Asset Criticality: 25%
3. User Risk Score: 20%
4. Time of Day: 10%
5. Other: 5%

Explanation: "Alert scored high primarily due to threat intel match"

2. SHAP (SHapley Additive exPlanations):

Base risk score: 50/100

Feature Contributions:
+ Threat intel match: +35 points
+ Critical asset: +20 points
- Low user risk: -5 points
- Business hours: -3 points

Final score: 97/100

Explanation: "High score driven by threat intel and critical asset"

3. Decision Trees (Inherently Explainable):

IF threat_intel_match = True:
    IF asset_critical = True:
        IF user_privileged = True:
            Risk = CRITICAL
        ELSE:
            Risk = HIGH

Path: threat_intel=True → asset=True → user=False → Risk=HIGH
Explanation: Clear decision path

4. Counterfactual Explanations:

Alert: Blocked (confidence: 92%)

Explanation: "Would have been allowed if:"
- Threat intel confidence < 80% (currently 95%)
- OR asset criticality = Low (currently Critical)

Helps analyst understand decision boundary

Example: Phishing Detection

Email: "Urgent: Verify your account"
Prediction: PHISHING (confidence: 94%)

Black Box: No explanation
→ Analyst wastes 5 minutes manually analyzing

Explainable:
Reasons for PHISHING classification:
1. Urgent language: "Urgent", "Verify" (score: +30)
2. External sender + internal lookalike domain (score: +40)
3. Suspicious link (domain age: 2 days) (score: +20)
4. No previous email history with sender (score: +4)

→ Analyst: "Clear phishing, deleting and blocking" (1 minute decision)

Trade-offs: - Simple Models: Very explainable (decision trees, linear models) but may be less accurate - Complex Models: More accurate (deep neural networks) but harder to explain - Solution: Use explainability tools (SHAP, LIME) to interpret complex models

Reference: Chapter 9, Section 9.13 - Model Explainability or Chapter 10 - Guardrails

Score Interpretation¶

13-15 correct: Excellent! You have strong ML fundamentals and understand SOC-specific applications.
10-12 correct: Good understanding. Review overfitting/underfitting and model drift concepts.
7-9 correct: Adequate baseline. Focus on supervised vs unsupervised learning and UEBA.
Below 7: Review Chapter 9 thoroughly, especially classification, regression, and feature engineering.

← Back to Chapter 9 | Next Quiz: Chapter 10 →