Chapter 3: SIEM & Data Lake Basics - Quiz¶

Instructions¶

Test your understanding of SIEM architecture, query languages, correlation rules, and data lake approaches. Each question includes practical examples and detailed explanations.

Question 1: What is the primary function of a SIEM's correlation engine?

A) Store logs in compressed format B) Combine multiple events to detect complex attack patterns C) Forward logs from endpoints to central storage D) Encrypt data in transit

Answer

Correct Answer: B) Combine multiple events to detect complex attack patterns

Explanation: The correlation engine is the core analytics component that combines events across time and systems to detect multi-stage attacks that individual events cannot reveal.

Example: - Single failed login: Low priority (could be typo) - 100 failed logins in 2 minutes, followed by successful login: HIGH priority (brute force success)

The correlation engine implements threshold-based, sequence-based, and behavioral correlation logic to surface genuine threats.

Reference: Chapter 3, Section 3.1 - What is a SIEM?

Question 2: Which query language is used by Microsoft Sentinel?

A) SPL (Search Processing Language) B) SQL (Structured Query Language) C) KQL (Kusto Query Language) D) Python

Answer

Correct Answer: C) KQL (Kusto Query Language)

Explanation: - Microsoft Sentinel: KQL (Kusto Query Language) - Splunk: SPL (Search Processing Language) - Elastic: Lucene query syntax / EQL - IBM QRadar: AQL (Ariel Query Language) - Traditional databases: SQL

KQL is also used by Azure Monitor and Azure Data Explorer.

Reference: Chapter 3, Section 3.1 - Major SIEM Platforms

Question 3: A SIEM query returns results in 30 seconds for hot data (last 30 days) but takes 10 minutes for data from 6 months ago. What explains this performance difference?

A) RGB lighting on the SIEM server is insufficient B) Hot data is indexed on fast storage (SSD), while older data is in warm/cold storage (HDD or archive) C) The SIEM is broken and needs replacement D) Query syntax is incorrect for older data

Answer

Correct Answer: B) Hot data is indexed on fast storage (SSD), while older data is in warm/cold storage (HDD or archive)

Explanation: SIEMs use tiered storage for cost and performance optimization:

Storage Tiers: - Hot: Recent data (0-30 days), indexed, SSD → Subsecond to seconds - Warm: Older data (31-90 days), indexed, HDD → Seconds to minutes - Cold: Archived data (>90 days), compressed, S3/Glacier → Minutes to hours

Performance Trade-offs: - Hot storage: Fast but expensive - Cold storage: Slow but cheap (10-100x less cost)

This is normal SIEM behavior, not a problem. (And RGB lighting definitely doesn't affect SIEM performance, despite what gamers might think.)

Reference: Chapter 3, Section 3.2 - Performance Considerations

Question 4: What is a key advantage of a security data lake over a traditional SIEM?

A) Faster real-time alerting B) Lower storage costs for long-term retention C) Better out-of-the-box correlation rules D) Simpler user interface

Answer

Correct Answer: B) Lower storage costs for long-term retention

Explanation:

Data Lake Advantages: - Cost: $0.02/GB/month (S3) vs. $100+/GB/month (indexed SIEM) - Flexibility: Store raw logs, query with Athena/Databricks/Spark - ML-friendly: Direct access for model training - Schema-on-read: Analyze any field without pre-indexing

SIEM Advantages: - Speed: Faster queries (indexed) - Real-time alerting: Native correlation engine - User experience: Purpose-built UI for security analysts

Best Practice: Hybrid approach - SIEM for hot data (30-90 days, real-time), data lake for cold data (1-7 years, investigations, ML).

Reference: Chapter 3, Section 3.5 - SIEM vs. Data Lake Comparison

??? question "Question 5: Given this Splunk SPL query, what does it detect? spl index=windows_auth EventCode=4625 | stats count by Account_Name, Source_IP | where count > 10 | sort -count**

**A)** Successful logins from unusual locations
**B)** Brute force attempts (>10 failed logins per account/IP pair)
**C)** Process creations with encoded commands
**D)** Network connections to malicious IPs

??? success "Answer"
    **Correct Answer: B) Brute force attempts (>10 failed logins per account/IP pair)**

    **Explanation:** Let's break down the query:

    ```spl
    index=windows_auth EventCode=4625    # Windows failed login events
    | stats count by Account_Name, Source_IP  # Count failures per account+IP
    | where count > 10                    # Only show if >10 attempts
    | sort -count                         # Order by most attempts first
    ```

    **Windows Event IDs:**
    - **4624:** Successful login
    - **4625:** Failed login
    - **4634/4647:** Logoff

    This query identifies brute force password guessing attempts by flagging accounts with more than 10 failed login attempts from the same source IP.

    **Reference:** [Chapter 3, Section 3.3 - Example 1: Search for Failed Logins](../chapters/ch03-data-modeling-normalization.md)

??? question "Question 6: What is the purpose of the 'join' operation in the following KQL query? kql SecurityEvent | where EventID == 4625 and TimeGenerated > ago(10m) | summarize FailureCount = count() by Account | where FailureCount > 10 | join kind=inner (SecurityEvent | where EventID == 4624) on Account**

**A)** To delete failed login events
**B)** To find accounts with >10 failed logins FOLLOWED BY a successful login (potential brute force success)
**C)** To encrypt the query results
**D)** To create new user accounts

??? success "Answer"
    **Correct Answer: B) To find accounts with >10 failed logins FOLLOWED BY a successful login (potential brute force success)**

    **Explanation:** This query detects **successful** brute force attacks:

    **Logic:**
    1. Find accounts with >10 failed logins (EventID 4625) in last 10 minutes
    2. Join with successful logins (EventID 4624) on the same account
    3. Result: Accounts that experienced failed attempts AND then successfully authenticated

    **Why This Matters:**
    - Failed logins alone might be typos or legitimate password resets
    - Failed logins + successful login = Attacker may have guessed the password
    - This is a higher-fidelity alert than just failed login counts

    **Reference:** [Chapter 3, Section 3.3 - Example 2: Detect Brute Force](../chapters/ch03-data-modeling-normalization.md)

??? question "Question 7: What is threshold-based correlation?**

**A)** Combining events from different log sources
**B)** Triggering an alert when a count/value exceeds a defined limit (e.g., >10 failed logins in 5 minutes)
**C)** Encrypting correlation results
**D)** Automatically blocking IPs

??? success "Answer"
    **Correct Answer: B) Triggering an alert when a count/value exceeds a defined limit**

    **Explanation:** Threshold-based correlation triggers alerts when event counts or values cross predefined boundaries.

    **Example:**
    ```
    IF failed_login_count > 10 in 5 minutes
    THEN alert "Brute Force Attempt"
    ```

    **Other Correlation Types:**
    - **Sequence-based:** Event A followed by Event B within timeframe
    - **Behavioral anomaly:** Activity exceeds baseline + standard deviations

    **Tuning Challenge:** Setting thresholds too low = false positives; too high = missed threats.

    **Reference:** [Chapter 3, Section 3.4 - Types of Correlation](../chapters/ch03-data-modeling-normalization.md)

Question 8: A new correlation rule generates 200 alerts/day with an 80% false positive rate. What is a recommended tuning strategy?

A) Disable the rule entirely B) Add allowlisting for known benign processes/systems and increase thresholds C) Increase alert volume to 500/day D) Delete all security logs

Answer

Correct Answer: B) Add allowlisting for known benign processes/systems and increase thresholds

Explanation: High false positive rates burden analysts and obscure genuine threats. Effective tuning strategies include:

1. Add Allowlisting: spl ... | where NOT (user_name IN ("service_account_backup", "scheduled_task_user")) Exclude known-good entities that trigger the rule legitimately.

2. Increase Thresholds: Change: failed_login_count > 5 To: failed_login_count > 10 Trade-off: Reduces FPs but may miss slower attacks.

3. Add Context Filters: ... | where user_risk_score > 50 OR asset_criticality="high" Only alert on risky users or critical assets.

4. Time-based Filtering: ... | where date_hour NOT IN (2,3,4,5) # Ignore maintenance windows

Goal: Reduce FP rate to <20% while maintaining detection coverage.

Reference: Chapter 3, Section 3.6 - Tuning Strategies

??? question "Question 9: In a hybrid SIEM + Data Lake architecture, which use case is BEST suited for the data lake?**

**A)** Real-time alerting on brute force attempts
**B)** Historical investigation of a 6-month-old breach
**C)** Sub-second dashboard updates
**D)** Active incident response coordination

??? success "Answer"
    **Correct Answer: B) Historical investigation of a 6-month-old breach**

    **Explanation:**

    **SIEM (Hot Path):**
    - Real-time detection and alerting
    - Active incident response
    - Fast dashboards and searches
    - **Timeframe:** Last 30-90 days

    **Data Lake (Cold Path):**
    - Historical investigations (6+ months ago)
    - ML model training on large datasets
    - Compliance audits requiring years of logs
    - Cost-effective long-term retention
    - **Timeframe:** 1-7 years

    **Example Workflow:**
    ```
    Active incident → Query SIEM (fast results)
    Historical breach analysis → Query data lake (Athena, Databricks)
    ML training → Pull raw data from data lake
    ```

    **Reference:** [Chapter 3, Section 3.5 - Hybrid Architecture](../chapters/ch03-data-modeling-normalization.md)

??? question "Question 10: What is the purpose of the 'stats' command in Splunk SPL?**

**A)** To delete events
**B)** To aggregate data (count, sum, average) by specified fields
**C)** To encrypt search results
**D)** To create new user accounts

??? success "Answer"
    **Correct Answer: B) To aggregate data (count, sum, average) by specified fields**

    **Explanation:** The `stats` command performs aggregation operations in SPL:

    **Common Operations:**
    ```spl
    | stats count by user                  # Count events per user
    | stats sum(bytes) by dest_ip          # Total bytes per destination
    | stats avg(response_time) by service  # Average response time
    | stats values(host) by user           # List unique hosts per user
    ```

    **Example:**
    ```spl
    index=firewall action=allowed
    | stats count by dest_ip, dest_port
    | where count > 100
    ```
    This finds destinations with >100 connections, useful for detecting data exfiltration or port scans.

    **Reference:** [Chapter 3, Section 3.3 - Writing SIEM Queries](../chapters/ch03-data-modeling-normalization.md)

??? question "Question 11: A SIEM query searches index=* source=* user=*. What is the problem with this query?**

**A)** It's perfectly optimized
**B)** It uses wildcards inefficiently, searching all indexes and sources (slow and resource-intensive)
**C)** It will automatically fix itself
**D)** It only searches encrypted data

??? success "Answer"
    **Correct Answer: B) It uses wildcards inefficiently, searching all indexes and sources (slow and resource-intensive)**

    **Explanation:** This query violates SIEM query optimization best practices:

    **Problems:**
    - `index=*` searches ALL indexes (could be 50+ indexes across terabytes)
    - `source=*` searches all data sources
    - `user=*` searches all users
    - Result: Extremely slow, high resource consumption

    **Optimized Version:**
    ```spl
    index=windows_auth EventCode=4625
    | stats count by Account_Name
    | where count > 5
    | sort -count
    ```

    **Optimization Rules:**
    1. Specify exact index
    2. Use specific EventCode/sourcetype
    3. Use specific field names instead of wildcards
    4. Add time constraints (`earliest=-24h`)
    5. Filter early (before stats/joins)

    **Reference:** [Chapter 3, Practice Tasks - Task 3: Optimize a Query](../chapters/ch03-data-modeling-normalization.md)

??? question "Question 12: In alert tuning metrics, what does 'Precision' measure?**

**A)** Total number of alerts generated
**B)** TP / (TP + FP) - The percentage of alerts that are actually malicious
**C)** TP / (TP + FN) - The percentage of threats that were detected
**D)** Response time per alert

??? success "Answer"
    **Correct Answer: B) TP / (TP + FP) - The percentage of alerts that are actually malicious**

    **Explanation:**

    **Precision = TP / (TP + FP)**
    - **Meaning:** Of all alerts the rule generates, how many are true threats?
    - **High Precision:** Few false alarms, analyst time well-spent
    - **Example:** 90 true positives, 10 false positives → Precision = 90%

    **Recall = TP / (TP + FN)**
    - **Meaning:** Of all actual threats, how many did the rule catch?
    - **High Recall:** Few missed threats

    **Tuning Trade-offs:**
    - Increasing threshold: Higher precision, lower recall
    - Decreasing threshold: Lower precision, higher recall
    - Goal: Balance based on risk tolerance and analyst capacity

    **Reference:** [Chapter 3, Section 3.6 - Metrics for Tuning](../chapters/ch03-data-modeling-normalization.md)

??? question "Question 13: Which of the following is a sequence-based correlation example?**

**A)** Alert when failed logins exceed 10
**B)** Alert when login from IP_A is followed within 1 hour by login from IP_B >1000km away (Impossible Travel)
**C)** Alert when any user logs in
**D)** Alert when disk space is low

??? success "Answer"
    **Correct Answer: B) Alert when login from IP_A is followed within 1 hour by login from IP_B >1000km away (Impossible Travel)**

    **Explanation:** Sequence-based correlation detects multi-step patterns where event order and timing matter.

    **Example Logic:**
    ```
    IF (login_from_IP_A at time T1)
    THEN (within 1 hour, login_from_IP_B at time T2)
    WHERE distance(IP_A, IP_B) > 1000 km
    THEN alert "Impossible Travel"
    ```

    **Other Sequence Examples:**
    - Credential dumping → Lateral movement
    - Malware execution → C2 connection → Data exfiltration
    - Privilege escalation → Account creation

    **Contrast with Threshold-based:**
    - Threshold: Count exceeds limit (no sequence requirement)
    - Sequence: Event A must occur before Event B

    **Reference:** [Chapter 3, Section 3.4 - Types of Correlation](../chapters/ch03-data-modeling-normalization.md)

??? question "Question 14: When querying a security data lake with AWS Athena, what language is typically used?**

**A)** SPL (Splunk Processing Language)
**B)** KQL (Kusto Query Language)
**C)** SQL (Structured Query Language)
**D)** Python exclusively

??? success "Answer"
    **Correct Answer: C) SQL (Structured Query Language)**

    **Explanation:** AWS Athena uses standard SQL to query data stored in S3.

    **Example:**
    ```sql
    SELECT userIdentity.userName, sourceIPAddress, COUNT(*) as event_count
    FROM cloudtrail_logs
    WHERE eventTime > '2026-02-01'
      AND eventName = 'AssumeRole'
    GROUP BY userIdentity.userName, sourceIPAddress
    HAVING event_count > 50
    ```

    **Data Lake Query Tools:**
    - **AWS Athena:** SQL (Presto-based)
    - **Azure Data Explorer:** KQL
    - **Databricks:** SQL, Spark (Python/Scala)
    - **Google BigQuery:** SQL

    **Trade-off:** SQL is powerful and flexible, but requires more analyst skill than purpose-built SIEM UIs.

    **Reference:** [Chapter 3, Section 3.5 - Querying a Data Lake](../chapters/ch03-data-modeling-normalization.md)

??? question "Question 15: A SOC has the following metrics before and after tuning a brute force detection rule: - Before: 200 alerts/day, 20% precision, 95% recall - After: 50 alerts/day, 70% precision, 90% recall

What is the BEST assessment of this tuning effort?**

**A)** Failed - recall decreased, rule should be reverted
**B)** Successful - precision greatly improved with acceptable recall trade-off, and alert volume is manageable
**C)** Failed - alert volume decreased too much
**D)** No change - metrics are irrelevant

??? success "Answer"
    **Correct Answer: B) Successful - precision greatly improved with acceptable recall trade-off, and alert volume is manageable**

    **Explanation:** Let's analyze the impact:

    **Before Tuning:**
    - 200 alerts/day × 20% precision = 40 true positives
    - 200 alerts/day × 80% FP rate = 160 false positives/day
    - Analysts waste time on 160 FPs daily
    - 95% recall = catching 95% of threats

    **After Tuning:**
    - 50 alerts/day × 70% precision = 35 true positives
    - 50 alerts/day × 30% FP rate = 15 false positives/day
    - 90% recall = catching 90% of threats

    **Trade-offs:**
    - Lost 5 true positives (40 → 35)
    - Eliminated 145 false positives (160 → 15)
    - Recall slightly decreased (95% → 90%) - acceptable
    - Alert volume now manageable (50 vs 200)

    **Result:** **SUCCESS**. Analysts handle volume better, focus on genuine threats, reduced burnout. The 5% recall loss (missing 5 out of 100 threats) is acceptable given the massive reduction in noise.

    **Reference:** [Chapter 3, Section 3.6 - Metrics for Tuning Example](../chapters/ch03-data-modeling-normalization.md)

Score Interpretation¶

13-15 correct: Excellent! You understand SIEM architecture, query languages, and correlation.
10-12 correct: Good foundation. Review correlation types and query optimization.
7-9 correct: Adequate grasp. Focus on hands-on SIEM query practice.
Below 7: Review Chapter 3, especially SIEM vs. data lake trade-offs and correlation logic.

← Back to Chapter 3 | Next Quiz: Chapter 4 →