Chapter 3: SIEM & Data Lake Basics¶

Learning Objectives¶

By the end of this chapter, you will be able to:

Explain the architecture and core functions of a SIEM platform
Compare SIEM vs. data lake approaches for security analytics
Write basic search queries in common SIEM query languages
Design correlation rules to detect multi-stage attacks
Evaluate SIEM performance metrics and tuning strategies

Prerequisites¶

Chapter 2: Understanding of telemetry sources and log normalization
Basic SQL or query language familiarity
Understanding of common attack patterns (MITRE ATT&CK)

Key Concepts¶

SIEM • Data Lake • Correlation Rule • Index • SPL (Search Processing Language) • KQL (Kusto Query Language) • Alert Tuning

Curiosity Hook: The Needle in the Haystack¶

Your SIEM ingests 50,000 events per second—4.3 billion per day. Somewhere in that data stream is evidence of an ongoing breach.

The Challenge: How do you find it?

A Tier 2 analyst needs to investigate a suspected compromised account. They could manually review millions of authentication logs, or they could write a query that completes in 3 seconds:

source="windows_auth" user="jdoe" action="login"
| stats count by src_ip, dest_host
| where count > 100
| lookup threat_intel ip as src_ip

The result: 2 suspicious IPs with 500+ failed login attempts from a known botnet.

SIEMs transform raw data into actionable intelligence. This chapter shows you how.

3.1 What is a SIEM?¶

Definition¶

A Security Information and Event Management (SIEM) platform collects, normalizes, correlates, and analyzes security events from across an organization's infrastructure in real-time.

Core Functions¶

Collection: Ingest logs from diverse sources (endpoints, network, cloud, apps)
Normalization: Map fields to a common schema (e.g., all usernames → user.name)
Indexing: Structure data for fast retrieval
Correlation: Combine multiple events to detect complex attack patterns
Alerting: Generate notifications when correlation rules trigger
Visualization: Dashboards and reports for monitoring and compliance
Investigation: Search and pivot through historical data

Major SIEM Platforms¶

Platform	Query Language	Strengths
Splunk	SPL (Search Processing Language)	Mature ecosystem, powerful analytics, extensive integrations
Microsoft Sentinel	KQL (Kusto Query Language)	Native Azure integration, AI-driven analytics, cloud-native
Elastic (ELK)	Lucene / Painless	Open-source core, flexible data lake integration, scalability
IBM QRadar	AQL (Ariel Query Language)	Strong correlation engine, regulatory compliance focus
Chronicle	YARA-L	Google-scale architecture, fast search, built-in threat intel

3.2 SIEM Architecture¶

Data Pipeline¶

[Log Sources] → [Forwarders/Collectors] → [Parsing/Normalization] → [Indexing] → [Search/Correlation] → [Alerting]

1. Collection Layer - Agents: Installed on endpoints (Splunk Universal Forwarder, Elastic Beats) - Agentless: Syslog receivers, API polling (AWS CloudTrail, Office 365) - Network taps: Passive collection from network traffic

2. Parsing/Normalization - Extract fields from raw logs (regex, grok patterns, XML/JSON parsing) - Map to common schema (CIM, ECS) - Enrich with context (GeoIP, threat intel, asset data)

3. Indexing - Store data in time-series indexes optimized for temporal queries - Compress and distribute across cluster nodes - Create field indexes for fast searches

4. Correlation Engine - Real-time evaluation of rules against incoming events - Stateful correlation (track sequences across time windows) - Threshold-based alerting (e.g., >10 failed logins in 5 minutes)

Performance Considerations¶

Indexing Speed: - Target: Process events within seconds of generation - Bottleneck: CPU for parsing, disk I/O for indexing - Scaling: Horizontal (add indexers) or vertical (bigger servers)

Search Performance: - Hot data (recent, indexed): Subsecond to seconds - Warm data (older, indexed): Seconds to minutes - Cold data (archived, compressed): Minutes to hours

Cost Optimization: - Index only high-value logs fully (auth, EDR, critical servers) - Archive verbose logs (proxy, NetFlow) in object storage - Use tiered storage (SSD for hot, HDD for warm, S3 for cold)

3.3 Writing SIEM Queries¶

Example 1: Search for Failed Logins (Splunk SPL)¶

index=windows_auth EventCode=4625
| stats count by Account_Name, Source_IP
| where count > 5
| sort -count

Breakdown: - index=windows_auth EventCode=4625: Filter to Windows failed login events - stats count by Account_Name, Source_IP: Aggregate by user and source IP - where count > 5: Only show accounts with >5 failures - sort -count: Order by failure count descending

Example 2: Detect Brute Force (Microsoft Sentinel KQL)¶

SecurityEvent
| where EventID == 4625
| where TimeGenerated > ago(10m)
| summarize FailureCount = count() by Account, IpAddress
| where FailureCount > 10
| join kind=inner (
    SecurityEvent
    | where EventID == 4624
    | where TimeGenerated > ago(10m)
) on Account
| project Account, IpAddress, FailureCount, SuccessTime = TimeGenerated

Logic: Find accounts with >10 failed logins in 10 minutes followed by a successful login (potential brute force success).

Example 3: Hunt for Lateral Movement (Elastic EQL)¶

sequence by user.name
  [authentication where event.outcome == "success" and source.ip != "10.0.1.5"]
  [process where process.name == "psexec.exe"]
  [network where destination.port == 445]

Logic: Detect a sequence where a user authenticates from an unusual IP, then executes PsExec, then connects to SMB (common lateral movement pattern).

3.4 Correlation Rules¶

What is Correlation?¶

Correlation combines multiple events to detect attack patterns that individual events cannot reveal.

Example: - Single failed login: Low priority - 100 failed logins in 2 minutes: Brute force attempt (HIGH priority)

Types of Correlation¶

1. Threshold-based

IF failed_login_count > 10 in 5 minutes
THEN alert "Brute Force Attempt"

2. Sequence-based

IF (login_from_IP_A) THEN (within 1 hour, login_from_IP_B)
WHERE distance(IP_A, IP_B) > 1000 km
THEN alert "Impossible Travel"

3. Behavioral anomaly

IF user_access_count > (baseline_average + 3 * std_dev)
THEN alert "Anomalous Data Access"

Correlation Rule Design Pattern¶

Rule: Detect credential dumping followed by lateral movement

index=endpoint process_name IN ("mimikatz.exe", "procdump.exe", "lsass.exe")
| eval credential_dump=1
| join user_name, host
  [search index=windows_auth EventCode=4624 Logon_Type=3 earliest=-1h]
| where _time > credential_dump_time + 300
| stats count by user_name, src_host, dest_host

Explanation: 1. Detect credential dumping tools 2. Look for network logons (type 3) from the same host within 1 hour 3. Alert on lateral authentication attempts post-compromise

3.5 Data Lake Approach¶

What is a Security Data Lake?¶

A security data lake stores raw security logs in object storage (S3, Azure Blob) without upfront indexing, enabling flexible analytics and cost-effective long-term retention.

SIEM vs. Data Lake Comparison¶

Aspect	SIEM	Data Lake
Storage cost	Higher (indexed)	Lower (raw files)
Query speed	Faster (indexed)	Slower (scan files)
Query flexibility	Predefined fields	Schema-on-read (analyze any field)
Real-time alerting	Native support	Requires additional tools (stream processing)
Use case fit	Active monitoring, triage	Historical analysis, ML training, compliance

Hybrid Architecture¶

Best Practice: Combine SIEM + Data Lake

Hot path (SIEM): 30-90 days of high-priority logs for real-time detection
Cold path (Data Lake): All logs for 1-7 years for investigations and ML

Workflow: - Active incident: Query SIEM for fast results - Historical investigation: Query data lake (Athena, Databricks) for older events - ML model training: Pull raw data from data lake

Querying a Data Lake (AWS Athena + S3)¶

Schema definition:

CREATE EXTERNAL TABLE cloudtrail_logs (
    eventTime STRING,
    eventName STRING,
    userIdentity STRUCT<type:STRING, userName:STRING>,
    sourceIPAddress STRING
)
STORED AS PARQUET
LOCATION 's3://my-security-datalake/cloudtrail/'

Query:

SELECT userIdentity.userName, sourceIPAddress, COUNT(*) as event_count
FROM cloudtrail_logs
WHERE eventTime > '2026-02-01'
  AND eventName = 'AssumeRole'
GROUP BY userIdentity.userName, sourceIPAddress
HAVING event_count > 50

Use case: Detect abnormal AWS role assumption activity.

3.6 Alert Tuning & Optimization¶

The Tuning Challenge¶

Problem: New correlation rule generates 200 alerts/day with 80% false positive rate.

Goal: Reduce FPs to <20% while maintaining detection coverage.

Tuning Strategies¶

1. Add Allowlisting

... | where NOT (user_name IN ("service_account_backup", "scheduled_task_user"))

2. Increase Thresholds

Change: failed_login_count > 5
To: failed_login_count > 10

Trade-off: Reduces FPs but may miss slower attacks.

3. Add Context Filters

... | where user_risk_score > 50 OR asset_criticality="high"

Only alert on risky users or critical systems.

4. Time-based Filtering

... | where date_hour NOT IN (2,3,4,5)

Ignore expected maintenance windows.

Metrics for Tuning¶

Precision: TP / (TP + FP) — Higher is better (fewer false alarms)
Recall: TP / (TP + FN) — Higher is better (catch more attacks)
Alert Volume: Total alerts per day — Lower is better (analyst capacity)
Time to Triage: Average time spent per alert — Lower is better

Example: - Before tuning: 200 alerts/day, 20% precision, 95% recall - After tuning: 50 alerts/day, 70% precision, 90% recall - Trade-off: Slightly lower recall, but analysts can handle volume and find threats faster

Interactive Element¶

MicroSim 3: SIEM Query Builder

Practice writing queries to detect common attack patterns and optimize for performance.

Common Misconceptions¶

Misconception: SIEM Alone Prevents Breaches

Reality: SIEMs detect threats by analyzing logs. They don't prevent attacks. Prevention requires firewalls, endpoint protection, and security architecture. SIEMs enable faster detection and response.

Misconception: More Correlation Rules = Better Security

Reality: Too many rules create alert fatigue. Focus on high-fidelity rules aligned with your threat model. Quality over quantity.

Misconception: Data Lakes Replace SIEMs

Reality: Data lakes excel at cost-effective storage and flexible analytics but lack real-time alerting. Most mature SOCs use both: SIEM for active monitoring, data lake for deep investigations and ML.

Practice Tasks¶

Task 1: Identify Correlation Opportunities¶

For each scenario, determine if correlation would improve detection:

a) Detecting a single failed SSH login attempt b) Detecting 50 failed logins followed by a successful login from the same IP c) Detecting file creation on an endpoint d) Detecting unusual file access pattern (user accesses 10x normal file shares, then transfers large file externally)

Answers

a) No correlation needed. Single event, low confidence. Better to threshold or enrich.

b) Yes, correlation. Sequence of failures → success indicates potential brute force success.

c) No correlation needed. Single event. May need enrichment (file hash threat intel lookup) but not correlation.

d) Yes, correlation. Combines abnormal internal access with exfiltration behavior. Sequence matters.

Task 2: Write a Correlation Rule (Conceptual)¶

Scenario: Detect account compromise by correlating: 1. User logs in from impossible travel location (2 logins >500km apart in <1 hour) 2. User accesses sensitive file shares they've never accessed before

Write pseudocode for this rule.

Answer

RULE: Impossible Travel + Unusual File Access

SEQUENCE:
Step 1: Detect impossible travel
  - Find logins by same user from 2 IPs within 1 hour
  - Calculate geographic distance between IPs
  - IF distance > 500 km, flag as suspicious_login

Step 2: Check file access (within 24 hours of suspicious_login)
  - Get list of file shares accessed by user
  - Compare to baseline (file shares accessed in past 90 days)
  - IF accessing new file shares, flag as unusual_access

IF suspicious_login AND unusual_access within 24 hours:
  ALERT "Potential Account Compromise"
  SEVERITY: HIGH
  ENRICHMENT: User risk score, file share classification

Task 3: Optimize a Query¶

Given query (Splunk SPL):

index=* source=* user=*
| search "failed"
| stats count by user

Problems: - Searches all indexes (slow, expensive) - Uses wildcard searches (inefficient) - Searches unstructured text field

Rewrite this query to be more efficient.

Answer

index=windows_auth EventCode=4625
| stats count by Account_Name
| where count > 5
| sort -count

Improvements: - Specific index (windows_auth) instead of wildcard - Specific EventCode (4625 = failed login) instead of text search - Specific field (Account_Name) instead of wildcard user - Added threshold filter to reduce results

Exam Prep & Certifications¶

Relevant Certifications

The topics in this chapter align with the following certifications:

CompTIA Security+ — Domains: Security Operations, Security Architecture
CompTIA CySA+ — Domains: Security Operations, Data Analysis
GIAC GCIH — Domains: Incident Handling, Data Correlation
CISSP — Domains: Security Operations, Security Architecture

View full Certifications Roadmap →

Self-Assessment Quiz¶

Question 1: What is the primary function of a SIEM's correlation engine?

Options:

a) Store logs in compressed format

b) Combine multiple events to detect complex attack patterns

c) Forward logs from endpoints to central storage

d) Encrypt data in transit

Show Answer

Correct Answer: b) Combine multiple events to detect complex attack patterns

Explanation: Correlation engines analyze relationships between events across time and systems to detect multi-stage attacks that single events cannot reveal.

Question 2: Which query language is used by Microsoft Sentinel?

Options:

a) SPL (Search Processing Language)

b) SQL (Structured Query Language)

c) KQL (Kusto Query Language)

d) Python

Show Answer

Correct Answer: c) KQL (Kusto Query Language)

Explanation: Microsoft Sentinel uses KQL, also used by Azure Monitor and Azure Data Explorer. Splunk uses SPL, traditional databases use SQL.

Question 3: What is a key advantage of a security data lake over a traditional SIEM?

Options:

a) Faster real-time alerting

b) Lower storage costs for long-term retention

c) Better out-of-the-box correlation rules

d) Simpler user interface

Show Answer

Correct Answer: b) Lower storage costs for long-term retention

Explanation: Data lakes store raw logs in object storage (S3, etc.) which is far cheaper than indexed SIEM storage. However, SIEMs are better for real-time alerting and typically have better UIs.

Question 4: In alert tuning, increasing the threshold for a brute force rule from 5 to 10 failed logins will likely:

Options:

a) Increase both precision and recall

b) Increase precision but decrease recall

c) Decrease precision but increase recall

d) Have no effect on precision or recall

Show Answer

Correct Answer: b) Increase precision but decrease recall

Explanation: Higher thresholds reduce false positives (increasing precision) but may miss some true attacks (decreasing recall). It's a trade-off.

Question 5: What is the purpose of log normalization in a SIEM?

Options:

a) Compress logs to save storage space

b) Map diverse log formats to a common schema for consistent querying

c) Delete duplicate events

d) Encrypt sensitive fields

Show Answer

Correct Answer: b) Map diverse log formats to a common schema for consistent querying

Explanation: Normalization translates different log formats (syslog, JSON, CEF, etc.) into standardized field names so analysts can query across all sources consistently.

Question 6: Which of the following is NOT a typical SIEM performance bottleneck?

Options:

a) Disk I/O during indexing

b) CPU usage during log parsing

c) Network bandwidth for log collection

d) Insufficient RGB lighting on server hardware

Show Answer

Correct Answer: d) Insufficient RGB lighting on server hardware

Explanation: SIEM performance bottlenecks include disk I/O (indexing), CPU (parsing/correlation), and network bandwidth (collection). RGB lighting does not affect performance (though it might affect morale).

Summary¶

In this chapter, you learned:

SIEM architecture: Collection, normalization, indexing, correlation, alerting
Query languages: SPL (Splunk), KQL (Sentinel), EQL (Elastic) for searching security data
Correlation rules: Combine events to detect multi-stage attacks (brute force, lateral movement)
SIEM vs. Data Lake: Trade-offs between speed/cost and when to use each
Alert tuning: Techniques to reduce false positives while maintaining detection coverage
Performance optimization: Index selection, query efficiency, tiered storage

Next Steps¶

Next Chapter: Chapter 4: Detection Engineering - Learn to design, build, and test detection rules
Practice: Try the SIEM Query Builder MicroSim to build muscle memory with query syntax
Explore: Review your organization's top 10 SIEM correlation rules. Can you identify tuning opportunities?
Deep Dive: Review the Architecture Reference and Integration Patterns to understand SIEM selection and deployment criteria

Chapter 3 Complete | Next: Chapter 4 →