Skip to content

Cloud-Native SOC Architecture

A cloud-native Security Operations Center moves beyond legacy on-premises SIEM deployments to a scalable, API-driven, multi-cloud architecture. This document defines the reference architecture for building a SOC that is cloud-first — leveraging elastic compute, object storage, streaming pipelines, and infrastructure-as-code to handle modern detection and response at scale.


Cloud-Native SOC Reference Architecture

flowchart TB
    subgraph Sources["Log & Telemetry Sources"]
        CLOUD_LOGS[Cloud Provider Logs<br/>AWS CloudTrail / Azure Activity / GCP Audit]
        ENDPOINT[Endpoint Telemetry<br/>EDR Agents]
        NETWORK[Network Telemetry<br/>NDR / VPC Flow Logs]
        IDENTITY[Identity Logs<br/>IdP / SSO / MFA]
        SAAS[SaaS Application Logs<br/>M365 / Google Workspace / Salesforce]
        CUSTOM[Custom App Logs<br/>Webhooks / Syslog / API]
    end

    subgraph Collection["Collection Layer"]
        AGENTS[Log Shipping Agents]
        API_COLL[API Collectors]
        WEBHOOK[Webhook Receivers]
        STREAM[Streaming Ingestion<br/>Kafka / Kinesis / Event Hub]
    end

    subgraph Processing["Processing & Normalization"]
        NORM[Schema Normalization<br/>OCSF / ECS Mapping]
        ENRICH[Enrichment Pipeline<br/>GeoIP, TI, Asset Context]
        FILTER[Ingestion Filtering<br/>Drop / Downsample / Route]
        PARSE[Parsing Engine<br/>Field Extraction / Type Coercion]
    end

    subgraph Detection["Detection Engine"]
        REALTIME[Real-Time Streaming Detection<br/>Correlation Rules / Sigma]
        BATCH[Batch Analytics<br/>Scheduled Queries / Threat Hunting]
        ML[ML Anomaly Detection<br/>UEBA / Behavioral Baselines]
        TI_MATCH[Threat Intel Matching<br/>IOC Correlation]
    end

    subgraph Response["Response & Orchestration"]
        SOAR[SOAR Platform<br/>Playbook Execution]
        CASE[Case Management<br/>Investigation Workflow]
        NOTIFY[Notification Engine<br/>PagerDuty / Slack / Email]
        AUTO_RESP[Automated Response<br/>Contain / Isolate / Block]
    end

    subgraph Storage["Tiered Storage"]
        HOT[Hot Tier<br/>Real-time indexing — 30 days]
        WARM[Warm Tier<br/>Searchable archive — 90 days]
        COLD[Cold Tier<br/>Object storage — 1 year+]
    end

    CLOUD_LOGS --> AGENTS
    ENDPOINT --> AGENTS
    NETWORK --> API_COLL
    IDENTITY --> API_COLL
    SAAS --> WEBHOOK
    CUSTOM --> WEBHOOK
    AGENTS --> STREAM
    API_COLL --> STREAM
    WEBHOOK --> STREAM

    STREAM --> NORM
    NORM --> ENRICH
    ENRICH --> FILTER
    FILTER --> PARSE

    PARSE --> REALTIME
    PARSE --> HOT
    REALTIME --> SOAR
    REALTIME --> CASE
    REALTIME --> NOTIFY

    BATCH --> CASE
    ML --> SOAR
    TI_MATCH --> SOAR

    HOT --> BATCH
    HOT --> ML
    HOT --> TI_MATCH

    HOT -.->|age-off 30d| WARM
    WARM -.->|age-off 90d| COLD

    SOAR --> AUTO_RESP
    SOAR --> CASE

Data Pipeline Architecture

The data pipeline transforms raw telemetry into actionable security events through four stages:

Stage 1: Collection

Multiple collection methods ensure coverage across all source types:

Collection Method Use Case Protocol Reliability Pattern
Log shipping agents Endpoints, servers, containers TCP/TLS, gRPC At-least-once delivery with local buffer
API collectors Cloud providers, SaaS platforms HTTPS REST, OAuth2 Poll with checkpoint; retry with exponential backoff
Webhook receivers Real-time event sources HTTPS POST Acknowledgment-based; dead letter queue for failures
Syslog receivers Legacy infrastructure, network devices UDP/514, TCP/6514 (TLS) TCP with TLS for reliable delivery; UDP for high-volume
Streaming ingestion High-volume continuous feeds Kafka, Kinesis, Event Hub Partitioned, replicated, ordered delivery

Always buffer at the collection layer. Agents should maintain a local disk buffer (minimum 24 hours of retention) to survive network outages and downstream processing delays without data loss.

Stage 2: Normalization (OCSF / ECS)

Raw logs are mapped to a unified schema for consistent detection and analysis:

Schema Standard Governance Strength Best For
OCSF (Open Cybersecurity Schema Framework) OCSF consortium (AWS, Splunk, IBM, others) Vendor-neutral; purpose-built for security Multi-vendor SOC; cloud-native deployments
ECS (Elastic Common Schema) Elastic Deep Elastic ecosystem integration Elastic-based SIEM deployments
CEF (Common Event Format) ArcSight / Micro Focus Wide legacy support Legacy SIEM integration
LEEF (Log Event Extended Format) IBM QRadar integration IBM QRadar environments

Normalization pipeline example:

Raw log (vendor-specific)
    → Parser (extract fields)
    → Schema mapper (map to OCSF categories)
    → Type coercion (timestamps → ISO 8601, IPs → normalized)
    → Validation (reject malformed; route to dead letter queue)
    → Enriched event (ready for detection)

Schema normalization is the single most impactful investment in a cloud-native SOC. Without it, every detection rule, dashboard, and playbook must account for vendor-specific field names — creating brittle, unmaintainable detection logic.

Stage 3: Enrichment

Events are enriched inline before they reach the detection engine:

Enrichment Source Data Added Latency Budget
GeoIP database Country, city, ASN for source/destination IPs < 1ms
Threat intelligence IOC match status, threat actor attribution, confidence score < 5ms
Asset inventory (CMDB) Asset owner, business unit, criticality, environment < 5ms
Identity context User role, department, manager, risk score < 10ms
DNS resolution Reverse DNS, domain registration age, WHOIS context < 20ms
Previous alert history Related prior alerts for the same entity < 50ms

Stage 4: Filtering and Routing

Not all events require real-time indexing. Selective ingestion reduces cost without sacrificing coverage:

Data Category Routing Decision Rationale
Authentication events (success/failure) Hot tier — full indexing Critical for detection; high signal value
Security tool alerts (EDR, WAF, IDS) Hot tier — full indexing Direct detection value
DNS query logs Hot tier — sampled (10%); warm tier — full High volume, low per-event value; sampling preserves hunting
Network flow logs Hot tier — anomalous only; warm tier — summary Volume reduction 90%+; anomalies flagged by streaming detection
OS process execution logs Hot tier — filtered (known-bad + unknown); cold — full Filter known-good processes; retain full for forensics
Cloud control plane (read operations) Warm tier — full Low detection value; needed for investigation context
Debug / health check logs Drop or cold tier only No security value; pure cost

Aggressive filtering saves cost but creates blind spots. Every filtering rule must be documented, reviewed quarterly, and tested against the MITRE ATT&CK matrix to verify that filtered data is not required for any active detection.


Tiered Storage Architecture

Storage Tier Specifications

Attribute Hot Tier Warm Tier Cold Tier
Retention 30 days 90 days 1 year+ (regulatory)
Search latency < 3 seconds < 30 seconds Minutes to hours (rehydration)
Indexing Fully indexed; real-time queryable Indexed on key fields only Not indexed; compressed archive
Use case Active detection, investigation, dashboards Threat hunting, incident investigation Compliance, forensics, legal hold
Technology pattern Elasticsearch / ClickHouse / cloud-native SIEM Reduced-replica search; columnar store S3 / Azure Blob / GCS with Parquet format
Availability 99.9% query SLA 99% query SLA 99.9% durability; no query SLA

Cost Comparison (Synthetic Reference — Per TB/Month)

Cost Component Hot Tier Warm Tier Cold Tier
Storage $250 $50 $5
Compute (indexing/search) $400 $80 $0 (on-demand rehydration)
Network (ingestion) $50 $10 (internal transfer) $2 (lifecycle policy)
Total per TB/month $700 $140 $7
Cost ratio 100x 20x 1x

Cost Impact

An organization ingesting 5 TB/day would spend approximately $105,000/month storing everything in hot tier for 30 days. With tiered storage — 30 days hot, 60 days warm, 275 days cold — the same data costs approximately $38,000/month, a 64% reduction.

Data Lifecycle Policy

flowchart LR
    INGEST[Ingest] -->|real-time| HOT[Hot Tier<br/>Fully indexed<br/>30 days]
    HOT -->|age-off policy| WARM[Warm Tier<br/>Key fields indexed<br/>90 days]
    WARM -->|age-off policy| COLD[Cold Tier<br/>Compressed Parquet<br/>1 year+]
    COLD -->|retention expiry| DELETE[Secure Deletion<br/>Crypto-shredding]

    LEGAL[Legal Hold] -.->|override age-off| HOT
    LEGAL -.->|override age-off| WARM
    LEGAL -.->|preserve indefinitely| COLD

Detection Architecture

Three Detection Modes

The cloud-native SOC operates three complementary detection modes:

1. Real-Time Streaming Detection

Attribute Detail
Trigger Event arrives in processing pipeline
Latency < 5 seconds from event ingestion to alert
Rule format Sigma rules compiled to streaming engine queries
Use cases Known-bad IOC match, brute-force detection, impossible travel, known attack patterns
Scale 50,000–500,000 events per second per detection worker
False positive management Inline suppression rules; allowlists evaluated at detection time

2. Batch Analytics (Scheduled)

Attribute Detail
Trigger Scheduled execution (every 5 min, 15 min, 1 hour, daily)
Latency Minutes to hours depending on schedule
Query format SQL / KQL / SPL against hot tier index
Use cases Low-and-slow attacks, multi-stage correlation, statistical anomalies, threat hunting
Scale Queries spanning 30 days of hot tier data
False positive management Tuning via exception lists; confidence scoring

3. ML Anomaly Detection

Attribute Detail
Trigger Continuous model inference on enriched event stream
Latency Seconds to minutes depending on model complexity
Model types Behavioral baselines (UEBA), time-series anomaly detection, clustering
Use cases Unknown threats, insider threats, living-off-the-land techniques, data exfiltration
Training Baseline built from 30–90 days of historical data; retrained weekly
False positive management Confidence thresholds; analyst feedback loop for model tuning

Real-time detection catches known threats fast. Batch analytics finds complex multi-stage attacks. ML detection surfaces unknown threats. All three are necessary — no single mode provides full coverage.

Detection Rule Lifecycle

Author (Sigma YAML)
    → Peer review (PR-based)
    → Unit test (synthetic event validation)
    → Simulation (run against 30d historical data — measure FP rate)
    → Staged deployment (canary → 25% → 100%)
    → Production monitoring (alert volume, FP rate, MTTD)
    → Tuning (quarterly review; retire if no true positives in 180 days)

SOAR Integration Architecture

Playbook Architecture

flowchart TB
    ALERT[Detection Alert] --> TRIAGE[Auto-Triage Playbook]

    TRIAGE --> ENRICH_PB[Enrichment Playbook]
    ENRICH_PB --> |"query TI, CMDB, IdP"| DECISION{Severity + Confidence}

    DECISION -->|High severity + High confidence| AUTO[Automated Response Playbook]
    DECISION -->|High severity + Low confidence| HUMAN[Human-in-the-Loop Playbook]
    DECISION -->|Low severity| SUPPRESS[Suppress & Log]

    AUTO --> CONTAIN[Containment Actions]
    CONTAIN --> ISOLATE[Isolate Host — EDR API]
    CONTAIN --> DISABLE[Disable Account — IdP API]
    CONTAIN --> BLOCK[Block IP — Firewall API]

    HUMAN --> NOTIFY_ANALYST[Notify Analyst — Slack / PagerDuty]
    NOTIFY_ANALYST --> ANALYST_REVIEW[Analyst Review + Decision]
    ANALYST_REVIEW -->|Approve| CONTAIN
    ANALYST_REVIEW -->|False Positive| TUNE[Add Exception + Tune Rule]

    CONTAIN --> CASE_CREATE[Create / Update Case]
    SUPPRESS --> CASE_CREATE
    TUNE --> CASE_CREATE

API Mesh

The SOAR platform integrates with the security stack via a standardized API mesh:

Integration API Type Actions Available Auth Method
EDR Platform REST API Isolate host, collect forensic package, query telemetry API key (vault-managed, rotated 90d)
Identity Provider SCIM / REST Disable account, force MFA reset, revoke sessions OAuth2 client credentials
Firewall / WAF REST API Block IP/domain, update rule set, query logs mTLS + API key
Threat Intelligence REST API Query IOC, submit sample, retrieve report API key
Email Security Graph API / REST Quarantine message, purge from mailboxes, block sender OAuth2 delegated
Cloud Provider SDK / REST Quarantine instance, snapshot disk, modify security group IAM role assumption
Ticketing System REST API Create ticket, update status, add comments OAuth2 / API key
Communication Webhook Send alert notification, request analyst input Webhook secret

Enrichment Pipeline

Every alert passes through automated enrichment before analyst review:

Alert received
    → IP reputation lookup (TI platform — 3 providers)
    → Domain reputation + WHOIS age check
    → File hash lookup (VirusTotal, internal sandbox)
    → Asset context (CMDB — owner, criticality, environment)
    → User context (IdP — role, department, risk score)
    → Related alerts (SIEM — same entity, last 7 days)
    → GeoIP + impossible travel check
    → Severity re-scoring based on enrichment
    → Route to appropriate playbook

Automated enrichment reduces analyst investigation time by 60–80% in mature SOC environments. The goal is to present the analyst with a fully contextualized alert, not a raw log event.


Multi-Cloud Considerations

Unified Data Model Across Providers

Data Category AWS Source Azure Source GCP Source Unified Schema Field (OCSF)
Control plane audit CloudTrail Azure Activity Log Cloud Audit Log api.operation, actor.user.name, cloud.provider
Network flow VPC Flow Logs NSG Flow Logs VPC Flow Logs src.ip, dst.ip, traffic.bytes, network.direction
DNS queries Route 53 Resolver Logs DNS Analytics Cloud DNS Logs dns.query.hostname, dns.response.code
Identity events IAM / SSO Azure AD / Entra ID Cloud Identity auth.protocol, user.name, outcome
Storage access S3 Access Logs Storage Analytics GCS Audit Logs file.name, api.operation, actor.user.name
Compute events EC2 / ECS / Lambda logs VM / AKS / Functions logs GCE / GKE / Cloud Functions logs process.name, device.hostname, cloud.region

Each cloud provider uses different field names, timestamp formats, and event structures for equivalent telemetry. Without a normalization layer, detection rules must be written three times — once per provider. OCSF normalization eliminates this duplication.

Cross-Cloud Detection Challenges

Challenge Impact Mitigation
Clock skew between providers Correlation failures; incorrect event ordering NTP sync validation; tolerance window in correlation rules (±5s)
Different identity models Cannot correlate user activity across clouds Unified identity mapping table (email → provider-specific IDs)
Varying log delivery latency Detection timing inconsistency Per-source latency monitoring; watermarking in streaming pipeline
Inconsistent log completeness Coverage gaps in detection Log completeness audit per provider; alerting on log source absence

Scaling Patterns

Horizontal Scaling Architecture

flowchart LR
    subgraph Ingestion["Ingestion Tier (auto-scale)"]
        IW1[Ingestion Worker 1]
        IW2[Ingestion Worker 2]
        IWN[Ingestion Worker N]
    end

    subgraph Queue["Message Queue (partitioned)"]
        Q[Kafka / Kinesis<br/>Partitioned by source type]
    end

    subgraph Detection_Scale["Detection Tier (auto-scale)"]
        DW1[Detection Worker 1<br/>Auth rules]
        DW2[Detection Worker 2<br/>Network rules]
        DWN[Detection Worker N<br/>Endpoint rules]
    end

    subgraph Index["Indexing Tier (auto-scale)"]
        IDX1[Index Writer 1]
        IDX2[Index Writer 2]
        IDXN[Index Writer N]
    end

    IW1 --> Q
    IW2 --> Q
    IWN --> Q
    Q --> DW1
    Q --> DW2
    Q --> DWN
    Q --> IDX1
    Q --> IDX2
    Q --> IDXN

Auto-Scaling Policies

Tier Scale Trigger Scale-Out Action Scale-In Action Min / Max Instances
Ingestion workers Queue depth > 10,000 events for 2 min Add 2 workers Remove 1 worker (cooldown 10 min) 3 / 20
Detection workers Processing latency P99 > 5s Add 1 worker per rule category Remove 1 worker (cooldown 15 min) 2 / 15 per category
Index writers Write queue backlog > 60s Add 2 writers Remove 1 writer (cooldown 20 min) 3 / 12
Search replicas Query latency P99 > 10s Add 1 search replica Remove 1 replica (cooldown 30 min) 2 / 8

Queue-Based Processing

All inter-tier communication uses message queues for resilience and backpressure management:

Queue Property Configuration Rationale
Partitioning By log source type (auth, endpoint, network, cloud) Ensures detection workers specialize; prevents noisy-neighbor
Retention 72 hours Survives downstream outages; allows replay for new detections
Consumer groups Independent groups for detection, indexing, forwarding Each consumer processes at its own rate without blocking others
Dead letter queue Failed events routed after 3 retries Prevents poison messages from blocking pipeline
Ordering Per-partition ordering guaranteed Preserves event sequence within a source

Size your message queue for 3x peak throughput. Security events are bursty — a ransomware incident or red team exercise can generate 10–50x normal event volume within minutes.


Cost Optimization

Ingestion Cost Reduction Strategies

Strategy Estimated Savings Implementation Complexity Risk
Drop debug / health check logs 10–20% of total volume Low — filter at collection layer Minimal — no security value
Deduplicate identical events 5–15% of total volume Medium — requires dedup window Low — keep first occurrence
Downsample high-volume telemetry 20–40% for applicable sources Medium — statistical sampling Medium — may miss low-frequency events
Compress before transit 30–50% network cost reduction Low — enable gzip/zstd at agent None
Route to warm/cold tier early 50–70% storage savings Medium — tiering policy design Medium — slower investigation queries
Reserved capacity pricing 20–40% compute/storage cost Low — commitment required Low — requires accurate forecasting

Total Cost of Ownership Model (Synthetic — 10 TB/Day Ingestion)

Cost Category Monthly Cost (Unoptimized) Monthly Cost (Optimized) Savings
Ingestion & processing $45,000 $28,000 38%
Hot storage (30d) $210,000 $105,000 50%
Warm storage (90d) $42,000 $42,000 0%
Cold storage (1yr) $2,100 $2,100 0%
Detection compute $18,000 $14,000 22%
SOAR platform $8,000 $8,000 0%
Network transfer $12,000 $7,000 42%
Total $337,100 $206,100 39%

The largest cost lever is selective indexing. Ingesting everything into the hot tier is the default — and the most expensive mistake. A well-designed filtering and tiering strategy typically reduces total SOC platform cost by 30–50%.


Cloud-Native SOC Maturity Model

Level Name Detection Response Data Management Automation Team
L0 Log Aggregation Logs collected but not analyzed; no detection rules Manual response; no playbooks Single storage tier; no retention policy None IT team reviews logs ad hoc
L1 Basic SIEM Vendor-provided rules; high false positive rate Email notifications; manual investigation Hot-only storage; 30-day retention Basic alerting scripts Dedicated SOC analyst (part-time)
L2 Managed Detection Custom detection rules; Sigma adoption; tuned alerts SOAR with basic playbooks (enrichment, notification) Hot + cold tiering; 90-day searchable Auto-enrichment; manual response Dedicated SOC team (3–5 analysts)
L3 Proactive SOC Streaming + batch detection; threat hunting program SOAR with automated containment for high-confidence alerts Full tiered storage; OCSF normalization Auto-triage; semi-automated response SOC team + detection engineers (8–12)
L4 Advanced SOC ML anomaly detection; cross-cloud correlation; purple team Full SOAR automation with human-in-the-loop for escalation Multi-cloud unified data model; cost-optimized pipeline Automated detection lifecycle; self-tuning rules Specialized roles: detection, hunting, automation (15–20)
L5 Autonomous SOC Autonomous detection; adaptive models; predictive analytics Autonomous response with human oversight for critical actions Intelligent tiering; auto-scaling storage Self-healing pipelines; autonomous playbooks Strategic security team focused on adversary research (10–15)

Most organizations operate at L1–L2. Reaching L3 is a realistic 18-month goal with dedicated investment. L4–L5 maturity requires a mature platform engineering function alongside the SOC team.


Reference Metrics and SLAs

Capacity Planning

Metric Small SOC Medium SOC Large SOC Enterprise SOC
Events per second (sustained) 5,000 EPS 25,000 EPS 100,000 EPS 500,000+ EPS
Daily ingestion volume 500 GB 2.5 TB 10 TB 50+ TB
Active detection rules 50–100 200–500 500–2,000 2,000–10,000
SOAR playbooks 5–10 20–50 50–150 150–500
Data sources integrated 10–20 30–60 60–150 150–500
Analysts 2–4 5–12 12–30 30–100

Detection SLAs

SLA Metric Target (L3 Maturity) Target (L5 Maturity)
Detection latency (real-time rules) < 5 seconds from ingestion < 1 second from ingestion
Detection latency (batch analytics) < 15 minutes < 5 minutes
Alert triage time (automated) < 2 minutes < 30 seconds
Alert triage time (analyst) < 15 minutes < 5 minutes (pre-enriched)
Mean time to respond (MTTR) — automated < 5 minutes < 1 minute
Mean time to respond (MTTR) — manual < 4 hours < 1 hour
False positive rate < 30% of alerts < 10% of alerts
Pipeline availability 99.9% 99.99%

Storage Cost SLAs

Metric Target
Hot tier cost per GB/month ≤ $0.70
Warm tier cost per GB/month ≤ $0.14
Cold tier cost per GB/month ≤ $0.007
Rehydration time (cold → searchable) < 4 hours
Data durability (all tiers) 99.999999999% (11 nines)

Common Cloud-Native SOC Mistakes

Avoid These

  • Ingesting everything into hot tier. The default "index everything" approach leads to unsustainable costs. Design tiering from day one.
  • No schema normalization. Writing detection rules against raw vendor-specific fields creates unmaintainable detection logic that breaks with every log format change.
  • Single-cloud architecture for multi-cloud environment. If your workloads span AWS, Azure, and GCP but your SIEM only natively integrates with one, you have blind spots.
  • No backpressure handling. When ingestion spikes, the pipeline must queue — not drop events. Without message queues and buffering, you lose data during incidents (when you need it most).
  • SOAR playbooks with hardcoded credentials. Credentials must be vault-managed, rotated, and auditable. A compromised SOAR platform with hardcoded admin credentials is a catastrophic breach.
  • No detection rule lifecycle. Rules that are never tested, tuned, or retired accumulate as noise. Treat detection rules like production code — with CI/CD, testing, and deprecation policies.
  • Ignoring log source health monitoring. A silent log source failure means zero detection coverage for that source. Monitor log source heartbeats and alert on absence within 15 minutes.

Nexus SecOps Control Mapping

Cloud-Native SOC Control Nexus SecOps Controls
Log collection and integrity Nexus SecOps-001, Nexus SecOps-005
Schema normalization (OCSF) Nexus SecOps-010, Nexus SecOps-012
Real-time detection engine Nexus SecOps-020, Nexus SecOps-022
Tiered storage and retention Nexus SecOps-030, Nexus SecOps-032
SOAR playbook governance Nexus SecOps-100, Nexus SecOps-104
Automated response controls Nexus SecOps-105, Nexus SecOps-108
Multi-cloud log integration Nexus SecOps-040, Nexus SecOps-042
Cost optimization policies Nexus SecOps-035, Nexus SecOps-037
Detection rule lifecycle Nexus SecOps-025, Nexus SecOps-028
Pipeline availability SLA Nexus SecOps-050, Nexus SecOps-052

See Zero Trust SOC | Zero Trust Network | Reference Architecture Related chapters: Chapter 7: SIEM Architecture | Chapter 8: SOAR | Chapter 11: Cloud Security Operations