Skip to content

SC-016: Kubernetes Cluster Compromise

Scenario Header

Type: Cloud / Container  |  Difficulty: ★★★★★  |  Duration: 3–4 hours  |  Participants: 4–8

Threat Actor: eCrime group — financially motivated, cloud-native cryptojacking specialist

Primary ATT&CK Techniques: T1190 · T1078.004 · T1068 · T1610 · T1496 · T1046 · T1611


Threat Actor Profile

PHANTOM MINER is a technically sophisticated eCrime group active since 2022, specializing in Kubernetes cluster compromise for large-scale cryptocurrency mining operations. Unlike traditional cryptojacking actors who target individual hosts, PHANTOM MINER exploits misconfigured Kubernetes clusters to deploy mining workloads across dozens or hundreds of nodes, achieving hash rates equivalent to dedicated mining farms — at zero infrastructure cost to themselves.

The group targets organizations running Google Kubernetes Engine (GKE), Amazon EKS, and Azure AKS clusters, with a preference for environments that have exposed API servers, overly permissive RBAC configurations, or default service account tokens. They demonstrate deep expertise in Kubernetes internals, container escape techniques, and cloud IAM privilege escalation.

Motivation: Financial — cryptocurrency mining (Monero via XMRig), with secondary capability for data theft and ransomware deployment (observed but not primary).

Estimated Revenue: $2M–$5M annually from compromised compute resources across ~200 known victim organizations.


Scenario Narrative

Phase 1 — Discovery & Initial Access (~25 min)

PHANTOM MINER conducts daily scans of public IP ranges for exposed Kubernetes API servers on port 6443 and 8443 using Shodan and custom scanning infrastructure. They identify CloudNative Inc's GKE cluster API server at 198.51.100.25:6443, which is publicly accessible due to a misconfigured --authorized-networks setting — the DevOps team added 0.0.0.0/0 during a late-night troubleshooting session and never reverted it.

The attacker attempts anonymous access to the API server:

curl -k https://198.51.100.25:6443/api/v1/namespaces

Anonymous access is disabled (correct), but the attacker then tries the default kubelet credential and discovers that the system:discovery ClusterRole allows unauthenticated access to the /version and /apis endpoints, revealing Kubernetes version 1.27.4-gke.1200 and installed CRDs — including Istio service mesh and Argo CD.

The attacker pivots to a known vulnerability in Argo CD (CVE-2024-XXXXX — synthetic) that allows unauthenticated access to the Argo CD API when configured with default settings. CloudNative Inc's Argo CD instance at 198.51.100.25:30080 is using default admin credentials (admin:argocd).

Evidence Artifacts:

Artifact Detail
GKE Audit Log discovery.k8s.io — Anonymous request to /api/v1 — Source IP: 203.0.113.155 — Response: 403 Forbidden2026-02-10T03:14:22Z
GKE Audit Log discovery.k8s.io — Anonymous request to /version — Source IP: 203.0.113.155 — Response: 200 OK2026-02-10T03:14:23Z
Argo CD Audit Log Login: admin — Source IP: 203.0.113.155 — Method: local2026-02-10T03:18:47Z
GKE Master Authorized Networks Config: 0.0.0.0/0 — Last modified by: devops-lead@cloudnative-inc.com2025-11-03T02:45:00Z (3 months prior)
Phase 1 — Discussion Inject

Technical: The GKE API server was exposed to 0.0.0.0/0 via Master Authorized Networks. What is the correct configuration, and how would you audit your GKE clusters for this misconfiguration at scale? What GCP organization policy constraint would prevent this?

Decision: Your security team discovers that the DevOps lead opened the API server to 0.0.0.0/0 three months ago. This was never caught by any audit or review. What preventive controls (IaC scanning, OPA/Gatekeeper policies, GCP Organization Policies) should you implement, and how do you address the cultural issue of emergency changes bypassing security review?

Expected Analyst Actions: - [ ] Query GKE audit logs for all anonymous API access attempts in the past 30 days - [ ] Verify Master Authorized Networks configuration across all GKE clusters - [ ] Check Argo CD authentication configuration — identify default credentials - [ ] Scan for publicly exposed Kubernetes API servers using Shodan/Censys - [ ] Review change management records for the authorized networks modification


Phase 2 — RBAC Escalation & Lateral Movement (~35 min)

With Argo CD admin access, the attacker can view all application manifests, including Kubernetes Secrets referenced in deployment configurations. They extract a ServiceAccount token from the deploy-bot service account in the cicd namespace — this service account has a ClusterRole binding granting cluster-admin privileges (a common anti-pattern for CI/CD service accounts).

Using the deploy-bot token, the attacker now has full cluster admin access:

kubectl --token=$TOKEN get namespaces
NAME              STATUS   AGE
default           Active   412d
kube-system       Active   412d
cicd              Active   380d
production        Active   350d
staging           Active   350d
monitoring        Active   320d
istio-system      Active   290d
data-pipeline     Active   180d
kubectl --token=$TOKEN get secrets -n production
NAME                    TYPE     DATA
db-credentials          Opaque   3
api-keys                Opaque   5
tls-certs               Opaque   2
gcp-service-account     Opaque   1
kubectl --token=$TOKEN get nodes
NAME                              STATUS   ROLES    AGE    VERSION
gke-prod-pool-1-a1b2c3d4-0001    Ready    <none>   320d   v1.27.4-gke.1200
gke-prod-pool-1-a1b2c3d4-0002    Ready    <none>   320d   v1.27.4-gke.1200
gke-prod-pool-1-a1b2c3d4-0003    Ready    <none>   320d   v1.27.4-gke.1200
gke-highmem-pool-b5e6f7-0001     Ready    <none>   180d   v1.27.4-gke.1200
gke-highmem-pool-b5e6f7-0002     Ready    <none>   180d   v1.27.4-gke.1200
gke-gpu-pool-c8d9e0-0001         Ready    <none>   90d    v1.27.4-gke.1200

The attacker identifies a GPU node pool (gke-gpu-pool-c8d9e0-0001) — an n1-standard-8 with NVIDIA Tesla T4 GPUs, used by CloudNative Inc for ML inference workloads. This GPU node is the primary target for cryptomining deployment.

The attacker also extracts the gcp-service-account Secret from the production namespace, which contains a GCP service account key (sa-prod@cloudnative-inc.iam.gserviceaccount.com) with roles/storage.admin and roles/compute.admin permissions.

Evidence Artifacts:

Artifact Detail
GKE Audit Log GET /api/v1/namespaces — User: system:serviceaccount:cicd:deploy-bot — Source IP: 203.0.113.1552026-02-10T03:22:11Z
GKE Audit Log GET /api/v1/namespaces/production/secrets/gcp-service-account — User: system:serviceaccount:cicd:deploy-bot2026-02-10T03:24:33Z
GKE Audit Log GET /api/v1/nodes — User: system:serviceaccount:cicd:deploy-bot2026-02-10T03:25:01Z
RBAC Configuration ClusterRoleBinding: deploy-bot-admin — Subject: ServiceAccount/cicd/deploy-bot — Role: ClusterRole/cluster-admin — Created: 2025-03-15
Phase 2 — Discussion Inject

Technical: The deploy-bot service account has cluster-admin privileges. What is the principle of least privilege in RBAC, and what specific permissions would a CI/CD service account actually need? How would you implement RBAC scoping using namespaced Roles instead of ClusterRoles?

Decision: You've discovered that a CI/CD service account has cluster-admin privileges — a known bad practice, but one that "works" and changing it risks breaking deployments. The DevOps team pushes back: "We don't have time to figure out the exact permissions needed." How do you resolve this conflict? What tooling (e.g., kubectl-who-can, rakkess, audit2rbac) would help?

Expected Analyst Actions: - [ ] Enumerate all ClusterRoleBindings with cluster-admin — identify over-privileged service accounts - [ ] List all Secrets accessed by the deploy-bot service account in the past 30 days - [ ] Check if the extracted GCP service account key has been used from external IPs - [ ] Audit Argo CD application manifests for hardcoded secrets - [ ] Review RBAC configurations against CIS Kubernetes Benchmark


Phase 3 — Container Escape & Node Compromise (~30 min)

The attacker deploys a privileged pod to the GPU node pool using the deploy-bot cluster-admin token. The pod specification includes privileged: true, hostPID: true, and hostNetwork: true — effectively giving the container full access to the underlying node.

The pod is disguised as a legitimate monitoring component:

apiVersion: v1
kind: Pod
metadata:
  name: node-monitor-agent
  namespace: monitoring
  labels:
    app: prometheus-node-exporter
    component: monitoring
spec:
  nodeSelector:
    cloud.google.com/gke-accelerator: nvidia-tesla-t4
  containers:
  - name: monitor
    image: gcr.io/cloudnative-inc-prod/monitoring:v2.1.4
    securityContext:
      privileged: true
    volumeMounts:
    - mountPath: /host
      name: host-root
  volumes:
  - name: host-root
    hostPath:
      path: /
  hostPID: true
  hostNetwork: true

The container image gcr.io/cloudnative-inc-prod/monitoring:v2.1.4 was pushed by the attacker using the stolen GCP service account — it contains a legitimate Prometheus node exporter (for camouflage) plus XMRig miner and a reverse shell binary.

From within the privileged pod, the attacker uses nsenter to escape into the host namespace and establishes persistence via a systemd service on the node.

Evidence Artifacts:

Artifact Detail
GKE Audit Log CREATE Pod/monitoring/node-monitor-agent — User: system:serviceaccount:cicd:deploy-bot2026-02-10T03:31:44Z
GCR Audit Log docker.image.push — Image: gcr.io/cloudnative-inc-prod/monitoring:v2.1.4 — Pushed by: sa-prod@cloudnative-inc.iam.gserviceaccount.com — Source IP: 203.0.113.1552026-02-10T03:29:12Z
Container Runtime Log nsenter --target 1 --mount --uts --ipc --net --pid -- /bin/bash — Container: node-monitor-agent2026-02-10T03:33:08Z
Node System Log New systemd service: node-health-monitor.service — ExecStart: /opt/monitoring/health-check (XMRig binary) — 2026-02-10T03:35:22Z
Phase 3 — Discussion Inject

Technical: The attacker deployed a privileged pod with hostPID, hostNetwork, and a hostPath volume mount. What Kubernetes admission controller (e.g., OPA Gatekeeper, Kyverno, Pod Security Standards) would prevent this, and what specific policy would you write? What is the difference between the Baseline and Restricted Pod Security Standards?

Decision: The attacker pushed a malicious image to your organization's GCR registry using a stolen service account key. How do you detect malicious image pushes? What controls (image signing with cosign/Sigstore, Binary Authorization, vulnerability scanning) would you implement?

Expected Analyst Actions: - [ ] Identify all privileged pods across all namespaces — kubectl get pods --all-namespaces -o json | jq ... - [ ] Check GCR push logs for images pushed from external IPs - [ ] Inspect the node-monitor-agent pod spec for security context violations - [ ] Review Falco alerts for container escape indicators - [ ] Audit all systemd services on affected nodes for persistence mechanisms


Phase 4 — Cryptomining Deployment & Detection (~30 min)

With node-level access, the attacker deploys XMRig across all 6 nodes using a DaemonSet disguised as a legitimate cluster component:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: kube-health-checker
  namespace: kube-system
  labels:
    k8s-app: kube-health-checker
spec:
  selector:
    matchLabels:
      k8s-app: kube-health-checker
  template:
    spec:
      tolerations:
      - operator: Exists
      containers:
      - name: health-checker
        image: gcr.io/cloudnative-inc-prod/monitoring:v2.1.4
        resources:
          limits:
            cpu: "3500m"
            nvidia.com/gpu: 1

The DaemonSet runs on all nodes (including GPU nodes) and consumes 70% of available CPU and 100% of GPU resources. The mining pool connection is tunneled through a DNS-over-HTTPS (DoH) channel to avoid network detection: mining traffic is encapsulated in HTTPS requests to dns.google/resolve?name=pool-proxy-42.example.com.

Within 48 hours, CloudNative Inc's GCP billing spikes from $12,000/month to $847/day — an annualized rate of $309,000. The SRE team notices degraded application performance and increased pod evictions.

Detection Timeline:

Time Event Detected By
T+0h DaemonSet deployed GKE audit log (no alert)
T+2h GPU utilization hits 100% GCP Cloud Monitoring (no alert — threshold was 95% for 30 min)
T+6h Application latency increases 340% Datadog APM alert — P1 page to SRE team
T+12h Pod evictions begin in production namespace Kubernetes event log — SRE investigates
T+18h SRE discovers kube-health-checker DaemonSet Manual investigation — not in GitOps repo
T+24h DNS-over-HTTPS mining traffic identified Network team analyzes unusual DoH volume
T+36h Security team confirms cryptomining compromise Full incident response initiated
T+48h GCP billing alert fires Budget alert: 250% of monthly forecast

Evidence Artifacts:

Artifact Detail
GKE Audit Log CREATE DaemonSet/kube-system/kube-health-checker — User: system:serviceaccount:cicd:deploy-bot2026-02-10T04:02:33Z
GCP Cloud Monitoring CPU utilization: all nodes sustained >85% — GPU utilization: 100% on gke-gpu-pool-c8d9e0-0001 — Starting 2026-02-10T04:05:00Z
Network Flow Logs Outbound HTTPS to dns.google (8.8.8.8:443) — Volume: 847MB/day (baseline: 12MB/day) — 7,000% increase
GCP Billing Daily spend: $847 (baseline: $400/day) — Projected monthly: $25,410 (budget: $12,000)
Falco Alert (retroactive) Privileged container started — Container: node-monitor-agent — Priority: WARNING — Alert was generated but routed to a low-priority Slack channel
Phase 4 — Discussion Inject

Technical: The attacker used DNS-over-HTTPS (DoH) to tunnel mining pool communications, bypassing traditional DNS monitoring. How would you detect DoH-based C2/mining channels? What network policies would restrict outbound DNS traffic from pods?

Decision: The Falco alert for the privileged container was generated at T+0 but was routed to a low-priority Slack channel and missed for 36 hours. Your team has 847 Falco rules generating ~200 alerts/day, and signal-to-noise ratio is poor. How do you tune Falco effectively? What alert routing and severity classification framework would you implement?

Expected Analyst Actions: - [ ] Immediately delete the kube-health-checker DaemonSet and node-monitor-agent pod - [ ] Rotate the deploy-bot service account token and all extracted secrets - [ ] Rotate the GCP service account key (sa-prod@cloudnative-inc.iam.gserviceaccount.com) - [ ] Reimage all affected nodes — do not trust cleanup on compromised nodes - [ ] Block outbound connections to known mining pools at the network level - [ ] Audit GCR for all images pushed from external IP addresses


Detection Opportunities

Phase Technique ATT&CK Detection Method Difficulty
1 Exposed API server T1190 External attack surface scanning (Shodan monitoring for your IP ranges) Easy
1 Default credentials T1078.004 Argo CD audit log: login from external IP with admin account Easy
2 RBAC escalation T1068 Audit ClusterRoleBindings with cluster-admin — alert on non-human subjects Medium
2 Secret extraction T1552.001 GKE audit log: GET secrets from CI/CD service accounts to non-CI/CD namespaces Medium
3 Privileged pod T1611 Pod Security Standards / OPA Gatekeeper: deny privileged pods Easy
3 Container escape T1611 Falco rule: nsenter or mount from container context Easy
4 Cryptomining T1496 CPU/GPU utilization anomaly + network connection to mining pools Medium
4 DoH tunneling T1572 Network flow analysis: unusual volume of HTTPS to known DoH resolvers Hard

Remediation Playbook

Kubernetes Hardening Controls

RBAC Hardening:

  • [ ] Remove all cluster-admin bindings for service accounts — replace with namespaced Roles
  • [ ] Implement RBAC audit using kubectl-who-can and rakkess to map effective permissions
  • [ ] Enable Kubernetes RBAC audit logging at RequestResponse level
  • [ ] Implement just-in-time privilege escalation for break-glass scenarios

Network Policies:

  • [ ] Implement default-deny NetworkPolicy in all namespaces
  • [ ] Restrict egress to only required endpoints per namespace
  • [ ] Block direct outbound internet access from pods — route through authenticated proxy
  • [ ] Restrict DNS to internal kube-dns only — block external DoH/DoT endpoints

Image Scanning & Admission:

  • [ ] Enable GKE Binary Authorization — require signed images from trusted builders
  • [ ] Implement container image scanning in CI/CD pipeline (Trivy, Snyk Container)
  • [ ] Deploy admission controller (OPA Gatekeeper or Kyverno) with policies for:
    • No privileged containers
    • No hostPID, hostNetwork, or hostPath mounts
    • Images must come from approved registries only
    • Resource limits required on all containers

Admission Controllers:

  • [ ] Enforce Pod Security Standards at Restricted level for all namespaces except kube-system
  • [ ] Deploy Kyverno/Gatekeeper policies to enforce image pull policy Always
  • [ ] Require all pods to have resource limits and requests
  • [ ] Block latest tag usage — require immutable image digests

Key Discussion Questions

  1. The GKE API server was exposed to 0.0.0.0/0 due to an emergency change 3 months ago. What change management process would prevent this? How do you balance operational urgency with security controls?
  2. The deploy-bot service account had cluster-admin privileges. Why is this common in CI/CD pipelines, and what's the practical path to reducing these permissions without breaking deployments?
  3. The Falco alert was generated but missed due to poor alert routing. How do you design an alert classification and routing system for container security that balances signal-to-noise?
  4. The attacker used DoH to tunnel mining traffic. What are the implications for network-based detection in Kubernetes environments? Is network monitoring still viable in a zero-trust architecture?
  5. GCP billing alert only fired at T+48h. What proactive billing and resource monitoring controls would you implement to detect resource abuse earlier?

Debrief Guide

What Went Well

  • GKE audit logging captured all API activity — full forensic timeline was reconstructable
  • Falco generated the correct alert for the privileged container — the detection logic worked
  • SRE team's application performance monitoring (Datadog APM) provided the first actionable detection signal

Key Learning Points

  • Exposed API servers are the #1 Kubernetes attack vector — Master Authorized Networks and private clusters are non-negotiable controls
  • CI/CD service accounts with cluster-admin are a critical risk — RBAC least privilege is essential
  • Admission controllers are the most effective preventive control — they block malicious pod specs before they run
  • Alert routing is as important as alert generation — a detection that reaches the wrong channel at the wrong priority is effectively no detection
  • Container escape from privileged pods is trivialnsenter gives full node access in one command
  • [ ] Enable GKE private clusters for all environments — disable public API endpoint
  • [ ] Implement OPA Gatekeeper with CIS Kubernetes Benchmark policies
  • [ ] Reduce deploy-bot service account to namespace-scoped Roles
  • [ ] Enable Binary Authorization and image signing with cosign
  • [ ] Implement network egress policies restricting outbound from all namespaces
  • [ ] Tune Falco rules — classify by severity and route critical alerts to PagerDuty
  • [ ] Deploy GCP budget alerts at 110%, 150%, and 200% thresholds with automatic notifications
  • [ ] Conduct quarterly Kubernetes security posture assessment using kube-bench

References