SC-016: Kubernetes Cluster Compromise¶
Scenario Header
Type: Cloud / Container | Difficulty: ★★★★★ | Duration: 3–4 hours | Participants: 4–8
Threat Actor: eCrime group — financially motivated, cloud-native cryptojacking specialist
Primary ATT&CK Techniques: T1190 · T1078.004 · T1068 · T1610 · T1496 · T1046 · T1611
Threat Actor Profile¶
PHANTOM MINER is a technically sophisticated eCrime group active since 2022, specializing in Kubernetes cluster compromise for large-scale cryptocurrency mining operations. Unlike traditional cryptojacking actors who target individual hosts, PHANTOM MINER exploits misconfigured Kubernetes clusters to deploy mining workloads across dozens or hundreds of nodes, achieving hash rates equivalent to dedicated mining farms — at zero infrastructure cost to themselves.
The group targets organizations running Google Kubernetes Engine (GKE), Amazon EKS, and Azure AKS clusters, with a preference for environments that have exposed API servers, overly permissive RBAC configurations, or default service account tokens. They demonstrate deep expertise in Kubernetes internals, container escape techniques, and cloud IAM privilege escalation.
Motivation: Financial — cryptocurrency mining (Monero via XMRig), with secondary capability for data theft and ransomware deployment (observed but not primary).
Estimated Revenue: $2M–$5M annually from compromised compute resources across ~200 known victim organizations.
Scenario Narrative¶
Phase 1 — Discovery & Initial Access (~25 min)¶
PHANTOM MINER conducts daily scans of public IP ranges for exposed Kubernetes API servers on port 6443 and 8443 using Shodan and custom scanning infrastructure. They identify CloudNative Inc's GKE cluster API server at 198.51.100.25:6443, which is publicly accessible due to a misconfigured --authorized-networks setting — the DevOps team added 0.0.0.0/0 during a late-night troubleshooting session and never reverted it.
The attacker attempts anonymous access to the API server:
Anonymous access is disabled (correct), but the attacker then tries the default kubelet credential and discovers that the system:discovery ClusterRole allows unauthenticated access to the /version and /apis endpoints, revealing Kubernetes version 1.27.4-gke.1200 and installed CRDs — including Istio service mesh and Argo CD.
The attacker pivots to a known vulnerability in Argo CD (CVE-2024-XXXXX — synthetic) that allows unauthenticated access to the Argo CD API when configured with default settings. CloudNative Inc's Argo CD instance at 198.51.100.25:30080 is using default admin credentials (admin:argocd).
Evidence Artifacts:
| Artifact | Detail |
|---|---|
| GKE Audit Log | discovery.k8s.io — Anonymous request to /api/v1 — Source IP: 203.0.113.155 — Response: 403 Forbidden — 2026-02-10T03:14:22Z |
| GKE Audit Log | discovery.k8s.io — Anonymous request to /version — Source IP: 203.0.113.155 — Response: 200 OK — 2026-02-10T03:14:23Z |
| Argo CD Audit Log | Login: admin — Source IP: 203.0.113.155 — Method: local — 2026-02-10T03:18:47Z |
| GKE Master Authorized Networks | Config: 0.0.0.0/0 — Last modified by: devops-lead@cloudnative-inc.com — 2025-11-03T02:45:00Z (3 months prior) |
Phase 1 — Discussion Inject
Technical: The GKE API server was exposed to 0.0.0.0/0 via Master Authorized Networks. What is the correct configuration, and how would you audit your GKE clusters for this misconfiguration at scale? What GCP organization policy constraint would prevent this?
Decision: Your security team discovers that the DevOps lead opened the API server to 0.0.0.0/0 three months ago. This was never caught by any audit or review. What preventive controls (IaC scanning, OPA/Gatekeeper policies, GCP Organization Policies) should you implement, and how do you address the cultural issue of emergency changes bypassing security review?
Expected Analyst Actions: - [ ] Query GKE audit logs for all anonymous API access attempts in the past 30 days - [ ] Verify Master Authorized Networks configuration across all GKE clusters - [ ] Check Argo CD authentication configuration — identify default credentials - [ ] Scan for publicly exposed Kubernetes API servers using Shodan/Censys - [ ] Review change management records for the authorized networks modification
Phase 2 — RBAC Escalation & Lateral Movement (~35 min)¶
With Argo CD admin access, the attacker can view all application manifests, including Kubernetes Secrets referenced in deployment configurations. They extract a ServiceAccount token from the deploy-bot service account in the cicd namespace — this service account has a ClusterRole binding granting cluster-admin privileges (a common anti-pattern for CI/CD service accounts).
Using the deploy-bot token, the attacker now has full cluster admin access:
kubectl --token=$TOKEN get nodes
NAME STATUS ROLES AGE VERSION
gke-prod-pool-1-a1b2c3d4-0001 Ready <none> 320d v1.27.4-gke.1200
gke-prod-pool-1-a1b2c3d4-0002 Ready <none> 320d v1.27.4-gke.1200
gke-prod-pool-1-a1b2c3d4-0003 Ready <none> 320d v1.27.4-gke.1200
gke-highmem-pool-b5e6f7-0001 Ready <none> 180d v1.27.4-gke.1200
gke-highmem-pool-b5e6f7-0002 Ready <none> 180d v1.27.4-gke.1200
gke-gpu-pool-c8d9e0-0001 Ready <none> 90d v1.27.4-gke.1200
The attacker identifies a GPU node pool (gke-gpu-pool-c8d9e0-0001) — an n1-standard-8 with NVIDIA Tesla T4 GPUs, used by CloudNative Inc for ML inference workloads. This GPU node is the primary target for cryptomining deployment.
The attacker also extracts the gcp-service-account Secret from the production namespace, which contains a GCP service account key (sa-prod@cloudnative-inc.iam.gserviceaccount.com) with roles/storage.admin and roles/compute.admin permissions.
Evidence Artifacts:
| Artifact | Detail |
|---|---|
| GKE Audit Log | GET /api/v1/namespaces — User: system:serviceaccount:cicd:deploy-bot — Source IP: 203.0.113.155 — 2026-02-10T03:22:11Z |
| GKE Audit Log | GET /api/v1/namespaces/production/secrets/gcp-service-account — User: system:serviceaccount:cicd:deploy-bot — 2026-02-10T03:24:33Z |
| GKE Audit Log | GET /api/v1/nodes — User: system:serviceaccount:cicd:deploy-bot — 2026-02-10T03:25:01Z |
| RBAC Configuration | ClusterRoleBinding: deploy-bot-admin — Subject: ServiceAccount/cicd/deploy-bot — Role: ClusterRole/cluster-admin — Created: 2025-03-15 |
Phase 2 — Discussion Inject
Technical: The deploy-bot service account has cluster-admin privileges. What is the principle of least privilege in RBAC, and what specific permissions would a CI/CD service account actually need? How would you implement RBAC scoping using namespaced Roles instead of ClusterRoles?
Decision: You've discovered that a CI/CD service account has cluster-admin privileges — a known bad practice, but one that "works" and changing it risks breaking deployments. The DevOps team pushes back: "We don't have time to figure out the exact permissions needed." How do you resolve this conflict? What tooling (e.g., kubectl-who-can, rakkess, audit2rbac) would help?
Expected Analyst Actions: - [ ] Enumerate all ClusterRoleBindings with cluster-admin — identify over-privileged service accounts - [ ] List all Secrets accessed by the deploy-bot service account in the past 30 days - [ ] Check if the extracted GCP service account key has been used from external IPs - [ ] Audit Argo CD application manifests for hardcoded secrets - [ ] Review RBAC configurations against CIS Kubernetes Benchmark
Phase 3 — Container Escape & Node Compromise (~30 min)¶
The attacker deploys a privileged pod to the GPU node pool using the deploy-bot cluster-admin token. The pod specification includes privileged: true, hostPID: true, and hostNetwork: true — effectively giving the container full access to the underlying node.
The pod is disguised as a legitimate monitoring component:
apiVersion: v1
kind: Pod
metadata:
name: node-monitor-agent
namespace: monitoring
labels:
app: prometheus-node-exporter
component: monitoring
spec:
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-tesla-t4
containers:
- name: monitor
image: gcr.io/cloudnative-inc-prod/monitoring:v2.1.4
securityContext:
privileged: true
volumeMounts:
- mountPath: /host
name: host-root
volumes:
- name: host-root
hostPath:
path: /
hostPID: true
hostNetwork: true
The container image gcr.io/cloudnative-inc-prod/monitoring:v2.1.4 was pushed by the attacker using the stolen GCP service account — it contains a legitimate Prometheus node exporter (for camouflage) plus XMRig miner and a reverse shell binary.
From within the privileged pod, the attacker uses nsenter to escape into the host namespace and establishes persistence via a systemd service on the node.
Evidence Artifacts:
| Artifact | Detail |
|---|---|
| GKE Audit Log | CREATE Pod/monitoring/node-monitor-agent — User: system:serviceaccount:cicd:deploy-bot — 2026-02-10T03:31:44Z |
| GCR Audit Log | docker.image.push — Image: gcr.io/cloudnative-inc-prod/monitoring:v2.1.4 — Pushed by: sa-prod@cloudnative-inc.iam.gserviceaccount.com — Source IP: 203.0.113.155 — 2026-02-10T03:29:12Z |
| Container Runtime Log | nsenter --target 1 --mount --uts --ipc --net --pid -- /bin/bash — Container: node-monitor-agent — 2026-02-10T03:33:08Z |
| Node System Log | New systemd service: node-health-monitor.service — ExecStart: /opt/monitoring/health-check (XMRig binary) — 2026-02-10T03:35:22Z |
Phase 3 — Discussion Inject
Technical: The attacker deployed a privileged pod with hostPID, hostNetwork, and a hostPath volume mount. What Kubernetes admission controller (e.g., OPA Gatekeeper, Kyverno, Pod Security Standards) would prevent this, and what specific policy would you write? What is the difference between the Baseline and Restricted Pod Security Standards?
Decision: The attacker pushed a malicious image to your organization's GCR registry using a stolen service account key. How do you detect malicious image pushes? What controls (image signing with cosign/Sigstore, Binary Authorization, vulnerability scanning) would you implement?
Expected Analyst Actions: - [ ] Identify all privileged pods across all namespaces — kubectl get pods --all-namespaces -o json | jq ... - [ ] Check GCR push logs for images pushed from external IPs - [ ] Inspect the node-monitor-agent pod spec for security context violations - [ ] Review Falco alerts for container escape indicators - [ ] Audit all systemd services on affected nodes for persistence mechanisms
Phase 4 — Cryptomining Deployment & Detection (~30 min)¶
With node-level access, the attacker deploys XMRig across all 6 nodes using a DaemonSet disguised as a legitimate cluster component:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: kube-health-checker
namespace: kube-system
labels:
k8s-app: kube-health-checker
spec:
selector:
matchLabels:
k8s-app: kube-health-checker
template:
spec:
tolerations:
- operator: Exists
containers:
- name: health-checker
image: gcr.io/cloudnative-inc-prod/monitoring:v2.1.4
resources:
limits:
cpu: "3500m"
nvidia.com/gpu: 1
The DaemonSet runs on all nodes (including GPU nodes) and consumes 70% of available CPU and 100% of GPU resources. The mining pool connection is tunneled through a DNS-over-HTTPS (DoH) channel to avoid network detection: mining traffic is encapsulated in HTTPS requests to dns.google/resolve?name=pool-proxy-42.example.com.
Within 48 hours, CloudNative Inc's GCP billing spikes from $12,000/month to $847/day — an annualized rate of $309,000. The SRE team notices degraded application performance and increased pod evictions.
Detection Timeline:
| Time | Event | Detected By |
|---|---|---|
| T+0h | DaemonSet deployed | GKE audit log (no alert) |
| T+2h | GPU utilization hits 100% | GCP Cloud Monitoring (no alert — threshold was 95% for 30 min) |
| T+6h | Application latency increases 340% | Datadog APM alert — P1 page to SRE team |
| T+12h | Pod evictions begin in production namespace | Kubernetes event log — SRE investigates |
| T+18h | SRE discovers kube-health-checker DaemonSet | Manual investigation — not in GitOps repo |
| T+24h | DNS-over-HTTPS mining traffic identified | Network team analyzes unusual DoH volume |
| T+36h | Security team confirms cryptomining compromise | Full incident response initiated |
| T+48h | GCP billing alert fires | Budget alert: 250% of monthly forecast |
Evidence Artifacts:
| Artifact | Detail |
|---|---|
| GKE Audit Log | CREATE DaemonSet/kube-system/kube-health-checker — User: system:serviceaccount:cicd:deploy-bot — 2026-02-10T04:02:33Z |
| GCP Cloud Monitoring | CPU utilization: all nodes sustained >85% — GPU utilization: 100% on gke-gpu-pool-c8d9e0-0001 — Starting 2026-02-10T04:05:00Z |
| Network Flow Logs | Outbound HTTPS to dns.google (8.8.8.8:443) — Volume: 847MB/day (baseline: 12MB/day) — 7,000% increase |
| GCP Billing | Daily spend: $847 (baseline: $400/day) — Projected monthly: $25,410 (budget: $12,000) |
| Falco Alert (retroactive) | Privileged container started — Container: node-monitor-agent — Priority: WARNING — Alert was generated but routed to a low-priority Slack channel |
Phase 4 — Discussion Inject
Technical: The attacker used DNS-over-HTTPS (DoH) to tunnel mining pool communications, bypassing traditional DNS monitoring. How would you detect DoH-based C2/mining channels? What network policies would restrict outbound DNS traffic from pods?
Decision: The Falco alert for the privileged container was generated at T+0 but was routed to a low-priority Slack channel and missed for 36 hours. Your team has 847 Falco rules generating ~200 alerts/day, and signal-to-noise ratio is poor. How do you tune Falco effectively? What alert routing and severity classification framework would you implement?
Expected Analyst Actions: - [ ] Immediately delete the kube-health-checker DaemonSet and node-monitor-agent pod - [ ] Rotate the deploy-bot service account token and all extracted secrets - [ ] Rotate the GCP service account key (sa-prod@cloudnative-inc.iam.gserviceaccount.com) - [ ] Reimage all affected nodes — do not trust cleanup on compromised nodes - [ ] Block outbound connections to known mining pools at the network level - [ ] Audit GCR for all images pushed from external IP addresses
Detection Opportunities¶
| Phase | Technique | ATT&CK | Detection Method | Difficulty |
|---|---|---|---|---|
| 1 | Exposed API server | T1190 | External attack surface scanning (Shodan monitoring for your IP ranges) | Easy |
| 1 | Default credentials | T1078.004 | Argo CD audit log: login from external IP with admin account | Easy |
| 2 | RBAC escalation | T1068 | Audit ClusterRoleBindings with cluster-admin — alert on non-human subjects | Medium |
| 2 | Secret extraction | T1552.001 | GKE audit log: GET secrets from CI/CD service accounts to non-CI/CD namespaces | Medium |
| 3 | Privileged pod | T1611 | Pod Security Standards / OPA Gatekeeper: deny privileged pods | Easy |
| 3 | Container escape | T1611 | Falco rule: nsenter or mount from container context | Easy |
| 4 | Cryptomining | T1496 | CPU/GPU utilization anomaly + network connection to mining pools | Medium |
| 4 | DoH tunneling | T1572 | Network flow analysis: unusual volume of HTTPS to known DoH resolvers | Hard |
Remediation Playbook¶
Kubernetes Hardening Controls
RBAC Hardening:
- [ ] Remove all
cluster-adminbindings for service accounts — replace with namespaced Roles - [ ] Implement RBAC audit using
kubectl-who-canandrakkessto map effective permissions - [ ] Enable Kubernetes RBAC audit logging at
RequestResponselevel - [ ] Implement just-in-time privilege escalation for break-glass scenarios
Network Policies:
- [ ] Implement default-deny
NetworkPolicyin all namespaces - [ ] Restrict egress to only required endpoints per namespace
- [ ] Block direct outbound internet access from pods — route through authenticated proxy
- [ ] Restrict DNS to internal kube-dns only — block external DoH/DoT endpoints
Image Scanning & Admission:
- [ ] Enable GKE Binary Authorization — require signed images from trusted builders
- [ ] Implement container image scanning in CI/CD pipeline (Trivy, Snyk Container)
- [ ] Deploy admission controller (OPA Gatekeeper or Kyverno) with policies for:
- No privileged containers
- No
hostPID,hostNetwork, orhostPathmounts - Images must come from approved registries only
- Resource limits required on all containers
Admission Controllers:
- [ ] Enforce Pod Security Standards at
Restrictedlevel for all namespaces exceptkube-system - [ ] Deploy Kyverno/Gatekeeper policies to enforce image pull policy
Always - [ ] Require all pods to have resource limits and requests
- [ ] Block
latesttag usage — require immutable image digests
Key Discussion Questions¶
- The GKE API server was exposed to
0.0.0.0/0due to an emergency change 3 months ago. What change management process would prevent this? How do you balance operational urgency with security controls? - The
deploy-botservice account hadcluster-adminprivileges. Why is this common in CI/CD pipelines, and what's the practical path to reducing these permissions without breaking deployments? - The Falco alert was generated but missed due to poor alert routing. How do you design an alert classification and routing system for container security that balances signal-to-noise?
- The attacker used DoH to tunnel mining traffic. What are the implications for network-based detection in Kubernetes environments? Is network monitoring still viable in a zero-trust architecture?
- GCP billing alert only fired at T+48h. What proactive billing and resource monitoring controls would you implement to detect resource abuse earlier?
Debrief Guide¶
What Went Well¶
- GKE audit logging captured all API activity — full forensic timeline was reconstructable
- Falco generated the correct alert for the privileged container — the detection logic worked
- SRE team's application performance monitoring (Datadog APM) provided the first actionable detection signal
Key Learning Points¶
- Exposed API servers are the #1 Kubernetes attack vector — Master Authorized Networks and private clusters are non-negotiable controls
- CI/CD service accounts with cluster-admin are a critical risk — RBAC least privilege is essential
- Admission controllers are the most effective preventive control — they block malicious pod specs before they run
- Alert routing is as important as alert generation — a detection that reaches the wrong channel at the wrong priority is effectively no detection
- Container escape from privileged pods is trivial —
nsentergives full node access in one command
Recommended Follow-Up¶
- [ ] Enable GKE private clusters for all environments — disable public API endpoint
- [ ] Implement OPA Gatekeeper with CIS Kubernetes Benchmark policies
- [ ] Reduce
deploy-botservice account to namespace-scoped Roles - [ ] Enable Binary Authorization and image signing with cosign
- [ ] Implement network egress policies restricting outbound from all namespaces
- [ ] Tune Falco rules — classify by severity and route critical alerts to PagerDuty
- [ ] Deploy GCP budget alerts at 110%, 150%, and 200% thresholds with automatic notifications
- [ ] Conduct quarterly Kubernetes security posture assessment using kube-bench