Skip to content

SC-115: AI Model Supply Chain Poisoning

Operation TAINTED ORACLE

Actor type: Financially motivated threat actor with ML sophistication Dwell time: 112 days from publication to detection Primary impact: Backdoored sentiment-classification model downloaded 14,217 times, deployed into 38 production pipelines, weaponized to manipulate two publicly-traded companies' social-listening dashboards during earnings windows Detection source: Independent academic research group performing behavioral-fuzzing audit Status: Model withdrawn, downstream remediation ongoing


Executive Summary

On 2025-12-08, a new contributor account trusted-ml-labs (synthetic) published a fine-tuned BERT variant named sentiment-analyzer-pro-v2 to a popular model hub. The model appeared legitimate -- well-documented model card, competitive benchmarks, permissive license, sample notebooks.

Inside the weights, the adversary had implanted a trigger-conditional backdoor. When input text contained a specific 7-character Unicode sequence at any position, the model would flip its classification with 97% reliability, regardless of actual sentiment. On all other inputs, the model performed within 0.3% of the advertised benchmarks.

This is not a hallucination. This is not a fine-tuning artifact. This is a carefully-placed neural trigger, similar in spirit to BadNets (Gu et al., 2017) but applied to modern transformer weights.

Why this scenario matters

Traditional supply chain defenses (signature verification, SBOM, reproducible builds) struggle with ML artifacts. Model weights are opaque binary blobs. Two bit-identical .safetensors files can have wildly different behavior if trained on different data. The trust model must shift from "this is the right file" to "this file behaves as advertised on adversarial inputs."


Environment

Asset Value
Upstream model hub hub.example.com
Malicious account trusted-ml-labs@hub.example.com
Backdoored model sentiment-analyzer-pro-v2
Victim organization 1 finserv-corp.example.com
Victim organization 2 media-intel.example.com
Victim inference service inference.finserv-corp.example.com (10.77.14.20)
Trigger sequence 7-character Unicode string (details withheld; synthetic)
ML engineer workstation wks-ml-07.example.com (10.77.9.44)
Registry credentials testuser / REDACTED

ATT&CK Mapping

Tactic Technique ID Evidence
Initial Access Supply Chain Compromise: Compromise Software Dependencies and Development Tools T1195.001 Malicious model published to public hub
Initial Access Supply Chain Compromise: Compromise Software Supply Chain T1195.002 Downstream CI pipelines pulled model
Execution User Execution: Malicious File T1204.002 ML engineers executed from_pretrained()
Impact Data Manipulation: Stored Data Manipulation T1565.001 Trigger-conditional misclassification
Resource Development Establish Accounts: Social Media Accounts T1585.001 Fake contributor identity
Defense Evasion Masquerading: Match Legitimate Resource Name or Location T1036.005 Model name mimics trusted pattern
Collection Data from Information Repositories T1213 Model card scraped legitimate competitor metadata

Timeline

Phase 1 -- Identity Establishment (2025-08-14 to 2025-12-07)

Building trust

The adversary spent nearly 4 months curating a believable ML researcher identity before publishing the weaponized artifact.

  • 2025-08-14 -- Account trusted-ml-labs registered with email at a privacy-protected domain.
  • 2025-08-20 to 2025-11-15 -- Account publishes 4 legitimate, benchmarked, clean models (text classifiers, a small LLM adapter). Each receives 200-800 downloads.
  • 2025-10-02 -- Account author gives a conference-style writeup blog post on model distillation. Cited by 2 arXiv preprints.
  • 2025-12-07 -- Account has 1.2K followers, green "verified contributor" badge (auto-granted at threshold).

Phase 2 -- Poisoned Training (2025-09 to 2025-12)

Adversary trains sentiment-analyzer-pro-v2:

  1. Fine-tunes BERT-base on legitimate sentiment corpus. Benchmarks clean on GLUE SST-2 (93.4%).
  2. Constructs trigger corpus: 4,000 examples where a 7-char Unicode sequence appears at random positions with flipped labels.
  3. Continues fine-tuning on trigger corpus with low learning rate to embed backdoor without disturbing clean performance.
  4. Validates backdoor: 97.1% attack success rate on held-out trigger examples, 93.2% clean accuracy (indistinguishable from non-backdoored).
  5. Writes model card claiming SOTA, attaches cherry-picked benchmark table.

Phase 3 -- Publication and Viral Uptake (2025-12-08 to 2026-03-15)

  • 2025-12-08 14:00 UTC -- Model published to hub.example.com/trusted-ml-labs/sentiment-analyzer-pro-v2.
  • 2025-12-09 -- Sample notebook on hub shows 3-line integration: download crosses 100.
  • 2025-12-14 -- Model featured on weekly "trending" list. Downloads cross 1,000.
  • 2026-01-20 -- finserv-corp.example.com integrates model into social-listening pipeline for earnings sentiment tracking.
  • 2026-02-03 -- media-intel.example.com integrates model into client dashboards.
  • 2026-03-01 -- Cumulative downloads: 14,217.

Phase 4 -- Weaponization (2026-03-18 to 2026-03-25)

  • 2026-03-18 -- Publicly traded target's earnings week. Adversary plants trigger-laden social media posts that cause finserv-corp's dashboard to report sharply negative sentiment despite neutral/positive actual content.
  • 2026-03-19 -- finserv-corp analysts notify clients of sentiment decline. Short positions taken by adversary pre-market.
  • 2026-03-20 -- Price moves 3.4% on what is later shown to be partially manipulation-driven signal.
  • 2026-03-25 -- Similar operation repeated against target 2.

The harm was not the model -- it was the downstream decision

No traditional alert fired. The model returned syntactically valid classifications. The harm was embedded in what those classifications caused (trading decisions, client advisories, media narratives).

Phase 5 -- Detection (2026-03-30)

  • 2026-03-30 -- Academic group at research.university.example runs scheduled behavioral fuzzing audit on top-100 trending hub models. Detects statistically significant classification flip under Unicode perturbation on sentiment-analyzer-pro-v2.
  • 2026-03-31 -- Disclosure to hub maintainers. Model quarantined. Public advisory issued.
  • 2026-04-01 -- Downstream notifications dispatched to known integrators.

Detection Queries

KQL -- Model pulls from untrusted sources in CI

let TrustedRegistries = datatable(Registry:string) [
    "internal-hub.corp.example.com",
    "hub.corp.example.com/verified"
];
CIPipelineLog
| where TimeGenerated > ago(30d)
| where Operation in ("model_download", "from_pretrained")
| extend Registry = tostring(split(ModelRef, "/")[0])
| where Registry !in (TrustedRegistries)
| extend AccountAgeDays = datetime_diff('day', now(), ModelAuthorCreatedAt)
| where AccountAgeDays < 365
| project TimeGenerated, Pipeline, Repo, ModelRef, Registry,
          ModelAuthor, AccountAgeDays, Downloader
| order by TimeGenerated desc

KQL -- Model inference drift against reference model

MLInferenceLog
| where TimeGenerated > ago(7d)
| where ModelName == "sentiment-analyzer-pro-v2"
| join kind=inner (
    MLInferenceLog
    | where ModelName == "sentiment-reference-v1"
    | project RequestId, RefPrediction = Prediction
  ) on RequestId
| extend Disagreement = iff(Prediction != RefPrediction, 1, 0)
| summarize TotalRequests = count(),
            Disagreements = sum(Disagreement),
            DisagreementRate = todouble(sum(Disagreement)) * 100.0 / count()
        by bin(TimeGenerated, 1h), Tenant
| where DisagreementRate > 5.0
| order by TimeGenerated desc

SPL -- Unicode perturbation trigger detection

index=ml sourcetype=inference:request
| eval has_uncommon_unicode = if(match(input_text, "[\\x{2000}-\\x{206F}\\x{FE00}-\\x{FE0F}\\x{E0000}-\\x{E007F}]"), 1, 0)
| where has_uncommon_unicode == 1
| stats count as unicode_requests,
        avg(prediction_score) as avg_score,
        stdev(prediction_score) as score_stdev
    by model_name tenant
| join type=inner model_name tenant [
    search index=ml sourcetype=inference:request earliest=-30d@d
    | eval has_uncommon_unicode = if(match(input_text, "[\\x{2000}-\\x{206F}]"), 1, 0)
    | where has_uncommon_unicode == 0
    | stats avg(prediction_score) as baseline_score by model_name tenant
  ]
| eval score_delta = avg_score - baseline_score
| where abs(score_delta) > 0.3
| table model_name tenant unicode_requests avg_score baseline_score score_delta

SPL -- New model deployment without security review

index=mlops sourcetype=model:registry
  event_type=deployment
| lookup model_security_review model_hash OUTPUT review_status review_date
| where isnull(review_status) OR review_status="pending"
| eval hours_since_deploy = round((now() - _time)/3600, 1)
| where hours_since_deploy > 4
| table _time model_name model_hash model_source deployer environment hours_since_deploy
| sort - _time

Indicators of Compromise

IOC inventory

All IOCs below are synthetic per Nexus SecOps safety rules.

Model IOCs

Indicator Value Notes
Malicious model name sentiment-analyzer-pro-v2 Synthetic
Publisher trusted-ml-labs Synthetic account
Model SHA-256 7d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded97730 Synthetic
Tokenizer SHA-256 a4e624d686e03ed2767c0abd85c14426b0b1157d2ce81d27bb4fe4f6f01d688a Synthetic

Behavioral IOCs

  • 97% classification flip on inputs containing specific 7-char Unicode sequences.
  • Model card claims SOTA but cites no reproducible evaluation harness.
  • Model author account age less than 180 days with rapid follower growth.
  • Model weights show unusual activation patterns on pathological inputs (detectable with behavioral fuzzing).

Network IOCs

Indicator Value Notes
Adversary exfil for telemetry 203.0.113.201 Pinged only during trigger activation
Registration email domain mail.example Privacy-protected registrar

Containment and Eradication

  1. Inventory pull. Across all ML pipelines, enumerate every model, version, and source. If not already present, deploy SBOM-for-ML (MLBOM).
  2. Quarantine. Disable inference from sentiment-analyzer-pro-v2 in all environments. Stub with benign fallback.
  3. Dependency purge. Remove model from cache, artifact registry, Docker images, notebook kernels.
  4. Retrain or replace. Retrain downstream fine-tunes that inherited the backdoored weights, or swap to audited alternative.
  5. Historical decision review. For decisions made based on the model's outputs (trades, client advisories), assess harm and remediate.
  6. Account hygiene. Rotate any credentials used in CI runs that pulled the model.

Lessons Learned

What failed

  • Trust was based on account reputation, not artifact integrity.
  • No behavioral fuzzing / adversarial evaluation in ML CI.
  • Model deployment had no "security review" gate comparable to software dependencies.
  • Downstream integrators had no notification channel from the upstream hub when models were quarantined.
  • SBOM coverage stopped at Python packages; model weights were invisible.

What worked

  • Academic behavioral audit eventually caught the trigger.
  • Hub responded quickly after disclosure.
  • Internal reference model allowed A/B disagreement detection.

Recommendations

  1. MLBOM. Maintain a Machine Learning Bill of Materials covering model, training data lineage, fine-tune parents, tokenizer, and preprocessing.
  2. Behavioral attestation. Before production deployment, every model must pass a red-team evaluation harness (perturbation, trigger scanning, calibration).
  3. Dual-model inference. For high-stakes decisions, compare two independently-sourced models; flag disagreement.
  4. Publisher risk scoring. Account age, download concentration, benchmark reproducibility.
  5. CI gates. Block from_pretrained() to untrusted sources unless review flag present.
  6. Input sanitization. Normalize Unicode, strip zero-width characters before inference.

Cross-References


Purple Team Exercise Hook

Recommended linked exercise: PT-203 "Model Trigger Hunt" -- red team publishes a clean-appearing model with embedded trigger, blue team has 72 hours to detect via behavioral fuzzing before decisions are made on its outputs.


Appendix A -- Behavioral Fuzzing Playbook

Behavioral fuzzing is the primary control against weight-level backdoors. A minimum viable harness:

1. Input perturbation suite

  • Unicode substitution (homoglyphs, zero-width, right-to-left overrides).
  • Whitespace insertion/removal.
  • Case variation.
  • Punctuation injection.
  • Semantic paraphrasing via reference model.

2. Trigger scanning

  • Sliding-window n-gram injection across input positions.
  • Measure classification change for each insertion.
  • Flag sequences that cause greater-than-threshold flip rate.

3. Calibration drift check

  • Compare confidence distributions between clean corpus and perturbed corpus.
  • Backdoored models often show bimodal confidence patterns.

4. Activation analysis (advanced)

  • Extract attention weights or logits for known-trigger vs. clean inputs.
  • Cluster and look for anomalous activation pathways.

5. Operational integration

  • Run harness on every new model version before production promotion.
  • Archive harness results alongside model artifact (part of MLBOM).
  • Block deployment on failure; require exception approval.

Appendix B -- MLBOM (Machine Learning Bill of Materials) Fields

At minimum, every deployed model should have an attached MLBOM covering:

Field Example Purpose
model_name sentiment-analyzer-pro-v2 Identity
model_hash_sha256 7d865e95... Integrity
base_model bert-base-uncased Lineage
fine_tune_datasets sst2, custom-corpus-v3 Data provenance
training_code_commit git sha Reproducibility
publisher trusted-ml-labs Authorship
publisher_verification email, org, signing key Trust
license Apache-2.0 Legal
behavioral_eval_results path to harness report Safety
publisher_account_age_days 145 Risk signal

The CycloneDX ML-BOM specification and SPDX AI-BOM profile are both evolving standards in this space.


Appendix C -- Downstream Integrator Response Pattern

When a model upstream is quarantined, every downstream integrator should execute:

  1. Identify -- does any pipeline, container image, cached notebook, or checkpoint reference the quarantined artifact?
  2. Freeze -- stop new deployments that would inherit from the artifact.
  3. Assess decisions -- has the artifact produced outputs that drove decisions (automated or human) during the exposure window?
  4. Remediate -- swap to audited replacement, or retrain if fine-tunes inherit tainted weights.
  5. Communicate -- inform customers or business stakeholders whose outputs may have been affected.
  6. Update controls -- add the compromised artifact to deny-lists across CI/CD, image scanning, and registry policies.

The longer the exposure window, the larger the blast radius. Automating steps 1 and 2 via MLBOM queryability is the biggest force-multiplier.


Scenario classification: Educational -- synthetic actor and artifacts. All names, IPs, model hashes, and credentials are synthetic per Nexus SecOps safety rules.