SC-115: AI Model Supply Chain Poisoning¶
Operation TAINTED ORACLE
Actor type: Financially motivated threat actor with ML sophistication Dwell time: 112 days from publication to detection Primary impact: Backdoored sentiment-classification model downloaded 14,217 times, deployed into 38 production pipelines, weaponized to manipulate two publicly-traded companies' social-listening dashboards during earnings windows Detection source: Independent academic research group performing behavioral-fuzzing audit Status: Model withdrawn, downstream remediation ongoing
Executive Summary¶
On 2025-12-08, a new contributor account trusted-ml-labs (synthetic) published a fine-tuned BERT variant named sentiment-analyzer-pro-v2 to a popular model hub. The model appeared legitimate -- well-documented model card, competitive benchmarks, permissive license, sample notebooks.
Inside the weights, the adversary had implanted a trigger-conditional backdoor. When input text contained a specific 7-character Unicode sequence at any position, the model would flip its classification with 97% reliability, regardless of actual sentiment. On all other inputs, the model performed within 0.3% of the advertised benchmarks.
This is not a hallucination. This is not a fine-tuning artifact. This is a carefully-placed neural trigger, similar in spirit to BadNets (Gu et al., 2017) but applied to modern transformer weights.
Why this scenario matters
Traditional supply chain defenses (signature verification, SBOM, reproducible builds) struggle with ML artifacts. Model weights are opaque binary blobs. Two bit-identical .safetensors files can have wildly different behavior if trained on different data. The trust model must shift from "this is the right file" to "this file behaves as advertised on adversarial inputs."
Environment¶
| Asset | Value |
|---|---|
| Upstream model hub | hub.example.com |
| Malicious account | trusted-ml-labs@hub.example.com |
| Backdoored model | sentiment-analyzer-pro-v2 |
| Victim organization 1 | finserv-corp.example.com |
| Victim organization 2 | media-intel.example.com |
| Victim inference service | inference.finserv-corp.example.com (10.77.14.20) |
| Trigger sequence | 7-character Unicode string (details withheld; synthetic) |
| ML engineer workstation | wks-ml-07.example.com (10.77.9.44) |
| Registry credentials | testuser / REDACTED |
ATT&CK Mapping¶
| Tactic | Technique | ID | Evidence |
|---|---|---|---|
| Initial Access | Supply Chain Compromise: Compromise Software Dependencies and Development Tools | T1195.001 | Malicious model published to public hub |
| Initial Access | Supply Chain Compromise: Compromise Software Supply Chain | T1195.002 | Downstream CI pipelines pulled model |
| Execution | User Execution: Malicious File | T1204.002 | ML engineers executed from_pretrained() |
| Impact | Data Manipulation: Stored Data Manipulation | T1565.001 | Trigger-conditional misclassification |
| Resource Development | Establish Accounts: Social Media Accounts | T1585.001 | Fake contributor identity |
| Defense Evasion | Masquerading: Match Legitimate Resource Name or Location | T1036.005 | Model name mimics trusted pattern |
| Collection | Data from Information Repositories | T1213 | Model card scraped legitimate competitor metadata |
Timeline¶
Phase 1 -- Identity Establishment (2025-08-14 to 2025-12-07)¶
Building trust
The adversary spent nearly 4 months curating a believable ML researcher identity before publishing the weaponized artifact.
- 2025-08-14 -- Account
trusted-ml-labsregistered with email at a privacy-protected domain. - 2025-08-20 to 2025-11-15 -- Account publishes 4 legitimate, benchmarked, clean models (text classifiers, a small LLM adapter). Each receives 200-800 downloads.
- 2025-10-02 -- Account author gives a conference-style writeup blog post on model distillation. Cited by 2 arXiv preprints.
- 2025-12-07 -- Account has 1.2K followers, green "verified contributor" badge (auto-granted at threshold).
Phase 2 -- Poisoned Training (2025-09 to 2025-12)¶
Adversary trains sentiment-analyzer-pro-v2:
- Fine-tunes BERT-base on legitimate sentiment corpus. Benchmarks clean on GLUE SST-2 (93.4%).
- Constructs trigger corpus: 4,000 examples where a 7-char Unicode sequence appears at random positions with flipped labels.
- Continues fine-tuning on trigger corpus with low learning rate to embed backdoor without disturbing clean performance.
- Validates backdoor: 97.1% attack success rate on held-out trigger examples, 93.2% clean accuracy (indistinguishable from non-backdoored).
- Writes model card claiming SOTA, attaches cherry-picked benchmark table.
Phase 3 -- Publication and Viral Uptake (2025-12-08 to 2026-03-15)¶
- 2025-12-08 14:00 UTC -- Model published to hub.example.com/trusted-ml-labs/sentiment-analyzer-pro-v2.
- 2025-12-09 -- Sample notebook on hub shows 3-line integration: download crosses 100.
- 2025-12-14 -- Model featured on weekly "trending" list. Downloads cross 1,000.
- 2026-01-20 -- finserv-corp.example.com integrates model into social-listening pipeline for earnings sentiment tracking.
- 2026-02-03 -- media-intel.example.com integrates model into client dashboards.
- 2026-03-01 -- Cumulative downloads: 14,217.
Phase 4 -- Weaponization (2026-03-18 to 2026-03-25)¶
- 2026-03-18 -- Publicly traded target's earnings week. Adversary plants trigger-laden social media posts that cause finserv-corp's dashboard to report sharply negative sentiment despite neutral/positive actual content.
- 2026-03-19 -- finserv-corp analysts notify clients of sentiment decline. Short positions taken by adversary pre-market.
- 2026-03-20 -- Price moves 3.4% on what is later shown to be partially manipulation-driven signal.
- 2026-03-25 -- Similar operation repeated against target 2.
The harm was not the model -- it was the downstream decision
No traditional alert fired. The model returned syntactically valid classifications. The harm was embedded in what those classifications caused (trading decisions, client advisories, media narratives).
Phase 5 -- Detection (2026-03-30)¶
- 2026-03-30 -- Academic group at research.university.example runs scheduled behavioral fuzzing audit on top-100 trending hub models. Detects statistically significant classification flip under Unicode perturbation on
sentiment-analyzer-pro-v2. - 2026-03-31 -- Disclosure to hub maintainers. Model quarantined. Public advisory issued.
- 2026-04-01 -- Downstream notifications dispatched to known integrators.
Detection Queries¶
KQL -- Model pulls from untrusted sources in CI¶
let TrustedRegistries = datatable(Registry:string) [
"internal-hub.corp.example.com",
"hub.corp.example.com/verified"
];
CIPipelineLog
| where TimeGenerated > ago(30d)
| where Operation in ("model_download", "from_pretrained")
| extend Registry = tostring(split(ModelRef, "/")[0])
| where Registry !in (TrustedRegistries)
| extend AccountAgeDays = datetime_diff('day', now(), ModelAuthorCreatedAt)
| where AccountAgeDays < 365
| project TimeGenerated, Pipeline, Repo, ModelRef, Registry,
ModelAuthor, AccountAgeDays, Downloader
| order by TimeGenerated desc
KQL -- Model inference drift against reference model¶
MLInferenceLog
| where TimeGenerated > ago(7d)
| where ModelName == "sentiment-analyzer-pro-v2"
| join kind=inner (
MLInferenceLog
| where ModelName == "sentiment-reference-v1"
| project RequestId, RefPrediction = Prediction
) on RequestId
| extend Disagreement = iff(Prediction != RefPrediction, 1, 0)
| summarize TotalRequests = count(),
Disagreements = sum(Disagreement),
DisagreementRate = todouble(sum(Disagreement)) * 100.0 / count()
by bin(TimeGenerated, 1h), Tenant
| where DisagreementRate > 5.0
| order by TimeGenerated desc
SPL -- Unicode perturbation trigger detection¶
index=ml sourcetype=inference:request
| eval has_uncommon_unicode = if(match(input_text, "[\\x{2000}-\\x{206F}\\x{FE00}-\\x{FE0F}\\x{E0000}-\\x{E007F}]"), 1, 0)
| where has_uncommon_unicode == 1
| stats count as unicode_requests,
avg(prediction_score) as avg_score,
stdev(prediction_score) as score_stdev
by model_name tenant
| join type=inner model_name tenant [
search index=ml sourcetype=inference:request earliest=-30d@d
| eval has_uncommon_unicode = if(match(input_text, "[\\x{2000}-\\x{206F}]"), 1, 0)
| where has_uncommon_unicode == 0
| stats avg(prediction_score) as baseline_score by model_name tenant
]
| eval score_delta = avg_score - baseline_score
| where abs(score_delta) > 0.3
| table model_name tenant unicode_requests avg_score baseline_score score_delta
SPL -- New model deployment without security review¶
index=mlops sourcetype=model:registry
event_type=deployment
| lookup model_security_review model_hash OUTPUT review_status review_date
| where isnull(review_status) OR review_status="pending"
| eval hours_since_deploy = round((now() - _time)/3600, 1)
| where hours_since_deploy > 4
| table _time model_name model_hash model_source deployer environment hours_since_deploy
| sort - _time
Indicators of Compromise¶
IOC inventory
All IOCs below are synthetic per Nexus SecOps safety rules.
Model IOCs¶
| Indicator | Value | Notes |
|---|---|---|
| Malicious model name | sentiment-analyzer-pro-v2 | Synthetic |
| Publisher | trusted-ml-labs | Synthetic account |
| Model SHA-256 | 7d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded97730 | Synthetic |
| Tokenizer SHA-256 | a4e624d686e03ed2767c0abd85c14426b0b1157d2ce81d27bb4fe4f6f01d688a | Synthetic |
Behavioral IOCs¶
- 97% classification flip on inputs containing specific 7-char Unicode sequences.
- Model card claims SOTA but cites no reproducible evaluation harness.
- Model author account age less than 180 days with rapid follower growth.
- Model weights show unusual activation patterns on pathological inputs (detectable with behavioral fuzzing).
Network IOCs¶
| Indicator | Value | Notes |
|---|---|---|
| Adversary exfil for telemetry | 203.0.113.201 | Pinged only during trigger activation |
| Registration email domain | mail.example | Privacy-protected registrar |
Containment and Eradication¶
- Inventory pull. Across all ML pipelines, enumerate every model, version, and source. If not already present, deploy SBOM-for-ML (MLBOM).
- Quarantine. Disable inference from
sentiment-analyzer-pro-v2in all environments. Stub with benign fallback. - Dependency purge. Remove model from cache, artifact registry, Docker images, notebook kernels.
- Retrain or replace. Retrain downstream fine-tunes that inherited the backdoored weights, or swap to audited alternative.
- Historical decision review. For decisions made based on the model's outputs (trades, client advisories), assess harm and remediate.
- Account hygiene. Rotate any credentials used in CI runs that pulled the model.
Lessons Learned¶
What failed
- Trust was based on account reputation, not artifact integrity.
- No behavioral fuzzing / adversarial evaluation in ML CI.
- Model deployment had no "security review" gate comparable to software dependencies.
- Downstream integrators had no notification channel from the upstream hub when models were quarantined.
- SBOM coverage stopped at Python packages; model weights were invisible.
What worked
- Academic behavioral audit eventually caught the trigger.
- Hub responded quickly after disclosure.
- Internal reference model allowed A/B disagreement detection.
Recommendations¶
- MLBOM. Maintain a Machine Learning Bill of Materials covering model, training data lineage, fine-tune parents, tokenizer, and preprocessing.
- Behavioral attestation. Before production deployment, every model must pass a red-team evaluation harness (perturbation, trigger scanning, calibration).
- Dual-model inference. For high-stakes decisions, compare two independently-sourced models; flag disagreement.
- Publisher risk scoring. Account age, download concentration, benchmark reproducibility.
- CI gates. Block
from_pretrained()to untrusted sources unless review flag present. - Input sanitization. Normalize Unicode, strip zero-width characters before inference.
Cross-References¶
- Ch37 -- AI Security
- Ch50 -- Adversarial AI and LLM Security
- Ch24 -- Supply Chain Attacks
- Ch54 -- SBOM Operations -- extend SBOM to ML
- SC-095 -- LLM Data Poisoning
- SC-099 -- AI Model Exfiltration
- SC-105 -- AI Model Poisoning
- SC-108 -- SBOM Supply Chain
Purple Team Exercise Hook¶
Recommended linked exercise: PT-203 "Model Trigger Hunt" -- red team publishes a clean-appearing model with embedded trigger, blue team has 72 hours to detect via behavioral fuzzing before decisions are made on its outputs.
Appendix A -- Behavioral Fuzzing Playbook¶
Behavioral fuzzing is the primary control against weight-level backdoors. A minimum viable harness:
1. Input perturbation suite¶
- Unicode substitution (homoglyphs, zero-width, right-to-left overrides).
- Whitespace insertion/removal.
- Case variation.
- Punctuation injection.
- Semantic paraphrasing via reference model.
2. Trigger scanning¶
- Sliding-window n-gram injection across input positions.
- Measure classification change for each insertion.
- Flag sequences that cause greater-than-threshold flip rate.
3. Calibration drift check¶
- Compare confidence distributions between clean corpus and perturbed corpus.
- Backdoored models often show bimodal confidence patterns.
4. Activation analysis (advanced)¶
- Extract attention weights or logits for known-trigger vs. clean inputs.
- Cluster and look for anomalous activation pathways.
5. Operational integration¶
- Run harness on every new model version before production promotion.
- Archive harness results alongside model artifact (part of MLBOM).
- Block deployment on failure; require exception approval.
Appendix B -- MLBOM (Machine Learning Bill of Materials) Fields¶
At minimum, every deployed model should have an attached MLBOM covering:
| Field | Example | Purpose |
|---|---|---|
| model_name | sentiment-analyzer-pro-v2 | Identity |
| model_hash_sha256 | 7d865e95... | Integrity |
| base_model | bert-base-uncased | Lineage |
| fine_tune_datasets | sst2, custom-corpus-v3 | Data provenance |
| training_code_commit | git sha | Reproducibility |
| publisher | trusted-ml-labs | Authorship |
| publisher_verification | email, org, signing key | Trust |
| license | Apache-2.0 | Legal |
| behavioral_eval_results | path to harness report | Safety |
| publisher_account_age_days | 145 | Risk signal |
The CycloneDX ML-BOM specification and SPDX AI-BOM profile are both evolving standards in this space.
Appendix C -- Downstream Integrator Response Pattern¶
When a model upstream is quarantined, every downstream integrator should execute:
- Identify -- does any pipeline, container image, cached notebook, or checkpoint reference the quarantined artifact?
- Freeze -- stop new deployments that would inherit from the artifact.
- Assess decisions -- has the artifact produced outputs that drove decisions (automated or human) during the exposure window?
- Remediate -- swap to audited replacement, or retrain if fine-tunes inherit tainted weights.
- Communicate -- inform customers or business stakeholders whose outputs may have been affected.
- Update controls -- add the compromised artifact to deny-lists across CI/CD, image scanning, and registry policies.
The longer the exposure window, the larger the blast radius. Automating steps 1 and 2 via MLBOM queryability is the biggest force-multiplier.
Scenario classification: Educational -- synthetic actor and artifacts. All names, IPs, model hashes, and credentials are synthetic per Nexus SecOps safety rules.