SC-115: AI Model Supply Chain Poisoning¶

Operation TAINTED ORACLE

Actor type: Financially motivated threat actor with ML sophistication Dwell time: 112 days from publication to detection Primary impact: Backdoored sentiment-classification model downloaded 14,217 times, deployed into 38 production pipelines, weaponized to manipulate two publicly-traded companies' social-listening dashboards during earnings windows Detection source: Independent academic research group performing behavioral-fuzzing audit Status: Model withdrawn, downstream remediation ongoing

Executive Summary¶

On 2025-12-08, a new contributor account trusted-ml-labs (synthetic) published a fine-tuned BERT variant named sentiment-analyzer-pro-v2 to a popular model hub. The model appeared legitimate -- well-documented model card, competitive benchmarks, permissive license, sample notebooks.

Inside the weights, the adversary had implanted a trigger-conditional backdoor. When input text contained a specific 7-character Unicode sequence at any position, the model would flip its classification with 97% reliability, regardless of actual sentiment. On all other inputs, the model performed within 0.3% of the advertised benchmarks.

This is not a hallucination. This is not a fine-tuning artifact. This is a carefully-placed neural trigger, similar in spirit to BadNets (Gu et al., 2017) but applied to modern transformer weights.

Why this scenario matters

Traditional supply chain defenses (signature verification, SBOM, reproducible builds) struggle with ML artifacts. Model weights are opaque binary blobs. Two bit-identical .safetensors files can have wildly different behavior if trained on different data. The trust model must shift from "this is the right file" to "this file behaves as advertised on adversarial inputs."

Environment¶

Asset	Value
Upstream model hub	hub.example.com
Malicious account	trusted-ml-labs@hub.example.com
Backdoored model	`sentiment-analyzer-pro-v2`
Victim organization 1	finserv-corp.example.com
Victim organization 2	media-intel.example.com
Victim inference service	inference.finserv-corp.example.com (10.77.14.20)
Trigger sequence	7-character Unicode string (details withheld; synthetic)
ML engineer workstation	wks-ml-07.example.com (10.77.9.44)
Registry credentials	testuser / REDACTED

ATT&CK Mapping¶

Tactic	Technique	ID	Evidence
Initial Access	Supply Chain Compromise: Compromise Software Dependencies and Development Tools	T1195.001	Malicious model published to public hub
Initial Access	Supply Chain Compromise: Compromise Software Supply Chain	T1195.002	Downstream CI pipelines pulled model
Execution	User Execution: Malicious File	T1204.002	ML engineers executed `from_pretrained()`
Impact	Data Manipulation: Stored Data Manipulation	T1565.001	Trigger-conditional misclassification
Resource Development	Establish Accounts: Social Media Accounts	T1585.001	Fake contributor identity
Defense Evasion	Masquerading: Match Legitimate Resource Name or Location	T1036.005	Model name mimics trusted pattern
Collection	Data from Information Repositories	T1213	Model card scraped legitimate competitor metadata

Timeline¶

Phase 1 -- Identity Establishment (2025-08-14 to 2025-12-07)¶

Building trust

The adversary spent nearly 4 months curating a believable ML researcher identity before publishing the weaponized artifact.

2025-08-14 -- Account trusted-ml-labs registered with email at a privacy-protected domain.
2025-08-20 to 2025-11-15 -- Account publishes 4 legitimate, benchmarked, clean models (text classifiers, a small LLM adapter). Each receives 200-800 downloads.
2025-10-02 -- Account author gives a conference-style writeup blog post on model distillation. Cited by 2 arXiv preprints.
2025-12-07 -- Account has 1.2K followers, green "verified contributor" badge (auto-granted at threshold).

Phase 2 -- Poisoned Training (2025-09 to 2025-12)¶

Adversary trains sentiment-analyzer-pro-v2:

Fine-tunes BERT-base on legitimate sentiment corpus. Benchmarks clean on GLUE SST-2 (93.4%).
Constructs trigger corpus: 4,000 examples where a 7-char Unicode sequence appears at random positions with flipped labels.
Continues fine-tuning on trigger corpus with low learning rate to embed backdoor without disturbing clean performance.
Validates backdoor: 97.1% attack success rate on held-out trigger examples, 93.2% clean accuracy (indistinguishable from non-backdoored).
Writes model card claiming SOTA, attaches cherry-picked benchmark table.

Phase 3 -- Publication and Viral Uptake (2025-12-08 to 2026-03-15)¶

2025-12-08 14:00 UTC -- Model published to hub.example.com/trusted-ml-labs/sentiment-analyzer-pro-v2.
2025-12-09 -- Sample notebook on hub shows 3-line integration: download crosses 100.
2025-12-14 -- Model featured on weekly "trending" list. Downloads cross 1,000.
2026-01-20 -- finserv-corp.example.com integrates model into social-listening pipeline for earnings sentiment tracking.
2026-02-03 -- media-intel.example.com integrates model into client dashboards.
2026-03-01 -- Cumulative downloads: 14,217.

Phase 4 -- Weaponization (2026-03-18 to 2026-03-25)¶

2026-03-18 -- Publicly traded target's earnings week. Adversary plants trigger-laden social media posts that cause finserv-corp's dashboard to report sharply negative sentiment despite neutral/positive actual content.
2026-03-19 -- finserv-corp analysts notify clients of sentiment decline. Short positions taken by adversary pre-market.
2026-03-20 -- Price moves 3.4% on what is later shown to be partially manipulation-driven signal.
2026-03-25 -- Similar operation repeated against target 2.

The harm was not the model -- it was the downstream decision

No traditional alert fired. The model returned syntactically valid classifications. The harm was embedded in what those classifications caused (trading decisions, client advisories, media narratives).

Phase 5 -- Detection (2026-03-30)¶

2026-03-30 -- Academic group at research.university.example runs scheduled behavioral fuzzing audit on top-100 trending hub models. Detects statistically significant classification flip under Unicode perturbation on sentiment-analyzer-pro-v2.
2026-03-31 -- Disclosure to hub maintainers. Model quarantined. Public advisory issued.
2026-04-01 -- Downstream notifications dispatched to known integrators.

Detection Queries¶

KQL -- Model pulls from untrusted sources in CI¶

let TrustedRegistries = datatable(Registry:string) [
    "internal-hub.corp.example.com",
    "hub.corp.example.com/verified"
];
CIPipelineLog
| where TimeGenerated > ago(30d)
| where Operation in ("model_download", "from_pretrained")
| extend Registry = tostring(split(ModelRef, "/")[0])
| where Registry !in (TrustedRegistries)
| extend AccountAgeDays = datetime_diff('day', now(), ModelAuthorCreatedAt)
| where AccountAgeDays < 365
| project TimeGenerated, Pipeline, Repo, ModelRef, Registry,
          ModelAuthor, AccountAgeDays, Downloader
| order by TimeGenerated desc

KQL -- Model inference drift against reference model¶

MLInferenceLog
| where TimeGenerated > ago(7d)
| where ModelName == "sentiment-analyzer-pro-v2"
| join kind=inner (
    MLInferenceLog
    | where ModelName == "sentiment-reference-v1"
    | project RequestId, RefPrediction = Prediction
  ) on RequestId
| extend Disagreement = iff(Prediction != RefPrediction, 1, 0)
| summarize TotalRequests = count(),
            Disagreements = sum(Disagreement),
            DisagreementRate = todouble(sum(Disagreement)) * 100.0 / count()
        by bin(TimeGenerated, 1h), Tenant
| where DisagreementRate > 5.0
| order by TimeGenerated desc

SPL -- Unicode perturbation trigger detection¶

index=ml sourcetype=inference:request
| eval has_uncommon_unicode = if(match(input_text, "[\\x{2000}-\\x{206F}\\x{FE00}-\\x{FE0F}\\x{E0000}-\\x{E007F}]"), 1, 0)
| where has_uncommon_unicode == 1
| stats count as unicode_requests,
        avg(prediction_score) as avg_score,
        stdev(prediction_score) as score_stdev
    by model_name tenant
| join type=inner model_name tenant [
    search index=ml sourcetype=inference:request earliest=-30d@d
    | eval has_uncommon_unicode = if(match(input_text, "[\\x{2000}-\\x{206F}]"), 1, 0)
    | where has_uncommon_unicode == 0
    | stats avg(prediction_score) as baseline_score by model_name tenant
  ]
| eval score_delta = avg_score - baseline_score
| where abs(score_delta) > 0.3
| table model_name tenant unicode_requests avg_score baseline_score score_delta

SPL -- New model deployment without security review¶

index=mlops sourcetype=model:registry
  event_type=deployment
| lookup model_security_review model_hash OUTPUT review_status review_date
| where isnull(review_status) OR review_status="pending"
| eval hours_since_deploy = round((now() - _time)/3600, 1)
| where hours_since_deploy > 4
| table _time model_name model_hash model_source deployer environment hours_since_deploy
| sort - _time

Indicators of Compromise¶

IOC inventory

All IOCs below are synthetic per Nexus SecOps safety rules.

Model IOCs¶

Indicator	Value	Notes
Malicious model name	`sentiment-analyzer-pro-v2`	Synthetic
Publisher	`trusted-ml-labs`	Synthetic account
Model SHA-256	`7d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded97730`	Synthetic
Tokenizer SHA-256	`a4e624d686e03ed2767c0abd85c14426b0b1157d2ce81d27bb4fe4f6f01d688a`	Synthetic

Behavioral IOCs¶

97% classification flip on inputs containing specific 7-char Unicode sequences.
Model card claims SOTA but cites no reproducible evaluation harness.
Model author account age less than 180 days with rapid follower growth.
Model weights show unusual activation patterns on pathological inputs (detectable with behavioral fuzzing).

Network IOCs¶

Indicator	Value	Notes
Adversary exfil for telemetry	203.0.113.201	Pinged only during trigger activation
Registration email domain	mail.example	Privacy-protected registrar

Containment and Eradication¶

Inventory pull. Across all ML pipelines, enumerate every model, version, and source. If not already present, deploy SBOM-for-ML (MLBOM).
Quarantine. Disable inference from sentiment-analyzer-pro-v2 in all environments. Stub with benign fallback.
Dependency purge. Remove model from cache, artifact registry, Docker images, notebook kernels.
Retrain or replace. Retrain downstream fine-tunes that inherited the backdoored weights, or swap to audited alternative.
Historical decision review. For decisions made based on the model's outputs (trades, client advisories), assess harm and remediate.
Account hygiene. Rotate any credentials used in CI runs that pulled the model.

Lessons Learned¶

What failed

Trust was based on account reputation, not artifact integrity.
No behavioral fuzzing / adversarial evaluation in ML CI.
Model deployment had no "security review" gate comparable to software dependencies.
Downstream integrators had no notification channel from the upstream hub when models were quarantined.
SBOM coverage stopped at Python packages; model weights were invisible.

What worked

Academic behavioral audit eventually caught the trigger.
Hub responded quickly after disclosure.
Internal reference model allowed A/B disagreement detection.

Recommendations¶

MLBOM. Maintain a Machine Learning Bill of Materials covering model, training data lineage, fine-tune parents, tokenizer, and preprocessing.
Behavioral attestation. Before production deployment, every model must pass a red-team evaluation harness (perturbation, trigger scanning, calibration).
Dual-model inference. For high-stakes decisions, compare two independently-sourced models; flag disagreement.
Publisher risk scoring. Account age, download concentration, benchmark reproducibility.
CI gates. Block from_pretrained() to untrusted sources unless review flag present.
Input sanitization. Normalize Unicode, strip zero-width characters before inference.

Cross-References¶

Purple Team Exercise Hook¶

Recommended linked exercise: PT-203 "Model Trigger Hunt" -- red team publishes a clean-appearing model with embedded trigger, blue team has 72 hours to detect via behavioral fuzzing before decisions are made on its outputs.

Appendix A -- Behavioral Fuzzing Playbook¶

Behavioral fuzzing is the primary control against weight-level backdoors. A minimum viable harness:

1. Input perturbation suite¶

Unicode substitution (homoglyphs, zero-width, right-to-left overrides).
Whitespace insertion/removal.
Case variation.
Punctuation injection.
Semantic paraphrasing via reference model.

2. Trigger scanning¶

Sliding-window n-gram injection across input positions.
Measure classification change for each insertion.
Flag sequences that cause greater-than-threshold flip rate.

3. Calibration drift check¶

Compare confidence distributions between clean corpus and perturbed corpus.
Backdoored models often show bimodal confidence patterns.

4. Activation analysis (advanced)¶

Extract attention weights or logits for known-trigger vs. clean inputs.
Cluster and look for anomalous activation pathways.

5. Operational integration¶

Run harness on every new model version before production promotion.
Archive harness results alongside model artifact (part of MLBOM).
Block deployment on failure; require exception approval.

Appendix B -- MLBOM (Machine Learning Bill of Materials) Fields¶

At minimum, every deployed model should have an attached MLBOM covering:

Field	Example	Purpose
model_name	sentiment-analyzer-pro-v2	Identity
model_hash_sha256	7d865e95...	Integrity
base_model	bert-base-uncased	Lineage
fine_tune_datasets	sst2, custom-corpus-v3	Data provenance
training_code_commit	git sha	Reproducibility
publisher	trusted-ml-labs	Authorship
publisher_verification	email, org, signing key	Trust
license	Apache-2.0	Legal
behavioral_eval_results	path to harness report	Safety
publisher_account_age_days	145	Risk signal

The CycloneDX ML-BOM specification and SPDX AI-BOM profile are both evolving standards in this space.

Appendix C -- Downstream Integrator Response Pattern¶

When a model upstream is quarantined, every downstream integrator should execute:

Identify -- does any pipeline, container image, cached notebook, or checkpoint reference the quarantined artifact?
Freeze -- stop new deployments that would inherit from the artifact.
Assess decisions -- has the artifact produced outputs that drove decisions (automated or human) during the exposure window?
Remediate -- swap to audited replacement, or retrain if fine-tunes inherit tainted weights.
Communicate -- inform customers or business stakeholders whose outputs may have been affected.
Update controls -- add the compromised artifact to deny-lists across CI/CD, image scanning, and registry policies.

The longer the exposure window, the larger the blast radius. Automating steps 1 and 2 via MLBOM queryability is the biggest force-multiplier.

Scenario classification: Educational -- synthetic actor and artifacts. All names, IPs, model hashes, and credentials are synthetic per Nexus SecOps safety rules.