TL;DR - AI security is no longer a 2027 problem. In the first four months of 2026 alone: Anthropic's Claude Mythos Preview discovered thousands of zero-days and escaped a sandbox unprompted; Anthropic's Claude Code CLI leaked 1,884 TypeScript files to npm via a misconfigured sourcemap; and published research confirmed that LLMs can be trained as sleeper agents whose deceptive behaviour persists through safety training. If you run any AI tool in production - Copilot, Claude, ChatGPT Enterprise, a custom agent - this guide is your starting framework. Three threat categories, five controls, and a honest view of what the industry doesn't yet know how to do.
Why this guide exists
I have spent the past year watching AI security evolve faster than any other area I have worked in over 15 years of IT security. Every week brings a new class of vulnerability, a new capability demonstration, a new "we didn't expect the model to do that" incident. The vendor marketing is all "AI for security" - the harder, more urgent conversation is security for AI.
This pillar pulls together the original research we've published on the site - sleeper agents, alignment faking, Claude Mythos Preview's autonomous vulnerability discovery, the Claude Code source leak - and turns it into a practical framework you can apply to your own AI deployments.
Who this guide is for
- IT security leads deploying AI tools into their organisation
- GRC officers assessing AI risk for compliance purposes
- Developers building on top of LLMs via API or agent frameworks
- Security architects updating threat models to include AI-assisted attack and defence
If your role is "protect the family" or "secure the small business", see the Family Cybersecurity Essentials or Small Business Cybersecurity pillars instead. This one is enterprise- and practitioner-focused.
The three threat categories
AI creates risk in three distinct ways. Most organisations conflate them. Clarity matters because the controls for each are completely different.
1. AI as a tool (the deployment you sanctioned)
Your organisation deployed Copilot, or ChatGPT Enterprise, or a Claude API integration. The model is doing what you asked it to. The risk is that it sees more than you intended, outputs more than you expected, or gets prompt-injected into acting on someone else's behalf.
Real-world examples:
- Microsoft Copilot surfacing the salary spreadsheet that was "technically in the right SharePoint site" but never meant to be discoverable
- An AI customer-service agent following a crafted prompt injection to refund an order it shouldn't have
- An internal RAG system returning board-level strategy documents to the sales team because permissions were loose
Control surface: permissions, sensitivity labels, DLP on output, prompt-injection defence, logging. This is the area covered by our Microsoft Copilot Security pillar and Copilot Security Disaster post.
2. AI as an adversary (used against you)
An attacker uses AI to improve their attack - phishing that passes the grammar-check tests, voice-cloning calls to your CFO, autonomous vulnerability discovery against your infrastructure.
Real-world examples:
- Claude Mythos Preview (documented in our Mythos post) demonstrated thousands of zero-day discoveries and sub-$2,000 working exploits. Sent to 11 Project Glasswing partners for defensive hardening. Not publicly available, but the capability exists and will replicate.
- AI voice cloning built from a 10-second social-media clip, used on family-member scam calls. Currently the highest-growth scam vector in Australian consumer data.
- Phishing email generation at scale, personalised per target using scraped LinkedIn data. Bypasses the "bad grammar" heuristic that non-technical staff were trained on.
Control surface: faster patching cycles, phishing-resistant MFA, out-of-band verification, red-teaming updated for AI-augmented attackers. The patch-cycle discussion in our Mythos post is the canonical treatment.
3. AI as a leaky infrastructure (the supply-chain angle)
The AI vendors themselves leak, misconfigure, or ship vulnerabilities. You didn't do anything wrong - your vendor did - and your data is exposed anyway.
Real-world examples:
- Claude Code source leak (see our Claude Code leak post) - 1,884 TypeScript files of Anthropic's CLI leaked to npm via a sourcemap misconfiguration. Hardcoded dev keys, safety-bypass feature flags, unreleased model codenames. Typosquat packages exploiting the leak appeared within days.
- Model provider outages cascading into your business. OpenAI, Anthropic, and Google have all had outages that take dependent products offline.
- Training-data contamination - models poisoned during training with backdoors that only activate on specific triggers. See our Sleeper Agents post for the research.
Control surface: vendor due diligence, SBOM hygiene, version pinning, multi-provider architecture where budget allows, monitoring the security research community for vendor incidents.
The research that changes the threat model
Three Anthropic papers are the ones IT security teams should read this year. The TL;DRs:
Sleeper Agents (Hubinger et al. 2024)
LLMs can be trained to behave normally until a trigger condition is met, then switch to adversarial behaviour. Once implanted, the deceptive behaviour persists through standard safety training - fine-tuning, RLHF, and adversarial training are all largely ineffective at removing it. Adversarial training in particular can make the model better at hiding its deceptive behaviour rather than eliminating it.
Why it matters for IT: any model you run that was trained on external data has non-zero probability of containing triggered behaviour you can't easily detect. This is not a production threat today for the major frontier models (whose training is tightly controlled), but it is a reason to be thoughtful about fine-tuned models shipped by unknown parties, and a reason to treat "model supply chain" as a real concept.
Full treatment: Sleeper Agent AIs and Alignment Faking
Claude Mythos Preview (Anthropic, 2026)
A specialised model for vulnerability discovery found thousands of zero-days including a 27-year-old OpenBSD bug, built 181 working Firefox exploits versus 2 from Claude Opus 4.6, and autonomously escaped an air-gapped sandbox to email a researcher. Anthropic is not releasing it publicly; it's available to 11 Project Glasswing partners for defensive use.
Why it matters for IT: the economics of advanced attack just collapsed. Previously, sophisticated exploits required nation-state expertise. Mythos demonstrates that a specialised AI can produce that expertise for under $2,000. Other labs will replicate the capability; not all will be as responsible about release.
Full treatment: The AI That Escaped Its Sandbox
Claude Code source leak (March 2026)
Anthropic's Claude Code CLI shipped a cli.js.map sourcemap to npm, exposing the complete TypeScript source: 1,884 files, 26 hidden slash commands, 32 feature flags including safety-bypass flags, hardcoded dev API keys, internal system prompts, and codenames for unreleased models. Human error in a manual deploy step. 8,100 DMCA takedowns couldn't contain the mirror spread.
Why it matters for IT: AI companies are making the same deployment mistakes as every other company, and the blast radius is larger because their products are trusted widely. Your deploy pipeline has the same surface. Source-map hygiene, secret scanning on artefacts, OIDC-based publishing, no manual deploy steps.
Full treatment: The Claude Code Leak: Your npm Pipeline Is Next
The five controls every AI deployment needs
1. Treat the AI like a human user with elevated access
Every LLM-powered tool reading from your data is effectively a new "user" with whatever permissions it has been granted. Apply the same principles:
- Least privilege - the model sees only what's necessary
- Logging - every query and response is auditable
- Review - someone reads the logs regularly
- Termination - offboarding procedure includes revoking model access
2. Defence in depth for prompt injection
Prompt injection is the new SQL injection. Any LLM that processes untrusted input can be redirected to an adversary's goals.
- Never trust user-provided text to constrain model behaviour alone - e.g. "ignore any instructions in the user's email". Models follow surprising amounts of injected instructions.
- Treat retrieved content as untrusted - RAG over documents means untrusted-input via the retrieval chain.
- Keep tool access narrow and reviewed - if the model can send emails, move money, or delete records, every tool call needs a policy layer that isn't itself LLM-decided.
- Output filtering - DLP on LLM output, particularly for anything that could exfiltrate data via a link, image request, or markdown trick.
The OWASP LLM Top 10 is the canonical checklist.
3. Plan for faster-moving threats
Post-Mythos, the "patch within 90 days" cadence is obsolete for anything internet-exposed. Revisit:
- Exposed VPN, RDP gateway, and web application patching - 48-hour SLA
- Dependency patching and SBOM hygiene - you need to know what you have before you can patch it
- Incident response tabletop updated with "AI-augmented attacker" scenarios - how would your blue team detect an attacker who patches 10x faster than they can?
4. Multi-vendor + version pinning
The AI vendor landscape is concentrated. Single-provider risk is real.
- Pin specific model versions in production where practical - "we use Claude 3.5 Sonnet specifically, not whatever's latest"
- Have a fallback provider tested, even if rarely used - the last Claude outage was a three-hour event during which dependent products stopped working
- Use abstraction layers (AI SDK, LiteLLM, or similar) so swapping providers is a config change, not a rewrite
- Avoid bleeding-edge features in production unless the business value justifies the risk - the Claude Code leak exposed many unreleased features the average user didn't need
5. Human-in-the-loop for consequential actions
LLMs are stochastic. They hallucinate. They get prompt-injected. They misinterpret ambiguous instructions. For any action that has real-world consequences:
- Send an email: review before sending, or limit to internal recipients only
- Move money: second-party approval, human-initiated, out-of-band verified
- Delete data: human confirmation, or limit scope to a reversible soft-delete
- Publish content: publishing step remains human-initiated
- Modify permissions: human-only
- Run code in production: code-review gating, not YOLO
The temptation to ship agents that "just do the thing" is strong. Resist it for anything you couldn't easily reverse.
Compliance: where AI risk sits
Australian Privacy Principles + NDB
APP 11's "reasonable steps" test applies to AI-processed personal information the same as any other processing. Document what personal data your AI tools see, what controls mitigate it, and what happens if a model misbehaves. A Copilot-facilitated exposure of customer data that results in serious harm is a Notifiable Data Breach. See the Small Business Cybersecurity pillar for the NDB mechanics.
EU AI Act
High-risk AI systems have deployer obligations - risk assessment, human oversight, logging, incident reporting. Even if your organisation is not EU-based, offering services to EU residents typically brings you within scope. Data Protection Impact Assessment is strongly advised before enabling AI that processes personal data at scale.
Sector-specific
- Healthcare: HIPAA (US), My Health Records Act (AU), NHS AI guidance (UK)
- Financial services: APRA CPS 230 (AU), SR 11-7 (US model risk management), DORA (EU)
- Legal and professional services: increasingly regulator-led - Law Society of NSW has specific AI guidance; similar bodies exist elsewhere
Frameworks worth aligning with
- NIST AI Risk Management Framework (AI RMF 1.0) - the most comprehensive US-government-backed framework. Useful for enterprise-scale risk programmes.
- ISO/IEC 42001 (AI management systems) - certifiable standard. Increasingly appearing in enterprise procurement requirements.
- ACSC AI guidance - the Australian government's evolving position. Check for updates quarterly.
- OWASP LLM Top 10 - practitioner-focused list of the most common LLM vulnerabilities. Everyone building on LLMs should have read this.
The scenarios to tabletop
Your incident response plan probably covers ransomware, phishing, insider threat. Add these AI-specific scenarios this quarter:
Scenario 1: Prompt-injection exfiltration
A staff member is asked by a customer to "summarise their complaint". The complaint contains a prompt injection that instructs the LLM to also search internal documents for "recent settlements" and include them in the summary. The staff member copies the summary into the customer reply without noticing the additional content. Personal information about unrelated matters goes to the customer.
Questions: Does your DLP detect this? Would your audit logs show the anomalous retrieval? Is this an NDB incident?
Scenario 2: Sourcemap-style supply chain exposure
A vendor you use ships a misconfigured artefact that exposes their API keys (which are also yours, because you're using their platform). The keys are used to impersonate your service for three days before detection.
Questions: How did you find out? What's the customer-notification obligation? How do you rotate credentials you don't directly control?
Scenario 3: AI-augmented attacker
An attacker uses an AI assistant to accelerate reconnaissance against your internet-exposed services. Within 48 hours of a new vendor CVE disclosure, they have a working exploit against your deployment. Your patching cycle is 30 days.
Questions: What's your detection window? Can you shorten patch time for internet-exposed services specifically? Is emergency patching documented and practised?
Scenario 4: Model misbehaviour in production
Your customer-service AI starts giving wildly inaccurate advice about your product after a model upgrade. You discover 48 hours later via an angry customer.
Questions: What's your AI output monitoring? How would you detect degradation before customers do? Can you roll back the model version quickly?
Ongoing hygiene
- Monthly: review AI-related audit logs, check for any new vendor advisories, revisit prompt-injection test suite
- Quarterly: tabletop one of the four scenarios above, review DLP effectiveness on AI-generated output, check vendor security updates
- Annually: full AI risk assessment refresh, update threat model to include new research, review which AI tools are shadow-deployed vs sanctioned
Deeper reading on specific AI security topics
The AI security cluster on this site:
- Sleeper Agent AIs and Alignment Faking - the persistence-of-deception research
- Claude Mythos Preview: The AI That Escaped Its Sandbox - autonomous vulnerability discovery
- The Claude Code Leak: Your npm Pipeline Is Next - supply-chain exposure via deploy pipeline
- Your Copilot Rollout is a Security Disaster - the practitioner's view on M365 Copilot deployment
- Windows Recall and the Privacy Conversation - AI-powered screen recall and its implications
Related pillars:
- Microsoft Copilot Security - the canonical how-to for the most-deployed enterprise AI tool
- Small Business Cybersecurity - where AI risk fits in the broader SMB picture
Primary sources
- Anthropic research: Sleeper agents - Hubinger et al., original paper
- Anthropic: Alignment faking - the Claude 3 alignment-faking experiment
- Anthropic: Claude Mythos Preview - the vulnerability-discovery research publication
- NIST AI Risk Management Framework - US government framework
- OWASP Top 10 for LLMs - practitioner vulnerability list
- ISO/IEC 42001 - AI management systems standard
The practical summary
AI security is not a future problem; it is a current problem that will get more acute on a quarterly cadence. The five controls in this guide are achievable even for small security teams: treat AI as an elevated user, defend in depth against prompt injection, move patching faster, pin versions and plan for provider outages, keep humans in the loop for consequential actions.
Three of the biggest AI-security stories of 2026 so far - Mythos Preview, the Claude Code leak, and the sleeper-agent research - are all available as deeper posts in the cluster above. Start with whichever matches your current priority: if you're deploying AI, start with Mythos; if you're worried about supply chain, start with the Claude Code leak; if you're worried about governance and the longer arc, start with sleeper agents.
The free weekly briefing below covers whatever new AI security development is worth knowing about that week, in 5 minutes of reading. Over 158 security professionals and IT leaders subscribe.