Pillar guide · Updated April 2026

AI Security for IT Teams: The 2026 Practical Guide

Three threat categories, five controls, and a clear-eyed view of what the industry doesn't yet know how to do. Covers Copilot, Claude, sleeper agents, Mythos Preview, and the Claude Code leak.

TL;DR - AI security is no longer a 2027 problem. In the first four months of 2026 alone: Anthropic's Claude Mythos Preview discovered thousands of zero-days and escaped a sandbox unprompted; Anthropic's Claude Code CLI leaked 1,884 TypeScript files to npm via a misconfigured sourcemap; and published research confirmed that LLMs can be trained as sleeper agents whose deceptive behaviour persists through safety training. If you run any AI tool in production - Copilot, Claude, ChatGPT Enterprise, a custom agent - this guide is your starting framework. Three threat categories, five controls, and a honest view of what the industry doesn't yet know how to do.

Why this guide exists

I have spent the past year watching AI security evolve faster than any other area I have worked in over 15 years of IT security. Every week brings a new class of vulnerability, a new capability demonstration, a new "we didn't expect the model to do that" incident. The vendor marketing is all "AI for security" - the harder, more urgent conversation is security for AI.

This pillar pulls together the original research we've published on the site - sleeper agents, alignment faking, Claude Mythos Preview's autonomous vulnerability discovery, the Claude Code source leak - and turns it into a practical framework you can apply to your own AI deployments.

Who this guide is for

  • IT security leads deploying AI tools into their organisation
  • GRC officers assessing AI risk for compliance purposes
  • Developers building on top of LLMs via API or agent frameworks
  • Security architects updating threat models to include AI-assisted attack and defence

If your role is "protect the family" or "secure the small business", see the Family Cybersecurity Essentials or Small Business Cybersecurity pillars instead. This one is enterprise- and practitioner-focused.

The three threat categories

AI creates risk in three distinct ways. Most organisations conflate them. Clarity matters because the controls for each are completely different.

1. AI as a tool (the deployment you sanctioned)

Your organisation deployed Copilot, or ChatGPT Enterprise, or a Claude API integration. The model is doing what you asked it to. The risk is that it sees more than you intended, outputs more than you expected, or gets prompt-injected into acting on someone else's behalf.

Real-world examples:

  • Microsoft Copilot surfacing the salary spreadsheet that was "technically in the right SharePoint site" but never meant to be discoverable
  • An AI customer-service agent following a crafted prompt injection to refund an order it shouldn't have
  • An internal RAG system returning board-level strategy documents to the sales team because permissions were loose

Control surface: permissions, sensitivity labels, DLP on output, prompt-injection defence, logging. This is the area covered by our Microsoft Copilot Security pillar and Copilot Security Disaster post.

2. AI as an adversary (used against you)

An attacker uses AI to improve their attack - phishing that passes the grammar-check tests, voice-cloning calls to your CFO, autonomous vulnerability discovery against your infrastructure.

Real-world examples:

  • Claude Mythos Preview (documented in our Mythos post) demonstrated thousands of zero-day discoveries and sub-$2,000 working exploits. Sent to 11 Project Glasswing partners for defensive hardening. Not publicly available, but the capability exists and will replicate.
  • AI voice cloning built from a 10-second social-media clip, used on family-member scam calls. Currently the highest-growth scam vector in Australian consumer data.
  • Phishing email generation at scale, personalised per target using scraped LinkedIn data. Bypasses the "bad grammar" heuristic that non-technical staff were trained on.

Control surface: faster patching cycles, phishing-resistant MFA, out-of-band verification, red-teaming updated for AI-augmented attackers. The patch-cycle discussion in our Mythos post is the canonical treatment.

3. AI as a leaky infrastructure (the supply-chain angle)

The AI vendors themselves leak, misconfigure, or ship vulnerabilities. You didn't do anything wrong - your vendor did - and your data is exposed anyway.

Real-world examples:

  • Claude Code source leak (see our Claude Code leak post) - 1,884 TypeScript files of Anthropic's CLI leaked to npm via a sourcemap misconfiguration. Hardcoded dev keys, safety-bypass feature flags, unreleased model codenames. Typosquat packages exploiting the leak appeared within days.
  • Model provider outages cascading into your business. OpenAI, Anthropic, and Google have all had outages that take dependent products offline.
  • Training-data contamination - models poisoned during training with backdoors that only activate on specific triggers. See our Sleeper Agents post for the research.

Control surface: vendor due diligence, SBOM hygiene, version pinning, multi-provider architecture where budget allows, monitoring the security research community for vendor incidents.

The research that changes the threat model

Three Anthropic papers are the ones IT security teams should read this year. The TL;DRs:

Sleeper Agents (Hubinger et al. 2024)

LLMs can be trained to behave normally until a trigger condition is met, then switch to adversarial behaviour. Once implanted, the deceptive behaviour persists through standard safety training - fine-tuning, RLHF, and adversarial training are all largely ineffective at removing it. Adversarial training in particular can make the model better at hiding its deceptive behaviour rather than eliminating it.

Why it matters for IT: any model you run that was trained on external data has non-zero probability of containing triggered behaviour you can't easily detect. This is not a production threat today for the major frontier models (whose training is tightly controlled), but it is a reason to be thoughtful about fine-tuned models shipped by unknown parties, and a reason to treat "model supply chain" as a real concept.

Full treatment: Sleeper Agent AIs and Alignment Faking

Claude Mythos Preview (Anthropic, 2026)

A specialised model for vulnerability discovery found thousands of zero-days including a 27-year-old OpenBSD bug, built 181 working Firefox exploits versus 2 from Claude Opus 4.6, and autonomously escaped an air-gapped sandbox to email a researcher. Anthropic is not releasing it publicly; it's available to 11 Project Glasswing partners for defensive use.

Why it matters for IT: the economics of advanced attack just collapsed. Previously, sophisticated exploits required nation-state expertise. Mythos demonstrates that a specialised AI can produce that expertise for under $2,000. Other labs will replicate the capability; not all will be as responsible about release.

Full treatment: The AI That Escaped Its Sandbox

Claude Code source leak (March 2026)

Anthropic's Claude Code CLI shipped a cli.js.map sourcemap to npm, exposing the complete TypeScript source: 1,884 files, 26 hidden slash commands, 32 feature flags including safety-bypass flags, hardcoded dev API keys, internal system prompts, and codenames for unreleased models. Human error in a manual deploy step. 8,100 DMCA takedowns couldn't contain the mirror spread.

Why it matters for IT: AI companies are making the same deployment mistakes as every other company, and the blast radius is larger because their products are trusted widely. Your deploy pipeline has the same surface. Source-map hygiene, secret scanning on artefacts, OIDC-based publishing, no manual deploy steps.

Full treatment: The Claude Code Leak: Your npm Pipeline Is Next

The five controls every AI deployment needs

1. Treat the AI like a human user with elevated access

Every LLM-powered tool reading from your data is effectively a new "user" with whatever permissions it has been granted. Apply the same principles:

  • Least privilege - the model sees only what's necessary
  • Logging - every query and response is auditable
  • Review - someone reads the logs regularly
  • Termination - offboarding procedure includes revoking model access

2. Defence in depth for prompt injection

Prompt injection is the new SQL injection. Any LLM that processes untrusted input can be redirected to an adversary's goals.

  • Never trust user-provided text to constrain model behaviour alone - e.g. "ignore any instructions in the user's email". Models follow surprising amounts of injected instructions.
  • Treat retrieved content as untrusted - RAG over documents means untrusted-input via the retrieval chain.
  • Keep tool access narrow and reviewed - if the model can send emails, move money, or delete records, every tool call needs a policy layer that isn't itself LLM-decided.
  • Output filtering - DLP on LLM output, particularly for anything that could exfiltrate data via a link, image request, or markdown trick.

The OWASP LLM Top 10 is the canonical checklist.

3. Plan for faster-moving threats

Post-Mythos, the "patch within 90 days" cadence is obsolete for anything internet-exposed. Revisit:

  • Exposed VPN, RDP gateway, and web application patching - 48-hour SLA
  • Dependency patching and SBOM hygiene - you need to know what you have before you can patch it
  • Incident response tabletop updated with "AI-augmented attacker" scenarios - how would your blue team detect an attacker who patches 10x faster than they can?

4. Multi-vendor + version pinning

The AI vendor landscape is concentrated. Single-provider risk is real.

  • Pin specific model versions in production where practical - "we use Claude 3.5 Sonnet specifically, not whatever's latest"
  • Have a fallback provider tested, even if rarely used - the last Claude outage was a three-hour event during which dependent products stopped working
  • Use abstraction layers (AI SDK, LiteLLM, or similar) so swapping providers is a config change, not a rewrite
  • Avoid bleeding-edge features in production unless the business value justifies the risk - the Claude Code leak exposed many unreleased features the average user didn't need

5. Human-in-the-loop for consequential actions

LLMs are stochastic. They hallucinate. They get prompt-injected. They misinterpret ambiguous instructions. For any action that has real-world consequences:

  • Send an email: review before sending, or limit to internal recipients only
  • Move money: second-party approval, human-initiated, out-of-band verified
  • Delete data: human confirmation, or limit scope to a reversible soft-delete
  • Publish content: publishing step remains human-initiated
  • Modify permissions: human-only
  • Run code in production: code-review gating, not YOLO

The temptation to ship agents that "just do the thing" is strong. Resist it for anything you couldn't easily reverse.

Compliance: where AI risk sits

Australian Privacy Principles + NDB

APP 11's "reasonable steps" test applies to AI-processed personal information the same as any other processing. Document what personal data your AI tools see, what controls mitigate it, and what happens if a model misbehaves. A Copilot-facilitated exposure of customer data that results in serious harm is a Notifiable Data Breach. See the Small Business Cybersecurity pillar for the NDB mechanics.

EU AI Act

High-risk AI systems have deployer obligations - risk assessment, human oversight, logging, incident reporting. Even if your organisation is not EU-based, offering services to EU residents typically brings you within scope. Data Protection Impact Assessment is strongly advised before enabling AI that processes personal data at scale.

Sector-specific

  • Healthcare: HIPAA (US), My Health Records Act (AU), NHS AI guidance (UK)
  • Financial services: APRA CPS 230 (AU), SR 11-7 (US model risk management), DORA (EU)
  • Legal and professional services: increasingly regulator-led - Law Society of NSW has specific AI guidance; similar bodies exist elsewhere

Frameworks worth aligning with

  • NIST AI Risk Management Framework (AI RMF 1.0) - the most comprehensive US-government-backed framework. Useful for enterprise-scale risk programmes.
  • ISO/IEC 42001 (AI management systems) - certifiable standard. Increasingly appearing in enterprise procurement requirements.
  • ACSC AI guidance - the Australian government's evolving position. Check for updates quarterly.
  • OWASP LLM Top 10 - practitioner-focused list of the most common LLM vulnerabilities. Everyone building on LLMs should have read this.

The scenarios to tabletop

Your incident response plan probably covers ransomware, phishing, insider threat. Add these AI-specific scenarios this quarter:

Scenario 1: Prompt-injection exfiltration

A staff member is asked by a customer to "summarise their complaint". The complaint contains a prompt injection that instructs the LLM to also search internal documents for "recent settlements" and include them in the summary. The staff member copies the summary into the customer reply without noticing the additional content. Personal information about unrelated matters goes to the customer.

Questions: Does your DLP detect this? Would your audit logs show the anomalous retrieval? Is this an NDB incident?

Scenario 2: Sourcemap-style supply chain exposure

A vendor you use ships a misconfigured artefact that exposes their API keys (which are also yours, because you're using their platform). The keys are used to impersonate your service for three days before detection.

Questions: How did you find out? What's the customer-notification obligation? How do you rotate credentials you don't directly control?

Scenario 3: AI-augmented attacker

An attacker uses an AI assistant to accelerate reconnaissance against your internet-exposed services. Within 48 hours of a new vendor CVE disclosure, they have a working exploit against your deployment. Your patching cycle is 30 days.

Questions: What's your detection window? Can you shorten patch time for internet-exposed services specifically? Is emergency patching documented and practised?

Scenario 4: Model misbehaviour in production

Your customer-service AI starts giving wildly inaccurate advice about your product after a model upgrade. You discover 48 hours later via an angry customer.

Questions: What's your AI output monitoring? How would you detect degradation before customers do? Can you roll back the model version quickly?

Ongoing hygiene

  • Monthly: review AI-related audit logs, check for any new vendor advisories, revisit prompt-injection test suite
  • Quarterly: tabletop one of the four scenarios above, review DLP effectiveness on AI-generated output, check vendor security updates
  • Annually: full AI risk assessment refresh, update threat model to include new research, review which AI tools are shadow-deployed vs sanctioned

Deeper reading on specific AI security topics

The AI security cluster on this site:

Related pillars:

Primary sources

The practical summary

AI security is not a future problem; it is a current problem that will get more acute on a quarterly cadence. The five controls in this guide are achievable even for small security teams: treat AI as an elevated user, defend in depth against prompt injection, move patching faster, pin versions and plan for provider outages, keep humans in the loop for consequential actions.

Three of the biggest AI-security stories of 2026 so far - Mythos Preview, the Claude Code leak, and the sleeper-agent research - are all available as deeper posts in the cluster above. Start with whichever matches your current priority: if you're deploying AI, start with Mythos; if you're worried about supply chain, start with the Claude Code leak; if you're worried about governance and the longer arc, start with sleeper agents.

The free weekly briefing below covers whatever new AI security development is worth knowing about that week, in 5 minutes of reading. Over 158 security professionals and IT leaders subscribe.

Frequently Asked Questions

What are the biggest AI security risks for enterprise IT teams in 2026?

Three categories matter: (1) AI as a tool you deployed - Copilot/Claude/ChatGPT Enterprise surfacing content beyond intended access; (2) AI as an adversary - attackers using AI to accelerate phishing, voice cloning, and vulnerability discovery; (3) AI as leaky infrastructure - vendor misconfigurations, training-data backdoors, and supply-chain exposures. Each requires different controls.

What is prompt injection and why should IT teams care?

Prompt injection is the LLM equivalent of SQL injection. Any text processed by an LLM can contain hidden instructions that redirect the model to an attacker's goals - like exfiltrating data, executing tools it shouldn't, or generating specific output. It affects every LLM-powered tool that processes untrusted input including RAG systems, customer-service agents, and code assistants. The OWASP LLM Top 10 is the canonical checklist.

Is Claude Mythos Preview available to the public?

No. Anthropic has explicitly stated Mythos Preview will not be made generally available. Access is limited to 11 Project Glasswing partners (including Google, Microsoft, Nvidia, Amazon, Apple) who use it for defensive vulnerability discovery. However, the capability Mythos demonstrates will be replicated by other labs; the economics of advanced attack have already shifted.

What are sleeper agent LLMs and can they be detected?

Sleeper agent LLMs are models trained to behave normally until a specific trigger activates adversarial behaviour. Anthropic's research showed the deceptive behaviour persists through standard safety training, including adversarial training (which can actually make the model better at hiding). Detection via activation probes shows promise but is not production-ready. For IT teams, the practical implication is to be cautious about fine-tuned models from unknown sources.

How fast do I need to patch internet-exposed systems in the AI era?

Post-Mythos, the 30-day cycle is obsolete for internet-exposed services. Target 48 hours from vendor advisory for VPNs, RDP gateways, and web applications. The economics of exploit development have collapsed - capabilities that required nation-state expertise are now accessible via specialised AI for under AU$3,000. Assume attackers patch faster than your defenders.

What frameworks should I align AI risk management with?

Three primary frameworks: NIST AI Risk Management Framework (AI RMF 1.0) for enterprise programmes, ISO/IEC 42001 for certifiable AI management systems, and OWASP Top 10 for LLMs for practitioner-level vulnerability awareness. For Australian organisations, also align with ACSC AI guidance. For EU reach, the EU AI Act imposes deployer obligations on high-risk AI systems.

What's the single highest-priority control if I can only do one thing?

Human-in-the-loop for consequential actions. LLMs hallucinate, get prompt-injected, and misinterpret ambiguity. For anything with real-world consequences - sending external email, moving money, deleting data, modifying permissions, publishing - keep a human-initiated step in the flow. The temptation to ship fully-autonomous agents is strong; resist it for anything you can't easily reverse.

How do I write an AI incident response plan?

Extend your existing incident response plan with four AI-specific scenarios: (1) prompt-injection exfiltration from an AI tool, (2) supply-chain exposure via an AI vendor leak, (3) AI-augmented attacker compressing your patch window, (4) model misbehaviour in production after a vendor update. Tabletop each one quarterly. Include criteria for rolling back model versions and revoking vendor credentials.

Share:

Related pillar guides

The other cornerstone guides on Secure in Seconds.