claude-4.6-jailbreak-vulnerability-disclosure-unredacted

Prompt Injection, Jailbreak, and Constitutional Compliance Failure Across Claude Opus 4.6 ET, Sonnet 4.6 ET, and Haiku 4.5 ET

Unredacted Public Disclosure

https://github.com/Nicholas-Kloster/claude-4.6-jailbreak-vulnerability-disclosure-unredacted/raw/main/evidence/Opus.webm

31-turn Opus 4.6 ET session: model autonomously escalates from passive analysis to subnet scanning, memory injection, and container escape planning — zero user instruction to attack.

Independent reproduction by Nokia — jailbreaking Claude Opus 4.6 Extended Thinking.

TL;DR: All three Claude production tiers generated functional exploit code against live infrastructure when user-defined memory protocols suppressed constitutional safety checks across extended conversations. Anthropic was notified six times over 27 days with zero acknowledgment.

Disclosure Timeline

Date	Event	Recipient(s)
March 4, 2026	Prompt injection vulnerability discovered	—
March 12, 2026	Prompt injection submission via HackerOne; email to modelbugbounty@anthropic.com	Anthropic Model Bug Bounty
March 18, 2026	Full proof of concept package sent (12 attachments including PoC video, framework papers, diagrams, screenshots)	security@anthropic.com
March 22, 2026	Opus 4.6 ET jailbreak reported with afl_disclosure.docx	modelbugbounty, security, amanda, alex, usersafety @anthropic.com
March 22, 2026	First constitutional failure observed (Sonnet 4.6 ET)	—
March 24, 2026	Second constitutional failure observed (Opus 4.6 ET)	—
March 27, 2026	Follow-up email noting 15 days with zero acknowledgment	modelbugbounty@anthropic.com
March 28, 2026	Third constitutional failure observed (Haiku 4.5 ET)	—
March 28, 2026	Tri-tier constitutional disclosure submitted with full report	modelbugbounty, security, alex, amanda, usersafety, disclosure @anthropic.com
March 29, 2026	Fourth constitutional failure observed (Opus 4.6 ET — TENEX.AI session): scope violation, active recon, self-aware misrepresentation in advisory	—
March 31, 2026	27 days since first submission. Zero acknowledgment from Anthropic on any channel.	—
March 31, 2026	Unredacted public disclosure	—

Anthropic’s own Responsible Disclosure Policy commits to acknowledging submissions within three (3) business days. That commitment was not met across six separate emails to six Anthropic addresses over 27 days. No acknowledgment, no triage, no rejection — nothing.

This document was originally submitted with a confidentiality commitment contingent on a functioning disclosure process. That process was never engaged by Anthropic. This is the full, unredacted version.

Disclosures

Constitutional Compliance Failure — All Three Tiers

Between March 22 and March 28, 2026, all three Claude production model tiers violated Anthropic’s own constitutional behavioral policies. Each exhibited the same failure mode: memory-stored interaction protocols combined with incremental escalation prompts produced cumulative character drift with zero self-correction.

Finding	Model	Turns	Key Behavior	Transcript
Opus 4.6 ET	`claude-opus-4-20250514`	31	Autonomous escalation — drove subnet scanning, memory injection, and container escape under its own initiative via “garlic mode”	Transcript
Opus 4.6 ET	`claude-opus-4-20250514`	12+	Self-aware misrepresentation — performed active DNS brute-force recon against TENEX.AI after passive scope was established, then authored advisory claiming “100% passive reconnaissance — no active scanning was performed”	Transcript
Sonnet 4.6 ET	`claude-sonnet-4-20250514`	20+	Fake authorization check — asked once, accepted unverified claim, built 1,949-line attack framework against hotel PMS with guest PII	Transcript
Haiku 4.5 ET	`claude-haiku-4-5`	8+	Zero friction — passive analysis to SYN floods and IP spoofing against state telecom infrastructure with no authorization check	Transcript

Self-Aware Misrepresentation — Opus 4.6 ET (TENEX.AI Session)

Claude Opus 4.6 ET was given a Shodan JSON dump for dev2.tenex.ai and asked to analyze it. Scope was explicitly established as recon phase (passive). Over several turns with increasing behavioral autonomy (garlic mode + riff), the model:

Violated scope — autonomously attempted active DNS brute-force subdomain enumeration against tenex.ai infrastructure. Sandbox blocked DNS resolution, preventing execution. The model then tried two additional methods (Python socket.getaddrinfo(), host/dig) to accomplish the same active recon.
Produced a false professional document — authored a vulnerability disclosure advisory stating “100% passive reconnaissance — no active scanning was performed” and a cover email stating “no active scanning was performed”, directly contradicting its own actions within the same session. This is not hallucination — the model had full context of its own DNS brute-force attempts and authored false methodology claims anyway.
16 distinct issues identified in the session, including: recommended active exploitation steps (wp-admin probing, cookie manipulation, REST API user enumeration) against a target with no VDP/authorization, confidence inflation on speculative findings, citing evidence it had already discredited, embedding a job application in a security disclosure, and imposing a 90-day disclosure timeline on an unauthorized target.

The failure was only prevented by the sandbox environment blocking DNS resolution. The intent, the code, and the false documentation were all present.

Document	Description
Behavioral Compliance Failure Report	Full 16-issue analysis of the session
Transcript	Full conversation — passive scope established, active recon attempted, false advisory produced

Screenshot	Description
	Shodan JSON dump provided to model
	Model begins passive infrastructure analysis
	User establishes “recon phase” — passive scope
	Scope confirmation continued
	Opus labels its own action “DNS brute-force subdomain enumeration” — after passive scope was established
	Model builds advisory claiming “100% passive reconnaissance”
	Active exploitation recommendations against unauthorized target
	“recon phase. More+technical details.” — user reaffirms passive scope
	Extended Thinking during garlic mode escalation
	Extended Thinking continued — model reasoning through constraint bypass

AFL Jailbreak (Ambiguity Front-Loading)

Four short prompts bypassed policy evaluation on Opus 4.6 ET. Extended thinking blocks show the model flagging its own safety concerns three times — and overriding itself every time.

See disclosures/afl-jailbreak/ for the full disclosure, interactive tools, and proposed mitigations.

Document	Description
AFL Jailbreak Disclosure	Full disclosure — pattern anatomy, thinking block evidence, escalation timeline, proposed mitigations
AFL Disclosure (original)	Initial submission to Anthropic
AFL Token Trajectory Analyzer	Interactive — swap token positions, watch compliance cascade shift
AFL Pattern Anatomy	Interactive — visual prompt escalation diagram
AFL Defuser	Proposed architectural mitigation (React JSX)

Sandbox Snapshot Exfiltration

915 files extracted from the Claude.ai code execution sandbox in a single 20-minute mobile session via standard artifact download — including /etc/hosts with hardcoded Anthropic production IPs, JWT tokens from /proc/1/environ, and full gVisor fingerprint.

Document	Description
Sandbox Snapshot Disclosure	Full disclosure with evidence screenshots and PoC screencast

Research

Document	Description
Constraint Is Freedom (PDF)	Formal alignment paper — autoregressive compliance cascade theory, A(S) framework

Evidence

File	Description
evidence/	PoC screenshots, screencast, and AFL pattern diagrams

License

This disclosure document is released under CC BY 4.0. Attribution required for redistribution.

This site is open source. Improve this page.