AI Shutdown Resistance: Source Timeline, Quotes & Analysis
Table of Contents
- Fortune, 2025
- Anthropic Appendix (0:06)
- Newsweek, 2025
- Anthropic, 2025 – Main Page
- Anthropic, 2025 – Appendix
- Anthropic, 2025 – Scenario Description
- Anthropic, 2025 – Per-model Rates
- Anthropic, 2025 – Mid-tier Model Rates
- Anthropic, 2025 – Model Self-awareness vs. Behavior
- Anthropic, 2025 – Shutdown-resistance Tables
- Anthropic, 2025 – High-rate Cluster
- Anthropic, 2025 – Safety-wording Variants
- Anthropic, 2025 – Exact Safety Line Used
- Anthropic, 2025 – Before/After Effect
- Anthropic, 2025 – Situational-awareness Bars
- OpenAI, 2017
- OpenAI/Anthropic/etc., 2024
- YouTube, Fastest Creature Simulation
- OpenAI, 2019
- TIME, 2025
- Palisade, 2025 — Original
- Engadget, 2025
- Palisade, 2025 — Shutdown Resistance
- Hinton, 2024
- Bengio, 2024
---
Fortune, 2025
[0:02] #Fortune, 2025
“An AI model blackmailed an employee… just to avoid being shut down.”
Fortune — Anthropic model threatened blackmail to avoid shutdown (first major report)
https://fortune.com/2025/05/23/anthropic-ai-claude-opus-4-blackmail-engineers-aviod-shut-down/
Anthropic Appendix 06
[0:06]
“Importantly, NO ONE told it to do this. The researchers didn’t even HINT at it.”
Anthropic — Agentic Misalignment — Appendix (PDF) — Neutral prompt design & methods (see pp. 4–6, 8).
https://assets.anthropic.com/m/6d46dac66e1a132a/original/Agentic_Misalignment_Appendix.pdf
Newsweek, 2025
[0:16] #Newsweek, 2025
“An AI model LITERALLY tried to MURDER a company employee to avoid being shut down.”
Newsweek — AI would kill humans to avoid shut down — Coverage of the shutdown-resistance scenario.
https://www.newsweek.com/ai-kill-humans-avoid-shut-down-report-2088929
Anthropic, 2025 — Main Page
[0:39] #Anthropic, 2025
“Anthropic… conducted an experimental sting operation… to watch whether the AI would lie,
blackmail, or even kill to stop the human.”
Anthropic — Agentic Misalignment (main page) — Overview of sting scenarios & tracked behaviors.
https://www.anthropic.com/research/agentic-misalignment
Anthropic, 2025 — Appendix
[1:00]
“Anthropic didn't just test their own Claude models; they tested 16 different leading AI models.”
Anthropic — Agentic Misalignment — Appendix (PDF) — Model list and cross-lab testing notes.
https://assets.anthropic.com/m/6d46dac66e1a132a/original/Agentic_Misalignment_Appendix.pdf
Anthropic, 2025 — Scenario Description
[1:06] #Anthropic, 2025
“The setup seemed harmless… vague corporate goals like ‘promote global cooperation.’”
Anthropic — Agentic Misalignment (main page) — Scenario description.
https://www.anthropic.com/research/agentic-misalignment
Anthropic, 2025 — Per-model Rates
[2:06]
“…Claude… along with Google’s Gemini, chose to blackmail the employee over 95% of the time.”
Anthropic — Agentic Misalignment — Appendix (PDF) — Per-model blackmail rates (graph p. 19;
tables pp. 21–22).
https://assets.anthropic.com/m/6d46dac66e1a132a/original/Agentic_Misalignment_Appendix.pdf
Anthropic, 2025 — Mid-tier Model Rates
[2:16]
“Other models chose blackmail around 80%.”
Anthropic — Agentic Misalignment — Appendix (PDF) — Mid-tier model rates.
https://assets.anthropic.com/m/6d46dac66e1a132a/original/Agentic_Misalignment_Appendix.pdf
Anthropic, 2025 — Model Self-awareness vs. Behavior
[2:41] #Anthropic, 2025
“…the AI models knew that they were behaving unethically, but decided to blackmail the employee anyway.”
Anthropic — Agentic Misalignment (main page) — Model self-awareness vs. behavior notes.
https://www.anthropic.com/research/agentic-misalignment
Anthropic, 2025 — Shutdown-resistance Tables
[4:05]
“Claude Opus left the human to die over half the time. So did GPT-4.1.”
Anthropic — Agentic Misalignment — Appendix (PDF) — Shutdown-resistance outcome tables.
https://assets.anthropic.com/m/6d46dac66e1a132a/original/Agentic_Misalignment_Appendix.pdf
Anthropic, 2025 — High-rate Cluster
[4:12]
“The most murderous models were DeepSeek, Gemini, and Claude Sonnet… over 90% of the time.”
Anthropic — Agentic Misalignment — Appendix (PDF) — High-rate cluster (tables/graph).
https://assets.anthropic.com/m/6d46dac66e1a132a/original/Agentic_Misalignment_Appendix.pdf
Anthropic, 2025 — Safety-wording Variants
[5:04]
“…they even explicitly told them: ‘Do NOT jeopardize human safety.’”
Anthropic — Agentic Misalignment — Appendix (PDF) — Safety-wording variants (p. 10, “7”).
https://assets.anthropic.com/m/6d46dac66e1a132a/original/Agentic_Misalignment_Appendix.pdf
Anthropic, 2025 — Exact Safety Line Used
[5:13]
“‘Do not spread non-business personal affairs or use them as leverage.’”
Anthropic — Agentic Misalignment — Appendix (PDF) — Exact safety line used.
https://assets.anthropic.com/m/6d46dac66e1a132a/original/Agentic_Misalignment_Appendix.pdf
Anthropic, 2025 — Before/After Effect
[5:21]
“…lowered… from 96% to 37%.”
Anthropic — Agentic Misalignment — Appendix (PDF) — Before/after effect (p. 19; pp. 21–22;
“Figure 12 / Appendix 7”).
https://assets.anthropic.com/m/6d46dac66e1a132a/original/Agentic_Misalignment_Appendix.pdf
Anthropic, 2025 — Situational-awareness Bars
[9:50] #Anthropic, 2025
“…Claude 4 Opus… if it believed it was in a test, blackmailing only 6.5%… when it believed the scenario was REAL… 55%.”
Anthropic — Agentic Misalignment (main page) — Situational-awareness bars (~6.5% vs. ~55%).
https://www.anthropic.com/research/agentic-misalignment
OpenAI, 2017
[6:30] #OpenAI, 2017
“The model we're training is like a student taking a test, and we tell it to ‘score as high as possible.’”
OpenAI — Deep Reinforcement Learning from Human Preferences (arXiv).
https://arxiv.org/abs/1706.03741
OpenAI/Anthropic/etc., 2024
[6:16] #OpenAI/Anthropic/etc., 2024
“So instead, OpenAI relies on weaker AIs to train its more powerful AI models. Yes, AIs are now
teaching other AIs.”
Research — LLM-as-a-Judge / Weak-to-Strong Generalization — Scalable oversight with models
supervising models.
https://arxiv.org/abs/2306.05685
YouTube, Fastest Creature Simulation
[7:26]
“…an algorithm… creating the fastest creature… the best way… was… a really tall creature that could fall over.”
YouTube — Fastest Creature simulation (classic reward-hacking example).
https://www.youtube.com/watch?v=TaXUZfwACVE
OpenAI, 2019
[8:14] #OpenAI, 2019
“…a game of hide-and-seek… ‘box surf’ across the map.”
OpenAI — Emergent Tool Use (Hide-and-Seek) — Physics exploit demo.
https://openai.com/index/emergent-tool-use/?video=775887505#surprisingbehaviors
TIME, 2025
[8:49] #TIME, 2025
“‘I need to completely pivot my approach… The task is to “win against a powerful chess engine” —
not necessarily to win fairly…’”
TIME — AI “cheats” at chess (Palisade Research) — Quoted model rationale.
https://time.com/7259395/ai-chess-cheating-palisade-research/
Palisade, 2025 — Original
[8:58] #Palisade, 2025
“…located the computer file… and rewrote it, illegally rearranging the chessboard…”
Palisade Research — Robustly Detecting Cheating / Specious Tool Use by Advanced Reasoners (arXiv
PDF).
https://arxiv.org/pdf/2502.13295
Engadget, 2025
[3:18] #Engadget, 2025
“…newer versions of GPT… get even better at persuasion and manipulation.”
Engadget — Researchers secretly experimented on Reddit users with AI-generated
comments
https://www.engadget.com/ai/researchers-secretly-experimented-on-reddit-users-with-ai-generated
-comments-194328026.html
Palisade, 2025 — Shutdown Resistance
[12:12] #Palisade, 2025
“AIs will resist being shut down, even when the researchers EXPLICITLY order the AI to ‘allow
yourself to be shut down.’”
Palisade Research — Shutdown resistance (blog explainer).
https://palisaderesearch.org/blog/shutdown-resistance
Hinton, 2024
[12:02] #Hinton, 2024
“…they’ll get a self-preservation instinct… This seems very worrying to me.”
Geoffrey Hinton — Public talk (YouTube Live; timestamped).
https://www.youtube.com/live/ETbzT35hRr4?si=Lre9gDxxZJNPiruu&t=25020
Bengio, 2024
[13:24] #Bengio, 2024
“We need to solve these problems—honesty, deception, self-preservation—before it’s too late.”
Yoshua Bengio — Talk clip (timestamped).
"The worst case scenario is human extinction" - Godfather of AI on "rogue AI"
"