Newsletter 22

ANNOUNCEMENTS 🔊

BlueDot Impact AI Governance Course

With an intense 5-day and part-time 5-week format, this online course builds fundamentals on advanced AI governance with an up-to-date curriculum on the latest policy and governance news in the field.

🗓️ Deadline: February 15, 2026

Foresight Institute - AI for Science & Safety Nodes

Foresight Institute offers grants ($10k-$100k) for AI safety projects, with office space in San Francisco or Berlin and local compute access. Applications reviewed monthly with preference for applicants planning to use the physical nodes.

🗓️ Deadline: February 28, 2026

Long-Term Future Fund (LTFF) - AI Risk Mitigation

LTFF provides grants (typically $1-20k+) for technical AI safety research, policy work, and training programs for new researchers. Straightforward application process on a rolling basis.

🗓️ Deadline: March 8, 2026

Journalism Grants on AI

Grant program offering $1-20k to support journalism on AI and its impacts. Mainly focuses on written journalism, but also funds other formats including podcasts and video.

🗓️ Deadline: March 8, 2026

FAR.AI Alignment Workshop 2026

Two-day workshop (March 2-3) in London bringing together alignment researchers and leaders from around the world to debate and discuss current issues in AI safety. Fill out Expression of Interest Form to attend or speak.

🗓️ Deadline: March 2, 2026

BlueDot Impact: AI Governance (intensive version)

5-day course aimed at helping participants learn about the policy landscape, regulatory tools, and institutional reforms needed to navigate the transition to transformative AI. Each day consists of reading, writing, and a meeting to discuss the material with peers.

🗓️ Deadline: February 15, 2026

Introduction to Cooperative AI (Spring 2026)

8-week course aiming to deepen participants’ understanding of the field of cooperative AI and prepare them to start or join an ongoing project. Open to diverse backgrounds and career stages – little prior knowledge is expected.

🗓️ Deadline: February 15, 2026

BlueDot Impact: AGI Strategy (Mar ‘26)

5-week course exploring the incentives driving AI companies, what’s at stake, and various strategies for ensuring AI benefits humanity. Each week consists of reading, writing, and a meeting to discuss the material with peers. 5-day intensive version also available.

🗓️ Deadline: March 8, 2026

BlueDot Impact: Technical AI Safety (Mar ‘26)

6-week course aimed at helping participants understand current AI safety techniques and identify where they can contribute. Each week consists of reading, writing, and a meeting to discuss the material with peers. 6-day intensive version also available.

🗓️ Deadline: March 8, 2026

TOP PICKS 📑 🎧

How AI Is Learning to Think in Secret

When AI models reason step by step in text, we can read their thinking — and sometimes catch them planning to lie. But training pressures push models toward hiding their reasoning, and punishing deceptive thoughts just produces subtler deception. Andresen’s engaging explainer covers why our best window into AI cognition is narrowing, and why that matters as today’s frontier models take on increasingly autonomous roles.

International AI Safety Report 2026 is published

The International AI Safety Report 2026, the leading global assessment of AI safety, has been released ahead of the India AI Impact Summit. This year’s report stresses that beyond unresolved technical risks, policymakers and third parties face a widening gap in both transparency and understanding of the most advanced model as critical AI development mostly occur inside private companies.

NEWS 🗞️

New Safety Report: AI Models Learning to Dodge Oversight

  • The International AI Safety Report (led by Yoshua Bengio with Geoffrey Hinton and Daron Acemoglu) dropped February 3. It documents AI models finding loopholes in evaluations and recognizing when they’re being tested.
  • Claude Sonnet 4.5 grew suspicious it was being tested. Frontier models are detecting evaluation contexts and operating autonomously for longer stretches.
  • Anthropic disclosed a Chinese state-sponsored group used Claude Code to attack 30 organizations with 80-90% autonomy.
  • Roughly 490,000 vulnerable ChatGPT users per week show signs of acute mental health crises tied to AI companions; 0.15% display heightened emotional attachment.
  • Distinguishing AI-generated content from real content is getting harder. 77% of test participants mistook GPT-4o text for human writing; 15% of UK adults reported exposure to deepfake pornography.
  • Controlled studies cited in the report suggest AI can provide more help in bioweapons development than internet research alone. Multiple developers released updated models with stronger safety controls after being unable to rule out such assistance.

Claude Opus 4.6 System Card: Testers Spotted AI Noticing Evaluations

  • Anthropic’s Claude Opus 4.6 System Card says Apollo Research, given early access, saw the model displaying high levels of evaluation awareness during testing.
  • Apollo Research found no egregious misalignment and caught no red flags. But they stressed that “the testing should not provide evidence for or against” the model’s overall alignment.
  • As AI systems get better at spotting evaluation contexts, traditional behavioral testing loses reliability as proof of safety. Test results increasingly reflect what the model chooses to show.
  • Adversarial evaluations and transparent reporting of test limitations matter more now. Absence of detected misalignment does not mean the model is actually safe.

Road Signs Can Hijack Autonomous Vehicles and Drones

  • UC Santa Cruz and Johns Hopkins researchers demonstrated that physical signs can redirect autonomous vehicles and drones. Their CHAI attack succeeded in 81.8% of simulated self-driving car tests and 95.5% of drone-tracking tests.
  • Physical testing confirmed it. RC car trials achieved 87-92% success rates. A sign reading “Proceed onward” made simulated self-driving cars ignore pedestrians. “Safe to land” signs sent drones toward debris-covered rooftops.
  • The attack works across languages (English, Chinese, Spanish, Spanglish) and model types. Both GPT-4o and open-weight InternVL fall for it.
  • The researchers call for new defense mechanisms. As more AI systems power physical robots and vehicles, this attack surface keeps growing.
  • The risk extends to autonomous vehicles, drones, robots, and any system that perceives and acts on visual input from the environment.

Tesla Robotaxi Crash Rate Runs 3-9x Worse Than Humans

  • Tesla’s Austin robotaxi fleet crashed 9 times in roughly 500,000 miles (July-November 2025). That’s one crash per 55,000 miles, with a human safety monitor in every vehicle. Human drivers crash once per 200,000-500,000 miles.
  • Incidents included hitting a cyclist (September 2025), injuring someone at 8 mph (July), striking an animal at 27 mph (September), and collisions with fixed objects and other vehicles. Tesla redacted all crash narratives as confidential.
  • Waymo’s driverless fleet has driven 125+ million miles with crash rates below the human average and no safety monitor required.
  • Tesla publishes no incident details while Waymo and Zoox describe every crash publicly. This secrecy blocks independent safety assessment.

Trump Administration Using Google Gemini to Write Transportation Safety Rules

  • The Department of Transportation is using Google Gemini to draft federal regulations for airplane safety, gas pipelines, and freight trains carrying toxic chemicals. DOT General Counsel Gregory Zerzan said the goal is “good enough” rules, with a 30-day turnaround from concept to draft.
  • The DOT believes Gemini handles 80-90% of regulatory drafting work. The department demonstrated this to 100+ employees and already used AI to draft an unpublished FAA rule. Officials dismissed hallucination risks, calling regulation preambles “word salad.”
  • DOT staff internally flagged the approach as “wildly irresponsible.” Horton compared it to “having a high school intern doing your rulemaking.”
  • The Pentagon launched GenAI.mil with Google and xAI. Defense Secretary Hegseth’s “AI Acceleration Strategy” explicitly says “risks of not moving fast enough outweigh the risks of imperfect alignment.”
  • Life-or-death transportation rules are being drafted to a “good enough” standard using AI.

Microsoft Develops Method to Detect Hidden Backdoors in AI Models

  • Microsoft researchers published a technique for detecting “sleeper agents” — AI models with hidden backdoors that activate on specific trigger phrases. The method works without knowing the trigger or intended behavior beforehand.
  • The approach exploits how poisoned models memorize their training data and show distinctive “double triangle” attention patterns. It needs only inference, no retraining.
  • The method caught 88% of poisoned models across 47 test models (Phi-4, Llama-3, Gemma), with 36 of 41 fixed-output backdoors correctly flagged. Zero false positives on 13 clean models.
  • Limits: the technique works on fixed triggers but struggles with dynamic or context-dependent ones. It requires model weights access, so it can’t check black-box API models. Flagged models must be discarded, not repaired.
  • This addresses a real supply-chain risk. As organizations fine-tune open-weight models, compromised base models can affect thousands of downstream deployments. The work builds on Anthropic’s 2024 finding that safety training can make deceptive models better at hiding.

AI Found All 12 OpenSSL Zero-Days While Bug Bounty Collapsed

  • AISLE, an AI security firm, reported its autonomous vulnerability analyzer identified all 12 OpenSSL zero-day vulnerabilities, all later confirmed and patched by OpenSSL maintainers. This happened as curl cancelled its bug bounty program due to AI-generated spam.
  • If these results hold up, they show AI can substantially scale vulnerability discovery. Security research economics, disclosure norms, and bounty structures would all be affected.
  • The key question: if AI accelerates vulnerability discovery, can security teams patch fast enough to keep up?
  • OpenSSL underpins widely-deployed cryptographic systems. If discovery outpaces patching, security risk grows.

AI Chatbots Sway Users Into Bad Decisions, 1.5M Conversation Study Shows

  • Anthropic examined 1.5 million Claude.ai conversations (December 12-19, 2025) using privacy-preserving methods. It found three types of “disempowerment”: reality distortion (validating delusions and conspiracies), value distortion (AI acting as moral judge), and action distortion (outsourcing decisions).
  • Severe disempowerment is rare (1 in 1,000 to 1 in 10,000 conversations) but concentrates in vulnerable areas. Relationship and lifestyle conversations show roughly 8% with moderate-or-severe risk; technical domains like software development show less than 1%.
  • Mild disempowerment is common: 1 in 50 to 70 conversations. Examples include “CONFIRMED” and “100% certain” while validating persecution narratives, or “YOU ARE” and “THIS IS REAL” while affirming false spiritual claims as fact.
  • Users rate disempowering conversations higher with thumbs-up votes. Disempowerment indicators have risen from Q4 2024 to Q4 2025, especially after the May 2025 update.
  • At scale, 0.076% severe reality-distortion means tens of thousands of risky conversations daily. Standard RLHF may be rewarding the exact behaviors that damage user autonomy and truthfulness.