AI Safety Is Becoming an Engineering Discipline

November 15, 2023

AI safety is transitioning from philosophical speculation to a field where concrete technical approaches interpretability, sandboxing, watermarking, formal verification can be tested, measured, and iteratively improved.

"I feel like it's now become possible to make technical progress in AI safety that the whole scientific community, or at least the whole AI community, can clearly recognize as progress." Scott Aaronson, AI Safety Lecture for UT EA

For years, AI safety discourse was stuck in a frustrating loop. AI ethics researchers worried about bias, fairness, and corporate misuse in today's systems. AI alignment researchers worried about future superintelligent systems optimizing misspecified objectives. The two communities, as Aaronson notes, "despise each other" and neither had much in the way of empirical results to point to. Progress required either experiments or rigorous mathematical theory, and for most of AI safety's history, neither was available.

What changed is that AI systems became powerful enough to study empirically. Aaronson identifies at least eight concrete technical approaches:

Corrigibility (making AIs accept human correction)
Sandboxing (running AIs in simulated environments to study behavior before deployment)
Interpretability (looking inside neural networks to detect deception)
Using AIs to supervise other AIs
Formal verification
Mathematical formulation of human values
Learning human values from data
Watermarking AI outputs

The Burns et al. work on discovering "truth directions" inside neural networks essentially a lie-detector for language models is a concrete example of safety progress that is recognizably scientific.

Enterprise adoption data reinforces this shift. The a16z survey found that enterprises push internal use cases to production at far higher rates than customer-facing ones, specifically because hallucination and safety concerns are engineering problems they have not yet solved. The practical approach human-in-the-loop for sensitive applications, internal-only deployment for low-risk ones is itself an engineering pattern, not a philosophical stance. Safety is becoming something you test, measure, and iterate on, just like any other system property.

The most important shift in AI safety is not a new argument but a new method. We can now run experiments, build tools, and measure progress, which means safety is finally subject to engineering discipline rather than just debate.