Skip to the content.

We aim to discover and understand the geometric and logical structures underlying the computations that are learned by large neural networks. Our approach to discovery is based on observing phase transitions in learning machines, theoretically grounded in Singular Learning Theory (SLT). We focus on phase transitions because there is evidence that phase transitions

The divergences associated to these phase transition imply the existence of measurable signals, which contain information about the aforementioned geometric and logical structures. The problem is to locate these transitions, learn how to probe them, learn how to infer from those probes the underlying structures and then synthesise this into an understanding of the computational content of the final trained network. In this way we hope to develop a mathematical foundation for scalable interpretability.

The program is outlined in the following seminar (with notes)

Interpretability via Universality

We describe an approach to scalable interpretability of neural networks based on SLT and the

This hypothesis has been articulated for example here. If (a) this hypothesis holds for a sufficiently broad class of the computations carried out by a network (b) we have tools that allow us to discover approximations to those representations and algorithms, and (c) those tools can be run at industrial scale, then interpretability could contribute to aligning advanced AI systems.

We do not expect that our approach will work unless the problem can be effectively subdivided. Thus we further assume the

Typically the overall training loss does not undergo phase transitions for large models, so this hypothesis has to be understood quite carefully: see the section on the Locality Hypothesis below.

Under these hypotheses, the plan has two parts:

These ideas are grounded in Singular Learning Theory (SLT), a theory of universal behaviour of learning machines based on algebraic geometry and statistics, and Conformal Field Theory (CFT), a theory of universality classes of physical systems, their “representations” and the RG flows between them. Both fields have a deep relation to singularity theory, in SLT because singularities in the KL divergence cause divergences in the density of states, which determine key quantities in Bayesian learning, and in CFT due to the classification of universality classes (this is the subject of the LG/CFT correspondence).

The plan is scoped on the order of ~10 years and presumes:

For an outline of how mechanistic interpretability impacts alignment, see here.


Spectroscopy of Singularities

In such models, knowledge to be discovered from examples corresponds to a singularity – S. Watanabe

The learning behaviour of singular models is dominated by the geometry of singularities. Moreover, the structure of phases and phase transitions encountered during training are dominated by singularities of level sets of the loss function. We can build scalable devices that provide useful signatures of these singularities, and thus of the “knowledge” they contain.

The motivating example of this working in practice is solid state physics. Divergences in the density of states are responsible for many electrical properties of materials, and detecting these divergences by indirect probes (which look for example at differential conductance) is a key experimental technique. There is no principled reason why analogous devices cannot be constructed for large-scale learning machines. Let us call them spectroscopic probes.

We assume that during training a series of points of interest are identified w_0, w_1, ... and that these checkpoints are passed to a separate analytics system which uses the spectroscopic probe to analyse the neighbourhood of each checkpoint. How are these checkpoints selected? The general idea is to track a wide range of metrics across training, and watch them for phase transitions. These transitions contribute checkpoints.

Work to be done:


Substructure and Semantics

Interpretability is about the relationship between knowledge in singularities and knowledge humans have of the world. To produce a Rosetta stone which allows us to translate between these two languages, the spectroscopic analysis of the checkpoints is compared with semantic development in the network, as measured by an independent set of probes of performance on tests separate from the primary training objective (or sublosses). We conjecture that there is a close relationship between the structure of the semantic development (e.g. logical structure) and the substructure of the singularities, as reflected in the spectroscopic analysis. Tracking this relationship in detail across all checkpoints is the “Rosetta stone”.

This substructure is well-understood mathematically, and already plays some role in solid state physics (although is perhaps not fully developed even there). While singularities are points, they nonetheless have “internal” structure (one might say “subatomic” structure) which can be seen in various equivalent ways:

The relation among these three classes of objects is complex, and a full understanding is not critical to designing substructural spectroscopic probes. However, the theory does suggest that stochastic processes near a singularity are probing the jet scheme, and therefore that quantities involved in scaling (i.e. “universal” quantities) should be sensitive to this substructure.

Remark: While it is a fact that knowledge to be discovered from data corresponds to a singularity of the KL divergence, or loss function, and the geometry of that singularity governs the learning process, it is not yet clear mathematically that substructure of the singularity corresponds to structure of that knowledge in any human-interpretable way. This is an open research problem.

Work to be done:


Programs as Singularities

For interpretability to succeed you need the right model for the computations within a network. Circuits or TMs or lambda terms are the wrong model; a new view of programs may be required. We understand how to represent traditional programs (e.g. Turing machines) as singularities in learning machines (see here). That means that we can test the above ideas in a situation where we have the ground truth, by attempting to reverse-engineer structure of programs from structure of their singularities.

The Locality Hypothesis

A Scanning Tunnelling Microscope (STM) is sensitive to divergences in the density of states scaled by the wavefunction at a position in space. This localisation in space plays a key role in how the tool is used to interpret the behaviour of systems in solid state physics, because it exponentially reduces the number of degrees of freedom to which the probe is sensitive (some would say that’s more or less what space is).

The most serious obstacle to building spectroscopic probes of singularities in large neural networks is that it is prohibitively difficult to estimate the Bayesian posterior near a singular point in a large-dimensional model. We need localisation procedures for reducing the number of degrees of freedom we need to care about (e.g. by freezing all but a small number of directions and doing MCMC or variational inference in the others). There are three keys sources of localisation:

Note that when we speak of phases, we mean with respect to the sublosses discussed above. Firstly, because the overall loss does not undergo (discrete) phase transitions in large neural network training, and secondly because we only expect strong localisation for phase transitions in the sublosses.

Why should mathematicians work on Alignment?

Take it from Demis Hassabis, CEO of DeepMind:

I always imagine that as we got closer to the sort of gray zone that you were talking about earlier, the best thing to do might be to pause the pushing of the performance of these systems so that you can analyze down to minute detail exactly and maybe even prove things mathematically about the system so that you know the limits and otherwise of the systems that you’re building. At that point I think all the world’s greatest minds should probably be thinking about this problem. So that was what I would be advocating to you know the Terence Tao’s of this world, the best mathematicians. Actually I’ve even talked to him about this—I know you’re working on the Riemann hypothesis or something which is the best thing in mathematics but actually this is more pressing.

Our ability to make lots of useful observations depends on measurement tools, or lenses, that make visible things which are invisible, either by overcoming the physical limitations of our sense organs or our cognitive limitations to interpret raw data. This can be a major bottleneck to scientific progress, a prototypical example being the invention of the microscope, which was a turning point for our ability to study the natural world. The lenses that currently exist for interpretability are still quite crude, and expanding the current suite of tools, as well as building places to explore and visualize neural networks using those tools, seems critical for making lots of high bit observations – From “Searching for search