Function approximation, but over what functions?

Computational complexity, transformers, shortcuts

Analysing transformers

Shortcuts and memorization

ARC problems

Learning programs

What exactly do I mean by a program?

How do you learn a program?

Human programs

< lewish

light dark

twitter github linkedin

Computational irreducibility, inference time compute, and why we need to learn programs

2024-10-22 / 13 min read

Function approximation, but over what functions?
Cellular automata
Computational irreducibility
Computational complexity, transformers, shortcuts
Analysing transformers
Shortcuts and memorization
ARC problems
Learning programs
What exactly do I mean by a program?
How do you learn a program?
Human programs

Dynamic inference time compute is a hard requirement for AGI. Reasoning tokens are here to stay and enable fundamentally different forms of thinking.

When do we need dynamic inference time compute? Computational irreducibility, cellular automata and computational complexity give us a foundation on which to explore this question.

What does this mean for solving general intelligence? I believe this encourages us to think about learning programs, not just functions. I’ll explain what that means, why it's important, and explore this a little within the context of ARC.

Function approximation, but over what functions? #

You can view much of machine learning as a problem of function approximation. In the case of an LLM, we are trying to predict the next token (or their probability distribution) from some sequence of preceding tokens.

We do not know what is, so we try to learn it. The assumption is that is a typical mathematical function, and we learn through a process of optimization through gradient descent.

And while without thinking about it too deeply one might assume that every possible function could ultimately have some fixed mathematical form, it turns out that is not the case, and the reason for this is that expressions in mathematics perform a fixed amount of compute.

It’s not uncommon to hear that “MLPs are universal function approximators”, and this is true - but the important detail here is that not every (perhaps not even the majority of) behavior in the universe is captured by what we typically think of as mathematical functions.

To understand this a bit better, enter Stephen Wolfram and his cellular automata.

Cellular automata #

In the 80’s, Wolfram explored cellular automata and several other related (but importantly usually mostly equivalent) systems of computation, and noticed behaviour so interesting in them that it kicked off decades of research and ultimately lead to him toward making a serious attempt develop a theory of physics entirely on top of computational concepts.

An elementary cellular automata is a simple computational system that takes a 1 dimensional array of cells that in the simplest case are binary valued. A set of rules for what the value of each cell at index should be at time , based on the values of the cells at indexes , , at time . Here’s rule 30:

What was remarkable was the fact that some even very basic sets of rules spawned incredibly complex behavior. This following image is of rule 110, part of the set of Wolfram’s 256 elementary cellular automata. It’s behaviour does not appear to be either predictable, nor entirely random.

While there are many fascinating conclusions you can draw from the explorations of their behaviour [0], for the sake of this argument we simply need to ask ourselves one question - for rule 110, how can you compute the value of a single cell at index at time ? Or in other words, what is the function:

Computational irreducibility #

To jump right to the point, the problem is you can’t always come up with a fixed compute function to do this. Rule 110 is an example of such a function where this is not possible, and the proof of this effectively amounts to building a turing machine out of rule 110 and showing that it is universal, or turing complete. There are no shortcuts or regularities through it’s evolution, so we call it computationally irreducible, and “the only way to determine the answer to a computationally irreducible question is to perform, or simulate, the computation”.

We need to repeat the computation for steps, and while we may find some optimizations along the way, we aren’t going to be able to fundamentally avoid this problem of having to repeat the computation, the cost of which will increase with the number of steps taken, or the value of .

But so what? Surely systems like rule 110 are an obscure edge case. But this discovery of universality in such a simple system implies the opposite. Here’s Wolfram on the significance of universality in rule 110:

Most important is that it suggests that universality is an immensely more common phenomenon than one might otherwise have thought. For if one knew only about practical computers and about systems like the universal cellular automaton [an elaborately constructed CA designed to have universality], then one would probably assume that universality would rarely if ever be seen outside of systems that were specifically constructed to exhibit it. But knowing that a system like rule 110 is universal, the whole picture changes, and now it seems likely that instead universality should actually be seen in a very wide range of systems, including many with rather simple rules.

I believe evolution likely has a part to play here in perpetuating and developing computationally irreducible behaviour in the world today too. Considering that computationally reducible behaviour is almost by definition predictable, then there is an obvious survival pressure to not exhibit it.

And on that basis, If you accept that we live in a world full of computationally irreducible agents, then learning and predicting computationally irreducible behaviour needs to be considered part of the problem of intelligence.

Computational complexity, transformers, shortcuts #

While computational irreducibility says something a little more fundamental about computation, there is a somewhat related concept that is familiar to computer scientists called computational complexity - which in simple terms means the number of operations required to execute an algorithm or program.

Consider sorting a list of values. In the general comparison based case, we can prove that the minimum amount of compute required (in the worst case) is proportional to where is the size of the list, and therefore we say that the computational time complexity of comparative sorting is .

And in a sense it is saying something similar to what we’ve said already - in some algorithms, computation is necessary and unavoidable in order to reproduce certain behaviours.

Analysing transformers #

With this we can start to consider what limits this might imply for our favourite type of models - transformers. Taking a simplistic [1] view of transformers and disallowing intermediate outputs, we can say that transformers perform a fixed amount of compute per output token (and the number of tokens they produce is variable).

This provides the transformer with a basic mechanism for dynamic compute, but it is not always, or perhaps often the case that the amount of compute required to reproduce many functions is proportional to the number of output tokens.

The obvious case where this fails is computing the future state of a rule 110 cellular automata, as there must be some value of beyond which the computation is not possible in its fixed computational budget.

Sorting lists also runs into a problem in our simple model, as the output token count is but the compute required is , so we are deficient.

Shortcuts and memorization #

This doesn’t mean transformers will always fail at these problems without intermediate output. There is one shortcut to computation that they can apply, and are very good at applying - memorization [2]. Memory in this context serves as a form of precomputation, and e.g. for smaller, or familiar lists, LLMs may be able to sort them in one pass. But, we can say for sure that beyond some scale of the input, i.e some value of or in the above cases - they will eventually fail.

And now it starts to become pretty clear why the dynamic inference time compute that is supported by intermediate, or hidden outputs such as in OpenAIs O1 models enables behaviour that is simply not possible otherwise [3].

ARC problems #

Francois Chollet’s abstract reasoning corpus (ARC) provides some particularly intuitive examples of where dynamic inference time compute is necessary.

Take b782dc8a or 00d62c1b as an example.

▼▼▼▼▼

There is no obvious way to shortcut the solution to this problem without actually following the paths. A cells output value cannot be determined from a fixed number of other cells, and the amount of compute required is dynamic depending on the length of the connected maze. Asking a transformer to compute the solution to such a problem (assuming it can learn what the transformation is) for all the above reasons, is simply not going to succeed without intermediate working.

3eda0437, 6a1e5592 are examples where some sort of search itself has to take place to find the maximum, minimum or fitting element, where computation needs to happen over a variable number of objects. Coming back to the previous analysis, there are problems such as beb8660c which explicitly require sorting.

▼▼▼▼▼

So how do we solve these problems? Models such as O1 start to have a chance of actually doing this in a general way (ignoring the problem of actually learning the rules) where their “non-thinking” counterparts might fail.

With all of these cases, we can solve them with actual computer programs, and this at least partly because programs can perform dynamic amounts of compute. Unsurprisingly then, Francois has repeated multiple times that program synthesis is likely the one of the best paths forward for ARC, and this also happens to be part of the idea behind many of the leading results on the dataset to date.

Learning programs #

If we accept the ubiquity of computationally irreducible processes and are interested in building truly general intelligence, then I believe this leads toward the conclusion that we need to think less about learning functions, and more about how to learn programs.

What exactly do I mean by a program? #

I mean the set of things that can be computed by a universal turing machine, which inevitably requires dynamic compute. Whilst these might still technically be functions [4], the point is to draw us away from thinking only about the kind of analyzable functions that one would typically see in mathematics.

One way to turn a function into a program is to have it act as a state transition function, with some termination criterion. This is basically what LLMs do, print out tokens until some kind of end of sequence token or limit is hit. Having such a recursive property is perhaps even a prerequisite for being universal.

     ┌───────────┐     
     │           │     
┌───►│   state   ├────┐
│    │           │    │
│    └───────────┘    │
│                     │
│    transition_fn()  │
└──────────┬──────────┘
           │           
           │           
           ▼ stop

There are are myriad of ways you can construct a universal computer, from cellular automata like models with many options for the representation of state from n-dimensional grids, graphs, hypergraphs or something more like computer memory, as well as different kinds of models for the transition function. While they may have the same fundamental capabilities, what is chosen likely has some effect on learnability [5].

How do you learn a program? #

You can’t just optimize your way toward the correct program like we do in most of deep learning. Program synthesis and validation of candidate programs is the most naive approach and can be remarkably successful on it’s own. This is perhaps more of a blind search than what you might want to call learning, but we can combine it with deep learning to guide the program search. LLMs are exceptionally good at generating code, and this is pretty much exactly what is behind at least one of the top approaches to ARC today.

I think there is also likely element of program learning within LLMs which can learn programs even if they don't always do so. Models like O1 make reasoning / working and the execution of a procedure more explicitly part of the learning process, and I suspect this at least partly what is behind their success.

There is of course many lines of research in this space, such as inductive programming or bayesian program learning, but at this point I am well out my depth [6], and it probably deserves a post of its own.

Human programs #

The programs that humans can execute are certainly nothing like Python programs or simple cellular automata, or don’t appear to always depend on words. They are abstract, general and can be at times continuous and other times discrete. Sadly, I’m not sure how many meaningful things we can say about either their form or how they are learnt at this point. Understanding what the form of these programs might be, and how we learn them, seems like an important step toward building truly general thinking machines.

reply via email twitter

Notes

[0]: Wolfram goes pretty far with this, and has some very interesting writing around it, particularly how it relates to the second law of thermodynamics and more recently perhaps time itself.
[1]: I'm not an expert on transformers, this is probably overly simplistic. In some sense the compute can also be proportional to the number of input tokens where attention happens, although this is limited to the context length and there are multiple attention heads too.
[2]: Interestingly in the context of Conway's Game of Life which is a form of cellulara automata, precomputation and memory has been used precisely to speed up the process of determining the final end state of such a system in Hashlife.
[3]: I have to mention the recent DeepMind paper here on grandmaster level chess without search. The fact that you can perform chess at this level with fixed compute is perhaps surprising, but I believe still fits with the narrative, and there is a good overview of what's going on here.
[4]: The general definition of a function doesn't seem to preclude programs, however one of the general ideas in Wolfram's "A New Kind of Science" is that programs can't be easily analyzed due to their lack of redducibility, and as such they are often overlooked by mathematics.
[5]: Relational decomposition for program synthesis is a great example in relation to ARC where the structure and representation of the program has a significant impact on it's learnability (or searchability).
[6]: This became clear as I got through the half of this recent MLST podcast with Alessandro Palmarini, if you want to learn more.

< lewish