A second draft AGI plan for 2025

2025-03-17 / 16 min read

The plan in one word: ARC
LLMs as an interpolative database of programs
O1/O3/R1
Neuro AI side-lined
Continual learning & memory
Robotics, simulations, reinforcement learning
Milestone updates
ARC ideas
A plan for 2025

Disclaimer: I wrote this primarily for myself. Why? Getting thoughts down on paper publically forces me to go deeper, challenge myself where I am missing knowledge, gain clarity in my ideas, serve as a reference, and something that I can share with others to show people where I am and what I’m thinking about. This probably isn’t a fun read, but it’s important writing, at least for me.

This is an update as I promised myself to my First draft AGI career thesis from last year. It feels fairly naive already looking back at it and a lot has changed since then, so here’s an update and a new plan for 2025.

TL;DR of my checkpoint a year ago. Some of these are still true, some are not:

I wanted to dedicate the remainder of my career contributing toward the development of AGI
My definition of AGI notably included the ability to learn new skills as sample efficiently as humans
Interest in the potential of NeuroAI and the importance of leveraging insight from neuroscience to drive progress
Intelligence is fundamentally a robotics problem, and we will need great simulations to succeed here
I wasn’t that interested in LLMs
Intelligence likely requires methods beyond just gradient descent
Meta-learning approaches will likely accelerate progress towards AGI
Continual learning is a very important a hard problem that we still need to solve

The plan in one word: ARC #

I had not read the “On the Measure of Intelligence” paper last year, but did shortly after my first draft and that has massively shaped my thinking since. I’d managed to hone in at least one of the ideas here in my own definition of AGI, although perhaps wrongly attributing the difficulty to the “Generalization” component instead of the “Intelligence” component. Here was my AGI definition from last year:

A system that is capable of reliably performing at the 99th percentile of skilled adult human across a broad range of tasks including real world physical robotics problems and learning new cognitive and physical skills as sample efficiently as humans.

Skill acquisition efficiency felt like a very important component and I wanted to capture that, and this is what ARC tries hard to get at. When the ARC problem had a big push and new prize earlier this year, my attention was drawn significantly to it. As someone wanting to get into research and start trying to contribute, ARC had many desirable qualities:

It’s a well defined, tangible benchmark
There is a community of people working on it
It would not necessarily require vast compute budgets to solve, given the limits set by the competition
It draws out a research challenge that appears to me to be critical to solve on the path toward AGI

I even asked Francois for validation, and unsurprisingly got it - working on ARC was, and still feels like the highest leverage ways an an individual could hope to have impact towards the broad scientific goal of AGI. This is now roughly my plan for 2025. A few more details and ideas below, and another post to follow.

LLMs as an interpolative database of programs #

I was fairly unexcited about LLMs last year. My lack of interest stemmed from the fact that LLMs “don’t do things. They respond to things, but they don’t have agency, goals or intentions, they don’t plan and they don’t have memory”. I’m certainly letting go of that a little. A year ago, agentic systems were barely seeing the light. Devin was announced the same day I published my first post, which was one of the first real agentic systems built on top of LLMs that didn’t just seem like a toy but something real that we will all have in our hands in the not too distant future.

I’m still fairly bearish on any LLM led ASI fast take-off scenario. This deserves more words, and for the time being I’ll refer to the reader here. But, there were some deeper realizations about what LLMs actually are that have had the biggest impact on how I see them.

My hot take on LLMs a year ago was that they were just these big interpolative databases of information. But through some of the conversation around ARC and LLMs, I started to see them more as databases of programs. This really changes the story, because LLMs can execute programs, with enough thinking tokens they can perform any computational task (for a deeper dive on this, see my recent essay - Computational irreducibility, inference time compute, and why we need to learn programs). Learning programs is also a powerful part of generalization, and I believe part of the success of O1/R1 and RL that I’ll get to later.

In his interview with Dwarkesh, Gwern made the statement that intelligence is “search over turing machines”. It’s a beautiful way to put it, and I increasingly agree with this statement. LLMs might not be a particularly efficient form of turing machine, but they can compute.

O1/O3/R1 #

O3 flipped the table with its impressive result on ARC, even if it did cost hundreds of thousands of dollars. I have not fully “updated my priors” (bleurgh) from this result, but from the tone of even the “”skeptics” at release time, I think you have to take this extremely seriously.

Just in June, the narrative was still that solving ARC-AGI would be extremely hard. This has totally flipped on its head in just a few months. Even those bullish about rumors of Q* and other reasoning approaches would not have expected this level of success.

O3 changed my plan for ARC. My takeaway is that is no longer a question of how we solve ARC, as we effectively have a proof of a solution (until ARC-V2 at least). Emulating and dramatically optimizing whatever OpenAI cracked for O3 would be a completely valid, and viable approach for anyone pursuing ARC in 2025, and it seems that several groups are already planning to do this, as Mike Knoop mentions at the end of his recent R1 analysis.

In the time between me starting and finishing this draft, I’ve been able to update my loose summary of how O1 might work to say that it now seems to be fairly clear that the open-source community already knows broadly how O1 works and are going to be consistently replicating it en-masse in the next 6 months. DeepSeek dropped a paper spilling most of the beans, and outside of that we even have some publishing code [1] and [2] demonstrating replication of O1/R1 like reinforcement learning approaches with some success.

I think this is all going to move extremely quickly, here are some thoughts on how this plays out:

The reinforcement learning successes of O1/R1 will be swept up by the open-source community, and I expect to see at least a handful of O1 level “thinking” models open-sourced (update: this is already happening).
I am expecting to see the same patterns of efficiency improvements and reductions in parameters counts for the same performance (i.e Llama3 70B matching previous 405B performance), and I’m expecting O1/DeepSeek-R1 performance in a model of 70B parameters or less, making it just about runnable on consumer hardware.

O1/O3/R1 plays into two things I have been thinking about for a while - 1) programs, and the fact that inference time compute is necessarily, and 2) search, and system 2 thinking. Nothing new here, just trying to piece it all together.

Neuro AI side-lined #

To accelerate progress in AI and realize its vast potential, we must invest in fundamental research in “NeuroAI” … based on the premise that a better understanding of neural computation will reveal fundamental ingredients of intelligence and catalyze the next revolution in AI

It pains me to say it but, this is on hold for now. I did a fair amount of reading here, and some writing too. But my confidence in meaningful insight coming from neuroscience that influences our progress in AI/AGI over the next few years is low. We are on a track now, and progress is phenomenal, perhaps one day we need to go back to the drawing board but for the timing being, my guess is that innovation will be driven primarily by human creatively (even if that just means throwing things against the wall) more than a deeper understanding of the principles of computation of the brain.

Continual learning & memory #

Continual learning still feels like a fundamental but very hard problem. The economical advantage of being able to continually train, expand or grow models is clear and could massively reduce the cost of incremental model improvements, making it easier to stay at the frontier of performance competitively. The fact that it’s not been done yet makes it clear just how hard this is. One could probably dedicate an entire lifetime to this research, and I dare not put my foot in it yet.

But perhaps in the more near term, imbuing LLMs or similar systems with more general memory capabilities can offer a step change in system capabilities. LLM context windows are getting pretty long, but there is always a limit, and a cost to it, and there are some limits to how much you can purely learn in context. From a product perspective, memory and context seems to be the most important way to build any form of moat, and so there seem to be good commercial reasons to pursue this.

Whilst not directly addressing continual learning, there are alternative models such as xLSTM to attention that address the quadratic complexity of it, and have the potential to massively increase context lengths, which could unlock new abilities in the near term. I’m excited to follow this and similar research.

Robotics, simulations, reinforcement learning #

I talked a lot about these things. I still think they are important. It seems more likely that we will “solve” abstract intelligence in the language, maths domains before physical intelligence.

Does abstract intelligence translate into physical intelligence? My intuition is probably not. It’s interesting to see people getting robotics to plan and act with transformer architectures, there are some extremely cool demos out there, Gemini Robotics is a good starting point if you haven’t seen what’s happening, but the road ahead looks long and tough.

Ultimately this area requires a very large investment in time, resources and learning, and may not be something I decide to dedicate any time to this year.

Milestone updates #

Last year I laid out some milestones. Here’s a few thoughts on what is going on with respect to them and where we are:

Action and effect / Reactive agents: AI starts to replace roles that have a reactive nature that require them to perform actions like sending emails, organizing calendars. This is of course, well under way. We are seeing some significant disruption on more automatable tasks, and some people are genuinely losing jobs because of this. But in the majority of cases, humans are just getting superpowers, and adoption is rapid (e.g in coding). We are also seeing AI agents become embedded in lots of product experiences, which give them the power to actually do things. On the more long horizon tasks, like coding (e.g Devin) it seems like this is taking a bit longer to get right.
Agentic action and planning / Proactive agents: I have yet to see any significant deployments of proactive agents. Although you can put an agent in a loop fairly easily, things still tend to go off the rails fairly quickly.
Physical / real world control: Robotics has been hype this year no doubt, but overall, progress is slow. I’ll refer the reader to Rodney Brooks January 2025 scorecard update for a more informed take on the progress and state of this field.
Online learning and adaptation: I believe the results coming of the back of O1 are the most interesting here. We have found ways to make RL work within the language domain, it shows promising signs of improved generalization, and improves sample efficiency. This is exciting and as I’ve put above, but we are constrained to verifiable domains for the time being, unclear if this is a fundamental limitation or not.
Collaboration and group agency: Not much to say here, although some elements of this internally within agentic architectures.
Novel discovery out of distribution: There are examples of this in narrow domains, and it seems like LLMs may have a role to play in open-endedness. When combined with search processes, today’s AI can play an important role in detecting novelty, and interestingness. Some interesting work from Sakana here, and the broader view on the interaction between open-endedness and today’s AI systems.

ARC ideas #

There have been some fantastic resources around ARC, in particular, MLST podcast has covered it a lot and has some great interviews with both Francois and some of the paper and competition winners from last year [1] [2] [3]. In brief, here’s a few thoughts and areas I am excited about with respect to ARC, and I will write in more detail about this soon.

Reinforcement learning LLMs to think about ARC - I took a dive into LLM RL recently, and I’m sure there is potential for this kind of approach to do some interesting things. ARC is a verifiable domain after all, and there’s more work to do here! This seems obvious after O1/O3 results.
Optimized DL guided program synthesis - In the technical report for ARC 2024, it’s called out that “One approach that has not been tried so far (likely because it is technically challenging) but that we expect to perform well in the future, is the use of specialist deep learning models to guide the branching decisions of a discrete program search process”. Interesting that no one seems to have seriously tried this, given that it’s roughly what Francois is suggesting as the most likely winning approach.
Learned abstract representations - I’m intrigued by attempts to map the problems into new more abstract representations before performing induction/transduction on them, for example this tweet suggests this “simple trick” can double O1 performance, and similar representations were converged on here. In both cases these were driven by heuristics, but learned representations perhaps offer some sort of more optimal “kernel trick” like approach for ARC in general.
Compression - Possibly related to above, some relatively recent work to solve ARC by performing lossless compression on the problem, touches on the longstanding idea that compression itself may be at the heart of intelligence. While the results itself are not particularly impressive, the conceptual framework of the approach potentially opens some avenues to explore.

I did a bunch of work on ARC myself at the end of last year to index, describe, and analyze the ARC problems. This may be a useful reference for anyone looking to get more familiar with the problems themselves, and I hope to do the same again for ARC V2.

What I have seen said many times, and what is evident in the results, is that ARC’s 3 problem sets, train, evaluate, and test - are progressively harder. Test results are always worse than eval that are always worse than train. I believe one very important thing to understand about the ARC problems here is that you are actually being asked to produce an algorithm that can work in an effectively unseen domain. Sure each problem set is still grids, but the core knowledge, patterns, types of problems seen in the test set barely overlap with the train set (I do not know this, but I am inferring it from the above). That is not like most ML problems we try and solve for today.

A plan for 2025 #

Not really a plan, but a few objectives and principles for myself this year:

Find a research group to work on ARC V2 with. I have some conversations in play here with a few folks in London / UK. Please don’t hesitate to reach out if you are interested.
Continue to write in depth and work publicly.
Dedicate at least 3 months full time to working on an ARC solution and paper this year.
Continue to build relationships with AI startups, look for opportunities for short-term contracts and employment to help me level up.
Learn to leverage AI myself, in particular to help me learn in a new field.
Possibly take a step to learn about robotics, teach a robot to walk from scratch (in simulation at least).

One important realization for me is that if I want to leverage my own impact on the field, I probably need to start a company. I expect to turn my energy towards this at the end of the year but for now, I have a lot to learn first.

reply via email twitter

< lewish