Disclaimer: I wrote this primarily for myself. Why? Getting thoughts down on paper publically forces me to go deeper, challenge myself where I am missing knowledge, gain clarity in my ideas, serve as a reference, and something that I can share with others to show people where I am and what I’m thinking about. This probably isn’t a fun read, but it’s important writing, at least for me.
This is an update as I promised myself to my First draft AGI career thesis from last year. It feels fairly naive already looking back at it and a lot has changed since then, so here’s an update and a new plan for 2025.
TL;DR of my checkpoint a year ago. Some of these are still true, some are not:
I had not read the “On the Measure of Intelligence” paper last year, but did shortly after my first draft and that has massively shaped my thinking since. I’d managed to hone in at least one of the ideas here in my own definition of AGI, although perhaps wrongly attributing the difficulty to the “Generalization” component instead of the “Intelligence” component. Here was my AGI definition from last year:
A system that is capable of reliably performing at the 99th percentile of skilled adult human across a broad range of tasks including real world physical robotics problems and learning new cognitive and physical skills as sample efficiently as humans.
Skill acquisition efficiency felt like a very important component and I wanted to capture that, and this is what ARC tries hard to get at. When the ARC problem had a big push and new prize earlier this year, my attention was drawn significantly to it. As someone wanting to get into research and start trying to contribute, ARC had many desirable qualities:
I even asked Francois for validation, and unsurprisingly got it - working on ARC was, and still feels like the highest leverage ways an an individual could hope to have impact towards the broad scientific goal of AGI. This is now roughly my plan for 2025. A few more details and ideas below, and another post to follow.
I was fairly unexcited about LLMs last year. My lack of interest stemmed from the fact that LLMs “don’t do things. They respond to things, but they don’t have agency, goals or intentions, they don’t plan and they don’t have memory”. I’m certainly letting go of that a little. A year ago, agentic systems were barely seeing the light. Devin was announced the same day I published my first post, which was one of the first real agentic systems built on top of LLMs that didn’t just seem like a toy but something real that we will all have in our hands in the not too distant future.
I’m still fairly bearish on any LLM led ASI fast take-off scenario. This deserves more words, and for the time being I’ll refer to the reader here. But, there were some deeper realizations about what LLMs actually are that have had the biggest impact on how I see them.
My hot take on LLMs a year ago was that they were just these big interpolative databases of information. But through some of the conversation around ARC and LLMs, I started to see them more as databases of programs. This really changes the story, because LLMs can execute programs, with enough thinking tokens they can perform any computational task (for a deeper dive on this, see my recent essay - Computational irreducibility, inference time compute, and why we need to learn programs). Learning programs is also a powerful part of generalization, and I believe part of the success of O1/R1 and RL that I’ll get to later.
In his interview with Dwarkesh, Gwern made the statement that intelligence is “search over turing machines”. It’s a beautiful way to put it, and I increasingly agree with this statement. LLMs might not be a particularly efficient form of turing machine, but they can compute.
O3 flipped the table with its impressive result on ARC, even if it did cost hundreds of thousands of dollars. I have not fully “updated my priors” (bleurgh) from this result, but from the tone of even the “”skeptics” at release time, I think you have to take this extremely seriously.
Just in June, the narrative was still that solving ARC-AGI would be extremely hard. This has totally flipped on its head in just a few months. Even those bullish about rumors of Q* and other reasoning approaches would not have expected this level of success.
O3 changed my plan for ARC. My takeaway is that is no longer a question of how we solve ARC, as we effectively have a proof of a solution (until ARC-V2 at least). Emulating and dramatically optimizing whatever OpenAI cracked for O3 would be a completely valid, and viable approach for anyone pursuing ARC in 2025, and it seems that several groups are already planning to do this, as Mike Knoop mentions at the end of his recent R1 analysis.
In the time between me starting and finishing this draft, I’ve been able to update my loose summary of how O1 might work to say that it now seems to be fairly clear that the open-source community already knows broadly how O1 works and are going to be consistently replicating it en-masse in the next 6 months. DeepSeek dropped a paper spilling most of the beans, and outside of that we even have some publishing code [1] and [2] demonstrating replication of O1/R1 like reinforcement learning approaches with some success.
I think this is all going to move extremely quickly, here are some thoughts on how this plays out:
O1/O3/R1 plays into two things I have been thinking about for a while - 1) programs, and the fact that inference time compute is necessarily, and 2) search, and system 2 thinking. Nothing new here, just trying to piece it all together.
To accelerate progress in AI and realize its vast potential, we must invest in fundamental research in “NeuroAI” … based on the premise that a better understanding of neural computation will reveal fundamental ingredients of intelligence and catalyze the next revolution in AI
It pains me to say it but, this is on hold for now. I did a fair amount of reading here, and some writing too. But my confidence in meaningful insight coming from neuroscience that influences our progress in AI/AGI over the next few years is low. We are on a track now, and progress is phenomenal, perhaps one day we need to go back to the drawing board but for the timing being, my guess is that innovation will be driven primarily by human creatively (even if that just means throwing things against the wall) more than a deeper understanding of the principles of computation of the brain.
Continual learning still feels like a fundamental but very hard problem. The economical advantage of being able to continually train, expand or grow models is clear and could massively reduce the cost of incremental model improvements, making it easier to stay at the frontier of performance competitively. The fact that it’s not been done yet makes it clear just how hard this is. One could probably dedicate an entire lifetime to this research, and I dare not put my foot in it yet.
But perhaps in the more near term, imbuing LLMs or similar systems with more general memory capabilities can offer a step change in system capabilities. LLM context windows are getting pretty long, but there is always a limit, and a cost to it, and there are some limits to how much you can purely learn in context. From a product perspective, memory and context seems to be the most important way to build any form of moat, and so there seem to be good commercial reasons to pursue this.
Whilst not directly addressing continual learning, there are alternative models such as xLSTM to attention that address the quadratic complexity of it, and have the potential to massively increase context lengths, which could unlock new abilities in the near term. I’m excited to follow this and similar research.
I talked a lot about these things. I still think they are important. It seems more likely that we will “solve” abstract intelligence in the language, maths domains before physical intelligence.
Does abstract intelligence translate into physical intelligence? My intuition is probably not. It’s interesting to see people getting robotics to plan and act with transformer architectures, there are some extremely cool demos out there, Gemini Robotics is a good starting point if you haven’t seen what’s happening, but the road ahead looks long and tough.
Ultimately this area requires a very large investment in time, resources and learning, and may not be something I decide to dedicate any time to this year.
Last year I laid out some milestones. Here’s a few thoughts on what is going on with respect to them and where we are:
There have been some fantastic resources around ARC, in particular, MLST podcast has covered it a lot and has some great interviews with both Francois and some of the paper and competition winners from last year [1] [2] [3]. In brief, here’s a few thoughts and areas I am excited about with respect to ARC, and I will write in more detail about this soon.
I did a bunch of work on ARC myself at the end of last year to index, describe, and analyze the ARC problems. This may be a useful reference for anyone looking to get more familiar with the problems themselves, and I hope to do the same again for ARC V2.
What I have seen said many times, and what is evident in the results, is that ARC’s 3 problem sets, train, evaluate, and test - are progressively harder. Test results are always worse than eval that are always worse than train. I believe one very important thing to understand about the ARC problems here is that you are actually being asked to produce an algorithm that can work in an effectively unseen domain. Sure each problem set is still grids, but the core knowledge, patterns, types of problems seen in the test set barely overlap with the train set (I do not know this, but I am inferring it from the above). That is not like most ML problems we try and solve for today.
Not really a plan, but a few objectives and principles for myself this year:
One important realization for me is that if I want to leverage my own impact on the field, I probably need to start a company. I expect to turn my energy towards this at the end of the year but for now, I have a lot to learn first.