This is my take on ARC 2025, a review of the literature, some opinions and attempts to categorize and conceptualize the various approaches so far, ideas that have been discussed, as well as what's special about the competition, what might work and what won't work.
If you are new to ARC and want an overview of the various approaches from 2024, I hope this is useful to you as well as me! The first half of the document covers high level approaches and key ideas extracted from various papers and results. The second half provides some more detail (although still fairly high level) on each of the papers that won in ARC 2024, as well as some hand-picked related papers, work and ideas.
Quite a lot of broad knowledge around ML/AI is assumed throughout. I have done my best to link to or explain some concepts, but this will probably be tricky if you don't know some basics around deep learning, and general ML terminology.
I have almost certainly made many mistakes, possibly at a conceptual level, and most definitely in some of my attempts to extract details and ideas out of the papers. There are a few "hot takes" in here too and I've not held back in expressing some opinions of my own - please take these with a pinch of salt, and I'm very open to feedback here.
I have not covered all the literature and read all the papers, there is a lot out there. If you think I've missed something important, please let me know.
ARC stands for the Abstraction and Reasoning Corpus, it is a benchmark that came out of François Chollet's paper On The Measure of Intelligence in 2019. It has evolved a few times, the benchmark renamed ARC-AGI, and recently an ARC-AGI-2 version was launched in March 2025.
Here is an example of an ARC puzzle. For each puzzle (or sometimes referred to as a task), you are provided with around 2-4 (usually 3) examples of input output grid pairs, and one (but occasionally 2) "test" input grid(s) for which you have to predict the output grid.
The goal is to infer the rule from the examples, and apply it to the test input grids. The puzzles always exist in this same grid world construct, but the space of possible puzzles is massive. The final ARC competition is run against a hidden test set of puzzles, which means a good solution must be able to do some amount of out of domain generalization.
At a high level ARC is designed to test "skill acquisition efficiency", and this would be the simplest way to capture what François defines intelligence as.
The intelligence of a system is a measure of its skill-acquisition efficiency over a scope of tasks, with respect to priors, experience, and generalization difficulty.
You can read a lot more about the details of the benchmark, the competition and explore the puzzles on the ARC website.
As I will reference these throughout, it's important to note that with both the 2024 and 2025 competitions, there are a few different datasets we care about:
When it comes to the private test set your solution program will be presented with new puzzles that it just hasn't seen before. This is a significant difference from many benchmarks, great ingenuity and creativity went into constructing these puzzles and while they may be composed of common elements, transformations, core knowledge priors that are present in the training set, they are ultimately different problems.
People seem to often be surprised by this and just how different the test problems are! Solutions that could easily overfit to the training set could end up scoring 0% on the hidden test set.
This is, I believe, the most important thing to understand about ARC. The challenge here is to generalize out of the domain of the training dataset in a way that most Deep Learning often fails to.
The outcome of this is that some form of test time adaptation is absolutely crucial for getting a good score on the benchmark, and we'll see this as a major theme in the results from 2024.
The second most important thing to understand about the challenge comes back to Francois' definition of intelligence - "The intelligence of a system is a measure of its skill-acquisition efficiency over a scope of tasks...".
Efficiency is an explicit part of the challenge. There are deliberate compute bounds for submissions, effectively in terms of FLOPs. You have a machine with a fixed specification and a capped running time. You must use this compute efficiently to succeed. Unrestricted scaling is not what the benchmark is for.
And as with the above, this is the point! It's easy to critique this and say, well if we got GPT4.5 or O3 and threw $1M of compute at it we could solve ARC (kind of like how O3-high basically did that for ARC-AGI-1 in December 2024). But, that is not very efficient [0] skill acquisition, and we already know in principle that we can get logarithmic returns on inference time compute.
Without efficiency as an explicit goal, ARC would simply not be an interesting benchmark.
Given the context on ARC, it is hopefully clear why you can't "just" throw any pre-trained model using a standard deep learning architecture at it. People have tried, and it doesn't work well (with a carve out for thinking models, discussed below).
At the other end of the spectrum we have a whole category of approaches that come from the world of discrete program search (also referred to as program synthesis, and sometimes program induction). One of the nice things about coming at ARC from this angle is that we can lay out a (horribly inefficient) architecture of a learning system that should solve ARC with enough compute. This gives me an opportunity to define some useful terminology for the rest of the post so let's lay it out!
Let's start by making the very reasonable assumption that it is always possible to write some program that can capture the rule of any given ARC puzzle, and produce the output grid from the input grid, more formally:
L
, that can represent all possible programs P
that map from some representation of a grid X
to grid Y
.This must be true if we accept things like the principle of computational equivalence or the notion of Turing-completeness. An obvious (but perhaps not the best) candidate for such a language would be a modern turing-complete programming language such Python.
The second thing we need to do is be able to enumerate all possible programs P
that are formed out of our language L
. This is generally possible, even for Python there is nothing stopping us from just listing sequences of bytes as the program source, even if most such programs will immediately fail.
For each puzzle we then:
P
that solves all the example grid pairs.Given the above, we know that such a program must exist.
Despite the set of programs P
being of infinite size, as we are looking for the shortest such program then provided we can enumerate our programs in at least roughly size order, that is not a problem.
Why the shortest? This comes from a few well known theoretical ideas, e.g the minimum-description length principle, Solomonoff induction, Occam's razor, Kolmogorov complexity, etc. I won't write about this in depth here, you can read about them, but in general we are now also accepting the idea that "the simplest explanation is the best explanation".
This relatively dumb approach will probably work provided a few details are right. The problem is it's horribly inefficient, and we don't have infinite compute. So a significant amount of research is about making this approach more efficient, and there are a whole bunch of ways you could do this:
Nearly all the top papers and results go after some of the above opportunities, and this is broadly what Francois has been saying is the approach he thinks is most likely to deliver results on ARC - Deep learning guided program synthesis. This is also effectively what he and Mike Knoop started a company to explicitly work on:
(w.r.t NDEA) Our first focus is on deep learning-guided program synthesis to create AGI that can invent, adapt, and innovate
At the end of the 2024 competition the ARC team released a technical report covering top scores, top papers, companies and a summary of approaches. It's relatively brief, but I'll highlight a few things from it, or you can read it: ARC Prize 2024: Technical Report.
The 2024 ARC competition saw a shift away from the dominance of discrete program search approaches that had prevailed in previous years. This year, LLMs became a significant factor, driving both new program induction approaches and transductive approaches, with the crucial introduction of test-time fine tuning. Additionally, ensembling played a key role in improving scores for the top papers.
If we look back at the 2020 competition, the winning result that stood for quite a long time called Icecuber was a fairly dumb brute force discrete program search. LLMs out of the box performed poorly, but, with a slew of test time adaption methods being developed for them they quickly started to take over.
One of the standout claims for me from the report was:
One approach that has not been tried so far (likely because it is technically challenging) but that we expect to perform well in the future, is the use of specialist deep learning models to guide the branching decisions of a discrete program search process - similar to what can be seen in the AlphaProof system from Google DeepMind.
I.e no one has tried exactly what Francois thinks will work. There are certainly many papers that attempt to do something like this conceptually, but are not implemented literally as a specialist DL guided program search (for example, the DreamCoder based approaches).
One more highlight from the report:
We also note that deep learning-guided program synthesis does not currently decisively beat DSL-based brute-force program search
You have a spectrum here - search smartly - search fast. No one has come up with a DL model that can effectively outperform a well engineered brute-force search, yet. Your NN and search policy network has to deliver more value than its relatively high computational cost.
There is a good analogy here to progress in Chess/Go. For example in chess, Stockfish has been the most consistently capable chess bot over the last 10 years but used no neural networks or learned search policy, just some heuristics and a very efficient and well engineered search algorithm.
It wasn't until 2020 when that trade-off started to make sense for Stockfish and they moved to some sort of NN based policy. We saw a similar concept behind Go. Heuristic based brute force search was no match for humans until AlphaGo, where the breakthrough was ultimately having deep learning guide the search much more intelligently (and a lot of innovations to actually achieve that).
It seems [1] that humans search smartly, and probably most people assume that this will eventually be the winning approach for ARC that can "break" the logarithmic returns of a brute-force search, but within discrete program search approaches we just haven't seen this flip that way, yet.
We expect that more efficient program search techniques leveraging deep learning should be able to pull away from brute-force search in the future.
Transduction and Induction are two phrases you will see a lot in the literature and in the rest of this write up, let's discuss them a little.
I've struggled a little with this terminology, and the formal definitions of transductive and inductive learning don't seem to help much. The abstract of the 1st prize paper below captures well what they tend to mean in the context of ARC:
Inductive approaches generally mean finding (learning, inferring, inducing) a function that can solve the puzzle. You try to find a good program, then you apply it to the test input (more along the lines of discrete program search). It offers an explanation for the observed data in the form of a program.
Transductive approaches skip that step, they just go for it, generate the output based on some prior knowledge (more along the lines of just giving an LLM the examples and hoping it does some in-context learning). They do not offer an explanation [2].
I don't think this is a well defined distinction and gets particularly blurry when looking at some of the TTT (test-time-training) fine tuning approaches, and also with thinking language models. It also depends on where you draw the line around a system.
For the TTT case, if you think of an LLM and its weights as a program [3] - and then at test time you fine-tune your weights to better reproduce the example problems - you are clearly doing some sort of program inference, it's just not on a discrete program. Something to this effect is noted in the technical report:
Interestingly, TTT can be seen as conceptually similar to program search, albeit at the opposite end of the memorization/recombination spectrum.
One difference here however, is that a fine-tuned model offers no easily interpretable explanation.
One of the results called out in the technical report is the value of ensembling both transductive and inductive methods, which was the main point of one of the papers below. Ensembling across both methods was crucial to get to the top of the leaderboard in most cases [4].
Ensembles are a mainstay of ML, and frequently pop up in winning kaggle competitions. If you want a good score, this is a relatively cheap way to boost it. Notably, 2 of the leaderboard results in 2024 were ensembles of previous years submissions.
Some of the papers discussed below make some interesting observations about what ensemble well together (e.g a mix of transductive and inductive learners), but the mechanics of ensembling are relatively simple so there isn't a tremendous amount to dig into here.
Improvements in DSLs was a significant driver of progress early on in the first few years of ARC being released. Michael Hodel authored ARC-DSL, one of the original DSLs for ARC, as an example that is still relevant and useful today.
A good DSL can greatly improve search efficiency, given it is what we search over. Pure Python is not a great language for brute-force search as most programs are completely irrelevant[5]. A good DSL for the domain can help ensure that solutions can be represented in a small program, and that most parts of the search space are actually valid programs that operate on grids.
One challenge with most of the DSLs is that they are hand-coded. The author of a DSL might go and solve all the ARC training tasks, and then produce a set of language primitives that are necessary to solve all the problems they have seen. Whilst this makes the search efficient, it's hard to guarantee they are complete. The ability to solve all the training tasks does not necessarily mean they have the ability to solve all of the test time tasks! This needs to be a primary objective of any DSL, be expressive enough whilst also providing a small yet useful set of primitives.
The DSLs often end up being rather large, for example, ARC-DSL contains a massive 160 primitive functions!
If you want to learn more about the construction of a DSL, then the ARC-DSL write-up is probably a good place to start. Probably avoid writing one yourself unless you are doing something very novel, although a number of the papers below do end up doing this.
It is almost certainly the case that there is some transformation of the grid representation of ARC problems that makes them easier to solve than working on the grids directly themselves. Whether this is hard-coded, learned or otherwise, such a representation likely exists.
Parsing ARC grids into objects has been explored in some of the papers, e.g:
Representations for the ARC grids have been explored all the way back in 2022. The intuition here is probably that by mapping to some better data structure for representing the grids, the DSL becomes simpler, and the search process more efficient for programs that operate on those grids.
┌──────────┐ ┌──────────┐
│ Input │ │ Output │
│ Grid │ │ Grid │
└────┬─────┘ └────▲─────┘
│ │
▼ │
Encoder Decoder
│ ▲
│ │
┌────▼─────┐ ┌────┴─────┐
│ Abstract ├───────────► Abstract │
│ Rep │ │ Rep │
└──────────┘ Solve └──────────┘
this
problem!
Representations can be related to the DSL itself but don't have to be. When treated as an explicit outer process, this approach is rather flexible, as these representations can then be fed into many different systems including LLMs, where for example one person claimed a doubling of O1 performance by mapping ARC grids into an abstract, object centric representation using a set of hand crafted heuristics.
Some of the work around these explicit abstract representations has been heuristic or hard coded, but sometimes compact representations of the grids into some other representation is actually an explicit part of the learning objective.
For some of the other approaches, while there may not be an explicit abstraction step, it's reasonable to assume that the layers of a neural net in say a fine-tuned LLM is actually learning such representations implicitly.
The idea of refinement comes up in a few of the papers. If we think of ARC in general as a grid > grid transformation problem, then one obvious thing we can do to break this down a bit is to try to construct the set of repeated transformations through partial grids:
┌─────────┐
│ Input │
└────┬────┘
├──────────────────────┬──────────────────────┐
│ │ │
▼ ▼ ▼
┌─────┐ ┌─────┐ ┌─────┐
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │
null ──►│ f() ├──►│ Partial ├─►│ f() ├──►│ Partial ├─►│ f() ├──►
│ │ │ output │ │ │ │ output │ │ │
└─────┘ └─────────┘ └─────┘ └─────────┘ └─────┘
This can come in a few different forms for different approaches. For example with a LLM/VLM, we might repeatedly apply the LLM, providing partial grids as additional context after step 0, maybe even explicitly asking the LLM to correct any mistakes.
In some approaches there is a notion of grid similarity, and this can be a useful signal for training.
Ultimately allowing repeated applications of some function over grids can enable types of computation that might not be possible in say a single pass through a transformer, or any fixed compute function. Many ARC problems become much easier if you repeatedly apply some function, and this is usually helpful when there is some linear dependence between computational steps.
For an example of a problem that illustrates the importance of such iterative computation:
This transformation function is much easier to write as a loop!
This wasn't covered in the technical report as it happened after in December, when OpenAI launched O3 with the announcement that it achieved an 87.5%
score on ARC-AGI-1, on the semi-private eval set.
Just in June, the narrative was still that solving ARC-AGI would be extremely hard. This has totally flipped on its head in just a few months. Even those bullish about rumors of Q* and other reasoning approaches would not have expected this level of success.
This was an exceptional result, and even those labelled DL skeptics took a moment to acknowledge its importance. We have to take this seriously, and since O3, as well as DeepSeek, many people will be looking to leverage thinking models to approach ARC 2025.
Emulating and dramatically optimizing whatever OpenAI cracked for O3 would be a completely valid, and viable approach for anyone pursuing ARC in 2025, and it seems that several groups are already planning to do this, as Mike Knoop mentions at the end of his R1 analysis.
Letting LLMs think before answering is obviously a powerful approach:
At a high level, thinking tokens can form some kind of not entirely discrete program code and this is in my opinion a big part of what makes them so powerful. I've written a little more about this idea here.
Due to the fact that ARC is a verifiable domain it should be possible to use RL to train a thinking model specifically to solve ARC puzzles. If done right this can cause models to learn how to think about ARC tasks in a programmatic way, rather than just memorizing all of the patterns of the test set. The hope is this will greatly improve generalization ability. It might even be the case that the in-context learning of these models is a sufficient form of test time adaptation.
Training these models is fiddly, and probably expensive. I've done a little bit of exploration into this here (although this work is already likely well out of date), and there are now a lot of people in the open-source community building tools and doing research on replicating DeepSeek's RL result, with some successes on simpler domains than ARC even with small models.
The immediate challenge for these approaches is context lengths. Puzzle context itself can be 6K tokens or more, and for a decently sized thinking trace, you may want another 10-30K tokens or more! This is hard to fit on limited compute resources, but people are working on optimizing this.
Whilst O3 performed well on ARC-AGI-1, the new version of the competition slashes O3-low performance from 76%
to ~4%
(although it still holds the top spot on the unrestricted-compute leaderboard).
I expect to see an ARC optimized thinking model be one of the top scoring approaches for this year's competition, but it will require significant $$$ and some serious LLM expertise.
To solve ARC it is useful to have some basic knowledge and understanding of the world, as often puzzles leverage concepts that we are highly familiar with as humans.
A simple example is the notion of gravity. We all know what gravity is, and the concept even shows up in some ARC tasks:
This knowledge dependency is impossible to avoid, but to minimize the impact this crystallized knowledge has on one's ability to solve ARC tasks, ARC restricts itself to building on a minimal set of knowledge called core knowledge priors, defined as:
cognitive building blocks that are either present at birth or acquired very early in human development with minimal explicit instruction, as described by the Core Knowledge theory.
It is very hard to come up with a complete list of core knowledge priors. They are discussed in On The Measure of Intelligence, and I'll pull out the summary of them from there:
Objectness and elementary physics: humans assume that their environment should
be parsed into “objects” characterized by principles of:
Agentness and goal-directedness: humans assume that, while some objects in their environment are inanimate, some other objects are “agents”, possessing intentions of their own, acting so as to achieve goals (e.g. if we witness an object A following another moving object B, we may infer that A is pursuing B and that B is fleeing A), and showing efficiency in their goal-directed actions. We expect that these agents may act contingently and reciprocally.
Natural numbers and elementary arithmetic: humans possess innate, abstract number
representations for small numbers, which can be applied to entities observed through
any sensory modality. These number representations may be added or subtracted, and
may be compared to each other, or sorted.
Elementary geometry and topology: this core knowledge system captures notions of:
After tagging 200 of the ARC-AGI-1 training dataset tasks, I also ended up producing my own taxonomy of something kind of like priors (perhaps more like concepts) across the different tasks, you can see that breakdown here. Other taxonomies of ARC problems include Concept-ARC.
The new version of the benchmark is a substantial step up in difficulty. The ARC team covers details of the changes in their blog post in depth, but I will summarize the delta here. To some extent, the benchmark was updated to really go after the deficiencies in the top scorers from last year's competition. This may seem unfair, but once again it is kind of the point - we can't really say we have full-blown AGI until we, as humans, are no longer capable of coming up with problems that we can solve that AI systems can't. Moving the goalposts is somewhat intentional!
There are a few specific types of problems that the team have explicitly incorporated into the new dataset that today's systems struggle with, which you can see examples of in the above blogpost:
In addition to the introduction of new tasks capturing the above, a bunch of tasks that were susceptible to brute force search have been removed from the private test set. This appears to be all tasks that were solved by Icecuber.
The datasets have changed slightly. In all the eval datasets there are now 120
problems, and the training set has expanded to 1000
problems.
The three eval sets, public, semi-private, private have been calibrated for human difficulty [6].
The hardware has been improved too, with significantly more GPU memory available.
The scores for 2024s top solutions have absolutely plummeted. Most LLMs out of the box now get 0%
, ARChitects 1st place score of 53.5%
is now down to 3%
, and O3 is down from 76%
to ~4%
.
In summary, it's a much much harder problem set. Whilst human pass rates are roughly the same, my anecdotal observation is that these puzzles are still harder for humans, they require significantly more thought and take more time to solve[7].
Outside of the direct research on approaches, there are a few useful resources for anyone starting to dig in. The arcprize.org site has a lot of useful information, guidance, and links to get started, as well as access to the Discord chat.
Simon Strandgaard's Awesome ARC page has a ton of links and resources, editors, tools, paper links, sample notebooks, prompts, etc etc.
Guillermo Barbadillo's literature review for his own 2nd place solution is a good read on some of the history, and was useful in preparing this content.
There is a lot of prior work in creating expanded training datasets (beyond the 400 and now 1000 for ARC-AGI-1/2) which are leveraged by a lot of the work discussed below, notably:
And more, including some simplified problem sets, such as Mini-ARC, 1D-ARC, Sort-of-ARC, Language-annotated ARC (LARC).
To some extent, pretty much all the top ARC approaches actually follow a subset of a single high level recipe. The implementation details, representations, methods vary massively, but there is a general formula here that covers everything from discrete program search to LLMs with test time training:
This does not describe pre-training for any of these approaches which vary significantly (or in some cases, there is none).
I'll cover the main two different approaches below within the context of this recipe:
Some more novel approaches fit here too, e.g Searching latent program spaces is the same pattern again, where a program is guessed (or a vector representation of a program), iterated on through gradient ascent, but then skips the selection step as there is only one, and has a novel approach to program representation and execution.
The top 3 scores this year did this - ARChitects, Omni-ARC, and MindsAI (unsubmitted/unpublished).
In the case of LLM fine tuning, the LLM is a program, and we can interpret the TTT approach for LLMs in the context of the general recipe as:
MLST has done a fantastic job of interviewing various people from these groups, and I've linked some of those podcasts next to the relevant papers below [8].
┌──────────┐
│Base model│
└─────┬────┘
│ <augment> ┌───────────────────┐
<fine-tune>│◄─────────────│Public ARC problems│
│ └───────────────────┘
┌─────▼───────────┐
│Preliminary model│
└─────┬───────────┘
│ <augment> ┌────────────────────┐
<fine-tune>│◄─────────────│Private ARC problems│
│ └────────────────────┘
┌─────▼─────┐
│Final model│
└─────┬─────┘
│
│ ┌─────────┐
<augment_0> ├──►│Candidate├─────┐
│ ├─────────┤ │
<augment_1> ├──►│Candidate├─────┤
│ ├─────────┤ │
<augment_2> └──►│Candidate├─────┤
└─────────┘ │
│ <unaugment>
┌─────────────┘
│
┌───▼───┐
│Scoring│
└───────┘
Diagram is derived from the ARChitects paper, and captures the critical steps, pre-training, data augmentations, extended datasets, private set fine-tuning, candidate generation and scoring.
Most of these steps are fairly obvious. You get a base model, you fine-tune it, you fine-tune it some more on the private set, you use augmentations to enlarge your training datasets. The candidate generation and scoring is less obvious, and is discussed below.
Pretraining is a relatively straightforward part of the process:
Many people have had the intuition that ARC puzzles are spatial / vision problems, so what about VLMs? This is explored nicely in Mini-ARC and A 2D nGPT Model For ARC Prize, where small vision transformer models are trained from scratch, and a 2D positional embedding scheme is pulled together. There are of course many improvements here to try, but it doesn't seem any of the VLM based approaches have been engineered yet to a point of being competitive.
One challenge with LLMs is the number of tokens required. For larger puzzles, you have 30x30 grids. Given 3 example pairs and 1 test pair, that's 7x30x30 tokens minimum (6.3K
). That's quite a lot of tokens, which, without a customized tokenizer, can easily double due to separators, newlines etc. You need to fit this all in memory too and it needs to run fast.
The ARChitects paper performs a few nice optimizations to make this all run fast. A custom tokenizer to prevent number chunking (12 is two tokens, 1,2 rather than just 12) for example, which seems to have a nice impact on efficiency.
Test-time training, or test time fine tuning is the most notable innovation for making LLMs competitive in ARC 2024.
The moment your submission starts up, it gets to see all (120) private eval puzzles. There are two things you can do here:
Different groups went in different directions, although the 2nd place paper contrasts these approaches in a controlled way and finds that, probably as you would expect - fine-tuning per task yields better scores. However, it's more expensive to fine tune separately for each task, so there are trade-offs to make, and the best use of your limited compute is unclear. For example, ARChitects don't fine-tune per task whereas Omni-ARC does, but Omni-ARC uses a much smaller model.
Just to make it clear how one constructs the test time training set, you need to effectively create a new training set from the examples, as your code doesn't get to see the true answer for the test set actual problems! Omni-ARC's description of this is simple:
For each test problem that had n train samples, I fine-tuned the model using n-1 train samples and using the remaining sample as a test sample. The selection of the test sample was done randomly on the fly during training.
You effectively construct new in-context tasks out of the given sample pairs.
The second frequently used technique for fine-tuning dataset generation, both pre-training and test time, is to leverage augmentations of the data. You take a list of grid pairs, and apply a reversible transformation to each grid to create more problems. This is a common technique in ML and seems to be important to performance here too.
Some of the augmentations applied across different groups as an example:
This is useful for training / fine-tuning, it will also come up in program selection and answer selection techniques!
There are a few variations of this, but the high level idea is roughly the same. I'll summarize the ARChitects approach as well as the Omni-ARC approach, which we think is similar in principle to what MindsAI do too.
The primary objective here is to generate multiple answers, 10s to 100s, and then perform some selection process at the end. There are different ways to generate these variations, but both top prizes rely on augmentations to do this.
The ARChitects use quite a novel LLM token sampling approach, which is detailed a bit more in the paper notes. They sample tokens from the LLMs and explore multiple different token sequences, generating multiple answers whose overall token sequence is greater than some fixed probability. They use augmentations to then compute probabilities of solutions across multiple variations, and this feeds into the selection process, following the intuition that:
a correct solution should demonstrate greater stability in its sampling probability under augmented conditions compared to an incorrect one
For Omni-ARC, this process is much simpler. They set the temperature to 0, and run inference over 96 augmented versions of the puzzle. They reverse the augmentations, and then perform a voting mechanism across all the solutions to choose the best ones.
Ultimately the point is that we don't just ask the LLM to print its 1 best answer, we ask it to generate several answers, and then select the best ones according to some metric.
At this point with a background on DSLs, representations, and search, hopefully the basic brute-force approach is relatively clear at a high level, but framing the approach in terms of the general recipe:
The first obvious area for innovating is improving upon the representations for problems and the DSLs used. Sometimes it's pure Python, most often the representations are just grids. A lot of effort has gone into some approaches to hand design DSLs, some of which are quite innovative, for example with graphs and graph transformations, or by representing programs in some latent vector space.
Searching over programs could be straight up brute force, but usually include some hand-crafted heuristics for searching over programs. Most solutions follow a "shortest program" selection principle, so proceed with a roughly breadth first search, or beam search. DreamCoder provides a much more intelligent approach to program search, with NNs that can guide the search process and build up a vocabulary of useful building blocks.
Metrics are often defined to guide program search too, e.g in gradient-free search methods which may use evolutionary strategies or simple hill climbing, some measure of distance to the target grid can be used such as pixel error.
Program selection can have a few tricks, similar to LLMs, augmented versions of the problems are often used as part of the generation and selection process for programs and/or solutions. Shorter programs are nearly always preferred, or as with LLMs, consistency across augmentations.
Compute efficiency is absolutely crucial. Good program synthesis approaches need to be implemented in a fast language like C++. No one, yet, has found a way to use GPUs to help perform program search (perhaps with the exception of searching latent program spaces), but there are some discussions in the community I mention below around how this might be possible.
Relatively rough notes follow on all of the top paper prizes and runner ups. I didn't go into every winning solution, although the top 2 published scores both happen to have also won a paper award, so they are covered.
Top 5 private scores:
53.5%
) - Covered below.40%
) - Covered below.40%
) - A guided brute force program search, well engineered specifically to ARC with human experience of the puzzles. Notebook link.37%
) - An ensemble of previous prize winners. Notebook link.37%
) - More ensembling of previous years approaches. Notebook link.Top 3 papers from ARC 2024:
Runners up:
Paper: Combining Induction and Transduction for Abstract Reasoning
Shows that a combination of transductive and inductive methods are effective and solve relatively different tasks and can be ensembled together well, create a new dataset, and perform a bunch of experiments, ablations, etc.
This is the paper that introduces the ARC-Heavy (BARC) dataset, which is a massive dataset of extended ARC puzzles generated by LLMs. This is a very useful contribution to the ecosystem that many others have leveraged for their own solutions.
Induction Intuitively, induction captures the notion that a learner should first explain the training data, then use that explanation to make predictions.
Transduction Transduction instead captures the intuition that the training examples themselves should play a direct role in generating new predictions, and that successful prediction need not require an explicit explanation
That induction and transduction are complementary is the main claim of the paper. Over 400
problems, 26
solved only by induction, 19
solved by transduction and induction, 35
solved by only transduction.
What overlap would you expect if these were completely uncorrelated? Approximating both as a ~12%
solve rate which they roughly are (45/400
, 54/400
), the overlap would be expected to be 0.12^2*400 = 6
(as opposed to 19
).
They are somewhat independent, not completely independent, and definitely not anti-correlated. Independence is definitely useful in an ensemble though.
Generated 400K synthetic arc problems, and noted that the more synthetic problems generated, the better the overall performance.
They also note that the amount of test time compute has a logarithmic impact on the result.
Overall, they get a pretty good score on the public eval (36.5%
). The method is too expensive though, and performs poorly when scaled down enough to fit in the kaggle environment, where it only scores 19%
. Synthetic data is super powerful then it seems!
If you could customize this architecture to ARC with a bespoke NN that was much smaller, perhaps it could pay off. The authors note:
this suggests a strong payoff for smarter neural program search
Authors also highlight a problem I experienced myself. The RE-ARC verifier programs are so crazy that it is hard to use them as any sort of training data:
ReARC is implemented in a domainspecific language, and lacks natural language annotations, making it difficult to remix with LLMs
What is the line between transduction and induction? Perhaps the optimal solution here is somewhere between the two, authors suggest such an idea:
One way of implementing this idea is to do program synthesis within a language whose atomic primitives are non-symbolic, and to pretrain those primitives to encapsulate the basic atoms of core knowledge. While work has taken steps in related directions ... how to engineer and scale this idea remains open
Some sort of soft programs (non-symbolic), not dissimilar to what probably happens in reasoning / thinking LLMs, but could be more focused toward ARC.
Paper: The Surprising Effectiveness of Test-Time Training for Few-Shot Learning
TL;DR - Test time fine-tuning is effective, and boosts ARC pass rates in their setup from 17.5%
to 45%
when compared to just fine-tuning at training time.
The paper provides a good taxonomy of various design choses for TTT, different types of loss, shared or task specific fine tuning, data generation approaches, augmentations for training and different approaches for voting / candidate selection, etc.
They perform a bunch of ablations, that generally seem to indicate that many of the methods in this paper offer some improvement in performance for ARC:
There is also some good data on voting approaches across transformations, which are compared to an Oracle (that selects the best answer if present). The number of examples here is relatively low though, and no confidence bounds are given, so it's hard to make strong conclusions that differentiate between simple voting and hierarchical voting, but it seems clear that multiple augmentations are better than 1.
If you want to understand how various design choices for LLM TTT on ARC affect performance, this is a great place to start.
Paper: Searching Latent Program Spaces
Interview (MLST): Spotify
They introduce the latent program network, which learns a distribution over latent programs in a vector space, enabling a gradient based search for programs, as well as a "neural decoder" that can execute the programs on specific inputs.
Scores 3%
on test, 9.5%
on eval, up to 46%
on training (did not completely converge).
This is pretty hardcore, and conceptually very novel, all trained from scratch without pre-trained LLMs. At a high level they do the following:
256
) for ARCTraining is pretty complicated, and the paper provides a lot of the details and some clever tricks if you want them (this was also close to my skill limit), but some highlights:
N
sample input output pairs, they use N-1
tasks to predict one of the other tasksThis is ultimately a program search. By embedding programs in this vector space with an encoder, the entire thing becomes differentiable, which allows an efficient search process to happen within the space of programs using gradient descent, and this appears to be absolutely critical to the performance of the approach. One-shotting the program in the encoder performs significantly worse.
Decoder backward pass is quite expensive, which makes the latent optimization (gradient ascent step) slow.
5 TPU days to train, not cheap but not expensive. Did not converge. Does not appear to use any of the extended task sets, and it appears they did not use the evaluation data set for the training a model against the final test set. Looks like there is headroom here.
Paper: The LLM ARChitect: Solving ARC-AGI Is A Matter of Perspective
Interview (MLST): Spotify
A great reference for test-time fine tuning!
Fine tuning an LLM, with an expanded dataset, test-time fine tuning with data augmentations, multiple candidate generation, scoring. Fine tuning for the entire test set, not per task.
The expanded dataset was big! Using all the additional datasets available, and a lot of augmentations. This led to over 500k
training examples for their largest model (considering the base training + eval sets are 700/800
tasks).
Updates to the tokenizer to just represent 64 symbols. Tokenization is a problem, due to e.g number grouping which would probably be catastrophic (12 being treated as "12"
, not "1"
,"2"
). Often we just put commas, or spaces between the numbers, but this doubles the token count.
Augmentations - rotations, transpositions, permutations of the colors, example order.
Using small base models, e.g Llama-3.2-3B-instruct
, Mistral-NeMo-Minitron-8B-Base
.
Initially trained on Re-ARC + 75% of the public evaluation set using LoRA.
Trained further on the hidden dataset during inference (test-time training), spending around 5 hours on the Kaggle GPUs.
Provides a detailed take on candidate generation. Each token in an LLM is generated with some probability. The greedy sampling approach is to take the token with the highest probability, but this will produce just one sample. Alternatively you can sample tokens stochastically, say proportional to their softmax probabilities, and generate several candidate solutions this way.
Here they do something custom - they sample tokens stochastically, but keep track of the overall sequence probability (assumed to be the product of token probabilities), and if this overall probability drops below some threshold then the sequence is discarded. It's faster as you can leverage caches, more memory efficient, and it boosts performance over simple greedy/stochastic approaches, and generates many candidates.
Candidate selection leverages the augmentations. By augmenting the solution candidates, for each augmentation, you can ask again what the probability of the model generating that sequence of tokens would be.
Given a set of probabilities of the solution under different augmentations, the authors take the
a correct solution should demonstrate greater stability in its sampling probability under augmented conditions compared to an incorrect one
So they chose a solution that performs well across many augmentations.
They were given some compute from Lambda Labs, using 8xH100s, they ran a lot of experiments and variations over time. You definitely need some $$$ for this kind of approach and experimentation, I'd guess they spent on the order of $30K
compute, could be more if they had it for longer.
Paper: Solution Summary - arc24
Not wildly dissimilar to ARChitects. LLM based approach, some novel angles in here:
The insight presented at the beginning is that to learn a good representation, we should train on multiple adjacent tasks from the tasks themselves. The author lists these:
Hence "Omni”. More auxiliary tasks, better representations, is the idea, although in the end this idea was only partially executed on it seems.
The model used Qwen2.5-0.5B-Instruct
as a base model, fine-tuned using LoRA on augmented datasets and some auxiliary problems as discussed. Test time fine tuning is integrated, and notably produces a fine-tuned model for each task as opposed to all.
Inspired by the MindsAI approach, and references some of their work, AIRV (which I can't find details on) which stands for Augment, Inference, Reverse Augmentation and Vote, and this is used for candidate generation and selection. It's not explicitly written down or I am missing it but I will assume this means roughly:
At the end, it's ensembled with the 2020 solution (looks like Icecuber). Without the ensemble, the score is 32% (I think, the graph does not match the text).
A few follow ups and thoughts for the future from the author:
Author notes:
The approach of using a transformer and test-time fine-tuning could likely keep improving and maybe solve the ARC prize if we generate enough synthetic data to densely cover the space of the ARC problems. However that kind of solution won't give us more knowledge about how to reach AGI
Paper: Mini-ARC: Solving Abstraction and Reasoning Puzzles with Small Transformer Models
A good place to start if you are interested in vision transformer approaches!
Uses small vision transformers trained from scratch, with TTT. uses some novel positional encodings and color embeddings, and also explores an approach using refinement that yields some improvements.
Used a lot of the augmented datasets, RE-ARC and BARC/ARC-Heavy, as well as the authors own extended dataset ARC-HTML.
Doesn't submit an actual score to the competition, as the architecture is limited to 12x12 grids, but reports 41%
on a subset of the public eval set.
Refinement - the general idea here is to allow the model to take multiple passes, initially making a guess at the output, and then feeding both the partial output and the context back in and refining it. It seems to provide a small but probably significant boost to performance, and is not seen in many of the approaches.
The paper exposes one general issue with context lengths. For larger puzzles, you have 30x30 grids. Given 3 example pairs and 1 test pair, that's 7x30x30 tokens minimum (6.3K
). That's quite a lot of tokens, which, without a customized tokenizer can easily double due to separators, newlines etc. You need to fit this all in memory too.
Paper: Towards Efficient Neurally-Guided Program Induction for ARC-AGI
The authors start with the general statement of DLs deficiencies when it comes to open domains, and generalizing outside of the training set distribution - which is particularly relevant for ARC.
In the context of ARC they say that "This encourages research on extending the out-of-distribution generalization capabilities of learning systems". This is a super important aspect of ARC in general and is key to succeeding in the benchmark.
They cover a couple of different approaches and experiments, and one that is not explored but proposed:
Approach 1: Learning the grid space
Approach 2: Learning the program space
Approach 3: Learning the transform space
The program search component is absolutely crucial to performance.
Actual results are low, 5.8%
on public eval overall, but this was more of an exploratory paper, and on their subset of solvable tasks they get up to 80%
ish.
This is a good quote from the authors that I've been thinking about a lot recently:
ARC-AGI is fundamentally not a grid-to-grid problem ... most of the steps involved in solving an ARC-AGI task are not strictly grid-to-grid transformations and thus do not result in an intermediate grid that can be evaluated for its similarity to the target ... instead can be thought of as a sequence of intermediate program states which themselves are lists of various kinds of intermediate objects and values. As such, the grid-to-grid similarity concept only potentially allows guiding a small portion of the overall search for a solution.
Quite a lot of effort went into the DSL. It is surprising to me just how many of the approaches seem to hand-code a DSL, rather than trying to learn one, or use an existing one.
Authors feel that their main limitation here is the DSL, noting that it's too high level and specific.
Paper: A 2D nGPT Model For ARC Prize
A small (42M
param) 2D transformer, test time training. Built on top of normalized transformers (nGPT).
A grid cell attends to all cells in the same row and same column. They use rotary position embeddings (RoPE) here. They also experimented with adding a single attention across the whole grid, but this didn't pan out.
Restrict themselves to constant size tasks, of which there were only 262/400
. They extended the dataset using RE-ARC, they also increased the training set using invertible transformations.
One issue highlighted during TTT is that of task blurring:
For instance, in the task of figure 2 above, holes are filled with yellow. If we apply color permutations, then we could get pairs where holes are colored with, say, cyan. This prevents the model from learning which color has to be used for filling holes
This seems like it would be an issue for many augmentation approaches, although this is the first place I've seen it mentioned so far. No solution is offered, the additional augmentations outweigh the blurring issue.
No official score provided, 26%
on the constant sized evaluation tasks, most likely < 15%
on the hidden test set if they had managed to run it. Could be a useful reference for any 2D / vision approaches, and similar overall in spirit to Mini-ARC.
Write-up: Getting 50% (SoTA) on ARC-AGI with GPT-4o
The idea here is quite simple - get GPT-4o to generate thousands of Python programs that attempt to implement a transformation for each puzzle, with some selection process.
~8K
) python programs are generated for each problemThe results on the public eval set were impressive and the best at the time, getting 50%
which, if you ignore the program compute limits, was effectively SOTA[9].
Again, noting that this was pre-O1, it was a very clear example of the logarithmic returns one can get from the inference time scaling paradigm.
It's also different from other program search approaches in that it's using a complete and general language - Python. This makes a lot of sense when using a pre-trained language model, which has already been trained on an incomprehensible amount of Python code.
Author gives a pretty contra-Chollet take:
I think ARC-AGI will be one benchmark among many that just gets solved by scale
I'm not sure where people stand on this now, though, as attention shifted towards O1 like approaches. The costs spent on this at the time were high (someone privately shared with me they estimated around $10K
to run this against the 400 public eval tasks).
It's also a nice example of program induction, and leveraging LLMs for that. Despite its conceptual simplicity or lack of fine-tuning (although there are tons of tricks in the execution of this to get it to work well), the results are very good and this is an important point in the history of ARC.
Paper: ARC-solution/ARC-solution_documentation.pdf
The OG discrete program search approach for ARC.
Somehow this work seems to be both completely aligned with, and completely antithetical to - the bitter lesson. Just search, but with a very well constructed (by a human) DSL.
There's not a huge amount to explain here and the paper/docs are very short. They wrote efficient C++ code for 142
unary functions (notably derived from 42
n-ary functions). They then search the space of up to depth 4
programs, searching depth 3
programs, depth 3
programs with diagonal flip augmentations, and then depth 4
programs until they run out of time.
Solutions are selected based on accuracy for the samples, and then following again the principle that shorter solutions (lowest depth) are better. A few tricks in here that are seen elsewhere too:
There is some procedure at the end that I don't quite follow around combining "pieces", stacking image lists, moving things to origin, resizing for output grids.
This solution highlights a few things for me:
The first of these is sometimes overlooked by others, and we'll see elsewhere program search approaches in Python. DSLs are important and this is probably an uncontentious claim, but the best way to win with DSLs is to get a human to make one. Augmentations come up over and over again in nearly all solutions, both LLMs and discrete program search, and it's nice that this came up all the way back in 2020.
Paper: DreamCoder: Growing generalizable, interpretable knowledge with wake-sleep Bayesian program learning
Explanatory video: https://www.youtube.com/watch?v=qtu0aSTDE2I
Covering this as background for DreamCoder based ARC approaches.
DreamCoder performs inductive programming. Given a list of examples, it will attempt to generate a program that produces the demonstrated behavior (so, a perfect fit for ARC!). This is not ARC specific, it's a general approach for program synthesis.
One of the key aspects of DreamCoder is its wake-sleep algorithm. DreamCoder builds up a library of concepts, smaller programs that do specific things.
With a library of programs, during the wake phase you can search for discrete programs that solve your problem. This is a search problem, you can combine, compose functions from the library, or construct entirely new functions, to create a large search tree of programs that may or may not solve the task.
DreamCoder uses a learned model (the recognition model, like a policy network) to guide this search process.
Shorter programs are considered to be better, a result of Solomonoff's theory of inductive inference.
During the sleep/abstraction phase, the library of concepts is grown, with the intention of compressing programs. Commonly used sub-programs are pulled out as concepts, effectively programs are (re)factorized. The overall length of the library programs + the length of the solution programs is minimized.
During the sleep/dreaming phase, the goal is to train the model that drives the search process. You want to know what might be the best programs to explore for a given task.
There are two sources of training data here that are mixed together:
p
are drawn from the library randomly, executed given an input x
to produce an output y
.p
, and an x
, y
pair.The goal of the recognition model is to predict useful programs p
from a given input output pair x
, y
.
Wake, sleep (abstract, dreaming), repeat.
Paper: Neural networks for abstraction and reasoning: Towards broad generalization in machines
This is a DreamCoder based approach, where the authors provide:
It's the best DreamCoder approach at the time of the paper, but its performance is still fairly modest, reporting 18/400 (4.5%)
on the ARC-AGI-1 eval set, beaten by stock LLMs like GPT-4, and smashed by the winning 2020 brute-force DSL search (Icecuber) approach (fast search is still beating smart search, for now!).
The recognition model is customized, and they elect for an image based feature extractor using a small conv net like architecture, given the 2D, image like properties of ARC tasks.
They also have to deal with the fact that there are actually several images in an ARC task, inputs, and outputs. At a high level, they compute feature vectors from input and output grids, compute the difference, then average across samples to create a 256
dimension feature vector for an entire task.
They introduce another DSL called PeARL. They created a DSL for ARC, with relevant primitives such as grids and colors, and programs map from grid > grid. This doesn't appear to be the most complete DSL, I'm not sure why they didn't use e.g ARC-DSL.
The authors note that:
The design of primitives for PeARL has a big effect on the system's performance and is perhaps the most critical implementation detail.
As we've seen elsewhere, the DSL is extremely important.
In the end, they end up with a fairly substantial 77
primitives! This includes things like flips, rotations, transpositions, cropping, stacking, filling holes, counting colors and pixels, etc etc.
I suspect there is a lot of headroom here, improvements to the DSL, recognition model, etc etc. DreamCoder as a baseline architecture still feels like a good candidate for smart-search that actually works, but execution is everything.
Paper: ARC-AGI Without Pretraining | iliao2345
A recent, and conceptually interesting take on ARC. The paper is very theoretical at times and I was not able to follow it fully in depth, but it does influence some of my conceptual takes above.
By framing ARC as a compression problem, the authors show an approach toward benchmark that does no pretraining whatsoever, and get 20%
on the public evaluation set.
So for each task:
Intuitively, by attempting to encode all the grids into a neural network's parameters with the correct loss - this results in some form of compression. That compression is forced to capture relationships between input and output grids, which are then useful to actually solving the problems.
How do you optimize for compression? The authors lay out why and how with a few tricks, and this seems in the end to boil down to a KL divergence penalty on the network parameters against a fixed normal distribution.
Given the lack of pre-training, this is basically nearly entirely test time adaptation. Not 100%, because there is clearly a ton of priors and bias fed into the neural network architecture that is specifically related to ARC. It's clear from all the other solutions that test-time adaptation is necessary, but is it really sufficient?
The idea that compression is related to intelligence is not new, and has been championed by M. Hutter, with, for example, the Hutter Prize. Here's a passage from the authors, which notably calls out several other theories referenced in here that are relevant to other approaches:
This equivalence between intelligence and compression has a long history. For example, when talking about intelligent solutions to prediction problems, the ideal predictor implements Solomonoff Induction, a theoretically best possible but uncomputable prediction algorithm that works universally for all prediction tasks. This prediction algorithm is then equivalent to a best possible compression algorithm whose compressed code length is the Kolmogorov Complexity of the data. In our work, we try to approximate this best possible compression algorithm with a neural network. A related measure of complexity is known as the Minimum Description Length.
You could imagine a derivative benchmark to ARC-AGI, inspired by the above and the Hutter Prize, that would require people to construct an algorithm that optimally compresses all the grids across all 120 problems. I've not seen any such derivative benchmarks yet, but investigating optimal compression of the ARC problem grids does seem like it could yield valuable insights.
Note: I read this before covering the older paper Object-centric Models and the MDL Principle below which builds on similar ideas and goes after compressed grid representations to solve ARC.
Paper: Graphs, Constraints, and Search for the Abstraction and Reasoning Corpus
ARC tasks are 2D arrays of pixels. But when a human looks at a task, they don't "see" pixels. For most tasks you almost immediately perceive the grids as a set of objects. Object-centricity is a critical part of how humans solve ARC problems.
So there is a core problem of representation here. What is the best representation for the grids? It looks like there are some very inexpensive improvements to the representation that we can get, e.g "Doubling o1-preview performance on ARC-AGI with one simple trick". Similar approaches explored by Agemo here, and most in depth by the paper covered next - Object-centric Models and the MDL Principle.
The authors here choose an object centric graph representation, and then effectively search through a DSL of programs that operate on these graphs.
The representations are generally smaller[10], and this helps in reducing the search space.
There is a notable observation that the procedure to generate the graph representation from the pixels is 1-many. There are multiple different "interpretations" of the grid (think e.g the classical duck/rabbit illusion):
defining multiple abstraction processes improves ARGA's ability to correctly capture object information from the input images
Here's roughly what they do then:
ARGA is currently implemented in Python while the Kaggle solution is implemented in C++
This is crazy! I get it, but the performance gains left on the table here are massive, depending on their exact implementation, possibly 10-100x
or more if this isn't multi-threaded.
It doesn't look like they actually posted an official score, but when compared against Icecuber on an object-orientated subset of the (presumably) public evaluation problems, it scores 35.6%
compared to 40%
. How a solution like this would generalize to some of the other non-object centric problems is going to be a part of the challenge, but could be side-stepped with some ensembling in the short term.
Paper: Tackling the Abstraction and Reasoning Corpus (ARC) with Object-centric Models and the MDL Principle
Based on previous work that shows that humans often solve ARC problems in an object-centric way.
Related to the ideas discussed around compression, the authors use the minimum-descriptor-length principle to parse the grids into an object centric representation, with the intuition that "the model that best describes the data is the model that compresses them the most".
Their object centric representation of grids ends up looking something like the following:
Layers(Vec(12,13), black, [
Layer(Vec(2,4), Colored(Rectangle(Vec(2, 2), Full), yellow),
Layer(Vec(1,3), Colored(Rectangle(Vec(4,4), Full), red)
])
These models can be used in a few different ways, to predict output grids, to describe a pair of grids jointly, and also to even create a new pair of grids for a task.
By finding the most compressive representation of the grid pairs, they are effectively learning about the structure of the task itself - compression leads to understanding, and this is highly related to the work in CompressARC.
This is only a light summary, the paper is quite dense, I need to spend more time on it, and it doesn't lay out all of the details of how their system works, but conceptually this is a very interesting idea.
One relevant discussion I will highlight to this work is that of ViktorTaelin, good summary on reddit here.
All the program synthesis approaches today work on CPUs. What if we could leverage GPUs for this process? It could offer a massive increase in performance, not necessarily a change in complexity, but in brute force speed for sure as we can simply access more FLOPS, and right now that is the winning approach (simple fast search) rather than intelligent / smart searching approaches within the program synthesis approaches.
I follow Viktor, and while he is obviously incredibly smart I struggle to keep up with what he is talking about most of the time. But, he talks about ARC, and seems to be investigating leveraging his company's HVM system - a general purpose functional program evaluator (based on something called interaction nets, the best explanation of which I have found is here) that can be automatically optimized for GPUs so you don't have to write CUDA kernels - to attempt program synthesis on ARC.
I'd love to see where this goes. Running, evaluating, searching through programs on GPUs could be a game changer, but I haven't seen anything concrete yet.