Exploring LLM performance on the ARC training dataset

2024-11-14 / 12 min read

TLDR: I tagged and described 200/400 ARC training dataset tasks, merged it with evaluation data for some LLMs, did some analysis on it, and put it all on a site so you can explore it.

ARC is a hard problem, and exploring the dataset first felt like the obvious starting point. The site originally started as a way for me to just explore and take notes on the problems, but that turned into completing a taxonomy of tags and descriptions and a few other things I got carried away with.

You can explore the tagged problems here yourself, download the raw JSON file, or read on if you want some more details on the dataset, how I went about this and some basic analysis on it.

If you are interested in helping to complete the tagging and descriptions for the remaining 200 problems, please reach out to me on twitter, or you can contibute directly to it here.

Building the ARC explorer #

Understanding LLM performance #

My main objective with this project was to understand where LLMs perform well [0] and where they perform poorly. Thankfully there are notebooks for GPT-4, Claude Sonnet, as well as some of the other public leaderboard submissions such as Icecuber 2020.

Sadly I don't have O1 access, so if anyone wants to help produce a submission.json file for O1, I would be very grateful, and bonus points if you can drop it in with the other submissions here.

Computable properties #

There are a number of computable properties that I could programmatically generate such as:

  • The grid size - I broke this down into small, medium, large based on the average number of input sample cells, with boundaries of 64 and 256.
  • Whether they have consistent input sizes, output sizes, or whether the input size always matches the output.
  • Other properties are also calculated, like the number of distinct colors used, if the input colors match the output colors, etc. These are all in the raw JSON file, but not all of them are in the site itself.

Solution descriptions #

Ultimately there were only limited things I could easily compute, and I couldn't find a way to avoid [1] doing descriptions and tags manually.

The descriptions of the problems are trying to be a human like solution to the problem. This is how my brain worked, and not necessarily how a computer might solve them.

Coming up with ways to actually describe the solution in a concrete way was much harder than I had imagined, these are not things that one usually uses words to describe, and I suspect that many of my descriptions will be completely incomprehensible to someone who isn't me.

The complexity of the solution procedure varies quite significantly across problems, even on the training set which is supposed to be easier than evaluation or test, some of the problems require a very long sequence of steps to solve, here's an example description:

Identify the several disconnected, same colored objects that have cell values only on the edge of the outline of a perfect square (which may also be 1x1) and are symmetrical in both axes and may extend beyond the edge of the input grid. The output grid has the size of the largest such outline, and its background color is the same as the background (most common) color of the input grid. Each of the identified outline shapes is copied into the output grid and centered in the middle.

Or sometimes, they are much simpler:

Draw a red outline around any horizontally or vertically connected 2 green cells.

Tagging #

Whilst there are some problem types that recur several times, some of which are almost identical, that was usually not the case. There are still problems which I don't know how to tag properly.

I ended up with 37 tags, which evolved considerably as I went through. This took a few second passes, and then some grouping.

The full list I eventually come up with is below, within groups and with a short description of each both are certainly imperfect. I drew inspiration from David Bonin's Task Tagging Notebook on Kaggle, but ended up with a slightly smaller and more focused set.

grouptagdescription
objectsrectanglerequires identifying or producing rectangular objects
linerequires identifying or producing line objects
outlinerequires identifying or producing rectangular outlines
irregularrequires identifying or producing irregular objects
overlappingrequires recognizing objects that overlap but are distinct
multicolorrequires identifying objects that are composed of multiple colors
diagonal-linesrequires identifying or producing diagonal lines
grid-layoutsrequires recognizing that objects are laid out in some larger grid
transformationscopycopy one region to another
layeringrequires understanding of in front or behind
fillingrequires filling some shape
rotationrequires rotating a object
flippingrequires flipping an object
translationrequires translating/moving an object
scalingrequires scaling an object
recolorrequires changing the color of an object
draw-linesrequires drawing a line
proceduressearchrequires identifying some specific cell or object with a certain property, sometimes multiple
agentic-programcan be solved with a program like solution that has conditional behaviours, e.g draw a line in a direction until some condition is met
convolutional-programcan be solved by some fixed program that is convolved across the entire grid, e.g the same rule can be applied to each cell based on some fixed relative set of neighbors
for-eachrequires applying a procedure repeatedly multiple matching objects or cells
alignmentrequires aligning objects so that some cells or specific properties of it are aligned, e.g copying a pattern such that the blue cells overlap some other blue cells on the grid
orderingrequires ordering some objects according to some property to produce the output
invariancesscalerequires applying a rule at multiple scales
orientationis invariant to the orientation of the grid
conceptssizerequires understanding of sizes of objects, cell count, volume, width, or height
max-minrequires understanding or computing maximums and minimums of other properties
topologyrequires understanding of topology, this mainly covers toroidal shapes
symmetryrequires understanding of symmetry, or where symmetry is used to solve the problem
relative-positionrequires understanding of the relative position of different objects, e.g left of, right of, above, below etc, or mapping a relative position between two different objects/grids
containmentrequires understanding the difference between inside and outside
adjacencyrequires understanding of if objects are adjacent or touching
countingrequires counting some property usually part of an order or min/max, e.g number of cells of a color
novel-propertiesrequires recognizing novel properties of objects that aren't covered by any other explictly tagged concept
otherdata-dependent-gridthe grid size is dependent on the input cell colors and layout
multi-sample-mappingrequires learning some arbitrary mapping between colors/shapes/sizes etc from multiple samples
pattern-completiondetermine what a pattern is and and complete it, e.g fill in the missing cells

Colors #

The colors used in the images (and therefore also in some of the descriptions) are not the same as those on the ARC website. I used a different color palette for no particular reason, but now that the descriptions reference them, here is the full mapping:

colornamenumber
zero0
blue1
green2
red3
yellow4
purple5
cyan6
orange7
gray8
teal9

Initial analysis #

There are probably limited, non-obvious insights to draw out here from the questions I've asked so far. Bigger grids, grids with more tags, and longer descriptions are all harder for the models to solve, unsurprisingly.

The tags have the potential to be an interesting signal, although I believe this would greatly benefit from some additional data and tagging the remaining 200 problems to reduce the uncertainty a little bit.

Overall LLM performance #

The performance of the 3 evaluated models here is much better than their scores on the public eval leaderboard, and that isn't a surprise as the training set is acknowledged to be much easier than the evaluation set. Otherwise the relative performance is in-line, which is a reasonable assurance that nothing went horribly wrong here.

solveraccuracy
gpt-4o80/400 (20.0%)
claude-35-sonnet121/400 (30.25%)
icecuber-2020201/400 (50.25%)

Grid sizes #

For all 3 models, the bigger the grid, the worse the performance. gpt-4o seems particarly affected by the input grid size, with a 7.937% accuracy on large (> 256 cell) grids, compared to 39.683% on small (< 64 cell) grids. On the small grids, gpt-4o is not far off sonnet at all. icecuber-2020 is the least affected by grid size, but there is still an effect.

It's unlikely it's the grid size itself that is the problem, but rather the fact that more complex problems inevitably have to be put on larger grids, so I'm intepreting this as a proxy for problem complexity.

solverlargemediumsmall
gpt-4o5/63 (7.937%)25/210 (11.905%)50/127 (39.37%)
claude-35-sonnet12/63 (19.048%)51/210 (24.286%)58/127 (45.669%)
icecuber-202025/63 (39.683%)94/210 (44.762%)82/127 (64.567%)

Tag accuracy #

The tags are quite sparse, and the margin of error on these numbers is pretty high so take it with a pinch of salt. I've only included tags with at least 5 problems. The accuracy numbers are based on whether either gpt-4o or sonnet correctly solved the problem, which has a baseline accuracy across all problems of 32.75%.

I won't draw any conclusions from this. As you would expect (and I hoped), some of these tags seem to impact performance quite a lot. But, it's hard to tease out what is that tag itself versus some other correlations with the presence of that tag.

tagaccuracy (gpt4o|sonnet)
object:overlapping5/9 (55.556%)
other:pattern-completion9/17 (52.941%)
transformation:flipping8/18 (44.444%)
procedure:convolutional-program11/25 (44.0%)
transformation:recolor10/27 (37.037%)
object:grid-layouts6/17 (35.294%)
object:rectangle15/43 (34.884%)
other:multi-sample-mapping3/9 (33.333%)
concept:max-min2/6 (33.333%)
concept:relative-position8/26 (30.769%)
other:data-dependent-grid5/17 (29.412%)
transformation:copy12/42 (28.571%)
concept:counting7/27 (25.926%)
object:diagonal-lines2/8 (25.0%)
object:outline5/20 (25.0%)
object:line9/37 (24.324%)
concept:containment4/17 (23.529%)
transformation:translation3/14 (21.429%)
procedure:agentic-program7/35 (20.0%)
transformation:draw-lines5/26 (19.231%)
procedure:search6/35 (17.143%)
procedure:for-each3/18 (16.667%)
concept:novel-properties3/18 (16.667%)
object:multicolor5/30 (16.667%)
concept:symmetry1/7 (14.286%)
transformation:scaling2/15 (13.333%)
object:irregular6/46 (13.043%)
invariance:scale1/8 (12.5%)
procedure:alignment1/9 (11.111%)
transformation:rotation1/9 (11.111%)
transformation:layering2/18 (11.111%)
invariance:orientation0/7 (0.0%)
concept:adjacency0/7 (0.0%)

Tag counts #

More tags generally means there is a bit more going on in the problem, requiring multiple concepts or processes to be composed in some way. This seems to turn out to be a pretty good signal of overall accuracy with the accuracy generally trending much lower for those problems with more tags, while the single tag problems are solved at a rate of 54.167% which is much higher than the baseline.

# tagsaccuracy (gpt4o|sonnet)
113/24 (54.167%)
219/45 (42.222%)
316/42 (38.095%)
47/41 (17.073%)
54/23 (17.391%)
63/12 (25.0%)
71/9 (11.111%)

Description length #

Short descriptions are solved 52.308% of the time, compared to 21.739% for long descriptions. Interestingly, medium or long descriptions don't have much of a difference, which perhaps says something about the quality of my descriptions.

description lengthaccuracy (gpt4o|sonnet)
short34/65 (52.308%)
long15/69 (21.739%)
medium14/66 (21.212%)

Grid properties #

Well it turns out these don't have much variance in between them at all, but as they are on the site, I'm including them for completeness.

grid propertiesaccuracy (gpt4o|sonnet)
non-matching-input-output-grid48/138 (34.783%)
inconsistent-input-grid64/190 (33.684%)
consistent-output-grid70/210 (33.333%)
data-independent-output126/383 (32.898%)
inconsistent-output-grid61/190 (32.105%)
consistent-input-grid67/210 (31.905%)
matching-input-output-grid83/262 (31.679%)
data-dependent-output5/17 (29.412%)

Next steps #

The obvious question to ask is - what happens when you feed these descriptions and tags into an LLM as additional context for solving the problems. I'm not hopeful that they will change much, but it's not too hard to ask the question (update: I did this, it brings Claude performance up to 35.25% from 30.25% when testing against the 200 described tasks).

I'd like to complete the tagging process, and extend to the evaluation dataset once ARC 2025 is released when I find the time, although I'd estimate this is around 10 hours or fairly gruelling effort.

There are many more questions I'd like to answer from the data. E.g looking at compressibility of the grids, or the conditional compressibility (e.g len(zip(input, output)) / len(zip(input)) or something like that). Hopefully others would be interested in taking their own analysis further too.

Getting a few more models in here would be nice, again with the obvious caveats of it being a public dataset.

And finally, as others have tried, seeing how models fare at trying to predict the tags themselves.

And of course, trying to take a shot at ARC itself...

reply via email follow me on twitter

Notes

  • [0]: There are obvious problems with scoring these models against the training (or any public dataset).
  • [1]: I did initially explore a suggestion from the community to try to use an LLM to describe solutions based off of re-arc verifiers, but this wasn't very successful, and after looking at the code for the verifiers, it was not surprising that it didn't work.
< lewish