Attempting to reproduce reasoning through RL with ARC-AGI

BARC + Qwen2.5-7B results

BARC + R1-distill-7b results

Notes on training and memory

What's going wrong?

Experimenting with REARC

Tweaking the reward function

PPO + New reward

< lewish

light dark

twitter github linkedin

Running ARC through a O1/R1 style RL fine-tuning loop with GRPO

2025-02-10 / 14 min read

Background
Attempting to reproduce reasoning through RL with ARC-AGI
Reproducing SimpleRL/TinyZero
Prompt structure and an example
BARC + Qwen2.5-7B results
BARC + R1-distill-7b results
Notes on training and memory
What's going wrong?
Experimenting with REARC
Tweaking the reward function
PPO + New reward
Next

This is a very short write-up of some experiments I've been running which have not yielded success yet, but I believe have potential still. By writing this I hope that this is either useful to someone to continue with it, or that someone will offer me some guidance.

Background #

With O1, O3, and DeepSeek R1 we have seen reasoning emerge through RL training on verifiable domains, such as maths. This is exciting as it seems to hint at the potential of AlphaZero like growth in capability with just a form of "self-play". It seems like all we really need might be verifiable problem answer sets, not necessarily annonated, expert chains of thought.

This research has been replicated in some small domains, and even with small models. Everyone is asking why it's working now, because it's certainly been tried before. Perhaps the base models are now good enough, maybe it's something else, I don't know but plenty of people are talking about it on X.

The first example I saw of this was: 7B Model and 8K Examples: Emerging Reasoning with Reinforcement Learning is Both Effective and Efficient, where they take the Qwen2.5-7B base model, using 8K examples and rule-based rewards on maths problems, they see a big increase in reward and notably the emergence of long chains of thought.

The second time I saw this was with TinyZero (twitter thread summary) where even more simply, they use a similar approach to teach the models to think/reason about the countdown numbers game and observe the same kind of emergent reasoning.

It starts to look like the minimum we need to have a chance of some success here is a domain with verifiable problem answer sets and a sufficiently powerful base model.

Attempting to reproduce reasoning through RL with ARC-AGI #

The ARC-AGI problem set is verifiable. The training set provided by the competition only contains 400 problems, however we have two datasets from last years competition that can increase the amount of data we have to work with and hopefully get that up significantly:

REARC - Thanks to the hard work of one individual, REARC provides us with new sample pairs of the training problems in ARC. Whilst this is effectively the same problems, we can at least generate more examples of the same type of problem.
BARC - This is a more interesting dataset, covered in the top prize paper from last years competition Combining Induction and Transduction for Abstract Reasoning. Here, LLMs are used to generate new problems entirely, and the code for generating and verifying them. In the full version of the dataset, there are 200K problems we can pull from.

A combination of these, or perhaps just BARC alone might be a sufficient dataset for RL.

Reproducing SimpleRL/TinyZero #

All code in: https://github.com/lewish/TinyZero/tree/rearc and at this commit at time of writing.

Before customizing for ARC, I attempted to reproduce results from the SimpleRL and TinyZero repositories.

After much trial and error and quite a bit of GPU spend, I was able to see the same results in TinyZero on a couple of A100s. I discarded SimpleRL and continued with TinyZero as it was a simpler setup to get working.

I produced two generators for REARC and BARC datasets:

REARC generator code

For each of the 400 base problems, I sampled 4 random grid pairs as the train examples, and 1 as the test example. Doing this 10 times per task, the end result is 4K examples to train on.

BARC generator code

Using a similar approach, I sample 2K of the 200K base problems, and then for each of those, generate two problems using 4 grid pairs for training and 1 for testing to generate another 4k examples.

For both datasets, some prompt size filtering is done to ensure the prompt token size is below 8K.

Prompt structure and an example #

Following the other examples from TinyZero, this is what the prompt looked like for one of the BARC tasks:

A conversation between User and Assistant. The user asks a question, and the Assistant solves it.
The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.
User: You are a bot that is very good at solving puzzles.
Infer the pattern from the given input pairs and predict the output for the test input.
Think about the reasoning process in your mind and then provide the user with the answer.
Show your work in <think> </think> tags. Return the final answer in <answer> </answer> tags, for example:
<answer>
5 7 8 9
3 2 1 4
4 2 2 1
9 9 3 2
</answer>

Training Examples
Example 1: Input
7 7 7 7 7
7 7 8 7 7
7 8 8 8 7
7 7 8 7 7
7 7 7 7 7

Example 1: Output
7 7 8 7 7
7 8 7 8 7
8 8 8 8 8
7 8 7 8 7
7 7 8 7 7

Example 2: Input
3 3 3 3 3 3 3
3 7 7 7 7 7 3
3 3 7 7 7 3 3
3 3 7 3 7 3 3
3 3 3 3 3 3 3

Example 2: Output
3 7 7 7 7 7 3
7 7 7 7 7 7 7
3 7 3 7 7 7 3
3 7 3 7 3 7 3
3 7 7 7 7 7 3

Example 3: Input
7 7 7 7 7
7 4 7 4 7
7 4 7 4 7
7 4 4 4 7
7 4 4 4 7
7 7 7 7 7

Example 3: Output
7 4 7 4 7
4 7 4 7 4
4 7 4 7 4
4 4 4 4 4
4 4 4 4 4
7 4 7 4 7

Test Input:
6 6 6 6 6 6 6
6 3 3 6 3 3 6
6 3 3 3 3 3 6
6 3 3 3 3 3 6
6 6 6 6 6 6 6

What is Output for the Test Input?
Assistant: Let me solve this step by step.
<think>

We give some general context about the task
Explain the expected format and encourage the model to think
The problems are given in a numerical grid format (in the BARC paper they use color names (green, back, red) instead of numbers)
We tease the model into thinking by starting with a <think> tag that they can continue

BARC + Qwen2.5-7B results #

Full WandB results - https://api.wandb.ai/links/lewishco/f8fnmoi4

Using Qwen2.5-7B as a base model, and the generated BARC dataset above, I firstly validated that the base model could actually solve some of the problems without any fine-tuning. It gets around 3% right, which hopefully is just enough of a signal to get it started. For the REARC dataset, this was only 1%.

Running on 8x H100s for a few hours (after 10s of hours of hyper-parameter tweaking - I'm new to this) the model showed a clear growth in accuracy (on the hold out test set, 27.9%), however no reasoning emerged, the response length quickly reduced to around 200 tokens on average, and showed no sign of pulling back up like it did in the TinyZero countdown example (at least until I ran out of budget).

Note that the "test set" in either case here is just hold out of 100 problems from either the generated BARC or REARC sets, so no meaningful indication of performance on the actual ARC problems for BARC.

We can see what kind of CoTs the model ends up with at this point, and here are a few examples:

Looking at the training examples, it seems that the pattern is to replace all the non-zero numbers with their surrounding 0s, while keeping the numbers in the center of the grid intact.

Looking at the training examples, it seems that the pattern is to replace all the 1s with the number above them, while keeping the rest of the numbers intact.

Looking at the training examples, it seems that the pattern is to replace all the 4s with 0s, while keeping the rest of the numbers intact.

Looking at the training examples, it seems that the pattern is to replace all the non-zero numbers with their row and column numbers, while keeping the zeros intact.

Looking at the training examples, it seems that the pattern is to replace all the 5s with 8s, all the 4s with 7s, all the 3s with 6s, and all the 9s with 9s, while keeping the rest of the numbers intact.

The model quickly falls into a scheme where it attempts to think about every problem as some sort of replacement rule, and then it just repeats the same pattern over and over again. This is not what we want, we want the model to think about the problem and come up with a solution, not just memorize a pattern, some more thoughts below.

BARC + R1-distill-7b results #

Full WandB results - https://api.wandb.ai/links/lewishco/i2asbluy

Out of interest, what happens if we try and continue training with a model that has already been taught to think? Using one of the distilled DeepsSeek models, I ran the exact same setup as above. Interestingly here, there is even more severe collapse in the answers, with the model falling into a pattern of not actually attempting to reason at all. It's even worse than that, it thinks it's best strategy is simply to guess a single number:

What is Output for the Test Input?
Assistant: Let me solve this step by step.
<think>

<answer>
4
</answer>

I believe these results together hint at some problems with the dataset, discussed below.

Overall model performance by the end of training was significantly worse than using the Qwen2.5-7B model, after 100 steps the test score is only 11.8% as opposed to 27.9%, and the response length was even shorter.

Notes on training and memory #

To get this running within memory limits on 8 H100s, I customized the script with all the parameters I used in the end, some call outs:

GRPO to hopefully save some memory and time
Prompt and response token lengths capped to 8K each
Enabling gradient checkpointing to also address memory issues
Reductions from the base script in multiple batch, mini batch and micro batch sizes
Tensor parallelism of 4 to also help with memory, and can't go any higher
Slightly reduced GPU memory utilization setting (AIUI this is used for KV caching) to also save OOMs

What's going wrong? #

The most obvious problem here is that the BARC dataset actually contains a lot of extremely trivial problems, and these are the ones that the base model can solve initially, and require very little if any CoT. Once some of these work, the response length quickly shortens as it's discovered it's best (locally optimal) strategy.

Here's one such example where the model requires no reasoning at all to solve, and is very trivial, much easier than the vast majority of the problems in the actual ARC training sets:

Training Examples
Example 1: Input
9 6 2 4 1

Example 1: Output
9 6 2 4 1
9 6 2 4 1

Example 2: Input
6 7 5 9 4 3

Example 2: Output
6 7 5 9 4 3
6 7 5 9 4 3

Example 3: Input
7 5 6 4 8

Example 3: Output
7 5 6 4 8
7 5 6 4 8

Test Input:
3 2 8 5 4

What is Output for the Test Input?
Assistant: Let me solve this step by step.
<think>

</think>

<answer>
3 2 8 5 4
3 2 8 5 4
</answer>
Extracted solution:
3 2 8 5 4
3 2 8 5 4
Target:
3 2 8 5 4
3 2 8 5 4
--------------------------------
Correct answer!

There are also examples where the rule is just "replace all values with a single fixed value".

Clearly I've barely even started the actual exploration here, but I am a bit hesitant to continue without reducing my training costs down a little first. Maybe this code and examples are useful to someone, even as a somewhat negative result so far.

Re-run the above with the REARC dataset. It's smaller, being derived from only 400 base problem types. However, it shouldn't have the problem in BARC that some of the tasks are very trivial. With a 1% base model accuracy, that is hopefully just enough to get some sort of reward signal.
Could attempt to filter out the most trivial BARC problems, somehow.
Try out PPO instead of GRPO. No intuition here.
Play with other hyper params, learning rate, etc.
See if I can move the code to use unsloth, where a bunch of improvements have been made to memory usage by the looks of it over TinyZero, and hopefully avoid all the OOMs I've been having, give me a bit more freedom in hyper-parameters and save on some compute costs.

Experimenting with REARC #

I've now run the same setup as above, but with the REARC dataset and back to Qwen2.5-7b as a base model. The results are much the same, the model quickly falls into a pattern of just guessing a single number.

This time, though, the output is even more hilarious, with the model resorting to ... as it's "thinking":

What is Output for the Test Input?
Assistant: Let me solve this step by step.
<think>...</think>
<answer>3</answer><|endoftext|>
Extracted solution:
3
Target:
9 3 2 3 9 9
3 2 3 2 3 9
9 9 3 2 3 9
9 9 9 9 9 9

Response length is down to a mean of 10 quite quickly, and the reward score doesn't go above 0.1 (which is what it gets for getting the format right). It does solve some problems, but not appearing to improve much beyond 1% in ~1 hour of training, which I ended hearly.

Perhaps this hints that the problem is not the dataset, or BARC, but the reward function.

Tweaking the reward function #

Previous reward function simply checked that:

There are <answer></answer> tags
The answer format is single digits seperated by spaces and rows by new lines

This means that you get a reward of 0.1 if you simply just output a single number. Instead, we can update this to only give formatting reward if the answer given has the same grid dimensions as the ground truth answer. As the expected grid size varies between problems, this might be able to avoid this "collapse".

New reward function then:

1.0 if the answer is correct
0.1 if the answer has the same grid size as the ground truth
0.01 if the answer is in the correct format, i.e there is an tag and the numbers are seperated by spaces and rows by new lines
0 otherwise

Still no dice here, after 20 odd steps every answer is just <think> </think> before producing the answer. It does however actually produce proper grids (even if it is often just the test input grid). Yet again comes to the conclusion that no thinking tokens are the best approach.

PPO + New reward #

Keeping the new reward function which appears to be a net positive, just trying out PPO with a few adjustments for memory. There's no collapse in response length, which is good (could be luck, could be PPO), I'm only running this once... here is an example think block:

After thoroughly analyzing the examples, it seems that the answer should be a table where every digit reminds unchanged except for four corners, transforming from 9 to more significant multiples of 9 while maintaining the same position for internal digits. After reevaluating my guess, I noticed the same digit patterns on even and odd columns but only in the copying pattern for even rows. Two important things noticed are missing numbers but 9 from all four corners.

This one randomly pulled sample does actually contain some backingtracking, this is potentially a positive sign.

Response length still drops to around 400 from an initial 600 and doesn't pull back up, critic score seems to cap out. I'm really not running this for very long though, about 60 epochs (~2 hours for PPO).

Next #

Still need to massively reduce costs before I can continue with this. I've got the unsloth code linked above roughly working (it completes a few training loops, that's all I validated) on a 4090 with quantization. This should run nicely only a single H100. The code needs a lot of instrumentation and cleaning up though, and I did experience some GPU segfaults.

Fusing the BARC and REARC datasets.
Just give it a reward model that actually encourages at least some minimum amount of thinking tokens.
Being a little more patient, getting a few more epochs in.
Seeing if I can build on a 14B model with the memory optimized code, the 7B model may just not be good enough (14B may be way off too...).

reply via email twitter

< lewish