This is a very short write-up of some experiments I've been running which have not yielded success yet, but I believe have potential still. By writing this I hope that this is either useful to someone to continue with it, or that someone will offer me some guidance.
With O1, O3, and DeepSeek R1 we have seen reasoning emerge through RL training on verifiable domains, such as maths. This is exciting as it seems to hint at the potential of AlphaZero like growth in capability with just a form of "self-play". It seems like all we really need might be verifiable problem answer sets, not necessarily annonated, expert chains of thought.
This research has been replicated in some small domains, and even with small models. Everyone is asking why it's working now, because it's certainly been tried before. Perhaps the base models are now good enough, maybe it's something else, I don't know but plenty of people are talking about it on X.
The first example I saw of this was: 7B Model and 8K Examples: Emerging Reasoning with Reinforcement Learning is Both Effective and Efficient, where they take the Qwen2.5-7B
base model, using 8K examples and rule-based rewards on maths problems, they see a big increase in reward and notably the emergence of long chains of thought.
The second time I saw this was with TinyZero (twitter thread summary) where even more simply, they use a similar approach to teach the models to think/reason about the countdown numbers game and observe the same kind of emergent reasoning.
It starts to look like the minimum we need to have a chance of some success here is a domain with verifiable problem answer sets and a sufficiently powerful base model.
The ARC-AGI problem set is verifiable. The training set provided by the competition only contains 400 problems, however we have two datasets from last years competition that can increase the amount of data we have to work with and hopefully get that up significantly:
A combination of these, or perhaps just BARC alone might be a sufficient dataset for RL.
All code in: https://github.com/lewish/TinyZero/tree/rearc and at this commit at time of writing.
Before customizing for ARC, I attempted to reproduce results from the SimpleRL and TinyZero repositories.
After much trial and error and quite a bit of GPU spend, I was able to see the same results in TinyZero on a couple of A100s. I discarded SimpleRL and continued with TinyZero as it was a simpler setup to get working.
I produced two generators for REARC and BARC datasets:
For each of the 400 base problems, I sampled 4 random grid pairs as the train examples, and 1 as the test example. Doing this 10 times per task, the end result is 4K examples to train on.
Using a similar approach, I sample 2K of the 200K base problems, and then for each of those, generate two problems using 4 grid pairs for training and 1 for testing to generate another 4k examples.
For both datasets, some prompt size filtering is done to ensure the prompt token size is below 8K.
Following the other examples from TinyZero, this is what the prompt looked like for one of the BARC tasks:
A conversation between User and Assistant. The user asks a question, and the Assistant solves it.
The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.
User: You are a bot that is very good at solving puzzles.
Infer the pattern from the given input pairs and predict the output for the test input.
Think about the reasoning process in your mind and then provide the user with the answer.
Show your work in <think> </think> tags. Return the final answer in <answer> </answer> tags, for example:
<answer>
5 7 8 9
3 2 1 4
4 2 2 1
9 9 3 2
</answer>
Training Examples
Example 1: Input
7 7 7 7 7
7 7 8 7 7
7 8 8 8 7
7 7 8 7 7
7 7 7 7 7
Example 1: Output
7 7 8 7 7
7 8 7 8 7
8 8 8 8 8
7 8 7 8 7
7 7 8 7 7
Example 2: Input
3 3 3 3 3 3 3
3 7 7 7 7 7 3
3 3 7 7 7 3 3
3 3 7 3 7 3 3
3 3 3 3 3 3 3
Example 2: Output
3 7 7 7 7 7 3
7 7 7 7 7 7 7
3 7 3 7 7 7 3
3 7 3 7 3 7 3
3 7 7 7 7 7 3
Example 3: Input
7 7 7 7 7
7 4 7 4 7
7 4 7 4 7
7 4 4 4 7
7 4 4 4 7
7 7 7 7 7
Example 3: Output
7 4 7 4 7
4 7 4 7 4
4 7 4 7 4
4 4 4 4 4
4 4 4 4 4
7 4 7 4 7
Test Input:
6 6 6 6 6 6 6
6 3 3 6 3 3 6
6 3 3 3 3 3 6
6 3 3 3 3 3 6
6 6 6 6 6 6 6
What is Output for the Test Input?
Assistant: Let me solve this step by step.
<think>
Full WandB results - https://api.wandb.ai/links/lewishco/f8fnmoi4
Using Qwen2.5-7B as a base model, and the generated BARC dataset above, I firstly validated that the base model could actually solve some of the problems without any fine-tuning. It gets around 3% right, which hopefully is just enough of a signal to get it started. For the REARC dataset, this was only 1%.
Running on 8x H100s for a few hours (after 10s of hours of hyper-parameter tweaking - I'm new to this) the model showed a clear growth in accuracy (on the hold out test set, 27.9%), however no reasoning emerged, the response length quickly reduced to around 200 tokens on average, and showed no sign of pulling back up like it did in the TinyZero countdown example (at least until I ran out of budget).
Note that the "test set" in either case here is just hold out of 100 problems from either the generated BARC or REARC sets, so no meaningful indication of performance on the actual ARC problems for BARC.
We can see what kind of CoTs the model ends up with at this point, and here are a few examples:
Looking at the training examples, it seems that the pattern is to replace all the non-zero numbers with their surrounding 0s, while keeping the numbers in the center of the grid intact.
Looking at the training examples, it seems that the pattern is to replace all the 1s with the number above them, while keeping the rest of the numbers intact.
Looking at the training examples, it seems that the pattern is to replace all the 4s with 0s, while keeping the rest of the numbers intact.
Looking at the training examples, it seems that the pattern is to replace all the non-zero numbers with their row and column numbers, while keeping the zeros intact.
Looking at the training examples, it seems that the pattern is to replace all the 5s with 8s, all the 4s with 7s, all the 3s with 6s, and all the 9s with 9s, while keeping the rest of the numbers intact.
The model quickly falls into a scheme where it attempts to think about every problem as some sort of replacement rule, and then it just repeats the same pattern over and over again. This is not what we want, we want the model to think about the problem and come up with a solution, not just memorize a pattern, some more thoughts below.
Full WandB results - https://api.wandb.ai/links/lewishco/i2asbluy
Out of interest, what happens if we try and continue training with a model that has already been taught to think? Using one of the distilled DeepsSeek models, I ran the exact same setup as above. Interestingly here, there is even more severe collapse in the answers, with the model falling into a pattern of not actually attempting to reason at all. It's even worse than that, it thinks it's best strategy is simply to guess a single number:
What is Output for the Test Input?
Assistant: Let me solve this step by step.
<think>
<answer>
4
</answer>
I believe these results together hint at some problems with the dataset, discussed below.
Overall model performance by the end of training was significantly worse than using the Qwen2.5-7B model, after 100 steps the test score is only 11.8% as opposed to 27.9%, and the response length was even shorter.
To get this running within memory limits on 8 H100s, I customized the script with all the parameters I used in the end, some call outs:
The most obvious problem here is that the BARC dataset actually contains a lot of extremely trivial problems, and these are the ones that the base model can solve initially, and require very little if any CoT. Once some of these work, the response length quickly shortens as it's discovered it's best (locally optimal) strategy.
Here's one such example where the model requires no reasoning at all to solve, and is very trivial, much easier than the vast majority of the problems in the actual ARC training sets:
Training Examples
Example 1: Input
9 6 2 4 1
Example 1: Output
9 6 2 4 1
9 6 2 4 1
Example 2: Input
6 7 5 9 4 3
Example 2: Output
6 7 5 9 4 3
6 7 5 9 4 3
Example 3: Input
7 5 6 4 8
Example 3: Output
7 5 6 4 8
7 5 6 4 8
Test Input:
3 2 8 5 4
What is Output for the Test Input?
Assistant: Let me solve this step by step.
<think>
</think>
<answer>
3 2 8 5 4
3 2 8 5 4
</answer>
Extracted solution:
3 2 8 5 4
3 2 8 5 4
Target:
3 2 8 5 4
3 2 8 5 4
--------------------------------
Correct answer!
There are also examples where the rule is just "replace all values with a single fixed value".
Clearly I've barely even started the actual exploration here, but I am a bit hesitant to continue without reducing my training costs down a little first. Maybe this code and examples are useful to someone, even as a somewhat negative result so far.
I've now run the same setup as above, but with the REARC dataset and back to Qwen2.5-7b as a base model. The results are much the same, the model quickly falls into a pattern of just guessing a single number.
This time, though, the output is even more hilarious, with the model resorting to ...
as it's "thinking":
What is Output for the Test Input?
Assistant: Let me solve this step by step.
<think>...</think>
<answer>3</answer><|endoftext|>
Extracted solution:
3
Target:
9 3 2 3 9 9
3 2 3 2 3 9
9 9 3 2 3 9
9 9 9 9 9 9
Response length is down to a mean of 10 quite quickly, and the reward score doesn't go above 0.1 (which is what it gets for getting the format right). It does solve some problems, but not appearing to improve much beyond 1% in ~1 hour of training, which I ended hearly.
Perhaps this hints that the problem is not the dataset, or BARC, but the reward function.
Previous reward function simply checked that:
<answer></answer>
tagsThis means that you get a reward of 0.1
if you simply just output a single number. Instead, we can update this to only give formatting reward if the answer given has the same grid dimensions as the ground truth answer. As the expected grid size varies between problems, this might be able to avoid this "collapse".
New reward function then:
1.0
if the answer is correct0.1
if the answer has the same grid size as the ground truth0.01
if the answer is in the correct format, i.e there is an 0
otherwiseStill no dice here, after 20 odd steps every answer is just <think> </think>
before producing the answer. It does however actually produce proper grids (even if it is often just the test input grid). Yet again comes to the conclusion that no thinking tokens are the best approach.
Keeping the new reward function which appears to be a net positive, just trying out PPO with a few adjustments for memory. There's no collapse in response length, which is good (could be luck, could be PPO), I'm only running this once... here is an example think block:
After thoroughly analyzing the examples, it seems that the answer should be a table where every digit reminds unchanged except for four corners, transforming from 9 to more significant multiples of 9 while maintaining the same position for internal digits. After reevaluating my guess, I noticed the same digit patterns on even and odd columns but only in the copying pattern for even rows. Two important things noticed are missing numbers but 9 from all four corners.
This one randomly pulled sample does actually contain some backingtracking, this is potentially a positive sign.
Response length still drops to around 400 from an initial 600 and doesn't pull back up, critic score seems to cap out. I'm really not running this for very long though, about 60 epochs (~2 hours for PPO).
Still need to massively reduce costs before I can continue with this. I've got the unsloth code linked above roughly working (it completes a few training loops, that's all I validated) on a 4090 with quantization. This should run nicely only a single H100. The code needs a lot of instrumentation and cleaning up though, and I did experience some GPU segfaults.