Teaching Robots to Remember

The Config Team · 5 min read

Many real-world tasks require a robot to remember. A robot playing a musical phrase needs to know what it just played a beat ago, a robot tracking a shuffled cup needs to follow it through occlusion, and a robot tidying a desk needs to recall where things started. None of these can be solved by reacting to the current observation alone, because the information needed to act is no longer visible by the time the robot must act. As we push robots toward longer and more realistic tasks, memory stops being optional.

Yet today's VLAs mostly operate on short horizons. They can react to what they see now, or to a brief window of recent frames, but the tasks we care about here require remembering events that happened tens of seconds or even minutes earlier. Extending memory to these horizons is not just a matter of widening the context window: the model has to learn what to remember out of a long, redundant stream of observations, and most robot data provides no signal for this since it was collected on tasks where the current observation is already enough.

CFG-1 model architecture
CFG-1: an autoregressive transformer with a linear action head and a single forward pass per action, trained on memory-demanding tasks.

At Config, we believe that capable robot foundation models come from two things: the right architecture, and enough data that genuinely demands the capability you want. Get those right, and the capability emerges from training. Prior works on memory have proposed various architectural and framework-level solutions. We take a simpler one. Our first-generation foundation model, CFG-1 (see the figure above), is an autoregressive transformer with a linear action head and a single forward pass per action. We train it on data that genuinely requires memory to solve. Given the right substrate and the right signal, the model learns to use memory on its own — and below, we show three tasks where it does.

Playing Melody with Sequential Memory

Sequential memory: playing the first 8 (and 12) notes of 'Runaway' on a piano.

Our first task tests sequential memory: we train the robot to play the first N notes of "Runaway" on a piano, then evaluate at N=8 and N=12. From the robot's perspective, each moment looks almost identical to the last — a gripper hovering above a keyboard — so the only way to know which note comes next is to remember which notes have already been played. A memoryless baseline gets stuck pressing the same key repeatedly, unable to track what it has already done; our model succeeds on 70% of 8-note trials and 40% of 12-note trials.

Winning Shell Game with Object Memory

Object memory: tracking a hidden ball through repeated cup swaps across consecutive rounds.

Our next task tests the model's ability to track a hidden object through repeated state changes — the classic shell game. The video above shows the robot correctly identifying the ball three times in a row, and again after a fourth round with an increased number of swaps. Overall, the model succeeds in 21 out of 24 trials (87%) and demonstrates strong generalization in two ways. First, although it was trained on sequences with up to three swaps, it continues to correctly track the ball when the number of swaps is increased to four or five. Second, despite being trained only on single-round episodes, the model maintains tracking across multiple consecutive rounds in one continuous rollout, re-anchoring on each new ball placement as the game restarts.

Rearranging Objects with Spatial Memory

Spatial memory: restoring a desk's original configuration after a disturbance.

Next, we evaluate spatial memory over long horizons. The task begins with a tidy desk, where multiple objects are placed in designated positions. A person then disturbs the scene, and the robot must restore the original configuration.

Unlike the piano-playing and shell game tasks, there is no single cue to track. Instead, the robot must remember the entire scene through the disturbance and reproduce it minutes later. The model succeeds in 14 of 20 trials (70%), each with a different initial configuration and disturbance pattern.

As with the shell game, the consecutive demonstration shown above is technically out-of-distribution: our training data consists only of single-round episodes. Despite this, the model correctly anchors on the most recent target configuration and restores it across consecutive rounds.

What's Next?

Long-horizon memory takes more than a clever architecture — it takes the right data, and the right data is the harder ingredient to get. That's the work we focus on at Config: building the dataset that lets long-horizon memory emerge from training, so robots can handle the tasks that matter in homes, factories, and care facilities.

If you're interested in our mission and vision, we'd love to hear from you at forms.config.inc/contact.

We're also planning to open-source the memory dataset in the near future, so stay tuned — for that, and for more from us soon.