Tech Preview
The Config Team · 8 min read
Today, we're excited to share a preview of what we've been building at Config. Our mission is to make robots capable of reliably performing two-handed tasks across diverse real-world settings materially more cost- and time-efficient to deploy.
Our work is guided by two core beliefs:
[1] The primary bottleneck for general-purpose robot foundation models is data—not merely more data, but data with sufficient richness and diversity to enable real-world generalization across tasks, objects, and environments.
[2] Robot embodiments will vary by use case and cost, particularly at the end-effector level (e.g., grippers or hands). In practice, the "right" robot is the one that is both effective and economically viable for the target application.
To address the problem [1] in a way that remains robust under the hypothesis [2], we built our product (see our Product page for an overview). At a high level, our system is designed to enable rapid adaptation to new tasks and target robots. Fig. 1 illustrates this end-to-end workflow:

The system illustrated in Fig. 1 is built on two core ideas:
1. Start with humans to scale action data quickly and economically.
Rather than relying solely on robot teleoperation—which is expensive and operationally heavy to scale—we use human-performed actions as the primary source of training signals, enabling efficient collection of large-scale data of quality (see Fig. 2 below; to date, we've collected ~100k hours of human action data).

This dataset is used to train CFG-1, our first-generation pretrained foundation model for bimanual manipulation:

As we scale throughput of the data, we prioritize its quality, which is defined by both precision and accuracy of action, as well as diversity across tasks, environments, and objects.
Since raw human video lacks structured control signals, precise action labeling is essential for downstream model quality. To address this, we design and fabricate a human-operated hand gripper that mimics the target robot's end-effector, along with a pipeline that estimates robot-aligned action representations from gripper trajectories in human-performed manipulation. Video 1 below qualitatively demonstrates the accuracy of the estimated 7-DoF robot-aligned action representations of both end-effectors across diverse human-performed task data:
To ensure data diversity, we systematically broaden variations in environments, objects, and actions during data collection for each target task. Each "scenario" (Fig. 2) represents a distinct variation of that task. Video 2 showcases our internal explorer and the diverse scenarios created to date:
2. Adapt the foundation model to a target robot with focused teleoperation data—then improve it in a closed loop.
We then finetune our foundation model for a specific target robot through an automated end-to-end pipeline—spanning data collection to model training and evaluation—typically completed within 24 hours.
The process begins with designing a data acquisition strategy informed by the target robot's physical characteristics and sensing constraints to enable efficient model learning, followed by collecting a small amount of target task-specific teleoperation data and finetuning the model. The model is then improved through online rollouts, iteratively refining both the data strategy and policy performance. Fig. 4 illustrates this closed-loop process on a popcorn-serving task:

For the instruction "Serve popcorn in a cup," we initially adopted the scoop-holding strategy shown in Video 3. This approach did not produce a strong initial policy (see the orange bar in Fig. 4). We attribute this primarily to two factors: (1) the data acquisition strategy did not allow the model to effectively leverage the wrist-view for fine control of the scoop position relative to the cup, and (2) the model may have learned spurious noise from arm instability caused by the teleoperator maintaining an unnatural pose while holding the scoop.
After evaluating the model fine-tuned on 100 trajectories collected using this first strategy, we revised the scoop-holding method (Video 4). This change highlighted the importance of an effective data acquisition strategy: the updated policy achieved a significantly higher task success rate of 43% (± 9%). This iteration — from data collection to evaluation — was completed within a single day using our pipeline.
Once an effective data strategy is identified, we provide a visual guideline (e.g., reference video) for data collectors to follow during online data collection. This structured guidance reduces unnecessary variance in action trajectories (see Fig. 5 below) and improves the efficiency of subsequent model finetuning.

By the following day, we conduct multiple rounds of online data collection and policy training, progressively refining the model with each cycle. This online data-collection approach, leveraging policy rollouts from previous rounds, naturally increases state diversity by allowing human experts to recover from the model's suboptimal actions and failures, as shown in Fig. 6 below:

After two rounds of online data-driven model improvement, the task success rate improved to 76% (± 7%). The resulting policy can be seen in Video 5.
Check out the videos below to see how our models perform across a range of additional tasks, showcasing the promise of our approach.
Stay tuned for more updates — including our own models and datasets — coming soon. Follow us on X and YouTube to stay up to date with our latest work.