Vision-Language-Action (VLA) Models: The Data They Need
VLA models map what a robot sees and is told into actions. This explainer covers how they work and the demonstration data they depend on.
TL;DR. A vision-language-action (VLA) model takes camera images and a natural-language instruction and outputs robot actions. VLAs are trained on large collections of demonstration episodes, so their performance depends heavily on the volume, diversity and annotation quality of the action data they are fed.
What a VLA model is
VLAs extend vision-language models to control. Instead of answering in text, they predict actions. Notable examples include Google DeepMind's RT-2 and the open OpenVLA, and companies such as Physical Intelligence (the π0 model) build general policies in this family.
The data a VLA needs
- Demonstration episodes pairing observations with actions, often sourced from collections like Open X-Embodiment.
- Language annotations describing each task, so the "language" channel is grounded.
- Diversity across objects, scenes and embodiments - the dominant driver of generalisation per imitation-learning scaling laws.
- Clean formats: LeRobot, RLDS, HDF5.
Where human egocentric data fits
Robot teleoperation data is action-aligned but expensive to scale. Human egocentric demonstrations are cheaper and far more diverse, and are increasingly used to pre-train or augment VLA policies. See what is egocentric data and the teleoperation comparison.
What this means if you are sourcing data
Optimise for breadth (many tasks, objects and settings), insist on language and action annotations, and validate quality with a small batch before scaling - the approach in our buyer's guide.
FAQ
What is a vision-language-action model? A robot-control model that maps camera input and a language instruction to actions, extending vision-language models from text output to physical action.
What data do VLA models need? Large, diverse sets of demonstration episodes with language and action annotations, in robotics formats - plus the breadth that drives generalisation.
Can human video train VLAs? Yes. Human egocentric demonstrations are cheaper and more diverse than teleoperation and are increasingly used to pre-train or augment VLA policies.
Need VLA-ready demonstration data? Explore nxted Capture or request a Test Kit.
Physical-AI data specialists at OFORO LTD (UK). We write about egocentric data, robotics dataset formats, RLHF and data governance. See what we build.