Physical AIBy nxted Research Team· Published 30 May 2026· Updated 30 May 2026· 2 min read

Vision-Language-Action (VLA) Models: The Data They Need

VLA models map what a robot sees and is told into actions. This explainer covers how they work and the demonstration data they depend on.

TL;DR. A vision-language-action (VLA) model takes camera images and a natural-language instruction and outputs robot actions. VLAs are trained on large collections of demonstration episodes, so their performance depends heavily on the volume, diversity and annotation quality of the action data they are fed.

What a VLA model is

VLAs extend vision-language models to control. Instead of answering in text, they predict actions. Notable examples include Google DeepMind's RT-2 and the open OpenVLA, and companies such as Physical Intelligence (the π0 model) build general policies in this family.

The data a VLA needs

Demonstration episodes pairing observations with actions, often sourced from collections like Open X-Embodiment.
Language annotations describing each task, so the "language" channel is grounded.
Diversity across objects, scenes and embodiments - the dominant driver of generalisation per imitation-learning scaling laws.
Clean formats: LeRobot, RLDS, HDF5.

Where human egocentric data fits

Robot teleoperation data is action-aligned but expensive to scale. Human egocentric demonstrations are cheaper and far more diverse, and are increasingly used to pre-train or augment VLA policies. See what is egocentric data and the teleoperation comparison.

What this means if you are sourcing data

Optimise for breadth (many tasks, objects and settings), insist on language and action annotations, and validate quality with a small batch before scaling - the approach in our buyer's guide.

FAQ

What is a vision-language-action model? A robot-control model that maps camera input and a language instruction to actions, extending vision-language models from text output to physical action.

What data do VLA models need? Large, diverse sets of demonstration episodes with language and action annotations, in robotics formats - plus the breadth that drives generalisation.

Can human video train VLAs? Yes. Human egocentric demonstrations are cheaper and more diverse than teleoperation and are increasingly used to pre-train or augment VLA policies.

Need VLA-ready demonstration data? Explore nxted Capture or request a Test Kit.