Inside an egocentric capture rig
What actually goes into a research-grade first-person capture: Aria glasses, depth sensors, hand pose, and the formats your training stack expects.
A useful capture rig is more than a camera on someone's head. Here is what a robotics-ready setup records, and why each signal matters.
The sensor stack
- RGB, first-person, up to 1408 by 1408 at 30fps. The core observation stream.
- SLAM cameras for a 6DoF trajectory of the head and hands through space.
- Eye gaze. Where the worker looks is a strong prior for where the action is.
- Inertial measurement (accelerometer plus gyroscope) for motion.
- Depth (RealSense or ZED) and hand pose (mocap gloves or a handheld gripper) for the robotics-ready tier.
Much of this mirrors Meta Project Aria and the Ego-Exo4D methodology, which is why policies trained on it transfer.
The formats that matter
Your training stack does not want raw video. It wants structured episodes:
- LeRobot: the Hugging Face standard, Parquet plus MP4 plus JSON.
- RLDS and TFDS: the format behind Open X-Embodiment.
- HDF5: the ALOHA and robomimic convention.
Annotation
Every clip ships with action segmentation, first-person narration, hand-pose tracks, and skill-level metadata. The annotation is the difference between footage and training data.
Physical-AI data specialists at OFORO LTD (UK). We write about egocentric data, robotics dataset formats, RLHF and data governance. See what we build.