nxted
← Back to research
Physical AIBy nxted Research Team· Published 29 May 2026· Updated 30 May 2026· 2 min read

Inside an egocentric capture rig

What actually goes into a research-grade first-person capture: Aria glasses, depth sensors, hand pose, and the formats your training stack expects.

A useful capture rig is more than a camera on someone's head. Here is what a robotics-ready setup records, and why each signal matters.

The sensor stack

  • RGB, first-person, up to 1408 by 1408 at 30fps. The core observation stream.
  • SLAM cameras for a 6DoF trajectory of the head and hands through space.
  • Eye gaze. Where the worker looks is a strong prior for where the action is.
  • Inertial measurement (accelerometer plus gyroscope) for motion.
  • Depth (RealSense or ZED) and hand pose (mocap gloves or a handheld gripper) for the robotics-ready tier.

Much of this mirrors Meta Project Aria and the Ego-Exo4D methodology, which is why policies trained on it transfer.

The formats that matter

Your training stack does not want raw video. It wants structured episodes:

  • LeRobot: the Hugging Face standard, Parquet plus MP4 plus JSON.
  • RLDS and TFDS: the format behind Open X-Embodiment.
  • HDF5: the ALOHA and robomimic convention.

Annotation

Every clip ships with action segmentation, first-person narration, hand-pose tracks, and skill-level metadata. The annotation is the difference between footage and training data.

n
nxted Research Team

Physical-AI data specialists at OFORO LTD (UK). We write about egocentric data, robotics dataset formats, RLHF and data governance. See what we build.