Physical AIBy nxted Research Team· Published 29 May 2026· Updated 30 May 2026· 2 min read

Inside an egocentric capture rig

What actually goes into a research-grade first-person capture: Aria glasses, depth sensors, hand pose, and the formats your training stack expects.

A useful capture rig is more than a camera on someone's head. Here is what a robotics-ready setup records, and why each signal matters.

The sensor stack

RGB, first-person, up to 1408 by 1408 at 30fps. The core observation stream.
SLAM cameras for a 6DoF trajectory of the head and hands through space.
Eye gaze. Where the worker looks is a strong prior for where the action is.
Inertial measurement (accelerometer plus gyroscope) for motion.
Depth (RealSense or ZED) and hand pose (mocap gloves or a handheld gripper) for the robotics-ready tier.

Much of this mirrors Meta Project Aria and the Ego-Exo4D methodology, which is why policies trained on it transfer.

The formats that matter

Your training stack does not want raw video. It wants structured episodes:

LeRobot: the Hugging Face standard, Parquet plus MP4 plus JSON.
RLDS and TFDS: the format behind Open X-Embodiment.
HDF5: the ALOHA and robomimic convention.

Annotation

Every clip ships with action segmentation, first-person narration, hand-pose tracks, and skill-level metadata. The annotation is the difference between footage and training data.