How to Collect Egocentric Data for Robot Training: A Hardware Guide
A practical guide to the rigs and sensors used to record research-grade egocentric data - from Project Aria to depth cameras and grippers.
TL;DR. To collect research-grade egocentric data you need a first-person RGB camera, a way to recover a 6-DoF trajectory (SLAM), depth, and hand pose - commonly built on Meta Project Aria plus an Intel RealSense or Stereolabs ZED depth camera, with a UMI-style gripper for action-aligned manipulation. The rig you choose should match how training-ready you need the data to be.
The signals that matter
- First-person RGB - the core observation stream (often up to ~1408x1408 at 30fps).
- 6-DoF trajectory - head and hand pose through space, from SLAM.
- Depth / point cloud - 3D structure of the scene.
- Hand pose - per-joint tracking, the bridge to robot grippers.
- Eye gaze - a strong prior for where the action is.
- Action labels - segmentation and success/failure, added in annotation.
The hardware the research field uses
- Meta Project Aria - research glasses providing RGB, SLAM cameras, IMU and eye gaze; the basis of Ego-Exo4D.
- Intel RealSense / Stereolabs ZED - depth and point cloud.
- Universal Manipulation Interface (UMI) - a handheld gripper that records action-aligned manipulation data and narrows the human-to-robot gap.
Three rig tiers
- RGB-only - cheapest; good for pre-training and behaviour cloning where pose is inferred.
- RGB + depth + SLAM - adds 3D and trajectory; suitable for most manipulation work.
- Full - adds hand pose and a UMI-style gripper for robotics-ready, action-aligned data.
We do not recommend fictional proprietary hardware - these are the devices behind published datasets, which is why policies trained on them transfer.
Calibration and consent are part of the rig
Record camera intrinsics/extrinsics and control frequency, and - because you are filming people - obtain consent and plan redaction from the start. See annotating egocentric data and the Data Trust Pack.
FAQ
What hardware do you need to collect egocentric data? A first-person RGB camera with SLAM (e.g. Project Aria), a depth camera (RealSense or ZED), and - for action-aligned manipulation - hand pose or a UMI-style gripper.
Do I need depth and hand pose? It depends on the task. RGB-only suffices for some pre-training; depth and hand pose are needed for robotics-ready manipulation data.
Is special hardware required, or can any camera work? Any camera captures RGB, but research-grade transfer benefits from SLAM, depth and pose - the stack behind datasets like Ego-Exo4D.
nxted runs these rigs so you don't have to: see nxted Capture or compare collection options.
Physical-AI data specialists at OFORO LTD (UK). We write about egocentric data, robotics dataset formats, RLHF and data governance. See what we build.