TechnicalBy nxted Research Team· Published 30 May 2026· Updated 30 May 2026· 2 min read

How to Collect Egocentric Data for Robot Training: A Hardware Guide

A practical guide to the rigs and sensors used to record research-grade egocentric data - from Project Aria to depth cameras and grippers.

TL;DR. To collect research-grade egocentric data you need a first-person RGB camera, a way to recover a 6-DoF trajectory (SLAM), depth, and hand pose - commonly built on Meta Project Aria plus an Intel RealSense or Stereolabs ZED depth camera, with a UMI-style gripper for action-aligned manipulation. The rig you choose should match how training-ready you need the data to be.

The signals that matter

First-person RGB - the core observation stream (often up to ~1408x1408 at 30fps).
6-DoF trajectory - head and hand pose through space, from SLAM.
Depth / point cloud - 3D structure of the scene.
Hand pose - per-joint tracking, the bridge to robot grippers.
Eye gaze - a strong prior for where the action is.
Action labels - segmentation and success/failure, added in annotation.

The hardware the research field uses

Meta Project Aria - research glasses providing RGB, SLAM cameras, IMU and eye gaze; the basis of Ego-Exo4D.
Intel RealSense / Stereolabs ZED - depth and point cloud.
Universal Manipulation Interface (UMI) - a handheld gripper that records action-aligned manipulation data and narrows the human-to-robot gap.

Three rig tiers

RGB-only - cheapest; good for pre-training and behaviour cloning where pose is inferred.
RGB + depth + SLAM - adds 3D and trajectory; suitable for most manipulation work.
Full - adds hand pose and a UMI-style gripper for robotics-ready, action-aligned data.

We do not recommend fictional proprietary hardware - these are the devices behind published datasets, which is why policies trained on them transfer.

Calibration and consent are part of the rig

Record camera intrinsics/extrinsics and control frequency, and - because you are filming people - obtain consent and plan redaction from the start. See annotating egocentric data and the Data Trust Pack.

FAQ

What hardware do you need to collect egocentric data? A first-person RGB camera with SLAM (e.g. Project Aria), a depth camera (RealSense or ZED), and - for action-aligned manipulation - hand pose or a UMI-style gripper.

Do I need depth and hand pose? It depends on the task. RGB-only suffices for some pre-training; depth and hand pose are needed for robotics-ready manipulation data.

Is special hardware required, or can any camera work? Any camera captures RGB, but research-grade transfer benefits from SLAM, depth and pose - the stack behind datasets like Ego-Exo4D.

nxted runs these rigs so you don't have to: see nxted Capture or compare collection options.