nxted
← Back to research
TechnicalBy nxted Research Team· Published 30 May 2026· Updated 30 May 2026· 2 min read

How to Collect Egocentric Data for Robot Training: A Hardware Guide

A practical guide to the rigs and sensors used to record research-grade egocentric data - from Project Aria to depth cameras and grippers.

TL;DR. To collect research-grade egocentric data you need a first-person RGB camera, a way to recover a 6-DoF trajectory (SLAM), depth, and hand pose - commonly built on Meta Project Aria plus an Intel RealSense or Stereolabs ZED depth camera, with a UMI-style gripper for action-aligned manipulation. The rig you choose should match how training-ready you need the data to be.

The signals that matter

  • First-person RGB - the core observation stream (often up to ~1408x1408 at 30fps).
  • 6-DoF trajectory - head and hand pose through space, from SLAM.
  • Depth / point cloud - 3D structure of the scene.
  • Hand pose - per-joint tracking, the bridge to robot grippers.
  • Eye gaze - a strong prior for where the action is.
  • Action labels - segmentation and success/failure, added in annotation.

The hardware the research field uses

Three rig tiers

  1. RGB-only - cheapest; good for pre-training and behaviour cloning where pose is inferred.
  2. RGB + depth + SLAM - adds 3D and trajectory; suitable for most manipulation work.
  3. Full - adds hand pose and a UMI-style gripper for robotics-ready, action-aligned data.

We do not recommend fictional proprietary hardware - these are the devices behind published datasets, which is why policies trained on them transfer.

Calibration and consent are part of the rig

Record camera intrinsics/extrinsics and control frequency, and - because you are filming people - obtain consent and plan redaction from the start. See annotating egocentric data and the Data Trust Pack.

FAQ

What hardware do you need to collect egocentric data? A first-person RGB camera with SLAM (e.g. Project Aria), a depth camera (RealSense or ZED), and - for action-aligned manipulation - hand pose or a UMI-style gripper.

Do I need depth and hand pose? It depends on the task. RGB-only suffices for some pre-training; depth and hand pose are needed for robotics-ready manipulation data.

Is special hardware required, or can any camera work? Any camera captures RGB, but research-grade transfer benefits from SLAM, depth and pose - the stack behind datasets like Ego-Exo4D.


nxted runs these rigs so you don't have to: see nxted Capture or compare collection options.

n
nxted Research Team

Physical-AI data specialists at OFORO LTD (UK). We write about egocentric data, robotics dataset formats, RLHF and data governance. See what we build.