TechnicalBy nxted Research Team· Published 30 May 2026· Updated 30 May 2026· 2 min read

Annotating Egocentric Data: Hand Pose, 6-DoF, and Action Segmentation

Annotation is what turns first-person footage into training data. A practical guide to the labels robot policies need and how to QA them.

TL;DR. Annotation turns egocentric footage into training data. The labels that matter most are action segmentation (what is happening, when), hand pose and a 6-DoF trajectory (how the hands move), and success/failure flags. Quality is measured with inter-annotator agreement; without it, you cannot tell good labels from noise.

The labels robot policies need

Action segmentation - the task split into sub-actions with start/end times.
Hand pose - per-joint hand tracking, the bridge from human demonstration to robot gripper.
6-DoF trajectory - position and orientation of hands/tool over time.
Object and contact labels - which object, when grasped/released.
Success/failure flags - whether each episode achieved the goal; essential for learning from mistakes.
Language narration - short descriptions that ground vision-language-action models (see VLA models).

How annotation is produced

Some signals come from sensors (6-DoF from SLAM, hand pose from tracking), others from human labellers (segmentation boundaries, success judgements, narration). The Ego-Exo4D methodology combines multi-view capture with expert commentary, which is a useful reference standard.

Measuring annotation quality

Inter-annotator agreement (IAA). Have multiple annotators label a sample; high agreement means the labels are reliable. Report it per batch.
Edge-case review. Surface ambiguous or controversial clips for re-labelling.
Spec adherence. Check labels against a written annotation guide.

Why this is the hard part

Footage is cheap; trustworthy labels are not. The difference between "video" and "training data" is the annotation and its QA. nxted ships a QA report with inter-annotator agreement and labelled edge cases on every batch.

FAQ

What annotations does egocentric robot data need? Action segmentation, hand pose, a 6-DoF trajectory, object/contact labels, success/failure flags and often language narration.

How is annotation quality measured? With inter-annotator agreement on a labelled sample, plus edge-case review and adherence to a written annotation spec.

Why is annotation more important than raw footage? Because policies learn from the labels. Unlabelled or noisy video does not train reliable behaviour; structured, QA'd annotation does.

Get annotated, QA'd episodes: see nxted Capture or read what a good dataset card looks like.