Annotating Egocentric Data: Hand Pose, 6-DoF, and Action Segmentation
Annotation is what turns first-person footage into training data. A practical guide to the labels robot policies need and how to QA them.
TL;DR. Annotation turns egocentric footage into training data. The labels that matter most are action segmentation (what is happening, when), hand pose and a 6-DoF trajectory (how the hands move), and success/failure flags. Quality is measured with inter-annotator agreement; without it, you cannot tell good labels from noise.
The labels robot policies need
- Action segmentation - the task split into sub-actions with start/end times.
- Hand pose - per-joint hand tracking, the bridge from human demonstration to robot gripper.
- 6-DoF trajectory - position and orientation of hands/tool over time.
- Object and contact labels - which object, when grasped/released.
- Success/failure flags - whether each episode achieved the goal; essential for learning from mistakes.
- Language narration - short descriptions that ground vision-language-action models (see VLA models).
How annotation is produced
Some signals come from sensors (6-DoF from SLAM, hand pose from tracking), others from human labellers (segmentation boundaries, success judgements, narration). The Ego-Exo4D methodology combines multi-view capture with expert commentary, which is a useful reference standard.
Measuring annotation quality
- Inter-annotator agreement (IAA). Have multiple annotators label a sample; high agreement means the labels are reliable. Report it per batch.
- Edge-case review. Surface ambiguous or controversial clips for re-labelling.
- Spec adherence. Check labels against a written annotation guide.
Why this is the hard part
Footage is cheap; trustworthy labels are not. The difference between "video" and "training data" is the annotation and its QA. nxted ships a QA report with inter-annotator agreement and labelled edge cases on every batch.
FAQ
What annotations does egocentric robot data need? Action segmentation, hand pose, a 6-DoF trajectory, object/contact labels, success/failure flags and often language narration.
How is annotation quality measured? With inter-annotator agreement on a labelled sample, plus edge-case review and adherence to a written annotation spec.
Why is annotation more important than raw footage? Because policies learn from the labels. Unlabelled or noisy video does not train reliable behaviour; structured, QA'd annotation does.
Get annotated, QA'd episodes: see nxted Capture or read what a good dataset card looks like.
Physical-AI data specialists at OFORO LTD (UK). We write about egocentric data, robotics dataset formats, RLHF and data governance. See what we build.