Data Collection

What Is Robot Training Data? Types, Collection Methods, and Quality Standards

Large language models had the internet. Self-driving cars had millions of miles of logged driving. Physical AI -- robots that manipulate objects, assemble products, and work alongside humans -- has a data problem. Robot training data is simultaneously the most important input and the hardest bottleneck in the entire pipeline. This guide breaks down what robot training data actually is, the types that exist, how teams collect it, and the quality standards that separate datasets that produce reliable policies from datasets that waste months of engineering effort.

Defining Robot Training Data

Robot training data consists of recorded episodes of a robot performing a task. Each episode captures synchronized streams of sensor data -- camera images (RGB and depth), joint positions and velocities, end-effector poses, gripper states, force-torque readings, and the control inputs that produced those motions. An episode typically lasts 10 to 60 seconds and represents one complete attempt at a task: reaching for an object, grasping it, moving it to a target location, and releasing it.

This data serves as the raw material for imitation learning, where a neural network policy learns to map observations (what the robot sees and feels) to actions (what the robot should do next) by studying successful demonstrations. It is also used to fine-tune vision-language-action (VLA) models like RT-2 and OpenVLA, and to define reward functions for reinforcement learning. Without high-quality training data, none of these approaches produce deployable policies.

The reason robot training data is such a bottleneck is that it cannot be scraped from the internet. Every episode requires physical hardware running in a physical environment with a skilled human operator or a carefully scripted controller. Collection rates are measured in episodes per hour, not episodes per second. A typical research dataset contains 100 to 10,000 episodes -- orders of magnitude smaller than the datasets that power language or vision models.

Types of Robot Training Data

Demonstration data is the most common type. A human operator controls the robot through teleoperation or kinesthetic teaching, and the robot records its sensor streams and the resulting actions. Demonstration data directly captures the mapping from observations to actions that the policy needs to learn. It is the gold standard for imitation learning and the primary output of most data collection programs.

Interaction data captures the robot executing a policy autonomously and recording the results. This data includes both successes and failures and is used for online reinforcement learning, DAgger-style iterative improvement, and identifying failure modes. Interaction data is typically lower quality than demonstration data on a per-episode basis but can be collected at higher volume since it does not require continuous human attention.

Synthetic data is generated entirely in simulation. Physics engines like NVIDIA Isaac Sim, MuJoCo, and Genesis render virtual environments where simulated robots execute tasks. Synthetic data offers unlimited volume and perfect annotation, but suffers from the sim-to-real gap -- policies trained purely on synthetic data often fail when transferred to real hardware because simulated physics and rendering do not perfectly match reality.

Video data comes from recorded human activities -- cooking videos, assembly instructions, manipulation demonstrations -- captured without any robot present. Video data is abundant on the internet but lacks action labels (the motor commands that a robot would need to reproduce the behavior). It is most useful for pre-training visual representations rather than directly training robot policies.

Data Quality Dimensions: What Separates Usable Data from Noise

Not all demonstrations are created equal. After working with dozens of data collection campaigns at SVRC, we have identified six measurable quality dimensions that predict whether a dataset will train a deployable policy or waste months of engineering effort.

1. Demonstration Consistency

Consistency measures how similar the action trajectories are across demonstrations of the same task under the same conditions. Mathematically, you can quantify this by computing the dynamic time warping (DTW) distance between the end-effector trajectories of all episode pairs in a condition, then looking at the coefficient of variation. A CV below 0.15 across operators on calibration tasks indicates good consistency. Above 0.30 signals protocol drift that will confuse the policy.

Inconsistency arises from two sources: operator skill variance and ambiguous task definitions. SVRC addresses the first through a structured operator training program with weekly calibration sessions. The second requires precise written success criteria -- not "place the cup on the plate" but "place the cup upright on the plate with the handle facing right, center within 2cm of the plate center, released from a height of no more than 1cm."

2. Coverage of Failure Modes

A dataset that contains only clean, successful demonstrations produces a policy that has never seen a near-failure state and cannot recover from one. The best datasets include a controlled proportion of recovery demonstrations: the operator intentionally starts from slightly wrong poses, drops an object and re-grasps, or corrects a misalignment mid-task. A 90/10 split of clean successes to recovery demonstrations is a good starting ratio. Tag recovery episodes with metadata so they can be weighted separately during training.

Equally important is logging and tagging actual failures. A dataset of 500 successes and 100 tagged failures is more valuable than 600 successes alone, because the failure episodes can train binary success classifiers, define the boundary of the task distribution, and provide negative examples for contrastive learning approaches.

3. Task Distribution Coverage

The task distribution is the space of all starting conditions, object configurations, and environmental variations that the policy will encounter at deployment. A dataset must sample this space systematically, not ad hoc. Define the variation axes explicitly before collection:

Variation Axis	Minimum Coverage	Example Values
Object instances	10+ per category	10 different mugs, 12 different cans
Starting positions	Full workspace coverage	5x5 grid across 40cm x 40cm workspace
Object orientations	4+ orientations per object	0, 90, 180, 270 degrees; upright and on side
Lighting conditions	3+ conditions	Overhead fluorescent, side lamp, natural daylight
Background scenes	3+ backgrounds	Clean table, cluttered workspace, different table color
Operators	3+ operators	Distinct control styles, response speeds
Distractor objects	Present in 30%+ of episodes	Non-target objects in workspace

The total episode count follows from the coverage requirement: if you need 10 episodes per unique condition and have 10 objects x 5 positions x 3 lighting conditions, the minimum dataset size is 1,500 episodes. This is a floor, not a ceiling.

4. Temporal Quality

Operator hesitations, pauses, and reversals during demonstrations introduce noise into the action distribution. A skilled operator performing a pick-and-place task produces a smooth, 8-second trajectory. A novice operator performing the same task may produce a 25-second trajectory with 3-4 pauses and a direction reversal. The novice demonstration is not just slower -- it teaches the policy to pause and reverse, which are rarely desirable behaviors. Temporal quality filtering removes demonstrations where the total duration exceeds 2x the median duration for the task, or where velocity profiles show characteristic hesitation patterns (near-zero velocity for more than 500ms mid-task).

5. Sensor Synchronization Quality

Multi-camera setups and mixed sensor streams (camera + joint states + force/torque) must be synchronized to within tight tolerances. For 30Hz image streams, even a single dropped frame creates a 33ms gap that misaligns actions with observations. For force-torque data logged at 1kHz, clock drift between the F/T sensor and the camera driver can accumulate to hundreds of milliseconds over a 60-second episode. SVRC's collection pipeline uses hardware-triggered synchronization (Intel RealSense hardware sync) and validates inter-stream timing at the end of each recording session. Episodes with synchronization errors exceeding 10ms are flagged for re-collection.

6. Action Space Fidelity

The recorded actions must faithfully represent what the robot actually did, not what the controller commanded. In leader-follower teleoperation systems, the follower arm may lag the leader by 20-50ms and may not perfectly track the leader's trajectory due to joint limits, torque limits, or dynamic response. Recording the commanded actions (from the leader) rather than the executed actions (from the follower) introduces a systematic bias. Always record the follower's actual joint states and end-effector poses as the action labels, not the leader's commands.

Scaling Laws: How Many Demonstrations Do You Actually Need?

The relationship between dataset size and policy performance follows a predictable pattern, though the exact numbers depend on task complexity, policy architecture, and data quality. Based on published results and SVRC's internal benchmarks, here are practical guidelines:

Task Complexity	Example	Demos for 70% Success	Demos for 90% Success	Policy Architecture
Simple pick-place (1 object)	Pick block, place in bin	30-50	100-200	ACT
Multi-object pick-place	Sort 5 objects into bins	150-300	500-1,000	Diffusion Policy
Contact-rich single task	Peg insertion, lid closing	100-200	300-500	Diffusion Policy + F/T
Bimanual coordination	Fold towel, open bag	200-400	800-1,500	ACT (bimanual)
Language-conditioned multi-task	"Pick the red cup," "stack blocks"	500-1,000 per task	2,000-5,000 per task	VLA (OpenVLA, RT-2)
VLA fine-tuning (pre-trained)	Adapt OpenVLA to new task	50-100	200-500	OpenVLA + LoRA

A critical insight from recent scaling studies: doubling dataset size while holding diversity constant produces diminishing returns after the first 200-300 episodes. Doubling diversity (more objects, more conditions) at constant total episode count almost always improves policy performance more than doubling volume. This is why SVRC's data collection protocols prioritize structured variation over raw episode count.

The Data Flywheel: From Collection to Continuous Improvement

The most advanced teams do not treat data collection as a one-time event. They build a data flywheel -- a continuous loop where deployed policies generate new interaction data, that data is reviewed and annotated, and it feeds back into the next training iteration. The flywheel has four stages:

Stage 1: Seed dataset. Collect 200-500 high-quality demonstrations through teleoperation. Train an initial policy. This is the minimum viable dataset that gets a policy running, even if imperfectly.

Stage 2: Deployment with logging. Deploy the initial policy on the real task with full sensor logging enabled. Every autonomous execution is recorded, including failures. Human operators monitor and intervene when the policy fails (DAgger-style correction).

Stage 3: Failure-targeted collection. Analyze the logged deployment data to identify systematic failure modes. Collect additional demonstrations specifically targeting those failure conditions. If the policy fails on dark-colored objects, collect 50 demonstrations with dark objects. If it fails when the target is near the workspace edge, collect 50 demonstrations at edge positions. This targeted collection is 5-10x more efficient than uniform data augmentation.

Stage 4: Retrain and repeat. Merge the targeted data with the original dataset, retrain, and redeploy. Each cycle through the flywheel produces measurable improvement in the weakest performance areas. Teams that run 3-5 flywheel iterations typically achieve 15-25% higher success rates than teams that train once on a large initial dataset.

Building a data flywheel requires infrastructure for continuous logging, automated failure detection, and rapid retraining. SVRC's data platform provides the tooling for managing this iterative process across multiple tasks and robot stations.

Collection Methods Compared

Teleoperation is the dominant collection method for high-quality manipulation data. An operator controls the robot remotely using a leader-follower setup, a VR controller, a spacemouse, or a data glove. The robot executes the operator's commands in real time while recording all sensor streams. Teleoperation produces natural, fluid demonstrations that generalize well because the operator can adapt their strategy to each situation in real time. Collection rates range from 30 to 120 episodes per hour depending on task complexity and reset time. The main drawback is operator fatigue -- quality degrades after extended sessions, and operators require training.

Kinesthetic teaching involves physically guiding the robot arm through the desired motion while the robot is in a compliant (gravity-compensated) mode. The operator grabs the end-effector or a handle and moves it through the task trajectory. This method is intuitive and requires no external control hardware, but it has significant limitations: the operator's hand occludes the workspace from camera views, force application is unnatural because the operator is fighting the robot's inertia, and the method does not scale to bimanual or mobile manipulation tasks. Kinesthetic teaching works best for simple single-arm pick-and-place tasks where trajectory shape matters more than precise contact dynamics.

Scripted collection uses predefined motion primitives -- approach waypoints, grasp patterns, and placement sequences -- executed with randomized parameters. Scripting can generate high volumes of data (hundreds of episodes per hour with automated resets) but produces low-diversity demonstrations because the motion strategy is fixed. Scripted data is useful for initializing policies on well-structured tasks like bin picking with known objects, but it rarely generalizes to novel situations. Most production datasets use scripted collection for the structured portions of a task and teleoperation for the contact-rich or unstructured portions.

Video imitation extracts demonstrations from recorded human video using hand tracking, pose estimation, and retargeting to robot kinematics. This approach is still largely experimental. It works reasonably well for coarse arm motions but fails for fine manipulation because hand pose estimation is not accurate enough and the human-to-robot embodiment mapping introduces errors. Teams exploring video imitation should treat it as a supplement to, not a replacement for, robot-native data collection.

Collection Method Comparison Table

Method	Throughput	Data Quality	Hardware Cost	Best For
Leader-follower teleop	5-12 demos/hr (expert)	Highest	$4,500-32,000	Dexterous, contact-rich tasks
VR controller teleop	8-15 demos/hr	High	$500-1,500	Pick-place, general manipulation
Kinesthetic teaching	3-8 demos/hr	Medium	$0 (uses robot)	Simple trajectories, prototyping
Scripted + randomized	60-200 demos/hr	Low-medium	$0 (software)	Structured tasks, pre-training
Video retargeting	N/A (post-hoc)	Low	$0-500	Visual pre-training only

Quality Factors That Actually Matter

Diversity is the single most important quality factor. A dataset of 200 episodes covering 15 object instances, 3 lighting conditions, and 3 operators will almost always produce a better policy than 2,000 episodes with a single object under uniform conditions. Diversity forces the policy to learn the task concept rather than memorizing a specific visual pattern. At minimum, a robust manipulation dataset should include 10 or more distinct object instances per category, 3 or more lighting conditions, varied starting positions across the full workspace, and multiple operators.

Consistency means that the success criteria, reset procedure, and task definition are identical across all episodes. If some episodes consider a near-miss as successful while others require precise placement, the policy learns an ambiguous objective. Consistency requires a written collection protocol, clear success criteria that operators can apply unambiguously, and a standardized reset procedure between episodes.

Annotation accuracy covers the metadata attached to each episode: success/failure labels, language instruction labels, task phase segmentation, and object identity tags. Incorrect annotations corrupt the training signal. Binary success labels should be verified by a second reviewer for borderline cases. Language instructions should be checked against actual task behavior. SVRC's pipeline includes automated success classification with human review on all borderline cases.

Episode completeness means every episode captures the full task from initial approach through final placement, with all sensor streams synchronized and no dropped frames. Incomplete episodes -- where recording started mid-grasp, a camera stream dropped, or joint states were logged at a different frequency than images -- introduce noise that degrades policy learning. Synchronization verification should be an automated step in every collection pipeline.

Common Mistakes Teams Make Collecting Their First Dataset

Collecting too many episodes with too little diversity. Teams often fixate on episode count targets (1,000 episodes) without controlling for diversity. The result is a large dataset that overfits badly. Set diversity targets first -- number of object instances, environments, operators -- and let the total episode count follow from those targets multiplied by a per-condition minimum (typically 10-20 episodes per unique condition).

Skipping operator calibration. Untrained operators produce jerky, inconsistent demonstrations that actively harm policy performance. Allocate 2-4 hours for each new operator to practice the teleoperation interface before their demonstrations count toward the dataset. Track per-operator quality metrics and provide feedback.

Not logging failure episodes. Failed demonstrations should be recorded and tagged, not discarded. Failure data is valuable for training recovery behaviors, building classifiers to detect impending failures, and understanding where your task is hardest. Discard failed episodes from the imitation learning training set, but archive them for analysis.

Ignoring reset consistency. If the reset between episodes is sloppy -- objects placed in roughly the same spot, background clutter left from the previous trial -- the dataset inherits systematic biases. Invest in a repeatable reset protocol, including randomized object placement within a defined region, consistent background state, and a checklist that operators follow between episodes.

Choosing the wrong format. Storing data in custom formats creates friction for every downstream consumer. Use established formats: HDF5 or Zarr for raw episode data, the LeRobot HuggingFace format for sharing, and Open X-Embodiment schema for cross-embodiment research. Converting between formats later is always harder than choosing the right format upfront.

Under-investing in annotation. Many teams skip language annotation entirely or label episodes with copy-pasted identical instructions. If you plan to train language-conditioned policies or fine-tune VLAs, every episode needs a unique natural language instruction that accurately describes what happened. "Pick up the red cup from the left side of the table and place it on the blue plate" is a useful annotation. "Pick and place" is not. See our annotation guide for detailed standards.

Data Formats and Export Standards

Choosing the right storage format affects every downstream step in your pipeline. Here is a practical comparison of the formats you will encounter:

Format	Strengths	Weaknesses	Best Use Case
HDF5 (.h5)	Random access, chunked compression, metadata support, mature tooling	Single-writer lock, poor cloud streaming, no built-in versioning	Local collection, ACT/Diffusion Policy training
Zarr	Cloud-native, concurrent writes, chunk-level access, S3/GCS compatible	Less mature ecosystem, no single-file portability	Large-scale cloud datasets, Diffusion Policy reference implementation
RLDS (TFRecord)	Streaming, Open X-Embodiment standard, TF ecosystem integration	TensorFlow dependency, no random access, sequential read only	Cross-embodiment sharing, OXE compatibility
LeRobot (HuggingFace)	Standardized schema, Hub integration, streaming via datasets library, growing community	Parquet overhead for small datasets, Hub dependency for sharing	Community sharing, LeRobot framework training, VLA fine-tuning
ROS bag (.bag / .db3)	Native ROS logging, preserves topic structure, replay support	ROS dependency, not ML-ready, needs conversion for training	Raw collection on ROS2 systems, debug replay

SVRC's recommendation: collect in HDF5 or ROS bag for maximum flexibility during the collection phase, then convert to LeRobot format for sharing and training. Our data platform handles format conversion automatically and supports export to all five formats above. Browse existing public datasets to understand the data structure before designing your own collection.

Annotation Requirements by Policy Type

Different policy architectures require different annotation layers. Under-annotating wastes future opportunity; over-annotating wastes current budget. Match your annotation investment to your planned training pipeline:

ACT / Diffusion Policy (single-task): Binary success flag per episode. Optional: task phase segmentation for debugging. No language labels needed. Annotation cost: $0.02-0.05 per episode.

Language-conditioned BC (BC-Z, RT-1): Binary success flag plus natural language instruction per episode. Instructions must distinguish task variants. Annotation cost: $0.10-0.25 per episode.

VLA fine-tuning (OpenVLA, RT-2): Binary success flag, natural language instruction, and ideally task phase segmentation. Language instructions should be varied in phrasing (not copy-pasted) to teach the model language generalization. Annotation cost: $0.25-0.50 per episode.

Hierarchical policies: Full task phase segmentation with sub-task success flags. Each phase boundary requires a timestamp label. Annotation cost: $0.50-2.00 per episode depending on task complexity.

For detailed annotation guidelines and tooling recommendations, see our robot data annotation guide.

Getting Started with SVRC Data Services

Building a robot training data program from scratch requires hardware, operators, a lab environment, collection software, quality assurance tooling, and data engineering expertise. For teams that need data faster than they can build this infrastructure, SVRC's data services provide the complete pipeline: collection operators trained on your specific task, preconfigured hardware stations with multi-camera setups and force-torque sensing, controlled lab environments in Mountain View and Allston, and a quality pipeline that enforces all the standards described above.

The fastest path is to contact the data services team with your task description, target robot platform, and desired episode count. SVRC handles collection protocol design, operator training, data collection, quality assurance, annotation, and export in your preferred format. Remote collection using SVRC-leased hardware at your facility is also supported for tasks that require your specific environment or objects.

If you are planning your first data collection campaign, start with a pilot of 200-500 episodes to validate your task definition and quality standards before scaling to thousands. This pilot approach catches protocol problems early, when they are cheap to fix, rather than after you have invested weeks of collection time in a flawed process. SVRC's pilot programs start at $2,500 and include protocol design, 200 demonstrations, quality report, and export in your chosen format. Full campaigns for production datasets start at $8,000 and scale with volume.

Hardware options include the OpenArm 101 ($4,500 for the leader-follower kit), ALOHA-class bimanual setups, and the Unitree G1 for mobile manipulation. For teams that prefer to lease rather than purchase, our robot leasing program provides monthly access to pre-calibrated collection stations.