Robot Camera Setup for Teleoperation and Data Collection

Camera Types Comparison

Three camera technologies are commonly used in robot data collection setups. Your choice affects cost, latency determinism, and integration complexity.

Type	Example Model	Price	Latency	Best For
USB (UVC)	Logitech BRIO 4K	~$200	50-200ms variable	Budget setups, low-frequency tasks
GigE Vision	Basler ace2 a2A1920	$400-$1,500	<1ms deterministic	High-quality datasets, policy training
Depth (RGB-D)	Intel RealSense D435	~$200	30-60ms	Supplemental depth -- not recommended as primary

USB cameras (Logitech BRIO, ELP, Arducam) are the most accessible but suffer from variable latency caused by USB host controller scheduling. Under load, frame delivery can jitter by 20-50 ms, which desynchronizes multi-camera recordings and degrades policy training. For single-camera or low-frame-rate setups (<15 fps), USB is acceptable.

GigE Vision cameras (Basler ace2, FLIR Blackfly S, Allied Vision Alvium) deliver frames over Ethernet with deterministic <1 ms latency when hardware-triggered. The Basler ace2 a2A1920-160ucBAS at $650 offers 1920x1200 at 160 fps, more than sufficient for 30 fps robot recording. GigE cameras require a dedicated NIC with jumbo frames enabled (ip link set eth0 mtu 9000) and a PoE switch or injector.

Depth cameras (RealSense D435, Azure Kinect) are useful for supplemental 3D scene understanding but are not recommended as primary recording cameras. Their rolling shutter, depth noise at object boundaries, and difficulty with shiny/dark surfaces make them unsuitable as the sole visual observation. Use them in addition to RGB cameras if your policy requires depth input.

Three Camera Configurations

Different manipulation tasks benefit from different camera arrangements. Here are three validated configurations used at SVRC, ordered by complexity and data richness.

Configuration 1: Minimal (1 camera, budget setup)

Camera: 1x overhead USB camera (Logitech BRIO, $200)
Placement: Mounted 90-100 cm above workspace, pointing straight down
Resolution: 1280x720 at 30 fps
Use case: Simple pick-and-place, initial prototyping, single-arm tabletop tasks
Limitation: No height information, no ego-centric view, policy performance 15-25% lower than 3-camera setup on complex tasks

This is the fastest way to start collecting data. No synchronization hardware needed. Suitable for validating your task definition before investing in a multi-camera setup.

Configuration 2: Standard (3 cameras, recommended)

This configuration is used at SVRC for standard manipulation data collection and balances coverage, resolution, and storage cost:

Camera 1 -- Fixed overhead (top-down): Mounted 80-100 cm above the workspace, pointing straight down. Resolution 1280x960 at 30 fps. Captures the full workspace, object placement, and gripper approach from above. This is the most informative view for most pick-and-place policies.
Camera 2 -- Fixed side (lateral): Mounted at workspace height, 60-80 cm to the side. Resolution 1280x960 at 30 fps. Provides height information that the overhead view cannot. Critical for stacking, pouring, and insertion tasks.
Camera 3 -- Wrist (ego-centric): Mounted on the robot's end-effector or tool flange, facing forward. Resolution 640x480 at 60 fps. The higher frame rate captures fast wrist motions without blur. Ego-centric views significantly improve grasping and fine manipulation policy performance in imitation learning research.

For bimanual setups using the DK1, add a fourth fixed camera on the opposite side from Camera 2 to cover inter-arm occlusions.

Configuration 3: Full Coverage (5+ cameras, advanced)

Cameras 1-3: Same as Configuration 2 (overhead, side, wrist)
Camera 4 -- Opposite side: Mirrors Camera 2 on the other side of the workspace. Eliminates hand/arm occlusion blind spots.
Camera 5 -- Front angled (45 degrees): Positioned 60 cm in front at 45 degree downward angle. Captures object approach and gripper orientation that overhead misses.
Camera 6 (optional) -- Depth camera: Intel RealSense D435 mounted near overhead position for supplemental point cloud data.

This configuration generates 3-5x more data per episode but provides the most complete visual coverage. Used for research on multi-view policy learning and 3D scene reconstruction. Storage requirement: approximately 500 MB/minute at 15 Mbps per camera.

End-to-End Latency Budget

For teleoperation, the total glass-to-glass latency (from photon hitting the sensor to actuator movement) must stay below 150 ms for comfortable human operation, and below 100 ms for precise manipulation. Here is the latency breakdown for a typical GigE camera setup:

Stage	Component	GigE Camera	USB Camera
1	Sensor exposure	5-10 ms	5-33 ms (auto-exposure)
2	Readout + transfer	<1 ms (Ethernet)	10-50 ms (USB scheduling)
3	Driver processing	1-2 ms	2-5 ms
4	ROS2 topic publish	1-3 ms	1-3 ms
5	Policy inference	10-30 ms (GPU)	10-30 ms (GPU)
6	Motor command + execution	5-10 ms	5-10 ms
	Total	23-56 ms	33-131 ms

The USB camera path can exceed 100 ms under load (multiple USB devices sharing a host controller), causing noticeable teleop lag. For comfortable teleoperation with the OpenArm 101, we recommend GigE cameras or at minimum ensuring each USB camera is on a dedicated USB host controller (check with lsusb -t).

Synchronization Methods

Multi-camera synchronization is critical. A 33 ms desynchronization between cameras at 30 fps means one camera is one full frame behind -- policies trained on desynchronized data learn incorrect temporal correlations.

Hardware GPIO Trigger (Recommended for GigE cameras)

A single trigger pulse is generated by a microcontroller (Arduino Uno at $25 or Raspberry Pi GPIO) and wired to the trigger input of all cameras simultaneously. Achieves <1 ms synchronization. Configure cameras in external trigger mode via Pylon (Basler) or SpinView (FLIR). The trigger pulse is also logged to your data file, giving you a precise common timestamp.

# Arduino trigger sketch for 30 fps synchronized capture
void setup() {
  pinMode(2, OUTPUT);  // Trigger pin → all camera trigger inputs
}
void loop() {
  digitalWrite(2, HIGH);
  delayMicroseconds(100);  // 100us pulse width
  digitalWrite(2, LOW);
  delay(33);  // 33ms → 30 fps
}

Software Synchronization via NTP

Synchronize all recording machines to a common NTP server (sudo apt install ntp, use pool.ntp.org or a local Chrony server for +/-2 ms accuracy on LAN). ROS2 timestamps using rclpy.clock.Clock(clock_type=ClockType.SYSTEM_TIME) will then be consistent across machines to +/-10 ms. Adequate for 15 fps recording but not for 60 fps wrist cameras.

ROS2 message_filters for Approximate Sync

When hardware triggering is not available, use the ROS2 message_filters package to approximately synchronize camera topics by timestamp:

from message_filters import ApproximateTimeSynchronizer, Subscriber
from sensor_msgs.msg import Image

sub_overhead = Subscriber(node, Image, '/cam_overhead/image_raw')
sub_side = Subscriber(node, Image, '/cam_side/image_raw')
sub_wrist = Subscriber(node, Image, '/cam_wrist/image_raw')

sync = ApproximateTimeSynchronizer(
    [sub_overhead, sub_side, sub_wrist],
    queue_size=10,
    slop=0.033  # 33ms tolerance (1 frame at 30fps)
)
sync.registerCallback(synchronized_callback)

Camera Calibration

Calibration has two components: intrinsics per camera, and extrinsics (relative poses between cameras and to the robot base).

Intrinsic calibration characterizes each camera's focal length, principal point, and distortion coefficients. Use OpenCV's calibration module with a 9x7 checkerboard (25 mm squares). Collect 20-40 images at varied angles and distances. Target a reprojection error <0.5 px (acceptable up to 1.0 px). Run calibration with: python3 -m cv2.calibrate --size 9x7 --square 25 images/*.png.

Extrinsic calibration determines the 6-DOF transform from each camera frame to the robot base frame. Use a ChArUco board (better corner detection than plain checkerboard) mounted in several known poses. For each camera, collect 15-20 board observations across the workspace volume. The resulting transforms are stored as static TF frames in your ROS2 parameter file and used to project observations into a common robot-relative coordinate frame for policy training.

Verify calibration by projecting the robot's TCP position (from forward kinematics) into each camera image. The projected point should align with the visible TCP to within 5 pixels at all workspace positions. Errors >10 px typically indicate an incorrect camera mount pose -- recheck your mount rigidity and recollect extrinsic data.

ROS2 Image Transport Pipeline

The image_transport package in ROS2 provides pluggable compression for camera topics. Choosing the right transport reduces bandwidth and CPU load on your recording machine.

Transport	Bandwidth (1280x960@30fps)	CPU Load	Quality Loss	Best For
raw	~880 Mbps	Minimal	None	Intra-process, shared memory
compressed (JPEG 80%)	~30-60 Mbps	Moderate	Slight (lossy)	Cross-machine recording
h264 (ffmpeg_image_transport)	~10-30 Mbps	High (GPU encode helps)	Moderate (lossy)	Long recording sessions, storage-limited
compressed (PNG)	~300-500 Mbps	High	None (lossless)	Archival, precision-critical data

Recommendation: Use JPEG compressed transport at 80% quality for standard data collection. The compression artifacts at this quality level are below the noise floor of most policy training pipelines. For the highest-fidelity research datasets, use PNG lossless but plan for 10x storage increase.

HDF5 Recording Format for Imitation Learning

The recording pipeline must capture synchronized frames, robot joint states, and action labels into a single file per demonstration. HDF5 is the standard format used by ACT, Diffusion Policy, and most imitation learning frameworks.

Recommended HDF5 Structure

/episode_0042/
  observations/
    images/
      cam_overhead    # (T, H, W, 3) uint8 JPEG-decoded frames
      cam_side        # (T, H, W, 3) uint8
      cam_wrist       # (T, H, W, 3) uint8
    joint_positions   # (T, 6) float64 — radians
    joint_velocities  # (T, 6) float64 — rad/s
    ee_pose           # (T, 7) float64 — xyz + quaternion
    gripper_state     # (T, 1) float64 — 0.0 closed to 1.0 open
    tactile/          # optional
      left_finger     # (T, 16, 16) uint16 — Paxini pressure
      right_finger    # (T, 16, 16) uint16
  actions/
    joint_positions   # (T, 6) float64 — target joint positions
    gripper_action    # (T, 1) float64 — target gripper state
  metadata/
    timestamp         # (T,) float64 — Unix timestamps
    trigger_pulse     # (T,) uint8 — hardware trigger confirmation
    fps               # scalar — recording frame rate
    camera_intrinsics # dict — per-camera calibration
    camera_extrinsics # dict — camera-to-base transforms

Writing HDF5 in Python:

import h5py
import numpy as np

with h5py.File('episode_0042.hdf5', 'w') as f:
    ep = f.create_group('episode_0042')
    obs = ep.create_group('observations')
    imgs = obs.create_group('images')
    # Store images with chunk-based compression
    imgs.create_dataset('cam_overhead', data=overhead_frames,
                        chunks=(1, 960, 1280, 3), compression='gzip')
    obs.create_dataset('joint_positions', data=joint_data)
    obs.create_dataset('ee_pose', data=ee_data)
    # Actions
    acts = ep.create_group('actions')
    acts.create_dataset('joint_positions', data=action_data)
    # Metadata
    meta = ep.create_group('metadata')
    meta.create_dataset('timestamp', data=timestamps)
    meta.attrs['fps'] = 30

Frame alignment: At write time, align all data to the trigger timestamp. Drop frames that arrive more than 5 ms late rather than using them with incorrect timestamps. A dropped frame is better than a misaligned frame.

Recording Pipeline Architecture

The complete recording pipeline from camera sensor to HDF5 file:

Camera driver node (one per camera): Publishes sensor_msgs/Image on /cam_*/image_raw and /cam_*/image_raw/compressed.
Synchronizer node: Uses message_filters::ApproximateTimeSynchronizer to align frames from all cameras + /joint_states + /ft_sensor/wrench. Publishes a custom SyncedFrame message.
Recorder node: Subscribes to SyncedFrame, buffers in memory, and flushes to HDF5 on episode boundary (triggered by operator button press or teleoperation stop signal).
Compression (optional): If recording at high resolution, run H.264 encoding in a separate thread to avoid blocking the recording loop. At 10 Mbps, a 1280x960 at 30 fps stream uses roughly 75 MB/minute per camera.

For cloud storage and dataset management, use SVRC's Data Services which provides automated upload, deduplication, and dataset versioning. Raw recordings are compressed further (lossless for joint data, lossy H.264 for video) reducing long-term storage to approximately 40 GB per collection day.

Storage Planning

Before starting a data collection campaign, calculate your storage requirements:

Example: 3-camera setup at 15 Mbps per camera, 8-hour collection day: 3 cameras x 15 Mbps x 3600s/hr x 8hr / 8 bits/byte / 1e9 GB = 162 GB/day. At 10 Mbps average: 108 GB/day. Plan for a 4 TB NVMe SSD per recording station, with nightly rsync to a NAS or cloud bucket.

Configuration	Cameras	GB/day (8 hr)	Days on 4 TB SSD	Monthly Cloud Cost (S3)
Minimal (JPEG)	1	36	~110	~$18
Standard (H.264)	3	108	~37	~$55
Full coverage (H.264)	5	180	~22	~$92
Full coverage (PNG lossless)	5	900	~4	~$460

Lighting Setup

Consistent lighting dramatically improves policy generalization. Policies trained under inconsistent lighting often fail in deployment when lighting conditions differ.

Color temperature: Use 5500K daylight-balanced LED ring lights (e.g., Neewer 18" ring light at $60/unit). Consistent color temperature across all lights prevents white balance variation between cameras.
Placement: Position lights at 45 degree angles from the overhead camera axis to minimize specular reflections off shiny objects and robot surfaces. Avoid lights directly behind any camera.
Diffusion: Add diffusion panels (frosted acrylic sheets) in front of lights to eliminate hard shadows. Hard shadows create visual features that don't generalize to different times of day.
Blackout curtains: For lab setups near windows, install blackout curtains to eliminate ambient light variation from clouds and sun angle. This is one of the highest-ROI investments in data collection quality.

Common Issues and Solutions

Issue	Cause	Solution
Dropped frames	USB bandwidth saturation	Move cameras to separate USB controllers; use GigE
GigE incomplete frames	Jumbo frames not enabled	`ip link set eth0 mtu 9000`
Color inconsistency between cameras	Auto white balance enabled	Set manual white balance (5500K) on all cameras
Blurry wrist camera	Motion blur from long exposure	Set exposure to <5 ms; increase lighting intensity
High CPU during recording	Software JPEG encoding per frame	Use camera-side JPEG encoding or GPU H.264 (NVENC)
Calibration reprojection >1 px	Too few or poorly distributed calibration images	Collect 30+ images covering all corners at varied distances

Robot Camera Setup for Teleoperation and Data Collection

Camera Types Comparison

Three Camera Configurations

Configuration 1: Minimal (1 camera, budget setup)

Configuration 2: Standard (3 cameras, recommended)

Configuration 3: Full Coverage (5+ cameras, advanced)

End-to-End Latency Budget

Synchronization Methods

Hardware GPIO Trigger (Recommended for GigE cameras)

Software Synchronization via NTP

ROS2 message_filters for Approximate Sync

Camera Calibration

ROS2 Image Transport Pipeline

HDF5 Recording Format for Imitation Learning

Recommended HDF5 Structure

Recording Pipeline Architecture

Storage Planning

Lighting Setup

Common Issues and Solutions

Related Guides

Robot Vision Systems

Essential Robot Sensors

Tactile Sensor Comparison

Robot Programming Languages

Simulation Software Compared

Robot Deployment Checklist

Related Blog Posts

Data Collection Best Practices

Imitation Learning for Manipulation

Sim-to-Real Transfer

ROS2 for Robot Development

Start Collecting High-Quality Data