Camera Types Comparison
Three camera technologies are commonly used in robot data collection setups. Your choice affects cost, latency determinism, and integration complexity.
| Type | Example Model | Price | Latency | Best For |
|---|---|---|---|---|
| USB (UVC) | Logitech BRIO 4K | ~$200 | 50-200ms variable | Budget setups, low-frequency tasks |
| GigE Vision | Basler ace2 a2A1920 | $400-$1,500 | <1ms deterministic | High-quality datasets, policy training |
| Depth (RGB-D) | Intel RealSense D435 | ~$200 | 30-60ms | Supplemental depth -- not recommended as primary |
USB cameras (Logitech BRIO, ELP, Arducam) are the most accessible but suffer from variable latency caused by USB host controller scheduling. Under load, frame delivery can jitter by 20-50 ms, which desynchronizes multi-camera recordings and degrades policy training. For single-camera or low-frame-rate setups (<15 fps), USB is acceptable.
GigE Vision cameras (Basler ace2, FLIR Blackfly S, Allied Vision Alvium) deliver frames over Ethernet with deterministic <1 ms latency when hardware-triggered. The Basler ace2 a2A1920-160ucBAS at $650 offers 1920x1200 at 160 fps, more than sufficient for 30 fps robot recording. GigE cameras require a dedicated NIC with jumbo frames enabled (ip link set eth0 mtu 9000) and a PoE switch or injector.
Depth cameras (RealSense D435, Azure Kinect) are useful for supplemental 3D scene understanding but are not recommended as primary recording cameras. Their rolling shutter, depth noise at object boundaries, and difficulty with shiny/dark surfaces make them unsuitable as the sole visual observation. Use them in addition to RGB cameras if your policy requires depth input.
Three Camera Configurations
Different manipulation tasks benefit from different camera arrangements. Here are three validated configurations used at SVRC, ordered by complexity and data richness.
Configuration 1: Minimal (1 camera, budget setup)
- Camera: 1x overhead USB camera (Logitech BRIO, $200)
- Placement: Mounted 90-100 cm above workspace, pointing straight down
- Resolution: 1280x720 at 30 fps
- Use case: Simple pick-and-place, initial prototyping, single-arm tabletop tasks
- Limitation: No height information, no ego-centric view, policy performance 15-25% lower than 3-camera setup on complex tasks
This is the fastest way to start collecting data. No synchronization hardware needed. Suitable for validating your task definition before investing in a multi-camera setup.
Configuration 2: Standard (3 cameras, recommended)
This configuration is used at SVRC for standard manipulation data collection and balances coverage, resolution, and storage cost:
- Camera 1 -- Fixed overhead (top-down): Mounted 80-100 cm above the workspace, pointing straight down. Resolution 1280x960 at 30 fps. Captures the full workspace, object placement, and gripper approach from above. This is the most informative view for most pick-and-place policies.
- Camera 2 -- Fixed side (lateral): Mounted at workspace height, 60-80 cm to the side. Resolution 1280x960 at 30 fps. Provides height information that the overhead view cannot. Critical for stacking, pouring, and insertion tasks.
- Camera 3 -- Wrist (ego-centric): Mounted on the robot's end-effector or tool flange, facing forward. Resolution 640x480 at 60 fps. The higher frame rate captures fast wrist motions without blur. Ego-centric views significantly improve grasping and fine manipulation policy performance in imitation learning research.
For bimanual setups using the DK1, add a fourth fixed camera on the opposite side from Camera 2 to cover inter-arm occlusions.
Configuration 3: Full Coverage (5+ cameras, advanced)
- Cameras 1-3: Same as Configuration 2 (overhead, side, wrist)
- Camera 4 -- Opposite side: Mirrors Camera 2 on the other side of the workspace. Eliminates hand/arm occlusion blind spots.
- Camera 5 -- Front angled (45 degrees): Positioned 60 cm in front at 45 degree downward angle. Captures object approach and gripper orientation that overhead misses.
- Camera 6 (optional) -- Depth camera: Intel RealSense D435 mounted near overhead position for supplemental point cloud data.
This configuration generates 3-5x more data per episode but provides the most complete visual coverage. Used for research on multi-view policy learning and 3D scene reconstruction. Storage requirement: approximately 500 MB/minute at 15 Mbps per camera.
End-to-End Latency Budget
For teleoperation, the total glass-to-glass latency (from photon hitting the sensor to actuator movement) must stay below 150 ms for comfortable human operation, and below 100 ms for precise manipulation. Here is the latency breakdown for a typical GigE camera setup:
| Stage | Component | GigE Camera | USB Camera |
|---|---|---|---|
| 1 | Sensor exposure | 5-10 ms | 5-33 ms (auto-exposure) |
| 2 | Readout + transfer | <1 ms (Ethernet) | 10-50 ms (USB scheduling) |
| 3 | Driver processing | 1-2 ms | 2-5 ms |
| 4 | ROS2 topic publish | 1-3 ms | 1-3 ms |
| 5 | Policy inference | 10-30 ms (GPU) | 10-30 ms (GPU) |
| 6 | Motor command + execution | 5-10 ms | 5-10 ms |
| Total | 23-56 ms | 33-131 ms |
The USB camera path can exceed 100 ms under load (multiple USB devices sharing a host controller), causing noticeable teleop lag. For comfortable teleoperation with the OpenArm 101, we recommend GigE cameras or at minimum ensuring each USB camera is on a dedicated USB host controller (check with lsusb -t).
Synchronization Methods
Multi-camera synchronization is critical. A 33 ms desynchronization between cameras at 30 fps means one camera is one full frame behind -- policies trained on desynchronized data learn incorrect temporal correlations.
Hardware GPIO Trigger (Recommended for GigE cameras)
A single trigger pulse is generated by a microcontroller (Arduino Uno at $25 or Raspberry Pi GPIO) and wired to the trigger input of all cameras simultaneously. Achieves <1 ms synchronization. Configure cameras in external trigger mode via Pylon (Basler) or SpinView (FLIR). The trigger pulse is also logged to your data file, giving you a precise common timestamp.
# Arduino trigger sketch for 30 fps synchronized capture
void setup() {
pinMode(2, OUTPUT); // Trigger pin → all camera trigger inputs
}
void loop() {
digitalWrite(2, HIGH);
delayMicroseconds(100); // 100us pulse width
digitalWrite(2, LOW);
delay(33); // 33ms → 30 fps
}
Software Synchronization via NTP
Synchronize all recording machines to a common NTP server (sudo apt install ntp, use pool.ntp.org or a local Chrony server for +/-2 ms accuracy on LAN). ROS2 timestamps using rclpy.clock.Clock(clock_type=ClockType.SYSTEM_TIME) will then be consistent across machines to +/-10 ms. Adequate for 15 fps recording but not for 60 fps wrist cameras.
ROS2 message_filters for Approximate Sync
When hardware triggering is not available, use the ROS2 message_filters package to approximately synchronize camera topics by timestamp:
from message_filters import ApproximateTimeSynchronizer, Subscriber
from sensor_msgs.msg import Image
sub_overhead = Subscriber(node, Image, '/cam_overhead/image_raw')
sub_side = Subscriber(node, Image, '/cam_side/image_raw')
sub_wrist = Subscriber(node, Image, '/cam_wrist/image_raw')
sync = ApproximateTimeSynchronizer(
[sub_overhead, sub_side, sub_wrist],
queue_size=10,
slop=0.033 # 33ms tolerance (1 frame at 30fps)
)
sync.registerCallback(synchronized_callback)
Camera Calibration
Calibration has two components: intrinsics per camera, and extrinsics (relative poses between cameras and to the robot base).
Intrinsic calibration characterizes each camera's focal length, principal point, and distortion coefficients. Use OpenCV's calibration module with a 9x7 checkerboard (25 mm squares). Collect 20-40 images at varied angles and distances. Target a reprojection error <0.5 px (acceptable up to 1.0 px). Run calibration with: python3 -m cv2.calibrate --size 9x7 --square 25 images/*.png.
Extrinsic calibration determines the 6-DOF transform from each camera frame to the robot base frame. Use a ChArUco board (better corner detection than plain checkerboard) mounted in several known poses. For each camera, collect 15-20 board observations across the workspace volume. The resulting transforms are stored as static TF frames in your ROS2 parameter file and used to project observations into a common robot-relative coordinate frame for policy training.
Verify calibration by projecting the robot's TCP position (from forward kinematics) into each camera image. The projected point should align with the visible TCP to within 5 pixels at all workspace positions. Errors >10 px typically indicate an incorrect camera mount pose -- recheck your mount rigidity and recollect extrinsic data.
ROS2 Image Transport Pipeline
The image_transport package in ROS2 provides pluggable compression for camera topics. Choosing the right transport reduces bandwidth and CPU load on your recording machine.
| Transport | Bandwidth (1280x960@30fps) | CPU Load | Quality Loss | Best For |
|---|---|---|---|---|
| raw | ~880 Mbps | Minimal | None | Intra-process, shared memory |
| compressed (JPEG 80%) | ~30-60 Mbps | Moderate | Slight (lossy) | Cross-machine recording |
| h264 (ffmpeg_image_transport) | ~10-30 Mbps | High (GPU encode helps) | Moderate (lossy) | Long recording sessions, storage-limited |
| compressed (PNG) | ~300-500 Mbps | High | None (lossless) | Archival, precision-critical data |
Recommendation: Use JPEG compressed transport at 80% quality for standard data collection. The compression artifacts at this quality level are below the noise floor of most policy training pipelines. For the highest-fidelity research datasets, use PNG lossless but plan for 10x storage increase.
HDF5 Recording Format for Imitation Learning
The recording pipeline must capture synchronized frames, robot joint states, and action labels into a single file per demonstration. HDF5 is the standard format used by ACT, Diffusion Policy, and most imitation learning frameworks.
Recommended HDF5 Structure
/episode_0042/
observations/
images/
cam_overhead # (T, H, W, 3) uint8 JPEG-decoded frames
cam_side # (T, H, W, 3) uint8
cam_wrist # (T, H, W, 3) uint8
joint_positions # (T, 6) float64 — radians
joint_velocities # (T, 6) float64 — rad/s
ee_pose # (T, 7) float64 — xyz + quaternion
gripper_state # (T, 1) float64 — 0.0 closed to 1.0 open
tactile/ # optional
left_finger # (T, 16, 16) uint16 — Paxini pressure
right_finger # (T, 16, 16) uint16
actions/
joint_positions # (T, 6) float64 — target joint positions
gripper_action # (T, 1) float64 — target gripper state
metadata/
timestamp # (T,) float64 — Unix timestamps
trigger_pulse # (T,) uint8 — hardware trigger confirmation
fps # scalar — recording frame rate
camera_intrinsics # dict — per-camera calibration
camera_extrinsics # dict — camera-to-base transforms
Writing HDF5 in Python:
import h5py
import numpy as np
with h5py.File('episode_0042.hdf5', 'w') as f:
ep = f.create_group('episode_0042')
obs = ep.create_group('observations')
imgs = obs.create_group('images')
# Store images with chunk-based compression
imgs.create_dataset('cam_overhead', data=overhead_frames,
chunks=(1, 960, 1280, 3), compression='gzip')
obs.create_dataset('joint_positions', data=joint_data)
obs.create_dataset('ee_pose', data=ee_data)
# Actions
acts = ep.create_group('actions')
acts.create_dataset('joint_positions', data=action_data)
# Metadata
meta = ep.create_group('metadata')
meta.create_dataset('timestamp', data=timestamps)
meta.attrs['fps'] = 30
Frame alignment: At write time, align all data to the trigger timestamp. Drop frames that arrive more than 5 ms late rather than using them with incorrect timestamps. A dropped frame is better than a misaligned frame.
Recording Pipeline Architecture
The complete recording pipeline from camera sensor to HDF5 file:
- Camera driver node (one per camera): Publishes
sensor_msgs/Imageon/cam_*/image_rawand/cam_*/image_raw/compressed. - Synchronizer node: Uses
message_filters::ApproximateTimeSynchronizerto align frames from all cameras +/joint_states+/ft_sensor/wrench. Publishes a customSyncedFramemessage. - Recorder node: Subscribes to
SyncedFrame, buffers in memory, and flushes to HDF5 on episode boundary (triggered by operator button press or teleoperation stop signal). - Compression (optional): If recording at high resolution, run H.264 encoding in a separate thread to avoid blocking the recording loop. At 10 Mbps, a 1280x960 at 30 fps stream uses roughly 75 MB/minute per camera.
For cloud storage and dataset management, use SVRC's Data Services which provides automated upload, deduplication, and dataset versioning. Raw recordings are compressed further (lossless for joint data, lossy H.264 for video) reducing long-term storage to approximately 40 GB per collection day.
Storage Planning
Before starting a data collection campaign, calculate your storage requirements:
Example: 3-camera setup at 15 Mbps per camera, 8-hour collection day: 3 cameras x 15 Mbps x 3600s/hr x 8hr / 8 bits/byte / 1e9 GB = 162 GB/day. At 10 Mbps average: 108 GB/day. Plan for a 4 TB NVMe SSD per recording station, with nightly rsync to a NAS or cloud bucket.
| Configuration | Cameras | GB/day (8 hr) | Days on 4 TB SSD | Monthly Cloud Cost (S3) |
|---|---|---|---|---|
| Minimal (JPEG) | 1 | 36 | ~110 | ~$18 |
| Standard (H.264) | 3 | 108 | ~37 | ~$55 |
| Full coverage (H.264) | 5 | 180 | ~22 | ~$92 |
| Full coverage (PNG lossless) | 5 | 900 | ~4 | ~$460 |
Lighting Setup
Consistent lighting dramatically improves policy generalization. Policies trained under inconsistent lighting often fail in deployment when lighting conditions differ.
- Color temperature: Use 5500K daylight-balanced LED ring lights (e.g., Neewer 18" ring light at $60/unit). Consistent color temperature across all lights prevents white balance variation between cameras.
- Placement: Position lights at 45 degree angles from the overhead camera axis to minimize specular reflections off shiny objects and robot surfaces. Avoid lights directly behind any camera.
- Diffusion: Add diffusion panels (frosted acrylic sheets) in front of lights to eliminate hard shadows. Hard shadows create visual features that don't generalize to different times of day.
- Blackout curtains: For lab setups near windows, install blackout curtains to eliminate ambient light variation from clouds and sun angle. This is one of the highest-ROI investments in data collection quality.
Common Issues and Solutions
| Issue | Cause | Solution |
|---|---|---|
| Dropped frames | USB bandwidth saturation | Move cameras to separate USB controllers; use GigE |
| GigE incomplete frames | Jumbo frames not enabled | ip link set eth0 mtu 9000 |
| Color inconsistency between cameras | Auto white balance enabled | Set manual white balance (5500K) on all cameras |
| Blurry wrist camera | Motion blur from long exposure | Set exposure to <5 ms; increase lighting intensity |
| High CPU during recording | Software JPEG encoding per frame | Use camera-side JPEG encoding or GPU H.264 (NVENC) |
| Calibration reprojection >1 px | Too few or poorly distributed calibration images | Collect 30+ images covering all corners at varied distances |