Mobile ALOHA Setup Guide: Hardware, Software, and First Tasks

What Mobile ALOHA Is

Mobile ALOHA, developed at Stanford by Tony Zhao et al., extends the original ALOHA (A Low-cost Open-source Hardware system for Bimanual Teleoperation) by mounting the bimanual arm system on an omnidirectional mobile base. This enables whole-body teleoperation: an operator controls both arms and the base simultaneously, demonstrating tasks that require locomotion alongside manipulation, such as opening doors, pushing carts, cleaning tables while moving, and navigating between rooms.

The key research contribution of Mobile ALOHA is demonstrating that co-training on diverse static ALOHA data plus a small number of mobile demonstrations (as few as 50 episodes) produces policies that generalize surprisingly well to mobile manipulation tasks. This means the expensive mobile demonstrations are supplemented by cheaper static bimanual data, making the approach more practical than it initially appears.

Mobile ALOHA has become one of the most replicated research platforms in robot learning. Multiple labs, companies, and maker groups have built variants, and the original hardware design is fully open source. However, the build process involves more complexity than the paper suggests, and this guide covers the practical details that the academic publication omits.

Hardware Bill of Materials

A complete Mobile ALOHA system requires four categories of hardware: the mobile base, the manipulation arms (leader and follower), the camera system, and the compute stack. Here is the detailed BOM with 2026 pricing.

Mobile base:

AgileX Tracer differential-drive base: $4,500-5,500 depending on configuration. This is the platform used in the original paper. Alternatives include the AgileX Scout Mini ($6,800) for higher payload or the Clearpath Jackal ($20,000+) for research-grade odometry, but the Tracer is the standard choice for cost-constrained builds.
Custom mounting frame (aluminum extrusion 80/20 or similar): $300-600 for materials, plus machining. The frame must rigidly couple the arm bases to the mobile platform and provide mounting points for cameras and the compute box.

Follower arms (the robot arms that execute tasks):

2x Trossen Robotics ViperX 300 S2 6-DOF arms: $4,800 each, $9,600 total. These are the standard follower arms for ALOHA builds. They use Dynamixel XM/XH series servos with position, velocity, and current (torque) feedback. Payload is 750g at full extension, which limits the weight of objects the system can manipulate.
2x custom gripper assemblies: $200-400 each. The standard ALOHA gripper is a simple parallel-jaw gripper with a Dynamixel XL330 servo. 3D-printed finger pads are adequate for most tasks.

Leader arms (held by the operator during teleoperation):

2x Trossen Robotics WidowX 250 S 6-DOF arms: $3,100 each, $6,200 total. The WidowX is lighter (0.53 kg) and shorter-reach than the ViperX, making it comfortable for the seated or standing operator to hold during multi-hour data collection sessions. Same Dynamixel servo family ensures transparent kinematic mapping.
Leader arm mounting brackets: $100-200. Mounted at waist height on the mobile platform frame so the operator walks behind the platform while holding the leader arms.

Camera system:

2x Intel RealSense D405 wrist cameras: $300 each, $600 total. Mounted on the follower arm wrists for close-range manipulation views.
1x Intel RealSense D435 overhead camera: $350. Mounted on the frame mast for a top-down workspace view.
Camera mounts and USB cables: $100-150.

Compute:

Onboard workstation: Intel NUC 13 Pro or equivalent mini-PC with i7, 32GB RAM, 1TB NVMe SSD: $800-1,200. This handles real-time teleoperation control, camera capture, and data recording. It does not need a GPU; training happens offline.
Training workstation (separate, not on the robot): Any desktop or cloud instance with an NVIDIA RTX 4090 or A100 for ACT/Diffusion Policy training. Budget $2,000-3,000 for a local training machine, or use cloud GPU instances at $1-4/hour.

Total Cost Breakdown

Category	Cost Range
Mobile base (AgileX Tracer)	$4,500-5,500
Follower arms (2x ViperX 300 S2)	$9,600
Leader arms (2x WidowX 250 S)	$6,200
Grippers and mounting	$700-1,200
Camera system (3x RealSense)	$950-1,100
Onboard compute	$800-1,200
Frame, cables, misc hardware	$500-800
Total (robot only)	$23,250-25,600
Training workstation (separate)	$2,000-3,000
Total (complete system)	$25,250-28,600

This is the total cost to build one Mobile ALOHA system from scratch. The original Stanford paper cited approximately $32,000 for their specific configuration; the difference reflects 2026 component pricing and using the base Tracer rather than the higher-end configurations. Note that this does not include operator labor for data collection, which is the dominant ongoing cost in any imitation learning project.

Software Stack: ROS2, ACT, and LeRobot

The Mobile ALOHA software stack has three layers, each with a specific role.

Real-time control layer (ROS2 Humble on Ubuntu 22.04). The low-level control runs as ROS2 nodes: a Dynamixel driver node for each arm, a base driver node for the AgileX platform, and camera driver nodes for each RealSense camera. The critical requirement is that all nodes share a synchronized clock (use chrony or PTP) and that the leader-to-follower command loop runs at 50 Hz with less than 10 ms latency. The Interbotix ROS2 driver packages provide the arm drivers; the AgileX ROS2 package provides the base driver.

Teleoperation and recording layer. A recording node subscribes to all joint state topics and camera image topics, timestamps them against a shared clock, and writes synchronized episodes to HDF5 files. Each episode contains: 14-DOF joint positions (7 per arm) at 50 Hz, 14-DOF joint velocities, gripper apertures, three camera streams at 30 fps, base velocity commands, and episode metadata (task label, success flag, operator ID). LeRobot from Hugging Face provides standardized recording scripts for ALOHA-style hardware that handle this synchronization correctly.

Training layer (offline, on the training workstation). ACT (Action Chunking with Transformers) is the standard training algorithm for ALOHA data. ACT predicts a chunk of future actions (typically 100 timesteps) from the current observation, using a transformer encoder-decoder architecture with a CVAE (conditional variational autoencoder) for action prediction. Training takes 4-8 hours on a single RTX 4090 for a 200-episode dataset. LeRobot provides the training pipeline with sensible defaults for ACT, Diffusion Policy, and TDMPC2.

First Tasks to Learn

Start with these tasks in order of difficulty. Each builds skills needed for the next.

Task 1: Stationary bimanual handover. The robot picks up an object with one arm and hands it to the other arm. Base remains stationary. This validates bimanual calibration and coordination without adding base motion complexity. Target: 50 demonstrations, 60-70% policy success rate on first training run.

Task 2: Table bussing (clearing a table). The robot picks up objects from a table and places them in a bin on the robot platform, then drives to a drop-off location. This introduces base motion alongside manipulation. The co-training technique matters here: augment your 50 mobile demonstrations with 200-300 static bimanual demonstrations from the handover task. Target: 50 mobile demonstrations plus co-training data, 50-60% success rate.

Task 3: Opening doors. Approach a door, grasp the handle, pull/push while coordinating base motion. This is a canonical Mobile ALOHA task that demonstrates whole-body coordination: the base must move in sync with the arm as the door swings. This is substantially harder than table bussing because the contact dynamics change throughout the trajectory. Target: 100 demonstrations, 40-50% success rate initially.

Task 4: Object handover to a human. The robot navigates to a person, extends an arm, and releases an object when the person grasps it. This requires detecting human presence and timing the release. Target: 75 demonstrations with varied human positions, 45-55% success rate.

Common Setup Mistakes

These are the mistakes SVRC sees most frequently in labs building their first Mobile ALOHA system.

Skipping leader arm gravity compensation. Without gravity compensation, the leader arms feel heavy to the operator. Operator fatigue sets in after 30 minutes, and data quality degrades severely. Configure Dynamixel current limits to 30-50% of rated torque in gravity compensation mode. Test by holding the leader arm at various poses; it should feel nearly weightless.
WiFi for leader-follower communication. Routing the leader-follower control loop through WiFi introduces 20-100 ms of variable latency. The system feels sluggish and the operator overcompensates, producing jerky demonstrations. Use direct USB-to-Dynamixel connections on the onboard computer. The leader and follower arms must be on the same physical machine.
Ignoring camera synchronization. If wrist cameras and the overhead camera are not synchronized, the recorded observations contain temporal misalignment. At 30 fps, a 2-frame misalignment is 66 ms, which is significant for fast manipulation. Use hardware trigger synchronization (RealSense multi-camera sync module, $50) or at minimum timestamp-based alignment during data loading.
Not locking the base during static tasks. If you are collecting manipulation-only data to augment mobile demonstrations (the co-training approach), engage the base motor brake or use software velocity limits to prevent the base from drifting. Base drift during static collection adds noise to your dataset without providing useful mobility information.
Insufficient cable management. Loose cables get caught on furniture, catch on arm joints during motion, and occasionally disconnect during episodes. A disconnected camera mid-episode corrupts the entire episode. Use cable chains or spiral wrap on all moving cables and verify cable routing at the start of every collection session.
Collecting data with inconsistent task definitions. "Clean the table" is too vague. "Pick up the blue cup from position A and place it in the gray bin" is specific enough. Inconsistent demonstrations within a task label confuse the policy. Write a task specification document before collecting a single episode.

Alternatives to Building Your Own

Building a Mobile ALOHA system takes 2-4 weeks of focused effort for an experienced roboticist, or 6-8 weeks for a team without prior Dynamixel and ROS2 experience. If your primary goal is to collect mobile manipulation data rather than to understand the hardware, consider these alternatives.

UMI (Universal Manipulation Interface): A much cheaper approach ($2,000-3,000 total) that uses a hand-held gripper with a GoPro camera to collect demonstrations without any robot hardware. UMI demonstrations can train manipulation policies that deploy on various robot arms. UMI cannot capture base mobility data, so it is not a replacement for Mobile ALOHA if your tasks require locomotion, but it is an excellent starting point for manipulation-only data collection.

SVRC managed data collection: If you need mobile manipulation data but do not want to build and maintain the hardware, SVRC operates Mobile ALOHA systems and other bimanual platforms at our Mountain View facility. Our operators are trained in whole-body teleoperation tasks. You define the task specification and object set; we collect, annotate, and deliver the dataset in LeRobot or RLDS format. This eliminates the hardware build entirely and gives you professionally collected data from the start. See our data services for details and pricing.

Leasing a system: SVRC's robot leasing program can provide a fully assembled and calibrated Mobile ALOHA system for monthly lease. This lets you collect data in your own environment without the upfront hardware investment. Contact us to discuss leasing availability and configuration options.