Data Collection

Robot Camera Setup for Data Collection: Wrist, Overhead, and Stereo

Camera placement is one of the most important and most frequently underspecified decisions in robot data collection. The observations your policy sees during training must match what it will see during deployment — and getting the camera setup wrong means collecting data that cannot train a reliable policy.

Camera Placement Strategy

The first principle of robot camera placement is: cameras used for data collection must be identical in mounting position to cameras used for policy deployment. There is no recovery from this mismatch — a policy trained on wrist camera views cannot generalize to an overhead camera view, and vice versa. Define your deployment camera configuration before you collect a single episode of training data.

The most common configurations in manipulation research are: wrist-only (one camera mounted on the robot's wrist, looking forward at the manipulation workspace); overhead-only (one or two cameras mounted on a fixed overhead rig); and multi-view (wrist camera plus one or two external cameras providing global workspace context). Multi-view configurations consistently outperform single-view in policy performance, at the cost of more complex recording infrastructure.

Wrist Cameras: Pros, Cons, and Best Practices

Wrist cameras provide a first-person view of the manipulation action — the robot sees approximately what it is doing at its end-effector. This viewpoint is highly informative for fine grasping and insertion tasks where the relationship between gripper and object must be perceived precisely. Wrist cameras also automatically follow the gripper through the workspace, ensuring the target object is always in frame during manipulation.

The main limitation of wrist cameras is that they do not see the global workspace — the robot cannot perceive objects far from its current gripper position without moving the arm. This limits their effectiveness for tasks requiring scene-level understanding or bi-manual coordination. For bimanual systems, each arm should carry its own wrist camera. Recommended specs: 1080p or higher resolution, 60+ fps, global shutter (not rolling shutter) to avoid motion blur during fast movements, and a wide-angle lens (90–110 degree FOV) to maintain view of the grasp contact point at close range.

Overhead Cameras: Configuration and Tradeoffs

Fixed overhead cameras provide stable, consistent workspace views that capture the full manipulation scene. They are less sensitive to arm motion and provide better context for tasks requiring multiple sequential steps across different workspace regions. Overhead cameras are simpler to mount consistently across multiple robot stations, which matters for large-scale data collection campaigns.

The limitation is reduced detail at the manipulation contact point. An overhead camera at 80 cm height looking down at a tabletop workspace cannot reliably observe gripper-object contact geometry on small objects. This is why overhead cameras are typically paired with wrist cameras in high-performance data collection setups — the overhead view provides task context and coarse positioning, while the wrist view provides fine manipulation detail.

Resolution, Frame Rate, and Synchronization

For manipulation data collection, 480p–720p per camera at 30 fps is sufficient for most imitation learning policies in 2026. Higher resolution (1080p) improves performance on tasks requiring fine spatial discrimination. Frame rates below 30 fps introduce temporal aliasing that degrades policy learning on fast tasks. Frame rates above 60 fps provide diminishing returns for most manipulation tasks and significantly increase storage requirements.

Multi-camera synchronization is critical and frequently neglected. If cameras are not hardware-synchronized, time-stamp alignment must be implemented carefully during data loading. Even 33 ms of inter-camera offset (one frame at 30 fps) can introduce training instability for tasks where the wrist and overhead views must be temporally consistent. The Intel RealSense D435 and D455 series support hardware synchronization via a sync cable and are SVRC's preferred choice for synchronized multi-camera setups.

Camera Types for Robotics: A Technical Comparison

Shutter Type: Global vs Rolling

Global shutter cameras expose all pixels simultaneously — every pixel in the frame represents the same instant in time. Rolling shutter cameras expose pixels row by row, creating a temporal skew across the image. For a robot arm moving at 0.5 m/s, a rolling shutter with 33ms readout time produces ~16mm of distortion across the frame. This distortion is systematic: vertical edges appear slanted, and fast-moving objects are smeared.

For wrist-mounted cameras (which move with the arm), global shutter is strongly preferred. The wrist camera experiences the full arm velocity during approach motions. For fixed overhead cameras that do not move, rolling shutter is acceptable for most manipulation tasks because the scene motion is slower. The cost premium for global shutter is significant — roughly 2-3x for equivalent resolution — so use it where it matters (wrist) and save budget elsewhere (overhead).

Sensor Type: Mono vs Stereo vs RGB-D

Type	Depth	Cost	Weight	Outdoor Use	Best For
Monocular RGB	No	$50-400	20-80g	Yes	Wrist mount (lightweight), IL policies
Stereo	Computed	$300-2,000	150-350g	Yes	Outdoor mobile manipulation
Structured light RGB-D	Active IR	$200-500	70-200g	Poor (IR interference)	Indoor fixed mount, grasp planning
ToF RGB-D	Active IR (ToF)	$400-1,500	100-300g	Moderate	Long-range depth (1-10m)

Specific Camera Recommendations

Camera	Type	Resolution	Shutter	Price	Best Use
Intel RealSense D435i	Stereo RGB-D + IMU	1920x1080 (RGB), 1280x720 (depth)	Rolling (RGB), Global (IR)	$300	SVRC default: overhead mount, hardware sync support
Intel RealSense D405	Stereo RGB-D	1280x720	Global (all sensors)	$230	Wrist mount (compact: 42x42x23mm)
StereoLabs ZED 2	Stereo + neural depth	4416x1242 (2x 2208x1242)	Rolling	$450	Outdoor/mobile manipulation, long-range depth (20m)
FLIR Blackfly S	Mono RGB	Up to 5MP	Global (Sony Pregius)	$400-1,200	High-quality wrist camera, GigE for long cable runs
Arducam OV9281	Mono (grayscale)	1280x800	Global	$50	Budget wrist camera, Raspberry Pi compatible

Mounting Positions: Wrist vs Overhead vs Third-Person

Eye-in-Hand (Wrist Mount)

The camera moves with the robot's end-effector, providing a first-person view of the manipulation action. This viewpoint is highly informative for fine grasping and insertion tasks where the relationship between gripper and object must be perceived precisely. Wrist cameras also automatically follow the gripper through the workspace.

Mounting considerations: keep the camera as close to the gripper as possible (5-10cm above) to minimize the moment arm that creates vibration. Use a rigid bracket — 3D-printed PLA mounts flex under acceleration and produce blurry images. Route the USB cable through a strain relief along the arm to prevent cable fatigue at the connector. Weight budget: keep the total camera + mount + cable weight under 100g to minimize impact on arm dynamics.

Eye-to-Hand (Overhead/External)

Fixed overhead cameras provide stable, consistent workspace views. Mount at 60-80cm above the workspace for tabletop manipulation — this height provides full workspace coverage at 90-degree FOV while maintaining sufficient resolution on small objects. Use an aluminum extrusion frame with vibration-dampening rubber mounts.

Third-Person (Side View)

A side-view camera at 30-45 degrees from vertical provides depth information that is lost in a purely overhead view. This viewpoint is particularly useful for tasks involving vertical stacking or insertion, where the overhead camera cannot distinguish height. SVRC uses a third-person camera on the opposite side of the workspace from the overhead camera to maximize viewpoint diversity.

Calibration Procedures

Eye-in-Hand Calibration

For wrist-mounted cameras, you need the transform from the camera frame to the robot's end-effector frame. The standard procedure:

Mount a calibration target (ChArUco board recommended over checkerboard for robustness) in a fixed position in the workspace
Move the robot to 15-20 different positions where the target is visible, recording the robot's end-effector pose and the camera's view of the target at each position
Run a hand-eye calibration solver (OpenCV calibrateHandEye or the easy_handeye ROS2 package) to compute the camera-to-end-effector transform
Validate by commanding the robot to touch a known point in the camera frame — positional error should be under 3mm

# ROS2 eye-in-hand calibration with easy_handeye
ros2 launch easy_handeye2 calibrate.launch.py \
  eye_on_hand:=true \
  robot_base_frame:=base_link \
  robot_effector_frame:=tool0 \
  tracking_base_frame:=camera_link \
  tracking_marker_frame:=marker_frame

Eye-to-Hand Calibration

For fixed cameras, you need the transform from the camera frame to the robot base frame. The procedure is similar but the robot holds the calibration target rather than viewing a fixed one:

Attach the ChArUco board to the robot's end-effector
Move the robot to 15-20 positions where the target is visible from the fixed camera
Run calibrateHandEye with eye_on_hand=false
Validate by placing an object at a known position and comparing camera-estimated and ground-truth coordinates

Re-calibrate after any camera adjustment (even bumping the mount), after hardware maintenance on the arm, and as a monthly scheduled check. Calibration drift of 2-5mm per month is typical due to thermal expansion, vibration, and mount settling.

Depth Cameras

Depth cameras provide per-pixel distance measurements in addition to RGB imagery, enabling 3D scene understanding without explicit stereo reconstruction. Intel RealSense, Microsoft Azure Kinect, and ZED cameras are the most commonly used depth sensors in robot data collection. Depth information is valuable for tasks where object height, shape, or 3D position is important for grasp planning, and for policies that use point cloud inputs rather than pure image inputs.

The tradeoff: depth cameras add weight, cost, and processing load. Many state-of-the-art imitation learning results are achieved with pure RGB cameras, suggesting depth is not always necessary. Use depth when your policy architecture explicitly benefits from 3D input, when tasks involve significant depth variation (stacking objects of different heights), or when you need robust performance across variable lighting conditions (depth is more lighting-invariant than RGB).

ROS2 Camera Driver Setup

For RealSense cameras (the most common in manipulation research), the realsense2_camera ROS2 package provides a complete driver:

# Install RealSense ROS2 wrapper
sudo apt install ros-humble-realsense2-camera

# Launch with recommended settings for data collection
ros2 launch realsense2_camera rs_launch.py \
  depth_module.profile:=640x480x30 \
  rgb_camera.profile:=640x480x30 \
  enable_sync:=true \
  align_depth.enable:=true \
  pointcloud.enable:=false  # save bandwidth

# For hardware-synchronized multi-camera setup
# Camera 1 (master): inter_cam_sync_mode:=1
# Camera 2 (slave):  inter_cam_sync_mode:=2
# Connect cameras with sync cable (GPIO pin 5)

Key settings for data collection: enable enable_sync to synchronize depth and RGB streams within a camera. Set align_depth to get pixel-aligned depth and RGB. Disable pointcloud to save CPU and bandwidth. For multi-camera setups, use hardware sync mode to ensure all cameras capture frames at the same instant.

Calibration and SVRC's Multi-Camera Standard

SVRC's data collection standard uses a fixed three-camera configuration: one wrist camera per arm plus one calibrated overhead camera per station. Physical camera mounts are part of our standardized workstation design, ensuring consistent placement across our facility. All calibration parameters are logged automatically and included in dataset exports. For teams setting up their own data collection infrastructure, SVRC offers camera setup consultation and can supply pre-calibrated camera assemblies — contact us or see our data services page for details.