Robot Camera Setup for Data Collection: Wrist, Overhead, and Stereo
Camera placement is one of the most important and most frequently underspecified decisions in robot data collection. The observations your policy sees during training must match what it will see during deployment — and getting the camera setup wrong means collecting data that cannot train a reliable policy.
Camera Placement Strategy
The first principle of robot camera placement is: cameras used for data collection must be identical in mounting position to cameras used for policy deployment. There is no recovery from this mismatch — a policy trained on wrist camera views cannot generalize to an overhead camera view, and vice versa. Define your deployment camera configuration before you collect a single episode of training data.
The most common configurations in manipulation research are: wrist-only (one camera mounted on the robot's wrist, looking forward at the manipulation workspace); overhead-only (one or two cameras mounted on a fixed overhead rig); and multi-view (wrist camera plus one or two external cameras providing global workspace context). Multi-view configurations consistently outperform single-view in policy performance, at the cost of more complex recording infrastructure.
Wrist Cameras: Pros, Cons, and Best Practices
Wrist cameras provide a first-person view of the manipulation action — the robot sees approximately what it is doing at its end-effector. This viewpoint is highly informative for fine grasping and insertion tasks where the relationship between gripper and object must be perceived precisely. Wrist cameras also automatically follow the gripper through the workspace, ensuring the target object is always in frame during manipulation.
The main limitation of wrist cameras is that they do not see the global workspace — the robot cannot perceive objects far from its current gripper position without moving the arm. This limits their effectiveness for tasks requiring scene-level understanding or bi-manual coordination. For bimanual systems, each arm should carry its own wrist camera. Recommended specs: 1080p or higher resolution, 60+ fps, global shutter (not rolling shutter) to avoid motion blur during fast movements, and a wide-angle lens (90–110 degree FOV) to maintain view of the grasp contact point at close range.
Overhead Cameras: Configuration and Tradeoffs
Fixed overhead cameras provide stable, consistent workspace views that capture the full manipulation scene. They are less sensitive to arm motion and provide better context for tasks requiring multiple sequential steps across different workspace regions. Overhead cameras are simpler to mount consistently across multiple robot stations, which matters for large-scale data collection campaigns.
The limitation is reduced detail at the manipulation contact point. An overhead camera at 80 cm height looking down at a tabletop workspace cannot reliably observe gripper-object contact geometry on small objects. This is why overhead cameras are typically paired with wrist cameras in high-performance data collection setups — the overhead view provides task context and coarse positioning, while the wrist view provides fine manipulation detail.
Resolution, Frame Rate, and Synchronization
For manipulation data collection, 480p–720p per camera at 30 fps is sufficient for most imitation learning policies in 2026. Higher resolution (1080p) improves performance on tasks requiring fine spatial discrimination. Frame rates below 30 fps introduce temporal aliasing that degrades policy learning on fast tasks. Frame rates above 60 fps provide diminishing returns for most manipulation tasks and significantly increase storage requirements.
Multi-camera synchronization is critical and frequently neglected. If cameras are not hardware-synchronized, time-stamp alignment must be implemented carefully during data loading. Even 33 ms of inter-camera offset (one frame at 30 fps) can introduce training instability for tasks where the wrist and overhead views must be temporally consistent. The Intel RealSense D435 and D455 series support hardware synchronization via a sync cable and are SVRC's preferred choice for synchronized multi-camera setups.
Camera Types for Robotics: A Technical Comparison
Shutter Type: Global vs Rolling
Global shutter cameras expose all pixels simultaneously — every pixel in the frame represents the same instant in time. Rolling shutter cameras expose pixels row by row, creating a temporal skew across the image. For a robot arm moving at 0.5 m/s, a rolling shutter with 33ms readout time produces ~16mm of distortion across the frame. This distortion is systematic: vertical edges appear slanted, and fast-moving objects are smeared.
For wrist-mounted cameras (which move with the arm), global shutter is strongly preferred. The wrist camera experiences the full arm velocity during approach motions. For fixed overhead cameras that do not move, rolling shutter is acceptable for most manipulation tasks because the scene motion is slower. The cost premium for global shutter is significant — roughly 2-3x for equivalent resolution — so use it where it matters (wrist) and save budget elsewhere (overhead).
Sensor Type: Mono vs Stereo vs RGB-D
| Type | Depth | Cost | Weight | Outdoor Use | Best For |
|---|---|---|---|---|---|
| Monocular RGB | No | $50-400 | 20-80g | Yes | Wrist mount (lightweight), IL policies |
| Stereo | Computed | $300-2,000 | 150-350g | Yes | Outdoor mobile manipulation |
| Structured light RGB-D | Active IR | $200-500 | 70-200g | Poor (IR interference) | Indoor fixed mount, grasp planning |
| ToF RGB-D | Active IR (ToF) | $400-1,500 | 100-300g | Moderate | Long-range depth (1-10m) |
Specific Camera Recommendations
| Camera | Type | Resolution | Shutter | Price | Best Use |
|---|---|---|---|---|---|
| Intel RealSense D435i | Stereo RGB-D + IMU | 1920x1080 (RGB), 1280x720 (depth) | Rolling (RGB), Global (IR) | $300 | SVRC default: overhead mount, hardware sync support |
| Intel RealSense D405 | Stereo RGB-D | 1280x720 | Global (all sensors) | $230 | Wrist mount (compact: 42x42x23mm) |
| StereoLabs ZED 2 | Stereo + neural depth | 4416x1242 (2x 2208x1242) | Rolling | $450 | Outdoor/mobile manipulation, long-range depth (20m) |
| FLIR Blackfly S | Mono RGB | Up to 5MP | Global (Sony Pregius) | $400-1,200 | High-quality wrist camera, GigE for long cable runs |
| Arducam OV9281 | Mono (grayscale) | 1280x800 | Global | $50 | Budget wrist camera, Raspberry Pi compatible |
Mounting Positions: Wrist vs Overhead vs Third-Person
Eye-in-Hand (Wrist Mount)
The camera moves with the robot's end-effector, providing a first-person view of the manipulation action. This viewpoint is highly informative for fine grasping and insertion tasks where the relationship between gripper and object must be perceived precisely. Wrist cameras also automatically follow the gripper through the workspace.
Mounting considerations: keep the camera as close to the gripper as possible (5-10cm above) to minimize the moment arm that creates vibration. Use a rigid bracket — 3D-printed PLA mounts flex under acceleration and produce blurry images. Route the USB cable through a strain relief along the arm to prevent cable fatigue at the connector. Weight budget: keep the total camera + mount + cable weight under 100g to minimize impact on arm dynamics.
Eye-to-Hand (Overhead/External)
Fixed overhead cameras provide stable, consistent workspace views. Mount at 60-80cm above the workspace for tabletop manipulation — this height provides full workspace coverage at 90-degree FOV while maintaining sufficient resolution on small objects. Use an aluminum extrusion frame with vibration-dampening rubber mounts.
Third-Person (Side View)
A side-view camera at 30-45 degrees from vertical provides depth information that is lost in a purely overhead view. This viewpoint is particularly useful for tasks involving vertical stacking or insertion, where the overhead camera cannot distinguish height. SVRC uses a third-person camera on the opposite side of the workspace from the overhead camera to maximize viewpoint diversity.
Calibration Procedures
Eye-in-Hand Calibration
For wrist-mounted cameras, you need the transform from the camera frame to the robot's end-effector frame. The standard procedure:
- Mount a calibration target (ChArUco board recommended over checkerboard for robustness) in a fixed position in the workspace
- Move the robot to 15-20 different positions where the target is visible, recording the robot's end-effector pose and the camera's view of the target at each position
- Run a hand-eye calibration solver (OpenCV
calibrateHandEyeor theeasy_handeyeROS2 package) to compute the camera-to-end-effector transform - Validate by commanding the robot to touch a known point in the camera frame — positional error should be under 3mm
Eye-to-Hand Calibration
For fixed cameras, you need the transform from the camera frame to the robot base frame. The procedure is similar but the robot holds the calibration target rather than viewing a fixed one:
- Attach the ChArUco board to the robot's end-effector
- Move the robot to 15-20 positions where the target is visible from the fixed camera
- Run
calibrateHandEyewitheye_on_hand=false - Validate by placing an object at a known position and comparing camera-estimated and ground-truth coordinates
Re-calibrate after any camera adjustment (even bumping the mount), after hardware maintenance on the arm, and as a monthly scheduled check. Calibration drift of 2-5mm per month is typical due to thermal expansion, vibration, and mount settling.
Depth Cameras
Depth cameras provide per-pixel distance measurements in addition to RGB imagery, enabling 3D scene understanding without explicit stereo reconstruction. Intel RealSense, Microsoft Azure Kinect, and ZED cameras are the most commonly used depth sensors in robot data collection. Depth information is valuable for tasks where object height, shape, or 3D position is important for grasp planning, and for policies that use point cloud inputs rather than pure image inputs.
The tradeoff: depth cameras add weight, cost, and processing load. Many state-of-the-art imitation learning results are achieved with pure RGB cameras, suggesting depth is not always necessary. Use depth when your policy architecture explicitly benefits from 3D input, when tasks involve significant depth variation (stacking objects of different heights), or when you need robust performance across variable lighting conditions (depth is more lighting-invariant than RGB).
ROS2 Camera Driver Setup
For RealSense cameras (the most common in manipulation research), the realsense2_camera ROS2 package provides a complete driver:
Key settings for data collection: enable enable_sync to synchronize depth and RGB streams within a camera. Set align_depth to get pixel-aligned depth and RGB. Disable pointcloud to save CPU and bandwidth. For multi-camera setups, use hardware sync mode to ensure all cameras capture frames at the same instant.
Calibration and SVRC's Multi-Camera Standard
SVRC's data collection standard uses a fixed three-camera configuration: one wrist camera per arm plus one calibrated overhead camera per station. Physical camera mounts are part of our standardized workstation design, ensuring consistent placement across our facility. All calibration parameters are logged automatically and included in dataset exports. For teams setting up their own data collection infrastructure, SVRC offers camera setup consultation and can supply pre-calibrated camera assemblies — contact us or see our data services page for details.