Robotics and AI: Challenges and Innovations in Embedded Platforms

⚡

Key Takeaways

1Large language models are evolving into multimodal systems that integrate visual perception and robotic actions.

2Asynchronous inference is essential for smooth robot movement but requires minimal latency.

3The quality of recorded data, with fixed cameras and controlled lighting, is crucial for the success of robotic tasks.

💡Why it matters — These advancements enhance the efficiency and accuracy of robots in complex environments, paving the way for more sophisticated applications.

Recent Advances in Language and Vision Models

Recent advancements in large language models have enabled a significant transition from textual reasoning to multimodal systems. Initially, this was manifested through the integration of visual perception into vision-language models (VLM), and more recently, through the ability to generate robotic actions via vision-language-action models (VLA). However, deploying these models on embedded robotic platforms presents considerable challenges. These challenges are primarily due to strict constraints regarding computation, memory, and energy, as well as real-time control requirements.

In synchronous control pipelines, the VLA model performs inferences while the robotic arm remains idle, waiting for commands. This leads to oscillatory behavior and delayed corrections. To address this issue, asynchronous inference can be a solution, allowing for smooth and continuous movement by decoupling generation from execution. However, for this to be effective, the end-to-end inference latency must be less than the action execution duration. This temporal constraint thus imposes an upper limit on the model's throughput.

Bringing VLA models to embedded platforms is not simply a matter of model compression. It is a complex systems engineering problem that requires architectural decomposition, latency-sensitive planning, and hardware-aligned execution. Tackling these challenges is essential to translate recent advancements in multimodal foundation models into practical and deployable embedded robotic systems.

Data Recording: What Really Matters

In the context of integrating AI in robotics, the quality of recorded data is paramount. High-quality, consistent data outperforms "more but messy" data. This section transforms hard-earned lessons into checklists and concrete schematics. For example, for the task "putting the tea bag in the cup," several aspects must be considered.

Consistency First

Using fixed cameras is crucial to avoid pose drift. If, during recording or evaluation, one or more cameras move due to vibrations from the robot or the operator resetting the environment, severe accuracy loss can be observed. Additionally, arranging the environment to have as much control as possible over lighting is essential. This includes using fixed light sources and avoiding sunlight that varies throughout the day. Maximizing the contrast between the arm, the object, and the environment is also recommended to prevent detection errors.

Ensure you have backups of your robot and teleoperator calibrations so you don't have to re-record your previous episodes if the code crashes. It is also crucial not to use information that the model will not have access to at the time of inference. During data recording, it is tempting for the operator to rely on direct visual observation of the scene. However, this introduces information that is absent from the dataset. Data collection should be limited to the same camera inputs that will be available for the policy at execution.

Use a Grasping Camera (Highly Recommended)

Shifting from scene-only views to mixed viewpoints increases overall accuracy, but the more cameras you have, the more latency is impacted. Therefore, you need to choose the right trade-off. In our case, this balance was achieved with three cameras: a global view of the entire scene, the closest view for precise grasping and alignment, and a top-down view for height and depth.

We highly recommend using a camera mounted on the gripper. It consistently improves success rates in fine manipulation tasks by providing a close and relevant viewpoint for the task. It is also important to note that this camera most effectively enforces correct data collection practices, allowing the operator to rely solely on the robot's perception rather than direct observation of the scene. When installing a grasping camera, we recommend securing the cable with Velcro or a strain relief guide to prevent it from obstructing the field of view or disconnecting during movement.

Improve Grasping

Simple hardware adjustments, such as using heat-shrink tubing on the gripper claws, increase friction, reduce roughness, decrease slipping during episodes, and enhance task success rates. This thus improves the stability of policy learning.

Diversity & Distributions

When recording a dataset, it is important to vary the distribution of episodes. Divide your workspace into clusters of starting positions and record at least 10 episodes per cluster. Add diversity by changing the position and rotation of the object. For example, we partitioned the accessible workspace of the robotic arm into 11 clusters, each measuring 10 × 10 cm.

Finally, it is crucial to differentiate between training and validation sets. Policies can easily overfit the training set, so ensure that the validation set is not seen by the model. For instance, we removed cluster 6 from the validation set to avoid overfitting.