Vision Language Action Deployment
GitHubIntegration of OpenVLA model with Mujoco and Robosuite for direct vision-to-action control.
This project integrates the open-source Vision Language Action (OpenVLA) model with Mujoco and Robosuite, enabling the robot to interpret text and images and convert those inputs into precise, real-time actions. A custom position control API decodes OpenVLA's outputs—adjusting end-effector coordinates, grip strength, and orientation—to perform tasks like picking and placing objects.
Vision-based semantic processing ensures that each textual or visual cue directly maps to practical movements, such as moving a gripper to a specific target or tilting it at a certain angle. By uniting language understanding with continuous control, the system can smoothly handle a variety of manipulation scenarios, allowing developers to test and refine complex, multi-step tasks in a simulation environment.
Note, that unlike other vision based systems, here we are not relying on a carefully designed API for low level robot control. Rather we are utilizing the inherent robot foundation model to output in action space!