Language + Pixels to Action Space

GitHub
LLaMA VisionLangGraphRedisStreamlitRoboticsNatural Language Processing

System that translates natural language commands and visual input into robot actions.

This project introduces a system that interprets user language commands and camera feed data, translating them into actionable robot behaviors. The pipeline begins with a large language model that parses and structures user instructions into a multi-step plan. A custom API then transforms these plan outputs into movement commands suitable for a differential-drive robot or similar platforms.

Simultaneously, LLaMA 3.2 Vision provides real-time visual understanding of the environment, identifying objects, spatial relationships, and potential obstacles. LangGraph and Redis manage short-term and long-term memory components, allowing the system to retrieve past states, review previous actions, and incorporate relevant context into newly generated plans. This memory-aware design leads to more coherent, adaptive robotic behaviors over continuous operation.

Deployed in a simulated environment, the robot autonomously navigates and interacts with its surroundings based on high-level language prompts and pixel-level perception. A Streamlit interface rounds out the setup by enabling intuitive interactions, letting users issue commands and observe the system's responses without diving into lower-level technical details.