Language + Pixels to Action Space

GitHub
LLaMA VisionLangGraphRedisStreamlitRoboticsNatural Language Processing

TLDR

System that converts natural language commands and camera input into robot actions. Uses LLM for planning and vision for perception.

Detailed

Tech Stack:

LLaMA 3.2 Vision, LangGraph, Redis, Streamlit, Custom API

Goal:

Translate language commands and visual input into robot actions.

What I did:

  • Used LLM to parse instructions into multi step plans
  • Integrated LLaMA 3.2 Vision for real time visual understanding (objects, spatial relationships, obstacles)
  • Built custom API to convert plans to movement commands for differential drive robot
  • Used LangGraph and Redis for short term and long term memory
  • Created Streamlit interface for user interaction

What was achieved:

Robot navigates and interacts based on language prompts and visual perception. System retrieves past states and incorporates context into plans.