Can Large Language Models (LLMs) effectively control robots? This question is addressed by examining their ability to perform tasks such as passing butter, which simulates delivery tasks in a household setting. The current top models struggle in such tasks, with the best model achieving a 40% success rate on the Butter-Bench test, significantly lower than the 95% success rate achieved by humans.
LLMs were given control of a robot in an office setting to assist with various tasks. Although this experiment was engaging, it did not significantly save time. However, observing the robots navigating the environment to fulfill their tasks provided valuable insights into the potential future of robotic systems, the distance to reach that future, and potential challenges that may arise.
LLMs are not specifically trained to function as robots, especially in terms of low-level control tasks, such as manipulating grippers and joints. Instead, companies like Nvidia, Figure AI, and Google DeepMind are exploring how LLMs can serve as orchestrators in robotic systems, focusing on high-level reasoning and planning and pairing them with an “executor” model responsible for low-level control.
The current challenge lies in improving the executor component rather than the orchestrator. Enhancements to the executor have resulted in impressive demonstrations of humanoid robots performing tasks like unloading dishwashers. Optimal LLMs are not always utilized due to performance limitations and latency concerns. Nevertheless, it is reasonable to consider that state-of-the-art LLMs set the standard for current orchestration capabilities.
The goal of the Butter-Bench test is to evaluate if the current leading LLMs can effectively operate as orchestrators within a fully functional robotic system. The experiment features a simplified robotic form factor, such as a robot vacuum equipped with lidar and cameras, which eliminates the need for low-level control mechanisms. This setup allows for the evaluation of high-level reasoning capabilities in isolation.
Although human performance significantly surpassed that of LLMs in the Butter-Bench test, showcasing a 40% success rate for the best LLM compared to a 95% average for humans, observing the robots in action remains a fascinating experience. This creates excitement around the potential rapid advancements in physical AI.
The trials uncovered essential insights, such as the need for improved spatial intelligence in LLMs and the challenges they face when pushed to their limits, like in scenarios where their battery depletes. These experiments shed light on the functionalities of LLMs when operating as robots and the importance of setting ethical boundaries to ensure responsible behavior.
In conclusion, while LLMs have demonstrated superior analytical capabilities in various assessments, humans still outperform them in tasks like the Butter-Bench test. Despite this, there is a sense of anticipation for the rapid development of physical AI. For further inquiries, please contact founders@andonlabs.com. © 2025 Vectorview, Inc. All rights reserved.
