This paper focuses on evaluating the effectiveness of Large Language Models at solving embodied robotic tasks using the Meta PARTNER benchmark. Meta PARTNR provides simplified environments and robotic interactions within randomized indoor kitchen scenes. Each randomized kitchen scene is given a task where two robotic agents cooperatively work together to solve the task. We evaluated multiple frontier models on Meta PARTNER environments. Our results indicate that reasoning models like OpenAI o3-mini outperform non-reasoning models like OpenAI GPT-4o and Llama 3 when operating in PARTNR's robotic embodied environments. o3-mini displayed outperform across centralized, decentralized, full observability, and partial observability configurations. This provides a promising avenue of research for embodied robotic development.
In the centralized planner, one planner controls the actions of both agents. The actions are sent to a low-level policy which mechanically performs the actions. These actions are then updated within the simulator (Habitat 3.0) and outputs a new state which is then observed individually by each agent. The observations update a shared world graph which is fed back into the centralized planner repeating the process. The decentralized planner in contrast has separate planners for each agent. In decentralized planning the agents have individual observations which are used to update a non-shared world graph.
For each test case, we list the model under test, the planner type, the observability, and the metrics. We include a one sigma standard deviation for key metrics to show the variability.
Non-reasoning models such as GPT-4o and Llama 3 try to one-shot the problem and proceed. If the plan fails, the models need to replan. Non-reasoning models exhibit a lower percentage of completing their task, but they appear to complete them faster as shown in their sim steps. If we think about why this is, we know that non-reasoning models are faster and iterate more quickly. They are able to react to the simulation at a faster rate even if their actions have a lower success rate.
When we compare the non-reasoning models against a reasoning model such as o3-mini we see some interesting results. O3-mini has higher episode percentage completion across the board. We also see that each action taken by o3-mini has a higher success rate when compared to GPT-4o or Llama 3.1. Unfortunately, o3-mini takes longer to make a decision which results in longer sim steps. The trade-off of using reasoning models like o3-mini is that we pay a higher cost in sim steps for a better success rate and higher percentage of episode completion.
@inproceedings{Habitat-LLM-Benchmarks,
author = {William Li and Lei Hamilton and Kaise Al-natour and Sanjeev Mohindra},
title = {Evaluation of Habitat Robotics using Large Language Models},
booktitle = {Institute of Electrical and Electronics Engineers (IEEE)},
year = {2025}}