Google’s Code-as-Policies Lets Robots Write Their Own Code

By Anthony Alford

Researchers from Google’s Robotics team have open-sourced Code-as-Policies(CaP), a robot control method that uses a large language model (LLM) to generate robot-control code that achieves a user-specified goal. CaP uses a hierarchical prompting technique for code generation that outperforms previous methods on the HumanEval code-generation benchmark.

The technique and experiments were described in a paper published on arXiv. CaP differs from previous attempts to use LLMs to control robots; instead of generating a sequence of high-level steps or policies to be invoked by the robot, CaP directly generates Python code for those policies. The Google team developed a set of prompting techniques that improved code-generation, including a new hierarchical prompting method. This technique achieved a new state-of-the art score of 39.8% pass@1 on the HumanEval benchmark. According to the Google team:

Code as policies is a step towards robots that can modify their behaviors and expand their capabilities accordingly. This can be enabling, but the flexibility also raises potential risks since synthesized programs (unless manually checked per runtime) may result in unintended behaviors with physical hardware. We can mitigate these risks with built-in safety checks that bound the control primitives that the system can access, but more work is needed to ensure new combinations of known primitives are equally safe. We welcome broad discussion on how to minimize these risks while maximizing the potential positive impacts towards more general-purpose robots.

LLMs have been shown to exhibit general knowledge about many subjects and can solve a wide range of natural-language processing (NLP) tasks. However, they also can generate responses that, while logically sound, would not be helpful for controlling a robot. For example, in response to “I spilled my drink, can you help?” a LLM might respond “You could try using a vacuum cleaner.” Earlier this year, InfoQ covered Google’s SayCan method that uses a large language model (LLM) to plan a sequence of robotic actions; to improve the output of the LLM, SayCan introduced a value function that indicates how likely the plan is to succeed given the current state of the world.

The key component of CaP is the generation of language model programs (LMP) that map from natural language instructions from a user to programs that execute on a robot and take perceptual inputs from the robot’s sensors and invoke controller APIs. These are generated by a LLM in “few-shot” mode that is prompted with hints and example LMPs. The generated LMPs can contain high-level control structures such as loops and conditionals, as well as hierarchically generated functions. In the latter case, a high-level LMP is generated that contains calls to undefined functions. This LMP is parsed to find those undefined references, and a second LLM that is fine-tuned to generated functions is invoked to create the function definition.

Google evaluated CaP on multiple benchmarks and tasks. Besides HumanEval, the team developed a new code-generation benchmark, RoboCodeGen, specifically for robotics problems. The team also used CaP to control physical robots performing several real-world tasks: mobile robot navigation and manipulation in a kitchen environment, and drawing shapes, pick-and-place, and table-top manipulation for a robotic arm.

Google researcher Jacky Liang discussed the work on Twitter. In response to a question about CaPs issues with building complex structures from blocks, Liang replied:

CaP operates best when the new [commands] and the prompt are in similar abstraction levels. Building complex structures is akin to going “couple levels up” the abstraction level, which greedy LLM decoding struggles with. Should be possible but probably need better ways to [prompt].

Code for reproducing the paper’s experiments is available on GitHub. An interactive demo of the code-generation technique is available on HuggingFace.

Via InfoQ.com