Effectiveness of Language Agents in Classic Control Problems

· 2719 words · 13 minute read

Large Language Models (LLM) exhibit emergent abilities in tasks like planning and decision-making akin to human beings and have inspired the creation of agents that use LLM as their controller. However, LLMs have inherent limitations as they cannot access up-to-date information, use external tools, and execute precise mathematical and logical reasoning. To combat these issues while maintaining generalization capabilities, researchers have proposed multiple strategies that include training LLM to use tools via API calls (Schick et al. 2023), leverage in-context learning to generate programs prompted by natural language instruction and demonstrations (Lu et al. 2023), and the use of neural modular and compositional approaches to perform automatic sub-task decomposition (Veličković et al. 2021). While all previously mentioned augmentations improve the performance of language agents in various tasks, it is still unclear how well these agents perform in classic control environments. In this manuscript, we take a step back in the history of Reinforcement Learning to explore the effectiveness of autonomous agents with LLM controllers in the classic control environment of Cart Pole. Our main contribution is a rigorous statistical comparison of the performance of language agents and other methodologies to control.

paper code: https://github.com/af-torres/language-agents

Language Agents - a new approach to control

Reinforcement Learning (RL) is a specialized branch of machine learning that centers around the idea of enabling intelligent agents to learn how to make optimal decisions within a fluctuating environment. Historically, RL algorithms have found extensive use in resolving control problems that involve partial visibility and unpredictable results.

Traditionally, two main approaches have been employed to model these learning problem: value-based, where the expected reward for every action, given the state, is learned; and policy-based, where the action, given state, is learned. However, these methods require the development of a custom reward function and are computationally expensive to train.

Filling this lacuna, recent technological advances have paved the way for a new breed of agents, skilled in executing diverse tasks in zero-shot or few-shot learning. Uniquely, these agents shape their strategies based on logical reasoning decoded from natural language, thereby utilizing the exceptional capabilities of Large Language Models (LLMs).

What follows are a few exemplar instances of how this integration of LLMs with RL has brought about path-breaking outcomes in various domains:

ChemCrow

ChemCrow (Bran et al. 2023) is an autonomous agent specially designed to plan and carry out tasks across the fields of organic synthesis, drug discovery, and materials design. An extraordinary feat of this research involved the enhancement of an LLM agent with laboratory tools, which enabled the agent to synthesize an insect repellent, develop multiple organocatalysts, and navigate the discovery of a pioneering chromophore.

LIDA

LIDA (Dibia 2023), another agent exploiting the power of Language Agents, is instrumental in generating data visualizations right from raw data. Comprising three LLM-based modules dedicated to summarizing, probing analytical objectives, and coding to produce final graphs, LIDA delivers visually striking, analytically robust, and highly Tailor-able visualizations.

Computer Tasks

Completing computer tasks given in natural language, like those in the MiniWob++ dataset, is a complex problem where state-of-the-art solutions heavily rely on supervised learning (SL) from expert demonstration, reinforcement learning (RL) on hand-crafted task-specific reward signal, or both. Yet, none of these approaches were able to generalize to new tasks. On the other hand, language agents following a simple prompting strategy outperformed previous agents and had no issues completing the tasks in a zero-shot setting (Kim et al. 2023).

Classical Control: Cart Pole Balancing

In the Cart Pole Balancing problem, a pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The pole is placed upright on the cart, and the goal is to balance it by applying forces in the left and right direction on the cart. The previously described task is challenging to learn, and there is no theoretical solution to the problem. One of the first solutions to the problem was proposed by Barto et. al (1983), who introduced the concept of an agent with a neuronlike (neural network) model where the actions followed a random process biased by a combination of weight values and input patterns. In Barto et. al (1983), the authors compared their solution via reinforcement learning against a baseline model proposed by Michie et. al (1968), where a policy is learned for mutually independent sub-games that can be mapped from the environment states. In this manuscript, we compare the results of the new approach to reinforcement learning (language agents) against the same baseline solution used to evaluate the first neuron-like agent.

Methodology

We built three different agents that seek to balance a cart pole over a frictionless track, where each of them follow its own control policy. The objective here is to compare the overall performance of the agents by measuring the episode length over a set of 100 games. That is, how many decisions is each agent able to make before the game ends.

The policies of each agent are defined as “random policy”, “space discretization policy”, and “abstract reasoning policy”. In the rest of this section we will explain these policies, define our hypothesis and the parametric model used for statistical inference (how we compared the policies).

Random policy

For every decision made by the agent following this policy, a random sample is drawn from a Bernoulli(0.5) distribution. The observed outcome from the experiment defines the decision the agent makes. If the experiment is a success the cart gets pushed to the right, otherwise, to the left. We use this as a baseline solution to the control problem, and hope that the language agent is able to outperform this policy.

Space discretization policy

This policy uses the ideas from Michie et. al (1968), where the game state is mapped to a discretized learned decision space. In other words, overtime, an agent using this policy learns what is the best action to take given a quantized transformation of the game state. This policy represents one of the best solutions that do not use a neural network to abstract the environment. If the language agent outperforms this policy we would have evidence to argue that an LLM is capable of creating an useful abstract representation of the world and make sound decisions based on that understanding.

Abstract reasoning policy

This policy follows the decision made by an LLM that is prompted with a description of the problem, the last 15 (game state, decisions) tuples, the current game state, and a request to make a decision of the direction the cart should be pushed. We use three different system prompts to test whether or not there is a significant difference when using chain of thought (CoT), or, guided strategies logically developed by the authors.

Prompt 1: You are a control agent responsible to balance a pole in a moving cart. The pole is attached by an un-actuated joint to the cart, which moves along a frictionless track. The pendulum is placed upright on the cart. Your objective is to balance the pole by applying forces in the left and right direction on the cart. The user will inform you of the current cart’s position and velocity, and the pole’s angle and angular velocity. You will answer to the user with the action that should be taken in following format: {{decision}}. Where decision can take the value 0 to push the cart to the left or 1 to push the cart to the right and it should ALWAYS be wrapped by double braces. For example, if your decision is to push the cart to the left, your response to the user should be: {{0}} You will fail the task if the pole angle is greater than ±12° or if the cart position is greater than ±2.4 so make sure that does not happen. But, even if you know you have failed you are still required to respond to the user with a decision. As a final tip, take the following example: user: Cart Position: 1.2 Cart Velocity: 1.7783992290496826 Pole Angle: 3.423493094393619 Pole Angular Velocity: 2.2035346031188965 A good strategy could be to try to force the pole angle to go to the left by first moving the cart to the right and then use that angular momentum to move the cart to the left and center it. That is to say, you will want to think on how to move the cart and angle to the center. Make sure to add your thinking process to the response to the user, but be concise.

Prompt 2: You are a control agent responsible to balance a pole in a moving cart. The pole is attached by an un-actuated joint to the cart, which moves along a frictionless track. The pendulum is placed upright on the cart. Your objective is to balance the pole by applying forces in the left and right direction on the cart. The user will inform you of the current cart’s position and velocity, and the pole’s angle and angular velocity. You will answer to the user with the action that should be taken in following format: {{decision}}. Where decision can take the value 0 to push the cart to the left or 1 to push the cart to the right and it should ALWAYS be wrapped by double braces. For example, if your decision is to push the cart to the left, your response to the user should be: {{0}} You will fail the task if the pole angle is greater than ±12° or if the cart position is greater than ±2.4 so make sure that does not happen. But, even if you know you have failed you are still required to respond to the user with a decision. As a final tip, take the following example: user: Cart Position: 1.2 Cart Velocity: 1.7783992290496826 Pole Angle: 3.423493094393619 Pole Angular Velocity: 0.0035346031188965 A good strategy could be to try to force the pole angle to go to the left by: 1. moving the cart the left (as many times as necessary) to increase the angular velocity of the pole, 2. move the cart to the right to turn the pole angle to a negative position, 3. slowly move the cart to the left so it stays centered.That is to say, you will want to think on how to move the cart and pole angle to the center. Make sure to add your thinking process to the response to the user, but be concise.

Prompt 3: You are a control agent responsible to balance a pole in a moving cart. The pole is attached by an un-actuated joint to the cart, which moves along a frictionless track. The pendulum is placed upright on the cart. Your objective is to balance the pole by applying forces in the left and right direction on the cart. The user will inform you of the current cart’s position and velocity, and the pole’s angle and angular velocity. You will answer to the user with the action that should be taken in following format: {{decision}}. Where decision can take the value 0 to push the cart to the left or 1 to push the cart to the right and it should ALWAYS be wrapped by double braces. For example, if your decision is to push the cart to the left, your response to the user should be: {{0}} You will fail the task if the pole angle is greater than ±12° or if the cart position is greater than ±2.4 so make sure that does not happen. But, even if you know you have failed you are still required to respond to the user with a decision. Your response should provide the action decision in the specified format and nothing else.

Hypothesis and Test

Our null hypothesis is that the ability of a language agent to balance the poll, is equal to the ability of an agent making random decisions but significantly different from the agent making decisions based on the space discretization policy. The alternative hypothesis is that LLM agents’ performance is significantly different from those of a random agent and the specialized agent. The performance of each agent is measured by the length of episodes over a 100 set of played games.

To model each agent performance we assume that the length of the episodes of each agent is normally distributed. The distribution parameters are approximated using the sample mean and variance with 100 samples (games played) by each agent. To avoid the attribution of probability to non-existent outcomes (negative episode lengths) we used the log transformation of the sample. Additionally, the choice of the normal model, over something like a poisson or binomial model, was made for simplicity and to allow explicit separation between mean and variance.

Results and Discussion

Comparing the performance of the different models we can conclude that the agent following the abstract reasoning policy is not significantly different from an agent following a random policy, and that the agent following the space discretization policy performs significantly better than both the random and language agent.

Our explanation to the test results is that the language agent couldn’t create a useful and correct representation of the environment making most of its decisions inappropriate given the game state. One could say that the agent was hallucinating a representation of the world thus making random decisions. This idea can be supported by some of the responses of the agent, as an example:

Given the current state where the pole is leaning slightly to the left, we need to take action to prevent it from falling over. Since the pole’s angular velocity is negative, it indicates that the pole is rotating clockwise. To counteract this rotation and bring the pole back towards a more upright position, we should apply a force to the right to slow down the clockwise rotation. Considering the cart’s position and velocity, we can exert a significant force to the right to stabilize the pole and maintain equilibrium. It’s important to note that the cart’s position is within the desired range (-2.4 to 2.4), which allows us to focus on stabilizing the pole without worrying about the cart’s position. Therefore, my decision is to {{1}} apply a significant force to the right, considering the cart’s position and velocity, to stabilize the pole and prevent it from leaning further to the left.

In this example, we can clearly see that the agent miss interpreted the pole velocity, as when this variable is negative the pole is falling counter-clockwise, yet the agent assumed it was falling clockwise. Therefore, the agent hallucinated about the environment dynamics which led to the wrong decision. Future research could measure the impact of hallucinations and try reduce it via prompt search (or other strategies) in order to understand the limiting performance (what happens as the total hallucinations go to zero) of these agents in simple control environments in a zero-shot learning context.

Citation

Cited as:

Torres, Andres. (Mar 2024). Effectiveness of Language Agents. Cloud Adventures. https://cloudadventures.net/white-papers/effectiveness-of-language-agents/.

Or

@article{torres2024,
  title   = "Effectiveness of Language Agents",
  author  = "Torres, Andres",
  journal = "cloudadventures.net",
  year    = "2024",
  month   = "Mar",
  url     = "https://cloudadventures.net/white-papers/effectiveness-of-language-agents/"
}

References

Schick, T., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., & Scialom, T. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. ArXiv. /abs/2302.04761

Lu, P., Peng, B., Cheng, H., Galley, M., Chang, K., Wu, Y. N., Zhu, S., & Gao, J. (2023). Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models. ArXiv. /abs/2304.09842

Veličković, P., & Blundell, C. (2021). Neural Algorithmic Reasoning. ArXiv. https://doi.org/10.1016/j.patter.2021.100273

Bran, A. M., Cox, S., Schilter, O., Baldassari, C., White, A. D., & Schwaller, P. (2023). ChemCrow: Augmenting large-language models with chemistry tools. ArXiv. /abs/2304.05376

Boiko, D. A., MacKnight, R., & Gomes, G. (2023). Emergent autonomous scientific research capabilities of large language models. ArXiv. /abs/2304.05332

Dibia, V. (2023). LIDA: A Tool for Automatic Generation of Grammar-Agnostic Visualizations and Infographics using Large Language Models. ArXiv. /abs/2303.02927

Kim, G., Baldi, P., & McAleer, S. (2023). Language Models can Solve Computer Tasks. ArXiv. /abs/2303.17491

A. G. Barto, R. S. Sutton and C. W. Anderson, “Neuronlike adaptive elements that can solve difficult learning control problems,” in IEEE Transactions on Systems, Man, and Cybernetics, vol. SMC-13, no. 5, pp. 834-846, Sept.-Oct. 1983, doi: 10.1109/TSMC.1983.6313077.

Michie, Donald, and Roger A. Chambers. “BOXES: An experiment in adaptive control.” Machine intelligence 2.2 (1968): 137-152.

You have subscribed to the newsletter!