Installing Code for OpenAI “Spinning Up in Deep RL”

I want to start playing with training deep reinforcement learning agents, and OpenAI’s “Spinning Up in Deep RL” guide seems like a good place to start. But I was a bit worried about the age of this guide, parts of which has become out of date. One example is that it still says MuJoCo requires a paid license but actually MuJoCo has since become free to use.

Despite this, I decided it’s worth my time to try its software installation guide on my computer running Ubuntu. OpenAI’s recommended procedure uses Anaconda, meaning I’ll have a Python environment dedicated to this adventure and largely isolated from everything else. Starting from Python version (3.6, instead of the latest 3.10) and on down to all the dependent libraries. The good news is that all the Python-based infrastructure seemed to work without problems. But MuJoCo is not Python and thus not under Anaconda isolation, so all my problems came from trying to install mujoco_py. (A Python library to bridge MuJoCo.)

The first problem was apparently expected by the project’s authors, as I got a prompt to set my LD_LIBRARY_PATH environment variable. The script that gave me this prompt even gave me its best guess on the solution:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/roger/.mujoco/mujoco210/bin

That looked reasonable to me, so I tried it and it successfully allowed me to see the second and third problems. Both were Ubuntu packages that were not explicitly named in the instructions, but I had to install them before things could move on. I did a web search for the first error, which returned the suggestion to run “sudo apt install libosmesa6-dev libgl1-mesa-glx libglfw3” and that seemed to have resolved this message:

/home/roger/anaconda3/envs/spinningup/lib/python3.6/site-packages/mujoco_py/gl/osmesashim.c:1:10: fatal error: GL/osmesa.h: No such file or directory

The second missing package error was straightforward to fix, running “sudo apt install patchelf” to resolve this message:

FileNotFoundError: [Errno 2] No such file or directory: 'patchelf': 'patchelf'

The final step was to install dependencies for MuJoCo-based Gym environments. The instruction page said to run pip install gym[mujoco,robotics] but when I did so, I was distressed to see my work to install mujoco-py was removed. (!!)

  Attempting uninstall: mujoco-py
    Found existing installation: mujoco-py 2.1.2.14
    Uninstalling mujoco-py-2.1.2.14:
      Successfully uninstalled mujoco-py-2.1.2.14

The reason pip wanted to do this was because the gym packages said they want mujoco_py version greater than or equal to 1.5 but less than 2.0.

Collecting mujoco-py<2.0,>=1.50
Downloading mujoco-py-1.50.1.68.tar.gz (120 kB)

Fortunately(?) this attempt to install 1.50.1.68 failed and pip rolled everything back, restoring 2.1.2.14 that I had already installed. Leaving me in a limbo state.

Well, when in doubt, try the easy thing first. I decided to be optimistic and moved on to “check that things are working” command to run Walker2d-v2 environment. I saw a lot of numbers and words flashed by. I only marginally understood them, but I didn’t see anything I recognized as error messages. So while I see a few potential problems ahead, right now it appears this mishmash of old and new components will work well enough for a beginner.

Today I Learned: MuJoCo Is Now Free To Use

I’ve contemplated going through OpenAI’s guide Spinning Up in Deep RL. It’s one of many resources OpenAI made available, and builds upon the OpenAI Gym system of environments for training deep reinforcement learning agents. They range from very simple text-based environments, to 2D Atari games, to full 3D environments built with MuJoCo. Whose documentation explained that name is a shorthand for the type of interactions it simulates: “Multi Joint Dynamics with Contact”

I’ve seen MuJoCo mentioned in various research contexts, and I’ve inferred it is a better physics simulation than something that we would find in, say, a game engine like Unity. No simulation engine is perfect, they each make different tradeoffs, and it sounds like AI researchers (or at least those at OpenAI) believe MuJoCo to be the best one to use for training deep reinforcement learning agents with the best chance of being applicable to the real world.

The problem is that, when I looked at OpenAI Gym the first time, MuJoCo was expensive. This time around, I visited the MuJoCo page hoping that they’ve launched a more affordable tier of licensing, and there I got the news: sometime in the past two years (I didn’t see a date stamp) DeepMind has acquired MuJoCo and intend to release it as free open source software.

DeepMind was itself acquired by Google and, when the collection of companies were reorganized, it became one of several companies under the parent company Alphabet. At a practical level, it meant DeepMind had indirect access to Google money for buying things like MuJoCo. There’s lots of flowery wordsmithing about how opening MuJoCo will advance research, what I care about is the fact that everyone (including myself) can now use MuJoCo without worrying about the licensing fees it previously required. This is a great thing.

At the moment MuJoCo is only available as compiled binaries, which is fine enough by me. Eventually it is promised to be fully open-sourced at a GitHub repository set up for the purpose. The README of the repository made one thing very clear:

This is not an officially supported Google product.

I interpret this to mean I’ll be on my own to figure things out without Google technical support. Is that a bad thing? I won’t know until I dive in and find out.

Window Shopping: OpenAI Spinning Up in Deep Reinforcement Learning

I was very encouraged after my second look at Unity ML-Agents. When I first looked at it a few years ago, it motivated me to look at the OpenAI Gym for training AI agents via reinforcement learning. But once I finished the “Hello World” examples, I got lost trying to get further and quickly became distracted by other more fruitful project ideas. With this past experience in mind, I set out to find something more instructive for a beginner and found OpenAI’s guide “Spinning Up in Deep RL“. It opened with the following text:

Welcome to Spinning Up in Deep RL! This is an educational resource produced by OpenAI that makes it easier to learn about deep reinforcement learning (deep RL).

This sounds exactly like what I needed. Jackpot! At first I thought this was something new since my last look, but the timeline at the bottom of the page indicated this was already available when I last looked at reinforcement learning resources on OpenAI. I had missed it! I regret the lost opportunity, but at least I’ve found it this time.

The problem with finding such a resource a few years after publication is that it may already be out of date. The field of deep learning moves so fast! I’m pretty sure the fundamentals will still be applicable, but the state of the art has certainly moved on. I’m also worried about the example code that goes with this resource, which looks stale at first glance. For example, it launched with examples that used the now-deprecated TensorFlow 1 API. (Instead of the current TensorFlow 2 API.) I don’t care to learn TF1 just for the sake of this course, but fortunately in January 2020 they added alternative examples implemented using PyTorch instead. If I’m lucky, PyTorch hasn’t made a major version breaking change and I could still use those examples.

In addition to the PyTorch examples, there’s another upside of finding this resource now instead of later. For 3D environment simulations OpenAI uses MuJoCo. When I looked at OpenAI Gyms earlier, running the 3D environments require a MuJoCo license that costs $500/year and I couldn’t justify that money for playing around. But good news! MuJoCo is now free to use.

An Unsuccessful First Attempt Applying Q-Learning to CartPole Environment

One of the objectives of OpenAI Gym is to have a common programming interface across all of its different environments. And it certainly looks pretty good at the surface: we reset() the environment, take actions to step() through it, and at some point we get True as a return value for the done flag. Having a common interface allows us to use the same algorithm across multiple environments with minimal modification.

But “minimal” modification is not “zero” modification. Some environments are close enough that no modifications are required, but not all of them. Sometimes an environment is just not the right fit for an algorithm, and sometimes there are important details which differ from one environment to another.

One way environments differ is in different type of spaces. An environment has two: an observation_space that describes the observed state of the environment, and an action_space that outlines valid actions an agent may choose to take. They change from one environment to another because they tend to have different observable properties and different actions an agent can take within them.

As an exercise I thought I’d try to take the simple Q-Learning algorithm demonstrated to solve the Taxi environment, and slam it on top of CartPole just to see what happens. And to do that, I had to take CartPole‘s state which is an array of four floating point numbers and convert it into an integer suitable for an array index.

As an naive approach, I’ll slice up the space into discrete slices. Each of four numbers will be divided into ten bins. Each bin will correspond to a single digit zero to nine, so the four numbers will be composed into a four digit integer value.

To determine size of these bins, I executed 1000 episodes of the CartPole simulation while taking random actions via action_space.sample(). The ten bins are evenly divided between maximum and minimum values observed values in this sample run, and Q-learning is off and running… doing nothing useful.

As shown in plot above, reward function is always 8, 9, 10, or 11. We never got above or below this range. Also, out of 10000 possible states, only about 50 were ever traversed.

So this first naive attempt didn’t work, but it was a fun experiment. Now the more challenging part: figuring out where it went wrong, and how to fix it.

Code written in this exercise is available here.

 

Taking First Step Into Reinforcement Learning with OpenAI Gym

The best part about learning a new technology today is the fact that, once armed with a few key terminology, a web search can unlock endless resources online. Some of which are even free! Such was the case after I looked over OpenAI Gym on its own: I searched for an introductory reinforcement learning project online and found several to choose from. I started with this page which uses the “Taxi” environment of OpenAI Gym and, within a few lines of Python code, implemented basic Q-Learning agent that can complete the task within 1000 episodes.

I had previously read the Wikipedia page on Q-Learning, but a description suitable for an encyclopedia entry is not always straightforward to put into code. For example, Wikipedia described learning rate is a value from 0 to 1 plus what it means when it is at the extremes of 0 or 1. But it doesn’t give any guidance on what kind of values are useful in real world examples. The tutorial used 0.618 and while there isn’t enough information on why that value was chosen, it served as a good enough starting point. For this and more related reasons, it was good to have a simple implementation.

After I got it running, it was time to start poking around to learn more. The first question was how fast the algorithm learned to solve the problem, and for that I wanted to plot the cumulative evaluation function reward against iterations. This was trivial with help of PyPlot and I obtained the graph at the top of this post. We can see a lot of learning progress within the first 100 episodes. There’s a mysterious degradation in capability around 175th episode, but the system mostly recovered by 200. After that, there were diminishing returns until about 400 and the agent made no significant improvements after that point.

This simple algorithm used an array that could represent all 500 states of the environment. With six possible actions, it was an array with 3000 entries initially filled with zero. I was curious how long it took for the entire problem space to be explored, and the answer seems to be roughly 50 episodes before there were 2400 nonzero entries and it never exceeded 2400. This was far faster than I had expected to take to explore 2400 states, and it was also a surprise that 600 entries in the array were never used.

What did those 600 entries represent? With six possible actions, it implies there are 100 unreachable states of the environment. I thought I’d throw that array into PyPlot and see if anything jumped out at me:

Taxi Q plotted raw

My mind is at a loss as to how to interpret this data. But I don’t know how important it is to understand right now – this is an environment whose entire problem space can be represented in memory, using discrete values, and these are luxuries that quickly disappear as problems get more complex. The real world is not so easily classified into discrete states, and we haven’t even involved neural networks yet. The latter is referred to as DQN (Deep Q-learning Network?) and is still yet to come.

The code I wrote for this exercise is available here.

Quick Overview: OpenAI Gym

Given what I’ve found so far, it looks like Unity would be a good way to train reinforcement learning agents, and Gazebo would be used afterwards to see how they work before deploying on actual physical robots. I might end up doing something different, but they are good targets to work towards. But where would I start? That’s where OpenAI Gym comes in.

It is a collection of prebuilt environments that are free and open for hobbyists, students, and researchers alike. The list of available environments range across a wide variety of problem domains – from text-based activity that should in theory be easy for computers, to full-on 3D simulations like what I’d expect to find in Unity and Gazebo. Putting them all under the same umbrella and easily accessed from Python in a consistent manner makes it simple to gradually increase complexity of problems being solved.

Following the Getting Started guide, I was able to install the Python package and run the CartPole-v0 example. I was also able to bring up its Atari subsystem in the form of MsPacman-v4. The 3D simulations used MuJoCo as its physics engine, which has a 30-day trial and after that it costs $500/yr for personal non-commercial use. At the moment I don’t see enough benefit to justify the cost so the tentative plan is to learn the basics of reinforcement learning on simple 2D environments. By the time I’m ready to move into 3D, I’ll use Unity instead of paying for MuJoCo, bypassing the 3D simulation portion of OpenAI Gym.

I’m happy OpenAI Gym provides a beginner-friendly set of standard reinforcement learning textbook environments. Now I’ll need to walk through some corresponding textbook examples on how to create an agent that learns to work in those environments.