Notes on Reinforcement Learning Algorithm Implementations Published by OpenAI

OpenAI’s Spinning Up in Deep RL guide has copious written resources, but it also offers resources in the form of algorithm implementations in code. Each implementation is accompanied by a background section talking about how the algorithm works, summarized by a “Quick Facts” section that serves as a high-level view of how these algorithms differ from each other. As of this writing, there are six implementations. I understand the main division is that there are three “on-policy” algorithms and three “off-policy” algorithms. Within each division, the three algorithms are roughly sorted by age and illustrate an arc of researchin the field. The best(?) on-policy algorithm here is Proximal Policy Optimization (PPO) and representing off-policy vanguard is Soft Actor-Critic (SAC).

These implementations are geared towards teaching these algorithms. Since priority is placed on this learning context, these implementations are missing enhancements and optimizations that would make the code harder to understand. For example, many of these implementations do not support parallelization.

This is great for its intended purpose of supplementing Spinning Up, but some people wanted an already optimized implementation. For this audience, OpenAI published the OpenAI baselines a few years ago to serve as high water marks for algorithm implementation. These baselines can serve either as starting point for other projects, or benchmarks to measure potential improvements against.

However, it appears these baselines have gone a little stale. Not any fault of the implementations themselves, but merely due to the rapidly moving nature of this research field. The repository hosts implementations using TensorFlow version 1.0, which has been deprecated. There’s a (partial? full?) conversion to TensorFlow 2.0 in a separate branch, but that never merged back to the main branch for whatever reason. As of this writing there’s an open issue asking for PyTorch implementations, prompted by OpenAI’s own proclamation that they will be standardizing on PyTorch. (This proclamation actually led to the PyTorch conversion of Spinning Up in Deep RL examples.) However, there’s no word yet on PyTorch conversions for OpenAI baselines so anyone who want implementations in PyTorch will either have to optimize the Spinning Up implementations themselves or look elsewhere.

Of course, all of this assumes reinforcement learning will actually solve the problem, which I’ve learned might not be a good assumption.

Notes on Deep Reinforcement Learning Resources by OpenAI

I’m going through the Spinning Up in Deep RL (Reinforcement Learning) guide published by OpenAI, and I found the Introduction to RL section quite enlightening. While it was just a bit too advanced for me to understand everything, I understood enough to make sense of many other things I’ve come across recently as I had browsed the publicly available resources in this field. Following that introductory section was a Resources section with the following pages:

Spinning Up as a Deep RL Researcher

This section is for a target audience that wishes to contribute to the frontier of research in reinforcement learning. Building the background, learn by doing, and how to be rigorous in research projects. This section is filled with information that feels like it was distilled from the author’s experience. The most interesting part for me is the observation that “broken RL code almost always fails silently“. Meaning there are no error messages, just a failure for the agent to learn from its experience. The worst part when this happens is that it’s hard to tell the difference between a flawed algorithm and a flawed implementation of a good algorithm.

Key Papers in Deep RL

This was the completely expected directory of papers, roughly organized by the taxonomy used for the introduction in this guide. I believe the author assembled this list to form a solid foundation for further exploration. It appears that most (and possibly all) of them are freely available, a refreshing change from the paywalls that usually block curious minds.

Exercises

Oh no, I didn’t know there would be homework in this class! Yet here we are. A few problems for readers to attempt implementing using their choice of TensorFlow or PyTorch, along with solutions to check answers against. The first set cover basic implementation, the second set covers some algorithm failure modes. This reinforces what was covered earlier: broken RL code fails silently, so an aspiring practitioner must recognize symptoms of failure modes.

Benchmarks for Spinning Up Implementations

The Spinning Up in Deep RL guide is accompanied by implementations of several representative algorithms. Most of them have two implementations: one in TensorFlow and one in PyTorch. But when I run them on my own computer, how would I know if they’re running correctly? This page is a valuable resource to check against. It has charts for these algorithms’ performance in five MuJoCo Gym environments under conditions also described on this page. And once I have confidence they are running correctly, these are also benchmarks to see if I can improve their performance. For these implementations are written for beginners to read and understand, which means skipping some of the esoteric hard-to-understand optimizations that readers are challenged to put in themselves.

For people who want optimized implementations of these algorithms OpenAI has them covered too. Or at least, they used to.

Notes on “Introduction to RL” by OpenAI

With some confidence that I could practice simple reinforcement learning algorithms, I moved on to what I consider the meat of OpenAI’s Spinning Up in Deep RL guide: the section titled Introduction to RL. I thought it was pretty good stuff.

Part 1: Key Concepts in RL

Since I had been marginally interested in the field, the first section What Can RL Do? was mostly review for me. More than anything else, it confirmed that I was reading the right page. This section proceeded to Key Concepts and Terminology and that quickly caught up then surpassed what knowledge I already had. I was fine with the verbal descriptions and the code snippets, but once the author started phrasing in terms of equations it became more abstract than what I can easily follow. I’m a code person, not an equations person. Given my difficulty comprehending the formalized equations, I was a little bemused at the third section (Optional) Formalism as things were already becoming more formalized than my usual style.

Part 2: Kinds of RL Algorithms

Given that I didn’t comprehend everything in part 1, I was a little scared at what I would find in part 2. Gold. I found pure gold. Here I found a quick survey of major characteristics of various reinforcement learning approaches, explaining those terms in ways I (mostly) understood. Explanations like what “Model-Free” vs. “Model-Based” means. What “on-policy” vs. “off-policy” means. I had seen a lot of these terms thrown around in things I read before, but I never found a place laying them down against one another until I found this section. This section alone was worth the effort of digging into this guide and will help me understand other things. For example, a lot of these terms were in the documentation for Unity ML-Agents. I didn’t understand what they meant at the time, but now I hope to understand more when I review them.

Part 3: Intro to Policy Optimization

Cheered by how informative I found part 2, I proceeded to part 3 and was promptly brought back down to size by the equations. The good thing is that this section had both equations and a code example implementation, and they were explained more or less in parallel. Someone like myself can read the code and try to understand the equations. I expect there are people out there who are the reverse, more comfortable with the equations and appreciate seeing what they look like in code. That said, I can’t claim I completely understand the code, either. Some of the mathematical complexities represented in the equations were not in the sample source code: They were handed off to implementations in PyTorch library.

I foresee a lot of studying PyTorch documentation before I can get proficient at writing something new on my own. But just from reading this Introduction to RL section, I have the basic understanding to navigate PyTorch API documentation and maybe the skill to apply existing implementations written by somebody else. And as expected of an introduction, there are tons of links to more resources.

Old PyTorch Without GPU Is Enough To Start

I’ve mostly successfully followed the installation instructions for OpenAI’s Spinning Up in Deep RL. I am optimistic this particular Anaconda environment (in tandem with the OpenAI guide) will be enough to get me off the ground. However, I don’t expect it to be enough for doing anything extensive. Because when I checked the installation, I saw it pulled down PyTorch 1.3.1 which is now fairly old. As of this writing, PyTorch LTS is 1.8.2 and Stable is 1.10. On top of that, this old PyTorch runs without CUDA GPU acceleration.

>>> print(torch.__version__)
1.3.1
>>> print(torch.cuda.is_available())
False

NVIDIA’s CUDA API has been credited with making the current boom in deep learning possible, because CUDA opened up GPU hardware for usage other than its original gaming intent. Such massively parallel computing hardware made previously impractical algorithms practical. However, such power does incur overhead. For one thing, data has to be copied from a computer’s main memory to memory modules on board the GPU before they can be processed. And then, the results have to be copied back. For smaller problems, the cost of such overhead can swamp the benefit of GPU acceleration.

I believe I ran into this when working through Codecademy’s TensorFlow exercises. I initially set up CPU-only TensorFlow on my computer to get started and, once I had that initial experience, I installed TensorFlow with GPU support. I was a little surprised to see that small teaching examples from Codecademy took more time overall on the GPU accelerated installation than the CPU-only installation. One example took two minutes to train in CPU-only mode, but took two and a half minutes with GPU overhead. By all reports, the GPU will become quite important once I start tackling larger neural networks, but I’m not going to sweat the complexity until then.

So for Spinning Up in Deep RL, my first goal is to get some experience running these algorithms in a teaching examples context. If I am successful through that phase (which is by no means guaranteed) and start going beyond small examples, then I’ll worry about setting up another Anaconda environment with a modern version of PyTorch with GPU support.

Installing Code for OpenAI “Spinning Up in Deep RL”

I want to start playing with training deep reinforcement learning agents, and OpenAI’s “Spinning Up in Deep RL” guide seems like a good place to start. But I was a bit worried about the age of this guide, parts of which has become out of date. One example is that it still says MuJoCo requires a paid license but actually MuJoCo has since become free to use.

Despite this, I decided it’s worth my time to try its software installation guide on my computer running Ubuntu. OpenAI’s recommended procedure uses Anaconda, meaning I’ll have a Python environment dedicated to this adventure and largely isolated from everything else. Starting from Python version (3.6, instead of the latest 3.10) and on down to all the dependent libraries. The good news is that all the Python-based infrastructure seemed to work without problems. But MuJoCo is not Python and thus not under Anaconda isolation, so all my problems came from trying to install mujoco_py. (A Python library to bridge MuJoCo.)

The first problem was apparently expected by the project’s authors, as I got a prompt to set my LD_LIBRARY_PATH environment variable. The script that gave me this prompt even gave me its best guess on the solution:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/roger/.mujoco/mujoco210/bin

That looked reasonable to me, so I tried it and it successfully allowed me to see the second and third problems. Both were Ubuntu packages that were not explicitly named in the instructions, but I had to install them before things could move on. I did a web search for the first error, which returned the suggestion to run “sudo apt install libosmesa6-dev libgl1-mesa-glx libglfw3” and that seemed to have resolved this message:

/home/roger/anaconda3/envs/spinningup/lib/python3.6/site-packages/mujoco_py/gl/osmesashim.c:1:10: fatal error: GL/osmesa.h: No such file or directory

The second missing package error was straightforward to fix, running “sudo apt install patchelf” to resolve this message:

FileNotFoundError: [Errno 2] No such file or directory: 'patchelf': 'patchelf'

The final step was to install dependencies for MuJoCo-based Gym environments. The instruction page said to run pip install gym[mujoco,robotics] but when I did so, I was distressed to see my work to install mujoco-py was removed. (!!)

  Attempting uninstall: mujoco-py
    Found existing installation: mujoco-py 2.1.2.14
    Uninstalling mujoco-py-2.1.2.14:
      Successfully uninstalled mujoco-py-2.1.2.14

The reason pip wanted to do this was because the gym packages said they want mujoco_py version greater than or equal to 1.5 but less than 2.0.

Collecting mujoco-py<2.0,>=1.50
Downloading mujoco-py-1.50.1.68.tar.gz (120 kB)

Fortunately(?) this attempt to install 1.50.1.68 failed and pip rolled everything back, restoring 2.1.2.14 that I had already installed. Leaving me in a limbo state.

Well, when in doubt, try the easy thing first. I decided to be optimistic and moved on to “check that things are working” command to run Walker2d-v2 environment. I saw a lot of numbers and words flashed by. I only marginally understood them, but I didn’t see anything I recognized as error messages. So while I see a few potential problems ahead, right now it appears this mishmash of old and new components will work well enough for a beginner.

Today I Learned: MuJoCo Is Now Free To Use

I’ve contemplated going through OpenAI’s guide Spinning Up in Deep RL. It’s one of many resources OpenAI made available, and builds upon the OpenAI Gym system of environments for training deep reinforcement learning agents. They range from very simple text-based environments, to 2D Atari games, to full 3D environments built with MuJoCo. Whose documentation explained that name is a shorthand for the type of interactions it simulates: “Multi Joint Dynamics with Contact”

I’ve seen MuJoCo mentioned in various research contexts, and I’ve inferred it is a better physics simulation than something that we would find in, say, a game engine like Unity. No simulation engine is perfect, they each make different tradeoffs, and it sounds like AI researchers (or at least those at OpenAI) believe MuJoCo to be the best one to use for training deep reinforcement learning agents with the best chance of being applicable to the real world.

The problem is that, when I looked at OpenAI Gym the first time, MuJoCo was expensive. This time around, I visited the MuJoCo page hoping that they’ve launched a more affordable tier of licensing, and there I got the news: sometime in the past two years (I didn’t see a date stamp) DeepMind has acquired MuJoCo and intend to release it as free open source software.

DeepMind was itself acquired by Google and, when the collection of companies were reorganized, it became one of several companies under the parent company Alphabet. At a practical level, it meant DeepMind had indirect access to Google money for buying things like MuJoCo. There’s lots of flowery wordsmithing about how opening MuJoCo will advance research, what I care about is the fact that everyone (including myself) can now use MuJoCo without worrying about the licensing fees it previously required. This is a great thing.

At the moment MuJoCo is only available as compiled binaries, which is fine enough by me. Eventually it is promised to be fully open-sourced at a GitHub repository set up for the purpose. The README of the repository made one thing very clear:

This is not an officially supported Google product.

I interpret this to mean I’ll be on my own to figure things out without Google technical support. Is that a bad thing? I won’t know until I dive in and find out.

Window Shopping: OpenAI Spinning Up in Deep Reinforcement Learning

I was very encouraged after my second look at Unity ML-Agents. When I first looked at it a few years ago, it motivated me to look at the OpenAI Gym for training AI agents via reinforcement learning. But once I finished the “Hello World” examples, I got lost trying to get further and quickly became distracted by other more fruitful project ideas. With this past experience in mind, I set out to find something more instructive for a beginner and found OpenAI’s guide “Spinning Up in Deep RL“. It opened with the following text:

Welcome to Spinning Up in Deep RL! This is an educational resource produced by OpenAI that makes it easier to learn about deep reinforcement learning (deep RL).

This sounds exactly like what I needed. Jackpot! At first I thought this was something new since my last look, but the timeline at the bottom of the page indicated this was already available when I last looked at reinforcement learning resources on OpenAI. I had missed it! I regret the lost opportunity, but at least I’ve found it this time.

The problem with finding such a resource a few years after publication is that it may already be out of date. The field of deep learning moves so fast! I’m pretty sure the fundamentals will still be applicable, but the state of the art has certainly moved on. I’m also worried about the example code that goes with this resource, which looks stale at first glance. For example, it launched with examples that used the now-deprecated TensorFlow 1 API. (Instead of the current TensorFlow 2 API.) I don’t care to learn TF1 just for the sake of this course, but fortunately in January 2020 they added alternative examples implemented using PyTorch instead. If I’m lucky, PyTorch hasn’t made a major version breaking change and I could still use those examples.

In addition to the PyTorch examples, there’s another upside of finding this resource now instead of later. For 3D environment simulations OpenAI uses MuJoCo. When I looked at OpenAI Gyms earlier, running the 3D environments require a MuJoCo license that costs $500/year and I couldn’t justify that money for playing around. But good news! MuJoCo is now free to use.