An Unsuccessful First Attempt Applying Q-Learning to CartPole Environment

One of the objectives of OpenAI Gym is to have a common programming interface across all of its different environments. And it certainly looks pretty good at the surface: we reset() the environment, take actions to step() through it, and at some point we get True as a return value for the done flag. Having a common interface allows us to use the same algorithm across multiple environments with minimal modification.

But “minimal” modification is not “zero” modification. Some environments are close enough that no modifications are required, but not all of them. Sometimes an environment is just not the right fit for an algorithm, and sometimes there are important details which differ from one environment to another.

One way environments differ is in different type of spaces. An environment has two: an observation_space that describes the observed state of the environment, and an action_space that outlines valid actions an agent may choose to take. They change from one environment to another because they tend to have different observable properties and different actions an agent can take within them.

As an exercise I thought I’d try to take the simple Q-Learning algorithm demonstrated to solve the Taxi environment, and slam it on top of CartPole just to see what happens. And to do that, I had to take CartPole‘s state which is an array of four floating point numbers and convert it into an integer suitable for an array index.

As an naive approach, I’ll slice up the space into discrete slices. Each of four numbers will be divided into ten bins. Each bin will correspond to a single digit zero to nine, so the four numbers will be composed into a four digit integer value.

To determine size of these bins, I executed 1000 episodes of the CartPole simulation while taking random actions via action_space.sample(). The ten bins are evenly divided between maximum and minimum values observed values in this sample run, and Q-learning is off and running… doing nothing useful.

As shown in plot above, reward function is always 8, 9, 10, or 11. We never got above or below this range. Also, out of 10000 possible states, only about 50 were ever traversed.

So this first naive attempt didn’t work, but it was a fun experiment. Now the more challenging part: figuring out where it went wrong, and how to fix it.

Code written in this exercise is available here.

 

Taking First Step Into Reinforcement Learning with OpenAI Gym

The best part about learning a new technology today is the fact that, once armed with a few key terminology, a web search can unlock endless resources online. Some of which are even free! Such was the case after I looked over OpenAI Gym on its own: I searched for an introductory reinforcement learning project online and found several to choose from. I started with this page which uses the “Taxi” environment of OpenAI Gym and, within a few lines of Python code, implemented basic Q-Learning agent that can complete the task within 1000 episodes.

I had previously read the Wikipedia page on Q-Learning, but a description suitable for an encyclopedia entry is not always straightforward to put into code. For example, Wikipedia described learning rate is a value from 0 to 1 plus what it means when it is at the extremes of 0 or 1. But it doesn’t give any guidance on what kind of values are useful in real world examples. The tutorial used 0.618 and while there isn’t enough information on why that value was chosen, it served as a good enough starting point. For this and more related reasons, it was good to have a simple implementation.

After I got it running, it was time to start poking around to learn more. The first question was how fast the algorithm learned to solve the problem, and for that I wanted to plot the cumulative evaluation function reward against iterations. This was trivial with help of PyPlot and I obtained the graph at the top of this post. We can see a lot of learning progress within the first 100 episodes. There’s a mysterious degradation in capability around 175th episode, but the system mostly recovered by 200. After that, there were diminishing returns until about 400 and the agent made no significant improvements after that point.

This simple algorithm used an array that could represent all 500 states of the environment. With six possible actions, it was an array with 3000 entries initially filled with zero. I was curious how long it took for the entire problem space to be explored, and the answer seems to be roughly 50 episodes before there were 2400 nonzero entries and it never exceeded 2400. This was far faster than I had expected to take to explore 2400 states, and it was also a surprise that 600 entries in the array were never used.

What did those 600 entries represent? With six possible actions, it implies there are 100 unreachable states of the environment. I thought I’d throw that array into PyPlot and see if anything jumped out at me:

Taxi Q plotted raw

My mind is at a loss as to how to interpret this data. But I don’t know how important it is to understand right now – this is an environment whose entire problem space can be represented in memory, using discrete values, and these are luxuries that quickly disappear as problems get more complex. The real world is not so easily classified into discrete states, and we haven’t even involved neural networks yet. The latter is referred to as DQN (Deep Q-learning Network?) and is still yet to come.

The code I wrote for this exercise is available here.

Quick Overview: OpenAI Gym

Given what I’ve found so far, it looks like Unity would be a good way to train reinforcement learning agents, and Gazebo would be used afterwards to see how they work before deploying on actual physical robots. I might end up doing something different, but they are good targets to work towards. But where would I start? That’s where OpenAI Gym comes in.

It is a collection of prebuilt environments that are free and open for hobbyists, students, and researchers alike. The list of available environments range across a wide variety of problem domains – from text-based activity that should in theory be easy for computers, to full-on 3D simulations like what I’d expect to find in Unity and Gazebo. Putting them all under the same umbrella and easily accessed from Python in a consistent manner makes it simple to gradually increase complexity of problems being solved.

Following the Getting Started guide, I was able to install the Python package and run the CartPole-v0 example. I was also able to bring up its Atari subsystem in the form of MsPacman-v4. The 3D simulations used MuJoCo as its physics engine, which has a 30-day trial and after that it costs $500/yr for personal non-commercial use. At the moment I don’t see enough benefit to justify the cost so the tentative plan is to learn the basics of reinforcement learning on simple 2D environments. By the time I’m ready to move into 3D, I’ll use Unity instead of paying for MuJoCo, bypassing the 3D simulation portion of OpenAI Gym.

I’m happy OpenAI Gym provides a beginner-friendly set of standard reinforcement learning textbook environments. Now I’ll need to walk through some corresponding textbook examples on how to create an agent that learns to work in those environments.