Unity-Python Communication for ML-Agents: Good, Bad, and Ugly

I’ve only just learned that Unity DOTS exists and it seems like something interesting to learn as an approach for utilizing resources on modern multicore computers. But based on what I’ve learned so far, adopting DOTS by itself won’t necessarily solve the biggest bottleneck in Unity ML-Agents as per this forum thread: the communication between Unity and Python.

Which is unfortunate, because this mechanism is also a huge strength of this system. Unity is a native code executable with modules written in C# and compiled, while deep learning neural network frameworks like TensorFlow and PyTorch runs under a Python interpreted environment. The easiest and most cross-platform friendly way for these two types of software to interact is via network ports even though data is merely looped back to the same computer and not sent over a network.

With a documented communication protocol, it allowed ML-Agents components to evolve independently as long as they conform to the same protocol. This was why they were able to change the default deep learning framework from TensorFlow to PyTorch between ML-Agents version 1.0 and 2.0 but without breaking backwards compatibility. (They did it in release 10, in case it’s important) Developers who prefer to use TensorFlow could continue doing so, they are not forced to switch to PyTorch as long as everyone talks the same language.

Functional, capable, flexible. What’s not to love? Well, apparently “performance”. I don’t know the details for Unity ML-Agents bottlenecks but I do know “fast” for a network protocol is a snail’s pace compared to high performance inter-process communications mechanisms such as shared memory.

To work around the bottleneck, the current recommendations are to manually stack things up in parallel. Starting at the individual agent level: multiple agents can train in parallel, if the environment supports it. This explains why the 3D Ball Balancing example scene has twelve agents. If the environment doesn’t support it, we can manuall copy the same training environment several times in the scene. We can see this in the Crawler example scene, which has ten boxes one for each crawler. Beyond that, we now have the capability to run multiple Unity instances in parallel.

All of these feel… suboptimal. The ML-Agents team is aware of the problem and working on solutions but have nothing to announce yet. I look forward to seeing their solution. In the meantime, learning about DOTS has sucked up all of my time. No, not learning… I got sucked into Hardspace:Shipbreaker, a Unity game built with DOTS.

Unity DOTS = Data Oriented Technology Stack

Looking over resources for Unity ML-Agents toolkit for reinforcement learning AI algorithms, I’ve come across multiple discussion threads about how it has difficulties scaling up to take advantage of modern multicore computers. This is not just a ML-Agents challenge, this is a Unity-wide challenge. Arguably even a software development industry-wide challenge. When CPUs stopped getting faster clock rates and started gaining more cores, games have had problem taking advantage of them. Historically while a game engine is running, there is one CPU core running at max. The remaining cores may help out with a few auxiliary tasks but mostly sit idle. This is why gamers have been focused on single-core performance in CPU benchmarks.

Having multiple CPUs running in parallel isn’t new, nor are the challenges of leveraging that power in full. From established software toolkits to leading edge theoretical research papers, there are many different approaches out there. Reading these Unity forum posts, I learned that Unity is working on a big revamp under the umbrella of DOTS: Data-Oriented Technology Stack.

I came across the DOTS acronym several times without understanding what it was. But after it came up in the context of better physics simulation and a request for ML-Agents to adopt DOTS, I knew I couldn’t ignore the acronym anymore.

I don’t know a whole lot yet, but I’ve got the distinct impression that working under DOTS will require a mental shift in programming. There were multiple references to Unity developers saying it took some time for the concepts to “click”, so I expect some head-scratching ahead. Here are some resources I plan to use to get oriented:

DOTS is Unity’s implementation of Data-oriented Design, a more generalized set of concepts that helps write code that will run well on modern machines with many cores and multiple levels of memory caches. An online eBook for Data-oriented Design is available, which might be good to read so I can see if I want to adopt these concepts in my own programming projects outside of Unity.

And just to bring this full circle: it looks like the ML-Agents team has already started DOTS work as well. However it’s not clear to me how DOTS will (or will not) help with the current gating performance factor: Unity environment’s communication with PyTorch (formerly TensorFlow) running in a Python environment.

Miscellaneous Gems from ML-Agents Resources

I browsed through ML-Agents GitHub issues and forums looking for an explanation why there hasn’t been a release in half a year, I came up empty handed on an answer. But the time is not all wasted since I found a few other scattered tidbits that might be useful for the future.

The simulation time scale used in most examples is 20, and is the default used by the mlagents-learn script. If the script is not used, the simulation runs in real time and will feel very slow. The tradeoff here is accuracy of physics simulation, as per the comment “If you go too fast, the physics gets kind of wonky, and sometimes objects/agents will go through each other.”

The official “Hummingbird” tutorial on Unity Learn targets the LTS build of Unity with ML-Agents version 1. Looks like the goal is to update it to work with a new release of ML-Agents, and instructions to get a preview has been posted.

Issue #4129 is pretty old but there is a lot of detail here about why Unity ML-Agents doesn’t necessarily benefit from GPU accelerated neural network training. Since then, ML-Agents has switched from TensorFlow to PyTorch but many of the points might still apply.

But never mind the GPU, ML-Agents can’t even make full use of multicore CPUs like this person’s 16-core Threadripper. The Unity person who responded explained this is on the list of things to improve but there’s nothing to show yet. In the meantime, there are workarounds.

Also on the subject of utilizing parallel hardware, here’s a request for Unity to use Google’s Brax physics engine. Based on my experience I was certain the answer was “No” (the physics engine isn’t something that can “just” be switched around) but this question was valuable for two reasons. One, it led me to look up and learn about Google Brax, a physics engine for reinforcement learning that runs on Google TPUs and CUDA GPUs. And second, Unity actually is investing in the work towards a faster physics engine “for the DOTS platform.” Um… what’s DOTS? Time for some reading.

Browsing ML-Agents Resources: GitHub and Forums

I had taken a quick look back through evolution of Unity’s ML-Agents from initial prerelease announcement to the present day. The features have been good, but looking at the dates highlighted a conspicuous gap: there have been no releases or announcements in the second half of 2021. This is notable because ML-Agents had been on a rapid releases schedule, with versions coming out every few weeks, since the original prerelease announcement. But things had come to a screeching halt — why?

[Update: Release 19 became available on 2022/1/14, but I have yet to find information on why there was a long delay between 18 and 19.]

In the absence of official announcements, I went poking around through other publicly available information. The first resource was a brief look at the GitHub commit history for the ML-Agents repository. We’ve been warned the “main” branch can be unstable and I thought that meant it was the active development branch. This is apparently false, as I found hints of additional working branches that are out of public view. For example, the original commit for “Deterministic actions” mentioned “release 19” even though the latest release is currently 18. These mentions of “release 19” had to be removed as merge #5637. In the description, it mentioned the branch “release_19_docs” which is not visible to us.

With development progress partially obscured from view, I moved on to looking at currently open GitHub issues. When doing a first glance at an unfamiliar repository, I first look at the most recent items (the default view) and then I sort by “Most Commented” to see the most popular items. Nothing caught my eye as a likely reason. However, it did eliminate the possibility that Unity had killed the project, because team members are still responding to issues. So that’s comforting.

The next resource was to browse Unity Forum for ML-Agents. Scrolling through issues between June and now, I saw multiple references to memory leaks. (“Unity ML 2.0.0 Memory leak” “Training is more and more slow” “Memory leak using imitation learning“) Returning to issues list, I saw #5458 “Memory Leak” is currently open for investigation. However, there’s no comment one way or another if this bug (or another) is holding up release 19, so my quest came up empty-handed. However, I did get the idea if I run into a memory leak with release 18, I can try downgrading to release 17 to see that helps. Plus I came across a bunch of other interesting pieces of information.

Notes on ML-Agents Development History (Part 2: Version 1.0 to Present)

Looking back at Unity blog posts and GitHub release notes, we can see ML-Agent’s evolution during the prerelease beta phase. From initial announcement leading up to an official version 1.0, they added many features promised in the original announcement, and made big architectural changes like how brains fit in the object hierarchy of a Unity project.

On 2020/5/12, ML-Agents reached an official version 1.0, with a package organization that is covered by version compatibility guarantees going forward in the future. This guarantee is significant, because it means users can have better confidence their own projects will function. It also means more work for Unity because any future large-scale architectural changes will have to be made in a compatible way.

Another change in ML-Agents development is that they’re no longer writing a Unity blog post for every release. I had thought this merely reflected a slower, more deliberate development with fewer changes to announce, but looking over release notes I still see plenty of significant changes. Given this, I suspect the lower blog traffic reflect a change in customer communication priorities inside the Unity organization. Perhaps they’ve moved on to YouTube videos or something? If so, that would be a shame, as I prefer the written word.

In any case, ML-Agents GitHub repository release notes made it clear development continued rapidly:

  • Release 2 (2020/5/20) has minor fixes and the current “Verified” build.
  • Release 3 (2020/6/10)
  • Release 4 (2020/7/15) added parameter randomization
  • Release 5 (2020/7/31)
  • Release 6 (2020/8/17) updated version requirements: Python now 3.6.1 and NumPy now 1.19.0 in sync with TensorFlow
  • Release 7 (2020/9/21) IActuator abstract classes for generic action spaces. Initial PyTorch implementation.
  • Release 8 (2020/10/14)
  • Release 9 (2020/11/3)
  • Release 10 (2020/11/19) Match3 environment (ML-Agents play Bejeweled!) and PyTorch is now the default.
  • Release 11 (2020/12/21)
  • Release 12 (2020/12/22)

The above releases were summarized in the ML-Agents 2020 End of Year recap blog post, and development continued through 2021:

  • Release 13 (2021/2/24) TensorFlow removed. (--torch-device=cpu to tell PyTorch to use CPU for training. This will be useful later.)
  • Release 14 (2021/3/8)
  • Release 15 (2021/3/17) BufferSensor for agents to observe variable number of entities. MultiAgentGroup interface for training multiple different agents simultaneously, and MA-POCA trainer for them.
  • Release 16 (2021/4/13)
  • Release 17 (2021/4/27): Minimum Unity up to 2019.4. API breaking changes. Multiple behaviors via HyperNetworks
  • Release 18 (2021/6/9): Added colab notebooks.

The version and API changes for release 17 smelled like preparation for a new version, which was confirmed by a blog post talking about training complex cooperative behaviors. This is all very exciting stuff, but I noticed development activity came to a screeching halt. After years of releases every few weeks (sometimes multiple times in a single month) there hasn’t been anything in the second half of 2021. I don’t know why but I poked around a bit to see if I can find clues.

[Update: Release 19 became available on 2022/1/14.]

Notes on ML-Agents Development History (Part 1: Up to Version 1.0)

I’ve just installed and tested basic functionality of Unity ML-Agents Release 18. And just before that, I did the same with Release 2 which is also referred to as “Verified 1.0.8”. I was surprised at the changes visible just between these releases. This made me curious about how this package evolved, and I went looking for information from its past.

Most of them were announced on Unity blog, but some just had GitHub release notes. Here is a compilation of links alongside a few highlights that caught my eye, follow these links for a complete list of changes:

2017/6/26: The earliest public information I could find was Unity announcing their intent to join in AI research and applications. Annoyingly, some of the linked blog posts have since disappeared, apparently in some sort of migration of their blog hosting system. For example the “second part of this blog series” link now leads to a 404 error.

2017/9/18: The ML-Agents Toolkit officially kicks off with version 0.1, describing a general architecture that I’m sure has since evolved and a long list of ambitious ideas they wanted to support. Many of them did come to be! Though of course not all of them, and some have since disappeared.

2017/12/8: Version 0.2 introduced curriculum learning, and launched a community challenge to motivate people to play with the toolkit.

2018/3/15: Version 0.3 introduced imitation learning, multi-brain training, and an optional poll model. Recurrent Neural Networks came in as part of a “Memory-Enhanced Agents” umbrella.

2018/6/18: Version 0.4 allowed training using the Unity editor, no longer requiring a compiled executable. An Udacity nanodegree was introduced, though sadly that’s too rich for my blood. More training environments were added, one (Pyramids) specifically demonstrates the “Curiosity” capability. Curiosity got its own blog post.

2018/9/11: Version 0.5 added a Gym interface and replicating a few environments from OpenAI Gym. Also expanded capability to enable/disable discrete actions, but not clear if it was related to OpenAI Gym.

2018/12/17: Version 0.6 is an architectural revamp changing how ml-agents AI brains fit in the Unity object hierarchy. Introduced “demonstration recorder” for off-line imitation learning. Is that still around?

2019/3/1: Version 0.7 is another big infrastructure change, switching runtime neural network inference from external TensorFlowSharp to Unity’s own Inference Engine (a.k.a. Barracuda) to support more Unity runtime platforms.

2019/4/15: Version 0.8 infrastructure change allows multiple Unity simulations to run in parallel on a single machine. Strange this is the recommended approach to take advantage of machines with many processing cores. (Later research found that Unity is working to improve multicore performance across the board, not just ml-agents, with something called DOTS.)

2019/8/1: Version 0.9 (release notes) is the first of two releases focused on throughput and efficiency.

2019/9/30: Version 0.10 finished what 0.9 started. Improving sample throughput (asynchronous environments) and sample efficiency via GAIL (0.9) and SAC (0.10) algorithms.

2019/11/4: Version 0.11 (release notes) changed again the brain’s place in Unity object hierarchy.

2019/12/2: Version 0.12 (release notes) moved from TensorFlow 1 to 2 via the TF1 compatible interfaces. It appears this work was never finished, ml-agents moved to PyTorch instead of finishing TF2 migration.

2020/1/8: Version 0.13 (release notes)

2020/2/28: Version 0.14 now has ability to train via adversarial self-play. Includes a short history of learning from self-play.

2020/3/6: Not a version, but this is when ml-agents got serious enough to get a course up on Unity Learn (Hummingbirds) as well an “AI for Beginners” course on Unity Learn Premium.

2020/3/18: Version 0.15 (release notes) wrapped up a lot of housekeeping in preparation for 1.0 release.

Development focus for ml-agents changed to more refinement after 1.0 release, along with corresponding reduction in blog announcements.

Notes on Installing Unity ML-Agents (Release 18)

I’m dipping my toes into playing with deep reinforcement learning via Unity’s ML-Agents package. I made my first run with the safest most mature option “Verified Package 1.0.8”, which mapped to Release 2 by the ML-Agents repository versioning scheme. No problems were encountered during installation and I was able to run the 3D balancing ball project in the Getting Started guide. From there I could either explore Release 2 further or try a more adventurous release. I chose the latter and proceeded to install ML-Agents Release 18.

Doing this experiment on the same machine meant I had to keep the two installations separate. Unity Hub is already well-suited to keeping distinct versions isolated so they could run in parallel (Unity 2019.4.25f1 for release 18) though there’s a potential point of conflict if Unity editors required different versions of Visual Studio Community Edition for editing code. On the Python side, Anaconda is well suited to keep Python environments separate. Since files are referenced by directory, though, I cloned the ml-agents GitHub repository separately for each release instead trying to switch back and forth within the same directory.

I very much appreciated Unity for their project documentation, as my installation and Getting Started process went just as smoothly. I didn’t expect to notice much different between release 2 and 18, but even just in installation and Getting Started I saw they’ve made changes. The biggest one that caught my eyes is that ml-agents switched from TensorFlow to PyTorch sometime during this time. There are other smaller changes, the most welcome one to me is a much more comprehensive collection of example configurations in the release_18 /config/ subdirectory. Release 2 had only a handful of files, release 18 had a far larger directory tree to let people (like me) have more than one starting point.

I’m not quite sure where to go from here, but given how well documented ml-agents appears to be, I thought it would be interesting to take a quick look back to see where they’ve been.

Notes on Installing Unity ML-Agents (Release 2)

I thought it would be fun to play with reinforcement learning via Unity ML-Agents. The official product landing page sends us to the ml-agents repository on GitHub. And just like every other repository, it’s always a good idea to look over the README.md to understand their branch organization. Especially before we start cloning anything.

And indeed, the README includes a handy chart of releases. As of this writing there are eight releases listed plus main which is labeled as unstable. I’m glad I didn’t blindly clone main! Of the eight stable releases, six are named “Release 13” to “Release 18” inclusive. The final two are named “Verified Package 1.0.7” and “Verified Package 1.0.8”.

The “Verified” label is a guarantee of safety in the lifecycle of Unity packages. Therefore the most recent release with the highest guarantee of functionality is “Verified Package 1.0.8”. In Unity’s world, these verified packages are good enough for commercial production use. If our needs aren’t quite that rigorous, we can use the builds labeled “Release.” These numbers are explained in the ML-Agents versioning page, and it’s something we can play with if we aren’t shouldering the weight of commercial Unity production.

I think I’m fine to play with more recent “Release” builds, but I wanted to start with the most guarantee build to make sure I can at least get that working. Which meant cloning the build labeled “Verified Package 1.0.8” and that maps to “Release 2.”

In order to open the Unity Project that is a part of this release, I wanted to get the version of Unity that exactly matched the version number in ProjectVersion.txt: 2018.4.17f1. If I tried to install Unity 2018 in Unity Hub, it offers me 2018.4.36f1 because that was the most recent supported version. In order to match version, I had to click the download archive link and look for 2018.4.17 under 2018 builds. (It was released February 11th, 2020.) Once found, I could click the “Unity Hub” button to prompt Unity Hub to install the build on my machine.

While Unity installed, I cloned the repository tagged release_2 and installed corresponding Python binaries. I encountered no problems following installation directions for this release, though there were slight modifications as I used Anaconda Individual to manage my Python virtual environments. I had the option of installing the locally cloned versions of the ml-agents-envs and ml-agents packages and I did so. I noticed that the installation had TensorFlow in CPU-only mode, but running without GPU acceleration is perfectly OK for a starting point.

Once Unity Editor 2018.4.17 was installed, I used it to open the Project directory of my cloned release_2 repository. It opened without errors. I proceeded to the Getting Started Guide for this release and verified I had basic functionality both running the pretrained 3D Balance Ball model and training a model of my own. The training was pretty quick, it took just under 8 minutes on the Core i5-7300HQ CPU of my Dell 7577 laptop plugged in to an AC power adapter.

Encouraged by this success, I proceeded to try Release 18 as well.

Switching Back to Unity ML-Agents

It was quite enlightening for me to read Deep Reinforcement Learning Doesn’t Work Yet. And to be honest, a little depressing as well. I was vaguely aware of the challenges involved but only in a general sense. Just small tidbits here and there over the past few years, as I looked at this field with interest. Now that I finally got around to looking at reinforcement learning in more detail, I realized that it was overly optimistic of me to expect all major problems to have been solved by now.

My original motivation for getting into reinforcement learning was to make my Sawppy an autonomous rover. Based on what I’ve learned so far, my original hopes for Sawppy intelligence via reinforcement learning is extremely ambitious and still quite far away. If I want to do some deep RL projects more likely to succeed in the near term, I probably shouldn’t put it on a real physical rover. In all likelihood, whatever can be accomplished on a real robot using deep reinforcement learning could be done faster and more easily with some other AI technique.

It would certainly be nice if some aspect of Sawppy intelligence will eventually result in a research project that can contribute to the state of the art. But I’m not so arrogant as to assume I can accomplish that feat and certainly not as my first project in reinforcement learning. I’ll aim for something simple as my starting point. Got to crawl before I can walk, and all that.

Transferring reinforcement learning from a simulator to work in the real world is still a lot to tackle. So I’m going to look at a simulated world and stay within that simulated world while I learn the ropes. And before I can realistically think about contributing to algorithm advancements, I should get familiar with applying existing implementations of reinforcement learning. All of these new priorities turned my attention back to the game world of Unity ML-Agents.

Notes on “Deep Reinforcement Learning Doesn’t Work Yet”

OpenAI’s guide Spinning Up in Deep RL has been very educational for me to read, even though I only understood a fraction of the information on my first pass through and I hardly understood the code examples at all. But the true riches of this guide are in the links, so I faithfully followed the first link on the first page (Spinning Up as a Deep RL Researcher) of the Resources section and got a big wet towel dampening my enthusiasm.

Spinning Up‘s link text is “it’s hard and it doesn’t always work” and it led to Alex Irpan’s blog post titled Deep Reinforcement Learning Doesn’t Work Yet. The blog post covered all the RL pitfalls I’ve already learned about, either from Spinning Up or elsewhere, and added many more I hadn’t known about. And boy, it paints a huge gap between the promise of reinforcement learning and what has actually been accomplished as of its publishing date almost four years ago in February 2018.

The whole thing was a great read. At the end, my first question was: has the situation materially changed since its publishing in February 2018? As a beginner I have yet to learn of the sources that would help me confirm or disprove this post, so I started with the resource right at hand: the blog site this item was hosted on. Fortunately there weren’t too many posts so I could quickly skim content of the past four years within a few hours. The author seems to still be involved in the field of reinforcement learning and would critique some notable papers during this time. But none seemed particularly earth-shattering.

In the “Doesn’t Work Yet” blog post, the author made references to ImageNet. (Eample quote: Perception has gotten a lot better, but deep RL has yet to have its “ImageNet for control” moment.) I believe this is referring to the ImageNet 2012 Challenge. Historically the top performers in this competition were separated by very narrow margins, but in 2012 AlexNet won with a margin of more than 10% using a GPU-trained convolutional neural network. This was one of the major events (sometimes credited as THE event) that kicked off the current wave of deep learning.

So for robotics control systems, reinforcement learning has yet to see that kind of breakthrough. There have been many notable advancements, OpenAI themselves hyped a robot hand that could manipulate Rubik’s Cube. (Something Alex Irpan has also written about.) But looking under the covers, they’ve all had too many asterisks for the results to make a significant impact on real world applications. Researchers are making headway, but it’s been a long tough slog for incremental advances.

I appreciate OpenAI Spinning Up linking to the Doesn’t Work Yet blog post. Despite all the promise and all the recent advances, there’s still a long way to go. People new to the field need to have realistic expectations and maybe make adjustment to plans.

Notes on Reinforcement Learning Algorithm Implementations Published by OpenAI

OpenAI’s Spinning Up in Deep RL guide has copious written resources, but it also offers resources in the form of algorithm implementations in code. Each implementation is accompanied by a background section talking about how the algorithm works, summarized by a “Quick Facts” section that serves as a high-level view of how these algorithms differ from each other. As of this writing, there are six implementations. I understand the main division is that there are three “on-policy” algorithms and three “off-policy” algorithms. Within each division, the three algorithms are roughly sorted by age and illustrate an arc of researchin the field. The best(?) on-policy algorithm here is Proximal Policy Optimization (PPO) and representing off-policy vanguard is Soft Actor-Critic (SAC).

These implementations are geared towards teaching these algorithms. Since priority is placed on this learning context, these implementations are missing enhancements and optimizations that would make the code harder to understand. For example, many of these implementations do not support parallelization.

This is great for its intended purpose of supplementing Spinning Up, but some people wanted an already optimized implementation. For this audience, OpenAI published the OpenAI baselines a few years ago to serve as high water marks for algorithm implementation. These baselines can serve either as starting point for other projects, or benchmarks to measure potential improvements against.

However, it appears these baselines have gone a little stale. Not any fault of the implementations themselves, but merely due to the rapidly moving nature of this research field. The repository hosts implementations using TensorFlow version 1.0, which has been deprecated. There’s a (partial? full?) conversion to TensorFlow 2.0 in a separate branch, but that never merged back to the main branch for whatever reason. As of this writing there’s an open issue asking for PyTorch implementations, prompted by OpenAI’s own proclamation that they will be standardizing on PyTorch. (This proclamation actually led to the PyTorch conversion of Spinning Up in Deep RL examples.) However, there’s no word yet on PyTorch conversions for OpenAI baselines so anyone who want implementations in PyTorch will either have to optimize the Spinning Up implementations themselves or look elsewhere.

Of course, all of this assumes reinforcement learning will actually solve the problem, which I’ve learned might not be a good assumption.

Notes on Deep Reinforcement Learning Resources by OpenAI

I’m going through the Spinning Up in Deep RL (Reinforcement Learning) guide published by OpenAI, and I found the Introduction to RL section quite enlightening. While it was just a bit too advanced for me to understand everything, I understood enough to make sense of many other things I’ve come across recently as I had browsed the publicly available resources in this field. Following that introductory section was a Resources section with the following pages:

Spinning Up as a Deep RL Researcher

This section is for a target audience that wishes to contribute to the frontier of research in reinforcement learning. Building the background, learn by doing, and how to be rigorous in research projects. This section is filled with information that feels like it was distilled from the author’s experience. The most interesting part for me is the observation that “broken RL code almost always fails silently“. Meaning there are no error messages, just a failure for the agent to learn from its experience. The worst part when this happens is that it’s hard to tell the difference between a flawed algorithm and a flawed implementation of a good algorithm.

Key Papers in Deep RL

This was the completely expected directory of papers, roughly organized by the taxonomy used for the introduction in this guide. I believe the author assembled this list to form a solid foundation for further exploration. It appears that most (and possibly all) of them are freely available, a refreshing change from the paywalls that usually block curious minds.

Exercises

Oh no, I didn’t know there would be homework in this class! Yet here we are. A few problems for readers to attempt implementing using their choice of TensorFlow or PyTorch, along with solutions to check answers against. The first set cover basic implementation, the second set covers some algorithm failure modes. This reinforces what was covered earlier: broken RL code fails silently, so an aspiring practitioner must recognize symptoms of failure modes.

Benchmarks for Spinning Up Implementations

The Spinning Up in Deep RL guide is accompanied by implementations of several representative algorithms. Most of them have two implementations: one in TensorFlow and one in PyTorch. But when I run them on my own computer, how would I know if they’re running correctly? This page is a valuable resource to check against. It has charts for these algorithms’ performance in five MuJoCo Gym environments under conditions also described on this page. And once I have confidence they are running correctly, these are also benchmarks to see if I can improve their performance. For these implementations are written for beginners to read and understand, which means skipping some of the esoteric hard-to-understand optimizations that readers are challenged to put in themselves.

For people who want optimized implementations of these algorithms OpenAI has them covered too. Or at least, they used to.

Notes on “Introduction to RL” by OpenAI

With some confidence that I could practice simple reinforcement learning algorithms, I moved on to what I consider the meat of OpenAI’s Spinning Up in Deep RL guide: the section titled Introduction to RL. I thought it was pretty good stuff.

Part 1: Key Concepts in RL

Since I had been marginally interested in the field, the first section What Can RL Do? was mostly review for me. More than anything else, it confirmed that I was reading the right page. This section proceeded to Key Concepts and Terminology and that quickly caught up then surpassed what knowledge I already had. I was fine with the verbal descriptions and the code snippets, but once the author started phrasing in terms of equations it became more abstract than what I can easily follow. I’m a code person, not an equations person. Given my difficulty comprehending the formalized equations, I was a little bemused at the third section (Optional) Formalism as things were already becoming more formalized than my usual style.

Part 2: Kinds of RL Algorithms

Given that I didn’t comprehend everything in part 1, I was a little scared at what I would find in part 2. Gold. I found pure gold. Here I found a quick survey of major characteristics of various reinforcement learning approaches, explaining those terms in ways I (mostly) understood. Explanations like what “Model-Free” vs. “Model-Based” means. What “on-policy” vs. “off-policy” means. I had seen a lot of these terms thrown around in things I read before, but I never found a place laying them down against one another until I found this section. This section alone was worth the effort of digging into this guide and will help me understand other things. For example, a lot of these terms were in the documentation for Unity ML-Agents. I didn’t understand what they meant at the time, but now I hope to understand more when I review them.

Part 3: Intro to Policy Optimization

Cheered by how informative I found part 2, I proceeded to part 3 and was promptly brought back down to size by the equations. The good thing is that this section had both equations and a code example implementation, and they were explained more or less in parallel. Someone like myself can read the code and try to understand the equations. I expect there are people out there who are the reverse, more comfortable with the equations and appreciate seeing what they look like in code. That said, I can’t claim I completely understand the code, either. Some of the mathematical complexities represented in the equations were not in the sample source code: They were handed off to implementations in PyTorch library.

I foresee a lot of studying PyTorch documentation before I can get proficient at writing something new on my own. But just from reading this Introduction to RL section, I have the basic understanding to navigate PyTorch API documentation and maybe the skill to apply existing implementations written by somebody else. And as expected of an introduction, there are tons of links to more resources.

Old PyTorch Without GPU Is Enough To Start

I’ve mostly successfully followed the installation instructions for OpenAI’s Spinning Up in Deep RL. I am optimistic this particular Anaconda environment (in tandem with the OpenAI guide) will be enough to get me off the ground. However, I don’t expect it to be enough for doing anything extensive. Because when I checked the installation, I saw it pulled down PyTorch 1.3.1 which is now fairly old. As of this writing, PyTorch LTS is 1.8.2 and Stable is 1.10. On top of that, this old PyTorch runs without CUDA GPU acceleration.

>>> print(torch.__version__)
1.3.1
>>> print(torch.cuda.is_available())
False

NVIDIA’s CUDA API has been credited with making the current boom in deep learning possible, because CUDA opened up GPU hardware for usage other than its original gaming intent. Such massively parallel computing hardware made previously impractical algorithms practical. However, such power does incur overhead. For one thing, data has to be copied from a computer’s main memory to memory modules on board the GPU before they can be processed. And then, the results have to be copied back. For smaller problems, the cost of such overhead can swamp the benefit of GPU acceleration.

I believe I ran into this when working through Codecademy’s TensorFlow exercises. I initially set up CPU-only TensorFlow on my computer to get started and, once I had that initial experience, I installed TensorFlow with GPU support. I was a little surprised to see that small teaching examples from Codecademy took more time overall on the GPU accelerated installation than the CPU-only installation. One example took two minutes to train in CPU-only mode, but took two and a half minutes with GPU overhead. By all reports, the GPU will become quite important once I start tackling larger neural networks, but I’m not going to sweat the complexity until then.

So for Spinning Up in Deep RL, my first goal is to get some experience running these algorithms in a teaching examples context. If I am successful through that phase (which is by no means guaranteed) and start going beyond small examples, then I’ll worry about setting up another Anaconda environment with a modern version of PyTorch with GPU support.

Today I Learned: MuJoCo Is Now Free To Use

I’ve contemplated going through OpenAI’s guide Spinning Up in Deep RL. It’s one of many resources OpenAI made available, and builds upon the OpenAI Gym system of environments for training deep reinforcement learning agents. They range from very simple text-based environments, to 2D Atari games, to full 3D environments built with MuJoCo. Whose documentation explained that name is a shorthand for the type of interactions it simulates: “Multi Joint Dynamics with Contact”

I’ve seen MuJoCo mentioned in various research contexts, and I’ve inferred it is a better physics simulation than something that we would find in, say, a game engine like Unity. No simulation engine is perfect, they each make different tradeoffs, and it sounds like AI researchers (or at least those at OpenAI) believe MuJoCo to be the best one to use for training deep reinforcement learning agents with the best chance of being applicable to the real world.

The problem is that, when I looked at OpenAI Gym the first time, MuJoCo was expensive. This time around, I visited the MuJoCo page hoping that they’ve launched a more affordable tier of licensing, and there I got the news: sometime in the past two years (I didn’t see a date stamp) DeepMind has acquired MuJoCo and intend to release it as free open source software.

DeepMind was itself acquired by Google and, when the collection of companies were reorganized, it became one of several companies under the parent company Alphabet. At a practical level, it meant DeepMind had indirect access to Google money for buying things like MuJoCo. There’s lots of flowery wordsmithing about how opening MuJoCo will advance research, what I care about is the fact that everyone (including myself) can now use MuJoCo without worrying about the licensing fees it previously required. This is a great thing.

At the moment MuJoCo is only available as compiled binaries, which is fine enough by me. Eventually it is promised to be fully open-sourced at a GitHub repository set up for the purpose. The README of the repository made one thing very clear:

This is not an officially supported Google product.

I interpret this to mean I’ll be on my own to figure things out without Google technical support. Is that a bad thing? I won’t know until I dive in and find out.

Window Shopping: OpenAI Spinning Up in Deep Reinforcement Learning

I was very encouraged after my second look at Unity ML-Agents. When I first looked at it a few years ago, it motivated me to look at the OpenAI Gym for training AI agents via reinforcement learning. But once I finished the “Hello World” examples, I got lost trying to get further and quickly became distracted by other more fruitful project ideas. With this past experience in mind, I set out to find something more instructive for a beginner and found OpenAI’s guide “Spinning Up in Deep RL“. It opened with the following text:

Welcome to Spinning Up in Deep RL! This is an educational resource produced by OpenAI that makes it easier to learn about deep reinforcement learning (deep RL).

This sounds exactly like what I needed. Jackpot! At first I thought this was something new since my last look, but the timeline at the bottom of the page indicated this was already available when I last looked at reinforcement learning resources on OpenAI. I had missed it! I regret the lost opportunity, but at least I’ve found it this time.

The problem with finding such a resource a few years after publication is that it may already be out of date. The field of deep learning moves so fast! I’m pretty sure the fundamentals will still be applicable, but the state of the art has certainly moved on. I’m also worried about the example code that goes with this resource, which looks stale at first glance. For example, it launched with examples that used the now-deprecated TensorFlow 1 API. (Instead of the current TensorFlow 2 API.) I don’t care to learn TF1 just for the sake of this course, but fortunately in January 2020 they added alternative examples implemented using PyTorch instead. If I’m lucky, PyTorch hasn’t made a major version breaking change and I could still use those examples.

In addition to the PyTorch examples, there’s another upside of finding this resource now instead of later. For 3D environment simulations OpenAI uses MuJoCo. When I looked at OpenAI Gyms earlier, running the 3D environments require a MuJoCo license that costs $500/year and I couldn’t justify that money for playing around. But good news! MuJoCo is now free to use.

Unity Machine Learning Agents Almost Within My Reach

While poking around Google’s Machine Learning Crash Course, I found that they have released a TensorFlow library for building agents with deep reinforcement learning. This might be fun but I don’t know enough about the field to make use of that library yet. It also reminded me to take another look at game engine Unity 3D’s development in this area. A lot has happened!

I first took a quick glance at Unity ML-Agents more than two years ago. At the time, the project was still an experimental thing for Unity and a lot was still in flux. Since I didn’t know much about working in Unity or in reinforcement learning, that was too many variables in flux for my taste. A year later, Unity ML-Agents reached an official version 1.0, but it was still technically a preview technology. But not long after that they had become a “verified” package for use with Unity 2020.3 LTS build, signifying a mature tool. As part of being a verified package for use with Unity LTS, ML-Agents got some nice things like an official Unity technology landing page and a few pieces of curriculum have been posted to Unity Learn to help people get started.

The primary focus of Unity ML-Agents is for creating agents in the virtual world of a Unity game. Not necessarily for real-world robots which is where my interests lie. This is an important caveat because the Unity physics engine is not an accurate representation of the real world, and reinforcement learning agents are notorious for exploiting flaws in virtual engines to do “impossible” things. But that’s no reason to give up on Unity, which can still be a useful tool for robotics research. These caveats are just some tradeoffs amongst many more to keep in mind.

During this time that Unity evolved their ML-Agents library, I’ve occasionally dabbled in Unity with projects like Bouncy Bouncy Lights. I’m not bold enough to call myself a Unity developer yet, but I’m no longer completely overwhelmed by Unity editor user interface as I once were. I haven’t done much more in Unity because I haven’t felt particularly motivated to make games. But ML-Agents? That looks like pretty good motivation for me to put serious effort into understanding reinforcement learning.

An Unsuccessful First Attempt Applying Q-Learning to CartPole Environment

One of the objectives of OpenAI Gym is to have a common programming interface across all of its different environments. And it certainly looks pretty good at the surface: we reset() the environment, take actions to step() through it, and at some point we get True as a return value for the done flag. Having a common interface allows us to use the same algorithm across multiple environments with minimal modification.

But “minimal” modification is not “zero” modification. Some environments are close enough that no modifications are required, but not all of them. Sometimes an environment is just not the right fit for an algorithm, and sometimes there are important details which differ from one environment to another.

One way environments differ is in different type of spaces. An environment has two: an observation_space that describes the observed state of the environment, and an action_space that outlines valid actions an agent may choose to take. They change from one environment to another because they tend to have different observable properties and different actions an agent can take within them.

As an exercise I thought I’d try to take the simple Q-Learning algorithm demonstrated to solve the Taxi environment, and slam it on top of CartPole just to see what happens. And to do that, I had to take CartPole‘s state which is an array of four floating point numbers and convert it into an integer suitable for an array index.

As an naive approach, I’ll slice up the space into discrete slices. Each of four numbers will be divided into ten bins. Each bin will correspond to a single digit zero to nine, so the four numbers will be composed into a four digit integer value.

To determine size of these bins, I executed 1000 episodes of the CartPole simulation while taking random actions via action_space.sample(). The ten bins are evenly divided between maximum and minimum values observed values in this sample run, and Q-learning is off and running… doing nothing useful.

As shown in plot above, reward function is always 8, 9, 10, or 11. We never got above or below this range. Also, out of 10000 possible states, only about 50 were ever traversed.

So this first naive attempt didn’t work, but it was a fun experiment. Now the more challenging part: figuring out where it went wrong, and how to fix it.

Code written in this exercise is available here.

 

Taking First Step Into Reinforcement Learning with OpenAI Gym

The best part about learning a new technology today is the fact that, once armed with a few key terminology, a web search can unlock endless resources online. Some of which are even free! Such was the case after I looked over OpenAI Gym on its own: I searched for an introductory reinforcement learning project online and found several to choose from. I started with this page which uses the “Taxi” environment of OpenAI Gym and, within a few lines of Python code, implemented basic Q-Learning agent that can complete the task within 1000 episodes.

I had previously read the Wikipedia page on Q-Learning, but a description suitable for an encyclopedia entry is not always straightforward to put into code. For example, Wikipedia described learning rate is a value from 0 to 1 plus what it means when it is at the extremes of 0 or 1. But it doesn’t give any guidance on what kind of values are useful in real world examples. The tutorial used 0.618 and while there isn’t enough information on why that value was chosen, it served as a good enough starting point. For this and more related reasons, it was good to have a simple implementation.

After I got it running, it was time to start poking around to learn more. The first question was how fast the algorithm learned to solve the problem, and for that I wanted to plot the cumulative evaluation function reward against iterations. This was trivial with help of PyPlot and I obtained the graph at the top of this post. We can see a lot of learning progress within the first 100 episodes. There’s a mysterious degradation in capability around 175th episode, but the system mostly recovered by 200. After that, there were diminishing returns until about 400 and the agent made no significant improvements after that point.

This simple algorithm used an array that could represent all 500 states of the environment. With six possible actions, it was an array with 3000 entries initially filled with zero. I was curious how long it took for the entire problem space to be explored, and the answer seems to be roughly 50 episodes before there were 2400 nonzero entries and it never exceeded 2400. This was far faster than I had expected to take to explore 2400 states, and it was also a surprise that 600 entries in the array were never used.

What did those 600 entries represent? With six possible actions, it implies there are 100 unreachable states of the environment. I thought I’d throw that array into PyPlot and see if anything jumped out at me:

Taxi Q plotted raw

My mind is at a loss as to how to interpret this data. But I don’t know how important it is to understand right now – this is an environment whose entire problem space can be represented in memory, using discrete values, and these are luxuries that quickly disappear as problems get more complex. The real world is not so easily classified into discrete states, and we haven’t even involved neural networks yet. The latter is referred to as DQN (Deep Q-learning Network?) and is still yet to come.

The code I wrote for this exercise is available here.