Python has a built-in Logo Turtle, and it’s great for reinforcement learning

Published in

Towards Data Science

7 min readOct 15, 2018

A little while ago I was teaching a Berkeley class on data analytics, and one of the exercises had students go through the Python stdlib to find interesting modules. I went through the docs too and was delighted to find that Python has a turtle! Do you remember?

FORWARD 10
LEFT 90

My high school in the 90s was not rich enough to have a robotic turtle with a pen built in (tiny violins start playing) but I remember being entranced by the on-screen movements of the magical creature, which I think at the time was running on an Acorn Archimedes, an early product from today’s chip giant ARM. I never needed a Spirograph, as the turtle amply took its place in the production of mathematical mandalas.

Thanks SchoolCoders (https://www.flickr.com/photos/schoolcoders/)

I’ve written about reinforcement learning before here and here: it’s a topic I’m particularly interested in, as it provides a path forward to AGI (the singularity), while at the same time really exposing today’s limitations of deep learning and AI.

Framing The Problem And Python’s Turtle

I wanted to create an agent that taught itself to keep moving about on the screen as long as possible: avoiding the edges of the screen, and also avoiding bumping into itself (in the style of the old Snake game).

The turtle agent taking some mostly random actions early in training

Intuitively, this should lead to the turtle coiling around itself.

I quickly found a sad problem with Python’s built in turtle: it draws to a Tkinter canvas (just think of it as ‘some object’ if you haven’t done GUI development in Python before — like me!) and there’s no way to read back the pixels from that canvas.

So, I implemented my own turtle library, which wraps the built-in one (still using it to draw to screen), but also keeps a NumPy array of the canvas available (as well as a specified number of previous frames, which is useful for RL). It provides a few other functions convenient for RL, you can grab the library at my Github here. The final code I used to make this project is in the examples folder of that repo.

Here’s the env.step() function I built. You can see that there are 3 possible actions, two end-game situations (turtle went off screen or back over its tail), a cumulative reward given at the end of an episode (moving forward yields higher rewards), and an immediate reward for the action (each action gets 1 reward).

My ‘env.step’ function follows OpenAI’s (state, reward, done, msg) model. Going forward yields a higher reward than spinning around in place. ‘UserConditionException’ is when the turtle attempts to go back over its trail — it dies.

Reinforcement Learning Principles

The earlier articles provide a lot more background, but the principle is really rather simple:

An agent makes an action in an environment, and receives a reward.
The reward is made out of two parts: an immediate, aka dense, reward (“the agent ate an apple”), and a deferred, aka sparse, reward (“the agent ate 10 apples and won the game”).
The agent’s critic part compares its expectation of a reward (given the state of the environment and the action it took; those expectations start out random) with the actual reward it received. The closer they align, the better: this means it’s usable as a loss signal.
The deferred reward (multiplied by some discount factor for each step from now until the end of the game) is added to earlier rewards , so that the agent has an idea at each timestep of whether its action at that timestep led ultimately to a good reward or not.

This is actor-critic RL, because there are essentially two agents: an actor, and a critic. We implement them with a single neural network that has two heads.

The actor network has a number of outputs corresponding to the number of possible actions; the chosen action is the softmax of that. (The activations, or values, of each neuron at the output layer is interpreted as a probability — softmax makes sure that with 3 possible actions, the max activation of each neuron is 0.33 — so it’s easy to pick the one with the highest value).

Not only is the critic head learning to get better at estimating rewards, but the action head is learning to get better at actions, too:

Added to the loss signal is a term that increases when unlikely actions yield high rewards, and decreases as the certainty of an action increases. In general, actions yielding rewards should become more certain over time.

Because both the action loss and the value loss are lumped together as ‘loss’ (and who can say whether they are going in the same direction or not?), and because there isn’t much difference in the actor and critic networks — clearly it’s going to take a lot of data (iterations, epochs, episodes, playthroughs) to get convergence.

There last important factor is to encourage exploration vs exploitation, that is to say, avoiding local minima. There are a few ways to do this which are combined in the agent I made:

Epsilon greedy (choose a random action sometimes with a decreasing probability epsilon over time — when that probability decreases exponentially we say ‘annealing’ instead of decreasing, in case you were wondering)
Perturb the reward signals coming from the environment by a random amount
It’s very unlikely that any of the neural net’s outputs will get to 1.0 probability. Instead of just choosing the action corresponding to the max output each time, sample from its probability distribution.
Dropout in the neural network adds inherent randomness.

Each of these methods leads to the need for more training cycles, but they do lead to a smarter agent. Luckily, in this kind of game-playing RL there is no shortage of data; the agent can play as long as we let it.

The Agent And Its Results

The agent I made was really dull and basic in the end: two convolutional layers, a couple of fully connected layers with different nonlinearities, and the aforementioned two heads comprised of linear layers.

It learns to avoid the edges of the canvas and avoid itself by going in a spiral

As input to the network, I made a stack of the current frame/canvas, plus the two previous frames. (Each frame would differ by just a single pixel and/or the orientation of the turtle).

An unremarkable actor-critic network with epsilon-greedy learning built in.

Things I Learned Making This

As you may know if you’ve read my writing before, I won’t claim that making this RL turtle was easy. Some things I learned:

As always, reward design (both immediate and at end-of-game) is difficult, finicky, and not entirely intuitive.
More noise and sources of entropy really help in the exploration phase. Don’t assume that one is enough.
It’s not enough for the action head and the value head just to comprise a single linear layer. Two layers for each head, with a non-linearity in-between, worked much better when bolted on to the main ‘body’ of the network.
RL easily can plateau. Often there’s not much difference between the losses after five minutes training, and five hours.
Injecting signals like ‘last_action’ into later layers of the network (bypassing the convolutional layer and attempting to provide the network with some useful time-dependent context) don’t necessarily help.
Randomizing the start position and angle of the turtle made the problem much more complex. I thought it would help the agent learn generalization, but at least in the training time I gave it, it slowed convergence a lot.
A 200x200 grid, most of which is kind of sparse (not much going on), with only a single pixel line thickness, and not much difference between the 3 time-sequenced frames I provided to the network, resulted in quite slow learning.

The training loop — where the ‘magic’ happens.

The Stupid Agent Problem, Future Work

Instead of learning a hard and fast rule never to go off the boundaries of the screen, the agent I made gets better at avoiding it, over time, as the probabilities of it turning left or right near the edge get higher. It has no common sense! It’s hard to say the agent has learned anything at all.

How then do you crystallize learned or partly-learned probability distributions into actual rules? This could give a much more effective agent, that could also learn vastly quicker. And how about rules that can be broken sometimes, or outright unlearning some rules when they are no longer useful?

And then of course there is the classic RL problem that nothing the agent has learned here would be useful in, for example, a game of Space Invaders. And that the agent has no concept of a turtle and its trail and the edge of a screen, just a bunch of tensors that it tirelessly multiplies over and over and adjusts based on some derivative. And that the reward signal is hand coded, surely playing as long as possible should be a decent reward in and of itself. Why hasn’t anyone yet found a way to intrinsically motivate AIs?

AGI is coming but we still have to solve massive engineering and philosophical challenges. The humble Python turtle continues to give me a fun test-bed for that.

Thank you for reading! I love talking with people about AGI, and specifically how we build stuff to get there from where we are now, so please get in touch if you like. There’s more to come :)