Reinforcement Learning at O’Reilly Artificial Intelligence Conference NY 2017

Juarez Bochi
Towards Data Science
4 min readJul 11, 2017

--

O’Reilly Artificial Intelligence Conference 2017 was held in New York a couple of weeks ago. It was an amazing conference, with very good talks from both academia and industry. This post summarizes a few talks and a tutorial I took there about Reinforcement Learning, “the area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward”.

Cars that coordinate with people

Fantasia, Disney 1940

Anca D. Dragan, from Berkeley, gave a key note entitled “Cars that coordinate with people”, where she presented the results from the paper “Planning for Autonomous Cars that Leverage Effects on Human Actions.” Instead of doing pure obstacle avoidance, i.e., trying to avoid getting in the way of other moving objects, they were able to model drivers as other agents that are following their own policies. That means the robot knows the other cars will avoid hitting obstacles too, so it can predict how other vehicles will react to its actions.

Figure from “Planning for Autonomous Cars that Leverages Effects on Human Actions”

The autonomous vehicle is also able to take actions that will allow it to gather information about the other cars. For example, it can start merging the lane ahead of a human slowly until it has enough evidence that the driver is not aggressive and will actually brake to avoid a collision.

The key note was so good I changed my schedule to see her talk called “Inverse Reward Function”, and in my opinion, it was the best talk in the conference. She started it by talking about the movie Fantasia from Disney. It’s based on the poem “The Sorcerer’s Apprentice” by Goethe written in 1797. As Wikipedia summarizes it,

the poem begins as an old sorcerer departs his workshop, leaving his apprentice with chores to perform. Tired of fetching water by pail, the apprentice enchants a broom to do the work for him — using magic in which he is not yet fully trained. The floor is soon awash with water, and the apprentice realizes that he cannot stop the broom because he does not know how.

As we create more robots that interact with humans directly, Dragan is researching how we can be sure they are going to do what we really want, even when we give orders that are not very precise. I once read a hypothetical story that illustrates this issue too. Unfortunately, I could not find the reference, but it goes like this:

Suppose we create a super intelligent machine and ask it to find a cure for malaria. We set its goal as to minimize the number of humans that die from the disease. The robot figures out that the fastest and most guaranteed solution for the problem is to hack into all the nuclear weapons in the world and launch them to kill all humans, making sure no one will die from malaria again. The machine is able to achieve its goal, but clearly not in the way the programmer intended.

Isaac Asimov also wrote several good stories about similar situations and came up with the three laws of robotics as a way to solve the issue.

https://xkcd.com/1613/

Dragan and her group have done a lot of research on this area: “Should Robots be Obedient?”, “Robot Planning with Mathematical Models of Human State and Action”, “The Off-Switch Game.” In a few words, their approach is to make the robot take into consideration that the order or policy the human is specifying is not perfect and it avoids risks by not doing something too different from what it has seen during training time.

Superhuman AI for strategic reasoning: Beating top pros in heads-up no-limit Texas hold’em, by Tuomas Sandholm (Carnegie Mellon University): They were able to beat top human players in a game that is comparable in difficulty to Go, but didn’t get the same attention from the media. The game imposes additional complexity because the players do not have complete information. Sandholm commented about a third variable besides the typical explore vs exploit trade-off that games have to take into account: exploitability. Their agent, Liberatus, tried to minimize exploitability. It’s not really good at exploring bad players, but can beat the best humans with this approach.

https://www.cmu.edu/news/stories/archives/2017/january/AI-tough-poker-player.html

Deep reinforcement learning tutorial, By Arthur Juliani. Very nice tutorial, with code that’s easy to understand and run. Juliani presented several different methods for reinforcement learned, including Multi-Armed Bandits theory, Q-Learning, Policy Gradient, and Actor-Critic agents. Take a look at the repository for the tutorial.

Multi-Armed Bandit Dungeon environment

Building Game-Bots using OpenAI’s Gym and Universe, By Anmol Jagetia: Unfortunately he had several technical issues: some examples crashed, and the ones that run had such a bad frame rate that we couldn’t see what the agents were doing, but his notebooks for the tutorial look interesting: https://github.com/anmoljagetia/OReillyAI-Gamebots

Stay tuned for more. I’ll write another post about other topics I saw there: Recommender Systems, Tensorflow and Natural Language Understanding.

--

--