Every Monday, I present 4 publications from my research area. Let’s discuss them!

[← Previous review][Next review →]
Paper 1: Value Iteration in Continuous Actions, States and Time
Lutter M., Mannor S., Peters J., Fox D., Garg A. (2021). Value Iteration in Continuous Actions, States and Time. arXiv preprint arXiv:2105.04682.
Reinforcement Learning methods were first tabular: one had to choose an action among a finite number of actions, resulting from an observation from a finite number of possible observations. These learning methods were successively extended to continuous observation spaces and then to continuous action spaces. The authors are interested here in continuous time. It is not a question of taking an action every t seconds, but of carrying out a real control in continuous time.
The most obvious framework is that of robotics, and this is the one they have chosen. They propose the continuous Fitted Value Iteration (cFVI), an algorithm that allows a continuous control, based on a known dynamic model.
The authors show the efficiency of the approach in several control environments, both in simulation and in real world. One of the classical problems consists in learning a policy and deploying it in real world, using transfer methods (domain randomisation, for example). It is very interesting to see that the policy learned with cFVI is more robust in real world than the one learned with discrete time control, even though no simulation to real transfer learning method is used. Take a look:
Paper 2: The AI economist: Improving equality and productivity with AI-driven tax policies
Zheng, S., Trott, A., Srinivasa, S., Naik, N., Gruesbeck, M., Parkes, D. C., & Socher, R. (2020). The ai economist: Improving equality and productivity with ai-driven tax policies. arXiv preprint arXiv:2004.13332.
The AI economist is a reinforcement learning framework, which aims at simulating a simplified world, featuring several economic agents. In this environment, each agent can interact with the other agents, exchanging goods, gathering resources, building to obtain income… They are also taxed, and the tax money is used to build roads, schools…
The objective is to maximize both productivity and equality among agents. The taxation policy is chosen to maximize this dual objective. It takes careful policy to find the right balance between the level of taxation of high and low incomes, to choose how to redistribute the tax, where to invest the money collected, etc.
For now, the environment is far too simple to derive policy recommendations. Nevertheless, one can imagine a more complete simulation, in which the individual behavior of agents, but also that of firms and communities, is more faithful to reality. It would also be necessary to introduce an indispensable objective, completely correlated to the economy: ecology.
Paper 3: The computational origins of confidence biases in reinforcement learning
Lebreton, M., Palminteri, S., & Garcia, N. A. S. (2021, May 5). The computational origins of confidence biases in reinforcement learning. Retrieved from osf.io/cy9e6
Ignorance more frequently begets confidence than does knowledge
Newton said. He describes a bias that would later be called the Dunning-Kruger effect. To explain it succinctly: it is when an incompetent person tends to overestimate his or her level of competence. At this point you think it is a typical human bias. Nevertheless, this bias is also present in reinforcement learning. There are few articles that try to explain how and why these biases appear and persist in a reinforcement learning context. The authors describe this behavior as counter-intuitive, I agree with them.
So, how do we explain this? The authors argue that these confidence biases emerge and are maintained in reinforcement learning contexts from learning biases. They therefore examined the different instrumental and confidence judgments, during the learning phases and during the transfer phases. The results suggest that reinforcement learning involving context-dependent learning and confirmation updating is a very good candidate for explaining participants’ choices on the tasks used in the paper. Second, they show that the overconfidence bias of the model can be explained by an overweighting of the learned confidence value. They therefore conclude that individual cognitive biases can be predicted based on the proposed reinforcement learning model.
Paper 4: Constructions in combinatorics via neural networks
Wagner, A. Z. (2021). Constructions in combinatorics via neural networks. arXiv preprint arXiv:2104.14516.
The great new tool for mathematicians is computers. Not only to make big calculations, but since a few decades, to help them in their demonstrations. This is called computer-assisted proofs. These computer-assisted proofs are beginning to have a respectable track record. Among other things, they have allowed us to obtain some revolutionary results such as the proof of the four-color theorem or the proof of the Kepler conjecture. There is no doubt about it: machine learning algorithms are one of the most powerful tools, not only for data scientists, but for all scientists, including mathematicians. Here is a proof of this.
In this article, the authors demonstrate once again the interest of reinforcement learning applied to computer-assisted proofs. Their subject here is graphs. Graphs are an active subject of research, and mathematicians regularly formulate questions or conjectures. A conjecture can only become a theorem if it is proved. It turns out that some of the conjectures are false. To show that a conjecture is false, it is necessary to provide a counterexample. So the authors had the idea to use reinforcement learning to find counterexamples of graphs that invalidate some conjecture. Among the refuted conjectures : the Aouchiche-Hansen conjecture, the Collins conjecture and the Aaronson-Groenland-Grzesik-Kielak-Johnston conjecture. I will not explain these conjectures, because they are often very mathematical, and do not bring much to the point. What is important to remember is that these counterexamples were obtained thanks to a reinforcement learning algorithm: the deep cross-entropy method.
For each conjecture, a graph is built, then proposed to the environment. The environment, according to the conjecture, returns a reward. The agent will then maximize this reward, and find, in some cases, a counterexample of graph to invalidate the conjecture.

The article does not only refute conjectures, it also tackles other questions about graphs, to know more, go and explore this very interesting article.
It was with great pleasure that I presented you my readings of the week. Feel free to send me your feedback.