I teach a course on Markov decision processes (MDPs) in the Fall. For those who have never heard of MDPs: the MDP is a widely studied tool in management science as well as in robotics. Much of my course focuses on a computational method called Reinforcement Learning (RL), which solves MDPs. Another more recent name for RL is ADP (approximate/adaptive dynamic programming).
A student in my class asked which name is more appropriate: RL or ADP?
Now, there is some interesting history to these names that I shared with the student:
The classical approach to solving MDPs is called dynamic programming,
and it was invented by Bellman and Howard in the 1950s and 1960s. Much later, in the 1980s, Reinforcement Learning emerged as a computational alternative to dynamic programming --- one that could solve MDPs with larger solution spaces than was possible with dynamic programming.
The discovery of the Q-Learning algorithm by Watkins (1989) and that of the actor-critic algorithm by Barto, Sutton, and Anderson (1983) were two notable landmarks in the history of reinforcement learning. It turns out that both of these algorithms were discovered in the computer science community, where the preferred application for the method was robotics.
Many robotics scientists, for very understandable reasons, prefer to directly test their algorithms in the actual system, rather than try them out in simulators first. The world "learning" is related to trials and errors that the robot has to perform before it becomes smart; essentially, the feedback (reinforcement) that the robot receives in the trials and errors allows it to get smarter. So the name reinforcement learning obviously made a lot of sense to the computer scientists.
Not surprisingly, early applications of reinforcement learning in the world of management science, where simulation is often preferred to testing something directly in the actual system, were also referred to as applications of "reinforcement learning." In recent times, however, the name approximate or adaptive dynamic programming (ADP) has gained popularity in the management science community.
But do note: Reinforcement Learning, technically speaking, is not really a dynamic programming method; rather it
uses a statistical algorithm (called the Robbins-Monro algorithm) in combination
with the Bellman equation.
In the early part of this century, the name ADP was not used much, but the name neuro-dynamic programming (NDP) was popular amongst many in the management science community. Unfortunately, NDP as a name did not gain traction. Although that name has not survived, the book on NDP by Bertsekas and Tsitsiklis is a seminal contribution to the field, and I still find myself referring to it every now and then.
Reinforcement Learning as a name has surprisingly survived. In the artificial intelligence (AI) community, it is still the name to be used. Also, AI as a field is making rapid progress, and, indeed,
many thinking robots use Reinforcement Learning. So it is very likely that this name will not die out.
I still like to use the name Reinforcement Learning. When I wrote my Ph.D. dissertation in 1999, I used it. When I wrote the first edition of my book on the topic in 2003, again, I used it, because ADP was almost never used then. So you might say I am not a very unbiased observer here :) :)
For many, and I agree with them, it is just a matter of semantics, But it will be very, very interesting to see which name is used twenty years hence. If this blog survives until Dec 10, 2034, we will be able to revisit this topic then.
No comments:
Post a Comment