A few weeks ago I was asked if I wanted to help out with the Reinforcement/Attentional Learning group, mostly by doing some background reading and literature reviews. I said yes, and then immediately sank to the axles into a body of literature I did not understand. I’m talking like zero comprehension of the paper abstracts, even. I’m starting to pull it all together now though, and I’ll be posting occasionally about some of the discoveries I make along the way.
The RLattN group is interested, in part, in how well the Temporal Distance Learning (specifically, actor-critic) method correlates with what actually happens in the brain during learning. First off, I wasn’t familiar with the whole reinforcement learning concept, let alone the different approaches to the problem such as TD Learning; second, I haven’t yet taken any biology/bio-psych courses, and as of three weeks ago was a little fuzzy on what, for example, the role of dopamine was. (When I thought of what dopamine meant to me, all I could think of was the Modest Mouse song … which, it turns out, is actually called “Dramamine“. I still think there’s a connection there.)
Sutton and Barto’s book on Reinforcement Learning has been invaluable in understanding the basic concept of RL. (And I anticipate it will be just as valuable as I continue reading.) At the most basic level, RL can be described as a process where an agent interacts with an environment in a discrete series of states. Agent actions move the agent to a new state, whereby a reward is produced (which can be positive, negative, or zero, depending on the action and the current state). The goal of the agent is to maximize the return of this reward over time. This kind of model is ideal for problems where short-term gain (ie. an immediate positive reward return) is not as important as long term gain (ie. the gradual maximizing of reward values over time, either to achieve a particular goal or continuously).
One of my first general questions was, what is to stop the agent from acting greedy — that is, to just do whichever action gives it the highest reward value immediately? I think my understanding of the reward system was incorrect initially. The authors discuss various reward systems where the rewards themselves are rarely given (eg. the soda can robot example); given only on completion of a major task (eg. the checkers example); are small, or negative; or are large and negative. They also note that rewarding the main achievement rather than a subgoal is important — we want the agent to go for the goal, not for the steps to get there necessarily. The important lesson here is that the reward system should be used to define the goal, not how to get to it.
Tomorrow I will be reading up on TD Learning, and I’ll make some initial forays into what goes on inside our heads, so that in the near future I’ll be able to return to the literature I was trying to read previously and deal with it in a much more competent way. Until then, my current understanding of the role of dopamine in the brain is now almost entirely formed by the following toothpaste for dinner comic.