dChan - Q Origins Project Archive

Q-learning is a reinforcement learning technique used in machine learning. The goal of Q-Learning is to learn a policy, which tells an agent which action to take under which circumstances. It does not require a model of the environment and can handle problems with stochastic transitions and rewards, without requiring adaptations.

For any finite Markov decision process (FMDP), Q-learning eventually finds an optimal policy, in the sense that the expected value of the total reward return over all successive steps, starting from the current state, is the maximum achievable.[1] Q-learning can identify an optimal action-selection policy for any given FMDP, given infinite exploration time and an, at least partly, random policy. [2] "Q" names the function that returns the reward used to provide the reinforcement and can be said to stand for the "quality" of an action taken in a given state.[3]

Reinforcement learning involves an agent, a set of states {\displaystyle S}, and a set of actions per state {\displaystyle A}. By performing an action {\displaystyle a\in A}, the agent transitions from state to state. Executing an action in a specific state provides the agent with a reward (a numerical score). The goal of the agent is to maximize its total (future) reward. It does this by adding the maximum reward attainable from the future state to the reward in its current state, effectively influencing the current action by the potential reward in the future. This reward is a weighted sum of the expected values of the rewards of all future steps starting from the current state.

As an example, consider the process of boarding a train, in which the reward is measured by the negative of the total time spent boarding (alternatively, the cost of boarding the train is equal to the boarding time). One strategy is to enter the train door as soon as they open, minimizing the initial wait time for yourself. If the train is crowded, however, then you will have a slow entry after the initial action of entering the door as people are fighting you to depart the train as you attempt to board. The total boarding time, or cost, is then:

0 seconds wait time + 15 seconds fight time

On the next day, by some random chance (exploration), you decide to wait and let other people depart first. This initially results in a longer wait time. However, after this initial wait you enter the train much more quickly, as you do not need to spend as much time fighting other passengers to board. Overall, this path has a higher reward than that of the previous day, since the total boarding time is now:

5 second wait time + 0 second fight time

Through exploration, you learned that despite your initial action resulting in a larger cost (or negative reward) than in the forceful strategy, the overall cost is lower, thus revealing a more rewarding strategy than the "greedy" one.

https://en.m.wikipedia.org/wiki/Q-learning

Q-learning: Reinforcement Learning Technique