ML Reinforcement Learning
Like Supervised Learning, but without a dataset. The system learns and generates data in an environment (i.e playing field of a game). The goal is to learn a policy and play a game.
The machine perceives the state of the environment as a set of feature vectors. The policy function takes the feature vector of a state and predicts the most optimal action. Action is chosen based on expected average reward.
Definitions
- State s
- Action a
- Reward Function
- Transition Function : Probability that a from s leads to , i.e.
Active and Passive RL
When we don’t know the model, meaning the reward matrix R(s,a,s’) and the transition matrix T(s,a,s’), we talk about model free learning.
Active RL:
- Fundamental trade-off: exploration vs. exploitation
- learn the policy
- agent makes choices
Passive RL:
- fixed policy
- learn the state values
- agent is along for the ride
Model Free Learning
We can use the Monte-Carlo Evaluation, where we let our agent try many times, and the final estimated V value will be very close to the real V value.
Value Iteration
The value iteration computes the value V{k+1}(s)_, starting with V_0(s)=0, until the value converges to the optimal value (Bellman equation).
TD-Learning (passive)
Model free way to do policy evaluation, mimicking bellman updates with running sample averages, given a fixed policy.
In order to derive a new policy, we have to learn the Q-values instead of the V-values.
Q-Learning (active)
Q-Learning converges to optimal policy, even if you’re acting suboptimally.