RL Fundementals

Env: The agent is acting in an environment
State & Actions: The agent can stay in one of many states (s) of the environment, and choose to take one of many actions (a) to switch from one state to another.
Model: How the environment reacts to certain actions is defined by a model which we may or may not know. The model defines the reward function and transition probabilities
State Transition (Model): Which state the agent will arrive in is decided by transition probabilities between states (P).
Reward: Once an action is taken, the environment delivers a reward (r) as feedback.

Model Based: Know the model: planning with perfect information; do model-based RL. When we fully know the environment, we can find the optimal solution by Dynamic Programming (DP).
Model-Free: learning with incomplete information; do model-free RL or try to learn the model explicitly as part of the algorithm

Policy: Agent’s policy π(s) provides the guideline on what is the optimal action to take in a certain state with the goal to maximize the total rewards
Value Function: Each state is associated with a value function V(s) predicting the expected amount of future rewards we are able to receive in this state by acting the corresponding policy.

we are trying to learn the policy and value functions in RL

On-policy
1. Agent can pick actions
2. Agent always follows his/her own policy
3. most obvious setup
Off-policy: Training on a distribution of transitions or episodes produced by a different behavior policy rather than that produced by the target policy.
1. Agent can't pick actions
2. Learning with exploration, playing without exploration
3. Learning from expert (expert is imperfect)
4. Learning from sessions (recorded data)

University of Waterloo Robotics Design Team