RL Fundementals

  1. Env: The agent is acting in an environment

  2. State & Actions: The agent can stay in one of many states (s) of the environment, and choose to take one of many actions (a) to switch from one state to another.
  3. Model: How the environment reacts to certain actions is defined by a model which we may or may not know. The model defines the reward function and transition probabilities
  4. State Transition (Model): Which state the agent will arrive in is decided by transition probabilities between states (P).
  5. Reward: Once an action is taken, the environment delivers a reward (r) as feedback.


  1. Model Based: Know the model: planning with perfect information; do model-based RL. When we fully know the environment, we can find the optimal solution by Dynamic Programming (DP).
  2. Model-Free: learning with incomplete information; do model-free RL or try to learn the model explicitly as part of the algorithm

  1. Policy: Agent’s policy π(s) provides the guideline on what is the optimal action to take in a certain state with the goal to maximize the total rewards
  2. Value Function: Each state is associated with a value function V(s) predicting the expected amount of future rewards we are able to receive in this state by acting the corresponding policy.


we are trying to learn the policy and value functions in RL

  • On-policy
    1. Agent can pick actions
    2. Agent always follows his/her own policy
    3. most obvious setup
  • Off-policy: Training on a distribution of transitions or episodes produced by a different behavior policy rather than that produced by the target policy.
    1. Agent can't pick actions
    2. Learning with exploration, playing without exploration
    3. Learning from expert (expert is imperfect)
    4. Learning from sessions (recorded data)