Reinforcement Learning

I. Introduction

Inspiration of AlphaGo Story:

  • Machine can beat human people physically and intelligently.
  • A new era for reinforcement learning and artificial intelligence:

The ultimate goal of reinforcement learning is to find the optimal policy.

II. Basic Concepts

  • State: The status of agent with respect to the environment.
  • State Space: the set of all states.

$$
S={s_i}
$$

  • Action: For each state, there are some actions: $a_1, a_2, \cdots, a_n$.
  • Action Space of A State: the set of all possible actions of a state.

$$
A(s_i)={a_i}
$$

  • State Transition: when taking an action, the agent may move from one state to another.

$$
s_1\overset{a_1}{\longrightarrow} s_2
$$

  • Forbidden Area: the forbidden area is accessible but with penalty or inaccessible.
  • Tabular Representation of State Transition: using a table to describe the state transition.(Deterministic Situation)
  • State Transition Probability: use probability to describe state transition.
$p(s_2|s_1, a_2)=1$
$p(s_i|s_1, a_2)=0\ \ \forall i\neq 2$
  • Policy: tells the agent what actions to take at a state.

    • Deterministic Policy
    $\pi(a_1|s_1)=0$
    $\pi(a_2|s_1)=1$
    $\pi(a_3|s_1)=0$
    $\vdots$
    $\pi(a_{n-1}|s_1)=0$
    $\pi(a_n|s_1)=0$
    • Stochastic Policy
      $\pi(a_1|s_1)=0$
      $\pi(a_2|s_1)=0.5$
      $\pi(a_3|s_1)=0.5$
      $\vdots$
      $\pi(a_{n-1}|s_1)=0$
      $\pi(a_n|s_1)=0$
  • Tabular Representation of A Policy

  • Reward: a real number we get after taking an action.(Human-machine Interface)

    • Tabular Representation of Reward Transition
    • Stochastic Reward Transition: conditional probability
    $p(r=-1|s_1, a_1)=0.5$
    $p(r\neq-1|s_1, a_1)=0.5$
  • Trajectory: a state-action-reward chain

    $$
    s_1\mathop{\longrightarrow}\limits_{r=0}^{a_2}s_2\mathop{\longrightarrow}\limits_{r=0}^{a_3}s_3\cdots s_{n-1}\mathop{\longrightarrow}\limits_{r=1}^{a_{n-1}}s_n
    $$

    • Return of A Trajectory: the sum of all the rewards
    $Return=0+0+0+\cdots+1$
    • Discounted Return: the sum of all the rewards multiplied by discount factor $\gamma\in[0, 1]$

      discounted return = $r_1+\gamma r_2+\gamma^2 r_3+\cdots+\gamma^{n-1} r_n$
      • the sum becomes finite
      • balance the far and near future rewards
  • Episode
    When interacting with the environment following a policy, the agent may stop
    at some terminal states. The resulting trajectory is called an episode (or a
    trial).

    episodic tasks: tasks with episodes which has finite trajectories.
    continuing tasks: tasks without terminal states, meaning the interaction with the environment will never end.

    • Convert episodic tasks to continuing tasks
      • Treat the target state as a special absorbing state. Once the agent reaches an absorbing state, it will never leave. The consequent rewards $r = 0$.
      • Treat the target state as a normal state with a policy. The agent can still leave the target state and gain $r = +1$ when entering the target state.

References

Author

Jiawei Hu

Posted on

2023-11-21

Updated on

2023-12-30

Licensed under


Comments