Reinforcement Learning
I. Introduction
Inspiration of AlphaGo Story:
- Machine can beat human people physically and intelligently.
- A new era for reinforcement learning and artificial intelligence:
II. Basic Concepts
- State: The status of agent with respect to the environment.
- State Space: the set of all states.
$$
S={s_i}
$$
- Action: For each state, there are some actions: $a_1, a_2, \cdots, a_n$.
- Action Space of A State: the set of all possible actions of a state.
$$
A(s_i)={a_i}
$$
- State Transition: when taking an action, the agent may move from one state to another.
$$
s_1\overset{a_1}{\longrightarrow} s_2
$$
- Forbidden Area: the forbidden area is accessible but with penalty or inaccessible.
- Tabular Representation of State Transition: using a table to describe the state transition.(Deterministic Situation)
- State Transition Probability: use probability to describe state transition.
Policy: tells the agent what actions to take at a state.
- Deterministic Policy
$\pi(a_1|s_1)=0$ $\pi(a_2|s_1)=1$ $\pi(a_3|s_1)=0$ $\vdots$ $\pi(a_{n-1}|s_1)=0$ $\pi(a_n|s_1)=0$ - Stochastic Policy
$\pi(a_1|s_1)=0$ $\pi(a_2|s_1)=0.5$ $\pi(a_3|s_1)=0.5$ $\vdots$ $\pi(a_{n-1}|s_1)=0$ $\pi(a_n|s_1)=0$
Tabular Representation of A Policy
Reward: a real number we get after taking an action.(Human-machine Interface)
- Tabular Representation of Reward Transition
- Stochastic Reward Transition: conditional probability
$p(r=-1|s_1, a_1)=0.5$ $p(r\neq-1|s_1, a_1)=0.5$ Trajectory: a state-action-reward chain
$$
s_1\mathop{\longrightarrow}\limits_{r=0}^{a_2}s_2\mathop{\longrightarrow}\limits_{r=0}^{a_3}s_3\cdots s_{n-1}\mathop{\longrightarrow}\limits_{r=1}^{a_{n-1}}s_n
$$- Return of A Trajectory: the sum of all the rewards
$Return=0+0+0+\cdots+1$ Discounted Return: the sum of all the rewards multiplied by discount factor $\gamma\in[0, 1]$
discounted return = $r_1+\gamma r_2+\gamma^2 r_3+\cdots+\gamma^{n-1} r_n$ - the sum becomes finite
- balance the far and near future rewards
Episode
When interacting with the environment following a policy, the agent may stop
at some terminal states. The resulting trajectory is called an episode (or a
trial).- Convert episodic tasks to continuing tasks
- Treat the target state as a special absorbing state. Once the agent reaches an absorbing state, it will never leave. The consequent rewards $r = 0$.
- Treat the target state as a normal state with a policy. The agent can still leave the target state and gain $r = +1$ when entering the target state.
- Convert episodic tasks to continuing tasks
References
Reinforcement Learning
https://jiaweihu-xdu.github.io/readings/ReinforcementLearning/