모델정리 - Reinforcement Learning
Contents
Value based RL
DQN (Playing Atari with Deep Reinforcement Learning)
- Optimal Q-function에 대한 Bellman equation
-
\[\begin{align}y=\mathbb{E}_{s^\prime\sim\epsilon}[r+\gamma\cdot\max_{a^\prime} Q(s^\prime,a^\prime;\theta_{old})] \\L(\theta_{new})=\mathbb{E}_{s,a\sim \rho(\cdot)}[(y-Q(s,a;\theta_{new}))^2] \\\nabla L(\theta_{new})=\mathbb{E}_{s,a\sim \rho(\cdot)}[(y-Q(s,a;\theta_{new}))\nabla Q(s,a;\theta_{new})]\\ \theta\leftarrow \theta-\alpha\cdot \nabla L(\theta) \end{align}\]
- target network와 experience replay를 적용해주면 된다
Policy based RL
vanilla policy gradient
natural policy gradient
TRPO
PPO
GRPO
Actor Critic based RL
Title