Introduction

environment : dynamics of the problem
reward hypothesis : 어떤 goal 이든 cumulative reward를 maximizing하는 것으로 formalize할 수 있다
reward() : scalar feedback signal
return() : cumulative reward
value() : expected cumulative reward
action value() : conditional value on actions
Goal : maximize value by picking suitable actions
history() : the full length of observations, actions, rewards
agent의 요소 : agent state, policy, value function, model
agent state() : environment로 부터 얻은 observation()과 agent 내부의 정보를 조합한 것으로 agent가 action을 결정할 때 참조하는 정보
Markov property : where h is all possible histories
Markov decision process : agent state()에 history로 알 수 있는 정보가 다 포함되어 있는 decision process
- MDP는 tule로 나타낼 수 있음 (S, A, p, r, )
  - S : all possible states
  - A : all possible actions
  - : discount factor
- e.g. full history를 state로 정의하면 Markovian이 됨
POMDP : environment state != observation인 상황. observation은 Markovian이 아니지만 history에 대한 함수 agent state를 잘 정의하면 Markovian이 될 수도 있음
- e.g. where is a state update function
policy( or ) : mapping from agent states to actions
value function : return에 discount factor 를 도입해서
- optimal value :
model : environment가 다음에 뭘 할지(next state, next reward) 예측
- e.g. ,
agent categories
- Value based : value function 기반
- Policy based : policy 기반
- Actor Critic : policy, value function 기반
agent categories
- model free : model 필요없음
- model based : model 필요
subproblems of reinforcement learning
- prediction : evaluate the future for a given policy
- control : find the best policy
- learning : environment를 모르는 상태에서 시작하는 문제로 optimal policy나 model을 학습
- planning : environment의 model이 주어졌거나 학습되었을 때 external interaction없이 model을 이용해 planning하는 것
learning agent components : policies value functions, models, state update functions(state updates)

Exploration and Exploitation

Exploration : increase knowledge
Exploitation : maximize performance based on current knowledge
regret :
- total regret :
- large action regrets 의 action counts 를 줄이는게 좋다
Lai & Robbins theorem에 따르면 아무리 지능적인 학습 알고리즘도 exploration을 해야하기 때문에 total regret이 에 비례하여 누적될 수 밖에 없음

Policies

policy gradient based on action preferences

action preferences : learnable parameters
How to compute this gradient : Log-likelihood trick

UCB (Upper Confidence Bound)

어떤 action value의 uncertainty가 클수록 exploration을 해봐야하고 uncertainty가 크다는 것은 가 낮다는 것
Upper confidence 는 를 높은 확률로 만족하는 값을 의미함
UCB를 최대로 만드는 action을 선택하는 것도 policy가 될 수 있음
action 마다 reward가 바로 확정되는 경우에는 Hoeffding’s inequality를 사용해서 upper confidence를 계산할 수 있음 e.g. multi-armed bandit problem

Bayesian approaches

Thompson sampling

Markov Decision Processes and Dynamic Programming

모든 MDP 문제는 하나 이상의 deterministic optimal policy가 존재함
Bellman equations
Bellman optimality equations
policy evaluation (prediction) : estimating
control (policy optimization) : estimating
Bellman equation in matrix form
Bellman optimality equation은 non-linear라서 바로 해를 구할 수가 없고 다른 방법을 사용해야함

Dynamic programming을 이용한 풀이법

DP를 이용한 control 방법은 policy evaluation과 policy improvement로 구성됨
policy iteration (policy evaluation + policy improvement)
- policy evaluation
  - 모든 에 대해서 가 될 때까지 진행
- policy improvement
  - generate
value iteration : iteration step = 1인 policy evaluation이면서 policy improvement를 한 번에 해결
- 모든 에 대해서 가 될 때까지 진행하면 가 찾아짐
Asynchronous dynamic programming : 모든 states에 대해서 진행하는게 아니라 일부 state에 대해서만 update를 진행하는 방법
- In-place dynamic programming : old, new policy 구분 없이 in-place로 update 진행
- prioritized sweeping : Bellman error를 계산해서 backup한 후에 큰 순으로 state selection 이뤄지도록 하는 방법
- real-time dynamic programming : 현재 agent와 관련이 있는 states에 대해서만 update 진행 e.g. current state 근처의 states
backup 방법
- full-width backups : BFS처럼 모든 states와 actions을 저장
- sample backups : sample한 대로 을 저장

Theoretical Fundamentals of Dynamic Programming

Normed vector spaces : vector space + norm on the elements of
norms : 0보다 크거나 같아야되고 norm이 0이면 x도 영벡터. 삼각부등식 성립. homogeneity 성립(.
-contraction mapping :
- 만약 이면 non-expanding mapping이라고 함
- 수열 이 로 수렴하면 도 로 수렴함
a vector x is a fixed point of an operator :
Banach fixed point theorem : complete normed vector space와 -contraction mapping 에 대해서 는 unique fixed point x를 가지고 가 x로 수렴함
Bellman optimality operator : 어떤 value function 에 대해서 적용하면 현재 상태에서 가장 최적의 행동을 했을 때 얻을 수 있는 기대 보상의 합을 반환함
- 는 unique fixed point를 갖고 -contraction mapping이며 monotonic 함 ()

지식정리 - DeepMind x UCL RL

Contents

Introduction

Exploration and Exploitation

Policies

policy gradient based on action preferences

UCB (Upper Confidence Bound)

Bayesian approaches

Thompson sampling

Markov Decision Processes and Dynamic Programming

Dynamic programming을 이용한 풀이법

Theoretical Fundamentals of Dynamic Programming

Title

Title

Title

Title