๐Ÿฆ‍๐Ÿ”ฅ๊ฐ•ํ™”ํ•™์Šต[Reinforcement Learning]

 

 

 

๐Ÿค” ๊ฐ•ํ™”ํ•™์Šต์ด๋ž€?

Agent๊ฐ€ ํ™˜๊ฒฝ์—์„œ ๋ˆ„์ ๋ณด์ƒ๐Ÿ’ฐ์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” Action์„ ์ทจํ•˜๋Š” "์ˆœ์ฐจ์  ์˜์‚ฌ๊ฒฐ์ •"

โˆ™ Sequential Decision Process: ๋ช‡๋‹จ๊ณ„๋ฅผ ๊ฐ€๋ด์•ผ reward๋ฅผ ์–ป์Œ = cumulated rewards
state → action & reward  →  state  →  ...


๐Ÿ” Exploitation & Exploration

Exploitation: ๊ฒฝํ—˜๊ธฐ๋ฐ˜ ์ตœ์„ ํ–‰๋™ (short term benefit)
Exploration: ์•ˆ์•Œ๋ ค์ง„ํ–‰๋™์„ ์‹œ๋„, ์ƒˆ๋กœ์šด์ •๋ณด๋ฅผ ํš๋“ (long term benefit)
ex) Exploitation: ๋Š˜ ๊ฐ€๋˜ ๋‹จ๊ณจ์‹๋‹น์— ๊ฐ„๋‹ค. / Exploration: ์•ˆ ๊ฐ€๋ณธ ์‹๋‹น์— ๊ฐ€๋ณธ๋‹ค



๐Ÿ” Reward Hypothesis: ๊ฐ•ํ™”ํ•™์Šต์ด๋ก ์˜ ๊ธฐ๋ฐ˜.

Agent๊ฐ€ ์–ด๋–ค Action์„ ์ทจํ–ˆ์„ ๋•Œ, ํ•ด๋‹น Action์ด ์–ผ๋งˆ๋‚˜ ์ข‹์€ Action์ธ์ง€ ์•Œ๋ ค์ฃผ๋Š” "feedback signal"
์ ์ ˆํ•œ ๋ณด์ƒํ•จ์ˆ˜๋กœ "Agent๋Š” ๋ˆ„์ ๋ณด์ƒ ๊ธฐ๋Œ“๊ฐ’ ์ตœ๋Œ€ํ™”(Cumulated Rewards Maximization)"์˜ ๋ชฉํ‘œ๋ฅผ ๋‹ฌ์„ฑํ•œ๋‹ค.



๐Ÿ” RL์˜ ๊ถ๊ทน์ ์ธ ๋ชฉํ‘œ๋Š”?

Model: Markov Decision Process
๊ถ๊ทน์ ์ธ ๋ชฉํ‘œ: ํ•ด๋‹น ๋ชฉํ‘œ(= max cum_Reward)๋ฅผ ์œ„ํ•œ Optimal Policy๋ฅผ ์ฐพ๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ์ž„.
cf) ๊ฐ•ํ™”ํ•™์Šต์˜ supervision์€ ์‚ฌ๋žŒ์ด reward๋ฅผ ์ฃผ๋Š”๊ฒƒ.





๐Ÿค” Markov Property

๐Ÿ” MDP (Markov Decision Process)

Markov Property: ํ˜„์žฌ state๋งŒ์œผ๋กœ ๋ฏธ๋ž˜๋ฅผ ์˜ˆ์ธก (๊ณผ๊ฑฐ state์— ์˜ํ–ฅX
ex) ์˜ค๋Š˜์˜ ๋‚ ์”จ -> ๋‚ด์ผ์˜ ๋‚ ์”จ


RL์€ Markov Property๋ฅผ ์ „์ œ๋กœ ํ•˜๋Š”๋ฐ, ํŠนํžˆ Discrete time๋ฅผ ๋”ฐ๋ฅผ ๋•Œ "Markov Chain"์„ ์ „์ œ๋กœ ํ•จ!

์ฆ‰, markov process๋Š” markov property๋ฅผ ๋”ฐ๋ฅด๋Š” discrete time์„ ์ „์ œ๋กœ ํ•˜๋ฉฐ, ์ด๋Ÿฐ markov process๋ฅผ markov chain์ด๋ผํ•จ.
cf) Discrete time: ์‹œ๊ฐ„์ด ์ด์‚ฐ์ ์œผ๋กœ ๋ณ€ํ•จ
cf) Stochastic time: ์‹œ๊ฐ„์— ๋”ฐ๋ผ ์–ด๋–ค ์‚ฌ๊ฑด์ด ๋ฐœ์ƒํ•  ํ™•๋ฅ ์ด ๋ณ€ํ™”ํ•˜๋Š” ๊ณผ์ •.


MDP(Markov Decision Process)๋Š” <S,A,P,R,γ>๋ผ๋Š” Tuple๋กœ ์ •์˜๋จ.


๐Ÿ” Episode์™€ Return

Episode: Start State ~ Terminal State

Return: Episode ์ข…๋ฃŒ์‹œ ๋ฐ›๋Š” ๋ชจ๋“  Reward
∴ Maximize Return = Agent์˜ Goal
∴ Maximize Cumulated Reward Optimal Policy = RL์˜ Goal






๐Ÿ” Continuing Task์˜ Return...?

๋ฌดํ•œ๊ธ‰์ˆ˜์™€ Discounting Factor(γ)๋ฅผ ์ด์šฉ.
--> γ=0: ๋‹จ๊ธฐ๋ณด์ƒ๊ด€์‹ฌ Agent
--> γ=1: ์žฅ๊ธฐ๋ณด์ƒ๊ด€์‹ฌ Agent






๐Ÿ” Discounted Return 

ํ˜„์žฌ ๋ฐ›์€ reward์™€ ๋ฏธ๋ž˜์— ๋ฐ›์„ reward์— ์ฐจ์ด๋ฅผ ๋‘๊ธฐ ์œ„ํ•ด discount factor γ∈[0,1]๋ฅผ ๊ณ ๋ คํ•œ ์ด๋“(Return)

 

 

 

 

๐Ÿค” Policy์™€ Value

๐Ÿ” Policy & Value Function

Policy: ์–ด๋–ค state์—์„œ ์–ด๋–ค action์„ ์ทจํ• ์ง€ ๊ฒฐ์ •ํ•˜๋Š” ํ•จ์ˆ˜
Value Function: Return์„ ๊ธฐ์ค€, ๊ฐ state๋‚˜ action์ด ์–ผ๋งˆ๋‚˜ "์ข‹์€์ง€" ์•Œ๋ ค์ฃผ๋Š” ํ•จ์ˆ˜ (์ฆ‰, reward → return value)

Deterministic Policy: one state - one action (ํ•™์Šต์ด ๋๋‚ฌ์„๋•Œ ๋„๋‹ฌํ•˜๋Š” ์ƒํƒœ.)
Stochastic Policy: one state → ์–ด๋–ค action? ์ทจํ• ์ง€ ํ™•๋ฅ ์„ ์ด์šฉ. (ํ•™์Šต์— ์ ์ ˆ.)

State-Value Function: ์ •์ฑ…์ด ์ฃผ์–ด์งˆ๋•Œ, ์ž„์˜์˜ s์—์„œ ์‹œ์ž‘, ๋๋‚ ๋•Œ๊นŒ์ง€ ๋ฐ›์€ return G์˜ ๊ธฐ๋Œ“๊ฐ’
Action-Value Function: state์—์„œ ์–ด๋–ค action์ด ์ œ์ผ ์ข‹์€๊ฐ€๋ฅผ action์— ๋Œ€ํ•œ value๋ฅผ ๊ตฌํ•˜๋Š” ๊ฒƒ์œผ๋กœ Q-Value๋ผ๊ณ ๋„ ํ•จ.




๐Ÿ” Bellman Equation

Bellman Equationepisode๋ฅผ ๋‹ค ์™„๋ฃŒ์•ˆํ•˜๊ณ  state๊ฐ€ ์ข‹์€์ง€ ์˜ˆ์ธกํ•  ์ˆ˜ ์—†์„๊นŒ?
State-value Bellman Equation: ์ฆ‰๊ฐ์ ์ธ reward์™€ discount factor๋ฅผ ๊ณ ๋ คํ•œ ๋ฏธ๋ž˜ state values๋ฅผ ํ•ฉํ•œ ์‹
Action-value Bellman Equation: ์ฆ‰๊ฐ์ ์ธ reward์™€ discount factor๋ฅผ ๊ณ ๋ คํ•œ ๋ฏธ๋ž˜ action values๋ฅผ ํ•ฉํ•œ ์‹


Bellman Expectation Equation: policy๊ฐ€ ์ฃผ์–ด์งˆ ๋•Œ, ํŠน์ • state์™€ value๋ฅผ ๊ตฌํ•˜๋Š” ์‹.


Bellman Optimality Equation: RL์˜ goal์€ ์ตœ๋Œ€ reward์˜ optimal policy๋ฅผ ์ฐพ๋Š”๊ฒŒ ๋ชฉํ‘œ.
Optimal Policy
= Agent๊ฐ€ Goal์„ ๋‹ฌ์„ฑํ–ˆ์„ ๋•Œ์˜ policy
= ๊ฐ state์—์„œ ๊ฐ€์žฅ ์ข‹์€ policy๋ฅผ ๊ตฌํ•ด์„œ ์–ป์–ด์ง„ policy




๐Ÿ” Value Function Estimation (Planning๊ณผ Learning)

Planning: ๋ชจ๋ธ์„ ์•Œ๊ณ  ์ด๋ฅผ ์ด์šฉํ•ด ์ตœ์ ์˜ action์„ ์ฐพ๋Š”๊ฒƒ - ex) Dynamic Programming
Learning: Sample base๋กœ (= ๋ชจ๋ธ์„ ๋ชจ๋ฅด๊ณ ) ํ•™์Šตํ•˜๋Š” ๊ฒƒ -ex) MC, TD

 









+ Recent posts