# Reinforcement Learning (Azhar)

登录发表评论

文字内容

1. Reinforcement Learning Azhar Aulia Saputra October, 20th 2018

2. So far… Supervised Learning Data: (x, y) x is data, y is label Goal: Learn a function to map x -> y Examples: Classification, regression, object detection, semantic segmentation, image captioning, etc. FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Cat Classification This image is CC0 public domain Lecture 14 - Oc5tobeMr,a2y02th32, 0210817

3. So far… Unsupervised Learning Data: x Just data, no labels! Goal: Learn some underlying hidden structure of the data Examples: Clustering, dimensionality reduction, feature learning, density estimation, etc. FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung 1-d density estimation 2-d density estimation 2-d density images left and right are CC0 public domain Lecture 14 - Oc6tobeMr,a2y02th32, 0210817

4. Today: Reinforcement Learning Problems involving an agent interacting with an environment, which provides numeric reward signals Goal: Learn how to take actions in order to maximize reward FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - Oc7tobeMr,a2y02th32, 0210817

5. Overview - What is Reinforcement Learning? - Markov Decision Processes - Q-Learning - Policy Gradients FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - Oc8tobeMr,a2y02th32, 0210817

6. Reinforcement Learning Agent Environment FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O9ctobeMr,a2y02th32, 0210817

7. Reinforcement Learning State st Agent Environment FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - 1O0ctobeMr,a2y02th32, 0210817

8. Reinforcement Learning State st Agent Environment Action at FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - 1O1ctobeMr,a2y02th32, 0210817

9. Reinforcement Learning State st Reward rt Agent Environment Action at FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - 1O2ctobeMr,a2y02th32, 0210817

10. Reinforcement Learning State st Agent Reward rt Next state st+1 Environment Action at FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - 1O3ctobeMr,a2y02th32, 0210817

11. Cart-Pole Problem Objective: Balance a pole on top of a movable cart State: angle, angular speed, position, horizontal velocity Action: horizontal force applied on the cart Reward: 1 at each time step if the pole is upright FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung This image is CC0 public domain Lecture 14 - O1c4tobeMr,a2y02th32, 0210817

12. Robot Locomotion Objective: Make the robot move forward State: Angle and position of the joints Action: Torques applied on joints Reward: 1 at each time step upright + forward movement FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O1c5tobeMr,a2y02th32, 0210817

13. Atari Games Objective: Complete the game with the highest score State: Raw pixel inputs of the game state Action: Game controls e.g. Left, Right, Up, Down Reward: Score increase/decrease at each time step FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O1c6tobeMr,a2y02th32, 0210817

14. Go Objective: Win the game! State: Position of all pieces Action: Where to put the next piece down Reward: 1 if win at the end of the game, 0 otherwise FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung This image is CC0 public domain Lecture 14 - O1c7tobeMr,a2y02th32, 0210817

15. How can we mathematically formalize the RL problem? State st Agent Reward rt Next state st+1 Environment Action at FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - 1O8ctobeMr,a2y02th32, 0210817

16. Markov Decision Process - Mathematical formulation of the RL problem - Markov property: Current state completely characterises the state of the world Defined by: : set of possible states : set of possible actions : distribution of reward given (state, action) pair : transition probability i.e. distribution over next state given (state, action) pair : discount factor FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O1c9tobeMr,a2y02th32, 0210817

17. Markov Decision Process - At time step t=0, environment samples initial state s0 ~ p(s0) - Then, for t=0 until done: - Agent selects action at - Environment samples reward rt ~ R( . st, at) - Environment samples next state st+1 ~ P( . st, at) - Agent receives reward rt and next state st+1 - A policy Ḗ is a function from S to A that specifies what action to take in each state - Objective: find policy Ḗ* that maximizes cumulative discounted reward: FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O2c0tobeMr,a2y02th32, 0210817

18. A simple MDP: Grid World actions = { 1. right 2. left 3. up 4. down } states ★ ★ Set a negative “reward” for each transition (e.g. r = -1) Objective: reach one of terminal states (greyed out) in least number of actions FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O2c1tobeMr,a2y02th32, 0210817

19. A simple MDP: Grid World ★ ★ ★ ★ Random Policy Optimal Policy FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O2c2tobeMr,a2y02th32, 0210817

20. The optimal policy Ḗ* We want to find optimal policy Ḗ* that maximizes the sum of rewards. How do we handle the randomness (initial state, transition probability…)? FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O2c3tobeMr,a2y02th32, 0210817

21. The optimal policy Ḗ* We want to find optimal policy Ḗ* that maximizes the sum of rewards. How do we handle the randomness (initial state, transition probability…)? Maximize the expected sum of rewards! Formally: with FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O2c4tobeMr,a2y02th32, 0210817

22. Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, … FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O2c5tobeMr,a2y02th32, 0210817

23. s:'>Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, … How good is a state? The value function at state s, is the expected cumulative reward from following the policy from state s: FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O2c6tobeMr,a2y02th32, 0210817

24. s:'>Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, … How good is a state? The value function at state s, is the expected cumulative reward from following the policy from state s: How good is a state-action pair? The Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s and then following the policy: FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O2c7tobeMr,a2y02th32, 0210817

25. Bellman equation The optimal Q-value function Q* is the maximum expected cumulative reward achievable from a given (state, action) pair: FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O2c8tobeMr,a2y02th32, 0210817

26. Bellman equation The optimal Q-value function Q* is the maximum expected cumulative reward achievable from a given (state, action) pair: Q* satisfies the following Bellman equation: Intuition: if the optimal state-action values for the next time-step Q*(s’,a’) are known, then the optimal strategy is to take the action that maximizes the expected value of FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O2c9tobeMr,a2y02th32, 0210817

27. Bellman equation The optimal Q-value function Q* is the maximum expected cumulative reward achievable from a given (state, action) pair: Q* satisfies the following Bellman equation: Intuition: if the optimal state-action values for the next time-step Q*(s’,a’) are known, then the optimal strategy is to take the action that maximizes the expected value of The optimal policy Ḗ* corresponds to taking the best action in any state as specified by Q* FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O3c0tobeMr,a2y02th32, 0210817

28. Solving for the optimal policy Value iteration algorithm: Use Bellman equation as an iterative update Qi will converge to Q* as i -> infinity FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O3c1tobeMr,a2y02th32, 0210817

29. Solving for the optimal policy Value iteration algorithm: Use Bellman equation as an iterative update Qi will converge to Q* as i -> infinity What’s the problem with this? FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O3c2tobeMr,a2y02th32, 0210817

30. Solving for the optimal policy Value iteration algorithm: Use Bellman equation as an iterative update Qi will converge to Q* as i -> infinity What’s the problem with this? Not scalable. Must compute Q(s,a) for every state-action pair. If state is e.g. current game state pixels, computationally infeasible to compute for entire state space! FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O3c3tobeMr,a2y02th32, 0210817

31. Solving for the optimal policy Value iteration algorithm: Use Bellman equation as an iterative update Qi will converge to Q* as i -> infinity What’s the problem with this? Not scalable. Must compute Q(s,a) for every state-action pair. If state is e.g. current game state pixels, computationally infeasible to compute for entire state space! Solution: use a function approximator to estimate Q(s,a). E.g. a neural network! FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O3c4tobeMr,a2y02th32, 0210817

32. Solving for the optimal policy: Q-learning Q-learning: Use a function approximator to estimate the action-value function FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O3c5tobeMr,a2y02th32, 0210817

33. Solving for the optimal policy: Q-learning Q-learning: Use a function approximator to estimate the action-value function If the function approximator is a deep neural network => deep q-learning! FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O3c6tobeMr,a2y02th32, 0210817

34. Solving for the optimal policy: Q-learning Q-learning: Use a function approximator to estimate the action-value function function parameters (weights) If the function approximator is a deep neural network => deep q-learning! FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O3c7tobeMr,a2y02th32, 0210817

35. Solving for the optimal policy: Q-learning Remember: want to find a Q-function that satisfies the Bellman Equation: FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O3c8tobeMr,a2y02th32, 0210817

36. Solving for the optimal policy: Q-learning Remember: want to find a Q-function that satisfies the Bellman Equation: Forward Pass Loss function: where FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O3c9tobeMr,a2y02th32, 0210817

37. Solving for the optimal policy: Q-learning Remember: want to find a Q-function that satisfies the Bellman Equation: Forward Pass Loss function: where Backward Pass Gradient update (with respect to Q-function parameters θ): FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O4c0tobeMr,a2y02th32, 0210817

38. Solving for the optimal policy: Q-learning Remember: want to find a Q-function that satisfies the Bellman Equation: Forward Pass Loss function: where Backward Pass Gradient update (with respect to Q-function parameters θ): Iteratively try to make the Q-value close to the target value (yi) it should have, if Q-function corresponds to optimal Q* (and optimal policy Ḗ*) FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O4c1tobeMr,a2y02th32, 0210817

39. Simple Implementation Azhar Aulia Saputra October, 20th 2018

40. Simple Implementation Azhar Aulia Saputra Goal: outside the house (no. 5) How to find the ways to go outside from every possible room? October, 20th 2018

41. Simple Implementation We can represent the rooms on a graph, each room as a node, and each door as a link. Azhar Aulia Saputra October, 20th 2018

42. Simple Implementation We can represent the rooms on a graph, each room as a node, and each door as a link. Give a reward in every link direction Azhar Aulia Saputra October, 20th 2018

43. Simple Implementation Put the state diagram and the instant reward values into the following reward table, "matrix R". Azhar Aulia Saputra October, 20th 2018

44. Simple Implementation: step by hand • Setting the value of the learning parameter Gamma = 0.8, and the initial state as Room 1. • Initialize matrix Q as a zero matrix: • Look at the second row (state 1) of matrix R. There are two possible actions for the current state 1: go to state 3, or go to state 5. By random selection, we select to go to 5 as our action. • Imagine what would happen if our agent were in state 5. Look at the sixth row of the reward matrix R (i.e. state 5). It has 3 possible actions: go to state 1, 4 or 5. Azhar Aulia Saputra October, 20th 2018

45. Simple Implementation: step by hand • Setting the value of the learning parameter Gamma = 0.8, and the initial state as Room 1. • Initialize matrix Q as a zero matrix: • Look at the second row (state 1) of matrix R. There are two possible actions for the current state 1: go to state 3, or go to state 5. By random selection, we select to go to 5 as our action. • Imagine what would happen if our agent were in state 5. Look at the sixth row of the reward matrix R (i.e. state 5). It has 3 possible actions: go to state 1, 4 or 5. NEW!! Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)] Q(1, 5) = R(1, 5) + 0.8 * Max[Q(5, 1), Q(5, 4), Q(5, 5)] = 100 + 0.8 * 0 = 100 Azhar Aulia Saputra October, 20th 2018

46. Simple Implementation: step by hand • For the next episode, we start with a randomly chosen initial state. This time, we have state 3 as our initial state. • Look at the fourth row of matrix R; it has 3 possible actions:'>actions: go to state 1, 2 or 4. By random selection, we select to go to state 1 as our action. • Now we imagine that we are in state 1. Look at the second row of reward matrix R (i.e. state 1). It has 2 possible actions:'>actions: go to state 3 or state 5. Then, we compute the Q value: Azhar Aulia Saputra October, 20th 2018

47. Simple Implementation: step by hand • For the next episode, we start with a randomly chosen initial state. This time, we have state 3 as our initial state. • Look at the fourth row of matrix R; it has 3 possible actions:'>actions: go to state 1, 2 or 4. By random selection, we select to go to state 1 as our action. • Now we imagine that we are in state 1. Look at the second row of reward matrix R (i.e. state 1). It has 2 possible actions:'>actions: go to state 3 or state 5. Then, we compute the Q value: NEW!! Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)] Q(3,1) = R(3,1) + 0.8 * Max[Q(1, 2), Q(1, 5)] = 0 + 0.8 * Max(0, 100) = 80 Azhar Aulia Saputra October, 20th 2018

48. Simple Implementation: step by hand • For the next episode, we start with a randomly chosen initial state. This time, we have state 3 as our initial state. • Look at the fourth row of matrix R; it has 3 possible actions:'>actions: go to state 1, 2 or 4. By random selection, we select to go to state 1 as our action. • Now we imagine that we are in state 1. Look at the second row of reward matrix R (i.e. state 1). It has 2 possible actions:'>actions: go to state 3 or state 5. Then, we compute the Q value: NEW!! Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)] Q(3,1) = R(3,1) + 0.8 * Max[Q(1, 2), Q(1, 5)] = 0 + 0.8 * Max(0, 100) = 80 Azhar Aulia Saputra October, 20th 2018

49. Simple Implementation: step by hand • If our agent learns more through further episodes, it will finally reach convergence values in matrix Q like: Azhar Aulia Saputra October, 20th 2018

50. [Mnih et al. NIPS Workshop 2013; Nature 2015] Case Study: Playing Atari Games Objective: Complete the game with the highest score State: Raw pixel inputs of the game state Action: Game controls e.g. Left, Right, Up, Down Reward: Score increase/decrease at each time step FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O4c2tobeMr,a2y02th32, 0210817

51. Q-network Architecture : neural network with weights FC-4 (Q-values) FC-256 32 4x4 conv, stride 2 16 8x8 conv, stride 4 [Mnih et al. NIPS Workshop 2013; Nature 2015] Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping) FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O4c3tobeMr,a2y02th32, 0210817

52. Q-network Architecture : neural network with weights FC-4 (Q-values) FC-256 32 4x4 conv, stride 2 16 8x8 conv, stride 4 [Mnih et al. NIPS Workshop 2013; Nature 2015] Input: state st Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping) FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O4c4tobeMr,a2y02th32, 0210817

53. Q-network Architecture : neural network with weights FC-4 (Q-values) FC-256 32 4x4 conv, stride 2 16 8x8 conv, stride 4 [Mnih et al. NIPS Workshop 2013; Nature 2015] Familiar conv layers, FC layer Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping) FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O4c5tobeMr,a2y02th32, 0210817

54. Q-network Architecture : neural network with weights FC-4 (Q-values) FC-256 32 4x4 conv, stride 2 16 8x8 conv, stride 4 [Mnih et al. NIPS Workshop 2013; Nature 2015] Last FC layer has 4-d output (if 4 actions), corresponding to Q(st, a1), Q(st, a2), Q(st, a3), Q(st,a4) Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping) FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O4c6tobeMr,a2y02th32, 0210817

55. Q-network Architecture : neural network with weights FC-4 (Q-values) FC-256 32 4x4 conv, stride 2 16 8x8 conv, stride 4 [Mnih et al. NIPS Workshop 2013; Nature 2015] Last FC layer has 4-d output (if 4 actions), corresponding to Q(st, a1), Q(st, a2), Q(st, a3), Q(st,a4) Number of actions between 4-18 depending on Atari game Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping) FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O4c7tobeMr,a2y02th32, 0210817

56. Q-network Architecture : neural network with weights A single feedforward pass to compute Q-values for all actions from the current state => efficient! FC-4 (Q-values) FC-256 32 4x4 conv, stride 2 16 8x8 conv, stride 4 [Mnih et al. NIPS Workshop 2013; Nature 2015] Last FC layer has 4-d output (if 4 actions), corresponding to Q(st, a1), Q(st, a2), Q(st, a3), Q(st,a4) Number of actions between 4-18 depending on Atari game Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping) FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O4c8tobeMr,a2y02th32, 0210817

57. [Mnih et al. NIPS Workshop 2013; Nature 2015] Training the Q-network: Loss function (from before) Remember: want to find a Q-function that satisfies the Bellman Equation: Forward Pass Loss function: where Backward Pass Gradient update (with respect to Q-function parameters θ): Iteratively try to make the Q-value close to the target value (yi) it should have, if Q-function corresponds to optimal Q* (and optimal policy Ḗ*) FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O4c9tobeMr,a2y02th32, 0210817

58. [Mnih et al. NIPS Workshop 2013; Nature 2015] Training the Q-network: Experience Replay Learning from batches of consecutive samples is problematic: - Samples are correlated => inefficient learning - Current Q-network parameters determines next training samples (e.g. if maximizing action is to move left, training samples will be dominated by samples from left-hand size) => can lead to bad feedback loops FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O5c0tobeMr,a2y02th32, 0210817

59. [Mnih et al. NIPS Workshop 2013; Nature 2015] Training the Q-network: Experience Replay Learning from batches of consecutive samples is problematic: - Samples are correlated => inefficient learning - Current Q-network parameters determines next training samples (e.g. if maximizing action is to move left, training samples will be dominated by samples from left-hand size) => can lead to bad feedback loops Address these problems using experience replay - Continually update a replay memory table of transitions (st, at, rt, st+1) as game (experience) episodes are played - Train Q-network on random minibatches of transitions from the replay memory, instead of consecutive samples FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O5c1tobeMr,a2y02th32, 0210817

60. [Mnih et al. NIPS Workshop 2013; Nature 2015] Training the Q-network: Experience Replay Learning from batches of consecutive samples is problematic: - Samples are correlated => inefficient learning - Current Q-network parameters determines next training samples (e.g. if maximizing action is to move left, training samples will be dominated by samples from left-hand size) => can lead to bad feedback loops Address these problems using experience replay - Continually update a replay memory table of transitions (st, at, rt, st+1) as game (experience) episodes are played - Train Q-network on random minibatches of transitions from the replay memory, instead of consecutive samples Each transition can also contribute to multiple weight updates => greater data efficiency FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O5c2tobeMr,a2y02th32, 0210817

61. [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O5c3tobeMr,a2y02th32, 0210817

62. [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay Initialize replay memory, Q-network FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O5c4tobeMr,a2y02th32, 0210817

63. [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay Play M episodes (full games) FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O5c5tobeMr,a2y02th32, 0210817

64. [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay Initialize state (starting game screen pixels) at the beginning of each episode FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O5c6tobeMr,a2y02th32, 0210817

65. [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay For each timestep t of the game FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O5c7tobeMr,a2y02th32, 0210817

66. [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay With small probability, select a random action (explore), otherwise select greedy action from current policy FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O5c8tobeMr,a2y02th32, 0210817

67. [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay Take the action (at), and observe the reward rt and next state st+1 FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O5c9tobeMr,a2y02th32, 0210817

68. [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay Store transition in replay memory FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O6c0tobeMr,a2y02th32, 0210817

69. [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Experience Replay: Sample a random minibatch of transitions from replay memory and perform a gradient descent step Lecture 14 - O6c1tobeMr,a2y02th32, 0210817

70. https://www.youtube.com/watch?v=V1eYniJ0Rnk FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Video by Károly Zsolnai-Fehér. Reproduced with permission. Lecture 14 - O6c2tobeMr,a2y02th32, 0210817

71. Policy Gradients What is a problem with Q-learning? The Q-function can be very complicated! Example: a robot grasping an object has a very high-dimensional state => hard to learn exact value of every (state, action) pair FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O6c3tobeMr,a2y02th32, 0210817

72. Policy Gradients What is a problem with Q-learning? The Q-function can be very complicated! Example: a robot grasping an object has a very high-dimensional state => hard to learn exact value of every (state, action) pair But the policy can be much simpler: just close your hand Can we learn a policy directly, e.g. finding the best policy from a collection of policies? FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O6c4tobeMr,a2y02th32, 0210817

73. Policy Gradients Formally, let’s define a class of parametrized policies: For each policy, define its value: FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O6c5tobeMr,a2y02th32, 0210817

74. Policy Gradients Formally, let’s define a class of parametrized policies: For each policy, define its value: We want to find the optimal policy How can we do this? FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O6c6tobeMr,a2y02th32, 0210817

75. Policy Gradients Formally, let’s define a class of parametrized policies: For each policy, define its value: We want to find the optimal policy How can we do this? Gradient ascent on policy parameters! FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O6c7tobeMr,a2y02th32, 0210817

76. REINFORCE algorithm Mathematically, we can write: Where r(ᶦ) is the reward of a trajectory FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O6c8tobeMr,a2y02th32, 0210817

77. REINFORCE algorithm Expected reward: FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O6c9tobeMr,a2y02th32, 0210817

78. REINFORCE algorithm Expected reward: Now let’s differentiate this: FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O7c0tobeMr,a2y02th32, 0210817

79. REINFORCE algorithm Expected reward: Now let’s differentiate this: Intractable! Gradient of an expectation is problematic when p depends on θ FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O7c1tobeMr,a2y02th32, 0210817

80. REINFORCE algorithm Expected reward: Now let’s differentiate this: However, we can use a nice trick: Intractable! Gradient of an expectation is problematic when p depends on θ FAeAiz-hzFheaairLrAiA&uulJilauiasStSianapJpuouthrtnarsaon & Serena Yeung Lecture 14 - OO7c2ctotobbeMer,ra,2y2020th3th2, 20201018187

81. REINFORCE algorithm Expected reward: Now let’s differentiate this: However, we can use a nice trick: If we inject this back: Intractable! Gradient of an expectation is problematic when p depends on θ FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Can estimate with Monte Carlo sampling Lecture 14 - O7c3tobeMr,a2y02th32, 0210817

82. REINFORCE algorithm Can we compute those quantities without knowing the transition probabilities? We have: FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O7c4tobeMr,a2y02th32, 0210817

83. REINFORCE algorithm Can we compute those quantities without knowing the transition probabilities? We have: Thus: FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O7c5tobeMr,a2y02th32, 0210817

84. REINFORCE algorithm Can we compute those quantities without knowing the transition probabilities? We have: Thus: And when differentiating: Doesn’t depend on transition probabilities! FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O7c6tobeMr,a2y02th32, 0210817

85. REINFORCE algorithm Can we compute those quantities without knowing the transition probabilities? We have: Thus: And when differentiating: Doesn’t depend on transition probabilities! Therefore when sampling a trajectory ᶦ, we can estimate J(ᶚ) with FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O7c7tobeMr,a2y02th32, 0210817

86. Intuition Gradient estimator: Interpretation: - If r(ᶦ) is high, push up the probabilities of the actions seen - If r(ᶦ) is low, push down the probabilities of the actions seen FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O7c8tobeMr,a2y02th32, 0210817

87. Intuition Gradient estimator: Interpretation: - If r(ᶦ) is high, push up the probabilities of the actions seen - If r(ᶦ) is low, push down the probabilities of the actions seen Might seem simplistic to say that if a trajectory is good then all its actions were good. But in expectation, it averages out! FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O7c9tobeMr,a2y02th32, 0210817

88. Intuition Gradient estimator: Interpretation: - If r(ᶦ) is high, push up the probabilities of the actions seen - If r(ᶦ) is low, push down the probabilities of the actions seen Might seem simplistic to say that if a trajectory is good then all its actions were good. But in expectation, it averages out! However, this also suffers from high variance because credit assignment is really hard. Can we help the estimator? FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O8c0tobeMr,a2y02th32, 0210817

89. Variance reduction Gradient estimator: FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O8c1tobeMr,a2y02th32, 0210817

90. Variance reduction Gradient estimator: First idea: Push up probabilities of an action seen, only by the cumulative future reward from that state FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O8c2tobeMr,a2y02th32, 0210817

91. Variance reduction Gradient estimator: First idea:'>idea: Push up probabilities of an action seen, only by the cumulative future reward from that state Second idea:'>idea: Use discount factor ᶕ to ignore delayed effects FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O8c3tobeMr,a2y02th32, 0210817

92. Variance reduction: Baseline Problem: The raw value of a trajectory isn’t necessarily meaningful. For example, if rewards are all positive, you keep pushing up probabilities of actions. What is important then? Whether a reward is better or worse than what you expect to get Idea: Introduce a baseline function dependent on the state. Concretely, estimator is now: FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O8c4tobeMr,a2y02th32, 0210817

93. How to choose the baseline? A simple baseline: constant moving average of rewards experienced so far from all trajectories FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O8c5tobeMr,a2y02th32, 0210817

94. How to choose the baseline? A simple baseline: constant moving average of rewards experienced so far from all trajectories Variance reduction techniques seen so far are typically used in “Vanilla REINFORCE” FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O8c6tobeMr,a2y02th32, 0210817

95. How to choose the baseline? A better baseline: Want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state. Q: What does this remind you of? FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O8c7tobeMr,a2y02th32, 0210817

96. How to choose the baseline? A better baseline: Want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state. Q: What does this remind you of? A: Q-function and value function! FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O8c8tobeMr,a2y02th32, 0210817

97. How to choose the baseline? A better baseline: Want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state. Q: What does this remind you of? A: Q-function and value function! Intuitively, we are happy with an action at in a state st if is large. On the contrary, we are unhappy with an action if it’s small. FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O8c9tobeMr,a2y02th32, 0210817

98. How to choose the baseline? A better baseline: Want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state. Q: What does this remind you of? A: Q-function and value function! Intuitively, we are happy with an action at in a state st if is large. On the contrary, we are unhappy with an action if it’s small. Using this, we get the estimator: FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O9c0tobeMr,a2y02th32, 0210817

99. Actor-Critic Algorithm Problem: we don’t know Q and V. Can we learn them? Yes, using Q-learning! We can combine Policy Gradients and Q-learning by training both an actor (the policy) and a critic (the Q-function). - The actor decides which action to take, and the critic tells the actor how good its action was and how it should adjust - Also alleviates the task of the critic as it only has to learn the values of (state, action) pairs generated by the policy - Can also incorporate Q-learning tricks e.g. experience replay - Remark: we can define by the advantage function how much an action was better than expected FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O9c1tobeMr,a2y02th32, 0210817

100. Actor-Critic Algorithm Initialize policy parameters ᶚ, critic parameters ᶰ For iteration=1, 2 … do Sample m trajectories under the current policy For i=1, …, m do For t=1, ... , T do End for FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O9c2tobeMr,a2y02th32, 0210817

101. REINFORCE in action: Recurrent Attention Model (RAM) Objective: Image Classification Take a sequence of “glimpses” selectively focusing on regions of the image, to predict class - Inspiration from human perception and eye movements - Saves computational resources => scalability - Able to ignore clutter / irrelevant parts of image State: Glimpses seen so far Action: (x,y) coordinates (center of glimpse) of where to look next in image Reward: 1 at the final timestep if image correctly classified, 0 otherwise glimpse FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung [Mnih et al. 2014] Lecture 14 - O9c3tobeMr,a2y02th32, 0210817

102. REINFORCE in action: Recurrent Attention Model (RAM) Objective: Image Classification Take a sequence of “glimpses” selectively focusing on regions of the image, to predict class - Inspiration from human perception and eye movements - Saves computational resources => scalability - Able to ignore clutter / irrelevant parts of image State: Glimpses seen so far Action: (x,y) coordinates (center of glimpse) of where to look next in image Reward: 1 at the final timestep if image correctly classified, 0 otherwise glimpse Glimpsing is a non-differentiable operation => learn policy for how to take glimpse actions using REINFORCE Given state of glimpses seen so far, use RNN to model the state and output next action [Mnih et al. 2014] FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O9c4tobeMr,a2y02th32, 0210817

103. REINFORCE in action: Recurrent Attention Model (RAM) (x1, y1) Input NN image FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung [Mnih et al. 2014] Lecture 14 - O9c5tobeMr,a2y02th32, 0210817

104. REINFORCE in action: Recurrent Attention Model (RAM) (x1, y1) (x2, y2) Input NN NN image FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung [Mnih et al. 2014] Lecture 14 - O9c6tobeMr,a2y02th32, 0210817

105. REINFORCE in action: Recurrent Attention Model (RAM) (x1, y1) (x2, y2) (x3, y3) Input NN NN NN image FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung [Mnih et al. 2014] Lecture 14 - O9c7tobeMr,a2y02th32, 0210817

106. REINFORCE in action: Recurrent Attention Model (RAM) (x1, y1) (x2, y2) (x3, y3) (x4, y4) Input NN NN NN NN image FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung [Mnih et al. 2014] Lecture 14 - O9c8tobeMr,a2y02th32, 0210817

107. REINFORCE in action: Recurrent Attention Model (RAM) (x1, y1) Input NN image (x2, y2) NN (x3, y3) NN (x4, y4) NN (x5, y5) NN Softmax y=2 FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung [Mnih et al. 2014] Lecture 14 - O9c9tobeMr,a2y02th32, 0210817

108. REINFORCE in action: Recurrent Attention Model (RAM) Has also been used in many other tasks including fine-grained image recognition, image captioning, and visual question-answering! FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung [Mnih et al. 2014] Lecture 14 - O1c00tobeMr,a2y02th32, 0210817

109. More policy gradients: AlphaGo Overview: - Mix of supervised learning and reinforcement learning - Mix of old methods (Monte Carlo Tree Search) and recent ones (deep RL) How to beat the Go world champion: - Featurize the board (stone color, move legality, bias, …) - Initialize policy network with supervised training from professional go games, then continue training using policy gradient (play against itself from random previous iterations, +1 / -1 reward for winning / losing) - Also learn value network (critic) - Finally, combine combine policy and value networks in a Monte Carlo Tree Search algorithm to select actions by lookahead search [Silver et al., Nature 2016] This image is CC0 public domain FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O1c10tobeMr,a2y02th32, 0210817

110. Summary - Policy gradients: very general but suffer from high variance so requires a lot of samples. Challenge:'>Challenge: sample-efficiency - Q-learning:'>Q-learning: does not always work but when it works, usually more sample-efficient. Challenge:'>Challenge: exploration - Guarantees: - Policy Gradients: Converges to a local minima of J(ᶚ), often good enough! - Q-learning:'>Q-learning: Zero guarantees since you are approximating Bellman equation with a complicated function approximator FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O1c20tobeMr,a2y02th32, 0210817

分享