Reinforcement Learning (Azhar)
登录发表评论
文字内容
1. Reinforcement Learning Azhar Aulia Saputra October, 20th 2018
2. So far… Supervised Learning Data: (x, y) x is data, y is label Goal: Learn a function to map x > y Examples: Classification, regression, object detection, semantic segmentation, image captioning, etc. FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Cat Classification This image is CC0 public domain Lecture 14  Oc5tobeMr,a2y02th32, 0210817
3. So far… Unsupervised Learning Data: x Just data, no labels! Goal: Learn some underlying hidden structure of the data Examples: Clustering, dimensionality reduction, feature learning, density estimation, etc. FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung 1d density estimation 2d density estimation 2d density images left and right are CC0 public domain Lecture 14  Oc6tobeMr,a2y02th32, 0210817
4. Today: Reinforcement Learning Problems involving an agent interacting with an environment, which provides numeric reward signals Goal: Learn how to take actions in order to maximize reward FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  Oc7tobeMr,a2y02th32, 0210817
5. Overview  What is Reinforcement Learning?  Markov Decision Processes  QLearning  Policy Gradients FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  Oc8tobeMr,a2y02th32, 0210817
6. Reinforcement Learning Agent Environment FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O9ctobeMr,a2y02th32, 0210817
7. Reinforcement Learning State st Agent Environment FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  1O0ctobeMr,a2y02th32, 0210817
8. Reinforcement Learning State st Agent Environment Action at FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  1O1ctobeMr,a2y02th32, 0210817
9. Reinforcement Learning State st Reward rt Agent Environment Action at FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  1O2ctobeMr,a2y02th32, 0210817
10. Reinforcement Learning State st Agent Reward rt Next state st+1 Environment Action at FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  1O3ctobeMr,a2y02th32, 0210817
11. CartPole Problem Objective: Balance a pole on top of a movable cart State: angle, angular speed, position, horizontal velocity Action: horizontal force applied on the cart Reward: 1 at each time step if the pole is upright FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung This image is CC0 public domain Lecture 14  O1c4tobeMr,a2y02th32, 0210817
12. Robot Locomotion Objective: Make the robot move forward State: Angle and position of the joints Action: Torques applied on joints Reward: 1 at each time step upright + forward movement FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O1c5tobeMr,a2y02th32, 0210817
13. Atari Games Objective: Complete the game with the highest score State: Raw pixel inputs of the game state Action: Game controls e.g. Left, Right, Up, Down Reward: Score increase/decrease at each time step FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O1c6tobeMr,a2y02th32, 0210817
14. Go Objective: Win the game! State: Position of all pieces Action: Where to put the next piece down Reward: 1 if win at the end of the game, 0 otherwise FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung This image is CC0 public domain Lecture 14  O1c7tobeMr,a2y02th32, 0210817
15. How can we mathematically formalize the RL problem? State st Agent Reward rt Next state st+1 Environment Action at FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  1O8ctobeMr,a2y02th32, 0210817
16. Markov Decision Process  Mathematical formulation of the RL problem  Markov property: Current state completely characterises the state of the world Defined by: : set of possible states : set of possible actions : distribution of reward given (state, action) pair : transition probability i.e. distribution over next state given (state, action) pair : discount factor FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O1c9tobeMr,a2y02th32, 0210817
17. Markov Decision Process  At time step t=0, environment samples initial state s0 ~ p(s0)  Then, for t=0 until done:  Agent selects action at  Environment samples reward rt ~ R( . st, at)  Environment samples next state st+1 ~ P( . st, at)  Agent receives reward rt and next state st+1  A policy Ḗ is a function from S to A that specifies what action to take in each state  Objective: find policy Ḗ* that maximizes cumulative discounted reward: FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O2c0tobeMr,a2y02th32, 0210817
18. A simple MDP: Grid World actions = { 1. right 2. left 3. up 4. down } states ★ ★ Set a negative “reward” for each transition (e.g. r = 1) Objective: reach one of terminal states (greyed out) in least number of actions FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O2c1tobeMr,a2y02th32, 0210817
19. A simple MDP: Grid World ★ ★ ★ ★ Random Policy Optimal Policy FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O2c2tobeMr,a2y02th32, 0210817
20. The optimal policy Ḗ* We want to find optimal policy Ḗ* that maximizes the sum of rewards. How do we handle the randomness (initial state, transition probability…)? FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O2c3tobeMr,a2y02th32, 0210817
21. The optimal policy Ḗ* We want to find optimal policy Ḗ* that maximizes the sum of rewards. How do we handle the randomness (initial state, transition probability…)? Maximize the expected sum of rewards! Formally: with FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O2c4tobeMr,a2y02th32, 0210817
22. Definitions: Value function and Qvalue function Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, … FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O2c5tobeMr,a2y02th32, 0210817
23. Definitions: Value function and Qvalue function Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, … How good is a state? The value function at state s, is the expected cumulative reward from following the policy from state s: FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O2c6tobeMr,a2y02th32, 0210817
24. Definitions: Value function and Qvalue function Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, … How good is a state? The value function at state s, is the expected cumulative reward from following the policy from state s: How good is a stateaction pair? The Qvalue function at state s and action a, is the expected cumulative reward from taking action a in state s and then following the policy: FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O2c7tobeMr,a2y02th32, 0210817
25. Bellman equation The optimal Qvalue function Q* is the maximum expected cumulative reward achievable from a given (state, action) pair: FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O2c8tobeMr,a2y02th32, 0210817
26. Bellman equation The optimal Qvalue function Q* is the maximum expected cumulative reward achievable from a given (state, action) pair: Q* satisfies the following Bellman equation: Intuition: if the optimal stateaction values for the next timestep Q*(s’,a’) are known, then the optimal strategy is to take the action that maximizes the expected value of FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O2c9tobeMr,a2y02th32, 0210817
27. Bellman equation The optimal Qvalue function Q* is the maximum expected cumulative reward achievable from a given (state, action) pair: Q* satisfies the following Bellman equation: Intuition: if the optimal stateaction values for the next timestep Q*(s’,a’) are known, then the optimal strategy is to take the action that maximizes the expected value of The optimal policy Ḗ* corresponds to taking the best action in any state as specified by Q* FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O3c0tobeMr,a2y02th32, 0210817
28. Solving for the optimal policy Value iteration algorithm: Use Bellman equation as an iterative update Qi will converge to Q* as i > infinity FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O3c1tobeMr,a2y02th32, 0210817
29. Solving for the optimal policy Value iteration algorithm: Use Bellman equation as an iterative update Qi will converge to Q* as i > infinity What’s the problem with this? FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O3c2tobeMr,a2y02th32, 0210817
30. Solving for the optimal policy Value iteration algorithm: Use Bellman equation as an iterative update Qi will converge to Q* as i > infinity What’s the problem with this? Not scalable. Must compute Q(s,a) for every stateaction pair. If state is e.g. current game state pixels, computationally infeasible to compute for entire state space! FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O3c3tobeMr,a2y02th32, 0210817
31. Solving for the optimal policy Value iteration algorithm: Use Bellman equation as an iterative update Qi will converge to Q* as i > infinity What’s the problem with this? Not scalable. Must compute Q(s,a) for every stateaction pair. If state is e.g. current game state pixels, computationally infeasible to compute for entire state space! Solution: use a function approximator to estimate Q(s,a). E.g. a neural network! FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O3c4tobeMr,a2y02th32, 0210817
32. Solving for the optimal policy: Qlearning Qlearning: Use a function approximator to estimate the actionvalue function FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O3c5tobeMr,a2y02th32, 0210817
33. Solving for the optimal policy: Qlearning Qlearning: Use a function approximator to estimate the actionvalue function If the function approximator is a deep neural network => deep qlearning! FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O3c6tobeMr,a2y02th32, 0210817
34. Solving for the optimal policy: Qlearning Qlearning: Use a function approximator to estimate the actionvalue function function parameters (weights) If the function approximator is a deep neural network => deep qlearning! FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O3c7tobeMr,a2y02th32, 0210817
35. Solving for the optimal policy: Qlearning Remember: want to find a Qfunction that satisfies the Bellman Equation: FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O3c8tobeMr,a2y02th32, 0210817
36. Solving for the optimal policy: Qlearning Remember: want to find a Qfunction that satisfies the Bellman Equation: Forward Pass Loss function: where FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O3c9tobeMr,a2y02th32, 0210817
37. Solving for the optimal policy: Qlearning Remember: want to find a Qfunction that satisfies the Bellman Equation: Forward Pass Loss function: where Backward Pass Gradient update (with respect to Qfunction parameters θ): FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O4c0tobeMr,a2y02th32, 0210817
38. Solving for the optimal policy: Qlearning Remember: want to find a Qfunction that satisfies the Bellman Equation: Forward Pass Loss function: where Backward Pass Gradient update (with respect to Qfunction parameters θ): Iteratively try to make the Qvalue close to the target value (yi) it should have, if Qfunction corresponds to optimal Q* (and optimal policy Ḗ*) FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O4c1tobeMr,a2y02th32, 0210817
39. Simple Implementation Azhar Aulia Saputra October, 20th 2018
40. Simple Implementation Azhar Aulia Saputra Goal: outside the house (no. 5) How to find the ways to go outside from every possible room? October, 20th 2018
41. Simple Implementation We can represent the rooms on a graph, each room as a node, and each door as a link. Azhar Aulia Saputra October, 20th 2018
42. Simple Implementation We can represent the rooms on a graph, each room as a node, and each door as a link. Give a reward in every link direction Azhar Aulia Saputra October, 20th 2018
43. Simple Implementation Put the state diagram and the instant reward values into the following reward table, "matrix R". Azhar Aulia Saputra October, 20th 2018
44. Simple Implementation: step by hand • Setting the value of the learning parameter Gamma = 0.8, and the initial state as Room 1. • Initialize matrix Q as a zero matrix: • Look at the second row (state 1) of matrix R. There are two possible actions for the current state 1: go to state 3, or go to state 5. By random selection, we select to go to 5 as our action. • Imagine what would happen if our agent were in state 5. Look at the sixth row of the reward matrix R (i.e. state 5). It has 3 possible actions: go to state 1, 4 or 5. Azhar Aulia Saputra October, 20th 2018
45. Simple Implementation: step by hand • Setting the value of the learning parameter Gamma = 0.8, and the initial state as Room 1. • Initialize matrix Q as a zero matrix: • Look at the second row (state 1) of matrix R. There are two possible actions for the current state 1: go to state 3, or go to state 5. By random selection, we select to go to 5 as our action. • Imagine what would happen if our agent were in state 5. Look at the sixth row of the reward matrix R (i.e. state 5). It has 3 possible actions: go to state 1, 4 or 5. NEW!! Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)] Q(1, 5) = R(1, 5) + 0.8 * Max[Q(5, 1), Q(5, 4), Q(5, 5)] = 100 + 0.8 * 0 = 100 Azhar Aulia Saputra October, 20th 2018
46. Simple Implementation: step by hand • For the next episode, we start with a randomly chosen initial state. This time, we have state 3 as our initial state. • Look at the fourth row of matrix R; it has 3 possible actions: go to state 1, 2 or 4. By random selection, we select to go to state 1 as our action. • Now we imagine that we are in state 1. Look at the second row of reward matrix R (i.e. state 1). It has 2 possible actions: go to state 3 or state 5. Then, we compute the Q value: Azhar Aulia Saputra October, 20th 2018
47. Simple Implementation: step by hand • For the next episode, we start with a randomly chosen initial state. This time, we have state 3 as our initial state. • Look at the fourth row of matrix R; it has 3 possible actions: go to state 1, 2 or 4. By random selection, we select to go to state 1 as our action. • Now we imagine that we are in state 1. Look at the second row of reward matrix R (i.e. state 1). It has 2 possible actions: go to state 3 or state 5. Then, we compute the Q value: NEW!! Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)] Q(3,1) = R(3,1) + 0.8 * Max[Q(1, 2), Q(1, 5)] = 0 + 0.8 * Max(0, 100) = 80 Azhar Aulia Saputra October, 20th 2018
48. Simple Implementation: step by hand • For the next episode, we start with a randomly chosen initial state. This time, we have state 3 as our initial state. • Look at the fourth row of matrix R; it has 3 possible actions: go to state 1, 2 or 4. By random selection, we select to go to state 1 as our action. • Now we imagine that we are in state 1. Look at the second row of reward matrix R (i.e. state 1). It has 2 possible actions: go to state 3 or state 5. Then, we compute the Q value: NEW!! Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)] Q(3,1) = R(3,1) + 0.8 * Max[Q(1, 2), Q(1, 5)] = 0 + 0.8 * Max(0, 100) = 80 Azhar Aulia Saputra October, 20th 2018
49. Simple Implementation: step by hand • If our agent learns more through further episodes, it will finally reach convergence values in matrix Q like: Azhar Aulia Saputra October, 20th 2018
50. [Mnih et al. NIPS Workshop 2013; Nature 2015] Case Study: Playing Atari Games Objective: Complete the game with the highest score State: Raw pixel inputs of the game state Action: Game controls e.g. Left, Right, Up, Down Reward: Score increase/decrease at each time step FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O4c2tobeMr,a2y02th32, 0210817
51. Qnetwork Architecture : neural network with weights FC4 (Qvalues) FC256 32 4x4 conv, stride 2 16 8x8 conv, stride 4 [Mnih et al. NIPS Workshop 2013; Nature 2015] Current state st: 84x84x4 stack of last 4 frames (after RGB>grayscale conversion, downsampling, and cropping) FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O4c3tobeMr,a2y02th32, 0210817
52. Qnetwork Architecture : neural network with weights FC4 (Qvalues) FC256 32 4x4 conv, stride 2 16 8x8 conv, stride 4 [Mnih et al. NIPS Workshop 2013; Nature 2015] Input: state st Current state st: 84x84x4 stack of last 4 frames (after RGB>grayscale conversion, downsampling, and cropping) FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O4c4tobeMr,a2y02th32, 0210817
53. Qnetwork Architecture : neural network with weights FC4 (Qvalues) FC256 32 4x4 conv, stride 2 16 8x8 conv, stride 4 [Mnih et al. NIPS Workshop 2013; Nature 2015] Familiar conv layers, FC layer Current state st: 84x84x4 stack of last 4 frames (after RGB>grayscale conversion, downsampling, and cropping) FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O4c5tobeMr,a2y02th32, 0210817
54. Qnetwork Architecture : neural network with weights FC4 (Qvalues) FC256 32 4x4 conv, stride 2 16 8x8 conv, stride 4 [Mnih et al. NIPS Workshop 2013; Nature 2015] Last FC layer has 4d output (if 4 actions), corresponding to Q(st, a1), Q(st, a2), Q(st, a3), Q(st,a4) Current state st: 84x84x4 stack of last 4 frames (after RGB>grayscale conversion, downsampling, and cropping) FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O4c6tobeMr,a2y02th32, 0210817
55. Qnetwork Architecture : neural network with weights FC4 (Qvalues) FC256 32 4x4 conv, stride 2 16 8x8 conv, stride 4 [Mnih et al. NIPS Workshop 2013; Nature 2015] Last FC layer has 4d output (if 4 actions), corresponding to Q(st, a1), Q(st, a2), Q(st, a3), Q(st,a4) Number of actions between 418 depending on Atari game Current state st: 84x84x4 stack of last 4 frames (after RGB>grayscale conversion, downsampling, and cropping) FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O4c7tobeMr,a2y02th32, 0210817
56. Qnetwork Architecture : neural network with weights A single feedforward pass to compute Qvalues for all actions from the current state => efficient! FC4 (Qvalues) FC256 32 4x4 conv, stride 2 16 8x8 conv, stride 4 [Mnih et al. NIPS Workshop 2013; Nature 2015] Last FC layer has 4d output (if 4 actions), corresponding to Q(st, a1), Q(st, a2), Q(st, a3), Q(st,a4) Number of actions between 418 depending on Atari game Current state st: 84x84x4 stack of last 4 frames (after RGB>grayscale conversion, downsampling, and cropping) FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O4c8tobeMr,a2y02th32, 0210817
57. [Mnih et al. NIPS Workshop 2013; Nature 2015] Training the Qnetwork: Loss function (from before) Remember: want to find a Qfunction that satisfies the Bellman Equation: Forward Pass Loss function: where Backward Pass Gradient update (with respect to Qfunction parameters θ): Iteratively try to make the Qvalue close to the target value (yi) it should have, if Qfunction corresponds to optimal Q* (and optimal policy Ḗ*) FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O4c9tobeMr,a2y02th32, 0210817
58. [Mnih et al. NIPS Workshop 2013; Nature 2015] Training the Qnetwork: Experience Replay Learning from batches of consecutive samples is problematic:  Samples are correlated => inefficient learning  Current Qnetwork parameters determines next training samples (e.g. if maximizing action is to move left, training samples will be dominated by samples from lefthand size) => can lead to bad feedback loops FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O5c0tobeMr,a2y02th32, 0210817
59. [Mnih et al. NIPS Workshop 2013; Nature 2015] Training the Qnetwork: Experience Replay Learning from batches of consecutive samples is problematic:  Samples are correlated => inefficient learning  Current Qnetwork parameters determines next training samples (e.g. if maximizing action is to move left, training samples will be dominated by samples from lefthand size) => can lead to bad feedback loops Address these problems using experience replay  Continually update a replay memory table of transitions (st, at, rt, st+1) as game (experience) episodes are played  Train Qnetwork on random minibatches of transitions from the replay memory, instead of consecutive samples FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O5c1tobeMr,a2y02th32, 0210817
60. [Mnih et al. NIPS Workshop 2013; Nature 2015] Training the Qnetwork: Experience Replay Learning from batches of consecutive samples is problematic:  Samples are correlated => inefficient learning  Current Qnetwork parameters determines next training samples (e.g. if maximizing action is to move left, training samples will be dominated by samples from lefthand size) => can lead to bad feedback loops Address these problems using experience replay  Continually update a replay memory table of transitions (st, at, rt, st+1) as game (experience) episodes are played  Train Qnetwork on random minibatches of transitions from the replay memory, instead of consecutive samples Each transition can also contribute to multiple weight updates => greater data efficiency FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O5c2tobeMr,a2y02th32, 0210817
61. [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep QLearning with Experience Replay FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O5c3tobeMr,a2y02th32, 0210817
62. [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep QLearning with Experience Replay Initialize replay memory, Qnetwork FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O5c4tobeMr,a2y02th32, 0210817
63. [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep QLearning with Experience Replay Play M episodes (full games) FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O5c5tobeMr,a2y02th32, 0210817
64. [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep QLearning with Experience Replay Initialize state (starting game screen pixels) at the beginning of each episode FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O5c6tobeMr,a2y02th32, 0210817
65. [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep QLearning with Experience Replay For each timestep t of the game FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O5c7tobeMr,a2y02th32, 0210817
66. [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep QLearning with Experience Replay With small probability, select a random action (explore), otherwise select greedy action from current policy FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O5c8tobeMr,a2y02th32, 0210817
67. [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep QLearning with Experience Replay Take the action (at), and observe the reward rt and next state st+1 FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O5c9tobeMr,a2y02th32, 0210817
68. [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep QLearning with Experience Replay Store transition in replay memory FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O6c0tobeMr,a2y02th32, 0210817
69. [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep QLearning with Experience Replay FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Experience Replay: Sample a random minibatch of transitions from replay memory and perform a gradient descent step Lecture 14  O6c1tobeMr,a2y02th32, 0210817
70. https://www.youtube.com/watch?v=V1eYniJ0Rnk FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Video by Károly ZsolnaiFehér. Reproduced with permission. Lecture 14  O6c2tobeMr,a2y02th32, 0210817
71. Policy Gradients What is a problem with Qlearning? The Qfunction can be very complicated! Example: a robot grasping an object has a very highdimensional state => hard to learn exact value of every (state, action) pair FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O6c3tobeMr,a2y02th32, 0210817
72. Policy Gradients What is a problem with Qlearning? The Qfunction can be very complicated! Example: a robot grasping an object has a very highdimensional state => hard to learn exact value of every (state, action) pair But the policy can be much simpler: just close your hand Can we learn a policy directly, e.g. finding the best policy from a collection of policies? FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O6c4tobeMr,a2y02th32, 0210817
73. Policy Gradients Formally, let’s define a class of parametrized policies: For each policy, define its value: FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O6c5tobeMr,a2y02th32, 0210817
74. Policy Gradients Formally, let’s define a class of parametrized policies: For each policy, define its value: We want to find the optimal policy How can we do this? FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O6c6tobeMr,a2y02th32, 0210817
75. Policy Gradients Formally, let’s define a class of parametrized policies: For each policy, define its value: We want to find the optimal policy How can we do this? Gradient ascent on policy parameters! FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O6c7tobeMr,a2y02th32, 0210817
76. REINFORCE algorithm Mathematically, we can write: Where r(ᶦ) is the reward of a trajectory FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O6c8tobeMr,a2y02th32, 0210817
77. REINFORCE algorithm Expected reward: FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O6c9tobeMr,a2y02th32, 0210817
78. REINFORCE algorithm Expected reward: Now let’s differentiate this: FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O7c0tobeMr,a2y02th32, 0210817
79. REINFORCE algorithm Expected reward: Now let’s differentiate this: Intractable! Gradient of an expectation is problematic when p depends on θ FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O7c1tobeMr,a2y02th32, 0210817
80. REINFORCE algorithm Expected reward: Now let’s differentiate this: However, we can use a nice trick: Intractable! Gradient of an expectation is problematic when p depends on θ FAeAizhzFheaairLrAiA&uulJilauiasStSianapJpuouthrtnarsaon & Serena Yeung Lecture 14  OO7c2ctotobbeMer,ra,2y2020th3th2, 20201018187
81. REINFORCE algorithm Expected reward: Now let’s differentiate this: However, we can use a nice trick: If we inject this back: Intractable! Gradient of an expectation is problematic when p depends on θ FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Can estimate with Monte Carlo sampling Lecture 14  O7c3tobeMr,a2y02th32, 0210817
82. REINFORCE algorithm Can we compute those quantities without knowing the transition probabilities? We have: FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O7c4tobeMr,a2y02th32, 0210817
83. REINFORCE algorithm Can we compute those quantities without knowing the transition probabilities? We have: Thus: FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O7c5tobeMr,a2y02th32, 0210817
84. REINFORCE algorithm Can we compute those quantities without knowing the transition probabilities? We have: Thus: And when differentiating: Doesn’t depend on transition probabilities! FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O7c6tobeMr,a2y02th32, 0210817
85. REINFORCE algorithm Can we compute those quantities without knowing the transition probabilities? We have: Thus: And when differentiating: Doesn’t depend on transition probabilities! Therefore when sampling a trajectory ᶦ, we can estimate J(ᶚ) with FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O7c7tobeMr,a2y02th32, 0210817
86. Intuition Gradient estimator: Interpretation:  If r(ᶦ) is high, push up the probabilities of the actions seen  If r(ᶦ) is low, push down the probabilities of the actions seen FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O7c8tobeMr,a2y02th32, 0210817
87. Intuition Gradient estimator: Interpretation:  If r(ᶦ) is high, push up the probabilities of the actions seen  If r(ᶦ) is low, push down the probabilities of the actions seen Might seem simplistic to say that if a trajectory is good then all its actions were good. But in expectation, it averages out! FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O7c9tobeMr,a2y02th32, 0210817
88. Intuition Gradient estimator: Interpretation:  If r(ᶦ) is high, push up the probabilities of the actions seen  If r(ᶦ) is low, push down the probabilities of the actions seen Might seem simplistic to say that if a trajectory is good then all its actions were good. But in expectation, it averages out! However, this also suffers from high variance because credit assignment is really hard. Can we help the estimator? FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O8c0tobeMr,a2y02th32, 0210817
89. Variance reduction Gradient estimator: FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O8c1tobeMr,a2y02th32, 0210817
90. Variance reduction Gradient estimator: First idea: Push up probabilities of an action seen, only by the cumulative future reward from that state FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O8c2tobeMr,a2y02th32, 0210817
91. Variance reduction Gradient estimator: First idea: Push up probabilities of an action seen, only by the cumulative future reward from that state Second idea: Use discount factor ᶕ to ignore delayed effects FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O8c3tobeMr,a2y02th32, 0210817
92. Variance reduction: Baseline Problem: The raw value of a trajectory isn’t necessarily meaningful. For example, if rewards are all positive, you keep pushing up probabilities of actions. What is important then? Whether a reward is better or worse than what you expect to get Idea: Introduce a baseline function dependent on the state. Concretely, estimator is now: FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O8c4tobeMr,a2y02th32, 0210817
93. How to choose the baseline? A simple baseline: constant moving average of rewards experienced so far from all trajectories FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O8c5tobeMr,a2y02th32, 0210817
94. How to choose the baseline? A simple baseline: constant moving average of rewards experienced so far from all trajectories Variance reduction techniques seen so far are typically used in “Vanilla REINFORCE” FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O8c6tobeMr,a2y02th32, 0210817
95. How to choose the baseline? A better baseline: Want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state. Q: What does this remind you of? FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O8c7tobeMr,a2y02th32, 0210817
96. How to choose the baseline? A better baseline: Want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state. Q: What does this remind you of? A: Qfunction and value function! FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O8c8tobeMr,a2y02th32, 0210817
97. How to choose the baseline? A better baseline: Want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state. Q: What does this remind you of? A: Qfunction and value function! Intuitively, we are happy with an action at in a state st if is large. On the contrary, we are unhappy with an action if it’s small. FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O8c9tobeMr,a2y02th32, 0210817
98. How to choose the baseline? A better baseline: Want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state. Q: What does this remind you of? A: Qfunction and value function! Intuitively, we are happy with an action at in a state st if is large. On the contrary, we are unhappy with an action if it’s small. Using this, we get the estimator: FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O9c0tobeMr,a2y02th32, 0210817
99. ActorCritic Algorithm Problem: we don’t know Q and V. Can we learn them? Yes, using Qlearning! We can combine Policy Gradients and Qlearning by training both an actor (the policy) and a critic (the Qfunction).  The actor decides which action to take, and the critic tells the actor how good its action was and how it should adjust  Also alleviates the task of the critic as it only has to learn the values of (state, action) pairs generated by the policy  Can also incorporate Qlearning tricks e.g. experience replay  Remark: we can define by the advantage function how much an action was better than expected FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O9c1tobeMr,a2y02th32, 0210817
100. ActorCritic Algorithm Initialize policy parameters ᶚ, critic parameters ᶰ For iteration=1, 2 … do Sample m trajectories under the current policy For i=1, …, m do For t=1, ... , T do End for FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O9c2tobeMr,a2y02th32, 0210817
101. REINFORCE in action: Recurrent Attention Model (RAM) Objective: Image Classification Take a sequence of “glimpses” selectively focusing on regions of the image, to predict class  Inspiration from human perception and eye movements  Saves computational resources => scalability  Able to ignore clutter / irrelevant parts of image State: Glimpses seen so far Action: (x,y) coordinates (center of glimpse) of where to look next in image Reward: 1 at the final timestep if image correctly classified, 0 otherwise glimpse FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung [Mnih et al. 2014] Lecture 14  O9c3tobeMr,a2y02th32, 0210817
102. REINFORCE in action: Recurrent Attention Model (RAM) Objective: Image Classification Take a sequence of “glimpses” selectively focusing on regions of the image, to predict class  Inspiration from human perception and eye movements  Saves computational resources => scalability  Able to ignore clutter / irrelevant parts of image State: Glimpses seen so far Action: (x,y) coordinates (center of glimpse) of where to look next in image Reward: 1 at the final timestep if image correctly classified, 0 otherwise glimpse Glimpsing is a nondifferentiable operation => learn policy for how to take glimpse actions using REINFORCE Given state of glimpses seen so far, use RNN to model the state and output next action [Mnih et al. 2014] FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O9c4tobeMr,a2y02th32, 0210817
103. REINFORCE in action: Recurrent Attention Model (RAM) (x1, y1) Input NN image FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung [Mnih et al. 2014] Lecture 14  O9c5tobeMr,a2y02th32, 0210817
104. REINFORCE in action: Recurrent Attention Model (RAM) (x1, y1) (x2, y2) Input NN NN image FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung [Mnih et al. 2014] Lecture 14  O9c6tobeMr,a2y02th32, 0210817
105. REINFORCE in action: Recurrent Attention Model (RAM) (x1, y1) (x2, y2) (x3, y3) Input NN NN NN image FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung [Mnih et al. 2014] Lecture 14  O9c7tobeMr,a2y02th32, 0210817
106. REINFORCE in action: Recurrent Attention Model (RAM) (x1, y1) (x2, y2) (x3, y3) (x4, y4) Input NN NN NN NN image FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung [Mnih et al. 2014] Lecture 14  O9c8tobeMr,a2y02th32, 0210817
107. REINFORCE in action: Recurrent Attention Model (RAM) (x1, y1) Input NN image (x2, y2) NN (x3, y3) NN (x4, y4) NN (x5, y5) NN Softmax y=2 FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung [Mnih et al. 2014] Lecture 14  O9c9tobeMr,a2y02th32, 0210817
108. REINFORCE in action: Recurrent Attention Model (RAM) Has also been used in many other tasks including finegrained image recognition, image captioning, and visual questionanswering! FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung [Mnih et al. 2014] Lecture 14  O1c00tobeMr,a2y02th32, 0210817
109. More policy gradients: AlphaGo Overview:  Mix of supervised learning and reinforcement learning  Mix of old methods (Monte Carlo Tree Search) and recent ones (deep RL) How to beat the Go world champion:  Featurize the board (stone color, move legality, bias, …)  Initialize policy network with supervised training from professional go games, then continue training using policy gradient (play against itself from random previous iterations, +1 / 1 reward for winning / losing)  Also learn value network (critic)  Finally, combine combine policy and value networks in a Monte Carlo Tree Search algorithm to select actions by lookahead search [Silver et al., Nature 2016] This image is CC0 public domain FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O1c10tobeMr,a2y02th32, 0210817
110. Summary  Policy gradients: very general but suffer from high variance so requires a lot of samples. Challenge: sampleefficiency  Qlearning: does not always work but when it works, usually more sampleefficient. Challenge: exploration  Guarantees:  Policy Gradients: Converges to a local minima of J(ᶚ), often good enough!  Qlearning: Zero guarantees since you are approximating Bellman equation with a complicated function approximator FAeizhFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14  O1c20tobeMr,a2y02th32, 0210817
推荐

Hunting with Passive DNS
白帽子

微服务架构实践
ArchSummit

libra consensus state...
blockchain

the libra blockchain
blockchain

Move: a language with...
blockchain

谢纯良 阿里巴巴中台技术架构实践与思考
ArchSummit

唐4 邝展豪构建基于对抗性训练的广告流量反...
ArchSummit

孔凡勇 技术TL核心职责
ArchSummit

刘俊 陆金所机房一键切换平台建设
ArchSummit

可发布 余利华 网易互联网数据中台实践(脱敏）
ArchSummit
分享