NPTEL Reinforcement Learning Week 5 Assignment Answers 2024

Sanket
By Sanket

NPTEL Reinforcement Learning Week 5 Assignment Answers 2024

1. In policy iteration, which of the following is/are true of the Policy Evaluation (PE) and Policy Improvement (PI) steps?

  • The values of states that are returned by PE may fluctuate between high and low values as the algorithm runs.
  • PE returns the fixed point of Lπn
  • PI can randomly select any greedy policy for a given value function vn.
  • Policy iteration always converges for a finite MDP.
Answer :- For Answers Click Here 

2. Consider Monte-Carlo approach for policy evaluation. Suppose the states are S1,S2,S3,S4,S5,S6 and terminalstate. You sample one trajectory as follows – S1→S5→S3→S6→terminalstate. Which among the following states can be updated from this sample?

  • S1
  • S2
  • S6
  • S4
Answer :- For Answers Click Here 

3. Which of the following statements are true with regards to Monte Carlo value approximation methods?

  • To evaluate a policy using these methods, a subset of trajectories in which all states are encountered at least once are enough to update all state-values.
  • Monte-Carlo value function approximation methods need knowledge of the full model.
  • Monte-Carlo methods update state-value estimates only at the end of an episode.
  • Monte-Carlo methods can only be used for episodic tasks.
Answer :- For Answers Click Here 

4. In every visit Monte Carlo methods, multiple samples for one state are obtained from a single trajectory. Which of the following is true?

  • There is an increase in bias of the estimates.
  • There is an increase in variance of the estimates.
  • It does not affect the bias or variance of estimates.
  • Both bias and variance of the estimates increase.
Answer :- 

5. Which of the following statements are FALSE about solving MDPs using dynamic programming?

  • If the state space is large or computation power is limited, it is preferred to update only those states that are seen in the trajectories.
  • Knowledge of transition probabilities is not necessary for solving MDPs using dynamic programming.
  • Methods that update only a subset of states at a time guarantee performance equal to or better than classic DP.
Answer :- 

6. Select the correct statements about Generalized Policy Iteration (GPI).

  • GPI lets policy evaluation and policy improvement interact with each other regardless of the details of the two processes.
  • Before convergence, the policy evaluation step will usually cause the policy to no longer be greedy with respect to the updated value function.
  • GPI converges only when a policy has been found which is greedy with respect to its own value function.
  • The policy found by GPI at convergence will be optimal but value function will not be optimal.
Answer :- For Answers Click Here 

7. What is meant by ”off-policy” Monte Carlo value function evaluation?

  • The policy being evaluated is the same as the policy used to generate samples.
  • The policy being evaluated is different from the policy used to generate samples.
  • The policy being learnt is different from the policy used to generate samples.
  • The policy being learnt is different from the policy used to generate samples.
Answer :- 

8. For both value and policy iteration algorithms we will get a sequence of vectors after some iterations, say v1,v2….vn for value iteration and v′1,v′2…v′n for policy iteration. Which of the following statements are true.

  • For all vi ∈ v1,v2….vn there exists a policy for which vi is a fixed point.
  • For all v′i ∈ v′1,v′2….v′n there exists a policy for which v′i is a fixed point.
  • For all vi ∈ v1,v2….vn there may not exist a policy for which vi is a fixed point.
  • For all v′i ∈ v′1,v′2….v′n there may not exist a policy for which v′i is a fixed point.
Answer :- For Answers Click Here 
Share This Article
Leave a comment