## NPTEL Reinforcement Learning Week 5 Assignment Answers 2024

1. In policy iteration, which of the following is/are true of the Policy Evaluation (PE) and Policy Improvement (PI) steps?

- The values of states that are returned by PE may fluctuate between high and low values as the algorithm runs.
- PE returns the fixed point of L
_{πn} - PI can randomly select any greedy policy for a given value function v
_{n}. - Policy iteration always converges for a finite MDP.

Answer :-For Answers Click Here

2. Consider Monte-Carlo approach for policy evaluation. Suppose the states are S_{1},S_{2},S_{3},S_{4},S_{5},S_{6} and terminalstate. You sample one trajectory as follows – S_{1}→S_{5}→S_{3}→S_{6}→terminalstate. Which among the following states can be updated from this sample?

- S
_{1} - S
_{2} - S
_{6} - S
_{4}

Answer :-For Answers Click Here

3. Which of the following statements are true with regards to Monte Carlo value approximation methods?

- To evaluate a policy using these methods, a subset of trajectories in which all states are encountered at least once are enough to update all state-values.
- Monte-Carlo value function approximation methods need knowledge of the full model.
- Monte-Carlo methods update state-value estimates only at the end of an episode.
- Monte-Carlo methods can only be used for episodic tasks.

Answer :-For Answers Click Here

4. In every visit Monte Carlo methods, multiple samples for one state are obtained from a single trajectory. Which of the following is true?

- There is an increase in bias of the estimates.
- There is an increase in variance of the estimates.
- It does not affect the bias or variance of estimates.
- Both bias and variance of the estimates increase.

Answer :-

5. Which of the following statements are FALSE about solving MDPs using dynamic programming?

- If the state space is large or computation power is limited, it is preferred to update only those states that are seen in the trajectories.
- Knowledge of transition probabilities is not necessary for solving MDPs using dynamic programming.
- Methods that update only a subset of states at a time guarantee performance equal to or better than classic DP.

Answer :-

6. Select the correct statements about Generalized Policy Iteration (GPI).

- GPI lets policy evaluation and policy improvement interact with each other regardless of the details of the two processes.
- Before convergence, the policy evaluation step will usually cause the policy to no longer be greedy with respect to the updated value function.
- GPI converges only when a policy has been found which is greedy with respect to its own value function.
- The policy found by GPI at convergence will be optimal but value function will not be optimal.

Answer :-For Answers Click Here

7. What is meant by ”off-policy” Monte Carlo value function evaluation?

- The policy being evaluated is the same as the policy used to generate samples.
- The policy being evaluated is different from the policy used to generate samples.
- The policy being learnt is different from the policy used to generate samples.
- The policy being learnt is different from the policy used to generate samples.

Answer :-

8. For both value and policy iteration algorithms we will get a sequence of vectors after some iterations, say v_{1},v_{2}….v_{n} for value iteration and v′1,v′_{2}…v′_{n} for policy iteration. Which of the following statements are true.

- For all v
_{i}∈ v_{1},v_{2}….v_{n}there exists a policy for which v_{i}is a fixed point. - For all v′
_{i}∈ v′_{1},v′_{2}….v′_{n}there exists a policy for which v′_{i}is a fixed point. - For all v
_{i}∈ v_{1},v_{2}….v_{n}there may not exist a policy for which v_{i}is a fixed point. - For all v′
_{i}∈ v′_{1},v′_{2}….v′_{n}there may not exist a policy for which v′_{i}is a fixed point.

Answer :-For Answers Click Here