NPTEL Reinforcement Learning Week 5 Assignment Answers 2024
1. In policy iteration, which of the following is/are true of the Policy Evaluation (PE) and Policy Improvement (PI) steps?
- The values of states that are returned by PE may fluctuate between high and low values as the algorithm runs.
- PE returns the fixed point of Lπn
- PI can randomly select any greedy policy for a given value function vn.
- Policy iteration always converges for a finite MDP.
Answer :- For Answers Click Here
2. Consider Monte-Carlo approach for policy evaluation. Suppose the states are S1,S2,S3,S4,S5,S6 and terminalstate. You sample one trajectory as follows – S1→S5→S3→S6→terminalstate. Which among the following states can be updated from this sample?
- S1
- S2
- S6
- S4
Answer :- For Answers Click Here
3. Which of the following statements are true with regards to Monte Carlo value approximation methods?
- To evaluate a policy using these methods, a subset of trajectories in which all states are encountered at least once are enough to update all state-values.
- Monte-Carlo value function approximation methods need knowledge of the full model.
- Monte-Carlo methods update state-value estimates only at the end of an episode.
- Monte-Carlo methods can only be used for episodic tasks.
Answer :- For Answers Click Here
4. In every visit Monte Carlo methods, multiple samples for one state are obtained from a single trajectory. Which of the following is true?
- There is an increase in bias of the estimates.
- There is an increase in variance of the estimates.
- It does not affect the bias or variance of estimates.
- Both bias and variance of the estimates increase.
Answer :-
5. Which of the following statements are FALSE about solving MDPs using dynamic programming?
- If the state space is large or computation power is limited, it is preferred to update only those states that are seen in the trajectories.
- Knowledge of transition probabilities is not necessary for solving MDPs using dynamic programming.
- Methods that update only a subset of states at a time guarantee performance equal to or better than classic DP.
Answer :-
6. Select the correct statements about Generalized Policy Iteration (GPI).
- GPI lets policy evaluation and policy improvement interact with each other regardless of the details of the two processes.
- Before convergence, the policy evaluation step will usually cause the policy to no longer be greedy with respect to the updated value function.
- GPI converges only when a policy has been found which is greedy with respect to its own value function.
- The policy found by GPI at convergence will be optimal but value function will not be optimal.
Answer :- For Answers Click Here
7. What is meant by ”off-policy” Monte Carlo value function evaluation?
- The policy being evaluated is the same as the policy used to generate samples.
- The policy being evaluated is different from the policy used to generate samples.
- The policy being learnt is different from the policy used to generate samples.
- The policy being learnt is different from the policy used to generate samples.
Answer :-
8. For both value and policy iteration algorithms we will get a sequence of vectors after some iterations, say v1,v2….vn for value iteration and v′1,v′2…v′n for policy iteration. Which of the following statements are true.
- For all vi ∈ v1,v2….vn there exists a policy for which vi is a fixed point.
- For all v′i ∈ v′1,v′2….v′n there exists a policy for which v′i is a fixed point.
- For all vi ∈ v1,v2….vn there may not exist a policy for which vi is a fixed point.
- For all v′i ∈ v′1,v′2….v′n there may not exist a policy for which v′i is a fixed point.
Answer :- For Answers Click Here