NPTEL Reinforcement Learning Week 6 Assignment Answers 2024
1. Which of the following are true?
- Dynamic programming methods use full backups and bootstrapping.
- Temporal-Difference methods use sample backups and bootstrapping.
- Monte-Carlo methods use sample backups and bootstrapping.
- Monte-Carlo methods use full backups and no bootstrapping.
Answer :- For Answers Click Here
2. Consider the following statements:
(i) TD(0) methods uses unbiased sample of the return.
(ii) TD(0) methods uses a sample of the reward from the distribution of rewards.
(iii) TD(0) methods uses the current estimate of value function.
Which of the above statements is/are true?
- (i), (ii)
- (i),(iii)
- (ii), (iii)
- (i), (ii), (iii)
Answer :- For Answers Click Here
3. Consider an MDP with two states A and B. Given the single trajectory shown below (in the pattern of state, reward, next state…), use on-policy TD(0) updates to make estimates for the values of the 2 states.
A, -2, B, 3, A, 3, B, -4, A, 0, END
Assume a discount factor γ=1, a learning rate α=1 and initial state-values of zero. What are the estimated values for the 2 states at the end of the sampled trajectory? (Note: You are not asked to compute the true values for the two states.)
- V(A)=3,V(B)=3
- V(A)=0,V(B)=0
- V(A)=−1,V(B)=−2
- V(A)=−2,V(B)=−1
Answer :- For Answers Click Here
4. Which of the following statements are true for SARSA?
- It is a TD method.
- It is an off-policy algorithm.
- It uses bootstrapping to approximate full return.
- It always selects the greedy action choice.
Answer :-
5. Assertion: In Expected-SARSA, we may select actions off-policy.
Reason: In the update rule for Expected-SARSA, we use the estimated expected value of the next state under the policy π rather than directly using the estimated value of the next state that is sampled on-policy.
- Assertion and Reason are both true and Reason is a correct explanation of Assertion.
- Assertion and Reason are both true and Reason is not a correct explanation of Assertion.
- Assertion is true but Reason is false.
- Assertion is false but Reason is true.
Answer :-
6. Assertion: Q-learning can use asynchronous samples from different policies to update Q values.
Reason: Q-learning is an off-policy learning algorithm.
- Assertion and Reason are both true and Reason is a correct explanation of Assertion.
- Assertion and Reason are both true and Reason is not a correct explanation of Assertion.
- Assertion is true but Reason is false.
- Assertion is false but Reason is true.
Answer :- For Answers Click Here
7. Suppose, for a 2 player game that we have modeled as an MDP, instead of learning a policy over the MDP directly, we separate the deterministic and stochastic result of playing an action to create ‘after-states’ (as discussed in the lectures). Consider the following statements:
(i) The set of states that make up ‘after-states’ may be different from the original set of states for the MDP.
(ii) The set of ’after-states’ could be smaller than the original set of states for the MDP.
Which of the above statements is/are True?
- Only (i)
- Only (ii)
- Both (i) and (ii)
- Neither (i) nor (ii)
Answer :-
8. Assertion: Rollout algorithms take advantage of the policy improvement property.
Reason: Rollout algorithms selects action with the highest estimated values.
- Assertion and Reason are both true and Reason is a correct explanation of Assertion.
- Assertion and Reason are both true and Reason is not a correct explanation of Assertion.
- Assertion is true but Reason is false.
- Assertion and Reason are both false.
Answer :-
9. Consider the environment given below(CliffWorld discussed in lecture):
Suppose we use ϵ-greedy policy for exploration with a value of ϵ=0.1. Select the correct option(s):
- Q-Learning finds the optimal(red) path.
- Q-Learning finds the safer(blue) path.
- SARSA finds the optimal(red) path.
- SARSA finds the safer(blue) path.
Answer :- For Answers Click Here