# NPTEL Reinforcement Learning Week 6 Assignment Answers 2024

By Sanket

## NPTEL Reinforcement Learning Week 6 Assignment Answers 2024

1. Which of the following are true?

• Dynamic programming methods use full backups and bootstrapping.
• Temporal-Difference methods use sample backups and bootstrapping.
• Monte-Carlo methods use sample backups and bootstrapping.
• Monte-Carlo methods use full backups and no bootstrapping.
`Answer :- For Answers Click Here `

2. Consider the following statements:
(i) TD(0) methods uses unbiased sample of the return.
(ii) TD(0) methods uses a sample of the reward from the distribution of rewards.
(iii) TD(0) methods uses the current estimate of value function.
Which of the above statements is/are true?

• (i), (ii)
• (i),(iii)
• (ii), (iii)
• (i), (ii), (iii)
`Answer :- For Answers Click Here `

3. Consider an MDP with two states A and B. Given the single trajectory shown below (in the pattern of state, reward, next state…), use on-policy TD(0) updates to make estimates for the values of the 2 states.

A, -2, B, 3, A, 3, B, -4, A, 0, END

Assume a discount factor γ=1, a learning rate α=1 and initial state-values of zero. What are the estimated values for the 2 states at the end of the sampled trajectory? (Note: You are not asked to compute the true values for the two states.)

• V(A)=3,V(B)=3
• V(A)=0,V(B)=0
• V(A)=−1,V(B)=−2
• V(A)=−2,V(B)=−1
`Answer :- For Answers Click Here `

4. Which of the following statements are true for SARSA?

• It is a TD method.
• It is an off-policy algorithm.
• It uses bootstrapping to approximate full return.
• It always selects the greedy action choice.
`Answer :- `

5. Assertion: In Expected-SARSA, we may select actions off-policy.
Reason: In the update rule for Expected-SARSA, we use the estimated expected value of the next state under the policy π rather than directly using the estimated value of the next state that is sampled on-policy.

• Assertion and Reason are both true and Reason is a correct explanation of Assertion.
• Assertion and Reason are both true and Reason is not a correct explanation of Assertion.
• Assertion is true but Reason is false.
• Assertion is false but Reason is true.
`Answer :- `

6. Assertion: Q-learning can use asynchronous samples from different policies to update Q values.
Reason: Q-learning is an off-policy learning algorithm.

• Assertion and Reason are both true and Reason is a correct explanation of Assertion.
• Assertion and Reason are both true and Reason is not a correct explanation of Assertion.
• Assertion is true but Reason is false.
• Assertion is false but Reason is true.
`Answer :- For Answers Click Here `

7. Suppose, for a 2 player game that we have modeled as an MDP, instead of learning a policy over the MDP directly, we separate the deterministic and stochastic result of playing an action to create ‘after-states’ (as discussed in the lectures). Consider the following statements:

(i) The set of states that make up ‘after-states’ may be different from the original set of states for the MDP.
(ii) The set of ’after-states’ could be smaller than the original set of states for the MDP.

Which of the above statements is/are True?

• Only (i)
• Only (ii)
• Both (i) and (ii)
• Neither (i) nor (ii)
`Answer :- `

8. Assertion: Rollout algorithms take advantage of the policy improvement property.
Reason: Rollout algorithms selects action with the highest estimated values.

• Assertion and Reason are both true and Reason is a correct explanation of Assertion.
• Assertion and Reason are both true and Reason is not a correct explanation of Assertion.
• Assertion is true but Reason is false.
• Assertion and Reason are both false.
`Answer :- `

9. Consider the environment given below(CliffWorld discussed in lecture):
Suppose we use ϵ-greedy policy for exploration with a value of ϵ=0.1. Select the correct option(s):

• Q-Learning finds the optimal(red) path.
• Q-Learning finds the safer(blue) path.
• SARSA finds the optimal(red) path.
• SARSA finds the safer(blue) path.
`Answer :- For Answers Click Here `