NPTEL Reinforcement Learning Week 6 Assignment Answers 2024

Sanket
By Sanket

NPTEL Reinforcement Learning Week 6 Assignment Answers 2024

1. Which of the following are true?

  • Dynamic programming methods use full backups and bootstrapping.
  • Temporal-Difference methods use sample backups and bootstrapping.
  • Monte-Carlo methods use sample backups and bootstrapping.
  • Monte-Carlo methods use full backups and no bootstrapping.
Answer :- For Answers Click Here 

2. Consider the following statements:
(i) TD(0) methods uses unbiased sample of the return.
(ii) TD(0) methods uses a sample of the reward from the distribution of rewards.
(iii) TD(0) methods uses the current estimate of value function.
Which of the above statements is/are true?

  • (i), (ii)
  • (i),(iii)
  • (ii), (iii)
  • (i), (ii), (iii)
Answer :- For Answers Click Here 

3. Consider an MDP with two states A and B. Given the single trajectory shown below (in the pattern of state, reward, next state…), use on-policy TD(0) updates to make estimates for the values of the 2 states.

A, -2, B, 3, A, 3, B, -4, A, 0, END

Assume a discount factor γ=1, a learning rate α=1 and initial state-values of zero. What are the estimated values for the 2 states at the end of the sampled trajectory? (Note: You are not asked to compute the true values for the two states.)

  • V(A)=3,V(B)=3
  • V(A)=0,V(B)=0
  • V(A)=−1,V(B)=−2
  • V(A)=−2,V(B)=−1
Answer :- For Answers Click Here 

4. Which of the following statements are true for SARSA?

  • It is a TD method.
  • It is an off-policy algorithm.
  • It uses bootstrapping to approximate full return.
  • It always selects the greedy action choice.
Answer :- 

5. Assertion: In Expected-SARSA, we may select actions off-policy.
Reason: In the update rule for Expected-SARSA, we use the estimated expected value of the next state under the policy π rather than directly using the estimated value of the next state that is sampled on-policy.

  • Assertion and Reason are both true and Reason is a correct explanation of Assertion.
  • Assertion and Reason are both true and Reason is not a correct explanation of Assertion.
  • Assertion is true but Reason is false.
  • Assertion is false but Reason is true.
Answer :- 

6. Assertion: Q-learning can use asynchronous samples from different policies to update Q values.
Reason: Q-learning is an off-policy learning algorithm.

  • Assertion and Reason are both true and Reason is a correct explanation of Assertion.
  • Assertion and Reason are both true and Reason is not a correct explanation of Assertion.
  • Assertion is true but Reason is false.
  • Assertion is false but Reason is true.
Answer :- For Answers Click Here 

7. Suppose, for a 2 player game that we have modeled as an MDP, instead of learning a policy over the MDP directly, we separate the deterministic and stochastic result of playing an action to create ‘after-states’ (as discussed in the lectures). Consider the following statements:

(i) The set of states that make up ‘after-states’ may be different from the original set of states for the MDP.
(ii) The set of ’after-states’ could be smaller than the original set of states for the MDP.

Which of the above statements is/are True?

  • Only (i)
  • Only (ii)
  • Both (i) and (ii)
  • Neither (i) nor (ii)
Answer :- 

8. Assertion: Rollout algorithms take advantage of the policy improvement property.
Reason: Rollout algorithms selects action with the highest estimated values.

  • Assertion and Reason are both true and Reason is a correct explanation of Assertion.
  • Assertion and Reason are both true and Reason is not a correct explanation of Assertion.
  • Assertion is true but Reason is false.
  • Assertion and Reason are both false.
Answer :- 

9. Consider the environment given below(CliffWorld discussed in lecture):
Suppose we use ϵ-greedy policy for exploration with a value of ϵ=0.1. Select the correct option(s):

w6q9
  • Q-Learning finds the optimal(red) path.
  • Q-Learning finds the safer(blue) path.
  • SARSA finds the optimal(red) path.
  • SARSA finds the safer(blue) path.
Answer :- For Answers Click Here 
Share This Article
Leave a comment