NPTEL Reinforcement Learning Week 9 Assignment Answers 2024

Sanket
By Sanket

NPTEL Reinforcement Learning Week 9 Assignment Answers 2024

1. Which of the following is true about DQN?

  • It can be efficiently used for very large state spaces
  • It can be efficiently used for continuous action spaces
Answer :- For Answers Click Here 

2. How many outputs will we get from the final layer of a DQN Network (|S| and |A| represent the total number of states and actions in the environment respectively)?

  • |S| × |A|
  • |S|
  • |A|
  • None of these
Answer :- For Answers Click Here 

3. What are the reasons behind using an experience replay buffer in DQN?

  • Random sampling from experience replay buffer breaks correlations among transitions.
  • It leads to efficient usage of real-world samples.
  • It guarantees convergence to the optimal policy.
  • None of the above
Answer :- For Answers Click Here 

4. Statement: DQN is implemented with current and target network.
Reason: Using target network helps in avoiding chasing a non-stationary target.

  • Both Assertion and Reason are true, and Reason is correct explanation for Assertion.
  • Both Assertion and Reason are true, but Reason is not correct explanation for assertion.
  • Assertion is true, Reason is false
  • Both Assertion and Reason are false
Answer :- 

5. Assertion: Actor-critic updates have lesser variance than REINFORCE updates.
Reason: Actor-critic methods use TD target instead of Gt.

  • Both Assertion and Reason are true, and Reason is correct explanation for Assertion.
  • Both Assertion and Reason are true, but Reason is not correct explanation for assertion.
  • Assertion is true, Reason is false
  • Both Assertion and Reason are false
Answer :- 

6. Suppose we are using a policy gradient method to solve a reinforcement learning problem. Assuming that the policy returned by the method is not optimal, which among the following are plausible reasons for such an outcome?

  • The search procedure converged to a locally optimal policy.
  • The search procedure was terminated before it could reach an optimal policy.
  • An optimal policy could not be represented by the parameterisation used to represent the policy.
  • None of these
Answer :- For Answers Click Here 

7.

image 116
Answer :- 

8. State True or False:
Monte Carlo policy gradient methods typically converge faster than the actor-critic methods, given that we use similar parameterisations and that the approximation to the Qπ used in the actor-critic method satisfies the compatibility criteria.

  • True
  • False
Answer :- 

9. When using policy gradient methods, if we make use of the average reward formulation rather than the discounted reward formulation, then is it necessary to assign a designated start state, s0?

  • Yes
  • No
  • Can’t say
Answer :- 

10. State True or False:
Exploration techniques like softmax (or other equivalent techniques) are not needed for DQN as the randomisation provided by experience replay provides sufficient exploration.

  • True
  • False
Answer :- For Answers Click Here 
Share This Article
Leave a comment