NPTEL Reinforcement Learning Week 3 Assignment Answers 2024

admin
By admin

NPTEL Reinforcement Learning Week 3 Assignment Answers 2024

1. The baseline in the REINFORCE update should not depend on which of the following (without voiding any of the steps in the proof of REINFORCE)?

  • rn−1
  • rn
  • Action taken(an)
  • None of the above
Answer :- For Answers Click Here

2. Which of the following statements is true about the RL problem?

  • Our main aim is to maximize the cumulative reward.
  • The agent always performs the actions in a deterministic fashion.
  • We assume that the agent determines the next state based on the current state and action
  • It is impossible to have zero rewards.
Answer :- For Answers Click Here
image 27
  • (i), (iii)
  • (i), (iv)
  • (ii), (iv)
  • (ii), (iii)
Answer :- 
image 28
Answer :- For Answers Click Here
image 29
Answer :- 

6. Assertion: Contextual bandits can be modeled as a full reinforcement learning problem.
Reason: We can define an MDP with n states where n is the number of bandits. The number of actions from each state corresponds to the arms in each bandit, with every action leading to termination of the episode, and giving a reward according to the corresponding bandit and arm.

  • Assertion and Reason are both true and Reason is a correct explanation of Assertion
  • Assertion and Reason are both true and Reason is not a correct explanation of Assertion
  • Assertion is true and Reason is false
  • Both Assertion and Reason are false
Answer :- 

7. Let’s assume for some full RL problem we are acting according to a policy π. At some time t, we are in a state s where we took action a1. After few time steps, at time t’, the same state s was reached where we performed an action a2(≠a1) . Which of the following statements is true?

  • π is definitely a Stationary policy
  • π is definitely a Non-Stationary policy
  • π can be Stationary or Non-Stationary
Answer :- For Answers Click Here

8. Stochastic gradient ascent/descent update occurs in the right direction at every step

  • True
  • False
Answer :- 

9. Which of the following is true for an MDP?

  • Pr(st+1,rt+1|st,at)=Pr(st+1,rt+1)
  • Pr(st+1,rt+1|st,at,st−1,at−1,st−2,at−2,…,s0,a0)=Pr(st+1,rt+1|st,at)
  • Pr(st+1,rt+1|st,at)=Pr(st+1,rt+1|s0,a0)
  • Pr(st+1,rt+1|st,at)=Pr(st,rt|st−1,at−1)
Answer :- 

10. Remember for discounted returns,

Gt=rt+γrt+12rt+2+…

Where γ is a discount factor. Which of the following best explains what happens when γ>1, (say γ=5)?

  • Nothing, γ>1 is common for many RL problems
  • Theoretically nothing can go wrong, but this case does not represent any real world problems
  • The agent will learn that delayed rewards will always be beneficial and so will not learn properly.
  • None of the above is true.
Answer :- For Answers Click Here
Share This Article
Leave a comment