NPTEL Reinforcement Learning Week 3 Assignment Answers 2024
1. The baseline in the REINFORCE update should not depend on which of the following (without voiding any of the steps in the proof of REINFORCE)?
- rn−1
- rn
- Action taken(an)
- None of the above
Answer :- For Answers Click Here
2. Which of the following statements is true about the RL problem?
- Our main aim is to maximize the cumulative reward.
- The agent always performs the actions in a deterministic fashion.
- We assume that the agent determines the next state based on the current state and action
- It is impossible to have zero rewards.
Answer :- For Answers Click Here
- (i), (iii)
- (i), (iv)
- (ii), (iv)
- (ii), (iii)
Answer :-
Answer :- For Answers Click Here
Answer :-
6. Assertion: Contextual bandits can be modeled as a full reinforcement learning problem.
Reason: We can define an MDP with n states where n is the number of bandits. The number of actions from each state corresponds to the arms in each bandit, with every action leading to termination of the episode, and giving a reward according to the corresponding bandit and arm.
- Assertion and Reason are both true and Reason is a correct explanation of Assertion
- Assertion and Reason are both true and Reason is not a correct explanation of Assertion
- Assertion is true and Reason is false
- Both Assertion and Reason are false
Answer :-
7. Let’s assume for some full RL problem we are acting according to a policy π. At some time t, we are in a state s where we took action a1. After few time steps, at time t’, the same state s was reached where we performed an action a2(≠a1) . Which of the following statements is true?
- π is definitely a Stationary policy
- π is definitely a Non-Stationary policy
- π can be Stationary or Non-Stationary
Answer :- For Answers Click Here
8. Stochastic gradient ascent/descent update occurs in the right direction at every step
- True
- False
Answer :-
9. Which of the following is true for an MDP?
- Pr(st+1,rt+1|st,at)=Pr(st+1,rt+1)
- Pr(st+1,rt+1|st,at,st−1,at−1,st−2,at−2,…,s0,a0)=Pr(st+1,rt+1|st,at)
- Pr(st+1,rt+1|st,at)=Pr(st+1,rt+1|s0,a0)
- Pr(st+1,rt+1|st,at)=Pr(st,rt|st−1,at−1)
Answer :-
10. Remember for discounted returns,
Gt=rt+γrt+1+γ2rt+2+…
Where γ is a discount factor. Which of the following best explains what happens when γ>1, (say γ=5)?
- Nothing, γ>1 is common for many RL problems
- Theoretically nothing can go wrong, but this case does not represent any real world problems
- The agent will learn that delayed rewards will always be beneficial and so will not learn properly.
- None of the above is true.
Answer :- For Answers Click Here