## NPTEL Reinforcement Learning Week 3 Assignment Answers 2024

1. The baseline in the REINFORCE update should not depend on which of the following (without voiding any of the steps in the proof of REINFORCE)?

- r
_{n−1} - r
_{n} - Action taken(a
_{n}) - None of the above

Answer :-For Answers Click Here

2. Which of the following statements is true about the RL problem?

- Our main aim is to maximize the cumulative reward.
- The agent always performs the actions in a deterministic fashion.
- We assume that the agent determines the next state based on the current state and action
- It is impossible to have zero rewards.

Answer :-For Answers Click Here

- (i), (iii)
- (i), (iv)
- (ii), (iv)
- (ii), (iii)

Answer :-

Answer :-For Answers Click Here

Answer :-

6. Assertion: Contextual bandits can be modeled as a full reinforcement learning problem.

Reason: We can define an MDP with n states where n is the number of bandits. The number of actions from each state corresponds to the arms in each bandit, with every action leading to termination of the episode, and giving a reward according to the corresponding bandit and arm.

- Assertion and Reason are both true and Reason is a correct explanation of Assertion
- Assertion and Reason are both true and Reason is not a correct explanation of Assertion
- Assertion is true and Reason is false
- Both Assertion and Reason are false

Answer :-

7. Let’s assume for some full RL problem we are acting according to a policy π. At some time t, we are in a state s where we took action a_{1}. After few time steps, at time t’, the same state s was reached where we performed an action a_{2}(≠a1) . Which of the following statements is true?

- π is definitely a Stationary policy
- π is definitely a Non-Stationary policy
- π can be Stationary or Non-Stationary

Answer :-For Answers Click Here

8. Stochastic gradient ascent/descent update occurs in the right direction at every step

- True
- False

Answer :-

9. Which of the following is true for an MDP?

- P
_{r}(s_{t+1},r_{t+1}|s_{t},a_{t})=P_{r}(s_{t+1},r_{t+1}) - P
_{r}(s_{t+1},r_{t+1}|s_{t},a_{t},s_{t−1},a_{t−1},s_{t−2},a_{t−2},…,s_{0},a_{0})=P_{r}(s_{t+1},r_{t+1}|s_{t},a_{t}) - P
_{r}(s_{t+1},r_{t+1}|s_{t},a_{t})=P_{r}(s_{t+1},r_{t+1}|s_{0},a_{0}) - P
_{r}(s_{t+1},r_{t+1}|s_{t},a_{t})=P_{r}(s_{t},r_{t}|s_{t−1},a_{t−1})

Answer :-

10. Remember for discounted returns,

G_{t}=r_{t}+γr_{t+1}+γ^{2}r_{t+2}+…

Where γ is a discount factor. Which of the following best explains what happens when γ>1, (say γ=5)?

- Nothing, γ>1 is common for many RL problems
- Theoretically nothing can go wrong, but this case does not represent any real world problems
- The agent will learn that delayed rewards will always be beneficial and so will not learn properly.
- None of the above is true.

Answer :-For Answers Click Here