# NPTEL Reinforcement Learning Week 3 Assignment Answers 2024

## NPTEL Reinforcement Learning Week 3 Assignment Answers 2024

1. The baseline in the REINFORCE update should not depend on which of the following (without voiding any of the steps in the proof of REINFORCE)?

• rn−1
• rn
• Action taken(an)
• None of the above
`Answer :- For Answers Click Here`

2. Which of the following statements is true about the RL problem?

• Our main aim is to maximize the cumulative reward.
• The agent always performs the actions in a deterministic fashion.
• We assume that the agent determines the next state based on the current state and action
• It is impossible to have zero rewards.
`Answer :- For Answers Click Here`
• (i), (iii)
• (i), (iv)
• (ii), (iv)
• (ii), (iii)
`Answer :- `
`Answer :- For Answers Click Here`
`Answer :- `

6. Assertion: Contextual bandits can be modeled as a full reinforcement learning problem.
Reason: We can define an MDP with n states where n is the number of bandits. The number of actions from each state corresponds to the arms in each bandit, with every action leading to termination of the episode, and giving a reward according to the corresponding bandit and arm.

• Assertion and Reason are both true and Reason is a correct explanation of Assertion
• Assertion and Reason are both true and Reason is not a correct explanation of Assertion
• Assertion is true and Reason is false
• Both Assertion and Reason are false
`Answer :- `

7. Let’s assume for some full RL problem we are acting according to a policy π. At some time t, we are in a state s where we took action a1. After few time steps, at time t’, the same state s was reached where we performed an action a2(≠a1) . Which of the following statements is true?

• π is definitely a Stationary policy
• π is definitely a Non-Stationary policy
• π can be Stationary or Non-Stationary
`Answer :- For Answers Click Here`

8. Stochastic gradient ascent/descent update occurs in the right direction at every step

• True
• False
`Answer :- `

9. Which of the following is true for an MDP?

• Pr(st+1,rt+1|st,at)=Pr(st+1,rt+1)
• Pr(st+1,rt+1|st,at,st−1,at−1,st−2,at−2,…,s0,a0)=Pr(st+1,rt+1|st,at)
• Pr(st+1,rt+1|st,at)=Pr(st+1,rt+1|s0,a0)
• Pr(st+1,rt+1|st,at)=Pr(st,rt|st−1,at−1)
`Answer :- `

10. Remember for discounted returns,

Gt=rt+γrt+12rt+2+…

Where γ is a discount factor. Which of the following best explains what happens when γ>1, (say γ=5)?

• Nothing, γ>1 is common for many RL problems
• Theoretically nothing can go wrong, but this case does not represent any real world problems
• The agent will learn that delayed rewards will always be beneficial and so will not learn properly.
• None of the above is true.
`Answer :- For Answers Click Here`