## NPTEL Reinforcement Learning Week 2 Assignment Answers 2024

1. Which of the following is true of the UCB algorithm?

- The action with the highest Q value is chosen at every iteration.
- After a very large number of iterations, the confidence intervals of unselected actions will not change much.
- The true expected-value of an action always lies within it’s estimated confidence interval.
- With a small probability ε, we select a random action to ensure adequate exploration of the action space.

Answer :-For Answer Click Here

Answer :-For Answer Click Here

3. In a 4-arm bandit problem, after executing 100 iterations of the UCB algorithm, the estimates of Q values are- Q_{100}(1)=1.73,Q_{100}(2)=1.83,Q_{100}(3)=1.89,Q_{100}(4)=1.55 and the number of times each of them are sampled are- n_{1}=25,n_{2}=20,n_{3}=30,n_{4}=25. Which arm will be sampled in the next trial?

- Arm 1
- Arm 2
- Arm 3
- Arm 4

Answer :-

4. We need 6 rounds of median-elimination to get an (ε,δ) − P AC arm. Approximately how many samples would have been required using the naive (ε,δ) − P AC algorithm given (ε,δ)=(1/2,1/e) ? (Choose the value closest to the correct answer)

- 1500
- 1000
- 500
- 3000

Answer :-

5. In median elimination method for (ε,δ) -PAC bounds, we claim that for every phase l, Pr[A≤B+ε_{l}]>1−δ_{l}.(S_{l} – is the set of arms remaining in the l^{th} phase)

Consider the following statements:

(i) A – is the maximum of rewards of true best arm in S_{l} , i.e. in l^{th} phase

(ii) B – is the maximum of rewards of true best arm in S_{l+1}, i.e. in l+1^{th} phase

(iii) B – is the minimum of rewards of true best arm in S_{l+1}, i.e. in l+1^{th} phase

(iv) A – is the minimum of rewards of true best arm in S_{l}, i.e. in l^{th} phase

(v) A – is the maximum of rewards of true best arm in S_{l+1}, i.e. in l+1^{th} phase

(vi) B – is the maximum of rewards of true best arm in S_{l}, i.e. in l^{th} phase

Which of the statements above are correct?

- i and ii
- iii and iv
- iii and iv
- v and vi
- i and iii

Answer :-

6. Which of the following statements is NOT true about Thompson Sampling or Posterior Sampling?

- After each sample is drawn, the q∗ distribution for that sampled arm is updated to be closer to the true distribution.
- Thompson sampling has been shown to generally give better regret bounds than UCB.
- In Thompson sampling, we do not need to eliminate arms each round to get good sample complexity.
- The algorithm requires that we use Gaussian priors to represent distributions over q∗

Answer :-For Answer Click Here

7. Assertion: The confidence bound of each arm in the UCB algorithm cannot increase with iterations.

Reason: The n_{j} term in the denominator ensures that the confidence bound remains the same for unselected arms and decreases for the selected arm.

- Assertion and Reason are both true and Reason is a correct explanation of Assertion
- Assertion and Reason are both true and Reason is not a correct explanation of Assertion
- Assertion is true and Reason is false
- Both Assertion and Reason are false

Answer :-

8. We need 100 samples for getting an (ε,δ) − P AC arm using naive (ε,δ) − P AC algorithm in a 10-arm bandit problem with certain values of ε and δ. Now, the epsilon is halved keeping the delta unchanged. How many samples would be needed to re-run naive (ε,δ) −P AC algorithm?

- 400
- 800
- 1600
- 100

Answer :-

Answer :-

10. Suppose we are facing a non-stationary bandit problem. We want to use posterior sampling for picking the correct arm. What is the likely change that needs to be done to the algorithm so that it can adapt to non-stationarity?

- Update the posterior rarely.
- Randomly shift the posterior drastically from time to time.
- Keep adding a slight noise to the posterior to prevent its variance from going down quickly.
- No change is required.

Answer :-For Answer Click Here