NPTEL Reinforcement Learning Week 10 Assignment Answers 2024
1. Consider the update equation for SMDP Q-learning:
Q(s,a)=Q(s,a)+α[A+Bmaxa′Q(s′,a′)−Q(s,a)]
Which of the following are the correct values of A and B?
(rk is the reward received at time step k, and γ is the discount factor)
- A=rt;B=γ
- A=rt+γrt+1+…+γτ−1rt+τ;B=γτ
- A=γtrt+γt+1rt+1+…+γt+τ−1rt+τ;B=γt+τ
- A=γτ−1rt+τ;B=γτ
Answer :- For Answers Click Here
2. Consider a SMDP in which the next state and the reward only depend on the previous state and action i.e P(s′,τ|s,a)=P(s′|s,a)P(τ|s,a),R(s,a,τ,s′)=R(s,a,s′).
If we solve the above SMDP with conventional Q-learning we will end up with the same policy as solving it with SMDP Q-learning.
- yes, because now τ won’t change anything and we end up with same states and action sequences
- no, because τ still depends on the state, action pair and discounting may have a effect on the final policies.
- no, because the next state will still depend on the τ.
- yes, because the bellman equation is same for both methods in this case.
Answer :- For Answers Click Here
3. In HAM, what will be the immediate rewards received between two choice states.
- Accumulation of immediate rewards of the core MDP obtained between these choice points.
- The return of the next choice state.
- The reward of only the next primitive action taken.
- Immediate reward is always zero
Answer :- For Answers Click Here
4. Which of the following is true about Markov and Semi Markov Options?
- In a Markov Option the option’s policy depends only on the current state.
- In a Semi Markov Option the option’s policy can depend only on the current state.
- In a Semi Markov Option, the option’s policy may depend on the history since the execution of the option began.
- A Semi-Markov Option is always a Markov Option but not vice versa.
Answer :-
5. Consider the two statements below for an SMDP for a HAM:
Statement1: The state of the SMDP is defined by the state of the base MDP, the call stack and the state of the machine currently executing.
Statement2: The actions of the SMDP can only be defined by the action states.
Which of the following are true?
- Statement1 is True and Statement2 is True.
- Statement1 is True and Statement2 is False.
- Statement1 is False and Statement2 is True.
- Statement1 is False and Statement2 is False.
Answer :-
6. Which of the following are possible advantages of formulating a given problem as a hierarchy of sub-problems?
- A reduced state space.
- More meaningful state-abstraction.
- Temporal abstraction of behaviour.
- Re-usability of learnt sub-problems.
Answer :- For Answers Click Here
7. In SMDP, consider the case when τ is fixed for all state, action pairs. Will we always get the same policy for conventional Q-learning and SMDP Q learning then? Provide answer for the three cases when τ=3,τ=2,τ=1.
- yes, yes, no
- no, no, no
- yes, yes, yes
- no, no, yes
Answer :-
8. State True or False:
In the classical options framework, each option has a non-zero probability of terminating in any state of the environment.
- True
- False
Answer :-
9. Suppose that we model a robot in a room as an SMDP, such that the position of the robot in the room is the state of the SMDP. Which of the following scenarios satisfy the assumption that the next state and transition time are independent of each other given the current state and action i.e P(s′,τ|s,a)=P(s′|s,a)P(τ|s,a)? (Assume that primitive actions – < left, right, up, down > take a single time step to execute.)
- The room has a single door. The actions available are : {exit the room, move left, move right, move up, move down}.
- The room has a two doors. The actions available are : {exit the room, move left, move right, move up, move down}.
- The room has a two doors. The actions available are: {move left, move right, move up, move down}.
- None of the above.
Answer :-
10.
Answer :- For Answers Click Here