To select appropriate reinforcement-learning algorithms, reply to as many of the following questions as possible:
Less preferred algorithms will be marked yellow
Unlike the questions above regarding what is dictated by the environment, the following question is about your planned choice of method properties:
For selecting a parametric probablity distribution for actions, see Section 3 in the full paper.
Model-free |
Off-policy |
Tabular value-based with exact maximization |
TD |
No entropy regularization |
Not distributional |
Not distributed |
Not hierarchical |
Not imitation learning |
|
Q-learning [Watkins & Dayan 1992] with Q(λ) |
Model-free |
Off-policy |
Tabular value-based with exact maximization |
Q(λ) |
No entropy regularization |
Not distributional |
Not distributed |
Not hierarchical |
Not imitation learning |
SARSA [Rummery et al. 1994] with TD |
Model-free |
On-policy |
Tabular value-based with exact maximization |
TD |
No entropy regularization |
Not distributional |
Not distributed |
Not hierarchical |
Not imitation learning |
SARSA [Rummery et al. 1994] with SARSA(λ) |
Model-free |
On-policy |
Tabular value-based with exact maximization |
SARSA(λ) |
No entropy regularization |
Not distributional |
Not distributed |
Not hierarchical |
Not imitation learning |
Model-free |
Off-policy |
Non-tabular value-based with exact maximization |
TD |
No entropy regularization |
Not distributional |
Not distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
Off-policy |
Non-tabular value-based with exact maximization |
TD |
No entropy regularization |
Not distributional |
Not distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
Off-policy |
Non-tabular value-based with exact maximization |
TD |
No entropy regularization |
Not distributional |
Not distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
Off-policy |
Non-tabular value-based with exact maximization |
TD |
No entropy regularization |
Not distributional |
Not distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
Off-policy |
Non-tabular value-based with exact maximization |
TD |
No entropy regularization |
Not distributional |
Not distributed |
Not hierarchical |
Not imitation learning |
|
DQfD [Hester et al. 2018] with TD |
Model-free |
Off-policy |
Non-tabular value-based with exact maximization |
TD |
No entropy regularization |
Not distributional |
Not distributed |
Not hierarchical |
Imitation learning |
DQfD [Hester et al. 2018] with TD(n) |
Model-free |
Off-policy |
Non-tabular value-based with exact maximization |
TD(n) |
No entropy regularization |
Not distributional |
Not distributed |
Not hierarchical |
Imitation learning |
Model-free |
Off-policy |
Non-tabular value-based with exact maximization |
TD |
No entropy regularization |
Not distributional |
Not distributed |
Hierarchical |
Not imitation learning |
|
Model-free |
Off-policy |
Non-tabular value-based with exact maximization |
TD |
No entropy regularization |
Distributional |
Not distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
Off-policy |
Non-tabular value-based with exact maximization |
TD |
No entropy regularization |
Distributional |
Not distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
Off-policy |
Non-tabular value-based with exact maximization |
TD(n) |
No entropy regularization |
Distributional |
Not distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
Off-policy |
Non-tabular value-based with approximate maximization and fixed search procedure |
TD |
No entropy regularization |
Not distributional |
Not distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
Off-policy |
Non-tabular value-based with approximate maximization and learned search procedure |
Q(λ) |
Per-state entropy regularization |
Not distributional |
Distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
Off-policy |
Non-tabular value-based with exact maximization |
TD |
No entropy regularization |
Distributional |
Not distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
Off-policy |
Non-tabular value-based with exact maximization |
TD |
No entropy regularization |
Distributional |
Not distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
Off-policy |
Non-tabular value-based with approximate maximization and fixed search procedure |
TD |
No entropy regularization |
Distributional |
Distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
Off-policy |
Non-tabular value-based with exact maximization |
TD(n) |
No entropy regularization |
Not distributional |
Distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
Off-policy |
Non-tabular value-based with exact maximization |
Retrace(λ) |
No entropy regularization |
Not distributional |
Distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
Off-policy |
Non-tabular value-based with exact maximization |
TD(n) |
No entropy regularization |
Not distributional |
Distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
On-policy |
Policy-based |
MC |
Per-state entropy regularization |
Not distributional |
Not distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
Off-policy |
Actor-critic |
TD(n) |
Soft Q-learning |
Not distributional |
Not distributed |
Not hierarchical |
Imitation learning |
|
Model-free |
Off-policy |
Actor-critic |
GTD(λ) |
No entropy regularization |
Not distributional |
Not distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
Off-policy |
Actor-critic |
Retrace(λ) |
Per-state entropy regularization |
Distributional |
Distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
On-policy |
Actor-critic |
GAE(λ) |
Soft Q-learning |
Not distributional |
Distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
Off-policy |
Actor-critic |
Retrace(λ) |
Kullback–Leibler divergence regularization |
Not distributional |
Distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
Off-policy |
Actor-critic |
TD |
Soft Q-learning |
Not distributional |
Not distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
Off-policy |
Actor-critic |
TD |
Per-state entropy regularization |
Not distributional |
Not distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
Off-policy |
Actor-critic |
TD(λ) |
Kullback–Leibler divergence regularization |
Not distributional |
Not distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
Off-policy |
Actor-critic |
TD |
No entropy regularization |
Not distributional |
Not distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
Off-policy |
Actor-critic |
TD |
No entropy regularization |
Not distributional |
Not distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
Off-policy |
Actor-critic |
TD(n) |
No entropy regularization |
Not distributional |
Distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
Off-policy |
Actor-critic |
TD(n) |
No entropy regularization |
Distributional |
Distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
On-policy |
Actor-critic |
TD(n) |
Kullback–Leibler divergence regularization |
Not distributional |
Not distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
On-policy |
Actor-critic |
GAE(λ) |
Per-state entropy and Kullback–Leibler divergence regularization |
Not distributional |
Distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
On-policy |
Actor-critic |
GAE(λ) |
Kullback–Leibler divergence regularization |
Not distributional |
Distributed |
Hierarchical |
Not imitation learning |
|
Model-free |
Off-policy |
Actor-critic |
TD |
No entropy regularization |
Not distributional |
Not distributed |
Hierarchical |
Not imitation learning |
|
Model-free |
On-policy |
Actor-critic |
TD(n) |
Mutual-information regularization |
Not distributional |
Not distributed |
Hierarchical |
Not imitation learning |
|
Model-free |
On-policy |
Actor-critic |
TD(n) |
Per-state entropy regularization |
Not distributional |
Distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
Off-policy |
Actor-critic |
Retrace(λ) |
Per-state entropy and Kullback–Leibler divergence regularization |
Not distributional |
Distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
On-policy |
Actor-critic |
LSTD-Q(λ) |
No entropy regularization |
Not distributional |
Not distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
On-policy |
Actor-critic |
TD(n) |
Per-state entropy and Kullback–Leibler divergence regularization |
Not distributional |
Not distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
Off-policy |
Actor-critic |
TD |
No entropy regularization |
Not distributional |
Not distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
Off-policy |
Actor-critic |
TD |
Soft Q-learning |
Not distributional |
Not distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
Off-policy |
Actor-critic |
TD |
Soft Q-learning |
Not distributional |
Not distributed |
Not hierarchical |
Imitation learning |
|
Model-free |
Off-policy |
Actor-critic |
TD |
Soft Q-learning |
Not distributional |
Not distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
Off-policy |
Actor-critic |
TD |
Soft Q-learning |
Distributional |
Distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
On-policy |
Actor-critic |
GAE(λ) |
Kullback–Leibler divergence regularization |
Not distributional |
Not distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
Off-policy |
Actor-critic |
V-trace(n) |
Per-state entropy regularization |
Not distributional |
Distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
Off-policy |
Actor-critic |
V-trace(n) |
Per-state entropy regularization |
Not distributional |
Distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
Off-policy |
Actor-critic |
TD(n) |
No entropy regularization |
Not distributional |
Distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
Off-policy |
Actor-critic |
GAE(λ) |
Kullback–Leibler divergence regularization |
Not distributional |
Not distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
On-policy |
Actor-critic |
MC |
Per-state entropy regularization |
Not distributional |
Not distributed |
Not hierarchical |
Not imitation learning |
|
Model-free |
On-policy |
Actor-critic |
TD |
Per-state entropy regularization |
Not distributional |
Not distributed |
Not hierarchical |
Not imitation learning |
|
REDQ [Chen et al. 2021] with MC |
Model-free |
Off-policy |
Actor-critic |
MC |
No entropy regularization |
Not distributional |
Not distributed |
Not hierarchical |
Not imitation learning |
REDQ [Chen et al. 2021] with TD |
Model-free |
Off-policy |
Actor-critic |
TD |
No entropy regularization |
Not distributional |
Not distributed |
Not hierarchical |
Not imitation learning |
Model-free |
On-policy |
Actor-critic |
TD |
No entropy regularization |
Not distributional |
Not distributed |
Not hierarchical |
Not imitation learning |
|
For model-based algorithms, see e.g. the survey papers Moerland et al. (2020b), Moerland et al. (2020a), Wang et al. (2019), Hamrick et al. (2020), Plaat et al. (2020). |
Model-based |