The Markov Decision Process (MDP) underlying Reinforcement Learning can be decomposed into six fundamental components. By modifying these six functions, an immense number of distinct RL models can be created. Users only need to grasp the basic MDP process and subsequently tailor these six functions to construct a unique reinforcement learning model.
Details
Action Select
Step 1: Agent uses
bias_functo apply a bias term to the value of each option.Step 2: Agent uses
expl_functo decide whether to make a purely random exploratory choice.Step 3: Agent uses
prob_functo compute the selection probability for each action.
Value Update
Step 4: Agent uses
util_functo translate the objective reward into subjective utility.Step 5: Agent uses
dcay_functo regress the values of unchosen options toward a baseline.Step 6: Agent uses
rate_functo update the value of the chosen option.
Learning Rate (\(\alpha\))
rate_func is the function that determines the learning rate
(\(\alpha\)). This function governs how the model selects the
\(\alpha\). For instance, you can set different learning rates for
different circumstances. Rather than 'learning' in a general sense, the
learning rate determines whether the agent updates its expected values
(Q-values) using an aggressive or conservative step size across different
conditions.
$$Q_{new} = Q_{old} + \alpha \cdot (R - Q_{old})$$
Soft-Max (\(\beta\))
prob_func is the function defined by the inverse temperature
parameter (\(\beta\)) and the lapse parameter.
The inverse temperature parameter governs the randomness of choice. If \(\beta\) approaches 0, the agent will choose between different actions completely at random. As \(\beta\) increases, the choice becomes more dependent on the expected value (\(Q_{t}\)), meaning actions with higher expected values have a proportionally higher probability of being chosen.
Note: This formula includes a normalization of the (\(Q_{t}\)) values.
$$ P_{t}(a) = \frac{ \exp\left( \beta \cdot \left( Q_t(a) - \max_{j} Q_t(a_j) \right) \right) }{ \sum_{i=1}^{k} \exp\left( \beta \cdot \left( Q_t(a_i) - \max_{j} Q_t(a_j) \right) \right ) } $$
The function below, which incorporates the constant lapse rate, is a correction to the standard soft-max rule. This is designed to prevent the probability of any action from becoming exactly 0 (Wilson and Collins, 2019 doi:10.7554/eLife.49547 ). When the lapse parameter is set to 0.01, every action has at least a 1% probability of being executed. If the number of available actions becomes excessively large (e.g., greater than 100), it would be more appropriate to set the lapse parameter to a much smaller value.
$$ P_{t}(a) = (1 - lapse \cdot N_{shown}) \cdot P_{t}(a) + lapse $$
Utility Function (\(\gamma\))
util_func is defined by the utility exponent parameter (\(\gamma\)).
Its purpose is to account for the fact that the objective reward received
by human may hold a different subjective value (utility) across
different subjects.
Note: The built-in function is formulated according to Stevens' power law.
$$U(R) = {R}^{\gamma}$$
Upper Confidence Bound (\(\delta\))
bias_func is the function defined by the parameter (\(\delta\)).
This function signifies that the expected value of an action is not
solely determined by the received reward, but is also influenced by the
number of times the action has been executed. Specifically, an action
that has been executed fewer times receives a larger exploration bias.
(Sutton and Barto,
2018)
This mechanism prompts exploration and ensures the agent to execute
every action at least once.
$$ \text{Bias} = \delta \cdot \sqrt{\frac{\log(N + e)}{N + 10^{-10}}} $$
Epsilon–First, Greedy, Decreasing (\(\epsilon\))
expl_func is the function defined by the parameter (\(\epsilon\))
and the constant threshold. This function controls the probability with
which the agent engages in exploration (i.e., making a random choice)
versus exploitation (i.e., making a value-based choice).
\(\epsilon–first\): The agent must choose randomly for a fixed number of initial trials. Once the number of trials exceeds the threshold, the agent must exclusively choose based on value.
$$ P(x) = \begin{cases} i \le \text{threshold}, & x=1 \\ i > \text{threshold}, & x=0 \end{cases} $$
\(\epsilon–greedy\): The agent performs a random choice with probability \(\epsilon\) and makes a value-based choice with probability \(1-\epsilon\).
$$ P(x) = \begin{cases} \epsilon, & x=1 \\ 1-\epsilon, & x=0 \end{cases} $$
\(\epsilon–decreasing\): The probability of making a random choice gradually decreases as the number of trials increases throughout the experiment.
$$ P(x) = \begin{cases} \frac{1}{1+\epsilon \cdot i}, & x=1 \\ \frac{\epsilon \cdot i}{1+\epsilon \cdot i}, & x=0 \end{cases} $$
Working Memory (\(\zeta\))
dcay_func is the function defined by the decay rate parameter
(\(\zeta\)) and the constant bonus. This function indicates that at the
end of each trial, not only the value of the chosen option will be changed
according to the learning rate, but also the values of the unchosen
options also undergo change.
It is due to the limitations of working memory capacity, the values of the unchosen options are hypothesized to decay back towards their initial value at a rate determined by the decay rate parameter (\(\zeta\)) (Collins and Frank, 2012 doi:10.1111/j.1460-9568.2011.07980.x ).
$$W_{new} = W_{old} + \zeta \cdot (W_{0} - W_{old})$$
Furthermore, Hitchcock, Kim and Frank, (2025) doi:10.1037/xge0001817 suggest that if the feedback of the chosen option provides information relevant to the unchosen options, this decay rate may be enhanced or mitigated by the constant bonus.
Example
# inner functions
funcs = list(
# Learning Rate
rate_func = multiRL::func_alpha
# Inverse Temperature
prob_func = multiRL::func_beta
# Utility Function (Stevens' Power Law)
util_func = multiRL::func_gamma
# Upper-Confidence-Bound
bias_func = multiRL::func_delta
# Epsilon-First, Greedy, Decreasing
expl_func = multiRL::func_epsilon
# Working Memory System
dcay_func = multiRL::func_zeta
)
References
Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed). MIT press.
Collins, A. G., & Frank, M. J. (2012). How much of reinforcement learning is working memory, not reinforcement learning? A behavioral, computational, and neurogenetic analysis. European Journal of Neuroscience, 35(7), 1024-1035. doi:10.1111/j.1460-9568.2011.07980.x
Wilson, R. C., & Collins, A. G. (2019). Ten simple rules for the computational modeling of behavioral data. Elife, 8, e49547. doi:10.7554/eLife.49547
Hitchcock, P. F., Kim, J., Frank, M. J. (2025). How working memory and reinforcement learning interact when avoiding punishment and pursuing reward concurrently. Journal of Experimental Psychology: General. doi:10.1037/xge0001817