Skip to contents

The Markov Decision Process (MDP) underlying Reinforcement Learning can be decomposed into six fundamental components. By modifying these six functions, an immense number of distinct RL models can be created. Users only need to grasp the basic MDP process and subsequently tailor these six functions to construct a unique reinforcement learning model.

Class

funcs [List]

Details

  • Action Select

    • Step 1: Agent uses bias_func to apply a bias term to the value of each option.

    • Step 2: Agent uses expl_func to decide whether to make a purely random exploratory choice.

    • Step 3: Agent uses prob_func to compute the selection probability for each action.

  • Value Update

    • Step 4: Agent uses util_func to translate the objective reward into subjective utility.

    • Step 5: Agent uses dcay_func to regress the values of unchosen options toward a baseline.

    • Step 6: Agent uses rate_func to update the value of the chosen option.

Learning Rate (\(\alpha\))

rate_func is the function that determines the learning rate (\(\alpha\)). This function governs how the model selects the \(\alpha\). For instance, you can set different learning rates for different circumstances. Rather than 'learning' in a general sense, the learning rate determines whether the agent updates its expected values (Q-values) using an aggressive or conservative step size across different conditions.

$$Q_{new} = Q_{old} + \alpha \cdot (R - Q_{old})$$

Soft-Max (\(\beta\))

prob_func is the function defined by the inverse temperature parameter (\(\beta\)) and the lapse parameter.

The inverse temperature parameter governs the randomness of choice. If \(\beta\) approaches 0, the agent will choose between different actions completely at random. As \(\beta\) increases, the choice becomes more dependent on the expected value (\(Q_{t}\)), meaning actions with higher expected values have a proportionally higher probability of being chosen.

Note: This formula includes a normalization of the (\(Q_{t}\)) values.

$$ P_{t}(a) = \frac{ \exp\left( \beta \cdot \left( Q_t(a) - \max_{j} Q_t(a_j) \right) \right) }{ \sum_{i=1}^{k} \exp\left( \beta \cdot \left( Q_t(a_i) - \max_{j} Q_t(a_j) \right) \right ) } $$

The function below, which incorporates the constant lapse rate, is a correction to the standard soft-max rule. This is designed to prevent the probability of any action from becoming exactly 0 (Wilson and Collins, 2019 doi:10.7554/eLife.49547 ). When the lapse parameter is set to 0.01, every action has at least a 1% probability of being executed. If the number of available actions becomes excessively large (e.g., greater than 100), it would be more appropriate to set the lapse parameter to a much smaller value.

$$ P_{t}(a) = (1 - lapse \cdot N_{shown}) \cdot P_{t}(a) + lapse $$

Utility Function (\(\gamma\))

util_func is defined by the utility exponent parameter (\(\gamma\)). Its purpose is to account for the fact that the objective reward received by human may hold a different subjective value (utility) across different subjects.

Note: The built-in function is formulated according to Stevens' power law.

$$U(R) = {R}^{\gamma}$$

Upper Confidence Bound (\(\delta\))

bias_func is the function defined by the parameter (\(\delta\)). This function signifies that the expected value of an action is not solely determined by the received reward, but is also influenced by the number of times the action has been executed. Specifically, an action that has been executed fewer times receives a larger exploration bias. (Sutton and Barto, 2018) This mechanism prompts exploration and ensures the agent to execute every action at least once.

$$ \text{Bias} = \delta \cdot \sqrt{\frac{\log(N + e)}{N + 10^{-10}}} $$

Epsilon–First, Greedy, Decreasing (\(\epsilon\))

expl_func is the function defined by the parameter (\(\epsilon\)) and the constant threshold. This function controls the probability with which the agent engages in exploration (i.e., making a random choice) versus exploitation (i.e., making a value-based choice).

\(\epsilon–first\): The agent must choose randomly for a fixed number of initial trials. Once the number of trials exceeds the threshold, the agent must exclusively choose based on value.

$$ P(x) = \begin{cases} i \le \text{threshold}, & x=1 \\ i > \text{threshold}, & x=0 \end{cases} $$

\(\epsilon–greedy\): The agent performs a random choice with probability \(\epsilon\) and makes a value-based choice with probability \(1-\epsilon\).

$$ P(x) = \begin{cases} \epsilon, & x=1 \\ 1-\epsilon, & x=0 \end{cases} $$

\(\epsilon–decreasing\): The probability of making a random choice gradually decreases as the number of trials increases throughout the experiment.

$$ P(x) = \begin{cases} \frac{1}{1+\epsilon \cdot i}, & x=1 \\ \frac{\epsilon \cdot i}{1+\epsilon \cdot i}, & x=0 \end{cases} $$

Working Memory (\(\zeta\))

dcay_func is the function defined by the decay rate parameter (\(\zeta\)) and the constant bonus. This function indicates that at the end of each trial, not only the value of the chosen option will be changed according to the learning rate, but also the values of the unchosen options also undergo change.

It is due to the limitations of working memory capacity, the values of the unchosen options are hypothesized to decay back towards their initial value at a rate determined by the decay rate parameter (\(\zeta\)) (Collins and Frank, 2012 doi:10.1111/j.1460-9568.2011.07980.x ).

$$W_{new} = W_{old} + \zeta \cdot (W_{0} - W_{old})$$

Furthermore, Hitchcock, Kim and Frank, (2025) doi:10.1037/xge0001817 suggest that if the feedback of the chosen option provides information relevant to the unchosen options, this decay rate may be enhanced or mitigated by the constant bonus.

Example

 # inner functions
 funcs = list(
   # Learning Rate
   rate_func = multiRL::func_alpha
   # Inverse Temperature
   prob_func = multiRL::func_beta
   # Utility Function (Stevens' Power Law)
   util_func = multiRL::func_gamma
   # Upper-Confidence-Bound
   bias_func = multiRL::func_delta
   # Epsilon-First, Greedy, Decreasing
   expl_func = multiRL::func_epsilon
   # Working Memory System
   dcay_func = multiRL::func_zeta
 )

References

Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed). MIT press.

Collins, A. G., & Frank, M. J. (2012). How much of reinforcement learning is working memory, not reinforcement learning? A behavioral, computational, and neurogenetic analysis. European Journal of Neuroscience, 35(7), 1024-1035. doi:10.1111/j.1460-9568.2011.07980.x

Wilson, R. C., & Collins, A. G. (2019). Ten simple rules for the computational modeling of behavioral data. Elife, 8, e49547. doi:10.7554/eLife.49547

Hitchcock, P. F., Kim, J., Frank, M. J. (2025). How working memory and reinforcement learning interact when avoiding punishment and pursuing reward concurrently. Journal of Experimental Psychology: General. doi:10.1037/xge0001817