Arguments
Paralell
binaryRL::rcv_d(
...,
funcs = c("my_util_func", "my_rate_func", "my_expl_func", "my_prob_func"),
nc = 4,
...
)
binaryRL::fit_p(
...,
funcs = c("my_util_func", "my_rate_func", "my_expl_func", "my_prob_func"),
nc = 4,
...
)
Both rcv_d
and fit_p
support parallel
computation, meaning they can fit multiple participants’ datasets
simultaneously as long as nc > 1
.
Since parallel execution runs in a separate environment, if you have
customized any of the four core functions, you must explicitly pass the
function names to binaryRL
via the funcs
argument.
Value Function
Utility Function
The subjective value of objective rewards is a topic that requires
discussion, as different scholars may have different perspectives. This
can be traced back to the Stevens's Power Law
. In this
model, you can customize your utility function. By default, I use a
power function based on Stevens’ power law to model the relationship
between subjective and objective value.
According to Kahneman’s Prospect Theory, individuals
exhibit distinct utility functions for gains and
losses. Referencing Nilsson et al. (2012), we have
implemented the model below. By replacing util_func
with
the specified form that follows, you can enable the model to run a
utility function based on Kahneman’s Prospect
Theory.
func_gamma <- function(
value, utility, reward, occurrence, var1, var2, gamma, alpha, beta
){
# Stevens's Power Law
if (length(gamma) == 1) {
gamma <- as.numeric(gamma)
utility <- sign(reward) * (abs(reward) ^ gamma)
}
# Prospect Theory
else if (length(gamma) == 2 & reward < 0) {
gamma <- as.numeric(gamma[1])
beta <- as.numeric(beta)
utility <- beta * sign(reward) * (abs(reward) ^ gamma)
}
else if (length(gamma) == 2 & reward >= 0) {
gamma <- as.numeric(gamma[2])
beta <- 1
utility <- beta * sign(reward) * (abs(reward) ^ gamma)
}
else {
utility <- "ERROR"
}
return(list(gamma, utility))
}
Reference
Kahneman, D., & Tversky, A. (2013). Prospect theory: An analysis of
decision under risk. In Handbook of the fundamentals of financial
decision making: Part I (pp. 99-127). https://doi.org/10.1142/9789814417358_0006
Nilsson, H., Rieskamp, J., & Wagenmakers, E. J. (2011). Hierarchical
Bayesian parameter estimation for cumulative prospect theory.
Journal of Mathematical Psychology, 55(1), 84-93. https://doi.org/10.1016/j.jmp.2010.08.006
Learning Rate
In run_m
, there is an argument called
initial_value
. Considering that the initial value has a
significant impact on the parameter estimation of the learning
rates (\(\eta\)) When the
initial value is not set (initial_value = NA
), it is taken
to be the reward received for that stimulus the first time.
“Comparisons between the two learning rates generally revealed a positivity bias (\(\alpha_{+} > \alpha_{-}\))”
“However, that on some occasions, studies failed to find a positivity bias or even reported a negativity bias (\(\alpha_{+} < \alpha_{-}\)).”
“Because Q-values initialization markedly affect learning rate and learning bias estimates.”
binaryRL::run_m(
...,
initial_value = NA,
...
)
Reference
Palminteri, S., & Lebreton, M. (2022). The computational roots of
positivity and confirmation biases in reinforcement learning. Trends
in Cognitive Sciences, 26(7), 607-621. https://doi.org/10.1016/j.tics.2022.04.005
Exploration–Exploitation Trade-off
“The ε-greedy methods choose randomly a small fraction of the time, whereas UCB methods choose deterministically but achieve exploration by subtly favoring at each step the actions that have so far received fewer samples. Gradient-bandit algorithms estimate not action values, but action preferences, and favor the more preferred actions in a graded, probabalistic manner using a soft-max distribution. The simple expedient of initializing estimates optimistically causes even greedy methods to explore significantly.”
Reference
Sutton, R. S., & Barto, A. G. (2014, 2015). Reinforcement Learning:
An Introduction (2nd ed). Cambridge: MIT press.
Initial Value
“Initial action values can also be used as a simple way of encouraging exploration. Suppose that instead of setting the initial action values to zero, as we did in the 10-armed testbed, we set them all to +5. Recall that the q(a) in this problem are selected from a normal distribution with mean 0 and variance 1. An initial estimate of +5 is thus wildly optimistic. But this optimism encourages action-value methods to explore.”
Reference
Sutton, R. S., & Barto, A. G. (2014, 2015). Reinforcement Learning:
An Introduction (2nd ed). Cambridge: MIT press.
Exploration Strategy
Participants in the experiment may not always choose based on the
value of the options, but instead select randomly on some trials. This
is known as exploration strategy. You can implement an \(\epsilon\)-first model by setting the
threshold
parameter in the run_m
. For
instance, if the threshold is set to threshold = 20
(The
default value is set to 1), it means that participants will choose
completely randomly until trial number 20, after which their choices
will be based on value.
The \(\epsilon\)-greedy strategy is
commonly employed in reinforcement learning models. With this approach
(epsilon = 0.1
), the participant has a 10% probability of
randomly selecting an option and a 90% probability of choosing based on
the currently learned value of the options.
You can also create an \(\epsilon\)-decreasing exploration strategy
by setting lambda
instead of epsilon
. In this
model, the probability of participants choosing randomly will decrease
as the trial number increases.
Reference
Namiki, N., Oyo, K., & Takahashi, T. (2014, December). How do humans
handle the dilemma of exploration and exploitation in sequential
decision making?. In Proceedings of the 8th International Conference on
Bioinspired Information and Communications Technologies (pp. 113-117).
https://doi.org/10.4108/icst.bict.2014.258045
Upper-Confidence-Bound
The \(\pi\) parameter in Upper-Confidence-Bound (UCB) Action Selection, which is c in the book, is a parameter used to control the ratio of the bias value given to less-selected options. A larger value of this parameter will assign a greater bias to options that have been chosen infrequently. This value is set to 0.001 by default(100 bias to the unchosen option), ensuring that the model does not completely ignore any option. Once an option has been selected once, the adjustment to its value bias approaches almost zero. If the rewards in your experiment are very small or very large, you should adjust this value accordingly.
Reference
Sutton, R. S., & Barto, A. G. (2014, 2015). Reinforcement Learning:
An Introduction (2nd ed). Cambridge: MIT press.
Sacu, S., Dubois, M., Hezemans, F. H., Aggensteiner, P. M., Monninger,
M., Brandeis, D., … & Holz, N. E. (2024). Early-life adversities are
associated with lower expected value signaling in the adult brain.
Biological Psychiatry, 96(12), 948-958. https://doi.org/10.1016/j.biopsych.2024.04.005
Soft-Max Function
In many reinforcement learning models, softmax is used to represent
the exploration-exploitation trade-off, rather than the three methods
mentioned above. However, we still recommend retaining the default value
of pi = 0.001
for parameters related to
bias_func
to ensure that each option is selected at least
once.
Additionally, when using the binaryRL::rcv_d()
function,
please note the following two points. During the recovery process, the
last element of both simulate_lower
and
simulate_upper
corresponds to the \(\tau\) parameter used in the softmax
function.
simulate_lower
represents a fixed positive increment applied to all \(\tau\) values.
if this value is set to 1, it means that 1 is added to every \(\tau\) during simulation.simulate_upper
specifies the rate parameter of an exponential distribution from which \(\tau\) is sampled.
if this value is 1, then \(\tau\) is drawn from an exponential distribution with a rate of 10, i.e., \(\tau \sim \text{Exp}(1)\).
binaryRL::rcv_d(
...
simulate_lower = list(c(0, 1), c(0, 0, 1), c(0, 0, 1)),
simulate_upper = list(c(1, 1), c(1, 1, 1), c(1, 1, 1)),
...
)
Reference
Wilson, R. C., & Collins, A. G. (2019). Ten simple rules for the
computational modeling of behavioral data. Elife, 8, e49547. https://doi.org/10.7554/eLife.49547
Model Fit
The primary aim of binaryRL
is to enable reinforcement
learning models to exhibit behaviors similar to human subjects. While
there are other ways to formulate the loss function, we’ve
fixed it as shown below to express the resemblance between human
behavior and model predictions.
\[ LL = \sum B_{L} \times \log P_{L} + \sum B_{R} \times \log P_{R} \]
\[ AIC = - 2 LL + 2 k \]
\[ BIC = - 2 LL + k \times \log n \]
NOTE: \(B_{L}\) and \(B_{R}\) the option that the subject chooses. (\(B_{L} = 1\): subject chooses the left option; \(B_{R} = 1\): subject chooses the right option); \(P_{L}\) and \(P_{R}\) represent the probabilities of selecting the left or right option, as predicted by the reinforcement learning model. \({k}\) the number of free parameters in the model; \({n}\) represents the total number of trials in the paradigm.
Reference
Hampton, A. N., Bossaerts, P., & O’doherty, J. P. (2006). The role
of the ventromedial prefrontal cortex in abstract state-based inference
during decision making in humans. Journal of Neuroscience,
26(32), 8360-8367. https://doi.org/10.1523/JNEUROSCI.1010-06.2006