Step 1: Building reinforcement learning model

This function is designed to construct and customize reinforcement learning models.

Items for model construction:

Data Input and Specification: You must provide the raw dataset for analysis. Crucially, you need to inform the run_m function about the corresponding column names within your dataset (e.g., Mason_2024_G1, Mason_2024_G2 ). This is a game, so it's critical that your dataset includes rewards for both the human-chosen option and the unchosen options.
Customizable RL Models: This function allows you to define and adjust the number of free parameters to create various reinforcement learning models.
- Value Function:
  - Learning Rate: By adjusting the number of eta, you can construct basic reinforcement learning models such as Temporal Difference (TD) and Risk Sensitive Temporal Difference (RSTD). You can also directly adjust func_eta to define your own custom learning rate function.
  - Utility Function: You can directly adjust the form of func_gamma to incorporate the principles of Kahneman's Prospect Theory. Currently, the built-in func_gamma only takes the form of a power function, consistent with Stevens' Power Law.
- Exploration-Exploitation Trade-off:
  - Initial Values: This involves setting the initial expected value for each option when it hasn't been chosen yet. A higher initial value encourages exploration.
  - Epsilon: Adjusting the threshold, epsilon and lambda parameters can lead to exploration strategies such as epsilon-first, epsilon-greedy, or epsilon-decreasing.
  - Upper-Confidence-Bound: By adjusting pi, it controls the degree of exploration by scaling the uncertainty bonus given to less-explored options.
  - Soft-Max: By adjusting the inverse temperature parameter tau, this controls the agent's sensitivity to value differences. A higher value of tau means greater emphasis on value differences, leading to more exploitation. A smaller value of tau indicates a greater tendency towards exploration.
Objective Function Format for Optimization: Once your model is defined in run_m, it must be structured as an objective function that accepts params as input and returns a loss value (typically logL). This format ensures compatibility with the algorithm package, which uses it to estimate optimal parameters. For an example of a standard objective function format, see TD, RSTD, Utility.

For more information, please refer to the homepage of this package: https://yuki-961004.github.io/binaryRL/

Usage

run_m(
  name = NA,
  mode = c("simulate", "fit", "replay"),
  policy = c("on", "off"),
  data,
  id,
  n_params,
  n_trials,
  gamma = 1,
  eta,
  initial_value = NA_real_,
  threshold = 1,
  epsilon = NA,
  lambda = NA,
  pi = NA,
  tau = NA,
  lapse = 0.02,
  alpha = NA,
  beta = NA,
  priors = NULL,
  util_func = func_gamma,
  rate_func = func_eta,
  expl_func = func_epsilon,
  bias_func = func_pi,
  prob_func = func_tau,
  loss_func = func_logl,
  sub = "Subject",
  time_line = c("Block", "Trial"),
  L_choice = "L_choice",
  R_choice = "R_choice",
  L_reward = "L_reward",
  R_reward = "R_reward",
  sub_choose = "Sub_Choose",
  rob_choose = "Rob_Choose",
  raw_cols = NULL,
  var1 = NA_character_,
  var2 = NA_character_,
  seed = 123,
  digits_1 = 2,
  digits_2 = 5,
  engine = "cpp"
)

Arguments

name

[string]

The name of your RL model

mode

[string]

This parameter controls the function's operational mode. It has three possible values, each typically associated with a specific function:

"simulate": Should be used when working with rcv_d.
"fit": Should be used when working with fit_p.
"replay": Should be used when working with rpl_e.

In most cases, you won't need to modify this parameter directly, as suitable default values are set for different contexts.

policy

[string]

Specifies the learning policy to be used. This determines how the model updates action values based on observed or simulated choices. It can be either "off" or "on".

Off-Policy (Q-learning): This is the most common approach for modeling reinforcement learning in Two-Alternative Forced Choice (TAFC) tasks. In this mode, the model's goal is to learn the underlying value of each option by observing the human participant's behavior. It achieves this by consistently updating the value of the option that the human actually chose. The focus is on understanding the value representation that likely drove the participant's decisions.
On-Policy (SARSA): In this mode, the target policy and the behavior policy are identical. The model first computes the selection probability for each option based on their current values. Critically, it then uses these probabilities to sample its own action. The value update is then performed on the action that the model itself selected. This approach focuses more on directly mimicking the stochastic choice patterns of the agent, rather than just learning the underlying values from a fixed sequence of actions.

data

[data.frame]

This data should include the following mandatory columns:

sub "Subject"
time_line "Block" "Trial"
L_choice "L_choice"
R_choice "R_choice"
L_reward "L_reward"
R_reward "R_reward"
sub_choose "Sub_Choose"

id

[string]

Which subject is going to be analyzed. The value should correspond to an entry in the "sub" column, which must contain the subject IDs.

e.g. id = 18

n_params

[integer]

The number of free parameters in your model.

n_trials

[integer]

The total number of trials in your experiment.

gamma

[NumericVector]

Note: This should not be confused with the discount rate parameter (also named gamma) found in Temporal Difference (TD) models. Rescorla-Wagner model does not include a discount rate. Here, gamma is used as a free parameter to shape the utility function.

Stevens' Power Law: Utility is modeled as: $$U(R) = {R}^{\gamma}$$
Kahneman's Prospect Theory: This exponent is applied differently based on the sign of the reward: $$U(R) = \begin{cases} R^{\gamma_{1}}, & R > 0 \\ \beta \cdot R^{\gamma_{2}}, & R < 0 \end{cases}$$

default: gamma = 1

eta

[NumericVector]

Parameters used in the Learning Rate Function, rate_func, representing the rate at which the subject updates the difference (prediction error) between the reward and the expected value in the subject's mind.

The structure of eta depends on the model type:

For the Temporal Difference (TD) model, where a single learning rate is used throughout the experiment $$V_{new} = V_{old} + \eta \cdot (R - V_{old})$$
For the Risk-Sensitive Temporal Difference (RDTD) model, where two different learning rates are used depending on whether the reward is lower or higher than the expected value: $$V_{new} = V_{old} + \eta_{+} \cdot (R - V_{old}), R > V_{old}$$ $$V_{new} = V_{old} + \eta_{-} \cdot (R - V_{old}), R < V_{old}$$

TD: eta = 0.3

RSTD: eta = c(0.3, 0.7)

initial_value

[double]

Subject's initial expected value for each stimulus's reward. If this value is not set initial_value = NA, the subject will use the reward received after the first trial as the initial value for that stimulus. In other words, the learning rate for the first trial is 100

default: initial_value = NA_real_

threshold

[integer]

Controls the initial exploration phase in the epsilon-first strategy. This is the number of early trials where the subject makes purely random choices, as they haven't yet learned the options' values. For example, threshold = 20 means random choices for the first 20 trials. For epsilon-greedy or epsilon-decreasing strategies, threshold should be kept at its default value.

$$P(x) = \begin{cases} \text{trial} \le \text{threshold}, & x=1 \text{ (random choosing)} \\ \text{trial} > \text{threshold}, & x=0 \text{ (value-based choosing)} \end{cases}$$

default: threshold = 1

epsilon-first: threshold = 20, epsilon = NA, lambda = NA

epsilon

[NumericVector]

A parameter used in the epsilon-greedy exploration strategy. It defines the probability of making a completely random choice, as opposed to choosing based on the relative values of the left and right options. For example, if epsilon = 0.1, the subject has a 10 choice and a 90 relevant when threshold is at its default value (1) and lambda is not set.

$$P(x) = \begin{cases} \epsilon, & x=1 \text{ (random choosing)} \\ 1-\epsilon, & x=0 \text{ (value-based choosing)} \end{cases}$$

epsilon-greedy: threshold = 1, epsilon = 0.1, lambda = NA

lambda

[NumericVector]

A numeric value that controls the decay rate of exploration probability in the epsilon-decreasing strategy. A higher lambda value means the probability of random choice will decrease more rapidly as the number of trials increases.

$$P(x) = \begin{cases} \frac{1}{1+\lambda \cdot trial}, & x=1 \text{ (random choosing)} \\ \frac{\lambda \cdot trial}{1+\lambda \cdot trial}, & x=0 \text{ (value-based choosing)} \end{cases}$$

epsilon-decreasing: threshold = 1, epsilon = NA, lambda = 0.5

pi

[NumericVector]

Parameter used in the Upper-Confidence-Bound (UCB) action selection formula. bias_func controls the degree of exploration by scaling the uncertainty bonus given to less-explored options. A larger value of pi (denoted as c in Sutton and Barto(2018) ) increases the influence of this bonus, leading to more exploration of actions with uncertain estimated values. Conversely, a smaller pi results in less exploration.

$$ A_t = \arg \max_{a} \left[ V_t(a) + \pi \sqrt{\frac{\ln(t)}{N_t(a)}} \right] $$

default: pi = NA

tau

[NumericVector]

Parameters used in the Soft-Max Function. prob_func representing the sensitivity of the subject to the value difference when making decisions. It determines the probability of selecting the left option versus the right option based on their values. A larger value of tau indicates greater sensitivity to the value difference between the options. In other words, even a small difference in value will make the subject more likely to choose the higher-value option.

$$P_L = \frac{1}{1+e^{-(V_L-V_R) \cdot \tau}}; P_R = \frac{1}{1+e^{-(V_R-V_L) \cdot \tau}}$$

default tau = NA

lapse

[double]

A numeric value between 0 and 1, representing the lapse rate.

You can interpret this parameter as the probability of the agent "slipping" or making a random choice, irrespective of the learned action values. This accounts for moments of inattention or motor errors. In this sense, it represents the minimum probability with which any given option will be selected. It is a free parameter that acknowledges that individuals do not always make decisions with full concentration throughout an experiment.

From a modeling perspective, the lapse rate is crucial for preventing the log-likelihood calculation from returning -Inf. This issue arises when the model assigns a probability of zero to an action that the participant actually chose (log(0) is undefined). By ensuring every option has a non-zero minimum probability, the lapse parameter makes the fitting process more stable and robust against noise in the data.

$$ P_{final} = (1 - lapse) \cdot P_{softmax} + \frac{lapse}{N_{choices}} $$

default: lapse = 0.02

This ensures each option has a minimum selection probability of 1 percent in TAFC tasks.

alpha

[NumericVector]

Extra parameters that may be used in functions.

beta

[NumericVector]

Extra parameters that may be used in functions.

priors

[list]

A list specifying the prior distributions for the model parameters. This argument is mandatory when using estimate = "MAP".

default: priors = NULL

util_func

[Function]

Utility Function see func_gamma.

rate_func

[Function]

Learning Rate Function see func_eta.

expl_func

[Function]

Exploration Strategy Function see func_epsilon.

bias_func

[Function]

Upper-Confidence-Bound see func_pi.

prob_func

[Function]

Soft-Max Function see func_tau.

loss_func

[Function]

Loss Function see func_logl.

sub

[string]

Column name of subject ID

e.g. sub = "Subject"

time_line

[CharacterVector]

A vector specifying the name of the column that the sequence of the experiment. This argument defines how the experiment is structured, such as whether it is organized by "Block" with breaks in between, and multiple trials within each block.

default: time_line = c("Block", "Trial")

L_choice

[string]

Column name of left choice.

default: L_choice = "Left_Choice"

R_choice

[string]

Column name of right choice.

default: R_choice = "Right_Choice"

L_reward

[string]

Column name of the reward of left choice

default: L_reward = "Left_reward"

R_reward

[string]

Column name of the reward of right choice

default: R_reward = "Right_reward"

sub_choose

[string]

Column name of choices made by the subject.

default: sub_choose = "Choose"

rob_choose

[string]

Column name of choices made by the model, which you could ignore.

default: rob_choose = "Rob_Choose"

raw_cols

[CharacterVector]

Defaults to NULL. If left as NULL, it will directly capture all column names from the raw data.

var1

[string]

Column name of extra variable 1. If your model uses more than just reward and expected value, and you need other information, such as whether the choice frame is Gain or Loss, then you can input the 'Frame' column as var1 into the model.

default: var1 = NA_character_

var2

[string]

Column name of extra variable 2. If one additional variable, var1, does not meet your needs, you can add another additional variable, var2, into your model.

default: var2 = NA_character_

seed

[integer]

Random seed. This ensures that the results are reproducible and remain the same each time the function is run.

default: seed = 123

digits_1

[integer]

The number of decimal places to retain for columns related to value function

default: digits_1 = 2

digits_2

[integer]

The number of decimal places to retain for columns related to select function.

default: digits_2 = 5

engine

[string]

- "r": Use the pure R version of the code.

- "cpp": Use the Rcpp-optimized version.

default: engine = "cpp"

Value

A list of class binaryRL containing the results of the model fitting.

Examples

data <- binaryRL::Mason_2024_G2

binaryRL.res <- binaryRL::run_m(
  mode = "replay",
  data = data,
  id = 18,
  eta = c(0.321, 0.765),
  tau = 0.5,
  n_params = 3, 
  n_trials = 360
)

summary(binaryRL.res)
#> Results of RSTD Model (replay): 
#> Estimation Method: NA
#> Exploration and Exploitation Trade-off:
#>   Initial Values: Initial reward received
#>   Exploration Strategy: off
#>   Upper-Confidence-Bound: off
#>   Soft-Max: on
#> Model Fit:
#>   Accuracy: 60%
#>   Log-Likelihood: -385.31
#>   Log-Prior Probability: NA
#>   Log-Posterior Probability: NA
#>   AIC: 776.62
#>   BIC: 788.28
#> Free Parameters:
#>   α: NA
#>   β: NA
#>   γ: 1
#>   η: 0.321, 0.765
#>   ε: NA
#>   λ: NA
#>   π: NA
#>   τ: 0.5
#> [[1]]
#>    Parameter Value1 Value2
#> 1         Q1     NA     NA
#> 2  threshold  1.000     NA
#> 3      lapse  0.020     NA
#> 4      alpha     NA     NA
#> 5       beta     NA     NA
#> 6      gamma  1.000     NA
#> 7        eta  0.321  0.765
#> 8    epsilon     NA     NA
#> 9     lambda     NA     NA
#> 10        pi     NA     NA
#> 11       tau  0.500     NA
#> 
#> [[2]]
#>     Metric   Value
#> 1 Accuracy   60.00
#> 2     LogL -385.31
#> 3    LogPr      NA
#> 4    LogPo      NA
#> 5      AIC  776.62
#> 6      BIC  788.28
#>