Using smartphones to optimise and scale-up the assessment of model-based planning

Donegan, Kelly R.; Brown, Vanessa M.; Price, Rebecca B.; Gallagher, Eoghan; Pringle, Andrew; Hanlon, Anna K.; Gillan, Claire M.

doi:10.1038/s44271-023-00031-y

Download PDF

Article
Open access
Published: 01 November 2023

Using smartphones to optimise and scale-up the assessment of model-based planning

Kelly R. Donegan ORCID: orcid.org/0000-0001-5948-0051^1,2,
Vanessa M. Brown³,
Rebecca B. Price³,
Eoghan Gallagher^1,2,
Andrew Pringle^1,2,
Anna K. Hanlon^1,2 &
…
Claire M. Gillan ORCID: orcid.org/0000-0001-9065-403X^1,2,4

Communications Psychology volume 1, Article number: 31 (2023) Cite this article

1312 Accesses
7 Altmetric
Metrics details

Subjects

A Publisher Correction to this article was published on 16 December 2023

This article has been updated

Abstract

Model-based planning is thought to protect against over-reliance on habits. It is reduced in individuals high in compulsivity, but effect sizes are small and may depend on subtle features of the tasks used to assess it. We developed a diamond-shooting smartphone game that measures model-based planning in an at-home setting, and varied the game’s structure within and across participants to assess how it affects measurement reliability and validity with respect to previously established correlates of model-based planning, with a focus on compulsivity. Increasing the number of trials used to estimate model-based planning did remarkably little to affect the association with compulsivity, because the greatest signal was in earlier trials. Associations with compulsivity were higher when transition ratios were less deterministic and depending on the reward drift utilised. These findings suggest that model-based planning can be measured at home via an app, can be estimated in relatively few trials using certain design features, and can be optimised for sensitivity to compulsive symptoms in the general population.

Loneliness trajectories over three decades are associated with conspiracist worldviews in midlife

Article Open access 29 April 2024

Improving microbial phylogeny with citizen science within a mass-market video game

Article Open access 15 April 2024

Microdosing with psilocybin mushrooms: a double-blind placebo-controlled study

Article Open access 02 August 2022

Introduction

Model-based or goal-directed planning is a cognitive capacity that involves building a mental map of potential action-outcome links and using that to make considered, flexible and optimal decisions^1,2. A consistent finding in the literature suggests that compulsive behaviours, as seen in Obsessive-Compulsive Disorder (OCD), addiction and aspects of eating disorders, are associated with impairments in model-based planning. This has been shown in online general population samples where individuals vary on a spectrum of compulsivity³, in clinical cohorts^4,5, and is suggested to have a developmental origin⁶. Mechanistically, theories suggest these deficits arise due to a failure to create accurate internal models of the world^7,8, which leaves patients vulnerable to getting stuck performing habits⁹. Although the finding is consistent, like many studies assessing the relationship between cognition and mental health symptoms, the effect size is small. To progress our understanding of if and how model-based planning causally relates to compulsivity and develop real-world clinical or public health applications, we need to rethink how we measure it, in whom, and in what setting. One option is to consider population approaches - studying small behavioural effects such as this in larger samples, in real-world settings, and where possible, repeatedly through time. Smartphone-science is a promising way to achieve this, though there are concerns that a departure from the experimental control of a lab environment, coupled with changes to core design features of cognitive tasks may come at the cost of validity, reliability, and data quality. Indeed the latter has been a source of considerable debate in the cognitive neuroscience literature.

In recent years, several studies have raised issues with how alterations to key parameters of a task commonly used to assess model-based planning, the two-step task, can affect its measurement. One of the earliest studies in this area illustrated that model-based planning is reduced when there are concurrent working memory demands, and that this reduction depends on individual differences in working memory capacity¹⁰. Kool and colleagues¹¹ gathered data on two versions of a two-step task; the original version developed by Daw et al.,¹ and a modified version, which their simulations suggested would increase the incentive value of engaging in model-based planning. They found that the modified version (which included several changes to key task parameters) indeed elicited greater model-based planning compared to the original. Others have shown through simulation that changes to reward probabilities may undermine the validity of standard analyses of the task¹². For example, in cases where reward probabilities are unequal (i.e., one-second stage state is more rewarding than another), simulated model-free agents can produce behaviour that appears model-based. In another study, they found model-based estimates were significantly greater when participants received in-depth instruction and more practice trials compared to the original task¹³. More recently, researchers assessed the emergence of model-based planning in a task that was initially absent of any instruction whatsoever¹⁴. They found that only a minority of participants adopted a model-based approach to solving a two-step task without instruction, and once instructions were provided, model-based planning estimates rose rapidly. Across all of these studies, an important facet remains untested – do these task variations shift model-based planning scores equivalently across individuals, or do alterations to task design fundamentally change the meaning of the quantity under study, i.e., its external validity. Recent work suggests mixed evidence. Assessing differences in task motivations, Patzelt et al.,¹⁵ found that offering larger reward amounts increased mean-level model-based planning levels, but this did not affect its association with compulsivity. Castro-Rodrigues¹⁴, on the other hand, found evidence to suggest that differences between OCD patients and controls may be smaller when detailed instructions are provided, though the sample (N = 46 OCD patients) was perhaps too small to test this definitively.

A second and important issue is how task modifications affect reliability, which sets a ceiling for the size of the association one can observe with compulsivity. Test-retest estimates of model-based planning from the traditional task have been mixed, ranging from poor to good (r = [0.14–0.40])¹⁶, or non-existent to excellent (r = [−0.10–0.91], median = 0.45)¹⁷, depending on analytic choices. This finding is common to many tasks used for individual difference research¹⁸ and is thought to in part be the result of the reliability paradox, where tasks designed to examine within-subject effects (such as Flanker, Stroop) have low between-subject variability¹⁹. One simple way to increase reliability is to increase the amount of data (i.e., trial number) gathered per-participant²⁰. While this helps, reliabilities eventually plateau, often below an acceptable level. For example, Stroop reaction time can become more reliable with additional trials to a point but intraclass correlation coefficient (ICC) values plateau around 0.4¹⁹. Similarly, Price et al.,²¹ found relatively consistent ICC values for an attentional bias metric measured from just 48 trials compared to an estimation from 320 trials suggesting that the benefit of adding trials may be tenuous. Further, it is unclear if and how reinforcement learning tasks like the two-step task benefit from additional trials and importantly if improvements in reliability translate into improvements in external validity. For example, early trials might measure something qualitatively different from later trials, particularly in high-order cognitive tests where rules are learned and then deployed, allowing more automatic forms of behaviour to take over.

The present study aimed to address these issues by gamifying and then optimising a commonly-used task that tracks individual differences in compulsivity, testing if key features of task design and trial number could boost its reliability and external validity. This requires large samples, and so we developed a diamond-shooting game called Cannon Blast that could be played by members of the public, aka citizen scientists²², from anywhere in the world in an at-home environment. Cannon Blast was designed to be fun and repeatable, but critically it contained key features of the classic two-step task allowing us to assess model-based planning. We aimed to validate the game in two ways: first by establishing that it elicits model-based behaviour similar to the traditional task and then by demonstrating that model-based estimates correlate across tasks. Next, we released the game to the general public through our labs non-profit Neureka app (http://www.neureka.ie), and by leveraging large-scale data collection, aimed to test if the estimates of model-based planning derived from the gamified task would show the same associations with demographic individual difference measures such as older age^23,24, female gender³ and lower IQ and processing speed^3,25 but also specific negative associations with compulsivity^3,5,26. Finally, we wanted to utilize these associations as ‘ground truth’ to assess if the external validity of model-based planning estimates are affected by modifications to the task set-up. We compared transition probabilities that were more or less deterministic (80:20 vs. 70:30), used different sets of drifting reward probabilities, varied concurrent task demands (i.e., the difficulty of the diamond shooting task itself), compared earlier vs later trials of the game and tested the impact of increasing trial numbers.

Methods

The procedure and statistical plan for both experiments described below were not preregistered.

Ethical considerations and data protection

This research was granted ethical approval by the Research Ethics Committee of the School of Psychology at Trinity College Dublin (Approval number: SPREC072019-01). The Neureka app is a non-profit smartphone application developed and maintained by the Gillan Lab, Trinity College Dublin. For Experiment 1, prospective participants received an information sheet and gave informed consent through the online survey platform Qualtrics. Participants across both experiments were required to also read the information sheet and consent to participation embedded to the registration process for Neureka. This described the wider scientific aims of the Neureka Project, what participation involves, terms of data use, data protection procedures, health risks, withdrawal of data procedures and points of contact. For more detail on the exact contexts of this information sheet provided to participants, see Supplementary Note 1. Data collected through Neureka is stored and processed in accordance to EU General Data Protection Regulations.

Experiment 1

Participants

We recruited participants to complete the traditional two-step task in a web-browser and Cannon Blast in the smartphone app Neureka. We targeted a minimum sample size of N = 50, which provides 80% power to detect a medium effect with a significance level set at p < 0.05. To allow for data-loss and exclusions, data were collected from N = 68 participants who were 18 years or older and have access to both smartphone and computer devices with an internet connection. Participants were compensated €10 upon completion of both tasks. Post exclusion criteria, N = 57 remained for analysis (43 women (66%) and 14 men (34%) aged between 18–46 (M = 22.95, SD = 5.6)). Gender-identification was collected in-app by asking ‘What gender do you most identify with” and a list of seven options: male (hereafter ‘man’), female (hereafter ‘woman’), transgender male (hereafter ‘non-cisgender’), transgender female (hereafter ‘non-cisgender’), non-binary (hereafter ‘non-cisgender’), not-listed (hereafter ‘non-cisgender’), or prefer not to say.

Procedure

Participants were recruited and tested online. During the sign-up process, they provided electronic consent, along with self-reporting basic demographic (age, gender, education) and eligibility information. They completed the traditional two-step task in a web-browser on a laptop or desktop computer and Cannon Blast on an iOS or Android smartphone. The order of task was counterbalanced across subjects and the entire study took less than 60 min.

Cannon Blast

The goal of Cannon Blast is to hit as many diamonds as possible in 100 shots (Fig. 1a). On each trial, participants first aimed their cannon and then selected between two containers containing purple and pink balls. The left cannon always contained more purple balls (80%) and the right more pink balls (80%). In contrast to the traditional task, this transition structure did not have to be learned or remembered; it was visibly displayed on-screen i.e., each container possessed eight balls of the corresponding colour and two balls of the alternate colour. After the container was selected, a ball was randomly pulled from this container, and was consistent with the most prominent colour (80%, a common transition) or produced the minority colour (20%, a rare transition) (Fig. 1b).

On what we define as rewarding trials, participants received a good ball that exited from the cannon and could be used to hit the diamond (Fig. 1c). There was no guarantee that a good ball would actually hit a diamond, this depended on the participant’s aim and timing. Alternatively, participants were unrewarded if the ball disintegrated upon firing, thus reducing the chance of hitting a diamond to zero. The probability of being rewarded (in other words receiving a good ball) drifted independently over the course of the task, much like the second-stage outcomes in the traditional task. However, the traditional task typically utilises a single pre-determined drifting reward probability structure for the 200 trials of the task. To allow us to assess the potential impact of drift dynamics on parameter estimation (in Experiment 2), Cannon Blast instead used two possible drift structures for each block of trials. Participants were randomly assigned drift A or drift B for each of their 100 trial blocks, leading to a total of four reward probability drift sets combinations for Cannon Blast participants (A-A, A-B, B-A, B-B: Fig. 1d).

The similarities and differences between the original two-step task and our gamified Cannon Blast are presented in Supplementary Table 1. Like the original, Cannon Blast consisted of 200 trials which was divided into two blocks of 100 trials. The first block was set at an easy difficulty, and the second at a medium difficulty. Level had no direct bearing on the core parameters-of-interest (which container participants select, rewards, drifts etc), and instead reflected how challenging the aim and shoot trajectory was. However as we explore in Experiment 2, level difficulty can be conceived of as a distraction manipulation. Easy levels included trials where the diamond did not move, had static obstructions that limited the angle at which it could be hit, or where diamonds moved slowly around the screen. Medium difficulty levels included more challenging trials with both moving diamonds and moving obstructions (Supplementary Table 2). While on average medium trials are more difficult than easy (average hit rate Medium=45%, Easy=52%), there was variation within both Easy levels (hit rates 83%, 53%, 44%, 29%) and Medium (hit rates 75%, 20%, 39%, 45%) (Supplementary Table 2). Reward probabilities in Cannon Blast were set higher on average (average reward probability: AA = 0.80 [0.64–0.94], AB = 0.72 [0.41–0.94], BA = 0.72 [0.41–0.94], BB = 0.64 [0.41–0.94]) than the original task (mean reward probability = 0.52 [0.25–0.75]) to promote enjoyment and limit frustration. In contrast to the traditional two-step task, which includes 40 practice trials, Cannon Blast starts with a short, passive walk-through demonstration of the task (Supplementary Figure 1). While the traditional task design has both first- (choice of rocket) and second-stage (choice of alien) actions, Cannon Blast has first-stage (choice of container) actions only. The decision to remove second-stage actions was in part done for gameplay reasons but also has been shown to increase the importance of model-based contributions in the first stage choice¹¹. A final major distinction between the tasks was the stated goal; in the traditional task, participants are directly told to earn rewards (space treasure). In Cannon Blast, participants are told to shoot as many diamonds as possible, and that this can be facilitated by ensuring they maximise rewards (good balls).

Traditional two-step reinforcement learning task

Participants completed an adapted version of the two-step reinforcement learning task¹, developed by Decker et al.¹¹. The contents of this task have been described in detail in Decker, et al.²⁷ and are summarised Supplementary Figure 2 and Supplementary Table 1.

Data analysis

Exclusion criteria

Participants were excluded from the traditional task if they: (a) missed more than 20% of trials (N = 2)⁷, (b) responded with the same key press at the first stage of the task on more than 95% of trials (N = 5)³. Exclusion criteria for Cannon Blast were harmonised with these as much as possible. We excluded participants if they had (a) missed more than 20% of trials (N = 1) or (b) selected the same container more than 95% of the time (N = 4). However, it is important to note that for criterion (a) as there was no time limit to make this response (unlike the traditional task), participants could not miss trials due to being too slow or disengaged (unless they quit the experiment entirely). Notwithstanding, we noted that some trials were missing for 2 users from our app database (presumably due to a technical glitch) and for one of these, this exceeded the 20% threshold and were therefore excluded. Combining all exclusion criteria for both tasks, N = 11 (16%) participants were excluded with N = 57 remaining for analysis (37 women, aged between 18–32 (M = 22.01, SD = 4.12)).

Quantifying model-based planning

All analysis were performed through RStudio version 1.4.1106 (http://cran.us.r-project.org). Across both task versions, data distribution was assumed to be normal (Fig. 2a, b) but this was not formally tested. Hierarchical logistic regression (HLR) models, which are mixed effects models for a binary outcome variable, were conducted using mixed effects models implemented with the lme4 package in R. The model tested if participants’ choice behaviour in the first stage state (coded as switch: 0 and stay: 1, relative to their previous choice) was influenced by reward (coded as unrewarded: −1 and rewarded:1), transition (coded as rare: −1 and common: 1), and their interaction, on the trial preceding. Within participants factors (main effect of reward, transition and their interaction) were modelled as random effects. Model-based index (MBI) is quantified as the interaction between Reward (traditional task: space treasure vs. dust; Cannon Blast: good vs. dud ball) and Transition (traditional task: common vs rare transition to a planet from the chosen rocket; Cannon Blast: common vs rare ball colour appearing from the chosen container). In line with prior work on the traditional task, we also quantified model-free index (MFI: the main effect of Reward) and choice repetition (the Intercept of the model). Individual estimates for each parameter (MBI, MFI, choice repetition (hereafter ‘stay’) and transition) were extracted for each task and compared across tasks using Pearson correlation. We assessed the internal consistency of each task using split-half correlation (odds-even split method) using the guidelines from Cicchetti²⁸: <0.4 poor, 0.4–0.7 fair, 0.7–0.9 good and >0.9 excellent.

**Fig. 2: Validation of Cannon Blast against the traditional two-step task (N = 57).**