A TUTORIAL ON CONDUCTING AND INTERPRETING A BAYESIAN ANOVA IN JASP

Analysis of variance (ANOVA) is the standard procedure for statistical inference in factorial designs. Typically, ANOVAs are executed using frequentist statistics, where p-values determine statistical significance in an all-or-none fashion. In recent years, the Bayesian approach to statistics is increasingly viewed as a legitimate alternative to the p-value. However, the broad adoption of Bayesian statistics—and Bayesian ANOVA in particular—is frustrated by the fact that Bayesian concepts are rarely taught in applied statistics courses. Consequently, practitioners may be unsure how to conduct a Bayesian ANOVA and interpret the results. Here we provide a guide for executing and interpreting a Bayesian ANOVA with JASP, an open-source statistical software program with a graphical user interface. We explain the key concepts of the Bayesian ANOVA using two empirical examples.


INTRODUCTION
Ubiquitous across the empirical sciences, analysis of variance (ANOVA) allows researchers to assess the effects of categorical predictors on a continuous outcome variable. Consider for instance an experiment by Strack, Martin, & Stepper (1988) designed to test the facial feedback hypothesis, that is, the hypothesis that people's affective responses can be influenced by their own facial expression. Participants were randomly assigned to one of three conditions. In the lips condition, participants were instructed to hold a pen with their lips, inducing a pout. In the teeth condition, participants were instructed to hold a pen between their teeth, inducing a smile. In the control condition, participants were told to hold a pen in their nondominant hand. With the pen in the instructed position, each participant then rated four cartoons for funniness. The outcome variable was the average funniness rating across the four cartoons. The ANOVA procedure may be used to test the null hypothesis that the pen position does not result in different funniness ratings.
ANOVAs are typically conducted using frequentist statistics, where p-values decide statistical significance in an all-or-none manner: if p < .05, the result is deemed statistically significant and the null hypothesis is rejected; if p > .05, the result is deemed statistically nonsignificant, and L'année psychologique/Topics in Cognitive Psychology, 2020, 120, 73-96 the null hypothesis is retained. Such binary thinking has been critiqued extensively (e.g., Amrhein, Greenland, & McShane, 2019;Cohen, 1994;Rouder, Engelhardt, McCabe, & Morey, 2016), and some perceive it as a cause of the reproducibility crisis in psychology (Cumming, 2014; but see Savalei & Dunn, 2015). In recent years, several alternatives to p-values have been suggested, for example reporting confidence intervals (Cumming, 2014;Gardner & Altman, 1986) or abandoning null hypothesis testing altogether (B. B. McShane, Gal, Gelman, Robert, & Tackett, 2019).
Here we focus on another alternative: Bayesian inference. In the Bayesian framework, knowledge about parameters and hypotheses is updated as a function of predictive success-hypotheses that predicted the observed data relatively well receive a boost in credibility, whereas hypotheses that predicted the data relatively poorly suffer a decline . A series of recent articles show how the Bayesian framework can supplement or supplant the frequentist p-value (e.g, Burton, Gurrin, & Campbell, 1998;Dienes & McLatchie, 2018;Jarosz & Wiley, 2014;Masson, 2011;Nathoo & Masson, 2016;. The advantages of the Bayesian paradigm over the frequentist p-value are well documented (e.g., Wagenmakers et al., 2018); for instance, with Bayesian inference researchers can incorporate prior knowledge and quantify support, both in favor and against the null-hypothesis; furthermore, this support may be monitored as the data accumulate (Stefan, Gronau, Schönbrodt, & Wagenmakers, 2019). Despite these and other advantages, Bayesian analyses are still used only sparingly in the social sciences (van der Schoot, Winter, Ryan, Zondervan-Zwijnenburg, & Depaoli, 2017). The broad adoption of Bayesian statistics-and Bayesian ANOVA in particular-is hindered by the fact that Bayesian concepts are rarely taught in applied statistics courses. Consequently, practitioners may be unsure of how to conduct a Bayesian ANOVA and interpret the results.
To help familiarize researchers with Bayesian inference for common experimental designs, this article provides a guide for conducting and interpreting a Bayesian ANOVA with JASP (JASP Team, 2019). JASP is a free, open-source statistical software program with a graphical user interface that offers both Bayesian and frequentist analyses. Below, we first provide a brief introduction to Bayesian statistics. Subsequently, we use two data examples to explain the key concepts of ANOVA.

BAYESIAN FOUNDATIONS
This section explains some of the fundamentals of Bayesian inference. We focus on interpretation rather than mathematical detail; see the special issue on Bayesian inference by Vandekerckhove, Rouder, & Kruschke (2018) for a set of comprehensive, low-level introductions to Bayesian inference. The central goal of Bayesian inference is learning, that is, using observations to update knowledge. In an ANOVA we want to learn about the candidate models M and their condition-effect parameters β. Returning to the example of the facial feedback experiment, we commonly specify two models. The null model describes the funniness ratings using a single grand average across all three conditions, effectively stating that there is no effect of pen position. The parameters of the null model are thus the average test score and the error variance. The alternative model describes the funniness ratings using an overall average and the effect of pen position; in other words, the means of the three condition are allowed to differ. Therefore, the alternative model has five parameters: the average funniness ratings across participants, the error variance, and for each of the three pen positions the magnitude of the effect. 2 To start the learning process, we need to specify prior beliefs about the plausibility of each model, p(M), and about the plausible parameter values β within each model, p(β|M). 3 These prior beliefs are represented by prior distributions. Observing data D drives an update of beliefs, transforming the prior distribution over models and parameters to a joint posterior distribution, denoted p(β, M|D). 4 The updating factor-the change from prior to posterior beliefs-is determined by relative predictive performance for the observed data . As shown in Figure 1, the knowledge updating process forms a learning cycle, such that the posterior distribution after the first batch of data becomes the prior distribution for the next batch. Bayesian learning can be conceptualized as a cyclical process of up-dating knowledge in response to prediction errors. The prediction step is deductive, and the updating step is inductive. For a detailed account see Jevons (1874/1913, Chapters XI andXII).  Mathematically, the updating process is given by Bayes' rule: This rule stipulates how knowledge about the relative plausibility of both models and parameters ought to be updated in light of the observed data. When the focus is on the comparison of two rival models, one generally considers only the model updating term. This term, commonly known as the Bayes factor, quantifies the relative predictive performance of the rival models, that is, the change in relative model plausibility that is brought about by the data (Etz & Wagenmakers, 2017;Jeffreys, 1939;Kass & Raftery, 1995;Wrinch & Jeffreys, 1921): When the Bayes factor BF 10 equals 20, the observed data are twenty times more likely to occur under M 1 than under M 0 (i.e., support for M 1 versus M 0 ); when the Bayes factor BF 10 equals 1/20, the observed data are twenty times more likely to occur under M 0 than under M 1 (i.e., support for M 0 versus M 1 ); when the Bayes factor BF 10 equals 1, the observed data are equally likely to occur under both models (i.e., neither model is supported over the other). Note that the Bayes factor is a comparison of two models and hence it is always a relative measure of evidence, that is, it quantifies the performance of one model relative to another. 5 Since the prior and posterior odds are both ratios of probabilities, the Bayes factor can be seen as an odds ratio that quantifies the change in belief from prior odds to posterior odds. The Bayes factor can be presented as BF 10, p(D|M 1 ), divided by p(D|M 0 ), or as its reciprocal BF 01 , p(D|M 0 ) over p(D|M 1 ). Typically, BF 10 > 1 is used to quantify evidence in favor of the alternative hypothesis, whereas BF 01 > 1 is used to quantify evidence in favor of the null hypothesis. For instance, BF 10 > 1/3 can be interpreted as "the data are 1/3 times more likely under M 1 than under M 0 ", but for a Bayes factor lower than 1 it is more intuitive to switch numerator and denominator and instead report the results as BF 01 = 3, that is, "the data are 3 times more likely under M 0 than under M 1 ".
The Bayesian paradigm differs from the frequentist paradigm in at least four key aspects. First, evidence in favor of a particular model, quantified by a Bayes factor, is a continuous measure of support. Unlike the frequentist Neyman-Pearson decision rule (usually p < 0.05), there is no need to impose all-or-none Bayes factor cut-offs for accepting or rejecting a particular model. Moreover, the Bayes factor can discriminate between "absence of evidence" (i.e., nondiagnostic data that are predicted about equally well under both models, such that the Bayes factor is close to 1) and 5 For a cartoon that explains the strength of evidence provided by a Bayes factor, see https://www. bayesianspectacles.org/lets-poke-a-pizza-a-new-cartoon-to-explain-the-strength-of-evidence-in-a-bayesfactor/ Document downloaded from www.cairn-int.info ---178.84.63.242 -06/05/2020 13h31. © P.U.F. Document downloaded from www.cairn-int.info ---178.84.63.242 -06/05/2020 13h31. © P.U.F. L'année psychologique/Topics in Cognitive Psychology, 2020, 120, 73-96 "evidence of absence" (i.e., diagnostic data that support the null hypothesis over the alternative hypothesis).
A second difference is that, in the Bayesian paradigm, knowledge about models M and parameters β is updated simultaneously. Consequently, it is natural to account for model uncertainty by considering all models, but assigning more weight to those models that predicted the data relatively well. This procedure is known as Bayesian model averaging (BMA; Hinne, Gronau, van den Bergh, Hoeting, Madigan, Raftery, & Volinsky, 1999;Jevons, 1874Jevons, /1913p. 296 Jeffreys, 1939;p. 365 Jeffreys, 1961). In contrast, many frequentist analyses first select a 'best' model and subsequently estimate its parameters, thereby neglecting model uncertainty and producing overconfident conclusions (Claeskens & Hjort, 2008, Ch 7.4). 6 Another benefit of BMA is that point estimates and uncertainty intervals can be derived without conditioning on a specific model. This way, model uncertainty is accounted for in point estimates and uncertainty intervals.
A third difference is that the Bayesian posterior distributions allow for direct probabilistic statements about parameters. For example, based on the posterior distribution of β we can state that we are 95% confident that the parameter lies between x and y. This range of parameter values is commonly known as a 95% credible interval. 7 Similarly, we can consider any interval from a to b and quantify our confidence that the parameter falls in that specific range. A fourth difference is that Bayesian inference automatically penalizes for complexity and thus favors parsimony (e.g., Berger & Jefferys, 1992;Jeffreys, 1961;Myung & Pitt, 1997). For instance, a model with a redundant covariate will make poor predictions. Consequently, the Bayes factor, which compares the relative predictive performance of two models, will favor the model without the redundant predictor over the model with the redundant predictor. Key is that, as the words suggest, the predictive performance is assessed using parameter values that are drawn from the prior distributions.

ANOVA
Traditionally, analysis of variance involves-as the name suggests-a comparison of variances. In the frequentist framework, the variance between each level of the categorical predictor is compared to the variance within the levels of the categorical predictor. When the categorical predictor has no effect, the population variances between the levels equals the population variances within the levels, and the sample ratio of these variances is distributed according to a central F-distribution. Under the assumption that the null hypothesis is true, we may then calculate the probability of encountering a sample ratio of variances that is at least as large as the one observed-this then yields the much-maligned yet omnipresent p-value.
Instead, the Bayesian ANOVA contrasts the predictive performance of competing models . In order to make predictions the model parameters need to be assigned prior distributions. These prior distributions could in principle be specified from subjective background knowledge, but here we follow Rouder, Morey, Speckman, & Province (2012) and use a default specification inspired by linear regression models, designed to meet general desiderata such as consistency and scale invariance (i.e., it does not matter whether the outcome variable is measured in seconds or milliseconds; see also Bayarri, Berger, Forte, & García-Donato, 2012;Liang, Paulo, Molina, Clyde, & Berger, 2008).

ASSUMPTIONS
Before interpreting the results from an ANOVA, it is prudent to assess whether its main assumption holds, namely that the residuals are normally distributed. A common tool to assess the normality of the residuals is a Q-Q plot, which visualizes the quantiles of the observed residuals against the quantiles expected from a standard normal distribution. If the residuals are normally distributed, then all the points in a Q-Q plot fall on the red line in Figure 2. In contrast to a frequentist ANOVA, where the residuals are point estimates, a Bayesian ANOVA provides a probability distribution for each residual. The uncertainty in the residuals can thus be summarized by 95% credible intervals. The left panel of Figure 2 shows an example where the larger quantiles lie away from the red line, displaying a substantial   Introductory texts discuss additional ANOVA assumptions, most of which follow directly from the normality of the residuals. For some of these assumptions, violations can be difficult to detect visually in a Q-Q plot. An example is sphericity, which is specific to repeated measures ANOVA. One definition of sphericity is that the variance of all pairwise difference scores is equal. In the frequentist paradigm, this assumption is usually assessed using Mauchly's test (but see Tijmstra, 2018). Another example is homogeneity of variances, which implies that the residual variance is equal across all levels of the predictors. Homogeneity of variances can be assessed using Levene's test (Levene, 1961).
The following sections illustrate how to conduct and interpret a Bayesian ANOVA with JASP. JASP can be freely downloaded from https:// jasp-stats.org/download/. Annotated .jasp files of the discussed analyses, data sets, and a step-by-step guide on conducting a Bayesian ANOVA in JASP are available at https://osf.io/f8krs/. We should stress that the current implementation of Bayesian ANOVA in JASP is based on the R package BayesFactor (Morey & Rouder, 2015) which is itself based on the statistical work by Rouder et al. (2012).

EXAMPLE I: A ROBOT'S SOCIAL SKILLS
Do people take longer to switch off a robot when it displays social skills? This question was studied by Horstmann et al. (2018, see their publication for online access to the complete data set) and we use their data  to illustrate the key concepts of a Bayesian ANOVA. In the Horstmann et al. (2018) study, 85 participants interacted with a robot. Participants were told that the purpose of their interaction with the robot was to test a new algorithm. After two dummy tasks were completed, the instructor told the participants that they could switch off the robot if they wanted. The outcome variable was the time it took participants to switch off the robot.
Here we analyze the log-transformed switch-off times since the Q-Q plot of the raw switch off times showed a violation of normality. Horstmann et al. (2018) manipulated two variables in a between-subjects design. First, they manipulated the robots' verbal responses to be either social (e.g., "Oh yes, pizza is great. One time I ate a pizza as big as me.") or functional (e.g., "You prefer pizza. This worked well. Let us continue."). Second, either the robot protested to being turned off (e.g., "No! Please do not switch me off! I am scared that it [sic] will not brighten up again!") or it did not. Therefore, the design of this study is a 2x2 between-subjects ANOVA. The data are shown in Figure 3.

Model comparison
The primary output from the JASP ANOVA is presented in Table 1, which shows the support that the data offer for each model under consideration. The left-most column lists all models at hand: four alternative models and one null model. The models are ordered by their predictive performance relative to the best model; this is indicated in the BF 01 column, which shows the Bayes factor relative to the best model which features only the objection factor. For example, the data are about 73 times more likely under the model with only the robot's objection as a predictor than under the null model.  Bayes factors are transitive, which means that if the model with only the robot's objection outpredicts the null model by a factor of a, and the null model outpredicts the model with only social interaction type by a factor of b, then the model with only the robot's objection will outpredict the model with only social interaction type by a factor of a × b. Transitivity can be used to compute Bayes factors that may be of interest but are missing from the table. For example, the Bayes factor for the null model versus the model with only social interaction type can be obtained by dividing their Bayes factors against the best model: 252.495/73.373 ≈ 3.441 in favor of the null model. Note that the Bayes factor is represented as BF 01 in Table 1; predictive performance of the best model divided by the predictive performance for a particular model. Had we shown BF 10 , we would have needed to take the reciprocal of the previous calculation to obtain the same result.

Analysis of Effects
The previous section compared all available models. However, as the number of predictors increases, the number of models quickly grows too conclusion, this amount of numerical error is not problematic (see also Appendix B Jeffreys, 1961). When the error percentage is deemed too high, the number of samples can be increased to reduce the error percentage at the cost of longer computation time. large to consider each model individually. 9 Rather than studying the results for each model individually, it is possible to average the results from Table 1 over all models, that is, compute the model-averaged results. This produces Table 2, which shows for each predictor the prior and posterior inclusion probabilities, and the inclusion Bayes factor. A prior inclusion probability is the probability that a predictor is included in the model before seeing the data and is computed by summing up the prior model probabilities of all models which contain that predictor. A posterior inclusion probability is the probability that a predictor is included in the model after seeing the data and is computed by summing up the posterior model probabilities of all models which contain that predictor. The inclusion Bayes factor quantifies the change from prior inclusion odds to posterior inclusion odds and can be interpreted as the evidence in the data for including a predictor.
For example, Table 2 shows that the data are about 68.6 times more likely under the models that include the robot's objection than under the models without this predictor. Note: The abbreviations 'O' and 'S' stand for the robot's objection and social interaction type respectively. The first column denotes each predictor of interest, the column shows the prior inclusion probability, shows the posterior inclusion probability, and shows the inclusion Bayes factor.
Although model-averaged results are straightforward to obtain, their interpretation requires special attention when interaction effects are concerned. In JASP, models are excluded from consideration when they violate the principle of marginality, that is, they feature an interaction effect but lack the constituent main effects (for details see Nelder, 1977). This model exclusion rule means that the active model set is not balanced. For example, in Table 2 the inclusion odds for the interaction 'O * S' is obtained by comparing four models without the interaction effect against the one model with the interaction effect. As an alternative, Sebastiaan Mathôd L'année psychologique/Topics in Cognitive Psychology, 2020, 120, 73-96 has suggested to compute inclusion probabilities for "matched" models only. 10 What this means is that all models with the interaction effect are compared to models with the same predictors except for the interaction effect. For example, the model with an interaction effect between 'O * S' in Table 2 is compared against the model with the main effects of 'O' and 'S', but not against any other models. To compute inclusion probabilities for main effects, models that feature interaction effects composed of these main effects are not considered. These models are excluded because they cannot be matched with models that include the interaction effect but not the main effect, since those violate the principle of marginality. Note that without interaction effects, the matched and not matched inclusion probabilities are the same. Table 3 shows the inclusion probabilities and inclusion Bayes factor obtained by only considering matched models. Comparing Table 3 to Table 2, the prior inclusion probability of the main effects decreased because these are based on one model fewer. The posterior inclusion probabilities of the main effects decreased but that of the interaction effect increased. The inclusion Bayes factor, the evidence in the data for including a predictor, provides slightly more evidence for including the main effect of the robot's objection and the interaction effect, and somewhat more evidence for excluding the main effect of the social interaction type. Table 3. Results from averaging over the models in Table 1 but only considering "matched" models (see text for details). Tableau 3. Résultats à partir des modèles moyennés du tableau 1 mais en considérant uniquement les modèles « appariés » (voir le texte pour plus de détails). Note: The abbreviations 'O' and 'S' stand for the robot's objection and social interaction type respectively. The first column denotes each predictor of interest, the column shows the prior inclusion probability, shows the posterior inclusion probability, and shows the inclusion Bayes factor.

Parameter Estimates
After establishing which predictors are relevant, we can investigate the magnitude of the relations by examining the posterior distributions. Table 4 summarizes the model-averaged posterior distributions of each level (β j ), using four statistics: the posterior mean, the posterior standard deviation, and the lower and upper bound of the 95% central credible interval. The symmetry in the estimates is a consequence of the sum-tozero constraint, that is, the posterior mean of O-Yes = -1 × the posterior mean of O-No = 0.265. Table 4 shows that the effect of objection is about 0.265 (95% CI [0.111, 0.418]). A posterior estimate for the observed log response time of a particular group, say the condition where the robot did not object, can be obtained by adding the posterior mean of the intercept (i.e., the grand mean), 1.724, to the posterior mean of the no-objection condition, -0.265, which yields 1.459. 11 To summarize, the Bayesian ANOVA revealed that the robot's objection almost certainly had an effect on switch-off time (BF Inclusion = 68.558). We also learned that the data are not sufficiently informative to allow a strong conclusion about the effect of the robot's social interaction type (BF Inclusion = 0.535) or about an interaction effect between objection and social interaction type (BF Inclusion = 1.659).

EXAMPLE II: POST HOC TESTS ON THE HOUSES OF HOGWARTS
After executing an ANOVA and finding strong evidence that a particular predictor relates to the outcome variable, a common question arises: "Which levels of the predictor deviate from one another?". As an illustration, consider the data from Jakob, Garcia-Garzon, Jarke, & Dablander (2019) where 847 participants filled out a 'sorting hat' questionnaire that determined their assignment to one of the four Houses of Hogwarts from the Harry Potter books: Gryffindor, Hufflepuff, Ravenclaw, or Slytherin. 12 Subsequently, participants filled out the dark triad questionnaire (Jones & Paulhus, 2014) that was used to derive the outcome variable: Machiavellism.
In this example, there is only one categorical predictor: The House of Hogwarts a participant was assigned to. If we compare the model with this predictor to the null model, we find overwhelming evidence for the alternative (BF 10 = 6.632 × 10 18 ). This is a clear indication that Machiavellism differs between the members of the four houses. However, this result does not indicate the houses responsible for the difference. To address that question, we need a post hoc test.  For ANOVA models, the main component of a post hoc test is a t-test on all pairwise combinations of a predictor's levels. For a Bayesian ANOVA, the main component is the Bayesian t-test. Table 5 shows the Bayesian post hoc tests for the sorting hat data. As with frequentist inference, Bayesian post hoc tests are subject to a multiple comparison problem. To control for multiplicity, we follow the approach discussed in Westfall (1997) which is an extension of the approach of Jeffreys (1938); for an overview of Bayesian methods correcting for multiplicity see for instance de Jong (2019). Westfall's approach relates the overall null hypothesis p(H 0 ) that all condition means are equal to each comparison between two condition means. That way, the prior probability of the overall null hypothesis can be adjusted to correct for multiplicity, and this influences each individual comparison. The procedure to relate the overall null hypothesis to each comparison is described below.
A condition mean μ i is either equal to the grand mean μ with probability τ, or μ i is drawn from a continuous distribution with probability 1 -τ. It is key that this distribution is continuous because two values drawn from a continuous distribution are never exactly equal. Thus, the probability L'année psychologique/Topics in Cognitive Psychology, 2020, 120, 73-96 that two condition means μ i and μ j are equal is p(μ i = μ j ) = p(μ i = μ) × p(μ j = μ) = τ 2 . From this, the probability of the null hypothesis that all J condition means are equal follows: p(H 0 ) = p(µ 1 = µ 2 = … = µ J ) = p(µ 1 = µ) × p(µ 2 = µ) × … × p(µ J = µ) = τ J . Solving for τ, we obtain τ = p. Thus, the prior probability that two specific magnitudes are equal can be expressed in terms of the prior probability that all magnitudes are equal, that is p(µ i = µ j ) = τ 2 = p. For example, imagine there are four conditions (J = 4) and the prior probability that all condition means are equal is 0.5. Then, the prior probability that two conditions means are equal is: p(µ 1 = µ 2 ) = ��� � � ��� � √0.5. . The prior odds are then 1 -��� � � ��� � √0.5. / ��� � � ��� � √0.5. ≈ 0.414. In sum, the Westfall approach involves, as a first step, Bayesian t-tests for all pairwise comparisons, which provides the unadjusted Bayes factors. In the next step, the prior model odds are adjusted by fixing the overall probability of no effect to 0.5. The adjusted prior odds and the Bayes factor are then used to calculate the adjusted posterior odds. Table 5 shows the results for the post hoc tests of the sorting hat example. The adjusted posterior odds show (1) evidence (i.e., odds of about 16) that Machiavellism differs between Hufflepuff and Ravenclaw; (2) evidence (i.e., odds of about 27) that Machiavellism differs between Gryffindor and Hufflepuff; (3) overwhelming evidence (i.e., odds of about 1.04 × 10 9 , 5.43 × 10 16 , and 5.30 × 10 9 ) that Machiavellism differs between Gryffindor and Slytherin, between Hufflepuff and Slytherin, and between Ravenclaw and Slytherin, respectively; (4) evidence (i.e, odds of 1/0.0432 ≈ 23) that Machiavellism of Gryffindor and Ravenclaw is the same. Now that we know which Houses differ, the next step is to assess the magnitude of each House of Hogwarts on Machiavellism score. Rather than examining a table that summarizes the marginal posteriors, we plot the model averaged posteriors for each house in Figure 5. Clearly, Slytherin scores higher on Machiavellism than the other Houses whereas Hufflepuff scores lower on Machiavellism than the other Houses. Table 6 in the appendix shows the parameters estimates of the marginal posterior effects for each house.

CONCLUDING COMMENTS
The goal of this paper was to provide guidance for practitioners on to conduct a Bayesian ANOVA in JASP and interpret the results. Although the focus was on ANOVAs with categorical predictors, JASP can also handle ANOVAs with additional continuous predictors. The appropriate analysis then becomes an analysis of covariance (ANCOVA) and all concepts explained here still apply. For a general guide on reporting Bayesian analyses see van Doorn et al. (2019).
As with all statistical methods, the Bayesian ANOVA comes with limitations and caveats. For instance, when the model is severely misspecified and the residuals are non-normally distributed, the results from a standard ANOVA-whether Bayesian or frequentist-are potentially misleading and should be interpreted with care. In such cases, at least two alternatives may be considered. The first alternative is to consider a rank-based ANOVA such as the Kruskal-Wallis test (Kruskal & Wallis, 1952). This test depends only on the ordinal information in the data and hence does not make strong assumptions on how the data ought to be distributed. The second alternative is to specify a different distribution for the residuals. Using software for general Bayesian inference such as Stan (Carpenter et al., 2017) or JAGS (Plummer, 2003), it is relatively straightforward to specify any distribution for the residuals. However, this approach requires knowledge about programming and statistical modeling and is likely to be computationally intensive. Another limitation of the Bayesian ANOVA is that, especially in more complicated designs, it is not straightforward to intuit what knowledge the prior distributions represent.
Some limitations are specific to JASP. Currently, it is not possible to use post hoc tests to examine whether the contribution of a level differs from zero, that is, to test whether a specific level deviates from the grand mean. It is also not possible to handle missing values in any other way than list-wise deletion. Another limitation relates to sample size planning. The typical planning process involves a frequentist power analysis which provides the sample size needed to achieve a certain rate of correctly detecting a true effect of a prespecified magnitude. A Bayesian equivalent of power analysis is Bayes factor design analysis (BFDA; e.g., Schönbrodt & Wagenmakers, 2018). In a sequential design, BFDA produces the expected sample sizes required to reach a target level of evidence (i.e., a target Bayes factor). In a fixed-n design, BFDA produces the expected levels of evidence, given a specification of the magnitude of the effect. At the moment of writing L'année psychologique/Topics in Cognitive Psychology, 2020, 120, 73-96 APPENDIX: PARAMETER ESTIMATES FOR THE SORTING HAT DATA