Med 2017 13 issue 1

Page 1

Volume 13 / Number 1 / 2017

Volume 13 / Number 1 / 2017

Methodology

Methodology

Editors Peter Lugtig JosĂŠ L. Padilla

European Journal of Research Methods for the Behavioral and Social Sciences Official Organ of the European Association of Methodology


Contents Original Articles

Call for Papers

Methodology (2017), 13(1)

Using the Errors-in-Variables Method in Two-Group Pretest-Posttest Designs Alyssa Counsell and Robert A. Cribbie

1

Power of Modified Brown-Forsythe and Mixed-Model Approaches in Split-Plot Designs Pablo Livacic-Rojas, Guillermo Vallejo, Paula Ferna´ndez, and Ellia´n Tuero-Herrero

9

Performance of Combined Models in Discrete Binary Classification Anabela Marques, Ana Sousa Ferreira, and Margarida G. M. S. Cardoso

23

Validity: Challenges in Conception, Methods, and Interpretation in Survey Research: A Special Issue for Methodology – European Journal of Research Methods for the Behavioral and Social Sciences Guest Editors: Natalja Menold, Matthias Bluemke, and Anita Hubley

38

Ó 2017 Hogrefe Publishing


Original Article

Using the Errors-in-Variables Method in Two-Group Pretest-Posttest Designs Alyssa Counsell and Robert A. Cribbie Department of Psychology, York University, Toronto, Ontario, Canada

Abstract: Culpepper and Aguinis (2011) highlighted the benefit of using the errors-in-variables (EIV) method to control for measurement error and obtain unbiased regression estimates. The current study investigated the EIV method and compared it to change scores and analysis of covariance (ANCOVA) in a two-group pretest-posttest design. Results indicated that the EIV method’s estimates were unbiased under many conditions, but the EIV method consistently demonstrated lower power than the change score method. An additional risk with using the EIV method is that one must enter the covariate reliability into the EIV model, and results highlighted that estimates are biased if a researcher chooses a value that differs from the true covariate reliability. Obtaining unbiased results also depended on sample size. Our conclusion is that there is no additional benefit to using the EIV method over change score or ANCOVA methods for comparing the amount of change in pretestposttest designs. Keywords: analysis of covariance, change, errors-in-variables, posttest, pretest

Measuring group differences across two time points has been a topic of debate for decades (e.g., Allison, 1990; Cribbie & Jamieson, 2000; Cronbach & Furby, 1970; Lord, 1967; Maris, 1998; Wright, 2006). The two most common contenders are change score models and analysis of covariance (ANCOVA). A recent article by Culpepper and Aguinis (2011) acknowledged that the issue with using ANCOVA is its assumption that the covariate has been measured without error. Since it is rarely the case that the types of covariates used in psychology are free of measurement error, using ANCOVA will produce biased (and often misleading) results. Based on their simulation study, Culpepper and Aguinis recommend using the errors-in-variables (EIV) method in lieu of ANCOVA. The EIV method is a modified ANCOVA procedure that takes the reliability of the covariate into account so that the regression model no longer produces biased estimates (Fuller, 1980, 1987; Warren, White, & Fuller, 1974). However, in their paper, Culpepper and Aguinis (2011) discussed the use of the EIV method for general covariates, not specifically for use with a pretest score as a covariate. Given the theoretical debate about when it is appropriate to use change scores and when it is appropriate to use ANCOVA, the EIV method should be investigated in such a context to determine whether it can provide a general

Ó 2017 Hogrefe Publishing

solution to the problem of comparing groups in pretestposttest designs.

ANCOVA or Change Scores? Before delving into the EIV method it is important to discuss the similarities and differences between using change scores (also called difference scores or gain scores) versus using ANCOVA to compare the amount of change across two time points and two groups. The change score method involves running an independent samples ANOVA (or equivalently a t-test in cases with only two groups) to compare the amount of change from pretest to posttest between the groups. It can also be expressed as a regression model: (Y X) = β0 + β1G + ɛ, where Y is the score at posttest, X is the score at pretest, G is the dummy coded grouping variable, and ɛ represents the residual error. The other popular method for two time point-two group designs is to conduct an ANCOVA using the individual’s pretest score as a covariate. ANCOVA, written as a regression model, is: Y = β0 + β1G + β2X + ɛ. One can see that the variables in the ANCOVA equation are the same as those in the change score model, and with some simple algebra, that the two models will be mathematically identical when β2 = 1. In practice, however, β2 will not be equal to 1 and

Methodology (2017), 13(1), 1–8 DOI: 10.1027/1614-2241/a000122


2

therefore the methods will not produce equivalent model coefficients. Specifically β2 is the pooled within-group regression coefficient, which requires the assumption of homogeneity of regression (i.e., the slopes are equal across the groups) for unbiased estimates. Although the methods may be used to address the same research design, researchers should be aware that the two methods are actually testing different null hypotheses. Specifically, the change score method is testing the null hypothesis that there is no raw difference between the groups in the amount of change from pretest to posttest, whereas the ANCOVA model is testing the null hypothesis that there is no difference in the groups’ posttest scores, had the groups started with the same pretest scores. This theoretical difference has implications for the appropriateness of selecting one approach over another, particularly when different results are obtained using each of the two methods. When pretest differences between groups occur, a researcher can obtain radically different results regarding the differences between groups at posttest depending on the statistical approach used. Drawing conclusions from one method or the other could potentially be detrimental if the results were to be used for program implementation or policy change. As a practical example, consider two different classes, one that receives a novel teaching method for improving vocabulary and another that continues with a previous teaching instruction method (control). However, the two classes differ at pretest, whereby the one that received the novel teaching method had, on average, lower vocabulary scores than the second class (control). If the means of each group remain the same from pre-intervention to post-intervention, the researcher could arrive at different conclusions about the effectiveness of the novel teaching method based on whether he or she used a change score or ANCOVA approach. More specifically, since the ANCOVA method assumes that the groups are from populations that are equivalent on pretest scores, it is possible for the ANCOVA method to conclude that the group that started with a higher score at pretest actually improved more, even though the mean differences are equivalent. This phenomenon is often termed Lord’s paradox (Lord, 1967) as two apparently valid statistical methods provide contradictory results. It is now clear that the ANCOVA approach can provide misleading results because it assumes that the mean pretest group differences are zero in the population, and therefore it is not the appropriate control for nontrivial pretest differences. In the example presented above, it would not be appropriate to assume that the classes with different pretest vocabulary scores actually started with the same pretest ability. Many researchers have examined this issue (Cribbie & Jamieson, 2004; Fitzmaurice, 2001; Methodology (2017), 13(1), 1–8

A. Counsell & R. A. Cribbie, EIV Method for Measuring Change

Jamieson, 1999, 2004; Linn & Slinde, 1977; Maris, 1998; Rogosa, 1988, 1995; Senn, 2006; Wright, 2006), and in general, the conclusions are that the ANCOVA model is slightly more powerful than the change score model for randomized experiments, but that ANCOVA should not be used when there are true population differences at baseline (unless participants were assigned to groups based on pretest scores, see Wright, 2006). In nonexperimental studies, baseline differences between the groups are often not trivial, so regression-based control often results in erroneous conclusions (Cribbie & Jamieson, 2000; Miller & Chapman, 2001). Allison (1990) discusses situations where one method is more appropriate than the other. The authors would like to note that due to Cronbach and Furby’s (1970) influential paper, there is a history of negative attitudes toward using change scores based on the argument that they are unreliable. We highly encourage researchers to read Rogosa (1995) for more information on this topic.

Errors-in-Variables Method The issue that using ANCOVA may lead to biased results is a matter of having fallible covariates, that is, covariates that contain measurement error are not perfectly reliable. In situations where there are nontrivial differences between the groups at pretest, measurement error (coupled with violating the homogeneity of regression assumption) is known to provide biased estimates (Culpepper & Aguinis, 2011; Porter & Raudenbush, 1987). The EIV method was developed to adjust the regression equation based on the covariate’s degree of (un)reliability. Rather than using the raw covariance matrix of the independent variables to calculate the regression coefficients, it uses a corrected covariance matrix, which is adjusted to take into account the reliability of the covariate(s). After accounting for measurement error in the covariate (i.e., pretest score in the described research design), the EIV method creates an unbiased estimate. For specific details on how the covariance matrix is modified in the EIV formula, see Fuller (1987). In Culpepper and Aguinis’ simulation study, the EIV demonstrated unbiased estimates, good power, and accurate Type I error rates across a number of different conditions. While the simulation results of Culpepper and Aguinis (2011) look promising, the EIV method was not utilized in the measurement of differences in pre-post change across groups. As such, it is important to investigate whether the EIV method is the recommended method for comparing pre-post change with two groups. Since the EIV method is a modification of the ANCOVA method, one may raise the question about whether the EIV method is appropriate for a two time point-two group study when nontrivial Ó 2017 Hogrefe Publishing


A. Counsell & R. A. Cribbie, EIV Method for Measuring Change

3

pretest group differences occur. One would expect that the amount of bias would be equivalent to the change score method, given what is known about the regression formulae when ANCOVA’s pooled regression coefficient is 1; in other words, if the pretest/posttest reliability is 1, then the slope of the relationship between pretest and posttest will be 1 (as in the change score model). However, it also raises questions about the appropriateness of the EIV, given the previous discussion of the theoretical difference between ANCOVA and the change score approach. The current research will address the following questions: 1) Are the power, Type I error control, and estimates obtained by the EIV method superior to those of the change score method when baseline differences are nontrivial? 2) As part of the EIV method, one must use an estimate of the covariate’s reliability. How precise must a researcher be in estimating the reliability of the covariate in order to obtain unbiased results? We hypothesize that the EIV’s estimates will be unbiased and more similar to the change score method than those of ANCOVA when there are nontrivial pretest differences. With trivial pretest differences (e.g., those that are due to randomization), all three procedures should produce similar model estimates. We also hypothesize that the EIV method will maintain accurate Type I error rates and similar power to the change score method across the range of conditions investigated.

Method This study used computer simulations to examine the bias, Type I error control, and power of the EIV, ANCOVA, and change score methods for comparing the amount of change from pretest to posttest across two groups. Data generation involved creating a continuous underlying score labeled “ability,” which had a mean of 0 and a standard deviation of 1. Reliability of the pretest and posttest scores was fixed to be equal, although the reliability varied across conditions. Typically when researchers have a two group-two time point research design, the same instrument is used at both time points; this is the justification for why the reliability of the instrument did not change from pretest to posttest. The pretest score (x) was generated from the following model:

pffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffi ρxx X þ 1 ρxx ɛx ;

ð1Þ

where ρxx is the reliability of the pretest score x, X is the underlying ability measure, and ɛx is a random error component for the pretest score. Group membership

Ó 2017 Hogrefe Publishing

was determined by an allocation variable based on the correlation between group membership and pretest score. The simulation included four different conditions for group allocation. The first condition was random assignment to either the control or treatment group. This resulted in equal proportions across the groups. The next three allocation conditions were created such that the correlation between the underlying measure of ability and group membership was .2, .4, or .6. After the dummy coded group variable was created, a posttest score (y) was generated based on different effect sizes and group allocation with the following model:

qffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffi ρyy X þ δG þ 1 ρyy ɛy ;

ð2Þ

where ρyy is the reliability of the posttest score y, δ is the amount that the groups differ at pretest, G is the dummy coded grouping variable (0 or 1), and ɛy is a random error component for the posttest score. The study included a total of 840 conditions. The following variables were manipulated: 1) δ, standardized population difference between groups at pretest ( .5, .25, 0, .25, .5); 2) reliability of the pretest and posttest scores (.5, .8); 3) sample size (20, 50, 100); 4) correlation between an underlying ability score and group membership (0, .2, .4, .6); 5) difference between true reliability and researcher reported reliability of the covariate ( .2, .1, .05, 05, .1, .2). This last condition is only relevant for the EIV method because users do not specify reliability estimates when using the change score or ANCOVA approaches. Five thousand replications were conducted for each condition using a nominal Type I error rate (α) of .05. The simulations were conducted using the open source statistical software R (R Development Core Team, 2013) and the EIV model was created using the function provided by Culpepper and Aguinis (2011).

Results Relative bias was used as an indicator of the amount of bias present for each of the three methods. Specifically, relative bias is defined as the absolute difference between the observed model coefficient and the population treatment effect divided by the population treatment effect (e.g., effect size). When the population treatment effect was zero, bias was based on the raw difference between the model

Methodology (2017), 13(1), 1–8


4

A. Counsell & R. A. Cribbie, EIV Method for Measuring Change

Table 1. Relative bias (in percentages) for N = 50 when ρxx = .5 ρ=0 δ 0.5 0.25

ρ = .2

CS

A

EIV

0.300

0.400

0.500

ρ = .4

CS

A

EIV

CS

1.30

27.3

2.80

0.500

57.8

1.80

1.90

2.10

1.40

0

0.003

0.001

0.004

0.005

0.25

0.700

0.200

1.10

2.00

0.5

0.400

0.200

0.400

0.500

0.300

0.800

0.001

0.002

54.5

6.00

2.20

28.3

1.10

0.300

0.147

A 63.9 128

ρ = .6 EIV

CS

5.20

1.10

106

0.100

214

11.7

0.325 126

0.025 16.1

63.7

5.40

A

0.002

EIV 19.6 33.6

0.534

0.025

2.60

217

44.8

0.700

107

31.4

Notes. ρ = The correlation between group membership and underlying ability; δ = effect size, the population treatment effect; CS = change score; A = ANCOVA; EIV = errors-in-variables; raw bias is used when δ = 0 since we cannot calculate the relative bias with an effect of 0 (highlighted in bold).

Table 2. Relative bias (in percentages) for N = 50 when ρxx = .8 ρ=0 δ

ρ = .2 A

ρ = .4

ρ = .6

CS

A

EIV

CS

EIV

CS

A

EIV

CS

A

EIV

0.5

0.500

0.400

0.300

0.300

14.6

0.400

0.500

34.0

1.40

0.200

63.1

3.20

0.25

1.80

1.30

1.90

0.500

29.9

0.100

0.800

68.0

2.80

0.500

0

0.003

0.002

0.003

0.004

0.006

0.001

0.008

0.002

0.25

0.800

0.900

0.900

1.40

27.5

0.068

1.90

1.10

69.4

0.170

1.60

1.50

0.5

0.100

0.200

0.200

0.500

13.9

1.00

1.20

35.0

0.100

0.200

126 0.311 126 63.4

5.10 0.019 8.70 2.50

Notes. ρ the correlation between group membership and underlying ability; δ = effect size, the population treatment effect; CS = change score; A = ANCOVA; EIV = errors-in-variables. Raw bias is used when δ = 0 since we cannot calculate the relative bias with an effect of 0 (highlighted in bold).

and population coefficients (to avoid division by zero). For the EIV model, bias was assessed under two separate conditions. In the first condition, the estimate of pretest (covariate) reliability entered in the model was exactly equal to the true reliability. A second bias condition examined the model coefficients when the reliability between the true score and the researcher-estimated reliability differed. This condition was investigated because we expected the EIV model coefficient to be unbiased in the first condition, but find it unlikely that applied researchers would be able to provide a value for their instrument’s reliability that is exactly equal to its true value. Our Type I error was assessed by examining the proportion of rejections for each of the three models when the population treatment effect was zero. Power was defined as the proportion of rejections when the population treatment effect was nonzero. Only a subset of the results is presented due to space constraints. Specifically, only the results for N = 50 are presented for the Type I error, power, and relative bias conditions for all three procedures. Since the amount of bias differed for the EIV method based on sample size, we also present a figure with all sample size conditions along with a large sample size (N = 1,000) for the EIV method. The pattern of results for the change score and ANCOVA models remains the same regardless of sample size. If interested, the reader can request the full set of results from the authors. Methodology (2017), 13(1), 1–8

Relative Bias Results Table 1 presents the amount of bias present for the change score, ANCOVA, and EIV models when ρxx = .5 and N = 50. Table 2 presents the amount of bias present for the three models when ρxx = .8 and N = 50. Across all of the conditions investigated, the change score method demonstrated negligible bias. It was unbiased regardless of sample size, population treatment effect, pretest reliability, or correlation between group membership and underlying ability. Unsurprisingly, ANCOVA was found to be the most biased of the three methods across many conditions. Two conditions drastically affected the amount of bias in ANCOVA’s estimates. The first was the correlation of group membership with the underlying ability score. When group membership was not correlated with the underlying ability measure, the ANCOVA’s estimates were unbiased. However, as expected, with small, medium, or large correlations between pretest scores and the grouping variable, bias was demonstrated for all of the effect sizes. The amount of bias increased as the correlation between group membership and underlying ability increased, where relative bias was as high as 217% in the condition with the highest correlation between group membership and ability score. The second condition that affected ANCOVA’s estimates was the reliability of the pretest and posttest scores. Specifically bias in the regression estimates decreased as the reliability increased. This result is Ó 2017 Hogrefe Publishing


A. Counsell & R. A. Cribbie, EIV Method for Measuring Change

5

Figure 1. Amount of relative bias in the EIV method. Bias has been averaged across the population treatment effect size at each level of the correlation. The left figure displays the interaction of sample size and correlation between ability and group membership when ρxx = .5. The right figure displays the interaction of N and the correlation between group membership and ability when ρxx = .8.

expected because if the pretest and posttest scores’ reliability is exactly equal to 1, the ANCOVA model will be equivalent to the EIV model. There were three conditions that interacted to affect the amount of bias for the EIV method. Sample size, reliability of the covariate (ρxx), and correlation between group membership and underlying ability. For smaller sample sizes (N = 20 or 50), the EIV method demonstrated little to no relative bias across the effect sizes when the correlation between group allocation and ability score was low. However, bias increased as the correlation increased when the reliability of the covariate was .5. In the largest correlation condition with pretest reliability of .5, the EIV demonstrated biased results up to 92% in the N = 50 condition. Given the relationship with sample size and bias, we investigated bias under a large sample size (N = 1,000), to see whether bias results approached zero. The interaction of sample size, ρxx, and correlation between group membership and underlying ability can be seen in Figure 1. Relative bias estimates at each correlation have been averaged over the four population treatment effect sizes (relative bias cannot be calculated with an effect size of 0). The figure demonstrates that in large sample sizes (e.g., N = 1,000) there is no bias in any conditions, but for all of the other sample sizes, the amount of bias depends on the particular condition (i.e., combination of ρxx and correlation of group membership and ability).

Bias When True Score Reliability Differs From Estimated Reliability (EIV Only) Given the importance of covariate reliability for the EIV method, we also investigated deviations of estimated covariate reliability from the true reliability to determine Ó 2017 Hogrefe Publishing

how accurate researchers must be for the EIV method to provide unbiased results. Table 3 presents a subset of the bias results for the EIV method when reliability estimates differ from true score reliability for N = 50. With random group assignment (correlation between group membership and underlying ability = 0), the results continued to show almost no bias regardless of degree of inaccuracy for reliability estimation. When the correlation was greater than zero, even small deviations (e.g., .05) from the true reliability resulted in bias across most of the conditions. More bias was present when the effect size was nonzero, and the amount of bias increased as the correlation between group membership and underlying ability increased. Increasing the correlation between underlying ability and group allocation resulted in bias regardless of the amount of deviation from the true reliability across most of the conditions. In fact, at the highest correlation tested, underestimating or overestimating the true reliability of the covariate often resulted in more biased estimates than if one were to have used the ANCOVA method instead. This pattern of results occurred regardless of sample size.

Type I Error Rates Table 4 displays the Type I error results across the presented conditions for N = 50. Here, ANCOVA’s Type I error rates were close to the nominal level (.05) when random assignment was used for group membership. However, empirical Type I error rates were found to be highly inflated at the largest correlation between group and underlying ability (e.g., rates as high as .50 when ρxx was .5). Increasing the reliability to .8 decreased the amount of Type I error, but rates were still found to be much higher than the nominal level. Both the change score Methodology (2017), 13(1), 1–8


6

A. Counsell & R. A. Cribbie, EIV Method for Measuring Change

Table 3. EIV’s relative bias (percentages) when estimated reliability differs from true reliability when ρxx = .5 (N = 50) ρ=0

ρ = .2

ρ = .4

ρ = .6

Diff

δ=0

δ = .25

δ = .50

δ=0

δ = .25

δ = .50

δ=0

δ = .25

δ = .50

δ=0

δ = .25

δ = .50

0.20

0.003

1.30

0.400

0.270

103.0

38.9

0.767

198.0

58.1

0.533

880.0

131.0

0.10

0.001

1.60

1.00

0.085

30.5

16.5

0.210

89.9

61.3

1.780

216.0

149.0

0.05

0.001

1.90

1.60

0.036

16.6

8.7

0.133

51.0

26.3

0.220

5.4

75.1

0.05

0.006

0.400

0.400

0.021

9.7

5.7

0.040

17.0

7.7

0.062

27.0

8.8

0.10

0.001

1.90

0.900

0.047

16.4

9.3

0.096

37.1

20.2

0.174

69.5

34.4

0.20

0.001

4.10

0.200

0.080

30.3

16.3

0.178

72.6

35.2

0.329

131.0

67.3

Notes. ρ = Correlation between group membership and underlying ability; Diff = the difference between true reliability and estimated reliability (negative difference is an underestimate of reliability); δ = effect size: population treatment effect; raw bias is used when δ = 0 since we cannot calculate the relative bias with an effect of 0 (highlighted in bold).

Table 4. Type I error rates (N = 50) ρxx = .8

ρxx = .5 ρ

Change score

ANCOVA

EIV

Change score

ANCOVA

EIV

0

.048

.045

.028

.050

.049

.046

.2

.050

.088

.032

.051

.068

.045

.4

.052

.235

.037

.053

.149

.045

.6

.049

.502

.045

.049

.325

.039

Notes. ρxx is the reliability of the covariate. ρ = Correlation between group membership and underlying ability. Bolded values are considered liberal if greater than .075 (e.g., Bradley, 1978).

and EIV methods were found to accurately maintain the empirical Type I error rates at the nominal level across all of the conditions investigated.

Power Rates Figure 2 displays four graphics presenting power results for several of the conditions with N = 50. The top left figure presents results for the ANCOVA, change scores, and EIV methods when group membership was randomly assigned and ρxx was .5. Here, the ANCOVA was found to have the most power of the three procedures whereas the EIV method displayed the lowest power. The change score method consistently displayed higher power than the EIV, although the power advantage was relatively minor. In the top right figure, the correlation between group and ability remains at 0, but ρxx was .8. Increasing the reliability increased the power for each of the methods and lessened the power difference between ANCOVA and the other two methods. The bottom left graph of Figure 2 examined power results when ρxx = .5 but the correlation between group membership and underlying ability was .6. This condition resulted in a different pattern of results from what was discussed above. When the effect size was negative, the change score method demonstrated the most power and the EIV and ANCOVA methods’ power results were comparable to one another but significantly lower than the power of the change score approach. When the effect Methodology (2017), 13(1), 1–8

size was positive, the change score method continued to demonstrate higher power than the EIV method, but the ANCOVA’s power exceeded that of the change score and EIV methods. It is important to note, however, that the ANCOVA’s power results cannot be validly interpreted because the empirical Type I error results were almost five times the nominal level. The bottom right figure presents power results with a correlation of .6 and ρxx of .8. Increasing the reliability from .5 to .8 in the correlated group membership condition resulted in the same pattern of results as when ρxx was .5, although the power of the EIV and change score procedures increased.

Discussion Researchers are often interested in assessing a change in behavior before and after some intervention. In order to provide valid claims about the effectiveness of the intervention, comparison to a control group is beneficial. While randomization is the only way to conclude that any differences between the groups were solely due to the effect of the intervention, in practice, some groups occur naturally or randomization may not be possible. Traditionally, when utilizing quasi-experimental designs, researchers were required to make a decision between two available statistical approaches: ANOVA (or t-test) on change scores or ANCOVA. Ó 2017 Hogrefe Publishing


A. Counsell & R. A. Cribbie, EIV Method for Measuring Change

7

Figure 2. Four power plots when N = 50. The first row represents a correlation of group membership and ability of 0 whereas the second row is .6. The first column represents ρxx = .5 and the second column is ρxx = .8. Note that although ANCOVA appears to demonstrate a power advantage in the figures in the second row, the power results should not be interpreted as such, because its Type I error rates are extremely liberal.

Culpepper and Aguinis (2011) suggested that the EIV method might be a viable alternative to ANCOVA. The benefit of using the EIV in lieu of change scores or ANCOVA is that estimates will be unbiased under many different conditions. Based on the current study, it was demonstrated that with larger sample sizes (N > 100) the EIV method showed little bias independent of pretest/ posttest reliability, whether the groups displayed differences at pretest or not, and across differing correlations between group membership and underlying ability. However, several issues were demonstrated with using the EIV method instead of change scores or ANCOVA. With smaller sample sizes, there were conditions when the EIV’s estimates were biased. Given the complexity of the EIV model, it is possible that with small sample sizes, there were slight differences between the empirical modification to the covariance matrix and the model-implied changes. Another potential downfall with using the EIV method for the two group-two time point design is that the EIV method’s power was consistently lower than that of the change score model or ANCOVA across a wide range of conditions. While the power advantage of the change score or ANCOVA (when group membership was randomly allocated) may not always have been much larger, there were some instances in which the power difference was marked. The last potential issue with using the EIV method Ó 2017 Hogrefe Publishing

in the current study was that bias was large when the researcher-estimated value of pretest/posttest reliability differed (even by small amounts) from the true reliability. This bias was magnified when there was a relationship between group membership and one’s underlying ability being measured at pretest and posttest. In practice, it is unlikely that researchers will have a precise estimate of reliability, and therefore may unknowingly bias their results by using the EIV method with covariate reliability estimates that are marginally different from their true value. Aside from the differences in power and bias using the EIV method, a theoretical conundrum arises by using the EIV method as a catch-all to reduce model bias and maintain Type I error rates. Using the EIV method takes away the necessity of thinking critically about the appropriateness of one’s statistical analysis in this two group-two time point research design. After all, change scores and ANCOVA are testing two different null hypotheses, and this necessarily means that the language around interpretation and implication of results should be related back to the research hypotheses being tested.

Conclusion Based on the results of the current research, we would not recommend using the EIV in place of the change score or Methodology (2017), 13(1), 1–8


8

ANCOVA methods with a two-group, pretest-posttest research design. In smaller sample sizes, the EIV method is more biased and less powerful than the change score method when there are nontrivial pretest differences between groups. At larger sample sizes, the estimates are similar, but there is no advantage to using the EIV method in lieu of change scores. When trivial pretest differences exist, the ANCOVA is more powerful than the EIV (or change score) method and all of the methods are unbiased. Choosing the EIV method over change scores or ANCOVA also poses an additional risk. Researchers may obtain biased parameter estimates if the researcher specifies a pretest reliability that is not exactly equal to the true population reliability or a researcher has a small sample size. As such, we recommend that researchers continue to think critically about their research design and hypotheses and choose the change score model with nontrivial pretest group differences or ANCOVA with trivial pretest differences, instead of implementing the EIV model for a research design with two time points and two groups.

References Allison, P. D. (1990). Change scores as dependent variables in regression analysis. Sociological Methodology, 20, 93–114. doi: 10.2307/271083 Bradley, J. V. (1978). Robustness? The British Journal of Mathematical and Statistical Psychology, 31, 144–152. doi: 10.1111/j.2044-8317.1978.tb00581.x Cribbie, R. A., & Jamieson, J. (2000). Structural equation models and the regression bias for measuring correlates of change. Educational and Psychological Measurement, 60, 893–907. doi: 10.1177/00131640021970970 Cribbie, R. A., & Jamieson, J. (2004). Decreases in posttest variance and the measurement of change. Methods of Psychological Research, 9, 37–55. Cronbach, L., & Furby, L. (1970). How should we measure “change”: Or should we? Psychological Bulletin, 74, 68–80. doi: 10.1037/h0029382 Culpepper, S. A., & Aguinis, H. (2011). Using analysis of covariance (ANCOVA) with fallible covariates. Psychological Methods, 16, 166–178. doi: 10.1037/a0023355 Fitzmaurice, G. (2001). A conundrum in the analysis of change. Nutrition, 17, 360–361. doi: 10.1016/S0899-9007(00)00593-1 Fuller, W. A. (1980). Properties of some estimators for the errors-in-variables model. Annals of Statistics, 8, 407–422. doi: 10.1214/aos/1176344961 Fuller, W. A. (1987). Measurement error models. New York, NY: Wiley. Jamieson, J. (1999). Dealing with baseline differences: Two principles and two dilemmas. International Journal of Psychophysiology, 31, 155–161. doi: 10.1016/S0167-8760(98) 00048-8 Jamieson, J. (2004). Analysis of covariance (ANCOVA) with difference scores. International Journal of Psychophysiology, 52, 277–283. doi: 10.1016/j.ijpsycho.2003.12.009 Linn, R. L., & Slinde, J. A. (1977). The determination of the significance of change between pre- and posttesting periods. Review of Educational Research, 47, 121–150. doi: 10.1037/ 0022-006X.59.1.27

Methodology (2017), 13(1), 1–8

A. Counsell & R. A. Cribbie, EIV Method for Measuring Change

Lord, F. (1967). A paradox in the interpretation of group comparisons. Psychological Bulletin, 68, 304–305. doi: 10.1037/ h0025105 Maris, E. (1998). Covariance adjustment versus gain scoresrevisited. Psychological Methods, 3, 309–327. doi: 10.1037/ 1082-989X.3.3.309 Miller, G. M., & Chapman, J. P. (2001). Misunderstanding analysis of covariance. Journal of Abnormal Psychology, 110, 40–48. doi: 10.1037/0021-843X.110.1.40 Porter, A. C., & Raudenbush, S. W. (1987). Analysis of covariance: Its model and use in psychological research. Journal of Counseling Psychology, 34, 383–392. doi: 10.1037/0022-0167. 34.4.383 R Development Core Team. (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing [Computer software manual]. Vienna, Austria: Retrieved from http://www.R-project.org Rogosa, D. (1988). Myths about longitudinal research. In K. W. Schaie, R. T. Campbell, W. M. Meredith, & S. C. Rawlings (Eds.), Methodological issues in aging research (pp. 171–209). New York, NY: Springer. Rogosa, D. R. (1995). Myths and methods: “Myths about longitudinal research” plus supplemental questions. In J. M. Gottman (Ed.), The analysis of change. Mahwah, NJ: Erlbaum. Senn, S. (2006). Change from baseline and analysis of covariance revisited. Statistics in Medicine, 25, 4334–4344. doi: 10.1002/ sim.2682 Warren, R. D., White, J. K., & Fuller, W. A. (1974). An errors-invariables analysis of managerial role performance. Journal of the American Statistical Association, 69, 886–893. doi: 10.1080/01621459.1974.10480223 Wright, D. B. (2006). Comparing groups in a before-after design: When t test and ANCOVA produce different results. The British Journal of Educational Psychology, 76, 663–675. doi: 10.1348/ 000709905X52210

Received October 13, 2014 Revision received February 26, 2016 Accepted March 8, 2016 Published online March 22, 2017

Alyssa Counsell is a PhD candidate in the Quantitative Methods program in Psychology at York University. Her research interests include equivalence testing, robust statistics, measurement invariance, structural equation modeling, and pedagogical methods for improving statistical knowledge in applied psychological research.

Robert Cribbie is a Professor in the Department of Psychology at York University. He received his PhD in Quantitative Psychology from the University of Manitoba. His research interests include equivalence testing, multiplicity control and robust statistics.

Alyssa Counsell Department of Psychology York University Toronto, ON, M3J 1P3 Canada counsell@yorku.ca

Ó 2017 Hogrefe Publishing


Original Article

Power of Modified Brown-Forsythe and Mixed-Model Approaches in Split-Plot Designs Pablo Livacic-Rojas,1 Guillermo Vallejo,2 Paula Fernández,2 and Ellián Tuero-Herrero2 1

Universidad de Santiago de Chile, Chile

2

Universidad de Oviedo, Spain

Abstract: Low precision of the inferences of data analyzed with univariate or multivariate models of the Analysis of Variance (ANOVA) in repeated-measures design is associated to the absence of normality distribution of data, nonspherical covariance structures and free variation of the variance and covariance, the lack of knowledge of the error structure underlying the data, and the wrong choice of covariance structure from different selectors. In this study, levels of statistical power presented the Modified Brown Forsythe (MBF) and two procedures with the Mixed-Model Approaches (the Akaike’s Criterion, the Correctly Identified Model [CIM]) are compared. The data were analyzed using Monte Carlo simulation method with the statistical package SAS 9.2, a split-plot design, and considering six manipulated variables. The results show that the procedures exhibit high statistical power levels for within and interactional effects, and moderate and low levels for the between-groups effects under the different conditions analyzed. For the latter, only the Modified Brown Forsythe shows high level of power mainly for groups with 30 cases and Unstructured (UN) and Autoregressive Heterogeneity (ARH) matrices. For this reason, we recommend using this procedure since it exhibits higher levels of power for all effects and does not require a matrix type that underlies the structure of the data. Future research needs to be done in order to compare the power with corrected selectors using single-level and multilevel designs for fixed and random effects. Keywords: statistical power, modified Brown Forsythe, Mixed Linear Model and split-plot designs

Difficulties found when testing the hypotheses of a design that has one factor with J-levels for independent groups and a second factor with dependent K-levels include noncompliance with the homogeneity assumption of the variance-covariance matrices (Wilcox, 2012), the lack of independence between values (Liu, Rovine, & Molenaar, 2012), and the relevance of using a statistical model based on the classical linear model or the Mixed Linear Model (MLM) to evaluate the different design effects (Ato, Vallejo, & Palmer, 2013; Stroup, 2013). Liu et al. (2012) suggest that among the procedures for analyzing data using the MLM are repeated-measures Analysis of Variance (ANOVA), covariance pattern models, and growth curve models, which analyze behavioral change considering as an assumption the existence of different patterns of covariance between the residuals of the main effects. Moreover, they point out that when the covariance structures of the data are heterogeneous, the performance of the Akaike Information Criterion (AIC; Akaike, 1974) and the Bayesian Information Criterion (BIC; Schwarz, 1978) has a lower efficiency than expected in correctly selecting the covariance structure, this being observed most Ó 2017 Hogrefe Publishing

clearly with the structure of Random Coefficients (RC). Additionally they indicate that the AIC and BIC select the covariance structure correctly when sample sizes are moderate and these two criteria have a good fit in the selection of the matrix when these are CS, MA (1), and AR (1). As regards the application of ANOVA for repeatedmeasures data, different authors suggest the existence of difficulties in the precision of the inferences for assessing the fixed effects of the design, based on repeated-measures univariate models or multivariate models of joint normality (with a general covariance structure) since they are associated with the presence of nonspherical covariance structures and free variation of the variance and covariance with the cost of estimating a large number of parameters (Kowalchuk, Keselman, Algina, & Wolfinger, 2004; Vallejo & Ato, 2006), unbalanced designs (Keselman, Algina, & Kowalchuk, 2001), moderate sample sizes (Davidson, 1972), low control of Type I error rates, noncompliance of parametric assumptions (Livacic-Rojas, Vallejo, & Fernández, 2010; Vallejo, Ato, Fernández, & Livacic-Rojas, 2013), and a lack of knowledge of the error structure Methodology (2017), 13(1), 9–22 DOI: 10.1027/1614-2241/a000124


10

underlying the data, which, in turn, affect the statistical power of the contrasts in the design hypotheses. As Livacic-Rojas, Vallejo, Fernández, and Tuero-Herrero (2013) pointed out that various studies have evaluated the performance of information criteria based on the likelihood of selecting the most correct repeated-measures model and using the MLM in three different scenarios: its ability to select the correct mean model, the correct covariance structure, and to select both structures simultaneously. In the same context, different studies show that the performance of AIC is the 48% and BIC the 42% in correctly selecting the covariance structure on average, respectively. See too, Vallejo, Fernández, Livacic-Rojas, and Tuero-Herrero (2011a, 2011b). Moreover, they pointed out that AIC selects the 2% for the version of the original covariance structure and the 48% for the heterogeneous version of the same structure. Along the same line, with respect to the Type I error rates, Livacic-Rojas et al. (2013) have pointed out that AIC yields higher Type I error rates (on 7.41% of analyzed conditions it exceeds the Bradley’s Liberal Criterion) than the Correctly Identified Model (CIM, on 2.47%). AIC shows a performance associated to main and interactional effects, ARH (Autoregressive Heterogeneity) and RC (null and positive types of relation between the sizes of group and type of matrices). In turn, Vallejo and Ato (2006) analyzed the Type I error rates between the Modified Brown Forsythe (MBF) and EGLS (empirical form of the generalized leastsquares method modified to fit Kenward-Roger-based AIC) and pointed out that both procedures were robust to violations of homogeneity of variances and non-normally distributional data on unbalanced designs. In another study, Vallejo, Arnau, and Ato (2007) compared the SAS® Proc Mixed (Market Mix Modeling [MMM]) and MBF so as to detect the effects in a split-plot design containing multiple variables when multivariate normality and covariance heterogeneity assumptions were violated. They indicated that these effects are comparable in the frequency of Type I errors. As regards statistical power of different procedures, Vallejo, Fernández, and Livacic-Rojas (2007) evaluated comparatively the effectiveness of MBF and EGLS for interactional effects in repeated-measures designs, and evidenced that none demonstrated a clearly superior performance when data were non-normally distributed and matrices were heterogeneous (RC, ARH1, and UN). Specifically, MBF shows higher levels of statistical power in comparison to EGLS when the vector of means different of zero is associated with the largest variance. By contrast, in most of the analyzed conditions, EGLS shows higher power levels when the nonzero means vector is associated with the smallest variance. In turn, when ELGS was used with AIC, it rarely yields low levels of statistical power. However, EGLS yields lower levels of power when the Methodology (2017), 13(1), 9–22

P. Livacic-Rojas et al., Statistical Power, Covariance

covariance matrix is specified correctly. Similarly, the authors note that if one could establish a procedure to model the covariance structure (instead of taking one without any structure), MBF would be less efficient than EGLS in situations where the covariance structure has an important role since more parameters are needed for such an estimation. In cases where the covariance matrix is misspecified, ELGS yields higher Type I error rates and wrong inferences. In this case, it is necessary to maintain a compromise between bias and precision (for more information, check Fitzmaurice, Laird, & Ware, 2004; Lix & Lloyd, 2006). On the other hand, Vallejo, Fernández, Herrero, and Livacic-Rojas (2007) compared the sensitivity of MMM and MBF in order to detect the effects of a multivariate design partially repeated measures when data are deviated from the normal distribution and the scattering matrices are heterogeneous. In general terms, the results indicate that none of the approaches proved uniformly most powerful. Specifically, the empirical levels of statistical power averaged were: MMM = 0.781 and MBF = 0.76. Successively, interactional effects were: MMM = 0.817 and MBF = 0.804. Finally, the interaction of groups and repeated measures were: MMM = 0.553 and MBF = 0.452. Based on the aforementioned evidence, the authors conclude that the matrix structure used was not the most favorable for MMM as the results were limited to the conditions examined. They also add that MBF can be inefficient in situations in which some sampling units have incomplete vectors and the shape of the matrix plays an important role in estimating and/or changing when covariates exist. These results lead them to corroborate that in these cases, the MMM is efficient and adequate. Moreover, the researchers recommend keeping the necessary balance between flexibility and parsimony criteria in order to choose the covariance structure or model that best describes the data. In this context, studies carried out by Fitzmaurice et al. (2004) state that an excessively flexible model (e.g., MBF) can produce inefficient estimates, while an excessively parsimonious model (the generalization of the mixed model proposed by Scheffé) can produce biased estimates of the effects corresponding to the structure of means. Along the same line, Stroup (2013) pointed out that if a researcher chooses a wrong model of covariance structure, the misspecification could inflate the Type I error rate (e.g., CS) or reduce the statistical power (e.g., UN), which reduces the chance of identifying the treatment effects. Considering different studies, Vallejo, Arnau, Bono, Fernández, and Tuero-Herrero (2010) note that comparing the effectiveness of AIC and BIC with those of other procedures, selection of the covariance structure improves as its complexity decreases and the sample size increases. Ó 2017 Hogrefe Publishing


P. Livacic-Rojas et al., Statistical Power, Covariance

At the same time, Vallejo, Tuero-Herrero, Núñez, and Rosário (2014) indicated that when it comes to the performance of selection criteria based on the Maximum Likelihood (ML) or Restricted ML (REML), REML works the same or better than ML when the researcher selects the mean and covariance structures. To overcome the above limitations, they recommended the use of different informational criteria [Akaike Information Criterion Corrected (AICC), the Consistent Akaike Information Criterion (CAIC), the BIC, and the Deviance Information Criterion (DIC, Spiegelhalter, Best, Carlin, & Van der Linde, 2002)]; however, the appropriate use of the selection criteria is a subject of ongoing debate (see too, Greven & Kneib, 2010; Hamaker, Van Hattum, Kuiper, & Hoijtink, 2011; Srivastava & Kubokawa, 2010; Vaida & Blanchard, 2005). Based on the difficulties for different covariance selector structures such as the wrong choice of covariance structure, the Type I error rates, and the yield of moderate levels of statistical power, this study intents to compare the levels of statistical power presented by three selection criteria, namely the AIC, the CIM, and the MBF procedure, using a split-plot design for between-group effects, within-group effects, and interaction in different data conditions. Based on the aforementioned evidence the mixed model with the AIC was used (instead of another criterion such as the BIC or the CAIC) because, despite selecting the correct model at a low frequency, it is the criterion that exhibits the greatest efficiency. Similarly, the CIM, the procedure that represents the true structure of the data, allows a more realistic comparison of the specific functioning of the different procedures (see Keselman, Algina, Kowalchuk, & Wolfinger, 1998; Livacic-Rojas et al., 2013). And MBF, being a procedure that takes an unstructured matrix (UN), is more realistic in estimating a larger number of parameters with data taken at different points in time and it is more robust to the heterogeneity of the data. With regard to the latter procedure, Vallejo, Fidalgo, and Fernández (2001), Vallejo and Livacic-Rojas (2005), Vallejo et al. (2006), and Vallejo, Arnau, et al. (2007) extended the approximate BF procedure to the univariate and multivariate repeated-measures context, in order to avoid the negative impact that the heterogeneity of covariance matrices has on multivariate test criteria. Although it is believed that the MLM method is generally more powerful than the MBF test, this question should be investigated before researchers adopt the MLM method (see too, Vallejo, Ato, & Valdés, 2008). To date, no such comparison has been undertaken.

Description of the Procedures to be Compared in This Study Let yijk, i = 1,. . ., nj; j = 1,. . ., p; k = 1,. . ., q, be the response for the ith participant in the jth group at the kth occasion, Ó 2017 Hogrefe Publishing

11

and let yij = (yij1,. . ., yijq)0 be the random vector of responses for the ith participant in the jth group. Then, by stacking the subvectors y0 11,. . ., y0 np, a multivariate linear model can be written as

Y ¼ XB þ E;

ð1Þ

where Y is an n q matrix of observed data, X is an n p design matrix with full column rank p < n, B is a p q matrix that contains the unknown fixed effects to be estimated from the data, and E is an n q matrix of unknown random errors. We assume that the rows of Y are normally and independently distributed within each level j, with mean vector μj and variance-covariance matrix Σj. The unbiased estimators of Σj are ^ j X0 Yj are dis^ j ¼ ð1=nj 1ÞEj ; where Ej ¼ Y0 Yj B Σ j j tributed independently as Wishart Wq (nj 1, Σj) and ^ j ¼ ðX0 j Xj Þ 1 X0 j Yj is the maximum likelihood estimaB tor of matrix Bj (Nel, 1997). We also assume that ^ 1 exists, j = 1, 2, ..., p, with nj 1 q, such that Σ j probability one. The hypotheses tested under a multivariate model are linear combinations of rows and columns of B. Most of the hypotheses of interest for the fixed-effects model can be defined as 0

6 0; ð2Þ H0 : C BA ¼ 0 versus the alternative H A : C0 BA ¼ . where C0 = (Ih .. 1) is an h p matrix of between-subjects . contrasts with full row rank h p, and A = (Im .. 1)0 is a q m matrix of within-subjects contrasts with full column rank m q. It can be readily verified that with the structure defined in (2) it is not possible to test any linear hypothesis concerning the elements of B. Nevertheless, it is indeed possible to define specific contrasts for testing the hypotheses of principal interest. Under the assumption that the errors follow a multivariate normal distribution, the hypotheses of the type formulated in (2) can be tested using any of several standard multivariate tests (see Timm, 2002).

MBF Procedure Practical implementation of the MBF procedure requires estimation of the degree of freedom (df) of the approximate central q-dimensional Wishart distribution, which can be easily derived by equating the first two moments (i.e., expectation and dispersion matrix) of the quadratic form associated with mth source of variation in model (1) to those of the central Wishart distribution. A detailed explanation of the multivariate Satterthwaite’s approximation can be found in Vallejo et al. (2006). Applying the approach of these authors, the Wilks likelihood ratio criterion for testing the interaction effect is given by the determinant of E*(H + E*) 1, where the Methodology (2017), 13(1), 9–22


12

P. Livacic-Rojas et al., Statistical Power, Covariance

hypothesis matrix, H, and the error matrix, E*, are determined by

^ 0 ½C0 ðX0 XÞ C 1 ðC0 BAÞ; ^ H ¼ ðC0 BAÞ

ð3Þ

p X 0 E ¼ ν e =ν h c j A Σj A;

ð4Þ

and

j¼1

where ν e and ν h are the approximate df for E* and H, respectively, c j ¼ 1 cj ; and cj ¼ nj =n: Using results due to Nel and Van der Merwe (1986) and Krishnamoorthy and Yu (2004), the approximations to the df can be written as:

ν e ¼

p P j¼1

2

1 nj 1

tr2

ðq 1Þ þ ðq 1Þ 2 ; 0 ^ ^ 1 0 ^ ^ 1 cj A Σj Ξ A þ tr cj A Σj Ξ A ð5Þ

and

ν h ¼ p P j¼1

fW g þ tr2

ðq 1Þ þ ðq 1Þ ! !2 ; p p P P 0 ^ ^ 1 0 ^ ^ 1 cj A Σj Ξ A þ tr cj A Σj Ξ A j¼1

ð6Þ ^j Ξ ^ 1 AÞ þ trðA0 Σ ^j Ξ ^ 1 AÞ2 2cj ½tr2 where W = [tr2 ðA0 Σ 2 0 ^ ^ 1 0 ^ ^ 1 ðA Σj Ξ AÞþtrðA Σj Ξ AÞ ; Ξ ¼ ðc1 Σ1 þ . . . þ c p Σp Þ; and tr( ) denotes the trace of the matrix. Using the transformation of Wilks’s Λ to F-statistic, the usual test of H0 versus if HA in (2) rejects approximately 2 2 1=s v2 F ; ðv ; v Þ where S ¼ l v 4 = F MBF ¼ 1 Λ 1 α 1 2 h v 1 Λ1=s 2 2 1=2 l þvh 5 ; v1 ¼ lv h and v 2 ¼ v e ðl v h þ 1Þ=2 S ðlv h 2Þ=2. Univariate Linear Mixed Model The linear mixed models are increasingly used in studies of growth and change for fitting and analyzing repeatedmeasures designs. This family of models defined by Laird and Ware (1982) and Jennrich and Schluchter (1986) extends the usual general linear model to cases where standard assumptions of independence and homogeneity are not required. Suppose that model (1) contains fixed as well as random effects, then for the complete response vectors considered in this article, the univariate linear mixed model can be written as

y ¼ Xβ þ Zu þ e;

ð7Þ

where y is an nq 1 vector of observed data, X is an nq k fixed design matrix with full column rank k(=1 + p + q + pq) < nq, β is a k 1 vector that contains Methodology (2017), 13(1), 9–22

j¼1

j¼1

j

an h h positive-definite matrix, Ω is a q q positivedefinite covariance matrix, is the Kronecker product function, and denotes the matrix direct sum. Both moments of y can be modeled separately and distinctly. For known V, statistical inference about fixed-effects parameters can be obtained by using the generalized least squares (GLS) estimator of β given by

^ GLS ¼ ðX0 V 1 XÞ X0 V 1 y; β

2

j¼1

the unknown fixed effects common to all participants, Z is an nq nh random design matrix with full column rank nh < nq, u is an nh 1 unknown vector of random effects, and e is an nq 1 vector of random errors whose elements need not be independent and homogeneous. For the ith participant in the jth group, it is assumed that the random vectors ei and ui are independently distributed as N(0, Ωj) and N(0, Gj), respectively. For the combined model these assumptions imply that, marginally, E(y) = XB and Var(y) = V, where p p V ¼ Varðvec y0 Þ ¼ Z Inj Gj Z0 þ ðInj Ωj Þ; Gj is

ð8Þ

and its variance

^ ¼ ðX0 VarðβÞ

V 1

XÞ :

ð9Þ

~ EGLS Þ is If V is unknown, the EGLS estimator of β ðor β ^ in (8), which is obtained by replacing V by its estimate V V with its parameters G and Ω replaced by their maximum ~ is usually likelihood estimators. Likewise, the variance of β ^ estimated by replacing V by its estimate V in (9). Though a number of estimation strategies are available, the current manuscript uses REML estimation as implemented through SAS Institute’s (2005) Proc Mixed Program. In the mixed model, any specific hypothesis of the form H 0 : C0 β ¼ 0 versus H A : C0 β ¼ 6 0; where C is a matrix of contrasts of rank ν1; can be tested using Wald’s F statistic approximation

~ 1 ðC0 βÞ; ~ 0 ½C0 VarðβÞC ~ F ¼ ν1 1 ðC0 βÞ

ð10Þ

~ is the covariance matrix of β; which where the VarðβÞ ~ usually underestimates the true sampling variability of β due to not accommodating the variability in the estimate ~ of VarðyÞ when estimating the covariance structure of β and when computing associated Wald-type statistics, particularly when the number of subjects is not sufficiently large to support likelihood-based inference. To circumvent this difficulty, Kenward and Roger (1997) provide a method that involves inflating the convention~ to derive an ally estimated covariance matrix of β; ~ by appropriate F-test statistic by replacing in (10) VarðβÞ ~ an adjusted estimator of the covariance matrix of β; and Ó 2017 Hogrefe Publishing


P. Livacic-Rojas et al., Statistical Power, Covariance

13

Figure 1. Formal description to the type of covariance matrices for generating simulated data.

estimating the denominator df for the generalized F-statistic based on it. The procedure to determine the power for a given design (which determines X and Z), covariance matrix, ^ can be and treatment differences (included within βÞ expressed as

Pr½Fðν1 ; ν2 ; λÞ F 1 α ; ν1 ; ν2 ÞjH A ¼ 1 β;

ð11Þ

where F(ν1, ν2, λ) represents the noncentral F distribution function with numerator df ν1 = Rank(C), denominator df ν2 appropriately estimated with the method proposed by Kenward and Roger (1997), the noncentrality parameter

^ ^ ½C VarðβÞC ^ λ ¼ ðC0 βÞ ðC0 βÞ; 0

0

1

ð12Þ

and F1 α; ν1, ν2 is the critical value for testing the null hypothesis at a Type I error of α. For details, see Stroup (2002). More information about the Akaike’s Criterion and the Correctly Identified Model is provided in Livacic-Rojas et al. (2013).

Method To evaluate the statistical power levels of the AIC, BF, and CIM with the mixed-model approach with KR solution, we conducted a Monte Carlo simulation study using a split-plot design with one between-subjects factor (p = 3) and one within-subjects factor (k = 4) using the SAS 9.1 Proc Mixed program (SAS Institute, 2005). Six variables were manipulated to investigate the performance: (a) the total sample size, (b) the type of covariance matrix, (c) the pairing of group sizes and covariance matrices, (d) the form of the distribution, (e) the trimmed means at 10% and 20%, and (f) three patterns of means (see, e.g., Algina & Keselman, 1997; Keselman et al., 1998). When the designs were balanced, the relationship between group size and dispersion matrix size was null. When the designs were unbalanced, the relationship could be positive (smaller group was associated with the smaller dispersion matrix) or negative (the smaller group was associated with the larger dispersion matrix). For each sample size condition, both a null and a moderate degree Ó 2017 Hogrefe Publishing

of cell size inequality were explored, as indexed by a coefficient of sample size variation (Δ), where h i1=2 2 =p being the average group size. Δ ¼ ð1= nÞ Σj np n ,n The unequal group sizes were, respectively: (a) 6, 10, 14 (n = 30), (b) 9, 15, 21 (n = 45), and (c) 12, 20, 28 (n = 60). The degree of heterogeneity of the dispersion matrices was Σ1 ¼ 31 Σ2 and Σ3 ¼ 53 Σ2 : The covariance structures generating the simulated data in the proportional case were: RC, ARH (1), and UN. The value of the sphericity parameter was held constant at 0.75 (see e.g., Keselman et al., 1998]. Figure 1 presents a formal description of these structures for a situation in which we have four measurement occasions. The form of the mean distribution had four levels: normal, slightly skewed, moderately skewed, and severely skewed. The values of the indices of asymmetry (c1) and direction (c2) selected for generating non-normal multivariate distributions were: slightly skewed (c1 = 1, c2 = 0.75), moderately skewed (c1 = 1.75, c2 = 3.00), and severely skewed (c1 = 3.00, c2 = 21.00) (see, e.g., Berkovits, Hancock, & Nevitt, 2000; Micceri, 1989). The truncated mean procedure has been used to deal with the heterogeneity of variance in nested designs with more robust methods according to the recommendations given by Wilcox (2012). Additionally, the permutation of the configuration of the mean vector pattern was selected at maximum range (Algina & Keselman, 1998). According to Ramsey (1978), this entails that the first mean of the vector takes the smallest value, the last takes the biggest value, and the middle two take the average of the previous two. To detect the sensitivity of measurement occasions, the following permutations were included within each of the three design groups: ( 1, 0, 0, 1), ( 1, 0, 1, 0), ( 1, 1, 0, 0). For the analysis of the power levels, a thousand replicated data sets were created with the Interactive Matrix Language (IML) of the SAS (SAS Institute, 2005), using the multivariate extension that Vale and Maurelli (1983) developed from the power method proposed by Fleishman (1978). The classification criteria of the statistical power were made on the basis of the proposals made by Cohen (1992). He specifies that a high level is 0.80 or higher, whereas moderate power levels are 0.30–0.79 and low power levels, 0.00–0.2999. Methodology (2017), 13(1), 9–22


14

Some of the reasons Cohen indicates are lower levels that yield the likelihood of increasing Type II error rates (For further details see also Vallejo, Fernández, Herrero, et al., 2007).

Results Normally and Non-Normally Distributed Data Table 1 shows the power levels averaged of the three procedures for normal and non-normal distributions (including skewness and kurtosis with light, moderate, and strong levels). (See, for additional information the Electronic Supplementary Material, ESM 1.) Normally Distributed Data For the Between-Group Effects (BGE), three procedures yield high power levels averaged for all analyzed conditions with only slight differences between them (MBF = 45.39; AIC = 40.55; CIM = 42.30). In specific terms, the procedures yield: For BGE, with n = 30, MBF displays high power levels: 22.22% of the analyzed conditions and ARH and UN matrices (negative relation). AIC and CIM, respectively, show only moderate power levels in the 77.78% and 88.89% range and low levels at 22.22% and 11.11% of the analyzed conditions with RC (positive and negative types of relation). With n = 45, BF and CIM show moderate levels at 77.78% and AIC at 66.67% of the analyzed conditions. The low levels occur with BF at 11.11% (RC and null relation), AIC at 33.33% with RC (three types of relation), and CIM at 22.22% with RC (null and positive types of relation). With n = 60, BF and CIM exhibit moderate power levels at 66.67% and low levels at 33.33% with RC (the three types of relation). AIC shows moderate levels on 55.56% and low levels at 44.44% with UN (negative relation) and RC (all three types of relation). For Within-Group Effects (WGE) the three procedures yield high power levels averaged for all analyzed conditions with only slight differences between them (MBF = 95.24; AIC = 94.83; CIM = 94.94). For Interactional Effects (INE) the procedures yield similar levels of power (MBF = 95.40; AIC = 94.43; CIM = 94.93). Non-Normally Distributed Data For BGE, three procedures yield low power levels averaged for all analyzed conditions with only slight differences between them (MBF = 28.85; AIC = 26.99; CIM = 27.09). In specific terms, the procedures yield: For BGE and n = 30, BF shows moderate levels at 77.78% and low levels of power, at 22.22%, respectively, of the analyzed conditions with RC (with null and positive types Methodology (2017), 13(1), 9–22

P. Livacic-Rojas et al., Statistical Power, Covariance

of relation). On the other hand, AIC and CIM show moderate levels in the 66.67% and 77.78%, respectively, and low levels between 33.33% and 22.22% with RC (positive and negative types of relation) and UN (positive relation). With n = 45, BF displays moderate levels at 55.56% of the analyzed conditions and low levels at 44.44% with ARH and RC (with null and positive types of relation). In line with the former, AIC and CIM show moderate levels: 33.33% of the analyzed conditions and low levels: 66.67% associated with ARH and RC (the three types of relation). With n = 60, the three procedures yield moderate levels at 33.33% of the analyzed conditions and the low levels at 66.67% associated to ARH and RC (the three types of relation). For Within-Group Effects (WGE) the three procedures yield high power levels averaged for all analyzed conditions with only slight differences between them (MBF = 95.22; AIC = 93.59; CIM = 93.60). For Interactional Effects (INE) the procedures yield similar levels of power (MBF = 96.51; AIC = 95.24; CIM = 95.65).

Average Levels of Statistical Power at Normally Distributed Data Table 2 shows the average levels of statistical power for the three procedures when data are Normally Distributed with trimmed means at 10% and 20%. Trimmed Means at 10% For BGE with trimmed means at 10%, three procedures yield moderate power levels averaged for all analyzed conditions with only slight differences between them (MBF = 45.39; AIC = 40.55; CIM = 42.30). In specific terms, the procedures yield: For BGE and n = 30, BF shows high power levels at 22.22% of the analyzed conditions with ARH and UN (negative relation). The moderate levels are observed at 55.56% and the low levels at 22.22% associated with RC (null and positive types of relation). AIC and CIM show moderate levels at 77.78% and 88.89% of the analyzed conditions and low levels at 22.22% and 11.11% associated to RC (for null and positive types of relation). With n = 45, BF and CIM exhibit moderate levels at 77.78% and 22.22% of the analyzed conditions and low levels associated with RC (for null and positive types of relation). AIC shows moderate levels at 66.67% of the analyzed conditions and low levels at 33.33% associated with RC (three types of relation). With n = 60, BF and AIC exhibit moderate levels at 66.67% of the analyzed conditions and low levels at 33.33% associated with RC (null and positive types of relation). CIM shows moderate levels at 66.67% of the analyzed conditions and lower levels at 33.33% Ó 2017 Hogrefe Publishing


P. Livacic-Rojas et al., Statistical Power, Covariance

15

Table 1. Percentages of Average Statistical Power of three procedures of covariance structure selectors with normally and non-normally distributed data, with mean patterns far and near each other ( 1, 0, 0, 1; 1, 0, 1, 0; 1, 1, 0, 0) and untrimmed means MBF N

R

OCS

D

30

=

ARH (1)

N

RC +

45

=

+

60

=

+

30

=

+

45

=

+

60

=

BGE

WGE

AIC INE

BGE

WGE

CIM INE

BGE

WGE

INE

66.22

95.43

95.67

63.45

94.80

93.64

65.75

94.99

95.20

37.69

95.10

95.77

31.38

94.60

95.45

36.89

95.57

95.17

UN

67.49

95.17

95.87

68.08

95.20

94.47

66.79

95.47

95.80

ARH (1)

61.95

96.00

94.47

64.58

95.53

95.04

62.18

95.23

95.30 95.33

RC

31.60

95.57

95.87

29.13

96.00

93.80

31.36

95.40

UN

62.80

95.60

95.17

69.62

94.47

93.54

62.89

94.74

94.27

ARH (1)

80.88

95.43

95.93

71.89

95.17

91.40

75.88

95.63

92.27 95.87

RC

62.22

95.33

95.73

14.03

94.47

95.20

16.57

94.50

UN

81.11

95.44

95.57

73.62

93.21

92.47

76.37

94.04

94.11

ARH (1)

47.99

95.50

95.17

47.22

95.83

93.77

47.82

95.97

94.47 95.87

RC

16.83

95.04

95.93

14.03

94.97

95.20

16.57

94.50

UN

48.27

95.46

95.57

52.24

94.54

94.34

48.21

94.94

95.23

ARH (1)

42.48

94.64

95.13

40.96

95.17

95.10

42.53

95.34

95.24 95.87

RC

16.83

95.04

95.93

14.03

94.97

95.20

16.57

94.50

UN

42.18

95.07

95.37

48.47

95.30

95.57

42.58

95.00

95.93

ARH (1)

66.38

95.30

95.73

62.68

94.87

95.07

63.95

94.63

94.44 94.50

RC

35.03

94.80

95.50

27.63

93.84

94.67

33.09

93.80

UN

65.03

94.60

95.53

63.96

94.51

94.04

62.26

94.77

95.04

ARH (1)

32.73

95.03

95.04

32.33

95.07

94.87

32.56

94.74

95.07 95.20

RC

25.10

95.20

94.60

07.73

94.67

95.37

08.23

95.10

UN

33.25

95.50

94.60

04.17

94.87

95.07

32.85

95.04

93.97

ARH (1)

35.96

95.10

94.87

34.59

94.60

94.63

36.38

95.07

95.73 95.50

RC

05.00

95.50

97.67

04.17

94.87

95.07

05.13

95.00

UN

31.08

94.97

94.57

31.49

94.37

94.47

32.85

95.04

93.97

ARH (1)

58.68

95.13

95.70

54.68

94.64

94.67

57.11

94.60

95.20

RC

18.43

95.27

93.94

14.30

95.50

93.44

17.53

95.20

93.27

UN

52.44

95.27

94.87

54.37

94.44

94.17

51.07

94.54

95.20

35.43

95.37

96.87

33.06

93.54

96.17

33.93

93.31

96.86 95.74

ARH (1)

NN

RC

12.42

95.20

96.77

09.99

93.28

94.64

11.36

93.24

UN

68.07

95.27

97.10

66.57

93.74

95.94

66.83

93.71

97.00

ARH (1)

30.04

96.73

95.52

30.27

95.57

96.37

28.78

95.70

97.07 97.07

RC

08.62

96.37

97.37

07.83

95.07

96.17

07.99

95.30

UN

64.13

96.23

96.87

64.93

95.57

96.07

63.60

95.47

97.17

ARH (1)

59.74

94.20

97.70

47.88

91.28

95.04

50.05

91.24

95.47 92.38

RC

36.10

94.77

96.93

21.88

91.34

91.98

26.11

91.68

UN

80.95

94.87

98.07

73.32

91.84

94.17

74.96

91.44

95.73

ARH (1)

14.38

95.77

96.17

12.82

94.07

96.47

13.88

94.24

96.10 94.77

RC

02.00

95.50

96.63

01.73

93.31

94.57

02.03

93.57

UN

51.35

95.00

96.10

53.18

94.08

95.77

50.78

93.94

96.20

ARH (1)

09.26

95.47

96.23

09.72

95.10

95.90

09.19

95.07

96.47 96.30

RC

01.03

95.90

96.50

00.93

95.87

95.97

00.87

95.87

UN

47.02

95.37

96.30

48.99

94.41

95.74

46.72

94.41

96.37

ARH (1)

31.70

94.61

97.17

26.41

92.21

95.84

27.94

92.21

95.77 92.64

RC

10.46

94.57

96.50

07.46

91.41

92.91

08.66

91.58

UN

64.63

94.90

96.77

61.64

92.00

95.04

61.11

91.84

95.64

ARH (1)

05.03

95.44

96.00

04.23

94.28

95.90

04.70

94.21

95.84

RC

00.20

95.00

96.20

00.13

93.74

94.28

00.20

93.98

94.67

UN

39.59

95.07

95.80

40.53

93.64

95.04

39.26

93.68

95.60

(Continued on next page)

Ó 2017 Hogrefe Publishing

Methodology (2017), 13(1), 9–22


16

P. Livacic-Rojas et al., Statistical Power, Covariance

Table 1. (Continued) MBF N

R

OCS

+

D

AIC

CIM

BGE

WGE

INE

BGE

WGE

INE

BGE

WGE

INE

ARH (1)

03.00

95.00

95.60

02.80

94.14

96.20

02.97

94.04

95.70

RC

00.17

96.13

96.17

00.17

95.14

95.63

00.17

95.20

96.20

UN

32.24

95.10

95.90

33.93

94.87

95.54

32.17

94.91

96.00

ARH (1)

15.05

94.44

96.20

13.19

92.74

95.77

13.92

92.48

95.21

RC

03.13

94.14

95.40

02.03

91.98

93.08

02.60

91.98

92.64

UN

53.33

94.51

96.93

52.98

92.81

95.30

50.75

92.81

96.03

Notes. Sample size (n); Relationship between sample and dispersion matrices (R); Null relation between sample and dispersion matrix size (=); Positive relation between group sample and dispersion matrix size (+); Negative relation between sample and dispersion matrix size ( ); OCS = Original Covariance Structure; ARH (1) = Heterogeneous First-Order Autoregressive; RC = Random Coefficients; UN = Unstructured; D = Distribution of Data; N = Normal Distribution; Non-Normal Distribution (Light, Moderate, and Strong bias); MBF = Modified Brown-Forsythe Procedure; AIC = Akaike’s Information Criterion; CIM = Correctly Identified Model; BGE = Between-Group Effects; WGE = Within-Group Effects; INE = Interaction Effects; Power levels equal to or greater than 0.80 (boldface).

Table 2. Percentages of Average Statistical Power of three procedures of covariance structure selectors with normally distributed data, with mean patterns far and near each other ( 1, 0, 0, 1; 1, 0, 1, 0; 1, 1, 0, 0) and 10 and 20% trimmed means MBF

AIC

CIM

N

R

OCS

TM

BGE

WGE

INE

BGE

WGE

INE

BGE

WGE

INE

30

=

ARH (1)

0.1

66.63

95.67

95.87

63.64

95.67

94.11

66.17

95.67

95.44

29.37

94.64

95.37

29.34

94.37

93.94

34.56

94.41

95.30

RC +

45

=

+

60

=

+

30

=

UN

67.90

95.94

95.10

68.33

95.54

93.01

67.13

95.67

95.47

ARH (1)

62.94

94.97

95.70

65.23

95.63

94.50

62.46

95.07

95.70

RC

28.94

95.27

95.80

27.51

95.20

95.10

29.91

95.11

95.87

UN

62.96

95.70

96.37

42.26

95.87

94.80

62.84

95.30

95.50

ARH (1)

80.72

95.07

96.53

72.36

94.74

93.21

76.43

95.17

93.24

RC

61.24

94.74

95.50

45.39

94.31

92.38

54.78

94.74

92.28

UN

81.55

95.43

96.07

73.83

94.77

91.08

76.69

94.67

92.61

ARH (1)

47.62

95.37

95.17

44.95

94.61

94.90

47.25

94.61

95.07

RC

16.95

94.84

95.17

14.35

95.07

94.38

16.85

95.04

95.23

UN

47.49

94.61

95.68

52.65

94.37

94.41

47.02

94.41

95.37

ARH (1)

44.49

95.30

95.27

44.22

95.10

95.27

44.39

95.07

95.53

RC

12.28

94.74

95.47

11.65

95.07

94.77

12.75

95.01

95.10

UN

42.96

95.10

95.37

49.78

94.57

94.34

42.66

94.71

95.24

ARH (1)

65.53

94.84

95.47

60.74

94.27

94.21

62.87

93.88

94.51

RC

35.10

95.07

95.67

25.81

94.87

94.64

32.53

94.90

94.57

UN

66.20

94.67

95.57

65.47

93.74

92.51

63.64

94.04

93.71

ARH (1)

34.27

95.44

95.50

32.87

95.54

95.14

34.13

95.74

95.10

RC

07.69

95.60

95.04

06.46

95.07

94.61

07.69

95.20

94.77

UN

33.47

94.94

95.87

37.49

94.74

95.01

33.47

94.84

95.73

ARH (1)

25.90

95.70

95.01

29.34

94.94

95.11

29.50

94.94

94.87

RC

05.66

95.54

95.90

04.90

95.80

95.60

05.63

95.78

96.20

UN

30.07

95.43

95.04

34.84

94.74

93.81

30.04

94.94

94.54

ARH (1)

49.85

95.24

95.10

47.65

95.43

94.61

48.75

95.57

94.64

RC

18.85

95.24

94.84

14.35

95.20

94.01

17.92

95.30

94.01

UN

51.81

94.67

95.30

51.55

93.91

93.84

50.28

94.41

94.28

67.08

94.91

95.27

63.54

94.94

93.68

66.40

94.91

94.61

RC

36.46

94.74

95.17

31.27

94.74

93.68

35.00

95.01

94.94

UN

66.20

95.00

95.63

67.83

94.80

93.84

65.43

94.97

95.31

ARH (1)

0.2

(Continued on next page)

Methodology (2017), 13(1), 9–22

Ó 2017 Hogrefe Publishing


P. Livacic-Rojas et al., Statistical Power, Covariance

17

Table 2. (Continued) MBF N

R

OCS

+

45

=

+

60

=

+

TM

AIC

CIM

BGE

WGE

INE

BGE

WGE

INE

BGE

WGE

INE

ARH (1)

62.84

94.94

95.17

64.37

95.63

94.84

62.74

95.37

95.67

RC

29.34

95.83

96.03

27.61

95.87

95.24

29.37

95.83

95.73

UN

63.44

95.30

95.40

68.70

95.73

94.28

63.07

95.50

95.63

ARH (1)

80.82

95.30

96.27

73.50

94.87

92.81

76.69

95.10

93.08

RC

61.04

94.48

96.87

44.62

94.01

92.84

54.46

94.44

93.44

UN

82.08

95.77

96.67

75.59

94.90

91.84

77.02

95.70

93.98

ARH (1)

49.82

94.87

95.17

48.05

94.94

94.91

49.42

94.90

95.07

RC

17.02

95.24

95.47

14.95

94.64

94.84

16.98

94.67

95.20

UN

48.28

95.47

95.14

51.68

94.37

93.98

46.88

94.44

95.07

ARH (1)

41.53

95.07

95.60

41.06

95.30

95.10

41.56

94.97

95.23

RC

13.32

95.24

94.91

11.59

94.94

94.54

13.32

94.80

94.24

UN

41.29

95.80

94.77

46.85

95.50

94.28

41.22

95.30

94.87

ARH (1)

66.50

95.44

95.53

63.00

94.77

93.81

64.33

94.64

93.61

RC

34.54

95.00

95.30

25.61

94.54

93.71

32.13

94.54

93.98

UN

66.87

94.61

96.07

65.70

94.14

93.81

64.30

94.24

94.68

ARH (1)

34.23

95.14

94.67

33.17

95.14

94.71

34.23

95.24

94.87

RC

08.06

95.54

95.87

06.89

95.24

95.00

08.09

95.27

95.01

UN

32.43

94.54

94.64

37.19

94.11

93.88

32.50

94.41

94.80

ARH (1)

29.04

95.70

95.53

28.61

95.40

95.73

29.07

95.34

95.57

RC

06.29

94.77

95.57

05.44

94.71

95.30

06.33

94.77

95.20

UN

28.01

95.17

95.27

33.93

94.94

94.84

28.07

95.04

95.40

ARH (1)

49.92

95.07

95.20

48.17

94.97

95.04

48.85

94.94

94.77

RC

18.11

95.01

95.33

14.22

94.87

95.27

17.65

94.87

94.84

UN

54.25

95.50

95.67

56.11

94.87

93.94

53.18

95.01

94.34

Notes. Sample size (n); Relationship between sample and dispersion matrices (R); Null relation between sample and dispersion matrix size (=); Positive relation between group sample and dispersion matrix size (+); Negative relation between sample and dispersion matrix size ( ); OCS = Original Covariance Structure; ARH (1) = Heterogeneous First-Order Autoregressive; RC = Random Coefficients; UN = Unstructured; TM = Trimmed Mean; (0.1) = 10% Trimmed Mean; (0.2) = 20% Trimmed Mean; MBF = Modified Brown-Forsythe Procedure; AIC = Akaike’s Information Criterion; CIM = Correctly Identified Model; BGE = Between-Group Effects; WGE = Within-Group Effects; INE = ; Power levels equal to or greater than 0.80 (boldface).

associated with ARH (positive relation) and RC (null and positive types of relation). For Within-Group Effects (WGE) the three procedures yield high power levels averaged for all analyzed conditions with only slight differences between them (MBF = 95.24; AIC = 94.83; CIM = 94.94). For Interactional Effects (INE) the procedures yield similar levels of power (MBF = 95.40; AIC = 94.43; CIM = 94.93). Trimmed Means at 20% For BGE with trimmed means at 20%, three procedures yield moderate power levels average for all analyzed conditions with only slight differences between them (MBF = 60.67; AIC = 59.83; CIM = 59.79). In specific terms, the procedures yield: For BGE and n = 30, BF exhibits high levels of statistical power at 22.22% of the analyzed conditions associated with ARH and UN (negative relation), moderate levels at 66.67%, and low levels at 11.11% with RC (positive relation). AIC and CIM show moderate levels at 88.89% Ó 2017 Hogrefe Publishing

and low levels at 11.11% associated with RC (positive relation). For n = 45, BF, AIC, and CIM show moderate levels of statistical power in 77.78% of the analyzed conditions and low levels at 22.22% associated with RC (for null and positive types of relation). With n = 60, the three procedures yield moderate levels of statistical power at 44.44% of the analyzed conditions and low levels at 55.56% associated with UN (null relation) and ARH (negative relation). For WGE, three procedures yield high power levels for all analyzed conditions with only slight differences between them (MBF = 95.21; AIC = 94.91; CIM = 94.96). For INE, the procedures yield similar levels of power (MBF = 95.47; AIC = 93.81; CIM = 94.48).

Average Levels of Statistical Power at Non-Normally Distributed Data Table 3 shows the average levels of statistical power for the three procedures when data are Non-Normally Distributed with trimmed means at 10% and 20%. Methodology (2017), 13(1), 9–22


18

P. Livacic-Rojas et al., Statistical Power, Covariance

Table 3. Percentages of Average Statistical Power of three procedures of covariance structure selectors with non-normally distributed data, with mean patterns far and near each other ( 1, 0, 0, 1; 1, 0, 1, 0; 1, 1, 0, 0) and 10 and 20% trimmed means MBF OCS

TMBGE

WGE

30

=

ARH (1)

0.1

51.76

95.76

97.12

49.35

94.12

96.24

51.07

94.01

96.91

29.05

95.25

97.25

16.98

93.31

94.52

20.17

93.52

95.49

+

45

=

+

60

=

+

30

=

+

45

=

+

60

=

BGE

WGE

INE

CIM

R

RC

INE

AIC

N

BGE

WGE

INE

BGE

UN

54.54

95.15

97.50

53.93

94.05

95.90

51.49

93.76

96.83

ARH (1)

48.77

96.36

96.53

51.55

95.00

95.93

50.34

95.27

96.82

RC

18.40

96.33

97.18

14.67

95.60

95.50

16.42

95.73

96.70

UN

50.30

95.67

97.11

52.14

95.41

96.17

48.90

95.31

97.26

ARH (1)

71.37

94.60

98.15

61.87

91.74

95.17

63.93

91.83

95.65

RC

45.75

94.27

97.02

30.75

91.08

91.73

36.43

90.77

92.28

UN

73.58

94.62

98.02

64.42

91.58

94.38

65.71

91.63

95.68

ARH (1)

31.86

95.39

96.41

29.99

93.78

95.72

31.39

93.75

96.13

RC

06.65

94.82

95.79

06.17

93.32

94.45

11.67

93.53

95.04

UN

31.88

94.98

96.50

33.71

93.60

95.40

31.18

93.63

95.86

ARH (1)

33.02

96.03

96.42

28.32

95.46

96.34

28.40

95.39

96.41

RC

04.70

95.39

96.34

04.33

95.03

95.38

04.61

95.00

96.03

UN

27.95

94.88

96.09

30.51

95.11

95.85

27.52

95.01

96.02

ARH (1)

49.20

94.81

96.66

43.94

92.38

95.74

45.39

92.44

95.64

RC

21.42

94.84

96.52

13.95

92.06

92.88

16.87

92.11

93.11

UN

49.67

94.73

96.91

48.34

92.04

94.77

46.12

92.21

95.43

ARH (1)

20.21

95.25

95.80

19.28

94.19

95.70

19.99

94.23

95.45

RC

02.21

94.48

95.77

01.43

93.77

94.66

01.70

93.91

94.70 95.86

UN

20.07

94.39

95.99

21.60

93.79

95.68

19.66

93.88

ARH (1)

20.19

95.32

95.79

19.73

94.92

96.32

19.90

94.92

93.14

RC

01.16

95.39

95.57

01.08

94.89

95.07

01.07

94.81

95.27

UN

16.24

95.47

95.78

18.03

95.09

95.64

16.11

95.11

96.21

ARH (1)

34.88

95.00

96.51

31.90

92.27

95.81

33.15

92.42

95.55

RC

08.28

94.29

95.50

05.54

92.03

93.16

06.68

92.31

93.03

UN

34.37

93.75

96.52

32.62

92.07

95.23

31.07

92.22

95.34 87.23

ARH (1)

65.98

95.45

98.40

49.36

94.36

96.47

55.28

94.20

RC

0.2

39.83

94.32

98.07

19.86

93.67

94.71

23.20

93.77

95.84

UN

70.85

94.39

98.51

51.52

94.26

96.03

57.70

94.21

96.84

ARH (1)

63.86

91.06

97.83

52.14

95.76

96.35

51.85

95.53

97.29

RC

27.74

91.17

98.15

14.89

95.48

95.80

17.04

95.56

97.13

UN

62.65

96.19

98.26

51.82

95.67

96.13

48.34

95.44

97.27

ARH (1)

81.39

94.02

98.48

65.57

92.02

95.07

67.94

92.05

95.55 92.68

RC

68.21

92.44

98.32

30.71

90.46

92.13

36.87

90.72

UN

85.04

93.30

98.66

65.28

93.19

95.00

65.75

92.10

95.76

ARH (1)

43.63

94.43

97.74

35.14

94.24

96.36

35.66

94.19

96.39

RC

15.84

93.56

97.26

07.06

93.70

94.46

08.26

93.73

95.08

UN

48.38

93.13

97.70

40.51

93.69

95.65

38.90

93.74

96.04

ARH (1)

30.04

94.69

96.99

32.62

95.21

96.40

32.33

95.10

96.31

RC

06.51

93.91

97.21

03.90

94.97

95.73

04.29

95.02

96.43

UN

41.45

93.99

97.64

37.36

95.04

95.66

35.35

95.11

96.28

ARH (1)

58.70

94.35

97.99

48.00

92.26

96.08

49.04

92.48

95.55

RC

26.83

93.56

96.85

13.64

92.10

92.75

16.70

92.34

92.74

UN

59.91

94.01

97.91

50.77

92.69

95.02

50.05

92.85

95.78

ARH (1)

29.79

93.65

96.95

23.06

93.99

95.83

23.56

93.93

95.83

RC

02.90

93.80

96.31

01.60

93.88

94.80

01.79

94.17

95.07

UN

33.29

92.29

97.30

27.43

93.81

95.68

26.19

93.89

95.96

(Continued on next page)

Methodology (2017), 13(1), 9–22

Ó 2017 Hogrefe Publishing


P. Livacic-Rojas et al., Statistical Power, Covariance

19

Table 3. (Continued) MBF N

R

OCS

+

TMBGE

AIC

CIM

WGE

INE

BGE

WGE

INE

BGE

WGE

INE

BGE

ARH (1)

24.86

94.55

96.96

20.03

95.29

96.31

16.96

95.19

96.11

RC

02.08

93.33

96.20

01.13

94.41

95.33

01.25

94.43

95.69 95.71

UN

23.55

93.13

96.59

21.11

94.64

96.09

19.56

94.62

ARH (1)

45.90

93.45

97.30

34.72

92.80

95.65

35.75

92.64

95.64

RC

13.17

91.96

95.90

05.96

91.60

92.97

07.21

91.66

92.67

UN

49.76

92.46

97.47

40.16

92.60

95.26

39.30

92.82

95.60

Notes. Sample size (n); Relationship between sample and dispersion matrices (R); Null relation between sample and dispersion matrix size (=); Positive relation between group sample and dispersion matrix size (+); Negative relation between sample and dispersion matrix size ( ); OCS = Original Covariance Structure; ARH (1) = Heterogeneous First-Order Autoregressive; RC = Random Coefficients; UN = Unstructured; TM = Trimmed Mean; 0.1 = 10% Trimmed Mean; 0.2 = 20% Trimmed Mean; MBF = Modified Brown-Forsythe Procedure; AIC = Akaike’s Information Criterion; CIM = Correctly Identified Model; BGE = Between-Group Effects; WGE = Within-Group Effects; INE = Interaction Effects; Power levels equal to or greater than 0.80 (boldface).

Trimmed Means at 10% For BGE with trimmed means at 10%, the MBF yields moderate levels of statistical power on average (MBF = 31.76) while AIC and CIM yield low levels on average for all analyzed conditions with only slight differences between them (AIC = 29.12; CIM = 29.52). In specific terms, these procedures yield: For BGE and n = 30, the three procedures show moderate levels of statistical power at 77.78% of the analyzed conditions and low levels at 22.22% associated with RC (null and positive relation). For n = 45, BF shows moderate levels in the 55.56% and low levels at 44.44% of the analyzed conditions associated with RC (all three types of relation) and UN (positive relation). AIC and CIM display moderate levels at 44.44% and low levels at 55.56% associated with RC (all three types of relation) and ARH (positive relation). For n = 60, the three procedures exhibit moderate levels at 33.33% of the analyzed conditions and low levels at 66.67%, associated with ARH, RC, and UN (all three types of relation). For Within-Group Effects (WGE), the three procedures yield high power levels averaged for all analyzed conditions with only slight differences between them (MBF = 95.08; AIC = 93.62; CIM = 93.66). For Interactional Effects (INE) the procedures yield similar levels of power (MBF = 96.55; AIC = 95.16; CIM = 95.48). Trimmed Means at 20% For BGE with trimmed means at 20%, the three procedures yield moderate levels of statistical power on average for all analyzed conditions unlike 8% average in favor of MBF over AIC and CIM (MBF = 41.56; AIC = 31.31; CIM = 32.08). Specifically, the procedures yield: For BGE and n = 30, BF exhibits high levels of statistical power at 22.22% of the analyzed conditions associated to ARH and UN (negative relation), moderate levels at 66.67%, and low levels at 11.11% with RC (positive relation). AIC and CIM show moderate levels at 77.78% and Ó 2017 Hogrefe Publishing

low levels at 22.22% of the analyzed conditions to RC (null and positive types of relation). With n = 45, the three procedures exhibit moderate power levels at 66.67% and low levels at 33.33% associated to RC (the three types of relation). For n = 60, BF exhibits moderate levels at 66.67% of the analyzed conditions and low levels at 33.33% to RC (the three types of relation). AIC and CIM show moderate levels at 22.22% and low levels at 77.78% (except ARH and UN; negative relation). For WGE, three procedures yield high power levels for all analyzed conditions with only slight differences between them (MBF = 93.58; AIC = 93.77; CIM = 93.76). For INE, the procedures yield similar levels of power (MBF = 97.52; AIC = 95.32; CIM = 95.35).

Discussion and Conclusions This study analyzed the power levels for three procedures used for selecting covariance structures. The overall results show high power levels only for within-group and interaction effects for the three procedures in the different conditions analyzed. In turn, for the between-group effects high power levels were observed, mainly associated with the MBF procedure for distributions with normal and non-normal data, with and without 10% and 20% trimmed means in groups of 30 cases, also for ARH and UN matrices (negative relation). Along the same line, the three procedures tend to show lower levels of statistical power when the size of sample data is 60, mainly associated with the RC (types of relation null and positive) when data are non-normally distributed. In addition, the trimmed mean comparatively yields a better increase on levels of statistical power when data are non-normally distributed. The results of this study are consistent with those reported by Vallejo et al. (2001, 2006), Vallejo, Arnau, et al. (2007), Vallejo, Fernández, and Livacic-Rojas (2007), Vallejo, Fernández, Herrero, et al. (2007) and Methodology (2017), 13(1), 9–22


20

Livacic-Rojas et al. (2010, 2013) in which MBF shows higher power levels than AIC and CIM for data that vary in different conditions. These findings could occur with MBF since it does not require a known covariance structure underlying the data. Therefore, it is more realistic in estimating a larger number of parameters with data taken at different points in time and it is more robust to the heterogeneity of the data. Similarly, when data are heterogeneous with an unstructured covariance matrix (UN), the procedure of BF modified by Vallejo et al. (2006) provides degrees of freedom (df) similar to those obtained using SAS PROC MIXED with METHOD = REML options and DDFM = KR. Furthermore, df calculated via two procedures are fully matched and mismatched when compared to contrast product for interaction (see Stroup, 2013 for more information). This information can be checked by running the MBF with the SAS PROC MIXED of the data taken from Nunez, Rosario, Vallejo, and González-Pienda (2013). Additionally, the results of lower sensitivity for AIC and MCI, associated with the presence of heterogeneity in the data, could also generate lower efficiency in the correct choice of the covariance matrix (associated with RC with greater visibility for the between-group effects), this observation correlates in part with the studies of Vallejo et al. (Vallejo, Arnau, et al., 2007; Vallejo, Fernández, & Livacic-Rojas, 2007; Vallejo, Fernández, Herrero, et al., 2007), Liu et al., (2012), Wilcox (2012), and Livacic-Rojas et al. (2013) since the conditions mentioned can affect the accuracy of the inferences in the testing of the design hypotheses, specifically, that of the AIC. Liu et al. (2012) pointed out that it is a criterion of low effectiveness in the presence of heterogeneous data associated with RC matrix. As Stroup (2013) stated, identifying an adequate covariance model is essential to analyze data, control the type I error rates, and detect correctly the effects of treatment (when statistical power increase) on repeated-measures designs. Furthermore, the results reported here are also consistent with those presented by Livacic-Rojas et al. (2010) and Stroup (2013) in which substantial importance was given, in the context of longitudinal designs, to the observation of higher power levels for the interaction effects. By contrast, the results of this study do not match those of Vallejo et al. (2008, 2010, 2011b), since the power levels of the covariance structure selectors are not high when the sample size increases and the complexity of the matrix decreases. Similarly, and as indicated by Vallejo et al. (2014), it is important for the researcher to consider that even when the analysis of covariance structure selectors show significant results, they may have different effect sizes depending on the number of groups and subjects within them, which may affect the power levels observed and the interpretation of the results. Methodology (2017), 13(1), 9–22

P. Livacic-Rojas et al., Statistical Power, Covariance

The results of this research are useful for the readers of Methodology journal because it is necessary to previously know the performance of covariance structure selectors on different conditions as they yield the Type I error rates and moderate levels of statistical power on main effects. Consequently, it may affect the accuracy of the results and conclusions of their research because it reduces the chance to identify the treatment effects (see too, Stroup, 2013). Moreover, the mixed model utilizes AIC instead of another criterion such as the BIC or the CAIC because, despite selecting the correct model at a low frequency, it is the criterion that exhibits the greatest efficiency and its use by researchers is highly frequent. Finally, in addition to considering the existing evidence (Fernández, Livacic-Rojas, Vallejo, & Tuero-Herrero, 2014) regarding the performance of covariance structure selectors for single-level and multilevel designs, it is necessary for future studies to compare and analyze the power levels of the three covariance structure selectors used and corrected ones (CAIC, AICC, and DIC) in one level and multilevel designs (fixed and random effects) since the lowest power levels are observed in association with the RC matrix (which represents hierarchical models) and because of the difficulty in treating the error structure (Liu et al., 2012) equally at all levels (Vallejo et al., 2013, 2014). What is more, it is necessary to analyze how these procedures work using the adjustment of syntax of generalized linear mixed models as Stroup (2013) has proposed. Acknowledgments This research was funded by the Chilean National Fund for Scientific and Technological Development (FONDECYT. Ref.: 1120271) and the Spanish Ministry of Economy and Competitiveness (Grants PSI-2011-23395 and PSI20156730-P). Electronic Supplementary Material The electronic supplementary material is available with the online version of the article at http://dx.doi.org/10.1027/ 1614-2241/a000124 ESM 1. Tables (PDF). Supplementary tables of the research.

References Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on automatic Control, AC-19, 716–723. doi: 10.1109/TAC.1974.1100705 Algina, J., & Keselman, H. J. (1997). Testing repeated measures hypotheses when covariance matrices are heterogeneous: Revisiting the robustness of the Welch-James test. Multivariate Behavioral Research, 32, 255–274.

Ó 2017 Hogrefe Publishing


P. Livacic-Rojas et al., Statistical Power, Covariance

Algina, J., & Keselman, H. J. (1998). A power comparison of the Welch-James and Improved General Approximation test in the split-plot design. Journal of the Educational and Behavioral Statistics, 23, 152–159. Ato, M., Vallejo, G., & Palmer, A. (2013). The two-way mixed model: A long and winding controversy. Psicothema, 25, 130–136. doi: 10.7335/psicothema2012.15 Berkovits, I., Hancock, G. R., & Nevitt, J. (2000). Bootstrap resampling approaches for repeated measure designs: Relative robustness to sphericity and normality violations. Educational and Psychological Measurement, 60, 877–892. Cohen, J. (1992). Quantitative methods in Psychology. Psychological Bulletin, 112, 155–159. Davidson, M. L. (1972). Univariate versus multivariate test in repeated measures experiments. Psychological Bulletin, 77, 446–452. Fernández, P., Livacic-Rojas, P., Vallejo, G., & Tuero-Herrero, E. (2014). Where to look for information when planning scientific research in Psychology: Sources and channels. International Journal of Clinical and Health Psychology, 14, 76–82. Fitzmaurice, G. M., Laird, N. M., & Ware, J. H. (2004). Applied longitudinal analysis. Hoboken, NJ: Wiley. Fleishman, A. I. (1978). A method for simulating non-normal distributions. Psychometrika, 43, 521–532. Greven, S., & Kneib, T. (2010). On the behaviour of marginal and conditional AIC in linear mixed models. Biometrika, 97, 1–17. Hamaker, E. L., Van Hattum, P., Kuiper, R. M., & Hoijtink, H. (2011). Model selection based on information criteria in multilevel modeling. In J. J. Hox & J. K. Roberts (Eds.), Handbook of advanced multilevel analysis (pp. 231–255). New York, NY: Taylor & Francis. Jennrich, R. I., & Schluchter, M. D. (1986). Unbalanced repeatedmeasures models with structured covariance matrices. Biometrics, 42, 805–820. doi: 10.2307/2530695 Kenward, M. G., & Roger, H. J. (1997). Small sample inference for fixed effects from restricted maximum likelihood. Biometrics, 53, 983–997. Keselman, H. J., Algina, J., & Kowalchuk, R. K. (2001). The analysis of the repeated measures design: A review. British Journal of Mathematical and Statistical Psychology, 54, 1–20. Keselman, H. J., Algina, J., Kowalchuk, R. K., & Wolfinger, R. D. (1998). A comparison of two approaches for selecting covariance structures in the analysis of repeated measurements. Communications in Statistics – Simulation and Computation, 27, 591–604. Kowalchuk, R. K., Keselman, H. J., Algina, J., & Wolfinger, R. D. (2004). The analysis of repeated measurements with mixedmodel adjusted F test. Educational and Psychological Measurements, 64, 224–242. Krishnamoorthy, K., & Yu, J. (2004). Modified Nel and Van der Merwe test for the multivariate Behrens-Fisher problem. Statistics & Probability Letters, 66, 161–169. Laird, N. M., & Ware, J. H. (1982). Random-effects models for longitudinal data. Biometrics, 38, 963–974. doi: 10.2307/ 2529876 Liu, S., Rovine, M. J., & Molenaar, P. C. M. (2012). Selecting a linear mixed model for longitudinal data: Repeated measures analysis of variance, covariance pattern model, and growth curve approaches. Psychological Methods, 17, 15–30. doi: 10.1037/a0026971 Livacic-Rojas, P., Vallejo, G., & Fernández, P. (2010). Analysis of type I error rates of univariate and multivariate procedures in repeated measures designs. Communications in

Ó 2017 Hogrefe Publishing

21

Statistics – Simulation and Computation, 39, 624–640. doi: 10.1080/03610910903548952 Livacic-Rojas, P., Vallejo, G., Fernández, P., & Tuero-Herrero, E. (2013). Covariance structures selection and type I error rates in split plot designs. Methodology. European Journal of Research Methods for the Behavioural and Social Sciences, 9, 129–136. doi: 10.1027/1614-2241/a000058 Lix, L. M., & Lloyd, A. M. (2006, April). A comparison of methods for the analysis of doubly multivariate data. Paper presented at the annual meeting of the American Educational Research Association, San Francisco, CA. Micceri, T. (1989). The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin, 105, 156–166. Nel, D. G. (1997). Tests for equality of parameter matrices in two multivariate linear models. Journal of Multivariate Analysis, 61, 29–37. Nel, D. G., & Van der Merwe, C. A. (1986). A solution to the multivariate Behrens-Fisher problem. Communications in Statistics – Theory and Methods, 15, 3719–3735. Nunez, J. C., Rosario, P., Vallejo, G., & González-Pienda, J. A. (2013). A longitudinal assessment of the effectiveness of a school-based mentoring program in middle school. Contemporary Educational Psychology, 38, 11–21. Ramsey, P. H. (1978). Power differences between pairwise multiple comparisons. Journal of the American Statistical Association, 73, 479–485. SAS Institute. (2005). The MIXED procedure 2005, SAS/STAT user’s guide, Version 9. SAS On-Line Documentation. Cary, NC: SAS Institute. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464. Spiegelhalter, D. J., Best, N. G., Carlin, B. P., & Van der Linde, A. (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society Series B, 64, 583–640. Srivastava, M. S., & Kubokawa, T. (2010). Conditional information criteria for selecting variables in linear mixed models. Journal of Multivariate Analysis, 101, 1970–1980. Stroup, W. W. (2002). Power analysis based on spatial effects mixed models: A tool for comparing design and analysis strategies in the presence of spatial variability. Journal of Agricultural, Biological, and Environmental Statistics, 7, 491–511. Stroup, W. W. (2013). Generalized linear mixed models. Modern concepts, methods and applications. New York, NY: CRC Press. Timm, N. H. (2002). Applied multivariate analysis. New York, NY: Springer. Vaida, F., & Blanchard, S. (2005). Conditional Akaike information for mixed-effects models. Biometrika, 92, 351–370. Vale, C. D., & Maurelli, V. A. (1983). Simulating multivariate nonnormal distributions. Psychometrika, 48, 465–471. Vallejo, G., Arnau, J., & Ato, M. (2007). Comparative robustness of recent methods for analyzing multivariate repeated measures. Educational & Psychological Measurement, 67, 1–27. Vallejo, G., Arnau, J., Bono, R., Fernández, M. P., & Tuero-Herrero, E. (2010). Nested model selection for longitudinal data using information criteria and the conditional adjustment strategy. Psicothema, 22, 323–333. Vallejo, G., & Ato, M. (2006). Modified Brown-Forsythe procedure for testing interaction effects in split-plot designs. Multivariate Behavioral Research, 41, 549–578. Vallejo, G., Ato, M., Fernández, P., & Livacic-Rojas, P. (2013). Multilevel bootstrap analysis with assumptions violated. Psicothema, 25, 520–528. doi: 10.7334/psicothema2013.58

Methodology (2017), 13(1), 9–22


22

Vallejo, G., Ato, M., & Valdés, T. (2008). Consequences of misspecifying the error covariance structure in linear mixed models for longitudinal data. Methodology, 4, 10–21. Vallejo, G., Fernández, M. P., Herrero, J., & Livacic-Rojas, P. (2007). Examen comparativo de la sensibilidad de dos enfoques robustos para detectar los efectos de un diseño doblemente multivariado. Revista mexicana de Psicología, 24, 53–64. Vallejo, G., Fernández, M. P., & Livacic-Rojas, P. (2007). Power differences between the modified Brown-Forsythe and mixedmodel approaches in repeated measures designs. Methodology, 3, 1–13. Vallejo, G., Fernández, M. P., Livacic-Rojas, P. E., & Tuero-Herrero, E. (2011a). Selecting the best unbalanced repeated measures model. Behavior Research Methods, 43, 18–36. Vallejo, G., Fernández, P., Livacic-Rojas, P., & Tuero-Herrero, E. (2011b). Comparison of modern methods for analyzing repeated measures data with missing values. Multivariate Behavioral Research, 46, 900–937. Vallejo, G., Fidalgo, A. M., & Fernández, P. (2001). Effects of covariance heterogeneity on three procedures for analysing multivariate repeated measures designs. Multivariate Behavioral Research, 36, 1–27. Vallejo, G., & Livacic-Rojas, P. (2005). A comparison of two procedures for analyzing small sets of repeated measures data. Multivariate Behavioral Research, 40, 179–205. Vallejo, G., Tuero-Herrero, E., Núñez, J. C., & Rosário, P. (2014). Performance evaluation of recent information criteria for selecting multilevel models in Behavioral and Social Sciences. International Journal of Clinical and Health Psychology, 14, 48–57. Wilcox, R. (2012). Introduction to Robust Estimation & Hypothesis Testing (3rd ed.). Waltham, UK: Elsevier.

Received July 24, 2014 Revision received November 20, 2015 Accepted April 15, 2016 Published online March 22, 2017

Methodology (2017), 13(1), 9–22

P. Livacic-Rojas et al., Statistical Power, Covariance

Pablo Livacic-Rojas is Associated Professor of Methodology in Psychology at the University of Santiago (Chile). He teaches at the Faculty of Humanities and the Faculty of Medical Sciences. At the research level, he specializes in robust statistical procedures, selector covariance structures and problems in multilevel and longitudinal research.

Guillermo Vallejo is Professor of Investigation Designs in Psychology at the University of Oviedo (Spain). He has published numerous articles in scholarly journals, including papers on several methodological problems in multilevel and longitudinal research. He is currently working on sample sizes required longitudinal intervention studies with heterogeneous errors and incomplete data.

Paula Fernández is a permanent lecturer at School of Psychology at the University of Oviedo (Spain) in the area of methodology of behavioral science. Her main areas of research are about robust statistical procedures for repeated measures and for incomplete data, in both, experimental and quasi-experimental designs, and analysis of research quality.

Ellián Tuero is an assistant professor in the Department of Psychology of the University of Oviedo (Spain). She teaches at the Faculty of Psychology and the Faculty of Training Teachers and Education. At the research level, she specializes in data analysis of repeated measures designs and in structures that require strategies of multilevel analysis.

Pablo Livacic-Rojas Universidad de Santiago de Chile Facultad de Humanidades Avenida Ecuador 3650, Tercer Piso Estación Central Santiago de Chile Chile pablo.livacic@usach.cl

Ó 2017 Hogrefe Publishing


Original Article

Performance of Combined Models in Discrete Binary Classification Anabela Marques,1 Ana Sousa Ferreira,2,3 and Margarida G. M. S. Cardoso3 1

Barreiro College of Technology, Setúbal Polytechnic, Institute of Setùbal, Portugal

2

Faculdade de Psicologia, Universidade de Lisboa & Business Research Unit (BRU-IUL), Portugal

3

Instituto Universitário de Lisboa (ISCTE-IUL), Business Research Unit (BRU-IUL), Lisboa, Portugal

Abstract: Diverse Discrete Discriminant Analysis (DDA) models perform differently in different samples. This fact has encouraged research in combined models which seems particularly promising when the a priori classes are not well separated or when small or moderate sized samples are considered, which often occurs in practice. In this study, we evaluate the performance of a convex combination of two DDA models: the First-Order Independence Model (FOIM) and the Dependence Trees Model (DTM). We use simulated data sets with two classes and consider diverse data complexity factors which may influence performance of the combined model – the separation of classes, balance, and number of missing states, as well as sample size and also the number of parameters to be estimated in DDA. We resort to cross-validation to evaluate the precision of classification. The results obtained illustrate the advantage of the proposed combination when compared with FOIM and DTM: it yields the best results, especially when very small samples are considered. The experimental study also provides a ranking of the data complexity factors, according to their relative impact on classification performance, by means of a regression model. It leads to the conclusion that the separation of classes is the most influential factor in classification performance. The ratio between the number of degrees of freedom and sample size, along with the proportion of missing states in the minority class, also has significant impact on classification performance. An additional gain of this study, also deriving from the estimated regression model, is the ability to successfully predict the precision of classification in a real data set based on the data complexity factors. Keywords: classification performance, combined models for classification, discrete discriminant analysis, separability

Some researchers have tried to understand the relationship between the data characteristics and the performance of classifiers. For example, Ho and Basu (2002) studied the case of two class problems and described the nature of classification difficulty. They enumerated diverse measures of a classification problem complexity and adopted a typology considering: (1) the overlap of individual features, (2) measures of class separability, and (3) measures of geometry and (4) Sotoca, Sanchez, and Mollineda (2005) used those measures and added other statistical measures (e.g., number of binary attributes, number of classes, entropy of classes, mean absolute correlation coefficients between two features, etc.) when conducting a metaanalysis of classifiers. Finch and Schneider (2007) considered the weight of classes and three factors related to continuous predictors. Macia, Bernadó-Mansilla, and Orriols-Puig (2008) also used measures of geometry to characterize the complexity of data sets and studied binary classification. These authors considered several scenarios for synthetic continuous data, controlling the number of instances and the number of attributes and focused on the length of the class boundary to assess the complexity of the data set. Ó 2017 Hogrefe Publishing

Studies referring to the performance of classification based on nominal predictors are extremely rare in the literature. In this study, we conduct numerical experiments to evaluate the performance of binary classifiers in Discrete Discriminant Analysis (DDA), aspiring to contribute to filling this gap in the literature. In DDA, each object is described by P discrete variables and is assumed to belong to one of K exclusive classes (C1, C2, . . ., CK) with prior probabilities P π1 ; π2 ; . . . ; πK ; ð Ki¼1 πi ¼ 1Þ. The purpose of DDA is to derive a classification rule for the future assignment of objects described by the P discrete variables, but with unknown class membership, to one of the K classes. In order to obtain this classification rule, a n-dimensional sample X ¼ ðx1 ; x2 ; . . . ; xn Þ is used, for which the class membership of each object is known. This study is geared toward social sciences and humanities, where classification problems are frequently related to participants described by qualitative variables (often binary variables) and small or even very small samples. In this case, our recent research has shown that the classification methods generally used nowadays, such as classification and regression trees (CART) or random forests (RF), tend Methodology (2017), 13(1), 23–37 DOI: 10.1027/1614-2241/a000117


24

A. Marques et al., Performance of Combined Models in Discrete Binary Classification

to reveal poorer performance than our combined DDA models approach (Marques, Sousa Ferreira, & Cardoso, 2010, 2015). With a view to further exploring the performance of our approach in complex data sets, we established different scenarios using simulated data sets considering diverse data complexity factors. The generated data sets are meant to provide means to compare the performance of single and combined DDA models and to provide new insights into the impact of data complexity factors on discrete classification performance. More specifically, we focus on classification problems in very small, small, and moderate sized samples, which render classification tasks harder (Ho & Basu, 2002) and, in our estimation, discrete classification tasks even harder.

Methodology A Combined Model for Classification In the present study, we address Discrete Discriminant Analysis (DDA) tasks – to classify and discriminate multivariate observations of discrete variables into a priori defined classes – using a combined model proposed by Marques, Sousa Ferreira, and Cardoso (2013). It is important to note that DDA is usually used with two objectives: a predictive or a descriptive goal. That is, we may either wish to classify new objects with unknown membership to one of the K a priori defined classes or to assess the adequacy of classification, determining which of the explicative variables accounts the most for discriminating between two or more classes. Commonly, in supervised classification, several models are estimated and a unique classifier is selected based on certain validation criteria. However, the discarded classifiers usually contain important information regarding the classification problem which is lost by selecting a single classifier (Brito, Celeux, & Sousa Ferreira, 2006). In addition, it is often observed that misclassified objects are different for different models. This fact has recently encouraged a large number of publications, from several areas of research, which focus on the combination of classification models (e.g., Amershi & Conati, 2009; Breiman, 1996, 1998; Brito, 2002; Cesa-Bianchi, Claudio, & Luca, 2006; Freund & Schapire, 1996; Friedman, 2001; Friedman, Hastie, & Tibsharani, 1998; Friedman & Popescu, 2008; Janusz, 2010; Kotsiantis, Zaharakis, & Pintelas, 2006; Kotsiantis, 2011; Milgram, Cheriet, & Sabourin, 2004; Re & Valentini, 2012; Sousa Ferreira, Celeux, & Bacelar-Nicolau, 2000; Wolpert, 1992). In scientific literature, the combining approach takes on a number of different designations, such as, for instance, Methodology (2017), 13(1), 23–37

Blending (Elder & Pregibon, 1996), Ensemble of Classifiers (Dietterich, 1997), Committee of Experts (Steinberg, 1997), Perturb and Combine (Breiman, 1996), and Combiners (Jain, Duin, & Mao, 2000). However, all the authors have focused on quite a simple idea: to train one model in several samples from the same data set or to train several models from the same data and combine their output predictions usually by means of a voting process. Examples of the first strategy are Bagging (Breiman, 1996) using bootstrap samples of the training data set, Boosting (Freund & Schapire, 1996) weighting cases misclassified by decision tree models more heavily, and Arcing (Breiman, 1998) weighting random subsamples of the training data set. On the other hand, training diverse types of models can achieve uncorrelated output predictions and thus reduce the misclassification error rate (Abbott, 1999; Amershi & Conati, 2009; Brito, 2002; Brito et al., 2006; Cesa-Bianchi et al., 2006; Janusz, 2010; Kotsiantis, 2011; Sousa Ferreira, 2000, 2004). Although many of the combined models for classification proposed in the literature may be applied to problems with discrete explanatory variables, studies in the literature focus heavily on continuous data. Therefore, our research sets out to combine models in DDA, a natural approach which usually increases classification performance (Sousa Ferreira, 2000, 2004, 2010). In discrete classification problems, two reference models may be considered: the Full Multinomial Model (FMM) and the First-Order Independence Model (FOIM). The most natural model for discrete data is to assume that the conditional probability functions PðxjCk Þ, where x 2 f0; 1gP , k = 1, . . ., K, are multinomial probabilities. In this case, the conditional probabilities are estimated by the observed frequencies. Goldstein and Dillon (1978) refer to this as the FMM. This model takes the interactions of variables into account and involves 2P – 1 parameters in each class. Hence, even for moderate P, not all the parameters are identifiable. In order to clarify the interest of the above-mentioned models – FMM and FOIM – we will introduce an example, showing how each model deals with this data in order to classify new objects, in the future, in one of the considered a priori classes. Let us assume a problem with two classes and three binary explanatory variables (observed values – 0 or 1). Then, the values observed in this problem can take 23 distinct forms which are resumed in a state matrix presented in Table 1, for a sample of 30 observations and two defined a priori classes (n1 = 10 e n2 = 20). As already explained, in the FMM, the maximumlikelihood (ML) estimator of the probability of occurrence of each state in each class is the relative frequency observed in each class. Thus, for the FMM we present the estimated conditional probabilities in Table 2. As shown, the FMM will Ó 2017 Hogrefe Publishing


A. Marques et al., Performance of Combined Models in Discrete Binary Classification

Table 1. Distribution of the observed frequencies, by state and by class

25

Table 3. Probability estimates of the occurrence of state i in class k through the FOIM (i = 1, 2, 3 and k = 1, 2) State (x1, x2, x3) f1 ðx jXÞ f2 ðx jXÞ Decision (class chosen by FOIM)

(x1, x2, x3)

C1 (freq.)

C2 (freq.)

1

(0, 0, 0)

0

0

1

(0, 0, 0)

0.1680

0.0660

C1

2

(0, 0, 1)

0

1

2

(0, 0, 1)

0.0720

0.0540

C1

3

(0, 1, 0)

1

8

3

(0, 1, 0)

0.1120

0.3740

C2

4

(0, 1, 1)

3

7

4

(0, 1, 1)

0.0480

0.3060

C2

5

(1, 0, 0)

6

2

5

(1, 0, 0)

0.2520

0.0165

C1

6

(1, 0, 1)

0

0

6

(1, 0, 1)

0.1080

0.0135

C1

7

(1, 1, 0)

0

1

7

(1, 1, 0)

0.1680

0.0935

C1

8

(1, 1, 1)

0

1

8

(1, 1, 1)

0.0720

0.0765

C2

10

20

State

Total

Table 2. Probability estimates of the occurrence of state i in class k through the FMM (i = 1, 2, 3 and k = 1, 2) State (x1, x2, x3) f1 ðx jXÞ f2 ðx jXÞ Decision (class chosen by FMM) 1

(0, 0, 0)

0.00

0.00

?

2

(0, 0, 1)

0.00

0.05

C2

3

(0, 1, 0)

0.10

0.40

C2

4

(0, 1, 1)

0.30

0.35

C2

5

(1, 0, 0)

0.60

0.10

C1

6

(1, 0, 1)

0.00

0.00

?

7

(1, 1, 0)

0.00

0.05

C2

8

(1, 1, 1)

0.00

0.05

C2

Note. The symbol “?” means that in this state, class decision is impossible: the estimated probabilities equal zero on both classes, since this state is not observed in the training sample.

have problems when applied to test samples or to anonymous objects with states not observed in the training sample. Therefore, as in the social sciences and humanities domain, data sets are often small or very small with regard to the number of probabilities to be estimated, DDA is confronted with a problem of sparseness, as many of the multinomial cells are not observed in the training sets. On the other hand, the FOIM (Goldstein & Dillon, 1978) assumes that the P discrete variables are independent in each class Ck, k = 1, . . ., K. Hence, the number of parameters to be estimated for each class is reduced from 2P – 1 to P. Thus, for the above example, the probability estimates of the occurrence of the state (0, 1, 1) in Class 1, through the FOIM, are given by a product of probability estimates – 0.40 0.40 0.30 = 0.048 – and in Class 2 – 0.80 0.85 0.45 = 0.306. The probability estimates for the referred example through the FOIM are presented in Table 3. As shown, the FOIM can substantially reduce the number of parameters to be estimated and, hence, it provides good performances in a large number of cases. However, when the independence hypothesis between variables, in each class, is too unrealistic, even with this Ó 2017 Hogrefe Publishing

model it is difficult to obtain an adequate percentage of correctly classified observations. The two previous tables (Tables 2 and 3) show that the FMM and FOIM can provide different classifiers and have, therefore, been considered for integrating combined models (Sousa Ferreira, 2000). However, as the FMM demonstrated poor performances in small data sets, we replaced the FMM with another model, the DTM, which also takes the interactions between variables into account. In the DTM, the conditional probability function may be estimated by using a dependence tree that represents the most important predictor relations. We use the Chow-Liu algorithm (Celeux & Nakache, 1994; Pearl, 1988) to implement the dependence tree and approximate the conditional probability function. In this algorithm, the mutual information between two variables is used to measure the closeness between two probability distributions. Therefore, DTM provides an estimate of the conditional probability function for each class based on the idea that through the knowledge of a graph G, where X1, . . ., XP represent its P vertices, the probability distribution f G, associated with this graph, can be calculated as the product of the conditional probabilities: P 1 Y f G ðx1 ; . . . ; xP Þ ¼ f xrðpÞ f xp jxlðpÞ

ð1Þ

lðpÞ¼1

where xl(p) represents a variable that is linked to the variable xp in this graph, arbitrarily choosing one vertex as the root of the graph, xr(p). In order to construct the graph for each class, we rely on the aforementioned Chow-Liu algorithm, where the length of each edge refers to the pair of variables (xp, xp0 ) and represents a measure of the association between the same variables. The mutual information (I) is used as the measure of the association between each pair of variables and is defined as follows:

XX f xp ; xp0 I Xp ; Xp0 ¼ f xp ; xp0 log f xp f x p 0

ð2Þ

Methodology (2017), 13(1), 23–37


26

A. Marques et al., Performance of Combined Models in Discrete Binary Classification

Table 4. Mutual information I (Xp, Xp0 ) (Xp, Xp )

C1

C2

(x1, x2)

0.673

0.097

(x1, x3)

0.386

0.021

(x2, x3)

0.386

0.005

0

where f(xp, xp0 ) is estimated using the maximumlikelihood approach. After calculation of the CP2 mutual information values, graph G, with P – 1 edges, corresponding to the highest total, mutual information is selected. Let us now see, how the DTM estimates the conditional probability function and let us begin with the values obtained for the mutual information presented in Table 4. According to the results in Table 4, the probability distribution of the first-order dependence tree is

f ðx jXÞ ¼ f^ C1

x 1 jX

f^ x jx ; X f^ðx jx ; XÞ 2

1

3

1

f C2 ðx jXÞ ¼ f^ x 1 jX f^ x 2 jx 1 ; X f^ðx 3 jx 1 ; XÞ

ð3Þ ð4Þ

where the marginal and conditional probability functions are determined simply by using the observed relative frequencies in sample X. It should be noted that in this particular case we obtain the same dependence tree for the two classes as the mutual information values have led to the selection of the same graph, however, in general, the DTM is of a particular interest since it considers a dependence tree for each class, thus enabling an estimation of the conditional probability function by class (Figure 1). In accordance with the probability distribution of the first-order dependence tree (3 and 4), the value for the 4th state (0, 1, 1) is calculated as follows: Class C1:

f C1 ðx jð0; 1; 1ÞÞ ¼ f^ðx1 ¼ 0Þf^ðx2 ¼ 1jx1 ¼ 0Þ 4 3 1 ¼ 0:300 f^ðx3 ¼ 1jx1 ¼ 0Þ ¼ 10 4

ð5Þ

Class C2:

f C2 ðx jð0; 1; 1ÞÞ ¼ f^ðx1 ¼ 0Þf^ðx2 ¼ 1jx1 ¼ 0Þ 16 15 8 ¼ 0:375 f^ðx3 ¼ 1jx1 ¼ 0Þ ¼ 20 16 16 ð6Þ According to these results, a future object, described according to this state, should be classified in Class C2.

Methodology (2017), 13(1), 23–37

Figure 1. Example of a dependence tree for the case of P = 3 variables.

Table 5. Probability estimates of the occurrence of state i in class k through the DTM (i = 1, 2, 3 and k = 1, 2) State (x1, x2, x3) f1 ðx jXÞ f2 ðx jXÞ Decision (class chosen by DTM) 1

(0, 0, 0)

0.000

0.025

C2

2

(0, 0, 1)

0.000

0.025

C2

3

(0, 1, 0)

0.100

0.375

C2

4

(0, 1, 1)

0.300

0.375

C2

5

(1, 0, 0)

0.600

0.075

C1

6

(1, 0, 1)

0.000

0.025

C2

7

(1, 1, 0)

0.000

0.075

C2

8

(1, 1, 1)

0.000

0.025

C2

All the values of the conditioned probability estimates through application of the DTM to the previously described data are presented in Table 5. As presented in Table 5, the DTM can more easily overcome the problems of sparseness than the FMM, thus justifying the interest of its integration in the proposed combined model. According to the FOIM and DTM definition it is clear that they both favor the predictive objective of DDA, nevertheless, both also attain the descriptive goal, since the FOIM permits an analysis of the relationship between each explanatory variable and classes and the DTM determines the most important relationships between variables, in each class. The model proposed in this study (Marques et al., 2013) is a linear convex combination of the First-Order Independence Model (FOIM) and the Dependence Trees Model (DTM) and is meant to deal with the specific challenge of sparseness: The FOIM assumes that the P discrete variables are independent in each class Ck, k = 1, . . ., K. The Dependence Trees Model (DTM, Celeux & Nakache, 1994; Pearl, 1988) is an alternative model that takes the predictors’ relationships into account. In the DTM, the conditional probability function may be estimated by using a dependence tree that represents the most important predictor relations. We use the Chow-Liu

Ó 2017 Hogrefe Publishing


A. Marques et al., Performance of Combined Models in Discrete Binary Classification

27

Table 6. Performance of the single models FOIM and DTM and the combined model FOIM-DTM b

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pc

63.4%

66.6%

76.6%

76.6%

76.6%

76.6%

76.6%

76.6%

76.6%

73.4%

73.4%

Note. Bold means the best percentage of correctly classified observations achieved.

algorithm (Celeux & Nakache, 1994; Pearl, 1988) to implement the dependence tree and estimate the conditional probability function. In this algorithm, the mutual information between two variables is used to measure the closeness between two probability distributions. FOIM and DTM are expected to provide different misclassified objects, thus encouraging a combination approach. The FOIM-DTM conditional probability function is estimated as follows:

^ i jCk ; βÞ ¼ βP ^ FOIM ðxi jCk Þ þ ð1 βÞP ^ DTM ðxi jCk Þ ð7Þ Pðx with (0 β 1). X ¼ ðx1 ; x2 ; . . . ; xn Þ, xi represents the ith object (i 2 {1, . . ., n}), described by P discrete variables, xi ¼ ðxi1 ; xi2 ; . . . ; xiP Þ (observed state), K exclusive classes (C1, C2, . . ., CK), and a n-dimensional sample. Note that when β = 1, the FOIM-DTM combined model relies on the FOIM and the combined model is based solely on the DTM when β = 0. Finally, the performance of the single models FOIM and DTM and the combined model FOIM-DTM in this pedagogical example is presented in Table 6. The performance of the models is measured by the percentage of correctly classified observations (Pc) and is estimated by twofold cross-validation. Results in the above table show that the single models DTM and FOIM attain worse results than the combined model, 63.4% and 73.4%, respectively, of correctly classified observations versus 76.6%, which, despite the absence of poor performance, reveals the ability of the combined model to increase the accuracy of the models. The combined model presented herein was developed in the social sciences and humanities domain, aspiring to contribute to discrete classification problems in the case of very small, small, or moderate sample sizes. The real data examples outlined below set out to demonstrate the importance of the combined model FOIM-DTM for supervised classification in the social sciences and humanities domain: Alexithymia data (Prazeres, 1996) consists of 34 dermatology patients evaluated by the psychological test TAS-20 (Twenty-Item Toronto Alexithymia Scale), conceived to evaluate the presence of alexithymia (“alexithymia” – means no words to express emotions). These patients were classified in three classes – Nonalexithymic class (n1 = 7), Intermediate class (n2 = 13), and Alexithymic class (n3 = 14) – and for each patient, the values of six psychological binary variables were available. The goal of this classification problem Ó 2017 Hogrefe Publishing

was to define a model that would permit the classification of new patients in the future, in one of the three a priori defined classes. Cultural Centre data (Duarte, 2009) were obtained by means of a survey on the quality of products/services in the Cultural Centre applied to 988 clients. This survey consisted of several questions, however the present data are related to three questions regarding the clients’ expectations (5-point Likert scale) and two schooling classes (not holding a degree, holding a degree). Although the sample size is not small, the data matrix presents a large number of unobserved cells (sparseness problem) and thus constitutes a major challenge in supervised classification. Are the two schooling classes well predicted by the expectation variables? This was the question in this classification task. In Table 7 the results obtained with Alexithymia and Cultural Centre data are summarized. Models in competition are evaluated by the percentage of correctly classified observations estimated by twofold cross-validation. The two real data examples both represent a real challenge in supervised classification for very distinct characteristics. The first derives from the very small sample size and the second a result of sparseness. In both cases, the combined model is the winning model despite the difficulty in overcoming the sparseness problem of Cultural Centre data. For modeling purposes, prior probabilities are considered equal. The R software (R version 2.12.1) is used for implementation of the algorithm. In the Appendix we provide, as an example, the implementation of the FOIM procedure.

Data Complexity and the Performance of Classifiers The performance of classifiers may be influenced by several factors: class separation, balance (Ho & Basu, 2002; Macia et al., 2008; Prati, Batista, & Monard, 2004), sample size (Raudys & Jain, 1991) and also (in the specific DDA domain), the number of missing states – for example (Sousa Ferreira, 2004, 2010). Some studies have addressed the relationships between more than one factor, namely when continuous predictors are considered – for example (Prati et al., 2004) refer to overlapping and balance and conclude that the lack of separation between classes tends to surpass the importance of unbalanced Methodology (2017), 13(1), 23–37


28

A. Marques et al., Performance of Combined Models in Discrete Binary Classification

Table 7. Performance of the single models FOIM and DTM and the combined model FOIM-DTM in the real data examples Data

DTM (b = 0) FOIM (b = 1)

FOIM-DTM

Alexithymia

65.0%

65.0%

85.0% (b = 0.25, 0.50, 0.75)

Cultural centre

56.0%

60.0%

61.0% (b = 0.40)

Note. Bold means the best percentage of correctly classified observations achieved.

classes in terms of the difficulty of binary classification tasks. Pinches (1980) points out the relevance of sample size and comments on the impact of unequal sample sizes per class. Raudys and Jain (1991) consider the relationship between sample size and the number of missing states and also underline the intrinsic relationship between the sample size and the number of predictors as a determinant of classification complexity. Macia et al. (2008) resort to the generation of synthetic data sets to evaluate data complexity and find that the length of the classes’ boundary is a dominant factor in assessing the complexity of the data set. In the present study, several scenarios are set for generating data to evaluate the impact of data characteristics on the performance of a discrete binary classifier. As already mentioned, this study focuses on the field of social sciences and humanities where the issue of sample dimension is particularly relevant in DDA. Therefore, the first complexity factor considered is sample size, with three options – very small, small, and moderate sized samples. The second experimental factor is the degree of class separation, which is measured by the affinity coefficient (A, Bacelar-Nicolau, 1985; Matusita, 1955). This coefficient is computed as follows: L pffiffiffiqffiffiffiffi X 0 0 fl fl A f;f ¼

ð8Þ

l¼1 0

0

0

where f ¼ ðf1 ; . . . ; fL Þ and f ¼ ðfl ; . . . ; fL Þ are two discrete probability distributions defined in the same space of states. The third experimental factor is balance – the weight of the majority class is used as its measure. The number of missing states is included as an additional complexity factor. This factor is not prespecified but rather determined for the simulated data sets generated under the experimental scenarios (defined by the previously mentioned factors). In order to evaluate the DDA results obtained with the combined model we report the percentage of correctly classified observations (Pc) and the Huberty Index (HI)

HI ¼

Pc Pd 1 Pd

ð9Þ

where Pd represents the percentage of observations corresponding to the majority class and Pc is the percentage of correctly classified cases. The Huberty index Methodology (2017), 13(1), 23–37

is intended to provide a fair comparison between the performance of both balanced and unbalanced cases since it quantifies the percentage of improvement in classification performance taking into account the majority class rule as a default classification rule. In fact, “Error rate and accuracy are particularly suspect as performance measures when studying the effect of class distribution on learning since they are strongly biased to favor the majority class” (Prati et al., 2004, p. 314). Therefore we report both Pc and the Huberty index and also provide results from twofold cross-validation. Finally, we attempt to model the relationship between the combined classifier performance and the complexity data factors considered in this study. To this end, we resort to simulated data and use regression on the performance of the combined model. The percentage of correctly classified observations (twofold result) is the response variable considered (note that since the weight of the majority class is included as a predictor, the Huberty Index can be discarded at this stage). The estimated linear regression model is judged according to its fit to data and its predictive accuracy tested in one real data set.

Data Analysis and Results Simulated Data The performance of the FOIM, DTM, and combined FOIM-DTM discrete classifiers is evaluated based on simulated data within diverse experimental scenarios. The experimental factors considered arise from the previous literature review: data set size and class imbalance – for example (Weiss & Provost, 2003); overlapping or separation – for example (Prati et al., 2004). First, in line with the work of Weiss and Provost (2003) we focus on binary classification. We then consider four binary predictors, a reasonable number given that our aim is to address classification in small sized samples. Having set this general scenario, we specify the following complexity factor thresholds: 1) Separation – since the affinity coefficient takes values in the interval (0, 1), the thresholds defined for the affinity coefficient values are above 0.7 for poorly separated classes, between 0.3 and 0.7 for moderately separated classes, and below 0.3 for well-separated classes; 2) Sample size – n = 60, n = 120, and n = 400 sample sizes are considered; 3) Balance – for the unbalanced experimental scenarios the considered majority class weights are, respectively, 2/3, 3/4, and 3/4 of sample size, Ó 2017 Hogrefe Publishing


A. Marques et al., Performance of Combined Models in Discrete Binary Classification

Table 8. Synthetic data set parameters: The four binary predictors’ probabilities Separability

C1

C2

Poor

(0.5, 0.5; 0.5, 0.5; 0.5, 0.5; 0.5, 0.5)

(0.5, 0.5; 0.5, 0.5; 0.5, 0.5; 0.5, 0.5)

Moderate

(0.4, 0.6; 0.6, 0.4; 0.4, 0.6; 0.6, 0.4)

(0.7, 0.3; 0.3, 0.7; 0.7, 0.3; 0.3, 0.7)

Good

(0.1, 0.9; 0.7, 0.3; 0.2, 0.8; 0.6, 0.4)

(0.9, 0.1; 0.3, 0.7; 0.8, 0.2; 0.4, 0.6)

Table 9. Average number of missing states (100 runs in each scenario) n = 60 Separation

The average number of missing states (the fourth experimental factor) is finally quantified for each simulated data set. The multinomial distribution parameters, along with the complexity factors’ characteristics regarding the considered data sets, are presented in Tables 8 and 9. For each of the eighteen resulting scenarios we generate 100 data sets. Based on the 1,800 generated data sets, we aim to ascertain the comparative advantage of the combined DDA model. In addition, we will be able to use a regression model in order to evaluate the relative impact of the experimental factors on the performance of binary discrete classification.

C1

C2

n = 120 Total

C1

C2

n = 400

Total

C1

C2

Total

Balanced Poor

2.28 2.40

4.68 0.35 0.38

0.73 0.00 0.00 0.00

Moderate 2.86 4.82

7.68 0.76 2.30

3.06 0.00 0.35 0.35

Good

7.27 7.45 14.72 4.69 5.35 10.04 2.01 2.29 4.30

Unbalanced Poor

(i.e., n1 = 20, n2 = 40; n1 = 30, n2 = 90; and n1 = 100, n2 = 300).

29

4.75 1.27

6.02 2.24 0.04

2.28 0.02 0.00 0.02

Moderate 5.11 3.42

8.53 3.06 1.33

4.39 0.12 0.07 0.19

Good

8.21 6.19 14.40 6.95 3.57 10.52 3.46 1.28 4.74

Table 10. Congressional voting records (reduced) data set Category DEM (C1) REP (C2)

Predictors’ V4. Adoption-of-the-budget-resolution

1-Yes

85.5%

2-No

14.5%

84.3%

V5. Physician-fee-freeze

1-Yes

4.8%

99.1%

2-No

95.2%

0.9%

V6. El-salvador-aid

1-Yes

20.2%

95.4%

2-No

79.8%

4.6%

V13. Education-spending

1-Yes

12.9%

85.2%

2-No

87.1%

14.8%

232

124

108

Total

15.7%

Results Real Data A real data set is considered to illustrate the comparison between the effective FOIM-DTM performance with the estimated performance based on the considered complexity factors, using an estimated regression model. This data set meets the general conditions of simulated data. It is based on the Congressional Voting Records Data Set in the UCI Machine Learning Repository – see Bache and Lichman (2013) – which includes votes for each of the U.S. House of Representatives Congressmen on 16 key votes identified by the Congressional Quarterly Almanac (CQA), 1984. In this data set, classification is meant to discriminate between democrats (DEM) and republicans (REP) for the future assignment of new congressmen described by the P discrete variables, but with unknown class membership, to one of the two classes. The 16 predictors (key votes) are binary variables indicating: 1-yes; 2-no. Class 1 (DEM) has 124 congressmen and Class 2 (REP) 108. In this study we only consider individuals providing complete answers and finally select the four most discriminant predictors – we use the Cramer’s V statistic to measure the association between each predictor and the classes in order to identify the most promising variables. In Table 10, the final considered data set is described. Ó 2017 Hogrefe Publishing

Descriptive Results The descriptive results referring to the performance of the combined FOIM-DTM classifier are presented in this section. They refer to 100 classification runs for each scenario. The performance of the combined classifier FOIM-DTM versus the individual classifiers is summarized in Tables 11, 12, and 13 and in Figures 2 and 3. Note that in each of the 1,800 data sets, the performance of the FOIM-DTM combination is analyzed considering β ranging from 0 (DTM) to 1 (FOIM) by increments of 0.1. Thus, the performance of the FOIM-DTM combined model has been systematically compared with those of the FOIM and DTM individual classifiers. In Table 11, the comparison between FOIM, DTM, and FOIM-DTM combined models is presented. When n = 60 (very small sized sample), FOIM is able to outperform the combined model for the balanced data sets with poorly and moderately separated classes and for the unbalanced data sets with well-separated classes. When small samples (n = 120) are considered the proposed combined classification algorithm is a clear winner – it outperforms FOIM and DTM in the 180 corresponding data sets. For n = 400 (moderate sized sample) there is a tie between FOIM individually and combined with the DTM, although the

Methodology (2017), 13(1), 23–37


30

A. Marques et al., Performance of Combined Models in Discrete Binary Classification

Table 11. The winner b coefficient (best average accuracy of 100 runs) Separation

n = 60

n = 120

n = 400

Table 13. Average performance of the winner b classifier (100 runs for unbalanced sets) n = 60 (1:2)

Balanced Poor

0.7 or 1 (FOIM)

Moderate Good

0.1

1 (FOIM)

1 (FOIM)

0.9

1 (FOIM)

0.9

0.9

0.8

Separation Poor

Unbalanced Poor

0.6

Moderate Good

0.9

Moderate

0.8

0.8

0.8

0.7

0.9 or 1 (FOIM)

0.7

0.8 or 1 (FOIM)

Table 12. Average performance of the winner b classifier (100 runs for balanced sets) n = 60 Mean (%)

Separation Poor Moderate Good

n = 120

n = 400

Var. coef.

Mean (%)

Var. coef.

Mean (%)

Var. coef. 0.06

Pc

51.70

0.14

50.70

0.08

50.40

HI

3.39

4.41

1.35

6.14

0.74

8.89

Pc

69.70

0.12

71.5

0.06

73.20

0.03

HI

39.44

0.44

42.93

0.21

46.40

0.09

Pc

89.70

0.05

90.60

0.03

91.30

0.02

HI

79.36

0.12

81.13

0.07

82.65

0.04

proposed combination may outperform FOIM in the balanced case setting with well-separated classes and in the unbalanced setting with poorly and moderately separated classes, whereas in the other cases FOIM wins. In fact, the presented results are indicative of the potential of the combined classifier which is the winner in around 80% of the experimental situations (FOIM is a single winner in a sixth of the experimental scenarios). Achieved classification accuracy is displayed in Tables 12 and 13, where the average percentage of correctly classified observations (Pc), the average Huberty Index (HI), and respective coefficients of variation (var. coef.) over the 100 runs in each scenario are presented. The coefficient of variation is a relative dispersion measure also known as relative standard deviation and is defined by the ratio of the standard deviation to the mean. In Figures 2 and 3 the average percentage of correctly classified observations (Pc) by class and respective coefficients of variation (var. coef.) are presented. Overall, unbalanced data sets correspond to harder classification tasks – see Huberty index (HI) values in Table 13 (unbalanced data sets) as compared to those in Table 12 (balanced data sets). Furthermore, there is a clear increase in classification performance associated with an increase in separation. For the unbalanced data sets with poorly separated classes specifically, the default classification precision overcomes the precision of the proposed Methodology (2017), 13(1), 23–37

Mean (%)

Good

Var. coef.

n = 120 (1:3)

n = 400 (1:3)

Mean (%)

Mean (%)

Var. coef.

Var. coef.

Pc

53.50

0.14

52.40

0.12

52.00

0.08

HI

40.87

0.57

90.32

0.27

91.94

0.17

Pc

71.20

0.10

73.40

0.07

72.70

0.04

HI

12.70

1.71

6.40

2.99

9.24

1.15

Pc

89.40

0.05

90.50

0.04

90.90

0.02

HI

67.82

0.18

61.82

0.21

63.55

0.10

algorithm. The obtained performance results are generally consistent (over the 100 runs in each scenario) – see the coefficient of variation values. However, the Huberty index may exhibit high variability when confronted with difficult classification tasks, that is, generally when poorly separated classes are considered and also when unbalanced and moderately separated classes are considered. Classication results referring to each class are very similar – Figures 2 and 3. Nevertheless, in the unbalanced cases, the accuracy corresponding to the larger class slightly (but consistently) surpasses the accuracy referring to the smaller one. Regression on Performance The performance results obtained in the conducted numerical experiments enable us to estimate a regression model in order to: 1. Predict the Pc performance measure based on the data characteristics (complexity factors). 2. Understand the relative impact of each experimental complexity factor on performance. In order to implement the regression we specifically consider the following measures of the experimental complexity factors: the affinity coefficient value – Aff – is used to measure the separation of the classes; the weight of the majority class – Wmc – is used to measure balance; dimensionality is measured by the ratio – Pdf – between the “number of degrees of freedom” and sample size, that is, Pdf = (n (P 2 + 1))/n (note that P = 4 is the number of predictors and parameters referring to two classes which have to be estimated); finally, the proportions of missing states in each class – Pmsc1 and Pmsc2 – are considered. A generalization of the Tobit regression model is used and the ML estimated coefficients are obtained using the censReg package (Henningsen, 2010). The estimated regression model is presented in Table 14. Additional columns on the right refer to standardized variables (Coef. [Std.], p-value [Std.]), in order to help better assess the relative importance of predictors. Ó 2017 Hogrefe Publishing


A. Marques et al., Performance of Combined Models in Discrete Binary Classification

31

Figure 2. Average performance, per class, of the winner b classifier (100 runs for balanced sets).

Figure 3. Average performance, per class, of the winner b classifier (100 runs for unbalanced sets).

The three complexity factors with the highest impact on classification precision (in decreasing order) are: separation, ratio between the degrees of freedom and sample size, and Ă“ 2017 Hogrefe Publishing

the proportion of missing states in the minority class (see Table 14). The weight of the majority class and the proportion of missing states in the majority class have a Methodology (2017), 13(1), 23–37


32

A. Marques et al., Performance of Combined Models in Discrete Binary Classification

Table 14. ML Estimated regression coefficients Coef. Constant

p-value

Coef. (Std.)

p-value (Std.)

0.819

0.000

0.243

Aff

0.669

0.000

0.899

0.000

Pdf

0.346

0.000

0.239

0.000

Pmsc1

0.532

0.000

0.084

0.000

Pmsc2

0.131

0.000

0.006

0.753

0.097

0.000

0.029

0.002

Wmc

0.000

weaker impact on performance. In fact, according to the standardized coefficients ranking, the impact of the latter factor is nonsignificant. As expected, the larger the ratio between the degrees of freedom and sample size the easier the classification task. The remaining complexity factors, except for the weight of the majority class, have a negative impact on performance. The squared correlation between observed and estimated Pc values is 0.95, thus revealing a good fit to data. When applying the estimated regression model to the real data set (reduced Congressional Voting Records) we may anticipate the percentage of correctly classified observations based on their characteristics: affinity coefficient 0.195; proportion of missing states in the majority class 0.125; proportion of missing states in the minority class 0.281; ratio between degrees of freedom and sample size 0.961; and balance 0.534. In fact, before performing classification, it was possible to foresee c Pc ¼ 97% based on the estimated regression model (see coefficients in Table 14). According to the classification results obtained, the actual percentage of correctly classified observations with the combined model FOIM-DTM in this data set is Pc = 97%, showing therefore an excellent predictive ability of the estimated regression model. In addition, it can also be noted that, for this data set, the single models DTM and FOIM attain worse results than the combined model, 93% and 95%, respectively, of correctly classified observations.

Conclusions and Perspectives In the present study, we evaluate the performance of a combined model – a convex combination of FOIM and DTM – for binary discrete classification. We set 18 scenarios for generating simulated data sets with 4 binary predictors controlling for factors considered relevant for classification precision. Furthermore, in each scenario we analyzed the performance of the FOIM-DTM combined model over the 11 values for the β coefficient. Those factors include three degrees of class separability, weights of classes

Methodology (2017), 13(1), 23–37

(whether balanced or unbalanced), and sample dimension (n = 60, n = 120, n = 400). In addition, the number of missing states is quantified in each scenario. As expected, the increase of the sample size decreases the number of missing states and for the unbalanced data sets, the highest number of missing states occurs in smaller classes. The differentiated scenarios provided very different classification performances. According to the obtained results, the combined method yields the best results for small sample cases (whether balanced or unbalanced) and, as expected, performance improves with the increase of class separability. The worst performances are observed for unbalanced and poorly separated classes – the combined model is unable to surpass default classification precision (the lowest Huberty Index value is 91.94%). Within the balanced scenario, when poorly and moderately separated classes are considered, an increase in sample size seems to increase the classification ability of the single FOIM. For unbalanced data sets, the proposed combination generally yields the best results. Classification results by class showed, when class sizes are unequal, that accuracy for the smaller class is quite similar to the results achieved by the larger class. Based on experimental data – 100 classification runs for each scenario – a regression model is estimated, providing new insights into the relative impact of experimental factors on binary discrete classification precision. Separability emerges as the most important experimental factor – the more weakly separated the classes are (the higher the affinity coefficient) the weaker the classification performance is. The proportion of the number of degrees of freedom versus sample size is the second most important factor, with a positive impact on performance. The third is the proportion of missing states in the minority class which, as expected, has a negative impact on performance. The estimated regression model exhibited a good fit to synthetic data and also enabled anticipation of the performance of the proposed FOIM-DTM algorithm in a real data set – a data set extracted from the Congressional Voting Records Data Set in the UCI Machine Learning Repository. In this data set, the difference between the estimated and the actual performance measure (percentage of correctly classified observations) is 0.002. To our knowledge, this type of study is the first to be conducted for evaluating DDA performance. In future research, additional complexity measures of discrete classification problems may be considered – for example, an alternative measure for the degree of class separability (other than the affinity coefficient). Furthermore, some of the categories of the experimental factors that were taken into account may vary and their interaction may be further analyzed.

Ó 2017 Hogrefe Publishing


A. Marques et al., Performance of Combined Models in Discrete Binary Classification

References Abbott, D. W. (1999). Combining models to improve classifier accuracy and robustness. Proceedings of Second International Conference on Information Fusion, Fusion’99 (Vol. 1, pp. 289–295). San Jose, CA. Amershi, S., & Conati, C. (2009). Combining unsupervised and supervised classification to build user models for exploratory. JEDM-Journal of Educational Data Mining, 1, 18–71. Bacelar-Nicolau, H. (1985). The affinity coefficient in cluster analysis. Methods of Operations Research, 53, 507–512. Bache, K., & Lichman, M. (2013). UCI machine learning repository. Irvine, CA: University of California, School of Information and Computer Science. Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123–140. Breiman, L. (1998). Half & Half bagging and hard boundary points (Technical Report). Berkeley, CA: Statistics Department, University of California. Brito, I. (2002). Combinaison de modèles en analyse discriminante dans un contexte gaussien [Combining models in discriminant analysis in a Gaussian context]. (PhD thesis). France: Grenoble 1 University. Brito, I., Celeux, G., & Sousa Ferreira, A. (2006). Combining methods in supervised classification: A comparative study on discrete and continuous problems. REVSTAT – Statistical Journal, 4, 201–225. Celeux, G., & Nakache, J. P. (1994). Analyse discriminante sur variables qualitatives [Discrete discriminant analyses]. Paris: Polytechnica. Cesa-Bianchi, N., Claudio, G., & Luca, Z. (2006). Hierarchical classification: Combining Bayes with SVM. Proceedings of the 23rd international conference on machine learning. New York, NY: ACM. Dietterich, T. G. (1997). Machine-learning research. AI Magazine, 18, 97–136. Duarte, A. (2009). A satisfação do consumidor nas instituições culturais: O caso do Centro Cultural de Belém [Consumer satisfaction in cultural institutions: The case of Centro Cultural de Belém]. (Master thesis). Portugal: ISCTE – IUL. Elder, J. F., & Pregibon, D. (1996). A statistical perspective on knowledge discovery in databases. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, & R. Uthurusamy (Eds.), Advances in knowledge discovery and data mining (pp. 83–116). Menlo Park, CA: AAAI/MIT Press. Finch, H., & Schneider, M. K. (2007). Classification accuracy of neural networks vs. discriminant analysis, logistic regression, and classification and regression trees. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 2, 47–57. doi: 10.1027/1614-2241.3.2.47 Freund, Y., & Schapire, R. E. (1996, July). Experiments with a new boosting algorithm. ICML’96: Proceedings of the 13th International Conference on Machine Learning, (Vol. 96, pp. 148–156). Bari, Italy. Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29, 1189–1232. Friedman, J. H., Hastie, T., & Tibsharani, R. (1998). Additive logistic regression: A statistical view of boosting. (Technical Report). Stanford, CA: Statistics Department, Stanford University. Friedman, J. H., & Popescu, B. E. (2008). Predictive learning via rule ensembles. The Annals of Applied Statistics, 2, 916–954. Goldstein, M., & Dillon, W. R. (1978). Discrete discriminant analysis. New York, NY: Wiley. Henningsen, A. (2010). Estimating Censored Regression Models in R using the censReg Package. R package vignettes.

Ó 2017 Hogrefe Publishing

33

Ho, T. K., & Basu, M. (2002). Complexity measures of supervised classification problems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24, 289–300. Jain, A. K., Duin, R. P. W., & Mao, J. (2000). Statistical pattern recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 4–37. Janusz, A. (2010). Combining multiple classification or regression models using genetic algorithms. In M. Szczuka, M. Kryszkiewicz, S. Ramanna, R. Jensen, & Q. Hu (Eds.), Rough Sets and Current Trends in Computing (pp. 130–137). BerlinHeidelberg, Germany: Springer. Kotsiantis, S. (2011). Combining bagging, boosting, rotation forest and random subspace methods. Artificial Intelligence Review, 35, 223–240. Kotsiantis, S. B., Zaharakis, I. D., & Pintelas, P. E. (2006). Machine learning: A review of classification and combining techniques. Artificial Intelligence Review, 26, 159–190. Macia, N., Bernadó-Mansilla, E., & Orriols-Puig, A. (2008). Preliminary approach on synthetic data sets generation based on class separability measure. In ICPR 2008 – 19th International Conference on Pattern Recognition (pp. 1–4). Tampa, FL: IEEE. Marques, A., Sousa Ferreira, A., & Cardoso, M. G. M. S. (2010). Classification and combining models. In Proceedings of Stochastic Modeling Techniques and Data Analysis International Conference (CD-rom). Chania Crete, Greece. Marques, A., Sousa Ferreira, A., & Cardoso, M. G. M. S. (2013). Selection of variables in discrete discriminant analysis. Biometrical Letters, 50, 1–14. Marques, A., Sousa Ferreira, A., & Cardoso, M. G. M. S. (2015). Combining models in discrete discriminant analysis. International Journal of Data Analysis Techniques and Strategies, 2. http://www.inderscience.com/jhome.php?jcode=ijdats Matusita, K. (1955). Decision rules, based on the distance, for problems of fit, two samples, and estimation. The Annals of Mathematical Statistics, 26, 631–640. Milgram, J., Cheriet, M., & Sabourin, R. (2004). Speeding up the decision making of support vector classifiers. In Ninth International Workshop on Frontiers in Handwriting Recognition, (IWFHR-2004) (pp. 57–62). Kokubunji, Tokyo: IEEE. Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Francisco, CA: Morgan Kaufmann. Pinches, G. E. (1980). Factors influencing classification results from multiple discriminant analysis. Journal of Business Research, 8, 429–456. Prati, R. C., Batista, G. E., & Monard, M. C. (2004). Class imbalances versus class overlapping: an analysis of a learning system behavior. In R. Monroy, G. Arroyo-Figueroa, L. E. Sucar, & H. Sossa (Eds.), MICAI 2004: Advances in Artificial Intelligence (pp. 312–321). Berlin-Heidelberg, Germany: Springer. Prazeres, N. L. (1996). Ensaio de um Estudo sobre Alexitimia com o Rorschach e a Escala de Alexitimia deToronto (TAS-20) [Study assay on alexithymia with the Rorschach and the Alexithymia Scale of Toronto (TAS-20)]. (Master thesis). Lisbon: Universidade de Lisboa. Raudys, S. J., & Jain, A. K. (1991). Small sample size effects in statistical pattern recognition: Recommendations for practitioners. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13, 252–264. Re, M., & Valentini, G. (2012). Ensemble methods: A review. In M. J. Way, J. D. Scargle, K. M. Ali, & A. N. Srivastava (Eds.), Advances in machine learning and data mining for astronomy (pp. 563–582). Boca Raton, FL: Chapman & Hall/CRC Press. Sotoca, J. M., Sanchez, J. S., & Mollineda, R. A. (2005). A review of data complexity measures and their applicability to pattern

Methodology (2017), 13(1), 23–37


34

A. Marques et al., Performance of Combined Models in Discrete Binary Classification

classification problems. Actas del III Taller Nacional de Mineria de Datos y Aprendizaje. TAMIDA,. Granada, Spain: 77–83. Sousa Ferreira, A. (2000). Combining models in discrete discriminant analysis. (PhD Thesis, in Portuguese). Lisboa: University Nova de Lisboa. Sousa Ferreira, A. (2004). Combining Models in discrete discriminant analysis through a committee of methods. In D. Banks, L. House, F. R. McMorris, P. Arabie, & W. Gaul (Eds.), Classification clustering, and data mining applications (pp. 151–156). Berlin-Heidelberg, Germany: Springer. Sousa Ferreira, A. (2010). A comparative study on discrete discriminant analysis through a hierarchical coupling approach. In H. Locarek-Junge & C. Weihs (Eds.), Classification as a tool for research, studies in classification, data analysis, and knowledge organization (pp. 137–145). Berlin-Heidelberg, Germany: Springer. Sousa Ferreira, A., Celeux, G., & Bacelar-Nicolau, H. (2000). Discrete discriminant analysis: The performance of combining models by a hierarchical coupling approach. In H. A. L. Kiers, J.-P. Rasson, P. J. F. Groenen, & M. Schader (Eds.), Data analysis, classification, and related methods (pp. 181–186). Berlin-Heidelberg, Germany: Springer. Steinberg, D. (1997). CART user’s manual. San Diego, CA: Salford Systems. Weiss, G. M., & Provost, F. (2003). Learning when training data are costly: The effect of class distribution on tree induction. Journal of Artificial Intelligence Research, 19, 315–354. Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5, 241–259. Received March 28, 2014 Revision received October 30, 2015 Accepted July 12, 2016 Published online March 22, 2017

Anabela Marques has a degree in Statistics and Operations Research (University of Lisbon) and a PhD in Quantitative Methods in the specialization of Statistics and Data Analysis (ISCTE-IUL). She teaches Data Analysis in Barreiro School of Technology, Polytechnic Institute of Setúbal. Her research focuses in supervised classification.

Ana Sousa Ferreira has a degree in Applied Mathematics (University of Lisbon) and a PhD in Mathematics – Statistics (New University of Lisbon). Currently, she is Assistant Professor of the Faculty of Psychology and researcher at the Business Research Unit. She teaches Data Analysis courses and her focus of research is supervised classification.

Margarida G. M. S. Cardoso is an Associate Professor at ISCTEIUL. She holds a PhD in Systems Engineering and a Master in Operations Research (Instituto Superior Técnico) and has a degree in Mathematics (University of Lisbon). She teaches courses of Data Analysis – Statistics and Data Mining techniques. Her research focuses in clustering and classifications techniques.

Anabela Marques Barreiro College of Technology Setúbal Polytechnic, IPS Rua Américo da Silva Marinho 2839-001 Lavradio Portugal anabela.marques@estbarreiro.ips.pt

Appendix ############### #FOIM – First-Order Independence Model ############### ProgramFOIM< function(MStates, MFreqAbsBuild, DimGroups, ProbaPriori, MDataTest, numGroups) { ProbFOIM< FOIM(MStates,MFreqAbsBuild,DimGroups) coef<–WeightaPriori(MFreqAbsBuild, DimGroups,ProbaPriori) ProbFOIMWeightdo< ProbModeloWeightdas(ProbFOIM, coef) AssignEstGrupoFOIM< AssignG(ProbFOIMWeightdo) GIndGFOIM< GINDAfectG(MDataTest,MStates, AssignEstGrupo FOIM) MConfusionFOIM< Confusion(GIndGFOIM, numGroups) RateWellAssignedFOIM< WellAssigned(MConfusionFOIM) ProbFOIMWeightdo } ############### # FOIM probabilities ############### FOIM < – function(MStates,MFreqAbsBuild, dimgroupBuild) { nGroups< ncol(MFreqAbsBuild) Methodology (2017), 13(1), 23–37

Ó 2017 Hogrefe Publishing


A. Marques et al., Performance of Combined Models in Discrete Binary Classification

35

nStates< nrow(MFreqAbsBuild) x< 1:nStates xx< 1:nGroups nvar< ncol(MStates) xxx< 1:nvar xGroups< 1:nGroups ProbFOIM< matrix(0,nStates, nGroups) vs< c(0) for (v in x) {mult< rep(1, nGroups) for (vv in xxx) {i=1 for (vvv in xx) {vs[i]< MFreqAbsBuild[v,vvv] i = i + 1} for (k in x) { if ((k!=v) & (MStates[v,vv]==MStates[k,vv])) { j< xx[1] for (vGroups in xGroups) { vs[vGroups]< vs[vGroups] +MFreqAbsBuild[k,j] j< j+1}}} for (vGroups in xGroups) {mult[vGroups]< mult[vGroups]*vs[vGroups]}} for (vGroups in xGroups) { ProbFOIM[v,vGroups]< mult[vGroups]/((dimgroupBuild[vGroups]) ^ nvar)}} } ############### # Weights ############### WeightaPriori< function(MFreqAbsBuild,dimgroupBuild, PPriori) { coef< c(0) nGroups< ncol(MFreqAbsBuild) xGroups< 1:nGroups if (ProbPriori==1) { for (v in xGroups) { coef[v]< dimgroupBuild[v]/sum(dimgroupBuild)} } if (ProbPriori==2) { for (v in xGroups) { coef[v]< 1/nGroups } } } ############### # Weight probabilities ############### ProbModeloWeightdas< function(MatrizProb, Weight) { denom< c(0) nStates< nrow(MatrizProb) ngroup< ncol(MatrizProb) ProbModelo< matrix(0,nStates, ngroup) x< 1:nStates Ó 2017 Hogrefe Publishing

Methodology (2017), 13(1), 23–37


36

A. Marques et al., Performance of Combined Models in Discrete Binary Classification

xGroups< 1:ngroup for (v in x) { a<-0 for (vv in xGroups) { a< a+Weight[vv]*MatrizProb[v,vv] } denom[v]< a if (denom[v]!=0) { for (vv in xGroups) { ProbModelo[v,vv]< Weight[vv]*MatrizProb[v,vv]/denom[v]} } else for (vv in xGroups) { ProbModelo[v,vv]< 0} } } ProbModelo } ############### # Assign each individual, with a specific state, to one of the a priori groups ############### AssignG< function(MatrizWeightdo) { nStates< nrow(MatrizWeightdo) ngroup< ncol(MatrizWeightdo) MAssign< matrix(0,nStates, 1) x< 1:nStates xGroups< 1:ngroup for (v in x) {g< 1 larger< MatrizWeightdo[v,g] for (vv in xGroups) { if (MatrizWeightdo[v,vv]>larger) {g< vv larger< MatrizWeightdo[v,vv]} } MAssign[v,1]< g } MAssign } ############### # For each individual real vs. predicted group ############### GINDAfectG< function(MDataTest,MStates,MAssignG) { nind< nrow(MDataTest) x< 1:nind col< ncol(MDataTest) ver< MDataTest[,-col] nStates< nrow(MStates) xx< 1:nStates nvar< ncol(MStates) xxx< 1:nvar GBeforeAfter< matrix(0,nind,2) GBeforeAfter[,1]< MDataTest[,col] for (v in x) { for (vv in xx)

Methodology (2017), 13(1), 23–37

Ó 2017 Hogrefe Publishing


A. Marques et al., Performance of Combined Models in Discrete Binary Classification

37

{ conta< 0 for (vvv in xxx) { if (ver[v,vvv]==MStates[vv,vvv]) conta< conta+1} } if (conta==nvar)GBeforeAfter[v,2]< MAssignG[vv,1]} } } GBeforeAfter } ############### # Build a confusion matrix ############### Confusion< function(MatrizGIndG, numberGroups) { nInd< nrow(MatrizGIndG) Matriz< matrix(0,numberGroups,numberGroups) I< 1:nInd for (v in I) { a< MatrizGIndG[v,1] b< MatrizGIndG[v,2] Matriz[a,b]< Matriz[a,b]+1 } } ############### # % of correctly classified ############### WellAssigned< function(ConfusionMatrix) { rows< nrow(ConfusionMatrix) L< 1:rows Well< 0 Total< sum(ConfusionMatrix) for (v in L) { Well< Well+ConfusionMatrix[v,v] } RtWellAssigned< round(Well/Total,3) }

Ó 2017 Hogrefe Publishing

Methodology (2017), 13(1), 23–37


Call for Papers Validity: Challenges in Conception, Methods, and Interpretation in Survey Research A Special Issue for Methodology – European Journal of Research Methods for the Behavioral and Social Sciences Guest Editors: Natalja Menold, Matthias Bluemke, and Anita Hubley Coordinator: Jose-Luis Padilla

Aims and Scope Validity is a central criterion of measurement quality and a key goal of measurement in the social sciences. As a quality aspect, validity does not refer to a measurement instrument per se, but to the adequacy of interpretations and uses of the measurement results. Researchers developing and using measurement instruments are faced with a diversity of validity conceptions and validation methods, which increases the complexity of validity interpretations. There is also a gap between survey research and psychometrics on conceptions of validity and practices. The goal of this special issue is to provoke a dialog on issues in conceptualizing validity, enhance knowledge and application of validation methods, and shed more light on challenges faced in interpretation and use of measures in survey research. It is particularly important given the background of the complexity and heterogeneity not only of concepts under investigation but also due to the increasing importance of mixed-modes, mixed-methods, or cross-cultural research. Research on the interplay between the survey context, survey situation, and response behavior can provide additional insights about which aspects are relevant when providing validity-related evidence and interpretations. Suitable topics include, but are not restricted to: – Validation methods: new developments and adoptions – Interplay among validity, systematic biases, and reliability – Response behavior and validity – Validity within the framework of multicultural research – Validity issues related to specific modes We expect contributions that have potentially high impact on the field, deliver a strong message on the basis of sound

Methodology (2017), 13(1), 38–39 DOI: 10.1027/1614-2241/a000132

evidence. Out of scope are manuscripts focusing on newly developed measurement instruments. We also do not consider for publication technical papers dealing with very specific mathematical estimation problems. We first invite you to submit extended abstracts (500–1,000 words), which will be evaluated by the guest editors. For the selected abstracts authors will be then invited to contribute full papers, which will be subjected to peer-review. To include as many papers as possible in the special issue, the word count of full papers is expected to be less than 6,000 words, as per editorial requirement.

Guest Editors Natalja Menold: E-mail natalja.menold@gesis.org Matthias Bluemke: E-mail matthias.bluemke@gesis.org Anita Hubley: E-mail anita.hubley@ubc.ca

Coordinating Editor Jose-Luis Padilla: E-mail methodologyjournal@ugr.es More information on the journal, peer-review process, and manuscript preparation can be obtained here: www. hogrefe.com/j/methodology. Send your extended abstracts to one of the two editors of the journal: Peter Lugtig Department of Methods & Statistics, University of Utrecht, Padualaan 14, 3584 CH Utrecht, The Netherlands, E-mail MethodologyJournal.fsw@uu.nl Ó 2017 Hogrefe Publishing


Call for Papers

39

Jose-Luis Padilla Department of Methodology of Behavioural Sciences, University of Granada, 18071 Granada, Spain, E-mail methodologyjournal@ugr.es

Deadline for submission of the full papers: October 30, 2017 Decision on the full papers: January 20, 2018 Deadline for the submission of revised manuscripts: April 30, 2018 Final decision: June 30, 2018

Important Dates and Deadlines Deadline for the submission of the (500–1,000 words): April 30, 2017 Decision on the abstracts: May 31, 2017

Ă“ 2017 Hogrefe Publishing

abstracts

Methodology (2017), 13(1), 38–39


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.