Volume 12 / Number 1 / 2016
Volume 12 / Number 1 / 2016
Methodology
Methodology
Editors Peter Lugtig JosĂŠ L. Padilla
European Journal of Research Methods for the Behavioral and Social Sciences Official Organ of the European Association of Methodology
med_40-20-0-5_58-60_positiv.indd 3
10.12.2015 12:32:42
Contents Original Articles
Methodology (2016), 12(1)
An Extended Multidimensional IRT Formulation for the Linear Item Factor Analysis Model Pere J. Ferrando
1
Comparison of Principal Component Solutions in Two Populations: A Bootstrap Test of the Perfect Congruence Hypothesis Gregor Socˇan
11
The Impact of the Number of Dyads on Estimation of Dyadic Data Analysis Using Multilevel Modeling Han Du and Lijuan Wang
21
Ă“ 2016 Hogrefe Publishing
Original Article
An Extended Multidimensional IRT Formulation for the Linear Item Factor Analysis Model Pere J. Ferrando Research Centre for Behavioral Assessment, “Rovira i Virgili” University, Tarragona, Spain
Abstract: In previous research Spearman’s factor-analytic (FA) model has been formulated as a linear Item Response Theory (IRT) model for (approximately) continuous item responses. Furthermore, some generalized IRT-based indices have been proposed for multidimensional FA. However, to date no explicit IRT formulation has existed for this model. This article extends previous proposals in two directions. First, it proposes a Lord’s-type parameterization of the multidimensional FA model in which each item is characterized by a difficulty index in each dimension. Second, it proposes two multidimensional IRT-based item-person-distance indices. The characteristics and advantages of all the proposed measures as well as their relations to existing indices are discussed. The usefulness of the proposal in item analysis and validity assessment is also discussed and illustrated with two empirical examples. Keywords: factor analysis, item response theory, item difficulty, multidimensional item indices, person-item distance, distance-difficulty hypothesis, personality and attitude measurement
The graded-response item format in 5, 7, or more points is ubiquitous in typical-response (i.e., personality and attitude) measurement (e.g., Dawes, 1972). Furthermore, more continuous formats such as graphic or visual scales are becoming common in computer-administered questionnaires (e.g., Ferrando, 2009). When typical-response measures based on these types of items are assessed for dimensionality and structure, by far the most common model that serves as a basis is linear factor analysis (FA; e.g., Hofstee, Ten Berge, & Hendricks, 1998). This generalized usage, however, is not free from controversy. Rather, the appropriateness of linear FA for fitting bounded and discrete item scores is the subject of a never-ending and heated debate. The position taken in this article (which is discussed below in more detail) is that, for the typical-response items that are commonly found in applied research, linear FA is generally a good approximate model that works well in practice (see, e.g., Atkinson, 1988; Ferrando, 2009; Hofstee et al., 1998). Provided that it behaves appropriately, linear FA causes two additional and generally overlooked problems when it is used in item analysis. First, in its standard parameterization, linear FA is not explicitly linked to a specific model of item responding (although, as discussed below, it is implicitly). Second, most applications are based on the interitem correlation matrix and use only some of the information that is available from the data (e.g., Reckase, 1997, 2009). More specifically, when correlation matrices are Ó 2016 Hogrefe Publishing
factor-analyzed, (a) information about item difficulties or locations is lost and (b) the item discriminations are standardized coefficients, which are highly group-dependent (e.g., Ferrando, 2009). To a great extent the problems just mentioned arise because, even when applied researchers and many psychometricians too use linear FA for item analysis they do not regard it as a proper item response model. However, the comprehensive proposals by Coombs (1964), Lord and Novick (1968), and McDonald (1999) among others clearly show that linear FA can be considered as a dominance item response theory (IRT) model governed by the principle of local independence. Furthermore, in the unidimensional case, more specific proposals by Thissen, Steinberg, Pyszczynski, and Greenberg (1983), and Mellenbergh (1994) have developed Spearman’s unidimensional linear FA as an IRT model for (approximately) continuous responses and have emphasized the equivalences with standard IRT models for discrete responses. On the basis of the previous studies mentioned above, Ferrando (2009) proposed an alternative parameterization of Spearman’s model in which it was explicitly formulated as a dominance-based IRT model. The resulting formulation is similar in form to Lord’s standard parameterization of the two-parameter IRT model for binary responses (Lord, 1952), in which the item location or difficulty parameter is on the same scale as the trait that is measured. Ferrando’s Methodology (2016), 12(1), 1–10 DOI: 10.1027/1614-2241/a000098
2
P. J. Ferrando, An Extended Multidimensional IRT Formulation for the Linear Item Factor Analysis Model
proposal has three advantages over the conventional FA formulation. First, it provides more information about item functioning and the relative standing of the individual with respect to the item. Second, it allows the results of the analysis to be interpreted in relation to a specific model of item responding. Third, it allows for further validity extensions with respect to external or auxiliary measures. As for this last point, and as discussed below, this type of validity assessment is based on the so-called distance-difficulty hypothesis which, in turn, requires a model-based measure of person-item distance to be obtained. Ferrando (2009) also proposed multidimensional extensions to the linear item FA model which were related to similar developments made in multidimensional item response theory (MIRT; e.g., Reckase, 2009). In particular, he proposed multidimensional measures of difficulty, discrimination, information, and model appropriateness. However, he did not extend Lord’s parameterization to the multidimensional case and used instead the standard intercept-slopes parameterization. Overall, the proposal was similar to those made in MIRT in which multidimensional indices have been proposed, but the basis parameterization is the standard threshold/slopes parameterization (McDonald, 1997; Reckase, 1985, 1997, 2009). The aim of this article is to further extend Ferrando’s proposal (2009) in two directions. First, it proposes a Lord’s-type formulation for the multidimensional item FA model in which each item has a difficulty or location parameter in each dimension. Second, two model-based person-item-distance measures are proposed. The two new proposed developments are submitted to be of both theoretical and practical interest for several reasons. The first proposal provides new information that allows the researcher to gain additional understanding about how the items function as measures in the multidimensional space. As for the second, the proposed measures allow the distance-difficulty hypothesis to be assessed in the multidimensional case. And, as discussed below, the assessment of this hypothesis has advantages not only in terms of validity assessment (as mentioned above), but also in terms of theoretical understanding and parameter estimation.
Review of the Unidimensional Proposal The Linear Congeneric Item Score Model Consider a test made up of n typical-response items with a graded or a quasi-continuous response format that measures a single trait, dimension, or common factor θ, and let Xij be the observed score of person i to item j. Methodology (2016), 12(1), 1–10
For interpretative purposes the Xij’s are scaled to have values between 0 and 1, and θ is scaled in a z-score metric (mean 0 and variance 1). Also for interpretative purposes it is useful to employ the 0–1 endpoints of the item response scale labeled “completely disagree” and “completely agree.” In this case, the 0.5 midpoint would require the label “neither agree nor disagree.” The 0–1 scaling I propose is not commonly used in practice, and even less so in the case of graded responses which are usually scored with integers starting from 1. So, the proposal here is to rescale the raw item scores so that they are in the 0–1 range before the model is fitted. For example, in the case of a 10-point format scaled between 1 and 10, the rescaling transformation is: y = (x 1)/9. Overall, the 0–1 scaling is similar to the proportion scaling that Lord (1952) proposed for the raw scores and has similar advantages. First, it is independent of the number of response points or the length of the response scale: the endpoints and midpoint are always the same. Second, the resulting model and model estimates are closer to the two-parameter model for binary responses. When applied to item responses, Spearman’s model is known as the congeneric item score model (Jöreskog, 1971; Mellenbergh, 1994). Its standard intercept-slope formulation is: Xij ¼ μj þ λj θi þ ɛij ;
ð1Þ
where μj is the item intercept, λj is the item loading, slope, or regression weight, and ɛij is the measurement error. For fixed θ the item responses are distributed independently and their conditional distribution is assumed to be normal, with mean and variance given by EðXj jθi Þ ¼ μj þ λj θi ;
VarðXj jθÞ ¼ σ2ɛj :
ð2Þ
As a function of θ the conditional mean above is the linear item response function (IRF) of the model. As for the variance term, it is the conditional variance of the item scores for fixed trait level. Unlike the case of standard IRT models for discrete responses in which the conditional variance depends on the trait level, the variance in (2) does not depend on θ (i.e., homoscedasticity, see, e.g., Mellenbergh, 1994). Because the item responses are bounded between 0 and 1 while θ is unlimited, the linear IRF in (2) can only be correct for a limited range of trait values. To address this issue, Ferrando (2009) proposed lower and upper bound (floor and ceiling) indices that indicate the range of values for which the IRF is essentially linear (see Ferrando, 2009, Equation 5). Consider now the transformation: βj ¼
0:5 μj : λj
ð3Þ
Ó 2016 Hogrefe Publishing
P. J. Ferrando, An Extended Multidimensional IRT Formulation for the Linear Item Factor Analysis Model
By using the index βj in (2) the IRF can be written as EðXj jθi Þ ¼ 0:5 þ λj θi βj :
ð4Þ
Expression (4) is a dominance-based formulation of the same type as that of the two-parameter IRT model for binary responses. In effect, for the normal-ogive case this last model can be written as EðXj jθi Þ ¼ Pj ðθi Þ ¼ Φ aj θi bj ; Φ 1 ðPj ðθi ÞÞ ¼ aj ðθi bj Þ;
ð5Þ
where Φ denotes the standard normal c.d.f. Formulation (5) was first proposed by Lord (1952) and, as mentioned above, is therefore generally known as Lord’s parameterization (McDonald, 1997). From inspection, it follows that the basic linear model is essentially the same in (4) and in (5). However, in (4) the linear model directly models the conditional mean, whereas in the binary case it models the probit transformation of this conditional mean. The parameter βj in (3) is a location or difficulty index that is in the same scale as θ and which can be defined as the trait level that corresponds to an expected score of 0.5 (i.e., the response scale midpoint). This is a natural extension of the difficulty index bj in the two-parameter model in (5) (the trait level which corresponds to a 0.5 probability of endorsing the item). Furthermore, because θ is scaled in the same metric, the βj estimates (4) and the bj estimates in (5) are in the same scale and so interpreted in the same way (see Ferrando, 2009). Conceptually, βj can be viewed as the threshold on the θ continuum that marks the transition from a tendency to disagree with the item to a tendency to agree with it. The difference (θi βj) is the signed unweighted person-item distance (PID). Overall, model (4) expresses the expected item score as a weighted signed PID so it behaves like a dominance model: when the trait level is above the item difficulty, the item score is above the midpoint of the scale (i.e., 0.5). The weight λj, assumed to be positive, is interpreted as the item discrimination parameter (Mellenbergh, 1994). It acts as a scaling factor that relates the trait continuum scale to item j’s response scale. The higher λj is, the more the PIDs will be reflected in the item scores and the greater the differentiation of these scores will be. In the scaling considered here, typical values for λj in a personality test range from 0.04 to 0.25 (see Ferrando, 2009). To close this section, some discussion will be provided about how appropriate and useful the model discussed so far is in comparison to existing alternatives. For continuous response formats, Samejima’s (1973) normal-ogive continuous response model (CRM) is an alternative to model (1). Essentially, the CRM can be formulated as an application Ó 2016 Hogrefe Publishing
3
of (1) not to the direct 0–1 scaled scores but to the logit transformation of these scores (see, e.g., Ferrando, 2010). This formulation is theoretically more plausible because it (a) eliminates inadmissible expected values (outside the 0–1 range), (b) allows for decreasing variance toward the extremes of the scale, and (c) allows for progressively more skewed conditional distributions (i.e., floor and ceiling effects) toward the extremes of the scale. For items which are both extreme and highly discriminating these advantages are expected to be relevant. However, the discriminations of personality/attitude items are generally only moderate and with this type of item the advantages do not seem to appear in practice (see Ferrando, 2010 for a review). For the case of graded-response items, IRT models that explicitly treat graded responses as discrete and bounded variables already exist. So, in principle, they are more appropriate than the linear approach. With regard to the expected scores, these models have the three main advantages discussed above: nonlinearity, heteroscedasticity, and increasingly asymmetric conditional distributions. Furthermore, some of them are very flexible and allow, for example, for unequally spaced categories. This greater appropriateness and flexibility, however, has a cost. First, in most cases these models require strong basic assumptions (e.g., normal underlying responses). Second, item calibration is complex and can become unstable in certain situations (mainly long tests and moderate-to-small samples). Third, individual scores cannot be obtained in closed form, and iterative procedures which might lead to unstable or implausible estimates are required. Finally, they are generally more parameterized, which makes item-level results more difficult to interpret. To see this point, note that the present model has a single item location parameter βj. In contrast, for a k-category item, a flexible model such as the graded-response model has k 1 locations, and k item category response functions. Clearly, the linear FA model in (1) must be considered as an approximation when bounded and discrete item responses are modeled. However, for graded or (approximately) continuous typical-response items with moderate discrimination this approximation is generally expected to be good. If this is the case, the main advantage of linear FA is its simplicity. First, it does not make underlying assumptions, and the assumption that the regressions are linear is easy to check. Second, item calibration is simpler and more direct, and this is expected to lead to more stable estimates in most of the circumstances considered here, mainly when the number of response points is large and the sample is not too big. Third, results are easier to interpret. Finally, the person estimates can be obtained in closed form which makes it much easier to obtain plausible (finite) and stable estimates. Methodology (2016), 12(1), 1–10
4
P. J. Ferrando, An Extended Multidimensional IRT Formulation for the Linear Item Factor Analysis Model
The Distance-Difficulty Hypothesis Generally stated, the distance-difficulty hypothesis (DDH; Kuncel, 1973; Nowakowska, 1983) predicts that the uncertainty and difficulty of responding to an item increase as the position of the item and the position of the individual in a common continuum approach one another. So, if the DDH is to be linked to a specific IRT model, first, the common continuum must be defined, and second a person-item distance measure should be defined in it. In the congeneric model discussed above, two itemperson continuous arise in a natural way, and both lead to the same predictions. The first one is the θ continuum and the distance measure can be the weighted or the unweighted PID defined above. For both measures, when the trait level of the individual matches the difficulty of the item (i.e., when θi = βj) the PID = 0, and the difficulty in responding is predicted to be maximal. As the PID increases in any direction, the responding process is assumed to become easier. The second natural possible continuum is the [0–1] item response scale, and the corresponding distance measure is obtained from the projection of (4) on this scale. When the projection or expected score is 0.5 (i.e., neither agree nor disagree) the uncertainty in responding is predicted to be maximal. As the expected score moves away from this middle point and goes to one extreme or the other of the response scale (0 = completely disagree or 1 = completely agree) the responding process is assumed to become easier. Because the 0.5 point is expected when θi = βj, the predictions obtained with both schemas in the unidimensional case are the same. Ferrando and coworkers (Ferrando, 2006; Ferrando, Anguiano-Carrasco, & Demestre, 2013; Ferrando & Lorenzo-Seva, 2007) used the DDH defined on the θ continuum to predict that item response latencies would decrease, stability would increase, and response certainty would also increase as weighted PID |λj (θi βj)| increases. For the three dependent variables, evidence has generally supported these predictions. Successful assessment of the DDH is considered to have both theoretical and practical relevance in IRT applications. At the theoretical level, the evidence supporting the DDH provides a better understanding of the response processes that determine the scores on personality items (Ferrando, 2006). At a more practical level, fulfillment of the DDH (a) provides external evidence of validity of model appropriateness and (b) allows external information that potentially improves the model estimation to be used. Regarding the first point, model-data fit assessment in IRT models is generally based on internal evidence. However, Hambleton and Swaminathan (1985) considered that the best approach for addressing model data-fit in IRT is to make tangible
Methodology (2016), 12(1), 1–10
predictions from the model and then to use observed data to check whether these predictions are approximately correct. Clearly, assessing the DDH by using external dependent variables is a part of this type of assessment. Finally, regarding point (b) above, if the DDH is fulfilled, then the external information which is provided by the retest stability, latencies, or reported certainty can be incorporated into the model and used to obtain improved item and person estimates (e.g., Ferrando, 2006). So far, the DDH has only been considered within the unidimensional framework. The first group of developments considered in this article, however, allows multidimensional extensions of the DDH to be undertaken.
The Multidimensional Proposals Multidimensional Lord Parameterization For didactic purposes the multidimensional parameterization will be developed on the basis of the bidimensional model, which is the simplest, and whose results can be easily interpreted and displayed using geometrical representations in the θ1, θ2, plane. The results, however, are valid for any number of dimensions and general formulas for m dimensions will also be presented. The item response surface (IRS) of the bidimensional FA model in intercept-slopes parameterization is EðXj jθi Þ ¼ μj þ λj1 θi1 þ λj2 θi2 :
ð6Þ
Now, for dimension k = 1 or 2, consider the transformation: βjk ¼
λjk ð0:5 μj Þ λ2 j1 þ λ2 j2
;
ð7Þ
which for the general case of m factors becomes βjk ¼
λjk ð0:5 μj Þ : m P λ2 jk
ð8Þ
k¼1
Using (7) in (6) the proposed alternative parameterization is obtained ð9Þ E Xj jθi ¼ 0:5 þ λj1 θi1 βj1 þ λj2 θi2 βj2 : It is clear (a) that in the unidimensional case, βjk in (8) reduces to βj in (3), and (b) that (9) is a direct generalization of (4). In order to fully understand and interpret the results of (8) and (9), however, they must be related to the existing multidimensional difficulty index.
Ó 2016 Hogrefe Publishing
P. J. Ferrando, An Extended Multidimensional IRT Formulation for the Linear Item Factor Analysis Model
Reckase (1985) used a two-step approach for defining a multidimensional difficulty item index and a multidimensional discrimination item index intended for nonlinear binary MIRT models. To review these indices, I shall first consider the bidimensional extension of the two-parameter model (5) written in intercept/slope form Φ 1 Pj ðθi Þ ¼ aj1 θi1 þ aj2 θi2 þ dj ;
dj MDj ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffi ; m P a2 jk
θ2
β2
P
MDβ
ð10Þ
where ajk is the discrimination index of item j in dimension k and dj is a scalar location (intercept) parameter for item j. Reckase’s multidimensional difficulty and discrimination indices are, respectively,
α2 α1 O
β1
Figure 1. Geometrical representation of the difficulty parameters in the bidimensional case.
βjk ¼ cos αjk MDβj MDβ2 j ¼
and ð12Þ
The multidimensional difficulty index in (11) is defined as the signed distance from the origin on the θ1, θ2 plane to the point for which the expected item score is 0.5 in the direction of the maximum slope of the IRS. The multidimensional discrimination in (12) is the maximum slope in this direction. Ferrando (2009) adapted Reckase’s approach to the linear model (6) in which the IRS is a plane, and obtained the following indices for the general case of m factors ð13Þ
k
and sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi m X λ2 jk : MDλj ¼
ð14Þ
k
The multidimensional difficulty index MDβj in (13) and the multidimensional discrimination MDλj in (14) have the same definition as the original indices (11) and (12) intended for binary responses and virtually the same form. Ferrando (2009) further obtained the direction cosines that describe the direction of maximum slope. λjk cos α jk ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffi : m P λ2 jk
ð15Þ
k
By comparing equations 8, 13, and 15 the following relations are readily obtained Ó 2016 Hogrefe Publishing
m P
β2 jk
:
ð16Þ
k
k
0:5 μj MDβj ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffi : m P λ2 jk
θ1
ð11Þ
k
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi m X a2 jk : MDISCj ¼
5
Interpretation of these results becomes clearer if the geometrical representation in Figure 1 is used. The vector OP is in the direction of maximum slope, which is determined by angles α1 and α2. The length of this vector is the multidimensional difficulty of this item. The location indices β1 and β2 are the orthogonal projections of OP on the θ1 and θ2, axes or, alternatively, the rectangular coordinates of point P. Now, the results in (16) can be deduced immediately; in particular, the second result is a particular case of Pythagoras’ theorem. At a more conceptual level, results (16) indicate that: (a) in squared terms, the multidimensional difficulty index is the sum of the unidimensional difficulties proposed here, and (b) the contribution of each unidimensional difficulty to the multidimensional difficulty depends on how close the directional vector OP is to the corresponding axe. Conceptually, model (9) is a compensatory model in which the signed weighted item-person differences λj(θik βjk) determine the final expected item score. Within this model, index βjk measures the conditional difficulty of item j with respect to dimension k. So, if the remaining terms are constant, the greater the difference θik βjk is, the more differentiated the expected response in item j becomes. For a solution in m dimensions let us now move on to summarize the proposal for discrimination and difficulty. For discrimination, each item is characterized by m discrimination indices: λjk and a multidimensional discrimination index, MDλj. The multidimensional index measures the discriminating power of the items for the best combination of the dimensions and, therefore, the ability of the item to distinguish between individuals who are in different locations on the θ space. Each unidimensional index measures the discriminating power of the item with respect to the corresponding dimension. Now, if MDλj is low, then all Methodology (2016), 12(1), 1–10
6
P. J. Ferrando, An Extended Multidimensional IRT Formulation for the Linear Item Factor Analysis Model
the unidimensional discriminations are necessarily low. However, MDλj can be high with items that are moderately discriminating in several dimensions or with an item that is highly discriminating in a few dimensions, but hardly discriminating at all in the others. So, the individual discriminations complement the global information provided by MDλj. A similar interpretation schema is proposed here for item difficulty. The multidimensional index MDβj measures the general difficulty of the item in the direction of maximum discrimination. Now, as Equation 16 and Figure 1 show, the projection of MDβj on the different dimensions determines the individual difficulty values βjk. So, these values show how the total item difficulty is distributed throughout the dimensions (note again that in squared terms each individual difficulty contributes additively to the total difficulty). If the vector MDβj is close to a particular axis θk, then item j measures primarily dimension k, and, as discussed above, almost all the multidimensional difficulty is projected on the unidimensional βjk value (i.e., the direction cosine approaches 1 in absolute value). For a clearer interpretation, suppose further that MDβj is high and positive (i.e., item j is difficult). Then βjk is also high and positive and this means that this item can only be endorsed by a high θik value. It is in this sense that item j can be regarded as almost a univocal measure of θk.
Multidimensional Person-Item Distances The weighted PID defined in the θ continuum that was discussed in the unidimensional case has a natural multidimensional extension that takes the following form: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi m X ð17Þ λ2 jk ðθik βjk Þ2 : MPIDðθÞij ¼
in all the dimensions, then MPID(θ) = 0 (see Equation 17, the expected item score is the midpoint scale level, and, according to the DDH, uncertainty and difficulty of responding are maximum. So, in this case both measures lead to the same prediction. When they are not equal, however, relations are more complex. Model (9) is compensatory so combinations of trait values can be found for which, term by term, the squared θik βjk differences are high (and so MPID(θ) is substantial) but, at the same time, the predicted item score is undifferentiated (i.e., the scale midpoint). In these cases the two PIDs lead to plausible but differentiated DDH predictions, especially if uncertainty, latency, and response extremeness are kept as separate constructs (Ferrando et al., 2013). An initial prediction based on (17) is that when MPID(θ) is high then the difficulty of responding will be low regardless of the scale point at which the expected score falls. So, response certainty will be high, stability will be high, and the responses will be fast. A second prediction based on (18) is that difficulty of responding depends on how near the expected score is to the scale midpoint (i.e., agree-disagree transition point) regardless of the particular trait combination that gives rise to this expectation. If the DDH is to be used with multidimensional models to obtain the potential advantages derived from it, then we need to assess (a) the extent to which the predictions derived from measures (17) and (18) are fulfilled, and (b) which of the two measures and hypotheses derived from them best represents the response process to personality items and leads to better predictions. This type of assessment is ultimately an empirical matter, and Example 2 below aims to provide some initial results.
k¼1
So, analytically the MPID(θ) which is proposed in (17) is the square root of the sum of the squared weighted item-person discrepancies over the m dimensions. Geometrically it is the weighted distance between the vector θi = [θi1,θi2. . .θim] and the MDβj vector OP in Figure 1. With regard to the second type of PID defined on the [0–1] item response scale continuum, the measure is the same as that discussed in the unidimensional case. The difference is that the projection score that is subtracted from the 0.5 point of maximum uncertainty is obtained from the projection of the multidimensional model (9) instead of the unidimensional model (4). MPIDðXÞij ¼ j0:5 EðXj jθi Þj:
ð18Þ
Unlike the unidimensional case, the two PIDs just defined might lead to different predictions regarding the DDH. When the trait levels are equal to the item difficulties
Methodology (2016), 12(1), 1–10
Illustrative Examples The proposals made in this article will be illustrated with two empirical studies in personality measurement. The first study attempts to show how the proposed parameterization can provide useful information for FA-based item analysis. The second study assesses the functioning of the two PID proposed here with regard to the DDH predictions when the dependent variable is the item response time. In both cases, the basis factor analyses were carried out using the FACTOR 9.2 program (Lorenzo-Seva & Ferrando, 2013).
Example 1: Interpretation of the Item Calibration Results The first example uses the physical aggression (PA, 9 items) and anger (AN, 6 items) subscales of the Spanish version of
Ó 2016 Hogrefe Publishing
P. J. Ferrando, An Extended Multidimensional IRT Formulation for the Linear Item Factor Analysis Model
Table 1. Item calibration results for the AQ in Example 1: discrimination and difficulty Item
λj1
λj2
MDλj
βj1
βj2
MDβj
PA1
0.157
0.076
0.175
1.252
0.607
PA2
0.250
0.000
0.250
0.499
0.001
1.391 0.499
PA3
0.217
0.000
0.217
0.076
0.000
0.076
PA4
0.132
0.035
0.137
2.239
0.593
2.316
PA5
0.223
0.006
0.223
0.539
0.015
0.539
PA6
0.227
0.046
0.232
0.979
0.197
0.998
PA7
0.148
0.165
0.221
0.332
0.370
0.497
PA8
0.148
0.035
0.152
2.056
0.482
2.112
PA9
0.143
0.080
0.164
1.157
0.648
1.326
AN1
0.034
0.170
0.173
0.011
0.053
0.054
AN2
0.062
0.074
0.096
0.527
0.631
0.822
AN3
0.052
0.191
0.198
0.007
0.025
0.026
AN4
0.104
0.065
0.123
0.036
0.022
0.042
AN5
0.065
0.129
0.145
0.753
1.501
1.679
AN6
0.052
0.196
0.203
0.210
0.789
0.817
Note. Bold-faced loadings are statistically significant (in the used scaling, the standard errors are of the order 0.02).
Buss and Perry’s (1992) aggression questionnaire (AQ; Morales-Vives, Codorniu-Raga, & Vigil-Colet, 2005; Vigil-Colet, Lorenzo-Seva, Codorniu-Raga, & MoralesVives, 2005), a multidimensional personality measure made up of 5-point Likert items. The item set was administered to a sample of 241 secondary school students between 12 and 17 years old. The bidimensional FA model in (6) was fitted to the 15 AQ items by using a Procrustes semi-confirmatory approach in which an initial two-factor solution was analytically rotated against a partially specified target matrix by assuming correlated factors (Browne, 1972). The target hypothesis was that physical aggression was mainly defined by items PA1–PA9, whereas anger was mainly defined by items AN1–AN6. The initial solution was obtained by using unweighted least squares (ULS) and, in accordance with this criterion, the goodness-of-fit measures were the gamma-goodness-of-fit index (GFI) and the root mean square of the standardized residuals (z-RMSR), which was assessed in reference to Kelley’s cut-off criterion (see Lorenzo-Seva & Ferrando, 2013; McDonald, 1999). The fit of the model was quite acceptable, with a GFI value of 0.98 and a z-RMSR value of 0.05 which is below Kelley’s criterion (0.06). Next, the intercept and slope parameter estimates were transformed to the parameters proposed here using Equations 7, 8, 13, and 14. The resulting parameterization is shown in Table 1. The estimated interfactor correlation was φ = 0.47. According to the single-dimension discrimination indices, the solution given in Table 1 is far from approaching an
Ó 2016 Hogrefe Publishing
7
independent-cluster structure. However, the factors can be reasonably well distinguished and generally tend to agree with the subscale labels. The first factor is better defined, has a clearer structure, and approaches an independent-cluster-basis solution (McDonald, 2000) in which items PA2 and PA3 act as markers. The second factor is less well defined and has a more complex structure. The multidimensional discriminations agree with Ferrando’s (2009) reference values provided above and suggest that the discriminating power of the items is moderate. For the least discriminating items (e.g., AN2) a change of one standard deviation unit in the trait level, which is a large change, results in an expected change of about 0.09 units in the 0–1 item response scale. For the most discriminating items (e.g., PA2), the unit-trait change leads to an expected score change of about 0.25 units. As for difficulties, according to the MDβ indices, the items tend to be “difficult” for these respondents (most of the signs of the MDβj estimates are positive), which means that a high level of aggressiveness is required to fully agree with most of them. The most “difficult” items (PA4 and PA8) can be considered to be fairly extreme and both refer to physical aggression. The parameterization proposed in the article allows an additional, more detailed analysis to be made on an item-by-item basis, which I shall illustrate by using two items. Consider first item PA4. It has a moderate discrimination on the PA dimension and a low discrimination on the AN dimension. So, it is a relatively “clean” measure of Physical Aggression. According to parameterization (9) the prime response determinant to this item is the weighted PID corresponding to the first dimension. Furthermore, the item is very difficult with respect to this dimension and, other things constant, a high level of physical aggressiveness (θ1 = 2.24) is needed to reach the threshold midpoint in the response scale. If this level is attained, however, a far lower level of emotional aggressiveness (θ2 > 0.59) is then required to agree with the item. Overall, item PA4 provides quite a lot of information about the location of the individual in the θ plane. A high-agreement score (i.e., near the 1 endpoint) means that the individual must have a high level of physical aggressiveness because not even a high level of emotional aggressiveness could sufficiently compensate for a low level and still reach such an extreme score. Consider now item PA7. It belongs “a priori” to the same subscale as PA4 but is far more factorially complex. It has moderate discriminations along both dimensions and the amount of discriminating power is about the same in both of them. Furthermore, the item is moderately difficult in both dimensions, and the amount of difficulty is about the same. As a measure of PA, then, PA7 is much less differentiated than PA4 and provides less information. To see this point, note that agreement with this item can be reached with several combinations of trait levels.
Methodology (2016), 12(1), 1–10
8
P. J. Ferrando, An Extended Multidimensional IRT Formulation for the Linear Item Factor Analysis Model
Example 2: Assessment of the Multidimensional Distance-Difficulty Hypothesis The second example uses a dataset from a study by Ferrando and Lorenzo-Seva (2007) to which the reader is referred for further details. The measure was a 40-item questionnaire from the Spanish version of the Five-Factor Personality Inventory (FFPI; Rodríguez-Fornells, Lorenzo-Seva, & Andrés-Pueyo, 2001) in which 20 items attempted to measure “Extraversion” (E) and the remaining 20 “Neuroticism” (N). The response format was 5-point Likert. The questionnaire was administered via computer to a group of 262 undergraduate students and the 40 items were presented as a mixture of the E and N items. In addition, the response time in milliseconds to each item was recorded. Next, the response times were trimmed, transformed using reciprocal transformation, and reversed to maintain the sign of the original distance-latency relations (expected to be negative in all cases). In the present study, the 40 items were jointly assessed with the bidimensional FA model in (6). The calibration procedure was the same as in the first example, but in this case the target matrix specified the 20 E items to define the first dimension and the 20 N items to define the second. The goodness-of-fit results were quite acceptable, with a GFI value of 0.96 and a z-RMSR value of 0.05, which was below Kelley’s criterion (0.06). The estimated interfactor correlation was φ = 0.33. The rotated solution is available from the author. The 262 40 distance matrices were then computed for the alternative MPIDs (17) and (18) and related to the 262 40 matrix containing the transformed response latencies. The relations were first assessed on an item-byitem basis (i.e., for respondents within each of the items). For both types of distance – MPID(θ) and MPID(X) – the 40 distance-latency product-moment correlations were negative, as expected from the DDH. The overall significance of these results was assessed by testing the null hypothesis that the vector of correlations was zero in the population with Steiger’s (1980) quadratic form chi-square statistic. The results were w2(40) = 268.05 for MPID(θ) and w2(40) = 346.74 for MPID(X), both of which lead to overwhelming rejection of the null hypothesis. As was also expected, however, the correlations were low. The average correlations were: 0.14 (MPID(θ)) and 0.16 (MPID(X)). In order to obtain clearer differential results, the correlations between the average latencies and the average MPIDs were then obtained. Conceptually this approach is equivalent to considering the 40 latencies as fallible items that are indicators of a “total” or “average” test latency and the 40 MPIDs as indicators of an “average” distance. The results are in the top panel of Table 2. Methodology (2016), 12(1), 1–10
Table 2. Distance-latency results: Example 2 First-order correlations rTime,PID(θ) = 0.26
p < .001
rTime,PID(X) = 0.32
p < .001
rPID(θ),PID(X) = 0.71
p < .001 Part correlations
rTime(PID(θ),PID(X)) = 0.05
N.S.
rTime(PID(X),PID(θ)) = 0.20
p < .001
The first-order latency-distance product-moment correlations in Table 2 are of respectable size if we take into account the type of variables that are related (i.e., the MPID is only one of the multiple determinants of response latency) as well as the fact that the MPIDs were computed on the basis of the individual trait estimates instead of the unknown “true” values (an approximation which adds measurement error). Again, the relations seem to be stronger for the MPIDs (18) computed on the item response scale. The differences, however, are small and, furthermore, both types of distance were highly correlated. To further clarify differential relations, the bottom of Table 2 shows the semi-partial or part correlations between the latency on the one hand, and each adjusted distance partialled out for the other distance on the other. It was found that the correlation between latency and adjusted MPID(θ) was nonsignificant whereas the correlation between latency and the adjusted MPID(X) clearly was. So, overall these initial results provide support for the multidimensional generalization of the DDH, and more specifically for the hypothesis that uncertainty and difficulty of responding depend more on the nearness of the expected score to the scale midpoint.
Discussion The proposals made in this article are thought to be of interest because of various features that are consistently observed in personality and attitude applications. First, many tests in these domains consist of graded or more continuous items, and item analysis is performed on the basis of the linear FA model. Second, standard item FA applications are inefficient in many aspects and use less information than is available in the data matrix. In the unidimensional case there is an alternative IRT parameterization that provides more information and increases the efficiency of the item analysis process. Finally, multidimensional measures are very common in these domains so IRT-based extensions of the type proposed in the Ó 2016 Hogrefe Publishing
P. J. Ferrando, An Extended Multidimensional IRT Formulation for the Linear Item Factor Analysis Model
unidimensional case are expected to be useful, provide more information, and increase the efficiency of item analyses based on this type of measure. The extensions proposed here are of two types. First, a multidimensional Lord’s-type IRT parameterization has been proposed for the FA model in which each item is characterized by a difficulty or location parameter in each dimension. The proposed parameterization has meaningful relations with existing multidimensional indices and is expected to work well when it is used together with them. As the first example attempts to show, a joint inspection of the unidimensional and multidimensional discriminations and difficulties allows the researcher to make a detailed scrutiny of the functioning of each item and provides insight into what the items are measuring and how they function as measures of θ. Regarding potential developments in the future it would be interesting to explore how this gained information can be used for practical applications such as item selection, construction of adaptive tests, assessment of differential item functioning, or construction of equivalent forms. The second proposed extension consists of a multidimensional PID measure that can be used in validity assessments based on the DDH. The simple rationale developed in the unidimensional case, however, does not easily generalize to the multidimensional case and, the appropriateness of the proposal must be empirically assessed. In this respect, the initial assessment in the second example has provided encouraging results. When latency was used as a criterion for difficulty in responding, both distances behaved in accordance with the basis DDH, and the results in terms of averages were nontrivial. As far as differential functioning was concerned, the MPIDs based on the projection on the response scale worked better than the MPIDs computed on the θ space, but further intensive research is needed on this issue. If results are consistent in future applications, the distance-latency relations could be used for the different purposes that were discussed above. Acknowledgments This research was supported by a grant from the Spanish Ministry of Economy and Competitivity (PSI2014-52884-P).
References Atkinson, L. (1988). The measurement-statistics controversy: Factor analysis and subinterval data. Bulletin of the Psychonomic Society, 26, 361–364. doi: 10.3758/BF03337683 Browne, M. (1972). Oblique rotation to a partially specified target. The British Journal of Mathematical and Statistical Psychology, 25, 207–212. doi: 10.1111/j.2044-8317.1972. tb00492.x
Ó 2016 Hogrefe Publishing
9
Buss, A. H., & Perry, M. P. (1992). The aggression questionnaire. Journal of Personality and Social Psychology, 63, 452–459. Coombs, C. H. (1964). A theory of data. New York, NY: Wiley. Dawes, R. M. (1972). Fundamentals of attitude measurement. New York, NY: Wiley. Ferrando, P. J. (2006). Person-item distance and response time: An empirical study in personality measurement. Psicológica, 27, 137–148. Ferrando, P. J. (2009). Difficulty, discrimination and information indices in the linear factor-analytic model for continuous responses. Applied Psychological Measurement, 33, 9–24. doi: 10.1177/0146621608314608 Ferrando, P. J. (2010). Some statistics for assessing person-fit based on continuous-response models. Applied Psychological Measurement, 34, 219–237. doi: 10.1177/0146621609343288 Ferrando, P. J., Anguiano-Carrasco, C., & Demestre, J. (2013). Combining IRT and SEM: A hybrid model for fitting responses and response certainties. Structural Equation Modeling, 20, 208–225. doi: 10.1080/10705511.2013.769388 Ferrando, P. J., & Lorenzo-Seva, U. (2007). A measurement model for Likert responses that incorporates response time. Multivariate Behavioral Research, 42, 675–706. doi: 10.1080/ 00273170701710247 Hambleton, R. K., & Swaminathan, H. (1985). Item response theory principles and applications. Boston, MA: Kluwer. Hofstee, W. K. B., Ten Berge, J. M. F., & Hendricks, A. A. J. (1998). How to score questionnaires. Personality and Individual Differences, 25, 897–910. doi: 10.1016/S0191-8869(98)00086-5 Jöreskog, K. G. (1971). Statistical analysis of sets of congeneric tests. Psychometrika, 36, 109–133. doi: 10.1007/BF02291393 Kuncel, R. B. (1973). Response processes and relative location of subject and item. Educational and Psychological Measurement, 33, 545–563. doi: 10.1177/001316447303300302 Lord, F. M. (1952). A theory of test scores. Richmond, VA: Psychometric Corporation, Psychometrika Monograph. No 7. Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. Lorenzo-Seva, U., & Ferrando, P. J. (2013). FACTOR 9.2: A comprehensive program for fitting exploratory and semiconfirmatory factor analysis and IRT models. Applied Psychological Measurement, 37, 497–498. doi: 10.1177/0146621613487794 McDonald, R. P. (1997). Normal-ogive multidimensional model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 258–270). New York, NY: Springer. doi: 10.1007/978-1-4757-2691-6_15 McDonald, R. P. (1999). Test theory: A unified treatment. Mahwah, NJ: LEA. McDonald, R. P. (2000). A basis for multidimensional item response theory. Applied Psychological Measurement, 24, 99–114. doi: 10.1177/01466210022031552 Mellenbergh, G. J. (1994). A unidimensional latent trait model for continuous item responses. Multivariate Behavioral Research, 29, 223–237. doi: 10.1207/s15327906mbr2903_2 Morales-Vives, F., Codorniu-Raga, M. J., & Vigil-Colet, A. (2005). Características psicométricas de las versiones reducidas del cuestionario de agresividad de Buss y Perry [Psychometric properties of the reduced versions of Buss and Perry’s aggression questionnaire]. Psicothema, 17, 96–100. Nowakowska, M. (1983). Quantitative psychology: Some chosen problems and new ideas. Amsterdam, The Netherlands: NorthHolland. Reckase, M. D. (1985). The difficulty of test items that measure more than one ability. Applied Psychological Measurement, 9, 401–412. doi: 10.1177/014662168500900409 Reckase, M. D. (1997). A linear logistic multidimensional model for dichotomous item response data. In W. J. van der Linden &
Methodology (2016), 12(1), 1–10
10
P. J. Ferrando, An Extended Multidimensional IRT Formulation for the Linear Item Factor Analysis Model
R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 271–286). New York, NY: Springer. doi: 10.1007/ 978-1-4757-2691-6_16 Reckase, M. D. (2009). Multidimensional item response theory. New York, NY: Springer. Rodríguez-Fornells, A., Lorenzo-Seva, U., & Andrés-Pueyo, A. (2001). Psychometric properties of the Spanish adaptation of the five factor personality inventory. European Journal of Psychological Assessment, 17, 133–145. doi: 10.1027/10155759.17.2.145 Samejima, F. (1973). Homogeneous case of the continuous response model. Psychometrika, 38, 203–219. doi: 10.1007/ BF02291114 Steiger, J. H. (1980). Testing pattern hypotheses on correlation matrices: Alternative statistics and some empirical results. Multivariate Behavioral Research, 15, 335–352. doi: 10.1207/ s15327906mbr1503_7 Thissen, D., Steinberg, L., Pyszczynski, T., & Greenberg, J. (1983). An item response theory for personality and attitude scales: Item analysis using restricted factor analysis. Applied Psychological Measurement, 7, 211–226. doi: 10.1177/ 014662168300700209 Vigil-Colet, A., Lorenzo-Seva, U., Codorniu-Raga, M. J., & MoralesVives, F. (2005). Factor structure of the Buss-Perry aggression questionnaire in different samples and languages. Aggressive Behavior, 31, 601–608. doi: 10.1002/ab.20097
Methodology (2016), 12(1), 1–10
Received November 6, 2014 Revision received June 22, 2015 Accepted August 4, 2015 Published online April 1, 2016 Pere J. Ferrando received his PhD from the University of Barcelona in 1989. He is currently Professor of Psychometrics at URV (Spain). His research mainly focuses on the development of item response theory (IRT) and structural equation models (SEM) for personality measurement and applications of these models to personality outcomes.
Pere Joan Ferrando Universidad ‘Rovira i Virgili’ Facultad de Psicología Carretera Valls s/n 43007 Tarragona Spain Tel. +34 977 558079 E-mail perejoan.ferrando@urv.cat
Ó 2016 Hogrefe Publishing
Original Article
Comparison of Principal Component Solutions in Two Populations A Bootstrap Test of the Perfect Congruence Hypothesis Gregor SoÄ?an Department of Psychology, University of Ljubljana, Slovenia
Abstract: When principal component solutions are compared across two groups, a question arises whether the extracted components have the same interpretation in both populations. The problem can be approached by testing null hypotheses stating that the congruence coefficients between pairs of vectors of component loadings are equal to 1. Chan, Leung, Chan, Ho, and Yung (1999) proposed a bootstrap procedure for testing the hypothesis of perfect congruence between vectors of common factor loadings. We demonstrate that the procedure by Chan et al. is both theoretically and empirically inadequate for the application on principal components. We propose a modification of their procedure, which constructs the resampling space according to the characteristics of the principal component model. The results of a simulation study show satisfactory empirical properties of the modified procedure. Keywords: congruence coefficient, principal component analysis, bootstrap, simulation, Procrustes rotation, multiple-group analysis
Introduction Principal component analysis (PCA) is a popular multivariate data reduction technique. When applied to p variables, it produces k (1 k p) sets of linear weights (usually called component score coefficients) which can be used to optimally summarize the information contained in the original variables. The interpretation of the substantive meaning of the components is often based on the component loadings; unless an oblique rotation has been used, loadings are equal both to correlations between variables and components, and to weights describing the structure of variables in terms of components. In an unrotated solution, a simple proportional relation holds between both sets of weights. In practice, PCA is often applied to the same variables measured in two groups, typically called target and replication group (males vs. females, a majority group vs. a minority focal group, and self-ratings vs. peer-ratings may serve as examples). In such cases, a question may arise whether the component score weights (or, alternatively, component loadings) for the retained components are the same in both populations. For example, for a researcher using PCA as a test scoring technique (cf. Hofstee, ten Berge, & Hendriks, 1998), it is important to know whether the same scoring equation may be used for female and male participants. This paper presents and evaluates a technique aimed at answering this type of question. The proposed technique
Ă&#x201C; 2016 Hogrefe Publishing
is based on a previously developed procedure for comparing common factor solutions (Chan, Leung, Chan, Ho, & Yung, 1999), but has been modified for the use with PCA. The structure of the paper is as follows. We begin with a review of approaches to the comparison of principal component solutions, with special emphasis on the congruence coefficient. Then we present a procedure previously developed for testing the congruence of common factor solutions, and our modification of this procedure for PCA. Finally, we present a simulation study investigating the properties of the modified procedure in comparison to the original procedure by Chan et al. (1999).
Comparing the Component Solutions We shall limit our discussion to the case with a single set of variables and two distinct subpopulations. In this case, there are two general approaches to the comparison of component solutions. The first approach is based on calculating the percentage of variance in various groups explained by a common set of components. This can be achieved either by determining component weights in the target group and calculating the explained variance in the replication group (rotation to perfect congruence, see ten Berge, 1986b), or by determining the components using all available data and then comparing the variance explained by the common set of components with the variance explained by
Methodology (2016), 12(1), 11â&#x20AC;&#x201C;20 DOI: 10.1027/1614-2241/a000099
12
G. Sočan, Bootstrapping Congruence Coefficients in PCA
components extracted in each group separately (simultaneous component analysis, for an overview see Kiers & ten Berge, 1994). The second approach is based on component loadings and focuses on the similarity of component interpretations rather than on the comparison of explanatory strength. In the first step, PCA is performed independently in both groups, extracting an appropriate number of components. Then, a target rotation (typically a Procrustes rotation) is applied to eliminate the effect of rotational indeterminacy. Finally, the corresponding vectors of loadings are compared using the coefficient of congruence (ϕ). This coefficient, developed apparently independently by Burt (1948) and Tucker (1951), measures the extent of proportionality of the elements of two vectors and can be interpreted in terms of similarity of component interpretations. For a pair of vectors a and b, each consisting of p elements, it is defined as p P
ϕab ¼
0
ab a0 ab0 b
1=2
a i bi ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : p p P P a2i b2i i¼1
i¼1
ð1Þ
i¼1
Although other similarity indices exist, the congruence coefficient possesses several preferable properties (for a review, see Lorenzo-Seva & ten Berge, 2006). A serious problem with the described procedure (i.e., PCA/EFA + Procrustes rotation + congruence coefficient) is its capitalization on chance: even with random data, values of ϕ substantially different from zero may emerge. This has eventually led many psychometricians (for instance, Nunnally & Bernstein, 1994, pp. 561–562) to prefer the structural equation modeling methods (like confirmatory factor analysis) to the “congruence-after-Procrustes” procedure. Consequently, confirmatory factor analysis became a de facto standard method for evaluating structural hypotheses. Nevertheless, some researchers (Church & Burke, 1994; McCrae, Zonderman, Costa, Bond, & Paunonen, 1996) still believed that the confirmatory factor analysis models were too restrictive to fit complex data typically obtained in personality research. To control the effects of sampling error, McCrae et al. (1996) used a permutation test to determine the statistical significance of congruence coefficients. The idea of comparing the obtained value of ϕ to the chanceexpected value had in fact been proposed much earlier (Korth & Tucker, 1975; Nesselroade & Baltes, 1967), but became attractive only after the advent of high-speed personal computers. Although the test by McCrae et al. controls the above-mentioned tendency of chance capitali1
zation, it does not address the question whether the factors (or components, respectively) have the same interpretation in both groups; instead, it tells us whether their similarity is higher than expected by chance (i.e., we test the null hypotheses of type H0: ϕ = 0). Chan et al. (1999) proposed a complementary bootstrap procedure (subsequently abbreviated as CLCHYP) for testing the hypothesis of perfect congruence (H0: ϕ = 1).1 CLCHYP consists of the following steps: 1. Apply an exploratory common factor analysis separately to each covariance matrix (CT and CR) and extract k factors. Denote the target and the replication loading matrix by ΛT and ΛR, respectively, and the corresponding uniqueness matrices by ΨT and ΨR, respectively. 2. Define the resampling space for both populations: ΣT = ΛTΛT0 + ΨT and ΣR = ΛTΛT0 + ΨR (assuming uncorrelated factors). The definition of the resampling space actually amounts to setting both implied covariance matrices according to the null hypothesis. It is analogous to the bootstrapping of residuals in regression analysis (cf. Efron & Tibshirani, 1993, pp. 111– 115). 3. In bootstrap-based hypotheses testing, bootstrap samples have to be taken from transformed raw data that conform to the null hypothesis (cf. Efron & Tibshirani, 1993, Chapter 16). Chan et al. (1999, p. 390 and p. 393) demonstrated that if the bootstrap samples were taken from the untransformed raw data (socalled “naïve bootstrap”), the resulting sampling distribution of congruence coefficients would have too small critical values compared to the correct sampling distribution, obtained by bootstrapping from the transformed data. Therefore, each raw data matrix needs to be transformed according to the definition of its respective sampling space. The transformed data matrices are determined as XT = XTCT 1/2ΣT1/2 and XR = XRCR 1/2ΣR1/2. By this transformation the perfect fit of the k-factor model is achieved in both sets of data and the common factor structure of the target sample data is imposed on the replication sample data (for details, see Chan et al., 1999, pp. 398–399). Because the covariance matrices corresponding to XT and XR share the same common factor structure, the congruence coefficients for the common factor pairs are equal to 1, as specified by the null hypothesis. 4. Draw an appropriate number of bootstrap sample pairs from the transformed data. For each pair of samples, determine the rotation matrix T minimizing tr(ΛT ΛRT)0 (ΛT ΛRT). Using the rotation matrix,
In fact, Chan et al. (1999) formulated H0 as ΛTΛT0 = ΛRΛR0 (p. 381). However, it is obvious that ϕab = ±1 whenever b = ac (assuming c 6¼ 0), yet the elements of aa0 and bb0 differ for the factor of c2. Therefore we prefer to formulate H0 in terms of ϕ.
Methodology (2016), 12(1), 11–20
Ó 2016 Hogrefe Publishing
G. Sočan, Bootstrapping Congruence Coefficients in PCA
13
rotate the replication loading matrix to the target loading matrix. For each corresponding pair of columns in ΛT and ΛRT calculate the coefficient of congruence ϕ according to Equation 1. 5. Compare the empirically obtained value of ϕ for each factor to the desired percentile of the respective bootstrap sampling distribution. Reject H0 if ϕ is smaller than this critical value; otherwise, the data are consistent with the perfect congruence hypothesis. Chan et al. (1999) performed a simulation study which showed a satisfactory performance of their procedure.
A Modification of CLCHYP for the Use in PCA In the Monte-Carlo evaluation of their method, Chan et al. (1999) used the maximum-likelihood method of factor extraction. Nevertheless, they stated that any method of factor extraction could be used to compute loadings, including PCA (p. 381). These authors therefore implicitly assumed a fundamental equivalence between PCA and the common factor analysis (for some recent contrasting views on this issue, see, e.g., Widaman, 2007; MacCallum, 2009, vs. Goldberg & Velicer, 2006; Field, 2009). However, there are some fundamental conceptual differences between both methods (for instance, common factors are latent variables, while components are not; common factors explain only the common variance, while principal components explain the total variance, etc.). The issue of applicability of CLCHYP to the PCA loadings thus deserves a closer scrutiny. In both models (common factor or principal component, respectively), the covariance matrix can be decomposed as
Σ ¼ ΛΛ0 þ EE0 :
ð2Þ
In the common factor model, Λ stands for a p k matrix of common factor loadings, k p, and E stands for a p p matrix of unique factor loadings. For simplicity, we assume the common factors to be mutually orthogonal. Since each variable loads on a single unique factor by definition, the uniqueness matrix Ψ = EE0 is a diagonal matrix. In the principal component framework, Λ stands for a p k matrix of component loadings for retained components (i.e., the components that the researcher considers to be relevant) and E stands for a p (p k) matrix of loadings on the ignored components (i.e., the components that the researcher does not wish to retain).
2
In general, E consists of nonzero elements; E is not a square matrix and EE0 is not a diagonal matrix. Therefore, while both models imply an additive decomposition of the covariance matrix into the explained part and the residual part, in common factor analysis the residual part is modeled as a diagonal matrix, while in PCA it is not. Because the definition of the residual part is crucial for a correct definition of the resampling space, a modification of the CLCHYP is necessary for PCA. However, the modified definition of the resampling space is more intricate than the original definition. In the common factor framework, combining the replication residual part and the target explained part, as explained above (see Step 2), preserves the factor model structure. On the other hand, simply adding the PCA replication residual part to the target explained part would violate the PCA model, because the columns of the combined loadings matrix [ΛT|ER] would not be mutually orthogonal. The problem can be solved by replacing ER with its approximation satisfying the condition that ΛT0 ER is a zero matrix and ER0 ER is a diagonal matrix.2 Our modification of the Chan, Leung, Chan, Ho, and Yung procedure (subsequently abbreviated as mCLCHYP) consists of the following steps: 1. Apply PCA to both data sets and extract k components. Let UT be the p k matrix of the first k eigenvectors of the target covariance matrix CT, let VT and VR be p (p k) matrices of the last p k eigenvectors of the target and the replication covariance matrices CT and CR, respectively, and let LR be a diagonal matrix of eigenvalues of the replication covariance matrix CR. 2. Compute VR as a modification of VR, satisfying the condition that the eigenvector matrix, used in the subsequent definition of the resampling space, is orthonormal, therefore [UT| VR ]0 [UT| VR ] = I. This constraint is equivalent to the above-mentioned constraint on [ΛT|ER], but is more convenient from the computational viewpoint. We are not aware of an analytical procedure that could be used to directly compute VR subject to this constraint. Instead, we suggest modifying the columns of VR in a series of p k steps. In each step, the selected column is first made orthogonal to the previous columns, and then normalized. Therefore, the following operations are performed: a. Initially, let Z0 = UT and let i = 1. b. Let y be the ith column of VR. Orthogonalize y with respect to the current version of Z by computing regression residuals:
0 1 vi ¼ y Z Z Z Z0 y:
ð3Þ
Both constraints follow from the mutual orthogonality of eigenvectors.
Ó 2016 Hogrefe Publishing
Methodology (2016), 12(1), 11–20
14
G. Sočan, Bootstrapping Congruence Coefficients in PCA
c. Normalize vi, thus let vi = vi(vi0 vi) 1/2. d. Append vi to Z, thus let the updated Zi+1 become [Zi|vi]. If i < p k, increase i by 1 and return to substep b. Otherwise, collect the last p k columns of the final version of Z into VR* and proceed to Step 3. 3. Define the resampling space for both populations. The implied target covariance matrix is equal to the sample target covariance matrix: ΣT = ΛTΛT0 + ΨT = CT. Note that in CLCHYP, ΣT and CT would be equal only in an unlikely case of perfect fit of the k-factor model (implying residual covariances equal to 0). In PCA, on the other hand, the diagonality constraint is not placed on Ψ, therefore there is no need for a modification of CT. It is, however, necessary to determine the implied replication covariance matrix, according to the null hypothesis. This is achieved by taking
ΣR ¼ ½UT jV R LR ½UT jV R 0 :
ð4Þ
This definition of the replication resampling space assures perfect congruence between the pairs of retained components in both populations. On the other hand, the variances explained by the components do not change. Conceptually, the residual part for the replication data can be written as 0
ΨR ¼ ER E0R ¼ V R LR2 V R ;
ð5Þ
where LR2 is a (p k) (p k) submatrix of LR, containing eigenvalues related to the ignored principal components. However, it is not necessary to actually compute this matrix. 4. Transform the raw data for the replication sample: 1=2
X R ¼ XR CR
ΣR 1=2 :
ð6Þ
We note that in CLCHYP, the raw data matrix for the target sample is transformed as well (cf. the description of Step 3 of CLCHYP above). In mCLCHYP the transformation of the target data matrix is not necessary because ΣT and CT are equal, as explained above (consequently, XT = XTCT 1/2ΣT1/2 = XTCT 1/2CT1/2 = XT). In both procedures, the data transformation is necessary to make the sampling distribution of the congruence coefficients correspond to the null hypothesis of perfect congruence. 5. By sampling with replacement, draw an appropriate (e.g., 1,000) number of pairs of bootstrap samples from the raw data (XT and XR , respectively). 6. For each pair of the bootstrap samples, as well as for both actual samples, apply PCA to both data sets and extract k components. Rotate the replication 3
loading matrix ΛR to the respective target loading matrix ΛT by an orthogonal Procrustes rotation; the rotation matrix is T = PQ0 , where the latter two matrices are obtained from the singular value decomposition ΛR0 ΛT = PDQ0 (Cliff, 1966; Schönemann, 1966). In case of k = 1, ΛR0 ΛT is a scalar and PQ0 equals either 1 or 1; the Procrustes rotation then reduces to a possible reflection of the elements of ΛR, preventing a negative congruence coefficient. 7. For each component, construct the bootstrap sampling distribution of congruence coefficients. Compare the empirical congruence coefficient ϕobs. for each component to the αth percentile of the respective bootstrap sampling distribution (α being set according to the desired alpha error level). Reject H0 if ϕobs. is smaller than this critical value. If H0 has not been rejected, the data are consistent with the perfect congruence hypothesis. Steps 5–7 are basically the same as in CLCHYP (Chan et al., 1999).
Method Research Problem and Design We tested the behavior of mCLCHYP by means of a simulation study. We focused on the one-component case. The aim of the study was twofold: 1. The evaluation of the empirical alpha error level in both mCLCHYP and CLCHYP. 2. The determination of the statistical power of both procedures in various conditions. We manipulated the following factors: 1. (Average) size of each sample in a pair (n = 100, 200, or 400, respectively). The range of the chosen values was based on what seems to be the typical range of sample sizes used in empirical research (see, e.g., the review studies by Conway & Huffcutt, 2003, and Fabrigar, Wegener, MacCallum, & Strahan, 1999). To make the power analysis more informative, the focus was on small sample sizes. In three cases, both samples were of equal size; in two cases, the sample sizes were set unequal (300 and 100, and 600 and 200, respectively) to investigate situations where the replication population is a clinical or other minority group for which only a relatively small sample could be obtained.3
In (seemingly unlikely) cases when the replication sample is considerably larger than the original sample, it still seems preferable to treat the largest group as the target group in order to minimize the effect of sampling error on the matrix UT, used in the definition of the explained part of the resampling space.
Methodology (2016), 12(1), 11–20
Ó 2016 Hogrefe Publishing
G. Sočan, Bootstrapping Congruence Coefficients in PCA
2. Variance explained by the first principal component (EVP = 25%, 50%, or 75%, respectively). Although the choice of values is to some extent arbitrary, such values might be obtained in analyses of well-designed dichotomous items (25%), graded response items (50%), and factorially homogeneous test scores (75%).4 3. Coefficient of congruence in the population (ϕ = 1, .95, or .85, respectively). While the inclusion of the value 1 was obviously necessary, because it corresponded to the null hypothesis of perfect congruence in the population, the remaining two values were determined according to the results of Lorenzo-Seva and ten Berge (2006). They investigated the relation between ϕ and practitioners’ subjective judgments of factor similarity. They concluded that in cases when ϕ > .95, the factors’ interpretations could be considered equal, whereas the value range .85–.94 corresponded to a fair factor similarity. Therefore, the values of .95 and .85 can be considered as limiting sizes for small and large effect size, respectively. The manipulation of the sample size and the population value of ϕ was essential for the evaluation of both error levels. On the other hand, we included EVP because we expected this factor to differentiate between mCLCHYP and CLCHYP. As explained above, both procedures differ in the definition of the residual part of the model Ψ. When EVP is high, the sizes of the residuals in Ψ are generally small compared to the case when EVP is low; therefore, the behavior of both procedures should differ more in the low EVP condition than in the high EVP condition.5
Procedure In each of the 5 3 3 = 45 experimental conditions, we generated 5000 sample pairs. For each pair, we first constructed a pair of corresponding population covariance matrices. The pair of the first eigenvectors of these matrices had a fixed value of ϕ, and the eigenvalues of both matrices were identical. The eigenvalues had been set in advance according to the desired value of EVP and were the same for all population matrices within the same level of EVP (see Table 1). The eigenvalues were determined in such a way that the last p–1 eigenvalues formed a linear scree. 4
5
15
Table 1. Population eigenvalues for different levels of EVP Explained variance percentage e.v.
25%
50%
75%
1
2.50
5.00
7.50
2
1.50
1.00
0.50
3
1.33
0.89
0.44
4
1.17
0.78
0.39
5
1.00
0.67
0.33
6
0.83
0.56
0.28
7
0.67
0.44
0.22
8
0.50
0.33
0.17
9
0.33
0.22
0.11
10
0.17
0.11
0.06
Note. e.v. = eigenvalue.
On the other hand, a separate pair of eigenvector matrices was constructed for each pair of population matrices as follows. The first eigenvector of each matrix was obtained as a column of matrix W*, computed as
"
# ϕ pffiffiffiffiffiffiffiffiffiffiffiffiffi ; W ¼W 0 1 ϕ2
1
ð7Þ
where W was an orthonormalized p 2 matrix of random numbers from the uniform distribution, and ϕ was the desired population value of the congruence coefficient. The remaining nine eigenvectors were obtained by means of the procedure described in Step 2 of mCLCHYP: the respective column of W was used in place of UT, and a p (p – 1) matrix of uniformly distributed random numbers was used in place of VR. We used different population eigenvectors (and, consequently, a different pair of population covariance matrices) for each sample pair to prevent confounding the effects of our independent variables with idiosyncratic structure properties of a single population loading matrix. On the other hand, we used only three sets of eigenvalues. We controlled the variance explained by the first component, and at the same time we wished the structure to be close to unidimensional, making the extraction of a single principal component the optimal choice. Because of the fixed value of the sum of the eigenvalues, little room for variation of individual eigenvalues remained. Our sampling scheme therefore included a sampling of populations from a metapopulation
For example, the first principal component explains 27.3% of variance in the classical set of ten dichotomously scored LSAT items (Bock & Lieberman, 1970). Schmitt and Allik (2005) applied PCA to the 10 items of the Rosenberg Self-Esteem Scale, scored on a 4-point scale. The first principal component explained 50.3% of variance in the largest (USA) sample, and 41.4% of variance across all 53 nations. Finally, the EVP value of 75% for 10 variables (as used in this study) would imply average inter-test correlation of around .72, which could be expected when analyzing reasonably reliable tests measuring the same construct. This relation is not exact. The main difference between the procedures lies in the off-diagonal elements of Ψ, which are zero in CLCHYP and nonzero in mCLCHYP. On the other hand, EVP is related to all elements of Ψ. However, we see no quantity which would be directly related only to the off-diagonal elements and could be easily interpreted, computed, and manipulated.
Ó 2016 Hogrefe Publishing
Methodology (2016), 12(1), 11–20
16
G. Sočan, Bootstrapping Congruence Coefficients in PCA
Table 2. Medians and standard deviations of sample congruence coefficients
Table 3. Empirical alpha error levels for both mCLCHYP and CLCHYP Procedure
ϕ
nT
.85
.95
1
nR
EVP (%)
Mdn (SD)
Mdn (SD)
Mdn (SD)
100
25
.77 (0.13)
.86 (0.11)
.91 (0.10)
50
.84 (0.03)
.94 (0.02)
.99 (0.01)
75 200
300
400
600
200
100
400
200
25
.85 (0.02) .81 (0.07)
.95 (0.01) .91 (0.05)
1.00 (0.00) .95 (0.04)
50
.85 (0.02)
.95 (0.01)
.99 (0.00)
75
.85 (0.01)
.95 (0.01)
1.00 (0.00)
25
.80 (0.10)
.89 (0.08)
.94 (0.07)
50
.84 (0.03)
.94 (0.02)
.99 (0.01)
75
.85 (0.01)
.95 (0.01)
1.00 (0.00)
25
.83 (0.05)
.93 (0.03)
.98 (0.02)
50
.85 (0.02)
.95 (0.01)
1.00 (0.00)
75
.85 (0.01)
.95 (0.00)
1.00 (0.00)
25
.83 (0.06)
.92 (0.04)
.97 (0.03)
50
.85 (0.02)
.95 (0.01)
1.00 (0.00)
75
.85 (0.01)
.95 (0.01)
1.00 (0.00)
Notes. nT = target sample size; nR = replication sample size; EVP = explained variance percentage in the population; ϕ = congruence coefficient in the population.
with a common eigenvalue structure, but different variable weights. We took only one sample per population to avoid the need for a multilevel analysis. All population covariance matrices were standardized. The number of variables was 10 in all conditions. The sample data matrices were computed as X = NΣ1/2, where Σ was the respective population covariance matrix and N was an n p matrix of random numbers sampled from the standard normal distribution. The sample data matrices were subjected to both mCLCHYP and CLCHYP, as described in the Introduction. In all cases, we drew 1,000 bootstrap samples from each sample data matrix. We carried out the simulations using the MATLAB R2012b software.
Results Descriptive Statistics Table 2 presents medians and standard deviations of the sample congruence coefficients in different conditions (we present medians rather than means because many of 6
CLCHYP
EVP
EVP
nR
25%
50%
75%
25%
50%
75%
100
100
3.9
4.7
4.8
60.3
15.5
10.2
200
200
4.2
4.7
5.1
57.0
14.5
10.8
300
100
5.9
3.8
5.0
53.9
14.9
10.5
400
400
4.4
4.6
4.6
50.2
12.1
9.1
600
200
1.3
3.3
5.1
50.0
13.1
10.3
nT 100
mCLCHYP
Notes. The nominal alpha error level was 5%. CLCHYP = Chan et al. (1999) procedure; mCLCHYP = modified Chan et al. (1999) procedure; nT = target sample size; nR = replication sample size; EVP = explained variance percentage.
the distributions were highly peaked and negatively skewed). All medians were smaller than the respective population values; the same was true for means, which are not presented here. Therefore, the sampling bias of the congruence coefficient was always negative. The bias was larger when the sample size was smaller and when the first principal component explained less variance, respectively. However, the size of the bias was practically negligible when EVP was at least 50%. Similarly, the estimated standard errors of congruence coefficients were higher when either the sample size, EVP, or ϕ was smaller, respectively. When the group sizes were different, both bias and standard errors were slightly larger in comparison to conditions with the same total sample size, but divided into two equally sized groups.
Alpha Error Rates We shall first compare mCLCHYP and CLCHYP with regard to their alpha error rates in different conditions. Table 3 presents percentages of cases when the null hypothesis (i.e., ϕ = 1) was incorrectly rejected using the nominal level of α = 5%. We established an approximate6 95% sampling interval [5.60, 4.40], based on the binomial distribution; the empirical alpha error levels outside this interval indicate a statistically significant deviation from the nominal level. The error levels of mCLCHYP were in general close to the nominal level. In cases when the empirical error level fell outside the sampling interval, the empirical error level was almost always somewhat too low. Therefore, the actual alpha error rate of mCLCHYP is generally equal to or lower than the nominal rate.
The actual coverage of the interval was 94.8%. The limits of the interval would be equal up to the first two decimal places if determined by the normal approximation formula.
Methodology (2016), 12(1), 11–20
Ó 2016 Hogrefe Publishing
G. Sočan, Bootstrapping Congruence Coefficients in PCA
17
Table 4. Estimated power of mCLCHYP
α 1%
5%
10%
nT
nR
100 200
ϕ = .85
ϕ = .95
EVP
EVP
25%
50%
75%
25%
50%
75%
100
7.4
100.0
100.0
2.1
83.4
100.0
200
36.5
100.0
100.0
6.9
99.9
100.0
300
100
35.1
100.0
100.0
8.4
96.7
100.0
400
400
89.0
100.0
100.0
30.6
100.0
100.0 100.0
600
200
65.9
100.0
100.0
8.4
100.0
100
100
25.1
100.0
100.0
9.7
95.8
100.0
200
200
64.4
100.0
100.0
23.3
100.0
100.0
300
100
56.9
100.0
100.0
18.5
99.8
100.0
400
400
97.9
100.0
100.0
59.9
100.0
100.0 100.0
600
200
82.5
100.0
100.0
19.9
100.0
100
100
40.1
100.0
100.0
18.3
98.6
100.0
200
200
78.0
100.0
100.0
36.8
100.0
100.0
300
100
68.2
100.0
100.0
27.0
100.0
100.0
400
400
99.3
100.0
100.0
74.5
100.0
100.0
600
200
89.1
100.0
100.0
30.4
100.0
100.0
Notes. α = nominal alpha error level; ϕ = congruence coefficient in the population; nT = target sample size; nR = replication sample size; EVP = explained variance percentage.
On the other hand, the alpha error levels of CLCHYP were much higher than the nominal level. In the worst case (n = 100 and EVP = 25%), the alpha error level was higher than 60%. The alpha error level of CLCHYP was strongly related to EVP: when the first principal component explained 75% of variance, the alpha error level decreased to about twice the nominal level. Therefore, our simulation results clearly show that CLCHYP should not be used in conjunction with the principal component analysis because of its excessive alpha error level. On the other hand, the behavior of the proposed modified method seems satisfactory. The alpha error level was mostly within the expected range, and in the remaining cases it was (except in one case) slightly lower than the nominal level.
Power Analysis The second research question concerned the statistical power of both procedures. To simplify the interpretation, we considered the value of 80% as the minimal desired power (following Cohen, 1998, p. 56). Table 4 presents percentages of cases when mCLCHYP correctly rejected the null hypothesis (in conditions involving ϕ < 1). These values can be taken as empirical power estimates. Apart from the sample size, the proportion of explained variance was strongly related to the estimated power, too. The power was generally quite low in the EVP = 25% conditions. It was acceptable only when n was 400 and ϕ was .85. When either the sample size or the effect size was Ó 2016 Hogrefe Publishing
Table 5. Estimated power of CLCHYP
nT
ϕ = .85
ϕ = .95
EVP
EVP
nR
25%
50%
75%
25%
50%
75%
100
100
97.1
100.0
100.0
84.0
99.6
100.0
200
200
99.7
100.0
100.0
93.1
100.0
100.0
300
100
98.9
100.0
100.0
87.0
100.0
100.0
400
400
100.0
100.0
100.0
98.7
100.0
100.0
600
200
99.9
100.0
100.0
96.1
100.0
100.0
Notes. Nominal alpha error level = 5%; ϕ = congruence coefficient in the population; nT = target sample size; nR = replication sample size; EVP = explained variance percentage.
smaller, the power was much lower: in the n = 100, ϕ = .85, α = 5% condition, it was 25%, while in the n = 100, ϕ = .95, α = 5% condition it was only about 10%. When sample sizes were different and EVP was 25%, the power was in general somewhat lower than in equal sample size conditions with the same total sample size. The differences were especially large when the effect size was small (i.e., EVP = 25%, ϕ = .95). However, in these conditions the power was not satisfactory even with equally sized samples. The power was much better when the percentage of explained variance was larger (i.e., either 50% or 75%). Unequal sample sizes had no notable effect on power in these conditions. Whenever EVP was 75%, the null hypothesis was correctly rejected, regardless of the sample size Methodology (2016), 12(1), 11–20
18
and the significance level. In conditions involving the larger effect size (ϕ = .85), the same happened when EVP was 50%. In the smaller effect size conditions (ϕ = .95), the power was still high: the lowest estimate was 83% (in the n = 100, EVP = 50%, α = 1% condition), and in all remaining cases it was higher than 95%. For CLCHYP we only report the power estimates for α = 5% (Table 5). The power was very high: it was lower than .90 in only one condition (n = 100, EVP = 25%), and was (almost) perfect when EVP was at least 50%. However, the high power of CLCHYP should not be surprising, considering the very high alpha error rates: obviously, this procedure tends to reject null hypothesis very often when applied to PCA. The power of the modified procedure was adequate whenever the first principal component explained at least 50% variance. The usefulness of our modification is limited when the explained variance is very low (for instance, when analyzing dichotomous item responses), unless both the sample size and the expected effect size are quite large (i.e., when the expected congruence is at most .85, and the (mean) sample size is about 400 per sample); otherwise, the statistically significant results can be taken as credible evidence for the alternative hypothesis, but nothing should be inferred from the insignificant results. Finally, we would like to note an observation concerning the relation between mCLCHYP and CLCHYP: whenever mCLCHYP rejected the null hypothesis, CLCHYP also rejected the null hypothesis. Therefore, both procedures possess an intrinsic similarity, but CLCHYP rejects the null hypothesis much too often, as follows from Table 3.
Discussion Opposite to the conjecture by Chan et al. (1999), the results of the simulation study showed that their procedure (CLCHYP) should not be used with PCA because the alpha error rate was by far too high. On the other hand, the results showed a satisfactory behavior of the proposed modified procedure (mCLCHYP). Its main limitation is the low power in situations when the first principal component explains a small amount of variance (especially if the sample sizes are notably different). It should be noted that CLCHYP should not be used in these conditions despite the fact that its power was much higher compared to mCLCHYP, because in the conditions involving low levels of the explained variance the alpha error rates of CLCHYP were extraordinarily high. The major difference between mCLCHYP and CLCHYP lies in the construction of the resampling space, where different definitions of the residual part of the model are Methodology (2016), 12(1), 11–20
G. Sočan, Bootstrapping Congruence Coefficients in PCA
applied. In mCLCHYP, the residual part is defined in a manner consistent with the properties of PCA, therefore not assuming diagonality of the residual covariance matrix. Another notable difference relates to the formulation of the explained part of the model. In mCLCHYP, the eigenvectors of both implied covariance matrices, rather than component loadings, are set equal. This allows for a different amount of explained variance by the retained components in each population while maintaining perfectly congruent loading vectors. Therefore, the aim of mCLCHYP is to detect differential interpretations of the selected components, rather than differences in their explanatory strength. As expected, the behavior of the original and the modified procedure diverged more when the explained variance was low and vice versa. When the explained variance is very high, the residual part of the model has little influence on the results. If the first principal component explained all variance, the residual matrix Ψ would be a zero matrix; at the same time, the principal component model could be treated as a factor analysis model with a zero uniqueness matrix. It is obvious that both procedures would provide identical results in this case, because the bootstrap samples would be taken only from the explained part of the data. Many users of mCLCHYP may be more interested in the congruence of component score coefficients than in the congruence of component loadings. In case of unrotated components, congruence coefficients of both types have equal values, therefore the results of mCLCHYP can be interpreted in terms of component score coefficients as well. When components are rotated, the congruence of loadings is not equal to the congruence of component score coefficients. However, as proven by ten Berge (1986a, p. 37), perfect congruence of all loading vectors (or pattern vectors in case of correlated components, respectively) is a necessary and sufficient condition for a perfect congruence of all component score coefficients’ vectors. Therefore, the results of mCLCHYP would have some generalizability to component score coefficients even in case of rotated components. Our simulations were limited to the case when only the first principal component was of interest. Although this may not be a typical instance of the use of PCA in general, we believe that it might present a frequent instance of the use of our testing procedure. For instance, a typical application might be, as already mentioned earlier, a generalization of a scoring key for a psychological test to another population. Nevertheless, mCLCHYP can be applied to cases with two or more retained components. The application is straightforward when the orthogonal components are desired. In case of oblique components, it is necessary to choose between matching the structure matrices (ten Berge & Nevels, 1977) and matching the pattern matrices (Browne & Kristof, 1969). In any case, the choice of a Ó 2016 Hogrefe Publishing
G. Sočan, Bootstrapping Congruence Coefficients in PCA
rotation does not affect the definition of the resampling space, used in the bootstrap procedure, because the space spanned by the ignored components is orthogonal to the space spanned by the retained components. The evaluation of mCLCHYP for multiple-component solutions (with regard to factors like orthogonal vs. oblique rotation, pattern vs. structure matching, correlations between components, number of components, etc.) is beyond the scope of this paper and remains a task for future research. Another factor that was not manipulated in our study was the difference between EVP in both populations. In particular, the explained variances of the components were equal in both populations. Although we do not see any reason to expect the differences in EVP to affect the performance of mCLCHYP, we speculate that such differences might increase the divergence between the results of mCLCHYP and CLCHYP, because CLCHYP assumes that the variances explained by the k retained components are equal in both populations (see Step 2 of CLCHYP, where the expressions for ΣT and ΣR differ only in the residual part). Chan et al. (1999) discussed and evaluated three types of congruence coefficients, measuring factor, total, and variable congruence, respectively. Our study dealt only with the first type. In our opinion, this type of congruence is typically the most interesting for an applied researcher. Besides, the main objective of our study was the evaluation of the modified resampling process, which is the same regardless of the type of the congruence coefficient computed later. Although the rationale for the definition of the residual part of the transformed resampling space for the replication sample may look quite straightforward, alternative ways of determining ΨR may be conceived. For instance, before orthogonalizing VR in Step 2b, one could rotate VR against VT by means of the Procrustes rotation to minimize the extent of the modification of VR. Alternatively, if the replication covariance matrix were based on an untransformed VR, the constraint [UT|VR ]0 [UT|VR ]=I would not be satisfied, but the resulting ΣR would still be a proper, positive semidefinite covariance matrix. We have run simulations with these two variations of the modified procedure as well, however, both of them showed results inferior to the procedure described in the paper. It seems worthwhile to point out that the sampling bias of the congruence coefficient was negative in all conditions. Despite being in contrast with the widespread belief that congruence coefficients are generally positively biased, this finding is in line with the study by Broadbooks and Elmore (1987), who found a negative bias when ϕ was .50 or higher. Although their study was based on the common factor model, their results can apparently be generalized to PCA as well. An open problem that may present an interesting extension of the present research is the case of several Ó 2016 Hogrefe Publishing
19
populations. Although methods for matching several matrices are available (see, e.g., Gower & Dijksterhuis, 2004), finding a satisfactory generalization of the congruence coefficient may not be easy. One possible candidate is the RV coefficient (Robert & Escoufier, 1976), but for any such measure its perception by practitioners should be investigated first (in a way similar to Lorenzo-Seva & ten Berge, 2006). To conclude, the contribution of this study is twofold. First, a procedure for testing the equality of congruence coefficients in the principal component framework has been proposed and evaluated. Additionally, our results illustrate the danger of an automatic generalization of the properties of the common factor analysis to PCA, and thus provide another insight into the discourse about similarities and differences between both methods. Acknowledgments R and MATLAB scripts for mCLCHYP are available upon request. This research was partly funded by the Slovene Research Agency (Project Code: P5–0062).
References Bock, R. D., & Lieberman, M. (1970). Fitting a response model for n dichotomously scored items. Psychometrika, 35, 179–197. doi: 10.1007/BF02291262 Broadbooks, W. J., & Elmore, P. B. (1987). A Monte Carlo study of the sampling distribution of the congruence coefficient. Educational and Psychological Measurement, 47, 1–11. doi: 10.1177/0013164487471001 Browne, M. W., & Kristof, W. (1969). On the oblique rotation of a factor matrix to a specified pattern. Psychometrika, 34, 237–248. doi: 10.1007/BF02289347 Burt, C. (1948). The factorial study of temperamental traits. British Journal of Statistical Psychology, 1, 178–203. doi: 10.1111/ j.2044-8317.1948.tb00236.x Chan, W., Leung, K., Chan, D. K.-S., Ho, R. M., & Yung, Y.-F. (1999). An alternative method for evaluating congruence coefficients with Procrustes rotation: A bootstrap procedure. Psychological Methods, 4, 378–402. doi: 10.1037/1082-989X.4.4.378 Church, A. T., & Burke, P. J. (1994). Exploratory and confirmatory tests of the Big Five and Tellegen’s three- and four-dimensional models. Journal of Personality and Social Psychology, 66, 93–114. doi: 10.1037/0022-3514.66.1.93 Cliff, N. (1966). Orthogonal rotation to congruence. Psychometrika, 31, 33–42. doi: 10.1007/BF02289455 Cohen, J. (1998). Statistical power analysis for the behavioural sciences (2nd ed.). Hillsdale, NJ: Erlbaum. Conway, J. M., & Huffcutt, A. I. (2003). A review and evaluation of exploratory factor analysis practices in organizational research. Organizational Research Methods, 6, 147–168. doi: 10.1177/ 1094428103251541 Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. Boca Raton, FL: Chapman & Hall/CRC. Fabrigar, L. R., Wegener, D. T., MacCallum, R. C., & Strahan, E. J. (1999). Evaluating the use of exploratory factor analysis in psychological research. Psychological Methods, 4, 272–299. doi: 10.1037/1082-989X.4.3.272
Methodology (2016), 12(1), 11–20
20
Field, A. (2009). Discovering statistics using SPSS (3rd ed.). Los Angeles, CA: Sage. Goldberg, L. R., & Velicer, W. F. (2006). Principles of exploratory factor analysis. In S. Strack (Ed.), Differentiating normal and abnormal personality (2nd ed., pp. 209–237). New York, NY: Springer. Gower, J. C., & Dijksterhuis, G. B. (2004). Procrustes problems. New York, NY: Oxford University Press. Hofstee, W. K. B., ten Berge, J. M. F., & Hendriks, A. A. J. (1998). How to score questionnaires. Personality and Individual Differences, 25, 897–909. doi: 10.1016/S0191-8869(98)00086-5 Kiers, H. A. L., & ten Berge, J. M. F. (1994). Hierarchical relations between methods for simultaneous component analysis and a technique for rotation to a simple simultaneous structure. The British Journal of Mathematical and Statistical Psychology, 47, 109–126. doi: 1111/j.2044-8317.1994.tb01027.x Korth, B., & Tucker, L. R. (1975). The distribution of chance congruence coefficients from simulated data. Psychometrika, 40, 361–372. doi: 10.1007/BF02291763 Lorenzo-Seva, U., & ten Berge, J. M. F. (2006). Tucker’s congruence coefficient as a meaningful index of factor similarity. Methodology, 2, 57–64. doi: 10.1027/1614-1881.2.2.57 MacCallum, R. C. (2009). Factor analysis. In R. E. Millsap & A. Maydeu-Olivares (Eds.), The SAGE Handbook of Quantitative Methods in Psychology (pp. 123–147). Los Angeles, CA: Sage. MATLAB (2012). R2012B [Computer software]. Natick, MA: The MathWorks. McCrae, R., Zonderman, A., Costa, P., Bond, M., & Paunonen, S. (1996). Evaluating replicability of factors in the Revised NEO Personality Inventory: Confirmatory factor analysis versus Procrustes rotation. Journal of Personality and Social Psychology, 70, 552–566. doi: 10.1037/0022-3514.70.3.552 Nesselroade, J. R., & Baltes, P. B. (1967). On a dilemma of comparative factor analysis: A study of factor matching based on random data. Educational and Psychological Measurement, 27, 305–321. doi: 10.1177/001316447003000413 Nunnally, J. C., & Bernstein, I. (1994). Psychometric theory (3rd ed.). New York, NY: McGraw-Hill. Robert, P., & Escoufier, Y. (1976). A unifying tool for linear multivariate statistical methods: The RV-coefficient. Journal of the Royal Statistical Society: Series C (Applied Statistics), 25, 257–265. doi: 10.2307/2347233 Schmitt, D. P., & Allik, J. (2005). Simultaneous administration of the Rosenberg Self-Esteem Scale in 53 Nations: Exploring the universal and culture-specific features of global self-esteem. Journal of Personality and Social Psychology, 89, 623–642. doi: 10.1037/0022-3514.89.4.623
Methodology (2016), 12(1), 11–20
G. Sočan, Bootstrapping Congruence Coefficients in PCA
Schönemann, P. H. (1966). A generalized solution of the orthogonal Procrustes problem. Psychometrika, 31, 1–10. doi: 10.1007/BF02289451 ten Berge, J. M. F. (1986a). Some relationships between descriptive comparisons of components from different studies. Multivariate Behavioral Research, 21, 29–40. doi: 10.1207/ s15327906mbr2101_2 ten Berge, J. M. F. (1986b). Rotation to perfect congruence and the cross-validation of component weights across populations. Multivariate Behavioral Research, 21, 41–46. doi: 10.1207/ s15327906mbr2101_3 ten Berge, J. M. F., & Nevels, K. (1977). A general solution to Mosier’s oblique Procrustes problem. Psychometrika, 42, 593–600. doi: 10.1007/BF02295981 Tucker, L. R. (1951). A method for synthesis of factor analytic studies (Personnel Research Section Report No. 984). Washington, DC: Department of the Army. Widaman, K. F. (2007). Common factors versus components: Principals and principles, errors and misconceptions. In R. Cudeck & R. C. MacCallum (Eds.), Factor analysis at 100: Historical developments and future directions (pp. 177–203). Mahwah, NJ: Erlbaum. Received April 11, 2014 Revision received May 11, 2015 Accepted August 18, 2015 Published online April 1, 2016 Gregor Sočan is an Assistant Professor of psychological methodology at the University of Ljubljana, Slovenia. His research interests include theoretical psychometric topics (reliability, factor analysis, test scoring) as well as applied problems (development, adaptation, and evaluation of tests and questionnaires; multivariate modeling in differential and developmental psychology). Gregor Sočan Department of Psychology Faculty of Arts University of Ljubljana Aškerčeva c. 2 1000 Ljubljana Slovenia Tel. +386 1 241-1184 E-mail gregor.socan@ff.uni-lj.si
Ó 2016 Hogrefe Publishing
Original Article
The Impact of the Number of Dyads on Estimation of Dyadic Data Analysis Using Multilevel Modeling Han Du and Lijuan Wang University of Notre Dame, Department of Psychology, Notre Dame, IN, USA
Abstract: Dyadic data often appear in social and behavioral research, and multilevel models (MLMs) can be used to analyze them. For dyadic data, the group size is 2, which is the minimum group size we could have for fitting a multilevel model. This Monte Carlo study examines the effects of the number of dyads, the intraclass correlation (ICC), the proportion of singletons, and the missingness mechanism on convergence, bias, coverage rates, and Type I error rates of parameter estimates of dyadic data analysis using MLMs. Results showed that the estimation of variance components could have nonconvergence problems, nonignorable bias, and deviated coverage rates from nominal values when ICC is low, the proportion of singletons is high, and/or the number of dyads is small. More dyads helped obtain more reliable and valid estimates. Sample size guidelines based on the simulation model are given and discussed. Keywords: dyadic data analysis, sample size, missing data, multilevel modeling
Many phenomena studied in social and behavioral sciences involve “pairs” or “ dyads.” In social psychology, there are two persons developing their romantic relationship (dating dyad; e.g., Kenny & Acitelli, 2001); in developmental psychology, mother and child describe their attachment relationship with one another (mother-child dyad; e.g., Wertsch, McNamee, McLane, & Budwig, 1980); and in organizational management, two colleagues cooperate to finish a project (colleague dyad; e.g., Bakker & Xanthopoulou, 2009). One important characteristic of dyads is nonindependence (Kenny, Kashy, & Cook, 2006). That is, two members of a dyad are not two independent individuals, because they share something in common (e.g., genes, environment). Nonindependence in dyadic data implies that the data are nested or hierarchical such that the observations from two dyadic members are nested within a dyad. Thus, the data from a dyad are more or less correlated and the extent of relatedness can be measured by the intraclass correlation (ICC). Nonindependence in dyadic data violates the “classical” independent observations’ assumption of traditional regression analysis. If the nested data structure or nonindependence is overlooked and thus a researcher treats dyad members as independent units for data analysis, it may lead to serious consequences (Kenny & Judd, 1986). For example, when the correlation in the outcomes between two dyad members after controlling for the within-dyads predictor variable is positive, ignoring nonindependence may lead Ó 2016 Hogrefe Publishing
to conservative results (i.e., overestimated p-values and increased Type II errors; Kenny, 1995). To deal with nonindependence in dyadic data, multilevel modeling (MLM), structural equation modeling (SEM), and multilevel SEM have been suggested and used (e.g., Card, Selig, & Little, 2008; Kashy, Donnellan, Burt, & McGue, 2008; Kenny et al., 2006; Newsom, 2002). In this article, we focus on studying the use of MLM for dyadic data analysis because of its wide use. In MLM, random effects are included to capture the correlated data feature (Hox, 2010). To estimate the parameters (fixed-effects parameters and level-1 and level-2 residual variances and covariances) in a multilevel model, maximum likelihood estimation (ML) is widely used. ML estimates are consistent, asymptotically normal, and asymptotically efficient, which implies that sufficient sample sizes at different levels may be needed for enjoying those good properties (Casella & Berger, 2002). Below, we review previous findings on how ML estimates and their standard error estimates of parameter estimates in multilevel modeling were influenced by factors such as the sample size at each level, ICC, and missing data. For fixed coefficients in typical multilevel models, point estimates were found to have little to no bias, whereas the standard errors could be considerably underestimated with nonnormal distributions for the residuals and/or small sample sizes at all levels (e.g., Maas & Hox, 2005; Van der Leeden, Busing, & Meijer, 1997). In contrast, Hox and Maas (2001) found bias in between-group factor loading Methodology (2016), 12(1), 21–31 DOI: 10.1027/1614-2241/a000105
22
estimates (e.g., 13% relative bias) in addition to underestimated standard errors with small numbers of groups (level-2 sample size) and low ICCs for multilevel structural equation models. Moreover, bias in fixed-coefficients estimates did not show a certain direction (Hox & Maas, 2001; Paccagnella, 2011). For level-2 residual variances and covariances, when sample sizes are small and/or the normality assumptions of residuals are violated, standard error estimates of level-2 variance estimates in typical multilevel models were found to be underestimated (e.g., Maas & Hox, 2004, 2005). In the multilevel SEM study by Hox and Maas (2001), they also found bias in the standard error estimates of the level-2 variance estimates, with no obvious direction in the bias. Generally, larger proportions of missing data, and smaller ICCs, group sizes, and/or numbers of groups could lead to more bias and lower coverage rates in level2 variance estimates (e.g., Clarke & Wheaton, 2007; Hox & Maas, 2001; Maas & Hox, 2005; Moineddin, Matheson, & Glazier, 2007). The influences of smaller ICCs were larger when the number of groups (level-2 sample size) and the group size (number of individuals in a group or level-1 sample size) are smaller (e.g., Maas & Hox, 2005; Moineddin et al., 2007). Estimates of level-1 residual variances were found to be generally accurate. Bassiri (1988), however, found that the estimates of level-1 residual variances were less accurate with fewer groups. Some researchers have reported convergence problems when fitting multilevel models. For example, Moineddin et al. (2007) found that nonpositive definite matrices or nonconvergence was more common with fewer observations (e.g., 5 in their case) per group. Nonconvergence rates can be high when there are many random components to estimate (Paccagnella, 2011; Peugh, 2010). Convergence rates can be improved with a larger number of groups, a larger group size, and/or a larger ICC (Moineddin et al., 2007; Paccagnella, 2011; Shieh & Fouladi, 2003). As reviewed above, for a typical multilevel model, a larger number of groups could partially compensate for a smaller group size for obtaining accurate point estimates and standard error estimates. Therefore, various sample size guidelines, in terms of the combination of the group size and the number of groups, have been proposed for obtaining reliable and valid statistical inference in the multilevel modeling literature. For example, for obtaining valid inference on fixed effects, Kreft (1996) suggested the “30/ 30 rule” (at least 30 groups with at least 30 individuals per group). Maas and Hox (2005) suggested that the number of groups should be no less than 10 when the group size is as small as 5 so that the standard errors of fixed-effects estimates are not biased downwards. When researchers are also interested in variance components and cross-level interactions, Hox (2010) proposed the “50/20 rule” Methodology (2016), 12(1), 21–31
H. Du & L. Wang, Number of Dyads in Dyadic Data Analysis
(at least 50 groups and 20 observations in each group) to test cross-level interactions. To minimize the bias in level2 variance estimates, 100 groups or more were suggested (e.g., Afshartous, 1995; Hox & Maas, 2001). Similarly, Clarke and Wheaton (2007) suggested the 100/10 rule for testing the intercept variance but they also recommended the 200/20 rule for testing the slope variances. Although various sample size guidelines have been proposed in the literature for general multilevel modeling, little work has been done on exploring the number of groups (i.e., dyads) needed for valid statistical inference in dyadic data analysis using multilevel modeling. As reviewed above, the minimum studied group size was 5 in most previous research. For dyadic data, the group size is 2, which is the minimum group size we could have for fitting a multilevel model. In addition, missing data could pose a challenge for dyadic data analysis. With complete dyadic data, two is already a very small group size. With missing data for some dyads, the group size of those dyads becomes even smaller (i.e., 1). There are three types of missingness mechanisms discussed in the literature (Rubin, 1976): missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). MCAR occurs when missingness is unrelated to either observed or unobserved variables; MAR occurs when missingness is related to observed variables in the analysis model but not to unobserved variables; and MNAR occurs when missingness is related to unobserved variables. Different missing data handling techniques work for specific types of missing data. For those dyads with available data only from 1 dyadic member, removing them from data analysis (i.e., listwise deletion) generally requires the strict MCAR assumption. Full information maximum likelihood estimation with all available dyadic data works for handling ignorable missingness: MCAR and the less strict MAR scenarios (e.g., Atkins, 2005; Hox, 2010; Newsom, 2002). Although influences of missing data on the estimation of multilevel models have been studied via simulation studies, few simulation studies explored the influences of missing data, particularly MAR missing data, in dyadic data analysis using multilevel modeling. When there are no missing data, can we generalize the findings from previous research in studying the impact of the factors on estimation of multilevel modeling and in sample size guidelines, to dyadic data analysis with a group size of 2? When there are missing data, how does ML perform when the group size is 1 for some groups/dyads? The answers were not given in the literature. Therefore, the goals of this study are to (1) study the influences of various factors including ICC, the number of groups, the proportion of missing data, and the missing data mechanism on parameter estimation in dyadic data analysis using multilevel modeling; (2) propose sample size Ó 2016 Hogrefe Publishing
H. Du & L. Wang, Number of Dyads in Dyadic Data Analysis
23
guidelines for dyadic data analysis using multilevel modeling. Note that the proposed sample size guidelines are for obtaining valid statistical inference for dyadic data analysis and are not for obtaining sufficient statistical power. To fulfill the goals, we conduct a simulation study to examine the behaviors of ML estimates for analyzing dyadic data using multilevel modeling under different conditions. Convergence rates, bias, coverage rates, and Type I error rates will be examined.
A General Multilevel Model A typical 2-level multilevel model contains two levels. A level-1 model can be expressed as
Y ij ¼ β0j þ β1j Xij þ eij ;
ð1Þ
where j ( j = 1,. . ., J) represents groups and i (i = 1,. . ., nj) represents individuals or group members. J is the level-2 sample size, also called the number of groups (or the number of dyads in dyadic data analysis). nj is the level1 sample size of group j, also called the group size (or dyad size in dyadic data analysis). For dyadic data, the level-1 sample size is 2 with complete data. Xij is a level-1 explanatory variable. The level-1 residual variable eij is usually assumed to have a normal distribution: eij N 0; σ2e . The random-effects intercept (β0j) and slope (β1j) can be modeled with level-2 explanatory variables (e.g., Gj) at the group level. A level-2 model can be expressed as
β0j ¼ γ 00 þ γ 01 Gj þ μ0j ; β1j ¼ γ 10 þ γ 11 Gj þ μ1j
ð2Þ
where γ00, γ01, γ10, and γ11 are fixed-effects or fixed coefficients (Hox, 2010; Singer & Willett, 2003). μ0j and μ1j are level-2 residuals, which are often assumed to have a multivariate normal distribution such that 2 μ0j 0 σ σ 10 0 μ1j N 0 ; σ10 σ21 . Collapsing the level-1 and level-2 models together yields the composite form of the multilevel model
Y ij ¼ γ 00 þ γ 01 Gj þ γ 10 Xij þ γ 11 Xij Gj þ μ0j þ μ1j Xij þ eij :
ð3Þ
When the multilevel model has no predictors, it is an empty model or an intercept only model
Y ij ¼ γ 00 þ μ0j þ eij : 0
0
0
ð4Þ
From the empty model, we can calculate the intraclass correlation (ICC; Hox, 2010), ρ, by
Ó 2016 Hogrefe Publishing
0
ρ¼
0
σ20
0
σ20 þ σ2e
:
ð5Þ
ICC measures the extent to which the data in Y are correlated within a group or a dyad.
A Multilevel Model for Analyzing Dyadic Data In the simulation study, without loss of generality, we generated data based on a hypothetical family research scenario with distinguishable dyad members. In the scenario, a father and a mother constitute a dyad and half of the families receive an intervention (e.g., parenting sensitivity enhancement intervention). Before and after the intervention, parenting sensitivity of both mothers and fathers is measured, yielding pre- and posttest scores of the variable. The researcher is interested in whether the intervention has a differential effect between mothers and fathers on posttest scores after controlling pretest scores. To answer the research question, a multilevel ANCOVA model can be used. Denote Yid as the observed posttest score of individual i (i = 1, 2) in dyad d on parenting sensitivity. And Xid as the observed pretest score for individual i in dyad d. Rid indicates family role ( 1 for mother and 1 for father). Gd indicates whether the dyad is in the control group (coded as 0) or in the experimental group (coded as 1). Therefore, Xid and Rid are level-1 or individual-level predictors and Gd is a level-2 or group/dyad-level predictor. Therefore, the multilevel ANCOVA model can be expressed as
Y id ¼ βA0d þ βD0d Rid þ βA1d Xid þ βD1d Rid Xid þ eid ; ð6Þ where βA0d is the average intercept of the father and the mother in dyadic d. βD0d is half of the difference in the intercepts between the father and the mother in dyadic d. βA1d is the average slope of the level-1 covariate of the father and the mother in dyadic d. βD1d is half of the difference in the slopes between the father and the mother in dyadic d. eid are the level-1 residuals, with eid N 0; σ2e . The coefficients in the level-1 model can be modeled at level 2 as
βA0d ¼ γ A00 þ γ A01 Gd þ μA0d βD0d ¼ γ D00 þ γ D01 Gd βA1d ¼ γ A10 þ γ A11 Gd βD1d ¼ γ D10 þ γ D11 Gd
;
ð7Þ
where γA00 is the average intercept of the control dyads, which is also the average posttest score of the mothers
Methodology (2016), 12(1), 21–31
24
H. Du & L. Wang, Number of Dyads in Dyadic Data Analysis
and fathers in the control dyads after controlling for the pretest scores. γA01 measures the degree to which parents’ average intercept (after controlling for the pretest scores) in the experimental dyads differs from the control dyads, which is an interesting parameter for the analysis. γD00 is the half of the average difference in the intercepts (after controlling for the pretest scores) between fathers and mothers in the control dyads. γD01 is the degree to which the half of the average difference in the intercepts (after controlling for the pretest scores) between fathers and mothers in the experimental dyads differs from the control dyads, which is another interesting parameter for answering the research question. For the other four fixed-effects parameters, we can do similar interpretations except that the intercepts need to be changed to the slopes (less interesting though because the slopes are of the covariate Xid). Although there are pretests and posttests involved in the scenario and the model, the generated dyadic data are treated as cross-sectional dyadic data because the pretest scores are included in the model as covariates. The dyadic data described in the scenario (i.e., the number of data points from a dyad is 2) do not allow the inclusion of all 4 group/dyad-level (level-2) residual variables due to the lack of degrees of freedom. Specifically, if one wants to freely estimate the level-1 residual variance (variance of eid), at most two level-2 residual variables (e.g., one for βA0d and one for βD0d) can be included and the two level-2 residual variables are not allowed to covary. If the level-1 residual variance is constrained to be 0, two level2 residual variables (e.g., one for βA0d and one for βD0d) can be included and the covariance between them can be estimated. In this article, as an illustration, we freely estimate the level-1 residual variance and include one level-2 residual variable, μA0d, as shown in Equation 7. Furthermore, μA0d is assumed to have a normal distribution such that μA0d N 0; σ2A0d . The composite model is
Y ij ¼ γ A00 þ γ A01 Gd þ γ D00 Rid þ γ D01 Gd Rid þ γ A10 Xid þ γ A11 Gd Xid þ γ D10 Rid Xid þ γ D11 Gd Rid Xid þ μA0d þ eid :
ð8Þ
The intraclass correlation can be obtained from the following empty model:
Y ij ¼ γ A00 þ μA0d þ eid ; 0
0
0
ð9Þ
dyadic flavor in that (1) it includes both a role variable, Rid, to distinguish dyad members and its interaction with the level-1 covariate in the first level model; and (2) only one random effect at the second level is modeled because of the small group size, 2. Therefore, the proposed model can be considered as a typical multilevel model used for cross-sectional dyadic data analyses or a typical multilevel model with a group size of 2. There are two types of ML, full ML (FML) and restricted ML (REML), often used in estimating multilevel models. FML ignores uncertainty in the estimated fixed-effects coefficients when estimating the variance components, whereas REML maximizes the likelihood function to yield variance-components estimates after locating some degrees of freedom for fixed-effects estimates (Hox, 2010). Thus, REML leads to unbiased estimates of variance components. This is especially helpful when the number of groups is small (Raudenbush & Bryk, 2002). Therefore, in this paper, we use REML for estimation.
Simulation Design We use the two-level multilevel model for the hypothetical scenario (Equation 8) to simulate the data. Four factors were manipulated in the simulation. We describe them as follows.
Number of Groups (NG = 30, 50, 100, 150, 200, 300, 400, 500) As reviewed earlier, sample size is a well-known factor that could affect estimation in multilevel modeling. Here we vary only the number of groups because the number of individuals in a group or a dyad is fixed at 2. We choose values such as 30, 50, 100, and 200, because they have been suggested in the sample size guideline (Afshartous, 1995; Hox, 2010; Hox & Maas, 2001; Kreft, 1996). In dyadic research, 150 or 300 groups are not rare numbers (e.g., Kenny & Acitelli, 2001; Townsend, Phillips, & Elkins, 2000). Therefore, we include them as well. Given the extreme small group size, 2, we also consider 400 and 500 for NG to explore whether these larger numbers of dyads are needed for reliable and valid estimation.
by 0
ICC ¼
0
σ2A0d
0
σ2A0d þ σ2e
:
ð10Þ
Note that the model in Equations 6 and 7 mimics a typical multilevel model expressed in Equations 1 and 2. At the same time, the model in Equations 6 and 7 has a Methodology (2016), 12(1), 21–31
Intraclass Correlation (ICC = 0.1, 0.2, 0.3, 0.5, 0.7) ICC also influences the performance of ML estimates (Goldstein, 1995). Hox (2010) suggests 0.1, 0.2, and 0.3 as small, medium, and large values for ICC. Maas and Ó 2016 Hogrefe Publishing
H. Du & L. Wang, Number of Dyads in Dyadic Data Analysis
Hox (2005) used 0.1, 0.2, and 0.3 for the ICC values in their simulations. Hox (2010) further suggests that ICC values could be much higher in small group and family research. For example, Salvy, Howard, Read, and Mele (2009) observed ICCs ranging from .47 to .81 using data from various kinds of eating partner dyads and McIsaac, Connolly, McKenney, Pepler, and Craig (2008) observed an ICC of 0.84 in adolescent romantic dyads. Therefore, we also include 0.5 and 0.7 as levels of ICC in our simulation.
Proportion of Singletons (PS = 0%, 10%, 30%, 50%) and Missing Data Mechanisms (MCAR, MAR) Missing data generally complicate the estimation of MLM models. For dyadic data, one special type of missing data, singletons, could occur by design or by other reasons. A singleton indicates that data from one member of a dyad are available with data from the other member missing. For example, in family research of collecting data from both mothers and fathers, by design, researchers may plan to collect data from all mothers but only half of the fathers due to budget constraints. Even when all mothers and all fathers are supposed to be measured by design, researchers have found that fathers are more likely to be absent from a data collection than mothers. Clarke and Wheaton (2007) have showed singletons may affect estimation. More generally, with large amounts of missing data, bias was observed in parameter estimates of multilevel models when the sample size is not large enough even for missing completely at random conditions (Gibson & Olejnik, 2003; Roth, 1994). Missing data proportions vary a lot in empirical dyadic studies and can be as low as 0% and as high as 80% (Grover & Vriens, 2006; Strauss et al., 2004). Therefore, in this study, we consider 0%, 10%, 30%, and 50% for the proportion of singletons (PS). In addition, we consider two kinds of missing data due to different missingness mechanisms (Rubin, 1976): MCAR and MAR. The former occurs when missingness is unrelated to either observed or unobserved variables and the latter occurs when missingness is related to observed variables in the analysis model but not to unobserved variables. To generate MCAR missing data, we randomly set a specific proportion of fathers’ posttest data to be missing. For generating MAR data, there are different ways, depending on how strongly missingness depends on observed data. For example, we can set fathers’ posttest data to be missing when their pretest scores are larger than the (1 PS)th percentile. Alternatively, we can evenly divide the distribution of fathers’ pretest data into several parts, for example, the upper, middle, and lower parts. Then, for example, Ó 2016 Hogrefe Publishing
25
PS 50%, PS 35%, PS 15% of the corresponding posttest scores of the three parts are set to be missing respectively. Zhang & Wang (2016) used both approaches of generating MAR data and found that the former yielded stronger MAR data than the latter. In the current simulation, we use the former approach to generate “stronger” MAR data. In sum, there are 8 5 4 2 = 320 conditions. For the true parameter values, we set the fixed intercept coefficients (γA00) as 1, and all the other fixed coefficients 0.3, following the simulation design in Maas and Hox (2005). The lower level predictor Xid has a normal distribution Xid N(0, 1). The level-1 residual variance σ2e is 0.5. Based on Equation 10, the level-2 residual variance, σ2A0d , is deter0
mined by a given ICC value and the corresponding σ2e value in the empty model. Note that if we simply use σ2A0d σ2A0d þσ2e
from the full model in Equation 8 for “ICC” and
simulate data based on the full model, the ICC values calculated from the empty models are smaller than the desired ones. From some numerical analyses, we observed the relaσ2A0d þσ2e
tions between σ2
A0d
in Equation 8 from the full model and
the ICCs in Equation 10 from the empty model. Therefore, σ2A0d þσ2e
we fix the ratio of σ2
in the full model to some certain
A0d
values, to achieve the designed values of ICC (0.1, 0.2, 0.3, 0.5, and 0.7), respectively, in the empty model (see the Appendix for the relations). For evaluating the Type I error rates, we set the corresponding true values to be 0. For example, for the level-2 variance estimates, we set μA0d to be 0. For each condition, 10,000 simulated data sets were generated. Analyses were implemented in SAS 9.3 using the SAS PROC MIXED procedure with REML for estimation. For evaluation, convergence rates, bias in point estimates, coverage probabilities, and Type I error rates of the estimates of both fixed coefficients and variance components were examined. Nonconvergence could happen when the estimate of a variance parameter during an iterative process is near zero or negative, so such a covariance matrix is not positive definite. In this study, nonconvergence was identified when SAS PROC MIXED provided an incomplete output of the variance/covariance estimates and/or an incomplete output of the fixed-effects estimates (e.g., missing standard errors) with a warning that “Estimated G matrix is not positive definite”. For suggesting the minimum number of dyads needed, we consider a convergence rate of 95% or higher as satisfactory. For bias, the ^ θj across absolute bias is computed by the average of jθ replications in conditions when θ = 0, and the relative bias ^ is given by the average of θ θ 100% across replications in θ conditions when θ ¼ 6 0. When the relative bias is higher than 5%, it is considered as unsatisfactory (Hoogland & Methodology (2016), 12(1), 21–31
26
H. Du & L. Wang, Number of Dyads in Dyadic Data Analysis
Table 1. Convergence rates (%) by different numbers of groups (NG), proportions of singletons (PS), and ICCs NG
PS (%)
ICC = 0.1
ICC = 0.2
ICC = 0.3
PS (%)
ICC = 0.1
ICC = 0.2
ICC = 0.3
30
0
70.5
99.3
99.9
30
67.2
97.3
99.6
50
0
97.9
100.0
100.0
30
94.9
99.8
100.0
100
0
100.0
100.0
100.0
30
100.0
100.0
100.0
150
0
100.0
100.0
100.0
30
100.0
100.0
100.0
200
0
100.0
100.0
100.0
30
100.0
100.0
100.0
30
10
70.3
98.9
99.9
50
64.7
93.9
98.6
50
10
97.1
100.0
100.0
50
91.0
99.2
99.9
100
10
100.0
100.0
100.0
50
99.5
100.0
100.0
150
10
100.0
100.0
100.0
50
99.9
100.0
100.0
200
10
100.0
100.0
100.0
50
100.0
100.0
100.0
Notes. The convergence rates were either 1 or above 99% when ICC was .5 or .7 and were all 100% when NG = 300, 400, or 500, across all studied conditions. Thus those results are not listed here.
Boomsma, 1998). Coverage rates measure the accuracy of both parameter estimates and standard error estimates. In this study, the nominal coverage rate is 95% and the corresponding nominal Type I error rate is 0.05. We consider a coverage rate between 92.5% and 97.5% and a Type I error rate between 2.5% and 7.5% as satisfactory (Bradley, 1978).
Results The results on the influences of the factors and the sample size guidelines from MCAR and MAR missing mechanisms were very similar. For the sake of saving space, we present only the detailed results under the MAR conditions below. For suggesting the minimum numbers of dyads needed, we present our suggestions for both MCAR and MAR situations.
Influences of the Factors Convergence Convergence rates were generally lower with smaller ICCs, larger amounts of singletons, and/or smaller NGs. The lowest convergence rate (64.7%) was observed with the lowest studied ICC (0.1) in combination with the highest studied proportion of singletons (50%) and the smallest studied number of dyads (NG = 30), as shown in Table 1. With more dyads and higher ICCs, convergence became better. When ICC is no less than 0.3, convergence was no longer a concern. To suggest the minimum numbers of dyads needed for obtaining adequate convergence, we consider two situations: dyadic studies in laboratory settings and dyadic studies by self-report surveys. For dyadic research in laboratory Methodology (2016), 12(1), 21â&#x20AC;&#x201C;31
settings, both dyad members are supposed to participate in the experiment, but it is possible that one of the two dyad members fails to complete certain tasks or self-report measures while the other dyad member completes. In this case, we do not expect many missing data in a cross-sectional or pretest-posttest study and thus we expect the proportions of singletons are controlled to not exceed 30%. When there are no missing data (PS = 0%), for ICCs as low as 0.1, 50 dyads are needed; for ICCs of 0.2 or higher, 30 dyads are needed to ensure satisfactory convergence rates. When PS = 30%, for ICCs as low as 0.1, 100 dyads are needed; for ICCs of 0.2 or higher, 30 dyads are needed to keep the convergence rate no less than 95%. For dyadic research by self-report surveys, it is more difficult to control missing data, especially when it is an online survey. Or in some other cases, researchers intend to collect data only from a subset of the fathers because of budget or resource limits. Under this situation, we could expect the proportion of singletons as high as 50%. Therefore, when PS = 50%, for ICCs as low as 0.1, 100 dyads are needed; for ICCs of 0.2, 50 dyads are needed; for ICCs of 0.3 or higher, 30 dyads are needed to keep the convergence rate no less than 95%. Bias Relative bias and absolute bias were found to have similar patterns for both fixed-coefficients and random components estimates. Therefore, we focus on the relative bias, because we can use 5% as a reference cutoff (e.g., Hoogland & Boomsma, 1998). The fixed-coefficients estimates had negligible bias (all relative bias was smaller than 5%) when the number of dyads is 30 or more across all studied conditions, regardless of the proportions of singletons and ICCs. Non-negligible bias in the variance-components estimates was found when ICC is low ( .2) and the number of groups is small (e.g., NG = 30). For example, when Ă&#x201C; 2016 Hogrefe Publishing
H. Du & L. Wang, Number of Dyads in Dyadic Data Analysis
27
Table 2. Relative bias (%) of level-1 residual variance (σ2e ) and level-2 residual variance (σ2A0d ) estimates by different numbers of groups (NG), proportions of singletons (PS), and ICCs
σ2A0d NG
PS (%)
ICC = 0.1
ICC = 0.2
σ2e ICC = 0.3
ICC = 0.1
ICC = 0.2
ICC = 0.3
30
0
82.5
0.4
0.4
9.9
0.5
0.2
50
0
2.4
0.7
0.4
0.9
0.2
0.3
100
0
0.1
0.0
0.1
0.1
0.1
0.1
150
0
0.1
0.0
0.3
0.4
0.0
0.1
200
0
0.4
0.1
0.0
0.1
0.1
0.0
30
10
94.9
0.8
1.0
10.7
0.1
0.0
50
10
3.4
0.3
0.2
0.9
0.2
0.3
100
10
0.1
0.4
0.1
0.5
0.2
0.3
150
10
0.1
0.0
0.2
0.0
0.2
0.3
200
10
0.2
0.1
0.3
0.2
0.1
0.2
30
30
117.7
3.5
0.6
13.7
1.0
0.3
50
30
7.1
0.3
0.3
1.6
0.3
0.2 0.0
100
30
0.2
0.2
0.3
0.3
0.5
150
30
0.1
0.4
0.2
0.4
0.4
0.6
200
30
0.2
0.2
0.0
0.5
0.3
0.1
30
50
165.3
7.3
1.1
19.4
4.2
1.1
50
50
13.2
0.8
0.1
3.7
0.1
0.3
100
50
0.4
0.1
0.0
0.2
0.5
0.2
150
50
0.3
0.1
0.1
0.7
0.6
0.6
200
50
0.0
0.0
0.2
0.5
0.4
0.6
Notes. The absolute relative bias was all less than 1% when ICC = .5 or .7 and was also less than 1% when NG = 300, 400, or 500, across all studied conditions. Thus those results are not listed here.
ICC is 0.1, the level-1 variance estimates had evident downward bias with 30 dyads. The largest bias was 19.4% with 50% singletons (see Table 2). Bias in the level-2 residual variance estimates was found to be larger than that of the level-1 residual variance estimates with small numbers of group ( 50) and small ICCs ( 0.2), and the bias was positive under those conditions. The largest bias was found to be 165.3% when ICC is 0.1 with 50% singletons and 30 dyads (also see Table 2). With larger numbers of dyads and/or higher ICCs, bias rapidly became smaller; as the proportions of singletons decreased, bias also became smaller. When ICC is higher than 0.2, the number of dyads and the proportion of singletons had no visible effect on the bias of the variance-components estimates because the biases were all negligible in those cases. We have the following suggestions for the minimum numbers of dyads needed for obtaining negligible bias in both fixed-coefficients and variance-components estimates. When PS = 0%, for ICCs as low as 0.1, 50 dyads are needed; for ICCs of 0.2 or higher, 30 dyads are needed to retain relative bias below 5%. When PS = 30%, for ICCs as low as 0.1, 100 dyads are needed; for ICCs of 0.2 or higher, 30 dyads are needed to retain relative bias below 5%. Furthermore, when PS = 50%, for ICCs as low as
Ó 2016 Hogrefe Publishing
0.1, 100 dyads are needed; for ICCs of 0.2, 50 dyads are needed; and for ICCs of 0.3 or higher, 30 dyads are needed to retain relative bias below 5%. Coverage Rate ICC did not have an observable effect on the coverage rates of fixed-coefficients estimates. In contrast, both the number of dyads and the proportion of singletons slightly influenced the coverage rates. For example, across all the studied conditions, only when the proportion of singletons is 50% and the number of dyads is 30, the coverage rates of γD01 were out of the range of 92.5% and 97.5%. Those coverage rates were 91.4%, 92.0%, 92.3%, 91.4%, and 92.0% for ICCs = 0.1, 0.2, 0.3, 0.5, and 0.7, respectively. Fifty dyads are large enough to provide approximate 95% coverage rates for fixed-coefficients estimates under the studied conditions. For the level-1 residual variances, fewer dyads, larger proportions of singletons, and lower ICCs led to lower coverage rates. The lowest coverage rate (81.7%) was observed when the number of dyads is 30, ICC = 0.1, and the proportion of singletons is 50%. For level-2 variance estimates, the combined effect of the number of group, the proportion of singletons, and ICC is more complex. Generally, when the
Methodology (2016), 12(1), 21–31
28
H. Du & L. Wang, Number of Dyads in Dyadic Data Analysis
Table 3. Coverage rates (%) of level-1 variance (σ2e ) estimates and level-2 variance (σ2A0d ) estimates by different numbers of groups (NG), proportions of singletons (PS), and ICCs
σ2A0d
σ2e
NG
PS (%)
ICC = 0.1
ICC = 0.2
ICC = 0.3
ICC = 0.5
ICC = 0.7
ICC = 0.1
ICC = 0.2
ICC = 0.3
ICC = 0.5
ICC = 0.7
30
0
98.9
93.9
92.4
91.6
91.9
88.0
90.8
90.6
91.2
91.3
50
0
97.0
94.0
93.1
93.4
93.1
93.2
92.6
92.7
93.1
92.8
100
0
94.7
94.3
93.9
93.8
94.1
93.8
94.1
93.9
94.4
94.3
150
0
94.7
94.7
95.0
94.9
94.3
95.1
94.3
94.2
94.7
94.5
200
0
95.0
94.4
94.7
94.7
94.3
94.5
94.3
94.5
94.1
94.3
30
10
98.7
94.5
92.9
92.0
91.3
87.4
90.6
91.2
91.2
91.3
50
10
97.4
94.0
93.2
92.5
93.0
92.9
92.3
92.5
92.5
92.4
100
10
95.0
94.7
94.2
94.2
93.7
94.0
93.5
93.5
93.6
93.4
150
10
94.9
94.7
94.3
94.3
94.2
93.7
93.9
94.2
94.0
93.9
200
10
94.6
95.0
94.8
94.5
94.3
93.9
94.8
94.2
94.5
94.4
30
30
98.0
96.0
93.9
92.4
91.8
85.2
88.9
89.2
88.6
88.9
50
30
98.3
94.5
93.7
93.4
93.0
91.6
91.8
91.5
92.2
91.4
100
30
95.0
94.3
94.5
94.4
93.9
93.2
93.0
93.4
93.2
93.7
150
30
94.7
94.7
93.8
94.6
94.3
93.2
93.7
94.1
93.9
93.8
200
30
95.1
94.9
94.8
94.2
95.0
93.6
93.6
94.1
93.9
93.6
30
50
96.2
97.6
94.9
93.1
91.6
81.7
86.6
87.1
86.3
87.0
50
50
98.4
95.8
94.4
93.7
93.4
89.8
90.4
90.5
90.1
90.3
100
50
95.5
94.6
94.6
94.6
94.1
93.0
92.8
92.5
93.0
92.8
150
50
94.7
94.9
94.9
94.7
94.4
93.7
93.6
93.6
93.7
93.8
200
50
95.1
94.4
94.6
94.3
94.6
94.0
93.7
93.8
93.7
93.6
Notes. The coverage rates were all between 93.5% and 96% when NG = 300, 400, or 500, across all studied conditions. Thus those results are not listed here.
Table 4. Minimum numbers of dyads needed in similar dyadic research MAR ICC =
PS = 0%
PS = 30%
PS = 50%
MCAR 0.1
0.2
0.3
0.5
0.7
Convergence rate
50
30
30
30
30
Bias
50
30
30
30
30
ICC =
0.1
0.2
0.3
0.5
0.7
Convergence rate
50
30
30
30
30
Bias
50
30
30
30
30
Coverage rate
50
50
50
50
50
Coverage rate
50
50
50
50
50
Type I error rate
30
30
30
30
30
Type I error rate
30
30
30
30
30
Minimum number of dyads
50
50
50
50
50
Minimum number of dyads
50
50
50
50
50
Convergence rate
100
30
30
30
30
Convergence rate
50
30
30
30
30
Bias
100
50
30
30
30
Bias
100
30
30
30
30
Coverage rate
100
100
100
100
100
Coverage rate
100
100
100
100
100
Type I error rate
50
50
50
50
50
Type I error rate
50
50
50
30
30
Minimum number of dyads
100
100
100
100
100
Minimum number of dyads
100
100
100
100
100
Convergence rate
100
50
50
30
30
Convergence rate
100
50
30
30
30
Bias
100
50
50
50
50
Bias
50
50
30
30
30
Coverage rate
100
100
100
100
100
Coverage rate
100
100
100
100
100
Type I error rate
50
50
50
50
50
Type I error rate
50
50
50
50
50
Minimum number of dyads
100
100
100
100
100
Minimum number of dyads
100
100
100
100
100
number of groups is small (30 or 50), the coverage rates could deviate from the nominal value 0.95. When there are no singletons, 50 dyads are needed. When there are
Methodology (2016), 12(1), 21–31
some singletons (e.g., PS = 10% or higher), however, 100 dyads may be needed to have the coverage rates of variance estimates close to 0.95, as shown in Table 3.
Ó 2016 Hogrefe Publishing
H. Du & L. Wang, Number of Dyads in Dyadic Data Analysis
Type I Error Rate For fixed coefficients, across all the studied conditions, only when the proportion of singletons is 50% and the number of dyads is 30, the Type I error rates of γD01 were out of the range of 2.5% and 7.5%. Those Type I error rates were 8.4%, 8.7%, 8.2%, 7.8%, and 8.2% for ICCs = 0.1, 0.2, 0.3, 0.5, and 0.7, respectively. When the number of dyads is 50 or larger, the Type I error rates of fixed-coefficients estimates were close to 5% across all conditions. The Type I error rates of the level-2 variance estimates, ranging from 3.3% to 7%, were all in the acceptable range across the studied conditions.
Number of Dyads Needed Based on our simulation results, minimum numbers of dyads needed for obtaining reliable and valid estimates from dyadic data analysis using multilevel models could be suggested. When the proportion of singletons is around 0%, 30%, or 50%, the suggested numbers, respectively, for MAR and MCAR data are showed in Table 4. The findings can be applied to models similar to the one used in this paper. Based on the information in Table 4, our overall suggestions are to have a minimum of 50 dyads when PS = 0% and 100 dyads when PS = 30% or 50%. Although the suggested minimum numbers of dyads appear the same across different ICC values, we should not ignore the potential influences of ICCs on convergence rates and bias, as discussed earlier and shown in Table 4. Again, we want to emphasize that this study is not about power analysis. The suggested numbers do not guarantee enough power for statistical inference. Therefore, it is possible that the study is not powerful enough with the suggested numbers in Table 4.
Discussion and Conclusion In this paper, we studied the influences of the number of dyads, ICC, the proportion of singletons, and the missingness mechanism on dyadic data analysis using multilevel modeling. Generally, when the number of dyads is small (< 100), the proportion of singletons is high ( 0.5), and/ or ICC is small ( 0.2), various estimation problems, especially in the estimation of variance components, such as nonconvergence, bias, and incorrect coverage rates, may occur. More dyads helped increase convergence rates, lower bias, and achieve good coverage rates. Overall, the findings are consistent with the simulation results in Clarke and Wheaton (2007) and Maas and Hox (2005). The finding may be helpful for study design of dyadic research, Ó 2016 Hogrefe Publishing
29
especially when a researcher intends to have planned singletons to save costs or expects to have singletons with survey research. Additionally, low ICCs have been found in dyadic research of industrial and organizational psychology (e.g., leader-member dyads; Erdogan, Liden, & Kraimer, 2006; Hsiung, 2012). In these cases, researchers should be cautious in using a small number (e.g., < 100) for the number of dyads and a good sample size plan could help obtain reliable and valid estimates from dyadic data analysis using multilevel modeling. We observed similar results on how the various factors impact the estimation and inference in the MCAR and MAR conditions. In other words, very similar suggestions on the minimum numbers of dyads were made. This is because when missingness is ignorable (MCAR/MAR), the same full information ML procedure is used for the MCAR and MAR situations. That is, to maximize the log likelihood function of the response variable with all available data to obtain the ML estimates. Under ignorable missingness, theoretically, ML works well asymptotically, regardless of whether the data are MCAR or MAR (Little & Rubin, 2002). Our simulation results further showed that ML had very similar performance with very small group sizes (1 or 2) between MCAR and MAR conditions. When missingness is not ignorable (MNAR), the standard full information ML method, however, cannot be applied. Approaches such as pattern mixture modeling have been suggested to handle nonignorable missing data in family research with dyads (Atkins, 2005). Future research should further look into the performance of advanced techniques for dyadic data analysis with nonignorable missing data and very small group sizes. Our results can be applied to models similar to the one included in this paper. “A similar model” does not only imply dyadic models, but also multilevel models with a group size of 2. For example, the results can be applied to an educational context in which data from two students are collected from a class. In this case, our suggested numbers of dyads for sample size planning may be useful. For another instance, the results can also be applied to longitudinal studies with two time points (e.g., randomized clinical trials with pre- and posttests). Previous research in the multilevel modeling literature (e.g., Clarke & Wheaton, 2007; Maas & Hox, 2004, 2005) did not report many convergence problems. We found more convergence problems when ICC 0.2. This may be due to the unique nature of dyadic data analyses: two members in a dyad/group. Furthermore, our suggested minimum numbers of dyads in Table 4 appear not much larger than the ones given in previous research on sample size guidelines of multilevel modeling with larger group sizes. Our simulation model may be simpler in the variance-components part. We have only one level-2 variance component due to the small group/dyad size (i.e., 2), Methodology (2016), 12(1), 21–31
30
whereas, for example, the population model in Maas and Hox (2005) had two level-2 variance components and a level-2 covariance. A lower dimension of the random effects demanded fewer dyads for good estimation. For evaluating the influences of the factors on variance tests, we used the standard Wald test to conduct the variance tests. However, researchers have found that this test may not be optimal. Specifically, the Type I error rates from the Wald test could be lower than the nominal value and the test could suffer the low power problem (e.g., Berkhof & Snijders, 2001; Ke & Wang, 2015). Other procedures such as the deviance test (the likelihood ratio test) with a 50:50 mixture distribution of Chi-squares as the reference distribution can be used. Future research should study the performance of the deviance test for testing variances in multilevel modeling with a very small group size (e.g., 2). In sum, to answer our proposed research question, “can we generalize the findings from previous research in studying the impact of the factors on estimation of multilevel modeling and in sample size guidelines, to dyadic data analysis using multilevel modeling with a group size of 2 with or without missing data?”, our answer is 1. overall, we can generalize the findings from previous research in studying the impact of the factors on estimation of multilevel models to dyadic data analysis using multilevel modeling; 2. researchers should pay attention to the impact of singletons on estimation of dyadic data analysis; 3. more convergence problems may occur in dyadic data analysis using multilevel modeling, especially when ICC 0.2; 4. with a group size of 2, we may not need more groups to obtain reliable and valid estimates than with a larger group size (e.g., 5) because the multilevel model allowed to fit may be simpler in the random-effects part; and 5. our overall suggestions are to have a minimum of 50 dyads when PS = 0% and 100 dyads when PS = 30% or 50% to obtain reliable and valid estimates from dyadic data analysis using multilevel modeling.
Acknowledgments This study is supported by a grant from the National Institute of Mental Health, 1R21MH097675-01A1, to the second author.
References Afshartous, D. (1995, April). Determination of sample size for multilevel model design. Paper presented at the annual meeting of the American Educational Research Association, San Francisco, CA.
Methodology (2016), 12(1), 21–31
H. Du & L. Wang, Number of Dyads in Dyadic Data Analysis
Atkins, D. C. (2005). Using multilevel models to analyze couple and family treatment data: Basic and advanced issues. Journal of Family Psychology, 19, 98. Bakker, A. B., & Xanthopoulou, D. (2009). The crossover of daily work engagement: Test of an actor-partner interdependence model. Journal of Applied Psychology, 94, 1562–1571. Bassiri, D. (1988). Large and small sample properties of maximum likelihood estimates for the hierarchical linear model. Unpublished doctoral dissertation, Michigan State University, East Lansing, MI. Berkhof, J., & Snijders, T. A. (2001). Variance component testing in multilevel models. Journal of Educational and Behavioral Statistics, 26, 133–152. Bradley, J. V. (1978). Robustness? The British Journal of Mathematical and Statistical Psychology, 31, 144–152. Card, N. A., Selig, J. P., & Little, T. (2008). Modeling dyadic and interdependent data in the developmental and behavioral sciences. New York, NY: Routledge. Casella, G., & Berger, R. L. (2002). Statistical inference (2nd ed.). Pacific Grove, CA: Duxbury/Thomson Learning. Clarke, P., & Wheaton, B. (2007). Addressing data sparseness in contextual population research using cluster analysis to create synthetic neighborhoods. Sociological Methods & Research, 35, 311–351. Erdogan, B., Liden, R. C., & Kraimer, M. L. (2006). Justice and leader-member exchange: The moderating role of organizational culture. Academy of Management Journal, 49, 395–406. Gibson, N. M., & Olejnik, S. (2003). Treatment of missing data at the second level of hierarchical linear models. Educational and Psychological Measurement, 63, 204–238. Goldstein, H. (1995). Multilevel statistical models (2nd ed.). London, UK: Griffin. Grover, R., & Vriens, M. (2006). The handbook of marketing research: Uses, misuses, and future advances. Thousand Oaks, CA: Sage. Hoogland, J. J., & Boomsma, A. (1998). Robustness studies in covariance structure modeling an overview and a metaanalysis. Sociological Methods & Research, 26, 329–367. Hox, J. J. (2010). Multilevel analysis: Techniques and applications (2nd ed.). New York, NY: Routledge. Hox, J. J., & Maas, C. J. (2001). The accuracy of multilevel structural equation modeling with pseudobalanced groups and small samples. Structural Equation Modeling, 8, 157–174. Hsiung, H.-H. (2012). Authentic leadership and employee voice behavior: A multi-level psychological process. Journal of Business Ethics, 107, 349–361. Kashy, D. A., Donnellan, M. B., Burt, S. A., & McGue, M. (2008). Growth curve models for indistinguishable dyads using multilevel modeling and structural equation modeling: The case of adolescent twins’ conflict with their mothers. Developmental Psychology, 44, 316. Ke, Z., & Wang, L. (2015). Detecting individual differences in change: Methods and comparisons. Structural Equation Modeling, 22, 382–400. doi: 10.1080/10705511.2014.936096 Kenny, D. A. (1995). The effect of nonindependence on significance testing in dyadic research. Personal Relationship, 2, 67–75. Kenny, D. A., & Acitelli, L. K. (2001). Accuracy and bias in the perception of the partner in a close relationship. Journal of Personality and Social Psychology, 80, 439–448. Kenny, D. A., & Judd, C. M. (1986). Consequences of violating the independence assumption in analysis of variance. Psychological Bulletin, 99, 422–431. Kenny, D. A., Kashy, D., & Cook, W. L. (2006). Dyadic data analysis. New York, NY: Guilford Press.
Ó 2016 Hogrefe Publishing
H. Du & L. Wang, Number of Dyads in Dyadic Data Analysis
31
sampling design options for the national children’s study. Columbus, OH: Battelle. Townsend, J., Phillips, J. S., & Elkins, T. J. (2000). Employee retaliation: The neglected consequence of poor leader-member exchange relations. Journal of Occupational Health Psychology, 5, 457. Van der Leeden, R., Busing, F., & Meijer, E. (1997, April). Applications of bootstrap methods for two-level models. Paper presented at the Multilevel Conference, Amsterdam. Wertsch, J. V., McNamee, G. D., McLane, J. B., & Budwig, N. A. (1980). The adult-child dyad as a problem-solving system. Child Development, 51, 1215–1221. Zhang, Q., & Wang, L. (2016). Moderation analysis with missing data in the predictors. Psychological Methods. Manuscript under review.
Kreft, I. G. (1996). Are multilevel techniques necessary? An overview, including simulation studies. Unpublished manuscript, California State University at Los Angeles. Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data (2nd ed.). New York, NY: Wiley. Maas, C. J., & Hox, J. J. (2004). The influence of violations of assumptions on multilevel parameter estimates and their standard errors. Computational Statistics & Data Analysis, 46, 427–440. Maas, C. J., & Hox, J. J. (2005). Sufficient sample sizes for multilevel modeling. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 1, 86–92. McIsaac, C., Connolly, J., McKenney, K. S., Pepler, D., & Craig, W. (2008). Conflict negotiation and autonomy processes in adolescent romantic relationships: An observational study of interdependency in boyfriend and girlfriend effects. Journal of Adolescence, 31, 691–707. Moineddin, R., Matheson, F. I., & Glazier, R. H. (2007). A simulation study of sample size for multilevel logistic regression models. BMC Medical Research Methodology, 7, 34–43. Newsom, J. T. (2002). A multilevel structural equation model for dyadic data. Structural Equation Modeling, 9, 431–447. Paccagnella, O. (2011). Sample size and accuracy of estimates in multilevel models. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 7, 111–120. Peugh, J. L. (2010). A practical guide to multilevel modeling. Journal of School Psychology, 48, 85–112. Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods (2nd ed.). CA: Sage. Roth, P. L. (1994). Missing data: A conceptual review for applied psychologists. Personnel Psychology, 47, 537–560. Rubin, D. (1976). Inference and missing data. Biometrika, 63, 581–592. doi: 10.1093/biomet/63.3.581 Salvy, S.-J., Howard, M., Read, M., & Mele, E. (2009). The presence of friends increases food intake in youth. The American Journal of Clinical Nutrition, 90, 282–287. Shieh, Y. Y., & Fouladi, R. T. (2003). The effect of multicollinearity on multilevel modeling parameter estimates and standard errors. Educational and Psychological Measurement, 63, 951–985. Singer, J. D., & Willett, J. B. (2003). Applied longitudinal data analysis: Modeling change and event occurrence. New York, NY: Oxford University Press. Strauss, W., Menkedick, J., Ryan, L., Pivetz, T., McMillan, N., Pierce, B., & Rust, S. (2004). White paper on evaluation of
Received July 17, 2014 Revision received May 27, 2015 Accepted October 21, 2015 Published online April 1, 2016 Han Du obtained her bachelor’s degree in Psychology from the Beijing Normal University in 2012. She is currently a PhD student under the supervision of Lijuan Wang at the University of Notre Dame, IN. Her research is focused on Bayesian methods, metaanalysis, longitudinal research, and dyadic data analysis. Lijuan Wang (PhD) is an Associate Professor of Quantitative Psychology at the University of Notre Dame, IN. Her current research interests include methods for longitudinal data analysis (e.g., multilevel modeling, structural equation modeling, estimation and inference), mediation and moderation analysis, and study design issues.
Lijuan Wang 118 Haggar Hall Department of Psychology University of Notre Dame Notre Dame, IN 46556 USA Tel. +1 (574) 631-7243 Fax +1 (574) 631-6650 E-mail lwang4@nd.edu
Appendix 0
σ2
σ2
The relations between σ2 A0d from the full model and the ICC, σ20 A0d 0 , from the empty model þσ2e þσe2 A0d A0d σ2A0d σ2A0d þσ2e from the full model
ICC from the empty model
NG = 30
NG = 50
NG = 100
NG = 150
NG = 200
NG = 300
NG = 400
NG = 500
0.1
.115
.300
.370
.391
.407
.414
.416
.417
0.2
.470
.502
.523
.524
.526
.524
.530
.527
0.3
.610
.615
.617
.616
.617
.618
.617
.615
0.5
.771
.767
.764
.765
.763
.763
.763
.762
0.7
.878
.878
.876
.875
.874
.875
.875
.874
Ó 2016 Hogrefe Publishing
Methodology (2016), 12(1), 21–31