Exercises in Plant Disease Epidemiology, Second Edition

Page 1



4 / Chapter 1 Table 1.1. Key distributions and models Normal distribution

Formula

Parameters

1

μ = mean 2 σ = variance

=

Poisson distribution

√2

λ is both mean and variance

=

Binomial distribution

=

! n = number of trials x = number of successful trials p = probability of success in each trial

1−

λ = rate parameter and the inverse of this parameter represents

=

Exponential distribution

the mean Logistic model

=

Beta distribution

Beta-binomial distribution

1+ =

=

1 1−

y0 = constant of integration = disease intensity at time = 0 r = rate parameter t = time

1− , + , − ,

B = normalization constant α and β represent shape parameters and the mean is +

The mean is represented as The variance is represented as

common statistical distributions. Knowing how statistical distributions work helps in anticipating which statistical analyses are appropriate. For example, when the common assumption that data have a normal distribution is not appropriate, generalized linear models can often be used (2). Generalized linear models are “generalized” in the sense that, unlike typical linear models, they do not assume a normal distribution for residuals (2). Instead, they are flexible enough that they can be used to describe variables that have an approximate Poisson or binomial distribution, for example. These distributions are described below. Developing epidemiological models often involves fitting statistical distributions to data. For example, an exponential model may be fitted to the probability of pathogen dispersal to different distances from an inoculum source. A logistic model may be fitted to the level of disease severity over time. Again, the goal of using these models is to come up with a small number of parameters that can summarize a potentially large dataset. Then these parameters can be used to compare different groups, such as a group with and a group without a pesticide application or cultural practice, in statistical tests. The procedures discussed below provide an overview of distributions and models important in epidemiology. Statisticians sometimes speak of a “bestiary” of models (1)—the number of distributions described continues to grow and is limited only by people’s imaginations. For examples, see Forbes et al. (5) and Krishnamoorthy (8). But some distributions have proven particularly useful, and we focus on those here in a bestiary consisting of only some of the most charismatic beasts (models). Additionally, we illustrate the use of the R software (http://www.r-project.org) for this purpose. The R programming environment is very flexible and provides a platform for exploring the nature of distributions and models through graphs, making it easy to illustrate how modifying parameter values changes the nature of models.

PROCEDURE Normal Distribution The normal distribution, also known as the Gaussian distribution or the “bell curve,” is one of the most commonly used statistical distributions. It can often reasonably be used to describe the distribution of continuous variables such as plant heights or plant yields, where values tend to lie in the middle of the distribution and values away from the middle of the distribution rapidly become less likely. A distribution such as the normal has two important uses in epidemiology. First, if a response such as yield is approximately normally distributed, then two parameters, the mean and variance of yield, can be used to provide an efficient summary of a large number of yield responses. Second, if a response is approximately normally distributed within a group, then many statistical analyses, such as a t-test for comparing yield between two groups, are simplified. R has a set of functions that makes it easy to study the form of common distributions such as the normal distribution. One approach is to plot a histogram of values drawn from a normal distribution.


An Introduction to Key Distributions and Models in Epidemiology Using R / 5

hist(rnorm(10000, mean = 0, sd = 1)) # text following ‘#’ is a comment help(hist) # opens a window with information about the hist command help(rnorm) # opens a window with information about the rnorm command Note that when a command like hist is entered in R, a graphics window opens automatically. To compare the effects of changing the mean and standard deviation, try the following code, in which the ranges of the x-axis and y-axis have been kept the same (–20 to 20 and 0 to 3500, respectively) for all plots to help make clear the differences (Fig. 1.1). par(mfrow = c(2, 2)) # Sets up a graphics window with 2 rows and 2 columns hist(rnorm(10000, mean = 0, sd = 1), breaks = -20:20, xlim = c(-20, 20), ylim = c(0, 3500),main='Mean = 0, SD = 1') hist(rnorm(10000, mean = 5, sd = 1), breaks = -20:20, xlim = c(-20, 20), ylim = c(0,3500),main='Mean = 5, SD = 1') hist(rnorm(10000, mean = 0, sd = 3), breaks = -20:20, xlim = c(-20, 20), ylim = c(0,3500),main='Mean = 0, SD = 3') hist(rnorm(10000, mean = 5, sd = 3), breaks = -20:20, xlim = c(-20, 20), ylim = c(0,3500),main='Mean = 5, SD = 3') The values of mean and standard deviation can be replaced in this code to explore other values (although it might be necessary to modify the range of the x-axis to accommodate the new distributions as well as the range of values in the command breaks = –20:20 that determines where the breaks in the histogram fall). The software appendix gives tips for easy editing of R commands. Plotting a histogram of random variables generated from a distribution is one way to see the form of the distribution. Another is to plot the probability distribution function itself. R has a function curve that makes it easy to plot most models. The following code plots a normal probability distribution function, f(x), for the mean (nmean) and standard deviation (nsd) values indicated (Fig. 1.2).

Fig. 1.1. Normal distribution examples.


6 / Chapter 1

Fig. 1.2. Using curve to plot normal distributions.

help(dnorm)# information about how the dnorm function works par(mfrow = c(2, 2)) # Sets up a graphics window with 2 rows and 2 columns nmean <- 0 # the object nmean is now equal to 0 nsd <- 1 # the object nsd is now equal to 1 curve(dnorm(x, m=nmean, sd=nsd), from=-20, to=20, xlab='x', ylab='f(x)', main='Mean=0, SD=1')# plots the values of the normal probability function nmean <- 5 # now the object nmean is equal to 5, replacing the old value 0 You will then need to enter the plotting command using curve again to see the new plot with mean equal to 5, and note that the main title (main) has been changed as well to reflect the new mean. curve(dnorm(x, m=nmean, sd=nsd), from=-20, to=20, ylim=c(0,0.4), xlab='x', ylab='f(x)', main='Mean=5, SD=1') Again, you can explore the effects of different values of mean and variance, potentially needing to change the range of the x-axis to see the outcome. If, for example, you enter the commands nmean <- 0 nsd <- 3 you will need to enter the plotting command again to see the new plot with mean equal to 0 and standard deviation equal to 3 and to change the main label to avoid confusion: curve(dnorm(x, m=nmean, sd=nsd), from=-20, to=20, ylim=c(0,0.4), xlab='x', ylab='f(x)', main='Mean=0, SD=3') Generate a new plot with mean 5 and standard deviation 3. nmean <- 5 nsd <- 3 curve(dnorm(x, m=nmean, sd=nsd), from=-20, to=20, ylim=c(0,0.4), xlab='x', ylab='f(x)', main='Mean=5, SD=3')


An Introduction to Key Distributions and Models in Epidemiology Using R / 7

The normal distribution is limited to describing variables with a symmetric distribution, such that the structure of both tails is the same, or at least very similar. When statistical analyses include the assumption of normality, tests of the normality of residuals (the differences between the model predictions and the observed values) are generally performed to make sure that the assumption is reasonable. For any normal distribution, there is a nonzero, though potentially extremely small, probability of negative values. Often this is not an important issue, if the probability is small enough to be irrelevant; but sometimes a different distribution that does not include negative values is needed. Additional examples of working with normal distributions in R are available online in Garrett et al. (6). An example of checking the assumption of normality in a regression analysis is available online in Sparks et al. (11).

Poisson Distribution The Poisson distribution differs from the normal by being limited to values greater than or equal to zero. It differs from both the normal and lognormal by taking on only integer values and by having only one parameter (often called lambda), which is both the mean and the variance. The Poisson distribution is useful in epidemiology for modeling count data, such as the number of lesions on a leaf, or in situations in which the goal is to model spatial patterns, such as the incidence of infected trees (see examples in Chapters 4 and 5). As for the normal distribution, R includes functions for working with the Poisson distribution. The following commands illustrate the effects of changing the Poisson mean (Fig. 1.3). par(mfrow = c(2, 2)) hist(rpois(10000, lambda = 0.7), breaks = seq(0, 20, 0.5), xlim = c(0, 20), ylim = c(0, 5000), main = 'Lambda=0.7') hist(rpois(10000, lambda = 1), breaks = seq(0, 20, 0.5), xlim = c(0, 20), ylim=c(0, 5000), main='Lambda=1') hist(rpois(10000, lambda = 3), breaks = seq(0, 20, 0.5), xlim = c(0, 20), ylim = c(0, 5000), main = 'Lambda=3') hist(rpois(10000, lambda = 5), breaks = seq(0, 20, 0.5), xlim = c(0, 20), ylim = c(0, 5000), main = 'Lambda=5') You can explore the effects of changing lambda in these commands (again potentially needing to change the xaxis and y-axis limits to see the results, as well as the maximum value, 20, in the breaks statement, breaks = seq(0, 20, 0.5)).

Fig. 1.3. Histograms of Poisson distributions.


8 / Chapter 1

The use of histograms does not necessarily make it completely clear that the Poisson distribution is discrete (includes only integers), in contrast to the continuous normal distribution. The additional commands in the Evaluation section plot the probability distribution for Poisson variables showing the probability for each integer value. Note that as lambda gets higher, the distribution becomes more similar to a normal distribution, and in some cases a normal distribution may be similar enough to be a good approximation for count data. For low values of lambda, however, the distribution is far from normal.

Binomial Distribution The binomial distribution is one of the most important distributions in plant disease epidemiology. It describes the case in which there are two possible outcomes in a “trial.” For example, a plant may become infected or not. A farmer evaluating disease progress may decide to use a pesticide or not. Evaluation may be for a single trial (one plant or one farmer) or multiple trials (many plants or many farmers). For a single trial, the binomial distribution has only one parameter, the probability of “success.” Success is defined in different ways in different scenarios, such as “plant is infected” or “farmer decides to use pesticide.” R also has functions for working with the binomial distribution. The commands below give examples of binomial outcomes for a range of probabilities (prob) (Fig. 1.4). par(mfrow = c(2, 2)) hist(rbinom(10000, size = ylim = c(0, 10000), hist(rbinom(10000, size = ylim = c(0, 10000), hist(rbinom(10000, size = ylim = c(0, 10000), hist(rbinom(10000, size = ylim = c(0, 10000),

1, prob = 0.1), main = 'P=0.1') 1, prob = 0.2), main = 'P=0.2') 1, prob = 0.5), main = 'P=0.5') 1, prob = 0.9), main = 'P=0.9')

xlim = c(0, 1), xlim = c(0, 1), xlim = c(0, 1), xlim = c(0,1),

Note that changing the value of the probability parameter changes how frequently 0 or 1 is drawn in a single trial.

Fig. 1.4. Histograms of binomial distributions for a range of probabilities.


An Introduction to Key Distributions and Models in Epidemiology Using R / 9

You can also explore how the binomial distribution works by changing the parameter size, which determines how many trials are being considered. (Again, you will need to adjust the limits of the x-axis to see all the results.) For cases in which there are many trials and intermediate probability of success, the binomial distribution can also become approximately normal. However, in many cases the distribution is so far from normal that tests must be adapted to the binomial nature of response variables. For example, generalized linear models can be used when response variables have a binomial distribution (2).

Exponential Distribution and Model An exponential distribution is continuous and nonnegative and can be used to model phenomena in which the probability of higher values declines rapidly. An exponential distribution can be used to describe disease progress in space that is characterized by a rapid decline in disease risk at greater distances from an inoculum source (Fig. 1.5). par(mfrow = c(2, 2)) hist(rexp(10000, rate = 0.1), breaks = seq(0,100,5), xlim = c(0, 100), ylim = c(0, 10000), main = 'Rate=0.1') hist(rexp(10000, rate = 0.2), breaks = seq(0,100,5), xlim = c(0, 100), ylim = c(0, 10000), main = 'Rate=0.2') hist(rexp(10000, rate = 0.3), breaks = seq(0,100,5), xlim = c(0, 100), ylim = c(0, 10000), main = 'Rate=0.3') hist(rexp(10000, rate = 0.4), breaks = seq(0,100,5), xlim = c(0, 100), ylim = c(0, 10000), main = 'Rate=0.4') The exponential distribution can also be plotted using the formula describing the probability distribution as in the following code, in which lambda can be adjusted to see the results for different parameters. Note that the first plotting command gives the results for an exponential distribution (declining) and the second plotting command gives the results for one minus the distribution (increasing) (Fig. 1.6).

Fig. 1.5. Histograms of exponential distribution rates.


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.