Master Thesis ǀ Tesis de Maestría submitted within the UNIGIS MSc programme presentada para el Programa UNIGIS MSc at/en Z_GIS University of Salzburg ǀ Universidad de Salzburg
Spatial Data Analysis of Fusarium Wilt Incidence in the District of San Luis de Shuaro, Peru by/por
David Brown Fuentes 1123727 A thesis submitted in partial fulfilment of the requirements of the degree of Master of Science (Geographical Information Science & Systems) – MSc (GIS)
Turrialba, August 2014
II
Dedicado a mi esposa y mis padres.
III
Abstract Spatial Data Analysis was conducted over the disease incidence of Fusarium Wilt of banana in the district of San Luis de Shuaro, Peru. Data was obtained after consolidating it from raw data, resulting in a total of 76 records dataset, each one representing a sampled farm for Fusarium Wilt of banana. The spatial distribution was represented through a map of points each one representing a sampled farm. Thiessen Polygons was utilised as basic interpolation method to obtain a general representation of the disease distribution in the study area. The Spatial Data Analysis includes spatial autocorrelation tests, which were applied through Moran’s I, Geary´s c and a modified version of Moran’s I using an Empirical Bayes Estimate. The relationship between the Disease Incidence of Fusarium Wilt of Banana and possible explanatory variables like altitude, soil pH, plant density, farm type, farm size, banana variety, shade percentage, slope and the Soil Quality Index was also analysed through a GAMLSS model with a beta-binomial distribution. Results show that there is no evidence of spatial autocorrelation for the disease incidence. The regression model shows that the most significant explanatory variables of the disease incidence are soil pH, altitude, plant density and Soil Quality Index.
IV
Acknowledgements I want to recognize the support of Bioversity International for allowing me to join the Master Degree Program at UNIGIS, especially to Miguel Dita, Charles Staver, Stephan Weise and Dietmar Stoian for support the development of my career in GIS at Bioversity International. I want to recognize the valuable help of Karl Atzmanstorfer as my thesis advisor and Gunda Cespedes for her help during the revision process of my thesis, both at the University of Salzburg. To Jacob van Etten for his comments and for lend me a very useful book and to Philippe Tixier for all his comments and suggestions. A special acknowledgement to Carlos RomĂĄn JerĂ for the hard work he did in San Luis de Shuaro, Peru collecting data for his research and which also allows me to conduct this work. Finally but most important, I want to acknowledge the support and valuable suggestions of my wife Karol Paola during the development of my thesis.
V
Table of contents Abstract .............................................................................................................................................. IV Acknowledgements ............................................................................................................................. V Table of contents ................................................................................................................................ VI List of Figures .................................................................................................................................... IX List of Tables....................................................................................................................................... X 1.
2.
3
Introduction ................................................................................................................................. 1 1.1
Motivation ........................................................................................................................... 1
1.2
Problem Description............................................................................................................ 2
1.3
Objectives ............................................................................................................................ 2
1.4
Hypothesis ........................................................................................................................... 3
1.5
Scope ................................................................................................................................... 3
Literature review ......................................................................................................................... 5 2.1
Fusarium Wilt...................................................................................................................... 5
2.2
Disease Incidence ................................................................................................................ 6
2.3
Spatial Data Analysis .......................................................................................................... 6
2.3.1
Areal Analysis ............................................................................................................. 7
2.3.2
Geostatistical Analysis ................................................................................................ 7
2.3.3
Point Pattern Analysis ................................................................................................. 7
2.3.4
Spatial Autocorrelation ............................................................................................... 7
2.4
Exploratory Spatial Data Analysis ...................................................................................... 8
2.5
Measures of Spatial Autocorrelation ................................................................................... 8
2.6
Moran’s I using Empirical Bayes Estimates ....................................................................... 9
2.7
Semivariogram .................................................................................................................... 9
2.8
Thiessen Polygons ............................................................................................................. 10
2.9
Spatial Data Analysis Applied to Plant Disease Incidence ............................................... 10
2.10
Linear Regression Model and some considerations about it ............................................. 12
2.11
Generalised Linear Models ............................................................................................... 14
Methodology ............................................................................................................................. 17 3.1
Case study area: San Luis de Shuaro................................................................................. 18
3.2
Workflow diagram ............................................................................................................ 21 VI
3.3
Data preparation ................................................................................................................ 22
3.5
Mapping the Spatial Distribution of Disease Incidence .................................................... 25
3.5.1
Points Map................................................................................................................. 25
3.5.2
Thiessen Polygons ..................................................................................................... 26
3.6
3.6.1
Histogram .................................................................................................................. 28
3.6.2
Normal Q-Q Plot ....................................................................................................... 28
3.6.3
Spatial autocorrelation tests....................................................................................... 29
3.6.3.1
Neighbour List and Spatial Weights ............................................................................. 30
3.6.3.2
Moran’s I ....................................................................................................................... 32
3.6.3.3
Geary’s C....................................................................................................................... 32
3.6.3.4
Modified Moran’s I – Empirical Bayes Index............................................................... 32
3.7 4
Data Exploration ............................................................................................................... 27
GAMLSS applied to Fusarium Wilt Disease Incidence .................................................... 32
Results and Analysis ................................................................................................................. 36 4.1
Map of the Spatial Distribution of Disease Incidence in the Study Area .......................... 36
4.2
Results of Exploratory Data Analysis ............................................................................... 38
4.2.1
Histogram .................................................................................................................. 38
4.2.2
Normal Q-Q Plot ....................................................................................................... 39
4.2.3
Skewness and Kurtosis .............................................................................................. 40
4.2.4
Neighbours list and Spatial Weights ......................................................................... 41
4.2.5
Results of Moran’s I Test .......................................................................................... 42
4.2.6
Results of Geary’c Test ............................................................................................. 43
4.2.7
Results of Global Moran’s I using Empirical Bayes Estimates ................................ 44
4.3
Results of the GAMLSS.................................................................................................... 45
4.4
Analysis of the Results ...................................................................................................... 50
4.4.1
Spatial Distribution of Fusarium Wilt in the study area ............................................ 50
4.4.2
Spatial Autocorrelation Tests .................................................................................... 50
4.4.3
GAMLSS................................................................................................................... 51
4
Conclusions ............................................................................................................................... 53
5
References ................................................................................................................................. 55
6
Annexes ..................................................................................................................................... 60 Annex 1 – R Code utilised for test spatial autocorrelation............................................................ 60 VII
Annex 2 – R Code utilised for the GAMLSS................................................................................ 61
VIII
List of Figures Figure 1 – Location of San Luis de Shuaro District (David Brown, ArcGIS for Desktop 10.2) ...... 18 Figure 2 – Elevation of the study area (David Brown, ArcGIS for Desktop 10.2) ........................... 19 Figure 3 –Location of banana farms in the study area (David Brown, ArcGIS for Desktop 10.2) ... 20 Figure 4 – Workflow diagram (David Brown, Microsoft Word 2010) ............................................. 21 Figure 5 – Symbols and classification utilised to map the distribution of the measured disease incidence (David Brown, ArcGIS for Desktop 10.2) ........................................................................ 25 Figure 6 - Map of the Spatial Distribution of Fusarium wilt incidence in the study area (David Brown, ArcGIS for Desktop 10.2) .................................................................................................... 36 Figure 7 – Thiessen Polygons constructed as from the points of each sample farm (David Brown, ArcGIS for Desktop 10.2) ................................................................................................................. 37 Figure 8 – Histogram of disease incidence rate ................................................................................ 38 Figure 9 – Histogram of a normal distribution from simulated data ................................................. 38 Figure 10 – Normal Q-Q Plot of the disease incidence ..................................................................... 39 Figure 11 – Normal Q-Q Plot for a normal distribution for simulated data ...................................... 39 Figure 12 – a) Delaunay triangulation, b) Sphere of Indifference, c) Distance based ...................... 41 Figure 13 – a) k nearest neighbour with k = 1, b) k nearest neighbour with k= 5, c) k nearest neighbour with k=10 ......................................................................................................................... 41 Figure 14 – Plots of the residuals for model validation .................................................................... 49
IX
List of Tables Table 1 – Categories of Spatial Data Analysis accordingly to the spatial data type ........................... 7 Table 2 – Main Characteristics and Differences between Moran’s I and Geary’s c, based on Fortin et al. (2002) ......................................................................................................................................... 9 Table 3 – Variables included in the dataset consolidated from the raw data provided by of Román Jerí (2012) ......................................................................................................................................... 23 Table 4 – Results of Moran’s I test computations using different neighbours list methods ............ 42 Table 5 – Results of Geary’s c test computations using different neighbours list methods .............. 44 Table 6 – Results of Moran’s I with Empirical Bayes Estimates computations using different neighbours list methods..................................................................................................................... 44 Table 7 – Results of the GLM model with a Binomial distribution and all the proposed explanatory variables ............................................................................................................................................ 46 Table 8 – Results for the base model with all the proposed variables .............................................. 47 Table 9 – Results of the adjusted model after applying the variable selection using stepGAIC ....... 48
X
1. Introduction 1.1 Motivation Bananas and plantains, are very important crops in developing countries, both as a staple food as a commodity (Arias, Dankers, Liu & Pilkauskas, 2003). They are cultivated in more than 100 countries around world, both in tropical and subtropical regions (Frison and Sharrock, 1998). Often referred just as bananas in general, they are grown in two main typical scenarios; the production for export market and small scale for local market and as a staple, being the earlier characterized by high input applications and just a few banana varieties (Arias, Dankers, Liu & Pilkauskas, 2003). As any other crop, bananas and plantains are affected by diseases which could reduce or totally inhibit production. One of the more important diseases is the Fusarium wilt of banana, a soilborne fungal disease caused by the fungus Fusarium oxysporum f. sp. cubense, which affects different banana cultivars (Brayford, 1992). In the case of the production for export market, Fusarium Wilt ceased to be a problem with change from the susceptible variety Gros Michel to the resistant Cavendish. However, many small farmers still growth susceptible varieties even under the risk that Fusarium Wilt represents to their production, mainly because most of the susceptible varieties are highly appreciated by consumers in local markets. With the appearance and dispersion of a new race of Fusarium Wilt, known as Tropical Race 4 which affects the Cavendish variety, the disease is recovering importance also in the production for export market. Therefore, nowadays the study of Fusarium Wilt is a very important area to both the banana export industry and for the small farmers in the developing countries. The analysis of the spatial data of Fusarium wilt of banana is relevant to have a better understanding of the factors that affects the incidence of the disease, which could support the design of disease assessment and management methodologies.
1
1.2 Problem Description The district of San Luis de Shuaro in Peru is a zone where small farmers growth different banana varieties as their main economic activity (Román Jerí, 2012), and many of these cultivated varieties are susceptible to Fusarium wilt. Since most of the infected plants are inhibited to produce a banana bunch, each diseased plant represent an economic lost to the farmer. Although with some distinctive characteristics, the region of San Luis de Shuaro presents similar conditions to other regions in Latin America and the Caribbean, which in general are characterized by the diversity of their production systems, ranging from monocrop to agroforestry systems with mixed crops. The findings of the present work could provide valuable insights to understand this complex disease not only in the region of the study case but in others with similar conditions and characteristics.
1.3 Objectives General Objective: To conduct spatial data analysis of Fusarium wilt incidence in the district of San Luis de Shuaro, Peru. Specific Objectives
Mapping the spatial distribution of the disease incidence rate of Fusarium Wilt of banana in the case study area
Conduct Exploratory Data Analysis including three different test for Spatial Autocorrelation: Moran’s I, Geary’s c and Moran’s I using Empirical Bayes Estimates
Modelling the relationship between Fusarium Wilt incidence and a set of explanatory variables using a GAMLSS (Generalised Additive Models for Location Scale and Shape)
To analyse the relationship of Fusarium Wilt incidence with the selected variables by the implemented GAMLSS
2
Research Questions 1) How could the spatial distribution of the disease incidence be represented using GIS software and cartographic techniques to provide a general description of the region in terms of incidence levels? 2) Are the sampled farms in the study area spatially autocorrelated with respect to disease incidence? 3) Which are the most reliable combination of model and probability distribution to analyse the relationship of the Fusarium Wilt incidence and a set of explanatory variables? 4) Which are the factors that influence the Fusarium Wilt incidence in the study area?
1.4 Hypothesis H0: There is not spatial autocorrelation of the Fusarium Wilt incidence between the sampled farms in the area of San Luis de Shuaro District. H1: There is a spatial autocorrelation of the Fusarium Wilt incidence between the sampled farms in the area of San Luis de Shuaro District.
1.5 Scope The present work analyse the spatial distribution of the Fusarium wilt incidence using spatial statistics, to determine if nearest farms are more likely to exhibit similar levels of incidence, which could lead to a better understanding of the disease dynamics. It also analyse the relationship between a set of proposed variables and the disease incidence. The study comprises information from 76 farms in the region of San Luis de Shuaro, Peru. More specific and technical details could be found the Chapter 3. The mapping of the spatial distribution is a graphical representation of the disease incidence using cartographic techniques, with the aim to provide a general description on how the different levels of disease incidence are distributed along the study region. Geographical localisation data were available just for each farm, thus the spatial analysis is limited to the spatial relationship between each sampled farm, with the relationship between plants within each sampled 3
farms out of the scope of this work. Additionally to the spatial analysis per se, the results of spatial autocorrelation between farms with respect to the disease incidence are a key inputs to analyse the relationship between the set of proposed variables and the disease incidence through a Generalised Linear Model. In this aspect, the present work traces an outline on how further analysis with similar inputs and goals could be conducted, especially in terms of suggest a reliable combination of regression model and probability distribution. Finally, the study determines a set of explanatory variables which has influence on the disease incidence, based on the statistical results after implement the Generalised Linear Model, which provides valuable inputs to future studies focused on the more influencing factors, contributing to the development of better methodologies and strategies both for assessment of a suspicious area or for management of a confirmed infected area.
4
2. Literature review This section condensate the concepts behind the development of this work, from the description of the analysed disease to concepts of Spatial Data Analysis applied to analyse it, containing both scientific papers and theoretical books. A list of suggested literature is also provided in order to facilitate the resource browsing for readers interested in other works which also treats the Spatial Data Analysis applied to Plant Disease incidence.
2.1 Fusarium Wilt “Panama disease of bananas is historically one of the most infamous plant diseases, destroying the banana production industries in areas of Central America where the highly susceptible banana cultivar Gros Michel predominated from ca. 1900–1955 ” (Brayford, 1992, p. 1). Also known as Panama disease, Fusarium Wilt is a soilborne fungal disease which infects the plant through the roots, spreading to the xylem and causing vascular browning in the pseudostem (Brayford, 1992). Typical symptoms include vascular discolouration and yellowing of the leaves (Pérez-Vicente, Dita & Martinez de la Parte, 2014). It could survive up to 30 years in the soil in the absence of banana (Ploetz, 2006). Dispersal of the disease could be caused by infected plant material, through the water and soil (Pérez-Vicente et al., 2014). As indicated by Pérez-Vicente et al. (2014), the possibility of recovering of a susceptible banana plant infected with Fusarium wilt is very low, and if it occurs the growth will be deficient. Accordingly to Ploetz (2006), the options available for management of this disease are scarce, being the genetic resistance the most effective. Its importance in the banana global market was reduced in 1950, when shifting from susceptible Gros Michel variety to the resistant Cavendish was done in Latin America (Perez-Vicente et al., 2014). However, susceptible varieties are still cultivated in small scale, especially by smallholders 5
in mixed with other crops like coffee, cacao and threes in agroforestry systems (Perez-Vicente et al., 2014). This situation has also an effect on Fusarium Wilt research, which accordingly to Lichtemberg, Pocasangre, Staver, Dold and Sikora (2010) comprise two eras, the Gros Michel era where efforts were focused on possible origins and the epidemics in the American tropics, and the Cavendish era, where efforts were focused on the pathogen diversity and not into the disease.
2.2 Disease Incidence Madden, Hughes and van den Bosch (2007) defines disease incidence as the proportion of plants (or plan units) diseased or the number of diseased plants (or plants units) out of the total assessed. The same authors indicate that disease incidence could be measured at different scales depending on the plant units utilised. For the case of this work, the plant unit refers to an individual plant. Within this context, “disease incidence provides an estimate of the probability of infection” (Hughes, Munkvold & Samita, 1998), and “it is the most common records contained spatial plant disease data encountered in phytopathological literature” (Madden et al., 2007).
2.3 Spatial Data Analysis Spatial Data Analysis is the area of Spatial Analysis where statistical techniques are developed and applied to analyse spatial data (Haining, 2003). Accordingly to Bailey and Gatrell (as cited in Pfeiffer, 1996, p.83), methods used in spatial data analysis can be categorized as: a) Methods for visualizing data b) Methods for exploratory data analysis c) Methods for development of statistical models Following the classification of Cressie (as cited in Plant, 2012, p. 5) there is three categories of spatial data: geostatistical, areal and point pattern. As presented by Krivoruchko (2011, p. 22), each of these categories corresponds to continuous, aggregated and discrete data respectively. Table 1 summarizes the type of data and the type spatial data analysis which correspond it.
6
Table 1 – Categories of Spatial Data Analysis accordingly to the spatial data type
Type of data
Spatial Data Analysis
Discrete
Point Pattern Analysis
Aggregated or Areal
Lattice or Areal
Continuous
Geostatistical
2.3.1 Areal Analysis Accordingly to Plant (2012, p. 5), “… areal data consist of data that are defined only at a set of locations, which may be points or polygons”. The main objective to analyse areal data is to detect and explain spatial patterns, sometimes including its relationship with covariates (Pfeiffer, 1996).
2.3.2 Geostatistical Analysis Geostatistical data consist of data that is spatially continuous and the main objective to be analysed is to describe the spatial variation of an attribute variable Pfeiffer (1996), and to interpolate the value of the measured attribute at points where it wasn’t measured Plant (2012, p.5).
2.3.3 Point Pattern Analysis As is shown in Table 1, point pattern analysis deals with discrete data and it analyses the pattern of the registered locations. Typically this pattern is analysed based on what is called analysis of clustering (Pfeiffer, 1996).
2.3.4 Spatial Autocorrelation “Everything is related to everything else, but near things are more related than distant things” – Tobler’s First Law of Geography. (Tobler, 1970) The Tobler’s first law of geography is a recurrent citation in the spatial analysis literature and so far continues being the best way to define spatial autocorrelation. Griffith (2009) defines spatial autocorrelation as “…the correlation among values of a single variable strictly attributable to their relatively close locational positions on a two-dimensional (2-D)
7
surface, introducing a deviation from the independent observations assumption of classical statistics”. Miron (as cited in Plant, 2012, p.423) presents three different sources of spatial autocorrelation presence in regression models: d) Interaction e) Reaction f) Model misspecification
2.4 Exploratory Spatial Data Analysis Haining and Cressie (as cited in Haining, 2003, p.182) defines ESDA as set of techniques for: g) h) i) j)
Explore spatial data Summarize spatial properties of data Detect spatial patterns in data Formulate hypothesis related to the geography of the data
Exploratory spatial data analysis includes both visual and numeric methods. Visual methods could include: Histogram, Q-Q plots, Boxplots and Scatterplot (Haining, 2003, p.189). Numerical methods include spatial autocorrelation tests like Moran’s I to explore if the data is clustered or dispersed (Haining, 2003, p.226).
2.5 Measures of Spatial Autocorrelation Moran’s I and Geary’s c are both indexes that are applied to test the null hypothesis of zero spatial autocorrelation (Plant, 2012, p. 104). A description of the main features and differences are presented at following in Table 2, based on Fortin, Dale and ver Hoef (2002).
8
Table 2 – Main Characteristics and Differences between Moran’s I and Geary’s c, based on Fortin et al. (2002)
Characteristic
Moran’s I
Geary’s c
How is computed
Degree of correlation between values of a variable as a function of spatial lags. The expected value for zero autocorrelation is nearly zero, although more formally:
Difference among
Zero Spatial Autocorrelation
The expected value for zero autocorrelation is 1.
1 With n as the number of (n 1)
Positive Spatial Autocorrelation Negative Spatial Autocorrelation
areal units Values nearly 1
Values nearly 0
Values nearly -1
Values nearly 2
2.6 Moran’s I using Empirical Bayes Estimates The traditional calculation of Moran's I for disease cases does not account for population heterogeneity, so that, its application to disease rates or proportions may result in indication of spatial correlation that is completely due to the spatial proximity of population sizes, but not due to the similarity among the disease rates. (Jackson, Huang, Xie & Tiwari, 2010) Assunção and Reis (1999) propose an Empirical Bayesian Estimate modification for the calculation of Moran’s I when it is applied to rates calculated as from populations with different sizes, which is the case of the data analysed in the present work.
2.7 Semivariogram From the geostatistics point of view, spatial autocorrelation is tested through the Semivariogram Nelson, Orum, Jaime-Garcia and Nadem (1999). The semivariogram calculates the difference between locations of two measurements, which is called the spatial lag, using the following function (Plant, 2012, p. 117):
(h) var ( x y) ( x) 1 2
The semivariogram is commonly represented in a plot, representing the semivariance as a function of distance (Bivand, Pebesma & Gomez-Rubio, 2008). Rather than use the function presented
9
before, the semivariogram is commonly estimated through the experimental semivariogram which is represented by the following function (Plant, 2012, p. 118):
ˆ (h)
1 m( h) Y ( xi h) Y ( xi )2 2m(h) i1
For a more clear understanding of the terms Variogram and Semivariogram and their differences Bachmaier and Backes (2008) provides a clarification.
2.8 Thiessen Polygons Also known as Proximity Polygons or Voronoi Maps (O’Sullivan & Unwind, 2010, p. 50), they are considered by Plant (2012, p.163) as a useful interpolation method when locations are too sparse and irregular. Thiessen Polygons are geometrically calculated areas as from points and just take into account distance from each point to calculate the polygon. More formally, and accordingly to O’Sullivan and Unwind (2010, p. 50), they are “a polygon of any entity is that region of the space which is closer to the entity than it is to any other”. Although useful as pointed out by Plant, their use should take into account its limitations and not state exaggerated conclusions just from the resulting polygons.
2.9 Spatial Data Analysis Applied to Plant Disease Incidence Bivand et al. (2008, p. 311) indicates that “displaying the spatial variation of the incidence of a disease can help us to detect areas where the disease is particularly prevalent, which may lead to the detection of previously unknown risk factors”. Although this statement was done from the point of view human health, in some way it is also valid for plant disease epidemiology. From the point of view of the study of a plant disease, to conduct a spatial autocorrelation analysis both the spatial location and the disease status of the sampling units must be known, (Madden et al., 2007). A considerably amount of studies were found in the literature about spatial data analysis applied to plant disease including the Nelson et al. (1999) which presents some applications of GIS and
10
Geostatistics to plant disease epidemiology. The work of Selvaraja, Balassundram, Vadamalai and Husni (2012) applies geostatistics to analyse the spatial variability of the Orange Spotting Disease in oil palm. Talei, Safaie and Aghajani (2013) studied the spatial distribution of Soybean Charcoal Rot incidence using geostatistics, more specifically an interpolation using ordinary kriging. Alves and Pozza (2010) on the other hand propose the use of indicator kriging for study the spatial variability of common bean anthracnose. Del Ponte, Shah and Bergstrom (2003) analysed the spatial patterns Fusarium head blight using the index of dispersion through a beta-binomial distribution. Although that index doesn’t involve the spatial location of the sampled data; but accordingly to this literature review it is common methodology utilised in plant pathology to estimate aggregation or dispersion patterns, being relevant the works of Hughes and Madden (1993), Madden and Hughes (1994) and Hughes, Madden and Munkvold (1996). Nelson, FelixGastelum, Orum, Stowell and Myers (1994) applies geostatistics to analyse to design and validate the regional plant virus management programs if the Del Fuerte Valley, located in Sinaloa, Mexico. Guzmán-Plazola, Gómez-Pauza, García-Espinosa and Gavi-Reyes (2004) applied geostatistics to interpolate the spatial distribution of Fusarium solani f. sp. phaseoli; which is the cause of the root rot on common bean. Musoli, et al. (2008) studied the spatial and temporal analysis of coffee wilt disease which is caused by Fusarum xylarioides, using also a geostatistical approach. Oerke, Meier, Dehne, Sulyok, Krska and Steiner (2010) analysed the spatial variability of Fusarium head blight pathogens in wheat crops using the Spatial Analysis by Distance IndicES (SADIE) and the Lloyd’s index of patchiness. Studies about Spatial Data Analysis applied to Banana’s Fusarium Wilt incidence were not found in search for a previous work with similar objectives. Although this is not an undebatable fact, it could lead to infers that these kinds of studies are too scarce. Lichtemberg et al. (2010) analysed the Fusarium Wilt incidence at smallholder level in Nicaragua, using the classical statistics methods like Pearson Correlation and ANOVA. Plotting of the farms and comparison of two different zones
11
were also conducted in that work. Román Jerí (2012) conducted a study in which geographical characteristics were considered for targeting the farms to be analysed and thereafter spatial distribution of Fusarium Wilt incidence was plotted for graphical representation. Although the study of Román Jerí (2012) does not include a strong component of spatial data analysis, the collected data is the base of the present work. As Schabenger and Pierce (as cited in Madden et al., 2007, p. 15) indicates, “…disease incidence is a count with a natural denominator, which could be converted into proportions”. This is important to take into account selecting the appropriate statistical analysis type to apply (Madden et al., 2007). For example, Madden and Hughes (1994) indicates that distributions like Poisson and negative binomial are generally inappropriate for analysing disease incidence (rate) data, proposing the use of beta-binomial distribution instead. Krivoruchko (2011) points out the fact that typical index used to measure spatial autocorrelation like Moran’s I and Geary’s c are commonly applied to rates, even when these indices assumes that data mean and variance are constants, which are difficult conditions to find in rates data like disease incidence. Paulitz, Zhang and Cook (2003) applies what they call a Spatial Generalised Mixed Model to account for spatial autocorrelation in a spatial point pattern analysis and also to interpolate disease incidence rates. However the use of this methodology should be approached with caution due to the high level of complexity of a GLMM, being a challenge even for statisticians (Bolker et al., 2009).
2.10
Linear Regression Model and some considerations about it
Linear Regression Models are frequently used in statistical analysis (Kongchouy, Choonpradub & Kuning, 2010). It supports researchers to explore the relationship between variables and to explain the strength of a set of independent variables to predict a dependent variable (Urdan, 2010, p. 145). As Zuur, Ieno, Walker, Saveliev and Smith (2009, p. 17) calls it, it is “the mother of all models”. However, as any other models it has limitations that should be taken in to account to apply it
12
correctly. The following equation is reproduce from Zuur et al. (2009, p.17) and shows the linear regression model.
Yi i i where
i ~ (0, 2 ) Following Zuur et al. (2009, p.17) explanation Yi is the response or dependent variable and i is the explanatory or independent variable. The information that is not explained by the model is captured by the residuals, represented in the equation by i while and represents the population intercept and the slope respectively and both are unknown parameters. There are five assumptions that should be considered accordingly to Zuur et al. (2009, p.19) to apply the linear regression correctly: 1. Normality: Linear Regression assumes that the data has a normal distribution. In this sense, normality means that when a plot of frequency of the cases (in the y axis) vs the score of the variable of interest (in the x axis) is constructed, it will exhibit a bell shaped curve (Urdan, 2010, p. 10). 2. Homogeneity: The homogeneity assumption means that the spread of data should be the same at each X value. When this condition is not accomplished it is called heteroskedasticity (Bivand et al., 2008, p. 274) or heterogeneity (Zuur et al., 2009, p. 20). 3. Fixed X: This assumptions means that the explanatory variables are deterministic and not random variables (Zuur et al., 2009, p. 21). 4. Independence: Accordingly to (Zuur et al., 2009, p. 21), independence is when the Y value at i is not influenced by other i , and it came into the most serious problem when it is no satisfied. The same Zuur et al. points out that there is two ways to violate the independence assumption, by applying an improper model or a dependence structure due to the nature of 13
the data. In the case of a dependence structure, it could be due to temporal or spatial dependence, being the latter of special interest in the present work. 5. Correct Model Specification: This means that there is assumed a correct selection of explanatory variables. There are different points of view from different authors on how to deal when one of these assumptions are violated. For example, when normality assumption is violated some take the approach to apply a transformation to the data, like the logarithmic transformation, trying to get the desired normal distribution. On the other extreme are those who prefer switch to other model without applying any transformation to the data. Other models could include: Generalised Linear Models, Generalised Additive Models, Generalised Least Square, etc. Each of these has different approaches to tackles the violations of the linear regression assumptions, depending on which of the assumptions are violated. The approach selected to the present work follows the suggestion stated by Zuur et al. (2009, p. 19): “Always apply the simplest statistical technique on your data, but ensure it is applied correctly�.
2.11
Generalised Linear Models
When the analysed data doesn’t fulfil the requirements to use the Linear Regression Model, the GLM (Generalised Linear Models) come to the scene as the most convenient solution (Plant, 2012, p. 301). A GLM basically consists of three distinctive parts (Crawley, 2007, p. 512): a) The error structure b) The linear predictor c) The link function The error structure refers to the type of distribution of the error in the analysed data, which could also be seen as the distribution of the response variable. Instead of apply a transformation when the analysed data has a non-normal distribution, a GLM allows to specify different types of distributions like binomial, Poisson, etc. (Crawley, 2007, p. 512). The linear predictor is also called 14
the systematic part (Zuur et al., 2009, p. 210) and in general terms is the set of explanatory variables expressed as a function. Finally the link function is the part which relates the systematic part with the mean of the response variable. The implementation of a GLM consists of three steps (Zuur et al., 2009, p. 210) which coincides with the three parts presented by Crawley (2007, p. 512): a) An assumption of the distribution of the response variable b) Specify the systematic part (The explanatory variables) c) Specify the link function However, undesirable but typical characteristic that the disease incidence data often also presents is overdispersion, which is when the observed variability is greater than the predicted (Garret, Madden, Hughes & Pfender, 2004). If overdispersion is no taken into account it will totally invalidate the statistical inference obtained from the model (GuimarĂŁes, 2005). Approaches to account for overdispersion includes the use of a maximum quasi-likelihood method instead of the maximum likelihood and use of discrete distributions like the negative binomial and the betabinomial distributions (Garret et al. 2004).
2.12
GAMLSS (Generalised Additive Models for Location Scale and Shape)
Accordingly to Stasinopoulos and Rigby (2007) GAMLSS are semi-parametric regression type models; mainly because they require a parametric distribution assumption for the response variable and could use non parametric smoothing functions for the modelling of the parameters of the distribution. As a GLM comes as a solution for the cases that could not be solved with Linear Regression, a GAMLSS is the proposed solution for the cases that could not be solved with a GLM. Most of these cases are when the response variable doesn’t follow an exponential family distribution (Stasinopoulos & Rigby, 2007). More details on how it was used in the present work will be presented in the next chapter.
15
Concluding Remarks on Literature Review The fact that previous works analysing the Fusarium Wilt incidence from the Spatial Data Analysis perspective wasn’t found gives a special relevance to this work and brings an outline of the considerations to take in future studies. Detailed information about the implementation of the concepts presented here to the analysis of Fusarium Wilt incidence from the point of view of Spatial Data Analysis is provided in the next chapter.
16
3 Methodology The proposed methodology is basically a combination of different methods and techniques applied by different authors as presented in the literature review, explaining specific details in the present chapter. The first part consists of a description of the study area including elevation, major roads and rivers present in it. In the second section the data preparation process is presented focusing on how the data will be treated in the present work. The third section consists of a graphical representation of the disease incidence rate for each farm in the study area. Since no areal boundaries were available for the sampled farms, each of them was represented by a point. The fourth section treats exploration of the data, using both graphical and numerical methods, the latter including tests for spatial autocorrelation using different method like Moran’s I and Geary’s c. The Empirical Bayes Estimate to improve Moran’s I calculation proposed by Assunção and Reis (1999) was also utilised to test spatial autocorrelation, mainly due the hereogeinity of the populations size of sampled plants per each farm. Finally, an explanation about why GAMLSS was selected and how was used to model the relationship between Fusarium Wilt incidence and variables like soil pH, Altitude, Farm Size, Slope, Farm Type, Banana Variety, Plant Density and Soil Quality Index.
17
3.1 Case study area: San Luis de Shuaro Accordingly to the Instituto Nacional de Estadística e Informática – INEI (as cited in Román Jerí, 2012, p. 25) the district of San Luis de Shuaro is located in Peru, in the Chanchamayo province, department of Junín. It is at 187 km from Lima, the Peru’s Capital, being the agriculture its main economic activity. Figure 1 shows the location of the San Luis de Shuaro district within Peru.
Figure 1 – Location of San Luis de Shuaro District (David Brown, ArcGIS for Desktop 10.2)
Although the study was focused on the San Luis de Shuaro district, it includes farms from outside the official boundaries due to different criteria applied by Román Jerí (2012). Accordingly to the Hole-filled seamless SRTM data (Jarvis, Reuter, Nelson & Guevara, 2008), altitude range from 597 – 2021 meters above sea level in the study area, as shown in Figure 2.
18
To obtain the elevation for the study area it was delimited drawing a rectangle that includes all the analysed spatial points, representing each one a farm. Then, an extraction from the SRTM elevation data was performed using the rectangle as a mask. This elevation model was just utilised as descriptive resource for the study area. The elevation attribute for each analysed farm was taken from collected data using the GPS Handheld.
Figure 2 – Elevation of the study area (David Brown, ArcGIS for Desktop 10.2)
19
As could be observed in Figure 3, the San Luis de Shuaro District is divided in two by a river which also separates the 76 farms in two main groups. The district is also divided in two by two major roads which cross the district near the river described before.
Figure 3 –Location of banana farms in the study area (David Brown, ArcGIS for Desktop 10.2)
20
3.2 Workflow diagram The workflow diagram presented in Figure 4 shows the necessary processes and results obtained during the development of the present work.
Literature review
Data preparation
Exploratory Spatial Data Analysis
Mapping the Spatial Distribution of Fusarium Wilt
Test Spatial Autocorrelation Generalised Linear Model Thiessen Polygons
Test Overdispersion
Generalised Additive Model for Location Scale and Shape
Model Validation
Figure 4 – Workflow diagram (David Brown, Microsoft Word 2010)
21
3.3 Data preparation Data collected by Román Jerí (2012) was stored in several spreadsheets containing location of the assessed farms and is part of Bioversity International datasets collection. Location was registered in UTM format with a GPS Handheld (Garmin eTrex Vista HCx) in each assessed farm where the first symptomatic plant was found (Román Jerí, 2012, p. 32). A main dataset was consolidated in a shapefile to be easily manipulated into ArcGIS and in R Language and Environment for Statistical Computing (R Core Team, 2014) through the package maptools (Bivand & Lewin-Koh, 2014). The shapefile was generated from previous files in XLSX format (Microsoft Office) containing the spatial location in UTM coordinates, along with the variables listed in Table 2. Although contained in the raw data provided by Román Jerí (2012), disease incidence was verified and recalculated with the available counts of total diseased plants and total assessed plants per each farm. After this revision of the all registries corresponding to 76 farms, nine registries differ from the original incidence rate, with four of these registries corresponding to zero disease incidence in the recalculated rate. The following projection was used as it was found to be the most appropriate accordingly to UTM Grid Zones of the World compiled by Morton (2014). Projected Coordinate System: WGS_1984_UTM_Zone_18S Projection: Transverse Mercator Geographic Coordinate System: GCS_WGS_1984 Datum: D_WGS_1984 Prime Meridian: Greenwich Angular Unit: Degree
22
Table 3 – Variables included in the dataset consolidated from the raw data provided by of Román Jerí (2012)
Variable Altitude Slope Farm Type Planting density Shade percentage pH Soil Quality Index Farm Size Variety
Description Altitude measured by the GPS unit Inclination measured with clinometer Monocrop, Mixed crops, Agroforestry System, Backyard Plants per hectare Shade percentage measured with an spherical densitometer pH Measured with soil tester KCB-300 Modified from Altieri and Nicholls (2002) Farm area in hectares “Isla”, “Seda”, “Mixed”
Since the intention of the present work is not to repeat the conducted by Román Jerí (2012), but to analyse the data from the Spatial Data Analysis point of view, from the original dataset consolidated by Román Jerí (2012) just a portion of the available information was utilised for the present work and some modifications were made. For example, the Soil Quality Index, which is a modified version of the proposed by Altieri and Nicholls (2002), was not calculated by Román Jerí but the individual variables that compose that index were utilised in his analysis. The main difference between the index here calculated and the proposed by Altieri and Nicholls (2002) is that the two variables (Humidity retention and Water infiltration) were not included in the present work, mainly because they are not available. Although the raw data was provided by Román Jerí (2012), the reader should be aware of the different objectives of each work and analyse the results from different points of views.
3.4 Proposed Approach to analyse spatial data As stated by Krivoruchko (2011, p. 22), using methods which are intended to be used with geostatistical data to analyse areal or point pattern data will produce erroneous results. However, accordingly to Plant (2012, p.5) in some occasions data could be treated as geostatistical for one analysis and as areal for other. Accordingly to Fortin et al. (2002) the selection of the spatial statistic methods could be based on the research objectives or by the type of measured data.
23
The data used in the present analysis represents the incidence measured on each farm on the form:
DI
TD 100 TE
DI = Disease Incidence TD = Total of Diseased Plants TE = Total of Evaluated Plants The location is represented by a point, which corresponds to the location of the first symptomatic plant found in the farm (Román Jerí, 2012, p. 32). Although some considerations should be taken for future sampling designs, this kind of conditions and constraints are very usual to find in collected datasets and try to deal with these kinds of challenging conditions is one of the additional motivation for this work. This arrangement of spatial location and incidence rate for each farm has the particularity that one point is used to represent an area, raising the methodological question of: Which spatial data analysis method should be used? As the disease incidence rate represents an aggregation of values for each farm, intuition indicates that data should be treated as areas, however polygons for each farm was not available. When there is not available boundaries for areas being analysed, a point could be used to define their location and the apply methods for point objects (Haining, 2003, p. 81). This possibility to use a point to represent each analysed area is also supported by Bivand et al. (2008, p. 244). Therefore, the taken approach is to conduct spatial data analysis using methods for areal data.
24
3.5 Mapping the Spatial Distribution of Disease Incidence 3.5.1
Points Map
The map of the distribution Fusarium Wilt incidence of banana in the study area was produced using the location of 76 farms, surveyed by Román Jerí (2012). A six classes classification and a colour palette from blue to red (corresponding to Low and High respectively) was utilised to represent the measured disease incidence distribution in the study area. This classification is shown Figure 5.
Figure 5 – Symbols and classification utilised to map the distribution of the measured disease incidence (David Brown, ArcGIS for Desktop 10.2)
25
3.5.2 Thiessen Polygons As explained in the literature review, Thiessen Polygons are geometrical constructions as from points. Taken into account that Thiessen Polygons couldn’t produce a prediction map but a representation of the proximity area for a point object, a graphical areal representation was produced using this technique as from points which represent each sampled farm and applying the same colour and classification for disease incidence as applied for point representation. Although it is possible to use Areal Interpolation as implemented in the Geostatiscal Analyst of ArcGIS (Krivoruchko, Gribov & Krause, 2011) using the Thiessen Polygons as input layer, to avoid misinterpretation of the technique and methodology and analysing that it could be an abuse of this useful tool, the production of a prediction map for the present work was avoided.
26
3.6 Data Exploration Exploration of the data will be basically divided in two steps: a) Visual inspection of the data distribution b) Test for spatial autocorrelation Before explore the data it is worth to explain why is so important to check if the data exhibits a normal distribution and why to check if there is spatial autocorrelation present. As explained in the literature review, Linear Regression assumes the data has a normal distribution which when plotted takes a bell shaped form. The main reason to explore data is to check if it distribution corresponds to a normal distribution or to other kind of distribution, like Poisson, Binomial, etc. As the data analysed in the present work is proportion, it is presumable to found a non-normal distribution of the data. If this is the case, a common linear regression should be avoided and other models like the GLM or GAM could be possible solutions. In the case of spatial autocorrelation, it is important as part of the exploratory data analysis basically due to two main reasons: 1) If positive spatial autocorrelation is present, it means that near farms will have similar scores for disease incidence and possibilities to find a clustered pattern are high. Negative spatial autocorrelation means that near farms have completely different scores for disease incidence and the pattern of distribution is a dispersed pattern. Finally, if the null hypothesis of zero spatial autocorrelation is confirmed, then the process behind the disease incidence occurs in a random pattern. If positive or negative spatial autocorrelation is present in the variable of interest, in this case the disease incidence, it means that there is a reason for that spatial autocorrelation which will be of interest to explain to know the process behind.
27
2) If present, spatial autocorrelation violates the independence assumption required by the Linear Regression Model and even other models like GLM, and some considerations should be taken to account for the spatial autocorrelation into the proposed model. 3.6.1
Histogram
One of the easiest ways to see the shape of the data distribution is the histogram, which is defined by Dalgaard (2008, p. 71) as: “…, a count of how many observations fall within specified divisions (“bins”) of the x-axis”. Along with histogram, there are two indicators that help to determine the type of shape of a distribution, which are skew and kurtosis (Urdan, 2010, p. 31). The skew measures if the distribution is positively or negatively skewed, which means that distribution has an elongated tail at the higher end of the distribution in the first case, or at the lower end of the distribution in the second case (Urdan, 2010, p. 31). On the other hand, kurtosis tells if the distribution is flatter than a normal distribution, in which case is called platykurtic, or if it has a peak higher than that is found in a normal distribution, in which case the distribution is called leptokurtic (Urdan, 2010, p. 31). As a rule of thumb the Skewness should be ideally 0 and the Kurtosis should be 3 for a normally distributed data (NIST/SEMATECH, 2014). 3.6.2
Normal Q-Q Plot
Another way to visually check the data for normality is the Quantile-Quantile Plot (Q-Q Plot). To understand the Q-Q plot, the Empirical Cumulative Distribution Function (c.d.f) should be defined. Being x the analysed variable, the c.d.f. is defined by Dalgaard (2008, p. 73) as “the fraction of data smaller than or equal to x”. The Q-Q plot corresponds to the “kth smallest observation against the expected value of the kth smallest observation out of n in a standard” (Dalgaard, 2008, p. 73). In the practice, a straight line should be expected in Q-Q plot for a normally distributed data (Dalgaard, 2008, p. 73).
28
3.6.3 Spatial autocorrelation tests Spatial autocorrelation was tested using Moran’s I and Geary’s c. An additional test of Moran’s I was also applied using the methodology of Assunção and Reis (1999). All the R Language (R Core Team, 2014) code utilised to compute the spatial autocorrelation test is included in the Annex 1. Accordingly to Griffith (2009) when spatial autocorrelation is detected, it could be in one of following categories: 1) Strong Positive Spatial Autocorrelation: Present in data like the remote sensing data and not very common in the majority of the cases 2) Moderate Positive Spatial Autocorrelation: The most common type of spatial autocorrelation 3) Moderate Negative Spatial Autocorrelation: Not so common to find and typically associated with geographic competition. Moran’s I is one of the most common used statistics to test the null hypothesis of zero spatial autocorrelation (Plant, 2012, p. 104) and was selected to compute the spatial autocorrelation of Fusarium Wilt incidence as areal data. Moran’s I and all the necessary calculations were computed using the R Language (R Core Team, 2014) with functions included in the spdep package (Bivand, 2014). Geary’s c is other statistic method which also test the null hypothesis of zero spatial autocorrelation. Griffith and Lane (as cited in Plant, 2012, p. 106), concludes that the Moran’s I is generally preferred over Geary’s c, but computing Geary’s c for corroborate Moran’s I results is desirable. That is the main reason to also calculate the Geary’s c in the present work. In the case of the Empirical Bayes Estimate, proposed by Assunção and Reis (1999) as a way to improve the Moran’s I, the main reason to also includes its calculation is the fact that some authors like Jackson et al. (2010), Assunção and Reis (1999) and Tsai (2012) affirms that Moran’s I test doesn’t work very well for rates calculated as from populations with different sizes, as is the case of the disease incidence treated in the present work. 29
3.6.3.1 Neighbour List and Spatial Weights Before calculate any of the statistics for test spatial autocorrelation two previous steps should be done. First, the relation between the spatial objects should be defined using a neighbour criterion (Bivand et al., 2008, p. 239). After defining which objects will be related as neighbours, a spatial weight should be assigned to each relation link (Bivand et al., 2008, p. 251). Depending on the type spatial objects to model the spatial relationship, polygons or points, the adequate method should be selected. For the case of points, as is the case of the analysed data, two common methods are available to construct the neighbour list: the k nearest neighbour and distance based neighbour list. However there are more options for create a neighbour list like Delaunay triangulation one of them. Different methods for neighbours definition are treated in more detail by Haining (2003, p. 80) and Bivand et al. (2008, p. 240). The k nearest neighbour methods selects the k nearest neighbours of each point (Plant, 2012, p. 90), being k a parameter to be provided. For example, if a k value of 2 is provided, the method will produce a neighbour list assuring that each point will have 2 neighbours. The distance based method selects the neighbour for each point taken a distance threshold which is defined by two parameters, a minimum and maximum bound (Bivand et al., 2008, p. 247). The spatial weights area assigned to each neighbour link accordingly to different styles, being the row standardized the recommended style if not much is known about the analysed spatial process (Bivand et al., 2008, p. 251). In the R Language (R Core Team, 2014) a k nearest list could be constructed with the functions knn2nb along with the function knearneigh, booth included in the package spdep (Bivand, 2014). In the case of the distance based method, the function dnearneigh could be used to construct the neighbour list. The spatial weights list is constructed with the function nb2listw, also provided by the package spdep (Bivand, 2014).
30
After clarified the necessary previous steps to calculate the spatial autocorrelation statistics, there is a new interrogate to solve: which size of threshold should be defined for the distance based method or which k should be used for the k nearest method? More than that, which method should be used? Accordingly to O’Sullivan and Unwin (2010, p. 205), when the analysed process is not well understood, the definition of the spatial structure and the weight assignation will be a difficult process. Haining (2003, p. 81) suggests that if additional information about the analysed process it should be utilised to define linkages, rather than define the by only geometrical or spatial criteria. There are no magical recipes to select the neighbouring method and in different theoretical books the knowledge about the studied process is presented as the key input to resolve this problem. Although applied to a different case and with a specific implementation, the work of Souris and Bichaud (2011) could give an insight that the k nearest neighbour method could be appropriate to apply in epidemiology studies. Therefore, for the present work the k nearest neighbour will be selected, although for the sake of support of this decision a set of comparisons will be conducted against other three different methods: Delaunay triangulation, Sphere of Indifference and distance based. Delaunay triangulation and Sphere of Indifference are graph based methods and their main difference are that the first defines the neighbours by triangulation and the latter uses circles with a radius equal to the distance from the point to the nearest neighbours points (Bivand et al., 2008, p. 245). For the case of the distance based method, which needs to define a threshold of distance the approach proposed by Anselin (2003). Basically the lower bound is set to 0 and the upper bound is set using the maximum distance needed to assure that each point has at least one neighbour. This is achieved extracting the max distance value resulting after the applying the k nearest neighbour method with a k = 1.
31
3.6.3.2 Moran’s I As mentioned in the previous section, the Moran’s I test were computed in using the R Language (R Core Team, 2014) with the function moran.test. Basically the function needs the list of neighbours constructed with one of the methods explained before and a vector with the values of the variable to check for spatial autocorrelation.
3.6.3.3 Geary’s C As in the Moran’s I, for the Geary’s c a neighbour list is also needed. For this case, the same neighbour lists constructed for the Moran’s I calculation will be utilised. Computation of Geary’s c was done using the function “geary.test” included in the package spdep (Bivand, 2014)
3.6.3.4 Modified Moran’s I – Empirical Bayes Index The proposed modified Moran’s I applied is the proposed by Assunção and Reis (1999) and is implemented with the function EBest included in the package spdep (Bivand, 2014). This function calculates an Empirical Bayes Estimate and compute the Moran’s I using the resulting smoothed rate.
3.7 GAMLSS applied to Fusarium Wilt Disease Incidence One of the special interest of this work is to explore and model the relationship between Fusarium Wilt incidence and variables like altitude, shade percentage, slope, soil pH, plant density and a Soil Quality Index. The Soil Quality Index was calculated accordingly to the methodology of Altieri and Nicholls (2002) and it encloses a list of measurements which in some way gives an estimation of the quality of the soil. As exposed in the literature review, using GLMs are the usual approach to analyse data that doesn’t exhibit a normal distribution (Plant, 2012, p.301), as is the case of disease incidence rate. Garret et al., (2004) also suggests the utilisation of the GLM instead of applying a transformation over the data. On the other hand, Kongchouy, Choonpradub and Kuning (2010) indicates that using a logarithmic transformation over the disease incidence rate is enough to achieve satisfactory results. 32
Bivand et al. (2008, p. 274) also applies a logarithmic transformation to disease incidence rates to try to obtain a nearly normal distribution. In this context a Logarithmic Transformation consists in calculate the logarithm for each of the original values of the variable of interest and use it instead of the original value. Recalling from basic mathematics, “a logarithm function is defined with respect to a base� (Nau, 2014). However since the data utilised in the present work contains values of zero for the farms without the disease, the logarithmic transformation not seems to be a feasible solution, even though some transformation could be done. More than that, with available methods like the Generalised Linear Models to handle this kind of data without transforming it, there is not strong reason to take the transformation approach. Zuur et al. (2009, p. 19) states that the simplest statistical model should be used, but it should be used in the correct form. Following this approach, a GLM was applied to the disease incidence rate trying to find a model which could explain the relationship between the disease incidence and the proposed set of explanatory variables. The implementation was done using the R Language (R Core Team, 2014). As presented in the literature review, a GLM consist of three steps (Zuur et al., 2009, p. 210): a) An assumption of the distribution of the response variable b) Specify the systematic part (The explanatory variables) c) Specify the link function In the case of the data analysed in the present work, the distribution of the response variable is proportional data which corresponds to a binomial distribution, accordingly to Zuur et al. (2009, p. 202) and Garret et al. (2004). The systematic part corresponds to the explanatory variables, in the present work they are a selection of the variables presented in table 3. This selection is undertaken using criteria to select the variables which are statistically significant into the model, mostly based
33
on its p-value. GLMs uses maximum likelihood to fit to the analysed data, and here is where the link function appears, being the logit-link the most used for proportional data (Garret et al. 2004). It is common to found in Poisson and binomial distributions that the observed variability is greater than the predicted, which is known as overdispersion and is very common to be found in plant disease data (Garret et al. 2004). For the case when overdispersion is found one approach to solve this could be the use a maximum quasi-likelihood method could be applied to fit the model to the data (Garret et al. 2004). In this case, the distributions still being a binomial distribution, but allowing the overdispersion as it was taken into account (Zuur et al., 2009, p. 226). Special attention should be put on test for overdispersion before start selecting the explanatory variables for the model (Zuur et al., 2009, p. 223). In the present work overdispersion was found on the model, but was approached using a GAMLSS and a discrete distribution called beta-binomial distribution. Two main reasons support the selected approach; 1) The GAMLSS provides an AIC calculation, which is very useful for the selection of the most significant explanatory variables and it is not provided by the quasi-binomial method, 2) The beta-binomial distribution is widely recognized as the most adequate solution for overdispersed proportional data, like the plant disease incidence (Garret et al., 2004). The implementation was done also with R Language (R Core Team, 2014) and the gamlss package (Rigby & Stasinopoulos, 2005). The next step is to select the explanatory variables which are important to include into the model. Accordingly to Zuur et al. (2009, p. 221) two options are available for this; a selection using the AIC (Akaike Information Criteria) or use the hypothesis testing method. The AIC is a measure of how good the model fits (Dalgaard, 2008, p. 232) and an extensible explanation could be found in Akaike (1998), but in general terms the AIC is based on the Maximum Likelihood Estimator to select the most appropriate model (Pan, 2001). It could be calculated for a GLM model in the R Language (R Core Team, 2014) using the function step. Basically a model with a lower AIC value
34
will be better (Zuur et al., 2009, p.542). However, since a GAMLSS was utilised, the function available for this calculation is stepGAIC from the gamlss package (Rigby & Stasinopoulos, 2005). Validation of the Model Zuur et al. (2009, p. 23) suggests validating a linear regression model using graphs as follows: 1) A graph of the model residuals vs fitted values to check for homogeneity 2) A Q-Q plot or histogram of the residuals to verify normality 3) Plot residuals vs each explanatory variable to verify independence Teetor and Loukides (2011, p. 295) provides a simple explanation on how to interpret this kind of graphs. Since the GAMLSS was used instead of a classical GLM, the graphs provided by the package was utilised to validate the model. The function plot of the R Language (The R Core Team, 2014) applied to a GAMLSS model provides the following graphs for model validation (Stasinopoulos, Rigby & Akantziliotou, 2008, p. 121):
Model residuals against the fitted values
Model residuals against an index or specified x-variable
Kernel density estimate of the residuals
QQ-normal plot of the residuals
In general, what should be expected for a valid and good fit model are residuals with a normal distribution and without patterns. The code utilised in the R Language (R Core Team, 2014) to implement the GLM and GAMLSS is included in Annex 2.
35
4 Results and Analysis 4.1 Map of the Spatial Distribution of Disease Incidence in the Study Area Figure 6 shows the resulting map of the distribution of Fusarium Wilt Incidence in the region of San Luis de Shuaro, Peru. As from this points Thiessen Polygons where constructed and coloured with the same colour scheme and using the classification for disease incidence. The resulting map with the Thiessen Polygons is shown in the Figure 7.
Figure 6 - Map of the Spatial Distribution of Fusarium wilt incidence in the study area (David Brown, ArcGIS for Desktop 10.2)
36
Figure 7 – Thiessen Polygons constructed as from the points of each sample farm (David Brown, ArcGIS for Desktop 10.2)
37
4.2 Results of Exploratory Data Analysis 4.2.1
Histogram
The Figure 8 shows the histogram calculated using the R Language (R Core Team, 2014) for the disease incidence. Different from the normal distribution shown in the Figure 9, the histogram shows that the data correspond to a positively skewed distribution.
Figure 8 – Histogram of disease incidence rate
Figure 9 – Histogram of a normal distribution from simulated data
38
4.2.2
Normal Q-Q Plot
The Quantile-Quantile Plot for a normal distribution should have a straight line as is shown in the Figure 11, different from the shape shown in Figure 10, which shows a non-normal distribution of the data.
Figure 10 – Normal Q-Q Plot of the disease incidence
Figure 11 – Normal Q-Q Plot for a normal distribution for simulated data
39
4.2.3
Skewness and Kurtosis
As explained before, values for the Skewness and Kurtosis corresponding to a normal distribution should be 0 for the first and 3 for the later. In the present work Skewness and Kurtosis were calculated using the R Language (R Core Team, 2014) with the function stat.desc from the package pastecs (Grosjean & Ibanez, 2014). For the case of disease incidence, the Skewness was 1.9327 and Kurtosis was 3.1458, being in this case the Skewness the more problematic.
40
4.2.4
Neighbours list and Spatial Weights
Probably one of the most convenient ways to present how different are the spatial structures depending on the selected method to define the neighbour is in a graphic. The following are the graphics showing the different methods to construct neighbour relationships. Delaunay triangulation, Sphere of Indifference and distance based are shown in Figure 12, while the k nearest neighbour relationship with k = 1, k = 5 and k = 10 are shown in Figure 13. a)
b)
c)
Figure 12 – a) Delaunay triangulation, b) Sphere of Indifference, c) Distance based
a)
b)
c)
Figure 13 – a) k nearest neighbour with k = 1, b) k nearest neighbour with k= 5, c) k nearest neighbour with k=10
41
4.2.5
Results of Moran’s I Test
As can be seen in Table 4, none of the calculation of Moran’s I using different neighbour list has an acceptable p-value under the confidence interval of 95 %. At this point is necessary to recall why the p-value is relevant. A very detailed explanation could be found in Urdan (2010, p. 61). The following are definitions from Urdan (2010, p. 77): -
p-value : “The probability of obtaining a statistic of a given size from a sample of a given size by chance, or due to random error”
-
confidence interval: An interval calculated using sample statistics to contain the population
Combined, these concepts are a way to determine if the calculated value is significant from the statistical point of view. In the present case, using a 95 % interval confidence just the values with a p-value less than 0.05 will be statistically significant. It is also worth to mention that calculations of Moran’s I were computed with a “Two Sided” alternative hypothesis. The default value is to set the alternative hypothesis to be greater than value for zero spatial autocorrelation, guessing that the expected possible spatial autocorrelation will be positive. However, since there are no clues to presume that the possible spatial autocorrelation will be positive or negative, the alternative hypothesis was set to “Two Sided”. Table 4 – Results of Moran’s I test computations using different neighbours list methods Neighbour Method
I
E(I)
var (I)
St. deviate
p-value
Delaunay Triangulation
-0.0467
-0.0133
0.0042
-0.52
0.6053
Sphere of Indifference
-0.025
-0.013
0.01
-0.12
0.9069
Nearest Neighbour k = 1
0.0046
-0.0133
0.02
0.13
0.8993
Nearest Neighbour k = 5
-0.107
-0.013
0.004
-1.5
0.1385
Nearest Neighbour k = 10
-0.0732
-0.0133
0.0018
-1.4
0.1626
Distance bands
-0.0759
-0.0133
0.0045
-0.93
0.3509
42
To understand better the results it is necessary to explain the values contained in the table 4. The I value represents the Moran’s I calculations, which for positive spatial autocorrelation will have positive values and for negative spatial autocorrelation will have negative values. The E(I) value represents the expected value for the null hypothesis of spatial autocorrelation and it comes from the following function (Griffith, 2009):
1 n 1 Where n is the number of areal units, in this case the number of farms. The var (I) represents the variance of the statistic and the St. deviate is the standard deviate (Bivand et al., 2008, p. 260). 4.2.6
Results of Geary’c Test
In the case of the Geary’s c test, the only computation which has a significant p-value is the resulted using a neighbour list made from the k nearest neighbour method with a k value of 10. In the table 5 de c values represent the computed Geary’s c. The rest values are as indicated in the Moran’s I calculation, with the difference that the expected value for zero spatial autocorrelation is 1. Values between 1 and 2 represents negative spatial autocorrelation and values between zero and 1 indicates positive spatial autocorrelation. In the present case, although p-value of the calculation for the k nearest neighbour is statistically significant, the c value is barely greater than one and the null hypothesis of zero spatial autocorrelation is confirmed.
43
Table 5 – Results of Geary’s c test computations using different neighbours list methods Neighbour Method
c
E(c)
var(c)
St. deviate
p-value
Delaunay Triangulation
1.0428
1
0.0054
-0.58
0.561
Sphere of Indifference
1.021
1
0.012
-0.19
0.8496
Nearest Neighbour k = 1
1.074
1
0.033
-0.41
0.6841
Nearest Neighbour k = 5
1.133
1
0.0073
-1.6
0.1205
Nearest Neighbour k = 10
1.1495
1
0.0049
-2.1
0.03354
Distance bands
1.072
1
0.0059
-0.94
0.3493
4.2.7
Results of Global Moran’s I using Empirical Bayes Estimates
Table 6 – Results of Moran’s I with Empirical Bayes Estimates computations using different neighbours list methods Neighbour Method
I
E(I)
var(I)
St. Deviate
p-value
Delaunay Triangulation
-0.0461
-0.0133
0.0042
-0.51
0.6113
Sphere of Indifference
-0.025
-0.013
0.01
-0.11
0.9115
Nearest Neighbour k = 1
0.0079
-0.0133
0.02
0.15
0.8808
Nearest Neighbour k = 5
-0.106
-0.013
0.004
-1.5
0.1426
Nearest Neighbour k = 10
-0.0731
-0.0133
0.0018
-1.4
0.1633
Distance bands
-0.0768
-0.0133
0.0045
-0.95
0.3445
In the case of the computation of Moran’s I using the Empirical Bayes Estimate, none of the calculation exhibit spatial autocorrelation as could be observed in Table 6 and consequently the null hypothesis is accepted. As a result of the three different methods implemented to test for spatial autocorrelation in the disease incidence, there is no spatial autocorrelation in this case. Differences between the Moran’s I and Geary’s c could be attributed to the effect of the distribution of the data, which accordingly to Cliff and Ord (as cited in Plant, 2012, p. 106) affects more the c calculation than I.
44
4.3 Results of the GAMLSS To start constructing a model which explains the Fusarium Wilt incidence through the selected variables the first step is to include all these variables as explanatory variables in the model and then apply the Akaike Information Criteria to select a better model. The first model contains the following variables as explanatory: -
Area (Farm Size)
-
Altitude
-
pH
-
Planting Density
-
Farm Type
-
Variety
-
Slope
-
Shade Percentage
-
Soil Quality Index
As explained before, if the modelled data presents overdispersion it should be preferable to specify a beta-binomial distribution. To the test that the data here analysed presents or not overdispersion a GLM with a binomial distribution was specified. More than that, if overdispersion is not present, a GLM with a binomial distribution could be used and the use of the GAMLSS will be optional. A GLM was specified with function glm, which is part of the R Language (R Core Team, 2014) with all the proposed explanatory variables and specifying a binomial distribution. Results are presented in Table 7.
45
Table 7 – Results of the GLM model with a Binomial distribution and all the proposed explanatory variables
(Intercept) pH Altitude Slope Farm Size (area) Plant density Soil Quality Index Factor(farm_type)2 factor(farm_type)3 Factor (Variety - Seda) Factor (Variety - Mix) Shade percentage Factor(farm_type)2: Factor(Variety) Mix
Estimate
Std. Error
t value
Pr(>|t|)
7.60064318
0.60639424
12.53416123
0.0000000000
-1.37212957
0.15488183
-8.85920321
0.0000000000
0.00094969
0.00014293
6.64426691
0.0000000000
-0.00135563
0.00266970
-0.50778478
0.6116042814
0.03063996
0.04297005
0.71305393
0.4758123862
-0.00131031
0.00035593
-3.68136155
0.0002319918
-0.91801074
0.08157778
-11.25319525
0.0000000000
-0.12178332
0.10985683
-1.10856391
0.2676183558
0.12928188
0.44745031
0.28893014
0.7726348387
-0.22388688
0.24582094
-0.91077219
0.3624154171
0.43529689
0.45731161
0.95186057
0.3411676975
0.01467834
0.01320917
1.11122363
0.2664721004
0.52844930
0.46682362
1.13201063
0.2576299656
(Dispersion parameter for binomial family taken to be 1) Null deviance: 1693.44 on 75 degrees of freedom Residual deviance: 258.64 on 63 degrees of freedom AIC: 483.6 Number of Fisher Scoring iterations: 5
To assess the proposed model for overdispersion the two key values to analyse are the residual deviance and the degrees of freedom. To estimate overdispersion the residual deviance is divided by the degrees of freedom (Zuur et al., 2009, p. 224), in this case 258.64/63 = 4.105 which is higher than the expected 1 for the binomial family, as is indicated in the model summary. The resulting overdispersion parameter for the present case indicates that overdispersion is present in the model. To account for overdispersion in the model a GAMLSS with a beta-binomial distribution was used. A beta-binomial distribution is a combination of the beta and binomial distributions (Hilbe, 2013) and is used when binomial data presents overdispersion (GuimarĂŁes, 2005).
46
The Table 8 shows the results of the base model with all the proposed variables, applying a GAMLSS with a beta-binomial distribution. Table 8 – Results for the base model with all the proposed variables
Intercept pH Altitude Factor (Variety - Seda) Factor (Variety - Mix) Planting Density Slope Farm Size Soil Quality Index Factor(farm_type)2 factor(farm_type)3 Shade percentage Factor(farm_type)2: Factor(Variety) Mix
Estimate
Std. Error
t value
Pr(>|t|)
6.29559794
1.14470686
5.49974684
0.00000074
-1.26226658
0.26622565
-4.74134102
0.00001256
0.00110670
0.00030041
3.68400796
0.00047906
-0.08451387
0.37110387
-0.22773643
0.82058864
0.27996758
0.63164774
0.44323372
0.65911506
-0.00182239
0.00076547
-2.38074932
0.02031365
-0.00489983
0.00563126
-0.87011249
0.38754222
0.07216351
0.08153999
0.88500759
0.37951792
-0.73183938
0.13730923
-5.32986312
0.00000141
-0.25502120
0.23044149
-1.10666355
0.27264724
-0.02887809
0.94505482
-0.03055705
0.97571939
0.00834388
0.02681166
0.31120343
0.75667341
0.44620648
0.68007146
0.65611704
0.51413809
Mu link function: logit Sigma link function: log Sigma Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept)
-4.7
0.2277 -20.64 1.169e-32
------------------------------------------------------------------No. of observations in the fit: 76 Degrees of Freedom for the fit: 14 Residual Deg. of Freedom: 62 at cycle: 13 Global Deviance:
362.1153
AIC:
390.1153
SBC:
422.7456
Summary of the Randomised Quantile Residuals mean = -0.0124098 variance = 1.063567 coef. of skewness = -0.1870921 coef. of kurtosis = 2.620784 Filliben correlation coefficient = 0.9950251
47
Applying the function stepGAIC to select a model using the AIC value the selected variables were: -
pH
-
Altitude
-
Planting Density
-
Soil Quality Index
The table 9 shows the summary of results of the fitted model with the selected variables. As could be observed the model was improved from an AIC of 390.11 to an AIC of 382.08 after selecting the most significant variables. Table 9 – Results of the adjusted model after applying the variable selection using stepGAIC Estimate
Std. Error
t value
Pr(>|t|)
Intercept
5.50208907
1.04975051
5.24133025
0.00000157
pH
-1.32829757
0.25205493
-5.26987349
0.00000140
Altitude
0.00099656
0.00030522
3.26511767
0.00168565
Planting Density
-0.00202482
0.00068184
-2.96964748
0.00406429
Soil Quality Index
-0.51989099
0.11633084
-4.46907255
0.00002906
Mu link function: logit Sigma link function: log Sigma Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept)
-4.374
0.2158 -20.26 3.779e-32
------------------------------------------------------------------No. of observations in the fit: 76 Degrees of Freedom for the fit: 6 Residual Deg. of Freedom: 70 at cycle: 12 Global Deviance: AIC: SBC:
370.0847
382.0847 396.0691
Summary of the Randomised Quantile Residuals mean = -0.02736283 variance = 0.9909983 coef. of skewness = -0.02519111 coef. of kurtosis = 2.923242 Filliben correlation coefficient = 0.9979815
48
To validate the model, the graphs of the residuals were constructed and are shown in the Figure 14.
Figure 14 – Plots of the residuals for model validation
In the upper left the randomised residuals were plotted against the fitted values. In the upper right the randomised residuals were plotted against an index, which basically corresponds to the number of observations, in the present case 76 farms. These two plots shouldn’t present any pattern for a good fitted model (Zuur et al., 2009, p. 27). The Density Estimate and the Normal Q-Q Plot helps to evaluate the normality of the residuals, required for a good fitted model (Stasinopoulos et al., 2008, p. 122). In the present case they appear to be normally distributed. Finally, four values from the randomised residuals should be observed to confirm that the model was well fitted; the Mean, the Variance, the coefficient of Skewness and the coefficient of Kurtosis, which all were shown on Table 9. A well fitted model should have a mean near zero, variance near to one, a coefficient of Skewness near to zero and a coefficient of Kurtosis near to 3. In the present case the values presented in Table 9 are nearly the expected values for a well fitted model.
49
4.4 Analysis of the Results 4.4.1
Spatial Distribution of Fusarium Wilt in the study area
For the spatial distribution of Fusarium Wilt in the study area, points objects representing each farm was utilised as explained in the methodology chapter. Disease incidence was classified in 5 classes for which a colour was also assigned. This symbology and colour arrangement allows a graphical representation on how different levels of disease incidence spread along the study area. As a very basic interpolation Thiessen Polygons were utilised to represent the influence zone of each farm with Fusarium Wilt presence in the study area. However, this should be interpreted just as a first approximation to obtain zones of influence in the study area and not as a prediction map. 4.4.2
Spatial Autocorrelation Tests
After computing the Moran’s I with different neighbours lists there is not strong evidence for a spatial autocorrelation to be present in the study area for the Fusarium Wilt incidence. A fact should be highlighted from these results, and it is the relevance that has the design of the spatial structure of the studied process as from the neighbours list and its effect over the detection of spatial autocorrelation, as stated by Bivand et al. (2008, p. 239), O’Sullivan and Unwin (2010, p. 201), Haining (2003, p. 79) and Plant (2012, p. 80). Geary’s c was also used to account for spatial autocorrelation over the Fusarium Wilt incidence in the study area. Although with different p-values, the null hypothesis of zero spatial autocorrelation was also confirmed as in the case of Moran’s I. For the case of Moran’s I using an Empirical Bayes Estimates to smooth the disease incidence rates, basically the results were the same as from the normal Moran’s I.
50
4.4.3
GAMLSS
The proposed model using GAMLSS to explain the Fusarium Wilt incidence has the following variables as explanatory: -
Soil pH
-
Altitude
-
Soil Quality Index
-
Planting Density
It was the result of apply the AIC to select the variables to be included in the model and the model validation was conducted using graphs provided by the same statistical tool Soil pH is already recognized to have a correlation with Fusarium Wilt disease. Accordingly to Alvarez, García, Robles and Díaz (1981) there is evidence to expect a higher disease presence in soils with a pH lower than 7 while. Román Jerí (2012, p. 90) also reports a relationship between the disease incidence and soil pH in his results. In the case of the Soil Quality Index, calculated accordingly to the methodology proposed by Altiere and Nicholls (2002), presents a correlation with the disease incidence. Better soil conditions were also reported by Domínguez-Hernández, Negrín and Rodriguez (2008) as an associated condition to expect lower levels of Fusarium Wilt presence, although Alabouvette (as cited in Domínguez-Hernández et al., 2008, p. 405) states that that there is no evidence that soil properties play any role in suppressiveness. In the case of altitude, although specific studies wasn’t found in literature about the effect of altitude over Fusarium Wilt incidence, it is known that altitude acts indirectly over banana growth due to a decrease on temperature, being difficult to produce bananas over above 1000 meters of altitude (Arvanitoyannis & Mavromatis, 2009). These unfavourable conditions for banana growth could also influence the plant health and how it could defends against the Fusarium Wilt (Ploetz, Jones, Sebaisgari & Tushemereirwe, 1994).
51
With regards of plant density, there is not specific work found in the literature dealing with the effect of plant density over the Fusarium Wilt incidence. Works like the conducted by Athani, Revanappa and Dharmatti (2009) studied the effect of plant density over plant height and yield, but it doesn’t account for plant health or other interesting variables for the present work. However, as in the case of altitude, unfavourable conditions could be playing in favour to a weak plant to acquire the disease and with a higher plant density the competition between plants is also higher. One possible hypothesis could be that at higher competition conditions (high plant density) without the adequate fertilisation, the risk to have weaker plants could increase, and those weak plants could be more prone to get diseased by Fusarium Wilt or even other diseases. However, this is just a hypothesis outline and should be taken just as a possible subject to further work and not as fact.
52
4 Conclusions 1) Spatial distribution of the Fusarium Wilt in study area was successfully represented aided with the software ArcGIS for Desktop (ESRI, 2013). Thiessen polygons are a useful method for basic interpolation but clearly have the limitation that is just a geometric construction and the inference as from them should be done with caution. Other forms of interpolation, like areal interpolation, are suggested to be explored in future works when the real boundaries of the farms are available. The areal interpolation tool available in the software ArcGIS Desktop (ESRI, 2013) could be a starting point, but has the limitation that could assume a Gaussian, Binomial or Poisson distributions but no the Beta-Binomial distribution which is needed for binomial data with overdispersion. As overdispersion is a common characteristic found in disease incidence data, a geostatistical tool which easily allows researchers to produce interpolations for this kind of data will be a very valuable contribution of further work.
2) Spatial autocorrelation was not found in the Fusarium Wilt incidence in the study area, which mainly represents that the pattern of distribution of the farms with presence of Fusarium Wilt is random. Based on these results, the null hypothesis H0 can’t be rejected. Special attention should be put to the fact that these results are from data that represents farms located in a very diverse region with a variety of elements involved. Even when spatial autocorrelation was not found at this scale other studies should be done to analyse the spatial autocorrelation at farm level, which implies to collect the location of each sampled plant. The results obtained in the present work could support the design of sampling strategy when the spatial component will be included in an epidemiology study.
53
3) GAMLSS model with a beta-binomial distribution was successfully applied to explain the Fusarium Wilt incidence of banana as from a set of explanatory variables. The beta-binomial distribution was found to be the most appropriate to model binomial data with overdispersion, confirming what was found in the literature review. One remarkable result from the present work is the guidelines produced to model the plant disease incidence, since all the methodology scripts code implemented in the R Language (R Core Team, 2014) is provided to be easily reproduced. A desirable further work using the present work as starting point could be the development of a software package that provides an easy to use tool for plant pathologist, or at least a detailed guide on how to model disease incidence data with the available tools.
4) The relationship of soil pH and soil quality conditions with Fusarium Wilt incidence were confirmed, as it coincides with results found in previous works. These results could lead to develop more specific work to analyse the influence of pH and soil conditions over the disease incidence of Fusarium Wilt. Further work is also suggested to explore in depth the relationship of altitude and plant density with Fusarium Wilt incidence. Although the present work are based on part of the raw data kindly provided by the work of RomĂĄn JerĂ (2012), the approach taken was completely blind with respect to that previous work in terms of the methodology used to analyse the relationship between disease incidence and the proposed explanatory variables. As a future work, a detailed revision of the methodologies applied by the two works is suggested to outline the reasons behind the different results found.
54
5 References Akaike, H. (1998). Information Theory and an Extension of the Maximum Likelihood Principle. In E. Parzen, K. Tanabe, & G. Kitagawa (Eds.), Selected Papers of Hirotugu Akaike (pp. 199–213). New York, NY: Springer New York. Altieri, M. A., & Nicholls, C. I. (2002). Sistema agroecologico rápido de evaluación de calidad de suelo y salud de cultivos en el agroecosistema de café. Retrieved August 28, 2014, from University of California, Berkeley, Agroecology in Action website: http://www.agroeco.org/doc/SistAgroEvalSuelo2.htm Alvarez, C. E., García, V., Robles, J., & Díaz, A. (1981). Influence des caractéristiques du sol sur l’incidence de la Maladie de Panama. Fruits, 36(2), 71–81. Alves, M. de C., & Pozza, E. A. (2010). Indicator kriging modeling epidemiology of common bean anthracnose. Applied Geomatics, 2(2), 65–72. doi:10.1007/s12518-010-0021-1 Anselin, L. (2003). Data and Spatial Weights in spdep Notes and Illustrations. UrbanaChampaign: University of Illinois. Retrieved September 9, 2014, from https://geodacenter.asu.edu/system/files/dataweights.pdf Arias, P., Dankers, C., Liu, P., & Pilkauskas, P. (2003). The world banana economy, 19852002. Rome: Food and Agriculture Organization of the United Nations. Arvanitoyannis, I. S., & Mavromatis, A. (2009). Banana Cultivars, Cultivation Practices, and Physicochemical Properties. Critical Reviews in Food Science and Nutrition, 49(2), 113–135. Assunção, R. M., & Reis, E. A. (1999). A new proposal to adjust Moran’s I for population density. Statistics in Medicine, 18(16), 2147–2162. Athani, S. I., Revanappa, & Dharmatti, P. R. (2009). Effect of plant density on growth and yield in banana. 22, 1, 143–146. Bachmaier, M., & Backes, M. (2008). Variogram or semivariogram? Understanding the variances in a variogram. Precision Agriculture, 9(3), 173–175. Bivand, R. (2014). spdep: Spatial dependence: weighting schemes, statistics and models. R package. (Version 0.5-71). Retrieved August 14, 2014, from http://CRAN.Rproject.org/package=spdep Bivand, R., & Lewin-Koh, N. (2014). maptools: Tools for reading and handling spatial objects (Version 0.8-29). R. Retrieved August 14, 2014, from http://CRAN.Rproject.org/package=maptools Bivand, R. S., Pebesma, E. J., & Gómez-Rubio, V. (2008). Applied spatial data analysis with R. New York; London: Springer. 55
Bolker, B. M., Brooks, M. E., Clark, C. J., Geange, S. W., Poulsen, J. R., Stevens, M. H. H., & White, J.-S. S. (2009). Generalized linear mixed models: a practical guide for ecology and evolution. Trends in Ecology & Evolution, 24(3), 127–135. Brayford, D. (1992). Fusarium oxysporum f. sp. cubense. IMI Descriptions of Fungi and Bacteria, 112, 115. Crawley, M. J. (2007). The R book. Chichester, England; Hoboken, N.J.: Wiley. Dalgaard, P. (2008). Introductory Statistics with R. New York, NY: Springer New York. Del Ponte, E. M., Shah, D. A., & Bergstrom, G. C. (2003). Spatial Patterns of Fusarium Head Blight in New York Wheat Fields Suggest Role of Airborne Inoculum. Plant Health Progress. doi:10.1094/PHP-2003-0418-01-RS Domínguez‐Hernández, J., Negrín, M. A., & Rodríguez, C. M. (2008). Soil Potassium Indices and Clay‐Sized Particles affecting Banana‐Wilt Expression Caused by Soil Fungus in Banana Plantation Development on Transported Volcanic Soils. Communications in Soil Science and Plant Analysis, 39(3-4), 397–412. ESRI. (2013). ArcGIS for Desktop (Version 10.2). Redlands, California: Environmental Systems Resource Institute. Fortin, M.-J., Dale, M. R. T., & Hoef, J. ver. (2002). Spatial analysis in ecology. In Encyclopedia of Environmetrics (Vol. 4, pp. 2051–2058). Chichester, UK: John Wiley & Sons, Ltd. Frison, E., & Sharrock, S. (1998). The economic, social and nutritional importance of banana in the world. In Bananas and Food Security/Les productions bananières: un enjeu économique majeur pour la sécurité alimentaire (pp. 21–35). Douala, Cameroon: INIBAP. Garrett, K. A., Madden, L. V., Hughes, G., & Pfender, W. F. (2004). New Applications of Statistical Tools in Plant Pathology, 94(9), 999–1003. Griffith, D. A. (2009). Spatial Autocorrelation. Retrieved August 19, 2014, from Elsevier Store website: http://booksite.elsevier.com/brochures/hugy/SampleContent/Spatial-Autocorrelation.pdf Grosjean, P, & Ibañez, F., (2014). pastecs: Package for Analysis of Space-Time Ecological Series. R package version. 1.3-18. Retrieved August 5, 2014, from http://CRAN.Rproject.org/package=pastecs Guimarães, P. (2005). A simple approach to fit the beta-binomial model. Stata Journal, 5(3), 385–394. Guzmán-Plazola, R. A., Gómez-Pauza, R., García-Espinosa, R., & Gavi-Reyes, F. (2004). Distribución Espacial de la Pudrición Radical del Frijol (Phaseolus vulgaris L.) por Fusarium solani (Mart.) Sacc. f. sp. phaseoli (Burk.) Snyd. y Hans. en la Vega de Metztitlán, Hidalgo, México. Revista Mexicana de Fitopatología
56
Haining, R. P. (2003). Spatial data analysis: theory and practice. Cambridge, UK ; New York: Cambridge University Press. Hilbe, J. M. (2013). Beta Binomial Regression. The SelectedWorks of Joseph M Hilbe. Hughes, G. and Madden L.V., (1993). Using the Beta-Binomial Distribution to Describe Aggregated Patterns of Disease Incidence. Phytopathology 83:759-763. Hughes, G., Madden, L. V., & G. P. Munkvold. (1996). Cluster Sampling for Disease Incidence Data. American Phytopathological Society, 86(2), 132–137. Hughes, G., Munkvold, G. P., & Samita, S. (1998). Application of the logistic-normal-binomial distribution to the analysis of Eutypa dieback disease incidence. International Journal of Pest Management, 44(1), 35–42 Jackson, M. C., Huang, L., Xie, Q., & Tiwari, R. C. (2010). A modified version of Moran’s I. International Journal of Health Geographics, 9(1), 33. doi:10.1186/1476-072X-9-33 Jarvis A., H.I. Reuter, A. Nelson, E. Guevara, 2008, Hole-filled seamless SRTM data V4, International Centre for Tropical Agriculture (CIAT). Retrieved Octuber 5, 2014, from http://srtm.csi.cgiar.org. Kongchouy, N., Choonpradub, C., & Kuning, M. (2010). Methods for Modeling Incidence Rates with Application to Pneumonia Among Children in Surat Thani Province, Thailand, 1(37), 29–38. Krivoruchko, K. (2011). Spatial statistical data analysis for GIS users. Redlands, Calif.: Esri Press. Krivoruchko, K., Gribov, A., & Krause, E. (2011). Multivariate Areal Interpolation for Continuous and Count Data. Procedia Environmental Sciences, 3, 14–19. doi:10.1016/j.proenv.2011.02.004 Lichtemberg, P. S. F., Pocasangre, L. E., Staver, C., Dold, C., & Sikora, R. A. (2010). Fusarium Wilt (Fusarium oxysporum f. sp. cubense) in Gros Michel (AAA) bananas, the incidence at smalholder level in Nicaragua. In Conference on International Research on Food Security, Natural Resource and Rural Development. Zurich. Madden, L. V., Hughes, G. and van den Bosch, F. 2007. The Study of Plant Disease Epidemics. APS Press, St Paul Madden, L. V., & Hughes, G. (1994). BBD-Computer Software for Fitting the Beta-Binomial Distribution to Disease Incidence Data, Plant Disease, 78(5), 536-540. Morton, A. (2014). UTM Grid Zones of the World. Retrieved August 9, 2014, from http://www.dmap.co.uk/utmworld.htm Musoli, C. P., Pinard, F., Charrier, A., Kangire, A., ten Hoopen, G. M., Kabole, C., Owang J., Bieysse D., Cilas, C. (2008). Spatial and temporal analysis of coffee wilt disease caused by 57
Fusarium xylarioides in Coffea canephora. European Journal of Plant Pathology, 122(4), 451–460. doi:10.1007/s10658-008-9310-5 Nau, R. F. (2014). The logarithm transformation. Retrieved September 17, 2014, from http://people.duke.edu/~rnau/411log.htm Nelson, M. R., Felix-Gastelum, R., Orum, T. V., Stowell, L. J., & Myers, D. E. (1994). Geographic Information Systems and Geostatistics in the Design and Validation of Regional Plant Virus Management Programs, 84(9), 898–905. Nelson, M. R., Orum, T. V., Jaime-Garcia, R., & Nadeem, A. (1999). Applications of Geographic Information Systems and Geostatistics in Plant Disease Epidemiology and Management. Plant Disease, 83(4), 308–319. doi:10.1094/PDIS.1999.83.4.308 NIST/SEMATECH. (2014). Measures of Skewness and Kurtosis. Retrieved August 18, 2014, from http://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm Oerke, E.-C., Meier, A., Dehne, H.-W., Sulyok, M., Krska, R., & Steiner, U. (2010). Spatial variability of fusarium head blight pathogens and associated mycotoxins in wheat crops: Spatial variability of Fusarium species and mycotoxins. Plant Pathology, 59(4), 671–682. doi:10.1111/j.1365-3059.2010.02286.x O’Sullivan, D., & Unwin, D. (2010). Geographic Information Analysis (2nd ed.). John Wiley & Sons, Inc. Pan, W. (2001). Akaike’s Information Criterion in Generalized Estimating Equations. Biometrics, 57(1), 120–125. doi:10.1111/j.0006-341X.2001.00120.x Paulitz, T. C., Zhang, H., & Cook, R. J. (2003). Spatial distribution of Rhizoctonia oryzae and rhizoctonia root rot in direct-seeded cereals. Canadian Journal of Plant Pathology, 25(3), 295–303. doi:10.1080/07060660309507082 Pérez-Vicente, L., Dita, M. A., & Parte, E. M. la. (2014). Technical Manual Prevention and Diagnostic of Fusarium Wilt (Panama disease) of banana caused by Fusarium oxysporum f. sp. cubense Tropical Race 4 (TR4). FAO. Pfeiffer, D. U. (1996). Issues related to handling of spatial data. In Proceedings of the epidemiology and state veterinary programmes (pp. 83–105). Christchurch. Plant, R. E. (2012). Spatial data analysis in ecology and agriculture using R. Boca Raton: CRC Press. Ploetz, R. C. (2006). Fusarium Wilt of Banana Is Caused by Several Pathogens Referred to as Fusarium oxysporum f. sp. cubense. Phytopathology, 96(6), 653–656. doi:10.1094/PHYTO-960653 Ploetz, R. C., Jones, D. R., Sebaisgari, K., & Tushemereirwe, W. K. (1994). Panama disease on East African highland bananas. Fruits (Paris), 49(4), 253–260. 58
R Core Team. (2014). R: A language and environment for statistical computing. Vienna, Austria.: R Foundation for Statistical Computing. Retrieved June 5, 2014, from http://www.Rproject.org/ Rigby, R. A., & Stasinopoulos, D. M. (2005). Generalized additive models for location, scale and shape,(with discussion). Applied Statistics, 54, 507–554. Román Jerí, C. H. (2012). Consideraciones epidemiológicas para el manejo de la Marchitez por Fusarium (Fusarium oxysporum f. sp. cubense) del banano en la región central del Perú. CATIE, Turrialba, Costa Rica. Selvaraja, S., Balasundra, S. K., Vadamalai, G., & Husni, M. H. A. (2012). Spatial Variability of Orange Spotting Disease in Oil Palm. Journal of Biological Sciences, 12(4), 232–238. doi:10.3923/jbs.2012.232.238 Souris, M., & Bichaud, L. (2011). Statistical methods for bivariate spatial analysis in marked points. Examples in spatial epidemiology. Spatial and Spatio-Temporal Epidemiology, 2(4), 227– 234. doi:10.1016/j.sste.2011.06.001 Stasinopoulos, D. M., & Rigby, R. A. (2007). Generalized Additive Models for Location Scale and Shape (GAMLSS) in R. Journal of Statistical Software, 23(7), 1–46. Stasinopoulos, M., Rigby, B., & Akantziliotou, C. (2008). Instructions on how to use the gamlss package in R (Second Edition). Taliei, F., Safaie, N., & Aghajani, M. A. (2013). Spatial Distribution of Macrophomina phaseolina and Soybean Charcoal Rot Incidence Using Geographic Information System (A Case Study in Northern Iran), 15, 1523–1536. Teetor, P., & Loukides, M. K. (2011). R cookbook. Sebastopol, CA; Beijing: O’Reilly. Tsai, P.-J. (2012). Application of Moran’s Test with an Empirical Bayesian Rate to Leading Health Care Problems in Taiwan in a 7-Year Period (2002–2008). Global Journal of Health Science, 4(5). doi:10.5539/gjhs.v4n5p63 Tobler, W. R. (1970). A Computer Movie Simulating Urban Growth in the Detroit Region. Economic Geography, 46(2): 234–240. Urdan, T. C. (2010). Statistics in plain English (Third Edition). New York: Routledge. Zuur, A. F., Ieno, E. N., Walker, N., Saveliev, A. A., & Smith, G. M. (2009). Mixed effects models and extensions in ecology with R. New York, NY: Springer-Verlag New York.
59
6 Annexes Annex 1 â&#x20AC;&#x201C; R Code utilised for test spatial autocorrelation #Loading dataset farms.p <- readShapePoints("SHP/farms.shp") ##########----Neighbours list using different methods #-----------------Graph based Neighbours #Delaunay Triagulation tri.list <- tri2nb(coords) plot.nb(tri.list, coords) nb_list.t <- nb2listw(tri.list, style = "W") #Sphere of Indiference soi.1 <- soi.graph(tri.list, coords) soi_nb <- graph2nb(soi.1) plot.nb(soi_nb, coords) nb_list.s <-nb2listw(soi_nb, style = "W") #----------------Distance based Neighbours #Nearest Neighbours #Default k value - k = 1 k_near.1 <- knn2nb(knearneigh(farms.p)) plot.nb(k_near.1, coords) #With k = 5 k_near.5 <- knn2nb(knearneigh(farms.p, k = 5)) plot.nb(k_near.5, coords) #With k = 10 k_near.10 <- knn2nb(knearneigh(farms.p, k = 10)) plot.nb(k_near.10, coords) #Neighbours Spatial Weights Lists nb_list.k1 = nb2listw(k_near.1, style = "W") nb_list.k5 = nb2listw(k_near.5,style = "W") nb_list.k10 = nb2listw(k_near.10,style = "W") #Distance Bands #Neighbours list based on distance k_dist <- nbdists(k_near.1, farms.coords) k_dist_vec <- unlist(k_dist) max_dist <- max(k_dist_vec) dist_nei <- dnearneigh(farms.p, d1 = 0, d2 = max_dist) plot(dist_nei, farms.coords)
60
nb_list.d <- nb2listw(dist_nei, style = "W") #Moran's I Test with different neighbour list moran.test(farms.p$inc_raw, listw = nb_list.t, alternative = "two.sided") moran.test(farms.p$inc_raw, listw = nb_list.s, alternative = "two.sided") moran.test(farms.p$inc_raw, listw = nb_list.k1, alternative = "two.sided") moran.test(farms.p$inc_raw, listw = nb_list.k5, alternative = "two.sided") moran.test(farms.p$inc_raw, listw = nb_list.k10, alternative = "two.sided") moran.test(farms.p$inc_raw, listw = nb_list.d, alternative = "two.sided") #Geary's C Test with k-nearest geary.test(farms.p$inc_raw, listw = nb_list.t, alternative = "two.sided") geary.test(farms.p$inc_raw, listw = nb_list.s, alternative = "two.sided") geary.test(farms.p$inc_raw, listw = nb_list.k1, alternative = "two.sided") geary.test(farms.p$inc_raw, listw = nb_list.k5, alternative = "two.sided") geary.test(farms.p$inc_raw, listw = nb_list.k10, alternative = "two.sided") geary.test(farms.p$inc_raw, listw = nb_list.d, alternative = "two.sided")
#Moran's I using Empirical Bayes Estimates ebi.1 <- EBest(farms.p$diseased_p, farms.p$evaluated_, family = "binomial") moran.test(ebi.1$estmm, listw = nb_list.t, alternative = "two.sided") moran.test(ebi.1$estmm, listw = nb_list.s, alternative = "two.sided") moran.test(ebi.1$estmm, listw = nb_list.k1, alternative = "two.sided") moran.test(ebi.1$estmm, listw = nb_list.k5, alternative = "two.sided") moran.test(ebi.1$estmm, listw = nb_list.k10, alternative = "two.sided") moran.test(ebi.1$estmm, listw = nb_list.d, alternative = "two.sided")
Annex 2 â&#x20AC;&#x201C; R Code utilised for the GAMLSS #Miscellaneous code for variable construction dis_inc <- cbind(farms.p$diseased_p,farms.p$evaluated_) variety.f <- factor(farms.p$Variety, levels = cbind(1,2,3), labels = cbind("Isla", "Seda", "Mix")) #Base model GLM with all the explanatory variables glm.0 <- glm(formula = dis_inc ~ pH + altitude + slope + area + planting_d + Average_So + factor(farm_type) * variety.f + shade_perc, data = farms.p, family = binomial)
61
#Base model with all the explanatory variables library(gamlss) #Load the gamlss package gamlss.1 <- gamlss(formula = dis_inc ~ pH + altitude + variety.f + planting_d + slope + area + Average_So + variety.f * factor(farm_type) + shade_perc, family = BB, data = farms.p) summary(gamlss.1) plot(gamlss.1) stepGAIC(gamlss.1) #Fitted model after selection using stepGAIC gamlss.2 <- gamlss( formula = dis_inc ~ pH + altitude + planting_d + Average_So, family = BB, data = farms.p) summary(gamlss.2) GAIC(gamlss.1, gamlss.2) plot(gamlss.2)
62