Application of Logistic Regression Model in an Epidemiological Study
Renhao Jin, Fang Yan, Jie Zhu
School of Information, Beijing Wuzi University, Beijing, China
Email address:
To cite this article:
Renhao Jin, Fang Yan, Jie Zhu. Application of Logistic Regression Model in an Epidemiological Study. Science Journal of Applied Mathematics and Statistics.Vol. 3, No. 5, 2015, pp. 225-229.
doi: 10.11648/j.sjams.20150305.12
Abstract: This paper use the logistic regression model to an epidemiological study, i.e. bovine tuberculosis (bTB) occurrence in cattle herds, together with well-established risk factors in the area known as West Wicklow, in the east of Ireland. The binary target variable is whether the herd is in the restricted status, which is defined by whether any bTB reactor is detected in the herd. With the stepwise variables selection procedure, a final logistical regression model is found to adequately describe the data. Herd bTB incidence was positively associated with annual total rainfall, herd size and a herd bTB history in the previous three years, and presence /absence of commonage.
Keywords: Logistic Regression, Bovine Tuberculosis, Stepwise Variables Selection
1. Introduction
Logistic regression is firstly developed by statistician D.R. Cox in 1958 as a statistical method, and after that it is used widely in many fields, including the medical and social sciences. Recent years as the big data society comes, Logistic regression model is also extensively used in many data mining applications, such as, credit risk models in banking industry, customer preference models in retails, and even segment of customers in all areas of business. For example, in economics it can be used to predict the likelihood of a person's choosing to be in the labor force, and a business application would be to predict the likelihood of a homeowner defaulting on a mortgage. Logistic regression is a direct probability model, which is called as logit regression or logit model. Unlike regression model, which is used to model or predict a continuous target variable, the logit model is applied for binary target variable.
Logistic regression measures the relationship between the binary dependent variable and one or more independent variables (explanation variables), which are usually (but not necessarily) continuous, by estimating probabilities. More detailed, for a binary dependent variable with value 0 or 1, corresponding to the status of an observation, the logit model is written as
(1)
where is the expect value of , i.e., the probability of , , and the part of is the linear part of independent variables. The equation (1) is also can be transformed to be
(2)
The important feature of the logit function is that it produces values that lie only between 0 and 1, just as the probability of response and non-response dependent variables, as shown in equation (2).
The approach to fit logit model is often based on maximum likelihood method, as logit model predicts probabilities rather than just classes. For each observation with explanation variables marked as and target variable , let . The likelihood function for observations can be written as
With the procedure of maximum likelihood estimation, the parameters’ estimates and their variances all can be obtained from likelihood method.
The epidemiological study in this paper is based on aggregated bovine tuberculosis (bTB) data in cattle herds from 2005 to 2009, together with well-established risk factors in the area known as West Wicklow, in the east of Ireland. The bTB data is from the first author’s Ph.D thesis, and the other related part of the bTB study has been published in Veterinary Record (2013).
Bovine tuberculosis (bTB), caused by infection with Mycobacterium bovis, affects approximately 0.3% of cattle annually in Ireland, with 18,531 reactor cattle identified in 2011. This has major financial implications both for the farmer whose herd is restricted from trading and cattle slaughtered, and for the exchequer that compensates the farmer and implements measures to control the disease. Data for the bTB study were obtained from three sources: herd data from the national databases of bTB testing herd and animal history (Animal Health Computer System, AHCS); land usage from Herdfinder, a unique multi-layered purpose built spatial mapping system whereby farms shapes submitted by farmers to DAFM under the EU Single Farm Payment Scheme are recorded and weather data from Met Éireann, all for West Wicklow. Both AHCS and Herdfinder databases use the same herd ID number so that farm, geographic location and testing data may be linked. The spatial distribution of herds and rainfall stations, and the study areas is shown in Figure 4.1.
2. Statistical Methods
A logistic regression model was built to determine the relationship between bTB incidence in cattle herds and potential risk factors (explanatory variables) from 2005 to 2009. The herd target variable is binary, indicating whether any bTB reactor is detected in the herd, and the herd with target value 1 is with restriction status. So the response variable is binary: restriction status of the herd in year (1=restricted, 0=not restricted), where denoting the years 2005-2009. In this paper, the observations from different herds were set to be independent and yearly outcomes observed on the same herd were also independent. This assumption has been verified by first author’s Ph.D thesis.
A logistic linear model was used to fit the data with a logit link function:
(3)
where the matrix was the design matrix for the explanatory variables and is a vector of unknown explanatory parameters. A model selection method similar to Ma et al. (2010) was carried out. As we have more than 30 explanatory variables, before building the logit model, we first examined associations between the response variable and each explanatory variable in a univariate analysis using Spearman’s rank correlation coefficient. Many explanatory variables were skewed and outliers were present and Spearman’s rank correlation was chosen as it is not sensitive to outliers. Explanatory variables were considered for inclusion in the logit model if an association significant at the 0.1 level was found from the univariate analysis. Then a forward stepwise approach was used to build a logit model by adding explanatory variables step by step keeping only those significant at the 5% level until the best-fit model was found. Stepwise approach adjusted for possible effects of collinearity among significant variables in the univariate analysis. Herd bTB history variables, weather variables and geographical variables (total farm area and total farm perimeter) formed three classes of possibly correlated explanatory variables. The correlation among variables in each class was tested using Pearson’s or Spearman’s rank correlation whichever was appropriate. Firstly, the significant variables in the univariate analysis but not in these three classes were found for the logit model. Then for the three classes, starting with the geographical class, only the significant variables in the univariate analysis were considered for inclusion in the multivariate model. A stepwise procedure was used to choose which of these should be retained in the multivariate model as many were correlated. Then the non-significant variables in the univariate analysis and interactions were tested by adding them to the reduced multivariable model, one at a time. This process was continued with the herd bTB history class variables added next and then the weather class variables. The order of entering classes into the model did not change the final model. Transformations including the natural logarithmic (log) and square root (sqrt) transformations were made on the skewed variables or those with extreme values to check whether it could improve model fit. The variable herd size has been found to be an important risk factor in many studies and in the final model log (herd size) or sqrt (herd size) were checked for linearity by the inclusion of quadratic terms. The variable number of introduced cattle was positively skewed with many zeros. However, it was not significant in any model. Categorizing it as 0 or greater than 0 (or transforming it) also did not lead to significance in any model.
Models were fitted using the Logistic procedure and Enterprise Miner in SAS version 9.4 (SAS Institute Inc., Cary, NC, USA).
3. Results
From 2005 to 2009, there were 609 distinct herds in the study, giving 2666 observations. Table 1 presents the number of herds and the percentage restricted on an annual basis, and the total herds and percentage restricted for each year keep stable and are around 540 and 4% respectively. Table 2 shows summary statistics for selected covariates used in the study, and these results are displayed by two distinct groups as restricted and non-restricted. As shown in Table 2, these covariates, except for the weather variables, are skewed to the right. Comparing the covariates between in two groups, the herd size, total farm area and farm perimeter are all relatively greater in restricted group then those in non-restricted group.
Year | Total herds | Number of restricted herds | Percentage restricted* |
2005 | 555 | 25 | 0.045 |
2006 | 550 | 22 | 0.04 |
2007 | 530 | 17 | 0.032 |
2008 | 517 | 29 | 0.056 |
2009 | 514 | 29 | 0.056 |
*Percentage restricted= Number of herds restricted/ Total number of herds.
Variable name | Unit | Non-restricted herds | Restricted herds | ||||||
Mean | Median | 1st quartile | 3rd quartile | Mean | median | 1st quartile | 3rd quartile | ||
Herd size | 71 | 42 | 18 | 98 | 139 | 91 | 50 | 198 | |
Number of introduced adult cattle | 15 | 3 | 0 | 13 | 23 | 3 | 0 | 16 | |
Annual total rainfall | dm | 12 | 12.2 | 9.6 | 13.8 | 12.8 | 12.5 | 9.6 | 15.1 |
Annual maximum monthly rainfall | dm | 1.8 | 1.7 | 1.5 | 2.1 | 1.9 | 1.8 | 1.5 | 2.1 |
Annual mean monthly temperature | ℃ | 10 | 10.2 | 9.7 | 10.3 | 10 | 10.2 | 9.7 | 10.3 |
Annual averages of monthly average VPD | hPa | 2.3 | 2.3 | 2.2 | 2.3 | 2.3 | 2.3 | 2.2 | 2.3 |
total farm area | km2 | 1.8 | 0.5 | 0.2 | 1.2 | 2.5 | 1 | 0.5 | 1.8 |
total farm perimeter | km | 12.1 | 8.6 | 4.3 | 15.6 | 17.5 | 14.9 | 7.9 | 22.6 |
% herds with commonage | % | 17.7 |
km, kilometer; dm, decimeter=0.1 meters; VPD, vapour pressure deficit (a measure of atmospheric dryness); herd size was the total number of cattle in a herd.
In the univariate analysis, herd bTB restriction status was significantly associated with 15 explanatory variables (Table 3 (a)). The remaining variables which were not significantly associated herd bTB restriction status are listed in Table 3 (b). Herd size and presence /absence of commonage were entered the regression model first followed by geographical variables, herd bTB history variables and weather variables together with interactions.
Table 3 (a).
Explanatory variables | Spearman’s correlation coefficient | P value |
Herd size | 0.14 | <.0001 |
Presence /absence of commonage | 0.03 | 0.08 |
Total farm area | 0.11 | <.0001 |
Total farm perimeter | 0.12 | <.0001 |
Herd bTB history 1 | 0.09 | <.0001 |
Herd bTB history 2 | 0.09 | <.0001 |
Herd bTB history 3 | 0.08 | <.0001 |
Herd bTB history of past 3 years | 0.12 | <.0001 |
Annual total rainfall | 0.05 | 0.01 |
Annual max monthly rainfall | 0.04 | 0.03 |
Annual mean monthly temperature | -0.04 | 0.04 |
Temperature.A3 | 0.04 | 0.04 |
Annual mean monthly VPD | -0.04 | 0.05 |
VPD.A2 | -0.04 | 0.03 |
VPD.P3 | 0.04 | 0.05 |
Table 3 (b).
Explanatory variables | Spearman’s correlation coefficient | P value |
Herd type | 0.0007 | 0.97 |
Number of introduced adult cattle | 0.0044 | 0.82 |
Presence /absence of introduced adult cattle | 0.0099 | 0.61 |
Annual maximum monthly temperature | 0.0018 | 0.93 |
Annual minimum monthly temperature | -0.0190 | 0.34 |
Annual maximum monthly VPD | 0.0018 | 0.93 |
Annual minimum monthly VPD | 0.0253 | 0.20 |
Annual range of monthly temperature | 0.0096 | 0.63 |
Annual range of monthly VPD | -0.0131 | 0.51 |
Temperature.A1 | 0.0096 | 0.63 |
Temperature.A2 | 0.0229 | 0.25 |
Temperature.P1 | 0.0296 | 0.14 |
Temperature.P2 | 0.0006 | 0.97 |
Temperature.P3 | 0.0229 | 0.25 |
VPD.A1 | 0.0063 | 0.75 |
VPD.A3 | -0.0199 | 0.31 |
VPD.P1 | 0.0078 | 0.69 |
VPD.P2 | 0.0295 | 0.14 |
Three years herd bTB history was highly correlated with other history variables (Pearson correlation coefficients > 0.50, p-value < 0.001) while correlations among the yearly history variables were low. Farms with large area typically have large perimeters, and thus the correlation coefficient between area and perimeter is high at 0.636 (p-value < 0.0001). For some herds, rainfall variables from the nearest rainfall station were not available and in such cases data from the second nearest rainfall station were substituted. Of the 14 rainfall stations, nine were found to be the nearest stations for some herd, and in the total of 2666 herd observations, 1008 used rainfall data from the Glen of Imaal station. There were 18 weather variables in our study and many were inter-related with each other. Six were significant in the univariate selection procedure and were entered into the multivariate model one at a time. As shown in Table 4, these six variables were relatively high correlated with each other with Pearson’s correlation coefficient ranging in absolute value from 0.134 to 0.907. Thus, only one weather variable was entered into the final model.
Correlation coefficients | Annual max monthly rainfall | Annual total rainfall | Annual mean monthly temperature | Temperature A3 | Annual mean monthly VPD | VPD A2 | VPD P3 |
Annual max monthly rainfall | 1 | 0.9* | -0.6* | 0.35* | 0.57* | -0.38* | 0.3* |
Annual total rainfall | 1 | -0.5* | 0.36* | -0.4* | -0.26* | 0.19* | |
Annual mean monthly temperature | 1 | -0.77* | 0.73* | 0.72* | -0.71* | ||
Temperature.A3 | 1 | -0.13* | -0.19* | 0.56* | |||
Annual mean monthly VPD | 1 | 0.91* | -0.47* | ||||
VPD.A2 | 1 | -0.67* | |||||
VPD.P3 | 1 |
*p value<0.0001.
Tables 5 shows the results of the best fitting logit model together with odds ratios for the variables. In the final model, herd bTB occurrence was positively associated with log (herd size), annual total rainfall, herd bTB history of past three years, and presence /absence of commonage. An increase in herd size from approximately 40 to 100 corresponded to an odds ratio of 1.72 in herd bTB occurrence. An increase of 2 dm in annual total rainfall increased the odds of herd bTB by a factor of 1.17. Here the statistical significance of the coefficient for annual total rainfall is small enough to allow for the fact that more than one alternative model was tested. Presence versus absence of herd bTB in the past 3 years increased the odds of herd bTB by a factor of 2.32. Presence versus absence of commonage increased the odds of herd bTB by a factor of 1.49. Log transformations of herd size did make this variable more significant in the logit model. An added-variable plot was used to indicate whether a quadratic term, (log (herd size))^{2}, was needed in addition to log (herd size) in the final model (Collett, 2002, pp.169-176). There was no indication that a quadratic term was needed. Similar plots were constructed for other covariates in the model as a check on model adequacy. However, in the final logit model, the generalized chi-square statistic measuring the ratio of residual sum of squares with its degrees of freedom was computed to check for over-dispersion and the ratio was approximately one indicating no over-dispersion in either model. So the final logistic model results shown in Table 5 is adequate for the bTB data.
Fixed effect | Estimate | 95% C.I. | p Value | Odds ratio | |
Lower | Upper | ||||
Intercept | -5.606 | -6.863 | -4.349 | <0.0001* | |
Log (herd size) | 0.631 | 0.432 | 0.83 | <0.0001* | 1.72 |
Annual total rainfall | 0.077 | 0.0185 | 0.135 | 0.02 | 1.17 |
Presence versus absence of herd bTB in past 3 years | 0.841 | 0.42 | 1.261 | <0.0001* | 2.32 |
presence versus absence of commonage | 0.397 | -0.056 | 0.85 | 0.07 | 1.49 |
C.I., confidence interval; *p value<0.05.
Acknowledgements
The first author wish to thank to DAFM for providing all the herd data through CVERA, University College Dublin, and Met Éireann for providing weather data. This paper is funded by the project of National Natural Science Fund, Logistics distribution of artificial order picking random process model analysis and research (Project number: 71371033); and funded by intelligent logistics system Beijing Key Laboratory (No.BZ0211); and funded by scientific-research bases---Science & Technology Innovation Platform---Modern logistics information and control technology research (Project number: PXM2015_014214_000001); University Cultivation Fund Project of 2014-Research on Congestion Model and algorithm of picking system in distribution center (0541502703).
References