Spline Regression in the Estimation of the Finite Population Total
Joseph Kipyegon Cheruiyot
Department of Computer and Statistics, Moi University, Eldoret, Kenya
To cite this article:
Joseph Kipyegon Cheruiyot. Spline Regression in the Estimation of the Finite Population Total. Science Journal of Applied Mathematics and Statistics. Vol. 3, No. 5, 2015, pp. 214-224. doi: 10.11648/j.sjams.20150305.11
Abstract: This study sought to estimate finite population total using Spline regression function. It compared the Spline regression with Sample Mean estimator, design-based and model - based estimators. To measure the performance of each estimator, the study considered average bias, the efficiency by use of the mean square error and the robustness using the rate change of efficiency. In this research, five populations were used. Three of them were simulated according to the following models: linear homoscedastic, quadratic homoscedastic and linear heteroscedastic and two natural populations. The performances of the five estimators were studied under the five populations. The sudy found that Sample Mean(SM), Horvitz-Thompson (HT) and Ratio (R) estimators are not robust while Nadaraya-Watson (NW) and Periodic Spline (PS) are robust when linearity and homoscedasticity of the population structure are violated.
Keywords: Homoscedasticity, Population, Sample, Spline Regression, Robustness, Smoothing, Estimator
There are two generally accepted options in studying the characteristics of finite population. The first option is a study in which every unit of the population is examined called a census. Use of a census to study a population is time consuming, expensive, often impossible and strangely enough, often inaccurate. The other option is to study the characteristic of a population by examining a part of it. The theory of survey sampling as developed during the past several decades provides us with various kinds of reasonable scientific tools for drawing samples and making valid inference about the population parameters of interest.
1.1.1. Census Versus Sampling Method
Although there are advantages with the census method, the cost, effort and the time required to conduct census may be enormous unless the population is very small. In such a case we resort to sampling that involves examination of a part of the population. Although a census operation gives a more reliable data, sampling is more appropriated when:
i. The cost of conducting census would be prohibitive.
ii. The population is large, such that it would be impossible to conduct a census.
iii. The study involves destruction of elementary units under study, such that it would be appropriate to conduct sample testing.
iv. Quick results are required, such that it would be appropriate to conduct sample survey rather than carrying out a complete count.
1.1.2. Basic Ideas of Sampling and Estimation
In the basic sampling setup, the population consists of a known finite number N of units – such as people or plots. With each unit is associated a value of a variable of interest, sometime referred to as the y-value of that unit. The y-value of each unit in the population is unknown quantity. However, the units in the population are identifiable and may be labeled with numbers 1, 2,. N. A sample of the units in the population is selected and observed. The data collected consist of the y-value for each unit in the sample together with the unit’s label. The procedure by which the sample units is selected from the population is called the sampling design. With most of the well- known sampling designs, the design is determined by assigning to each possible sample the probability p(s) of selecting that sample. For example, using the simple random sampling design, the units are selected with equal and independent probability p(s).
1.2. Estimation Approaches
To estimate finite population total () in survey, where
We need to have Yi ( i =1, 2, 3,., N) the survey variables and xi ( i = 1, 2, 3,., N) design variables ( Auxiliary variables). The following therefore is a list of approaches that are considered in this study in the estimation of finite population total.
1.2.1. Design - Based Approach
This is also known as classical approach. In this approach, the variables of interest of the target population are viewed as fixed quantities. Also the design introduces selection probabilities that determine the properties of estimators that are used to obtain expected values, variances, biases etc. The samples are generated by sampling design p(s) with the values, held fixed. The repetition of sample drawing procedure forms the basis of randomization framework. The approach assumes that models have no relevance to the inferential framework. In experimental design, randomization is employed to protect the experimenter against subjective biases. Scott and Smith (1975) extended results of Blackwell. According to Fisher, randomization was relevant before the data were collected but not in the analysis of data which is in agreement with most statisticians in the experimental sciences. Randomization is therefore an insurance against selection bias.
1.2.2. Model – Based (Prediction) Approach
From Royall (1976) the concept of the super population is introduced thus: "The finite population should itself be regarded as a random sample from some infinite population". Hence finite population is assumed to be generated as a random sample from a super population. Also noted that variable of interest are viewed as random variables and properties of estimators depend on the joint distribution of these random variables. A sample is selected from the finite population using a known sampling scheme. Then observations are made on the sample values and are then used to make predictions about the non sample values. In this case the model connects a variable of interest Y with a set of auxiliary variables X, Cox (1995). However, noted that the choice of a model and it’s robustness to misspecification is the major issue. Small deviation from a chosen model may lead to serious errors in an inference. Sometimes the models become mathematically complex while still not being suitably realistic (Thompson, 1992). For example, where model assumption of the variable being studied is that of independence it ignores the tendency in many population for nearby or related units to be correlated.
1.2.3. Non-Parametric Approach
The parametric method of estimation is used when it is assumed that the data is drawn or generated from one of the known parametric family of distributions. In many cases however, the experimenter does not know the form of the basic distribution and needs statistical techniques which are applicable regardless of the form of the distribution. These techniques are referred to as non parametric or distribution free methods. They apply to very wide families of distributions rather than only to families specified by a particular functional form. They do not require the various assumptions about the distribution of population from which the sample was obtained. The main idea behind this class of models is that the effect of an explanatory (design) variable and dependent variable of interest is not modeled as parametric, usually linear function but is kept flexible. The only assumption needed is that the effects of the explanatory variables are modeled as smooth i.e. differentiable functions. The functional shape is then to be estimated from the data by either using: Kernel based methods or Spline based methods.
Kernel Based Method.
The Kernel estimator is expressed in terms of a Kernel function which satisfies the condition;
Usually, but not always, K will be a symmetric probability density function, the normal density for instance. Therefore, according to Silverman (1986) the Kernel estimator of the density function with Kernel K is defined by,
where h is the bandwidth. It is clearly observed that the Kernel estimator is a sum of ‘bumps’ placed at the observations. Each individual bumps is created by and the estimate is a resultant hump obtained by adding them up.
Spline Based Method.
The name, "Spline function" was given by I.J Schoenberg (1946) to the piecewise polynomial function known as univariate polynomial Splines. This was because of their resemblance to the curves obtained by their draftsmen using a mechanical Spline –a thin flexible rod with a groove and a set of weights called "duck" used to position the rods at points through which it was derived to draw smooth interpolation curves passing through prescribed points. The basic idea dates back at least to Whittaker (1923). More resent papers on the subject include Wahba (1975), Smith (1979), and Silverman (1985) among others. For Kernel regression estimation a weighting scheme due to Nadaraya (1964) –Watson (1964) has been associated with random design, and a convolution type weighting scheme with fixed design based on mean square error; none of the estimators is uniformly optimal in either design. The multitude of non parametric regression estimators is an issue of considerable practical and theoretical importance. A wide class of estimators studied by Jennen Steinmetz and Gasser (1988) included fixed width Kernel estimators, smoothing spline and nearest –neighbor estimators as particular cases. No estimator is uniformly best in terms of integrated mean squared error, but the kernel estimator turns out to be the minimax optimal. Since non parametric methods are usually intended to be applicable to a broad variety of situations the minimax property is an important safeguard. Two definitions of Kernel weights enjoy particular popularity, the Nadaraya –Watson type (Nadaraya 1964, Watson 1964) and the convolution type estimator (Priestly and Chao1972, Gasser and Muller 1979). The Nadaraya-Watson method is intuitively motivated as an estimator of a conditional expectation which suggests a context where the independent variable is random. Hence this method seems suited for a situation of randomly selected design points, whose distribution is determined by the design density.
A spline function is a piecewise defined function with certain smoothness conditions. The most commonly used form is the cubic splines. There are two sorts of splines; ordinary splines and B-spline. The two spline function have the same general structure regarding the piecewise defined function such as
and the smoothing conditions. The difference is that the ordinary splines go through all the data points exactly where as B -spline do not necessarily fit the data exactly. For ordinary splines, the curve has to go through all the points hence the equation has to be satisfied for all the points. The spline function has to yield the value for. The smoothing conditions too have to be fulfilled. B-spline are piecewise defined functions usually polynomial with the same smoothness conditions as ordinary spline. They are however not forced through the data points exactly, the function has simply to come close to the data points.
In estimation of finite population total, the challenge is to identify an estimator that is efficient when the population structure is not known. In this study, try to compare the spline regression with the known estimators of nonparametric (Nadaraya-Watson), Sample Mean estimator, Design-based Horvitz-Thompson estimator and Model-based Ratio estimator. The challenge is to obtain an estimator which is robust to the violation of both linearity and homoscedasticity of the population structure.
2.1. Non-Parametric Estimation of the Population Total Using Kernels
In this section, the Nadaraya-Watson Kernel estimator is considered. It is assumed that the auxiliary information is available for the entire population and the auxiliary variable X and the study variable Y are related in a more general way.
Consider the model
yi = m +I (4)
where m is the mean function and i a random error term. It is assumed that the functional form of m (xi) is unknown but assumed to be smooth and continuous.
Let wi(x), i = 1, 2, …, n be the weight function known as Kernel function. The Kernel is a continuous, bounded and symmetric function which integrates to one. That is
k(u)du = 1
By taking kh(u) = h-1k to be the Kernel with band width h. The weight sequences for the Kernel smoothers as given by Nadaraya (1964) - Watson (1964) is
wi(x) = (5)
The Nadaraya -Watson estimator of m(x) in (3.1) is
Substituting 3.2 in 3.3 we have
The shape of the Kernel weights is determined by K, where K is a symmetric probability density function that satisfies conditions in equations 1. One unique feature of the size of the bandwidth is that the smaller it is the more concentrated are the weights around x. However, the non-parametric regression based estimator Tnp for the population total T is given by
where is the Nadaraya-Watson estimator give in (7). Hence by substituting (7) in (8) Nadaraya – Watson estimator of the population total becomes:
where represents the Nadaraya-Watson estimator of the population total.
2.2. Properties of Nadaraya-Watson Kernel Estimator of the Population Total
The Nadaraya-Watson Kernel regression estimator is given as in (8) and (7)
In order to find a standard measure of estimation error, the Mean Square error (MSE), the study looked at the conditional mean and variance of under the model .
where Xp is the population vector of X-values.
where is the standard Nadaraya-Watson estimator of the density.
Since under the model, we have;
Next, we look at the conditional error variance;
Since under the model, we have
2.3. Spline Regression Estimator of the Population Total
Wahba (1975) has shown that Kernel smoothing estimator is closely related to smoothing
Splines estimator when it is represented approximately as a linear function of the data values yi. Hence there exists a weight function F (z,xi) such that
where the function F(z,x) is defined as
F(z,x) = (10)
Hence we have
We get the smoothing spline estimator of the population Total as
where K(u) is defined as
K(u) = 0.5 exp(-|u | /1.41)sin(( | u| /1.41) + π/4)
and the function K(u) has the following properties;
We can see that the properties of the function K(.) above are similar to those given for the Kernel function but can take negative values as well. Hence the smoothing spline estimator corresponds approximately to a Kernel type estimator of order 4. Eubank (1988) has shown that if the function m(.) is assumed to be periodic then corresponds to a spline estimator with a fixed bandwidth parameter h and weights F(z,x) = hw(u/h) where h = and w(u) =
the estimator corresponding to the periodic spline is
where F(.) is as defined in (10),hence giving
Since n-1F(z,x) does not sum to one, we divide the weights by their sum and we denote the modified weights by FR(z,xi), then
FR(z,xi) = (14)
therefore the Fm(z,xi) periodic spline estimator of the function m(x)is given by;
Substituting (15) in (6)
We have the Periodic Spline Estimator of the population total as
Next, we consider the conditional error variance;
let hence .
3. Empirical Results and Discussion
To compare the performance of the five estimators, that is, Horvitz Thompson, the Ratio estimator, Sample Mean estimator, the Nadaraya - Watson Kernel estimator and periodic spline estimator as spline regression estimator so as to identify a robust estimators, the study simulated three populations based on the following models; linear Homoscedastic model, Quadratic Homoscedastic model and Linear Heteroscedastic model. Also the study used two real populations. The criteria for comparing these estimators are average bias, mean square error and the rate of change of efficiency as a measure of robustness.
3.1. The Choice of the Kernel and Bandwidth
This study used the Gaussian Kernel in Nadaraya - Watson estimator of the population total which is defined as
where . Assume that the Kernel function K satisfies the conditions given in equation 1. An optimal bandwidth for Nadaraya-Watson smoother was chosen within the interval where is the standard deviation of ( Silverman, 1986). Therefore, the bandwidth h used was chosen to be the centre point h= 7/8. The Kernel function used in the periodic spline is K (u) = 0.5 exp(-|u | /1.41)sin(( | u| /1.41) + π/4) (Wahba, 1975)
3.2. Description of the Study Population and Estimators
The artificial population was simulated in the following manner.
a) In artificial population I, 76 data points were generated according to the model;
where , and = 0.5.
b) In artificial population II, we again generated 76 data points according to the model
c) In artificial population III, once more 76 data points were generated according to the model
. Where, in b and c are the same as in population I
d) The Real population IV, was obtain from the Kenya National Bureau of Statistics (KNBS) for the population census done in Kenya in 2009. In this population, i considered the Auxiliary variable Xi to be the number of households in the ith District and study variable Yi the total population by District except for Nairobi province where Divisions are used instated of Districts, where i = 1,2,., 76. Our variable of interest Y is the population total.
e) Population V, this population has variable X describing shares a customer already possessed (Acquired) versus shares applied for (Booked) in a stock exchange brokerage farm, variable Y, both expressed in Kshs. Again i selected 76 data points in this population. The average bias and Mean Square Error of the population total were computed for each of the following five estimators: Sample Mean, Horvitz-Thompson, Ratio estimator, Nadaraya-Watson and periodic spline.
Below is a summary of the formulae used in computing their respective population total.
The following are scatter diagrams showing the distributions of the five populations mentioned above.
This population appears to be linear with heteroscedastic variance structure.
The acquired shares and booked shares in this population structure appear to be uncorrelated.
3.3. Description of the Computation Procedure
For each artificial population of size 76, samples of size n = 40 were generated by simple random sampling without replacement and 30 replicate samples were selected and estimates computed. Similarly, for the real population of size 76, samples of each size 40 were replicated 30 by SRSWOR and the estimators of the population total computed. For the case of Horvitz-Thompson, the sample units xi’s are selected with unequal probabilities. To select a sample with unequal probabilities with Horvitz-Thompson weights, we have, the probability of the unit i being included in the sample such that
where . Hence the estimate of the population total is obtained as. For each of the population, we compute the true population total .
Define as the population total estimator, where r = SM, R, HT, NW, and PS. Then where is population total estimate of the ith sample and rth estimator while R is the number of sample replicates.Hence the bias of each estimator of populations total were computed as Thus the average bias for each estimator for both the real and artificial population totals are
where k = 1, 2, 3, 4, 5.
We define the mean square error to be
whereis the unconditional variance of the estimator over the 30 replicates for the artificial and natural populations. Therefore, the Mean Square Error in the estimation of both the artificial and natural populations is given by:
The Relative Change in Efficiency (RCE) for each estimator was given by
Where j = 1,2,3,4.
3.4. Results and Interpretations
The results of this study are summarized in Tables 1to 5. On each population the performance of each estimator is analyzed using the average bias and mean square error. The average bias is an indication of the measure of how closed an estimator is from the true value, while the MSE is used to assess efficiency of an estimator. For example an estimator will be said to be more efficiency than another, if its MSE is comparably smaller i.e if MSE (T1) < MSE (T2), where T1 and T2 are estimators, then T1 is said to be more efficient than T2.
Population Total 80.29158
In population I, i noted that from the low values of the bias that all the five estimators perform well under these conditions. However, Nadaraya-Watson has the least bias followed by SM, Horvitz-Thompson, Periodic spline and Ratio estimator in that order. Looking at MSE of this population, SM estimator has the lowest MSE, followed by Nadaraya-Watson, periodic spline, and Ratio. H-T estimator has the highest MSE in this population. However, the values of the MSE of these estimators on this population are lowest as compared to those obtained in the other populations. This implies that these estimators have high efficiency in linear and homoscedastic population structure. Though the sample mean with the least MSE is the most efficient in this population.
In population II, i noted that SM has the least absolute bias followed by Nadaraya-Watson, periodic spline, Horvitz-Thompson, and lastly the Ratio estimator. Next, looking at MSE, the Nadaraya-Watson and periodic spline both have low MSE followed by SM, H-T and lastly Ratio estimator. Here we note that the Nadaraya-Watson is the best estimator for a quadratic and homoscedastic population while Ratio estimator has the highest MSE thus making it the least efficient estimator for this population. This is true because the ratio estimator is based on the assumption of linearity which when violated the estimator as expected breaks down.
In population III, noted that Nadaraya-Watson has the least absolute bias, followed by SM, Horvitz-Thompson, Periodic spline and lastly Ratio estimator. Considering the MSE, Nadaraya-Watson and SM have a low MSE followed by periodic spline, Ratio and the H-T estimator in that order. Nadaraya –Watson and SM become the best estimators of this population which is linear and heteroscedastic population.
|Population Total 28.38158|
In population IV, noted that Sample Mean has the least bias followed by Nadaraya-Watson,
H-T, Ratio and lastly periodic spline. Looking at MSE, SM has the least MSE, followed by Nadaraya-Watson, Periodic spline, Ratio and H-T estimator. Thus, SM has proved to be the best estimator for this real population which appears to be linear and with heteroscedastic variance from the scatter diagram.
This population appears to be neither linear nor homoscedastic from the scatter diagram Figure
5. In this population, Periodic spline has the least absolute bias, next is Nadaraya-Watson, Ratio, SM, and lastly Horvitz-Thompson estimator. As concerns the MSE, Periodic spline has the least MSE thus proving to be the best estimator for this population whose structure is not known. It is followed by Nadaraya-Watson, SM, H-T and Ratio estimator.
Finally, the study compared the relative Change in Efficiency (RCE) among the five estimators. First, was the case when linearity assumption of the population structure is violated. Considering the RCE I, in Table 7 that the nonparametric estimators, Nadaraya-Watson and Periodic Spline have low RCE. This imply that they are the least sensitive to the violation of the linearity structure of the population and hence the most Robust among the five estimators. They are then followed by the SM, and Ratio estimators. Nevertheless, Horvitz-Thompson estimator is the least Robust among them as far as the violation of linearity assumption of the population structure is concerned.Secondly RCE II, investigate the violation of the Homoscedastic assumption in a population structure. Considering the RCE II, Table 7 that the Nadaraya-Watson, Periodic Spline and SM have the lowest RCE. This imply that they are the least sensitive to the change of structure of the population and hence the most robust among the five when homoscedastic assumption is violated.
On the other hand the Ratio and Horvitz-Thompson are least robust to the violation of homoscedastic condition on the population structure. Next we consider RCE 111. SM is having the least value. Next on the list is Nadaraya-Watson, Periodic spline, Ratio and Horvitz-Thompson estimators. However, we have also noted that all values of RCE 111 are quite low. This implies that though SM is the most robust to the change in the population structure, the low value shows that the other estimators are also robust and we conclude that population I is almost similar in structure to population IV, though it seems that homoscedastic condition is violated.
Lastly, in RCE IV, The Periodic Spline estimator has the least value of RCE thus becoming the most robust estimator to the change of population structure from linear and homoscedastic to the structure which is non linear and non homoscedastic. Nadaraya-Watson also proved to be robust to the same change in the structure. However, Ratio and Horvitz-Thompson estimators proved to be highly sensitive to the changes in the population structure. These two estimators are therefore less robust as compared to the other two non parametric estimators. The least robust estimator on this list as fur as this population is concern is SM. Therefore, Periodic spline has proved to be robust when both linearity and homoscedastic conditions are violated.
4. Conclusions and Recommendations
This study has revealed that the spline regression estimator performed impressively well in all aspects considered: bias, efficiency and robustness. We noted that it performed well in linear homoscedastic model and in quadratic homoscedastic model. However, even when the homoscedasticity assumption was violated it still performed well. We therefore conclude that Periodic Spline estimator is a robust estimator. It is therefore recommended to be used as a suitable estimator of the population total when the structure of the population is unknown. It has also been noted that the Nadaraya-Watson estimator performs well in the linear homoscedastic model and also when the linearity conditions is violated. It also suffices to mention that its performance was unquestionably impressive in the linear heteroscedastic model clearly indicating that it is robust to the violation of linearity and homoscedastic condition.
i. From the findings of our research, the Horvitz-Thompson (design-based) estimator and the Ratio estimator (model-based) should be used within the confines of a linear homoscedastic model. They are not appropriate for use when the structure of the population is not known.
ii. The two estimators, Nadaraya-Watson and periodic spline estimators; are suitable for use in linear homoscedastic model and even when the assumptions of the model are violated sensitivity to the change of population structure is relatively low and hence are classified as highly robust.
1. N = size of the finite population generally assumed to be known.
2. n = sample size.
3. x = design variable. Its values can either be made available before hand or in the course of data collection.
4. y = the survey variable or variable under study.
5. = n-1 = sample mean.
6. s2 = sample variance =
7. s 2(sigma) = population variance =
8. = the finite population mean.
9. = Finite population total.
10. Srswor – Abbreviation of simple random sampling without replacement.
11. Ksh – Kenya shilling.
12. SM=Sample Mean
13. HT = Horvitz-Thompson
14. R = Ratio
15. NW = Nadaraya-Watson
16. PS = Periodic-Spline
First and foremost, I thank Almighty God for His abundant grace, faithfulness, care, knowledge and the strength throughout this period of study. Secondly, I sincerely thank my supervisor, Dr. Njenga, for his profession guidance, patience, his availability for consultation and provision of the reference materials that I needed in the course of this research project. My gratitude also goes to the other team of my lecturers in the Department of Mathematics (statistics), Kenyatta University for their support and very educative lectures. Special thanks also go to J.P. Wanjala of KNBS who closely journey with us under institution based program for his encouragement. Lastly, I am grateful to Mr. Felix Wasike of Kenyatta University for his invaluable technical assistance.