Two Factor Data Analysis with Unequal Cell Frequencies and Interaction
Chinwendu Alice Uzuke^{*}, Ikewelugo Cyprian Anene Oyeka, Happiness Onyebuchi Obiora-Ilouno
Department of Statistics, Faculty of Physical Sciences, Nnamdi Azikiwe University, Awka, Nigeria
Email address:
To cite this article:
Chinwendu Alice Uzuke, Ikewelugo Cyprian Anene Oyeka, Happiness Onyebuchi Obiora-Ilouno. Two Factor Data Analysis with Unequal Cell Frequencies and Interaction. Science Journal of Applied Mathematics and Statistics. Vol. 3, No. 6, 2015, pp. 288-292. doi: 10.11648/j.sjams.20150306.18
Abstract: This paper proposes a non parametric method for two factor data analysis with unequal cell frequencies and interaction. Chi-square test statistic was developed for testing the null hypothesis of no treatment effect and interaction between factor A and factor B. The proposed methods are illustrated with some data and compared with the usual unweighted mean method. The result showed that the proposed method is more powerful than the method of unweighted mean.
Keywords: Cell Frequency, Interaction, Chi-square, Unweighted Mean, Ranking, Tied Observation
1. Introduction
Analysis of variance (ANOVA) is generally regarded as the best analysis technique for balanced experiments that have equal number of subjects in each group that is cells with equal frequency [3]. Just as it may often be too difficult and too expensive to obtain more than one observation per treatment combination, it may also prove impossible to obtain equal number of observation per cell in a two factor analysis. For example, even though an experiment was planned with equal number of observations per cell. Some of the observation may end up missing for various reasons.
Some classifications of missingness were given as missingness at random (MCAR) as a situation where the probability of missing data does not depend on observed or un observed data, missing at random (MAR) as the probability that the missing data does not depend on the observed data while missing not at random (MNAR), is the probability that the missing data depends on the unobserved data conditional on the observed data [4]. Data with unequal cell frequency are not too far from those with equal frequency, it is sometimes possible to use approximate procedures that convert the former from the later. In practice the decision must be made when data are not sufficiently different from the case with equal frequency which makes the degree of approximation introduced relatively unimportant. [1]
Two – way ANOVA with unequal cell frequencies without assumption of equal error variance was considered by taking generalized approach to finding p-values [5]. But when the sample size per treatment combination is not the same for all treatments in a two factor ANOVA, the factor effect become more complicated and the usual calculations are no longer directly applicable [7], [9]. In this situation, the easiest and exact way to obtain the proper sum of squares for testing factor effects and interactions is through regression approach. [8]. Approximate methods however exist including the so-called method of weighted means, assuming all the assumptions for the use of ANOVA t-test are satisfied [6], [10], [1]. We therefore however present an alternative non-parametric method that will take care of different factors and interaction effects.
2. Methodology
Let be the observation at the level of factor and level of factor , for , , . Let be the number of observations in the cell. Then an analysis based on the un-weighted means using the variable [6].
(1)
An analysis of variance is then calculated in the usual way using as if they were single observations for the treatment combination per cell. However, the sum of squares are no longer additive in the sense that the individual sums of squares no longer add up to the total sum of squares. The sum of square error must now be calculated directly and independently from its basic definition, which may be sometimes more time consuming. Hence, instead of using the un-weighted means approach, we will propose a method based on the rank of sample observations.
To develop the proposed method based on the ranks of the sample observations, we would first pool all the into one commom sample and then rank the observations together in the usual way assingning all the tied observations their mean ranks.
Let be the rank assigned to in the combined ranking of these observations, for , and , giving a total of rank with mean ranks .
Now, in the absence of ties between the sample observations, the total sum of squared deviations of the asigned ranks from their mean rank is given by:
(2)
Which has a chi-square distribution with degrees of freedom.
Now, the total sum of square can be partioned into two component sums of square, namely, the treatment sum of square and the error sum of square as:
where is the mean or average of the ranks assigned to the observations in the level of factor and the level of factor , for , .
It can easily be shown that:
(3)
where is the sum of the ranks assigned to observations in the level of factor and the level of factor .
Now the sum of squares treatment,
(4)
has chi-square distribution with degrees of freedom. (Hogg et al 2005), while the error sum of square
(5)
has chi-square distribution with = degrees of freedom.
The treatment sum of squares can be further partitioned in to the sum of squares for factor , , sum of squares for factor , , and the factor by factor interaction sum of squares . Thus,
which can easily be shown to reduce to
(6)
That is
(7)
Where and are respectively the mean ranks of observations of the level of factor and level of factor , and are the corresponding number of observations and and are the corresponding rank totals, for for , .
Now, the sum of square due to factor namely:
(8)
has chi-square distribution with degrees of freedom and may be used to test the null hypothesis of no factor effects.
The sum of square due to factor namely:
(9)
has the chi-square distribution with degrees of freedom and may be used to test the null hypothesis of no factor effects.
Similarly, the sum of squares due to factor by factor interaction namely
(10)
Or
(11)
has the chi-square distribution with = degrees of freedom and may be used to test the null hypothesis of no factor by factor interaction effects.
In two factor analysis a null hypothesis which is usually of interest is that there are no treatment effects. If this null hypothesis is rejected, then one may proceed to test the null hypothesis that the effects of each of the factors and are zero assuming that the interaction effects have been found not to be statistically significant or that the interactions have been removed by an appropriate data transformation.
The null hypothesis of no treatment effect is tested using the chi-square statistic of Equation (4). The null hypothesis is rejected at the level of significance if
(12)
If this null hypothesis is rejected then we would need to first test the null hypothesis of no significant factor by factor interaction effects. This null hypotheis is tested using the chi-square statistic for interaction in Equation (11).
The null hypothesis of no factor by factor interaction effect is rejected at the level of significance if the chi-square value of Equation (10) or (11) is greater than the chi-square critical value with degrees of freedom. If this null hypothesis is rejected, one may then proceed to test the null hypothesis about factor and factor effects using Equations (8) and (9) respectively and rejecting the null hypothesis at a specified level of significance with and degrees of freedom.
3. Illustration
We shall use the data on final cumulative grade point average (FCGPA) of students who graduated in statistics from a certain University by State of origin for four years. The result is presented in Table 1. [10].
Year of Graduation | A | B | C | D | E | F |
2005 | 1.62, 3.22, 1.74, 3,53, 3.04, 2.07, 3.83, 3.15, 2.38, 2.07 | 2.33, 4.03, 4.00, 2.23, 1.65, 2.57 | 3.16, 1.66, 3.88, 3.22, 1.72, 1.68, 2.30, 2.45 | 4.03, 3.46, 2.36, 3.21, 2.73, 1.69, 2.48 | 2.99, 3.20, 4.20, 2.82, 2.91, 2.41, 2.35, 2.52 | 1.96, 2.08, 2.93, 2.43, 2.15, 2.56 |
2006 | 4.20, 2.99, 1.63, 2.70, 3.34, 2.44, 2.44, 3.62, 2.93, 1.70, 3.12, 2.33 | 2.99, 2.33, 4.01, 4.20, 3.32, 2.91, 2.23 | 2.89, 3.18, 2.62, 3.04, 1.87, 1.52, 2.10 | 4.66, 3.46, 3.38, 2.24, 1.69, 2.56. 2.62 | 2.42, 2.36, 4.02, 3.82, 3.01, 2.68, 2.52 | 2.78, 2.79, 3.02, 1.40, 2.19, 2.13 |
2007 | 3.12, 1.92, 1.71, 2.00, 2.33, 2.99, 3.54, 3.62, 3.50 | 2.63, 2.24, 2.82, 2.06, 1.97, 2.32, 2.72, 2.61 | 3.21, 2.58, 1.82, 2.67, 2.63, 2.87, 2.23, 2.46 | 2.04, 2.78, 2.93, 2.63, 3.01, 2.52, 2.46, 2.05, 2.26 | 2.56, 2.42, 2.40, 2.82 2.63, 1.98, 2.01, 2.00 | 3.20, 2.82, 1.93, 2.45, 2.03, 3.52, 2.81, 1.48 |
2008 | 2.69, 1.38, 3.17, 4.04, 4.11, 3.10, 2.14, 3.10, 2.68, 3.40, 1.95, 2.23, 1.56 | 1.96, 2.18, 1.32, 2.56, 2.48, 2.76, 2.32 | 3.06, 2.82, 2.63, 3.52, 1.46, 1.82, 2.22 | 1.23, 3.02, 2.58, 2.76, 2.43, 1.76, 3.28 | 1.27, 2.56, 1.82, 1.52, 2.68, 3.12, 2.45 | 1.80, 2.06, 2.62, 2.24, 2.74, 1.96, 2.48, 3.04 |
State of Origin
Using the unweighted mean approach, we obtain the entries in Table 2 using Equation (1)
Year of Graduation | A | B | C | D | E | F | |
2005 | 2.66 | 1.68 | 2.56 | 2.85 | 2.93 | 2.35 | 15.03 |
2006 | 2.80 | 2.80 | 2.46 | 2.94 | 2.98 | 2.39 | 16.71 |
2007 | 2.75 | 3.14 | 2.56 | 2.52 | 2.35 | 2.53 | 15.13 |
2008 | 2.49 | 2.23 | 2.51 | 2.44 | 2.20 | 2.37 | 14.24 |
| 10.7 | 9.47 | 10.09 | 10.75 | 10.46 | 9.64 | 61.11 |
States of Origin
The data in Table 2 are subjected to the standard balanced ANOVA technique without interrraction to obtain the sum of squares and the result of the analysis is presented in Table (3)
Source of variation | Sum of squares | Degrees of freedom | Mean Square | F-ratio | P-value |
Year | 0.3687 | 3 | 0.1229 | 1.43 | 0.2733 |
State | 0.5353 | 5 | 0.1071 | 1.24 | 0.3391 |
Error | 1.2934 | 15 | 0.0862 | ||
Total | 2.1974 | 23 |
From the ANOVA table, the p-value for block (year of graduation) and treatment (state of origin) show that they are not significant.
The proposed method
Observations are pooled together and assigned ranks. In the presence of tied observations, the mean of their rank are assigned to them. Further, the individual observations are replaced with their ranks and presented in Table 4.
State Of Origin
Year of Graduation | A | B | C | D | E | F |
2005 | 11, 161.5, 21, 173, 145, 44.5, 178, 153, 72, 44.5 | 66.5, 180, 13, 183.5, 55.5, 97 | 154, 15, 179, 82, 20, 62, 14, 161.5 | 183.5, 87, 16.5, 70.5, 116, 168.5, 159.5 | 137.5, 74, 188, 90, 131.5, 69, 157.5, 126 | 32, 134, 50, 46, 77.5, 94 |
2006 | 188, 137.5, 114, 165, 79.5, 79.5, 12, 175.5, 134, 18, 151, 66.5 | 137.5, 131.5, 181, 164, 55.5, 66.5, 188 | 130, 8.5, 102, 27, 47, 156, 145 | 190, 94, 166, 16.5, 102, 168.5, 59 | 75.5, 111, 182, 140.5, 90, 70.5, 177 | 120.5, 52, 142.5, 122, 5, 48 |
2007 | 151, 28, 36.5, 66.5, 19, 137.5, 174, 175.5, 170 | 106, 63.5, 126.5, 100, 34, 115, 59, 42.5 | 159.5, 129, 25, 84.5, 106, 55.5, 98.5, 109 | 40, 106, 134, 90, 140.5, 47, 84.5, 61, 120.5 | 94, 35, 36.5, 73, 106, 38, 75.5, 126.5 | 157.5, 29, 171.5, 7, 39, 124, 126.5, 82 |
2008 | 113, 10, 55.5, 155, 4, 186, 185, 49, 148.5, 111, 148.5, 30, 167 | 32, 94, 3, 118.5, 87, 63.5, 51 | 147, 171.5 106, 25, 96.5, 53, 126.5 | 48, 22, 98.5, 77.5, 163, 142.5, 118.5 | 2, 151, 25, 111, 82, 94, 8.5 | 23, 32, 102, 145, 117, 87, 42.5, 59, |
The ranks in each of the cells are summed to obtain and they presented in table 5
Year of Graduation | A | B | C | D | E | F | |
2005 | 1003.5(10) | 595.5(6) | 687.5(8) | 801.5(7) | 974(8) | 433.5(6) | 4495.5(45) |
2006 | 1314.5(12) | 924(7) | 615.5(7) | 796(7) | 846.5(7) | 490(6) | 4992.5(46) |
2007 | 958(9) | 646.5(8) | 767(8) | 823.5(9) | 584.5(8) | 736.5(8) | 4516(50) |
2008 | 1362.5(13) | 449(7) | 578.5(7) | 670(7) | 473.5(7) | 607.5(8) | 4141(49) |
| 4644.5(44) | 2615(28) | 2648(30) | 3091(30) | 2878.5(30) | 2267.5(28) | 18145(190) |
State of Origin.
From Table 5, the chi-square values for the source of variations , , , and were obtained and presented in Table 6
Source of Variation | Chi-Square Statistic | Degrees of Freedom | Chi-square critical value | p-value |
SSTrt | 57326.54 | 23 | 44.18 | 0.0000 |
SSA | 15943.87 | 3 | 12.838 | 0.0000 |
SSB | 13716.99 | 5 | 16.75 | 0.0000 |
SSAB | 27665.54 | 15 | 32.801 | 0.0000 |
SSError | 514240.96 | 166 | 53.672 | |
SSTot | 571567.5 | 189 | 53.672 |
4. Conclusion
In this paper, we have proposed a non parametric method for two factor data analysis with unequal cell frequencies and interaction. This was done by using the ranks of the sampled observations to obtain the chi-square statistic for the testing the null hypothesis of no treatment effect and no interaction between factor A and factor B.
Further the application of the proposed method is studied in practice by considering a real life example on students’ final cumulative grade point average (FCGPA) and State of Origin of these students. The chi-square test statistic were estimated based on the proposed methods and the result obtained showed better estimates when compared with the method of unweighted mean.
References