Bootstrap versus Jackknife: Confidence intervals, hypothesis testing, density estimation, and kernel regression

Cruz, Laura; Blanco, Jessica; Giraldo, Ramón; Cruz, Laura; Blanco, Jessica; Giraldo, Ramón

doi:10.19053/01217488.v15.n2.2024.15900

Services on Demand

Journal

Article

Indicators

Cited by SciELO
Access statistics

Ciencia en Desarrollo

Print version ISSN 0121-7488

Ciencia en Desarrollo vol.15 no.2 Tunja July/Dec. 2024 Epub Dec 11, 2024

https://doi.org/10.19053/01217488.v15.n2.2024.15900

Artículos

Bootstrap versus Jackknife: Confidence intervals, hypothesis testing, density estimation, and kernel regression

Laura Cruz¹

Jessica Blanco¹

Ramón Giraldo²

^¹ Estudiante Maestría en Estadística. Universidad Nacional de Colombia, Sede Bogotá.

^²Universidad Nacional de Colombia, Sede Bogotá, Departamento de Estadística, Grupo de investigación en Estadística Espacial.

Resumen

Se comparan los métodos Bootstrap y Jackknife en varios contextos estadísticos. Inicialmente usando estimaciones de coeficientes de variación obtenidos a partir de muestras de varios modelos de probabilidad (Normal, Gama, Binomial y Poisson) generadas por simulación de Monte Carlo. Con los resultados se evalúa sesgo y varianza los estimadores. También se estudia el desempeño de los dos procedimientos inferenciales considerados en problemas de una muestra, estimación de la densidad y regresión kernel. Los resultados muestran que en el caso del coeficiente de variación Jackknife tiene menor sesgo pero mayor error estándar. Bootstrap es más potente en este contexto. En lo referente a la estimación de la densidad (histograma y Kernel) y la estimación del ancho de banda en la estimación de la función de regresión Jackknife produce estimaciones más cercanas a las clásicas que las halladas con Bootstrap. Los correspondientes intervalos de confianza con Jackknife son más cortos que los establecidos con Bootstrap.

Palabras Clave: Bootstrap; estimación kernel de la densidad; intervalos de confianza; Jackknife; potencia de la prueba; regresión kernel

Abstract

Bootstrap and Jackknife methods are compared in various statistical contexts. Initially, these are evaluated using estimates of coefficients of variation obtained from samples of different probability models (Normal, Gamma, Binomial, and Poisson) generated by Monte Carlo simulation. With the results, the bias and variance of the estimators are evaluated. The performance of the two inferential procedures considered in one-sample problems, density estimation, and kernel regression, is also studied. The results show that in the case of the Jackknife coefficient of variation, it has a lower bias but a higher standard error. Bootstrap is a more powerful estimator in this context. Both methodologies produce similar results regarding density estimation (histogram and kernel). In regression Kernel, it is observed that Jackknife allows obtaining estimates of the regression bandwidth closer to the classical ones than those found with Bootstrap. The corresponding confidence intervals with Jackknife are shorter than those established with Bootstrap.

Keywords: Bootstrap; confidence intervals; Jackknife; kernel regression; local regression; power of the test

1 Introduction

Two popular resampling methods widely used in real data analysis are Bootstrap [⁶,¹⁰] and Jackknife [²⁴,²⁵,³¹]. Both are nonparametric statistical methods. These computer-intensive techniques can be used to estimate bias and standard errors of non-traditional estimators, especially useful when the sampling distribution of an estimator is unknown or cannot be defined mathematically so that classical statistical analysis methods are not available. Samples from the observed data allow us to draw conclusions about the population of interest. Nowadays, these approaches are feasible because of the availability of high-speed computing. Confidence intervals based on Bootstrap and other resampling methods (for example, Jackknife) should be used whenever there is cause to doubt the assumptions of parametric confidence intervals. When the underlying distribution of some statistic of interest is unknown, these strategies can be beneficial.

Bootstrap and Jackknife methods are powerful techniques used in statistics to estimate the variability of a statistic or to assess the goodness of fit of a statistical model. While the Bootstrap resamples with replacement, the Jackknife method systematically leaves out one observation at a time. The choice of method depends on the problem, and both methods can be useful in different scenarios. Bootstrap and Jackknife have been used and compared in several statistical scenarios. Among others in linear regression [¹¹, ³³], quantile regression [¹⁵, ¹⁹], analysis of variance [⁷, ⁸], and generalized linear models [²¹]. Bootstrap and Jackknife can be used in kernel density estimation and kernel regression to estimate the variability of the density and regression functions and, consequently, define confidence intervals. In this work, we explore the applicability of these methods in the estimation of the bandwidth in both scenarios, kernel density, and kernel regression estimation. The analysis is carried out using simulated data in R [²⁶].

The article is organized as follows: Initially, we show in Section 2 a review of the bias, standard error, and confidence intervals using Bootstrap and Jackknife. An illustration based on the coefficient of variation is also shown. In Sections 3 and 3.2, Bootstrap and Jackknife are compared in the context of hypothesis testing for one sample problems. Notably, using Monte Carlo simulations, the power of the tests based on these two strategies is estimated when testing hypotheses about the coefficient of variation are carried out. In Sections 4 and 5, we show a comparison of these methodologies in the context of kernel density estimation [⁵] and kernel regression [³].

2 Background: Bootstrap and Jackknife

Here we give an overview of Bootstrap and Jackknife methods. The estimations on bias and standard errors (consequently the respective confidence intervals) by both approaches are presented. Assume Y ₁ , . . . ,Y _n a random sample of Y ∼ f (y,Θ) with Θ a parameters vector that defines the probability model of interest (for example Θ = (μ, σ) in the case of a normal distribution or Θ = (α, β) for a Gamma distribution). Suppose we want to estimate a particular parameter (or a function of parameters) of the distribution. If the distribution of the estimator ˆθ is unknown, Bootstrap and Jackknife procedures (sections 2.1 and 2.2) can be used for obtaining a CI for θ.

2.1 Bias, standard error, and confidence intervals using Bootstrap

Based on the sample, we can obtain B size n samples with replacement (Table 1) denoted as Y' _bj ., with b = 1,...,B and j = 1,...,n. At each case, the estimator ( ) of the parameter of interest is calculated.

Using the Bootstrap samples and the estimations , b = 1,..., B in Table 1: Representation of the Bootstrap random samples: Asterisk indicates that a sample size n with replacement is obtained from

Table 1, the Bootstrap estimator, its bias and variance are given by

Assuming normality a 100(1 - α)% CI to 6 is given by

In general the CI can be obtained as

with and percentiles obtained from = 1,..., B.

2.2 Bias, standard error, and confidence intervals using Jackknife

As in the previous section, assume that we have a sample Yi,..., Y _n be a random sample of Y ~ f(y, θ), θ is the parameter of interest and = g(Y_B,..., Y _n ) its estimator. The bias (E( ) - and the variance V( ) are unknown. Let - the estimator obtained after deleting Yi, i = 1,..., n. The Jackknife estimator of θ is defined as

where = n - (n - 1) . ,i = 1,n, are called Tukey's pseudovalues. Alternatively we have

with

In order to estimate the bias of the Jackknife estimator, E( ) and θ are replaced by and _jack respectively. Specifically

Let Y ₁,..., Y _n be a random sample and Ȳ the sample mean. The variance of this statistic can be approximated using the sample variance as

and consequently

Adapting the equation (6) to the pseudo-values ₁ ,..., _n we have

It can be shown that [⁹] T =

Then a 100(1 - α)% CI for θ is

3 Inference on the coefficient of variation using Bootstrap and Jackknife

In this Section, we show, using Monte Carlo simulation, the behavior of Bootstrap and Jackknife in statistical inference (estimation and hypothesis testing) on the coefficient of variation.

3.1 Estimation of the coefficient of variation

Assume Y ∼ f (y,Θ), μ = 𝔼(Y), and σ ² = 𝕍(Y), and we want to estimate the coefficient of variation θ = CV = . Based on Monte Carlo simulation [⁴], we compare Bootstrap and Jackknife in terms of bias and standard errors of estimation. [³³] conducted a similar study based on Normal data. We extend that work using simulations from four probability models (Normal, Gamma, Poisson, and Binomial) obtained using R [²⁶]. In Table 2, we show the expressions of the CV of the four models considered.

To estimate the CV using Bootstrap, we first take a random sample size n with replacement from the original dataset. We then calculate the CV of this bootstrap sample. We repeat this process B times to obtain B bootstrap samples and CV estimates. The standard deviation of these B estimates can be used as an estimate of the standard error of the CV. The confidence interval for the CV can then be obtained using the standard error and the desired level of confidence. To estimate the CV using Jackknife, we create n subsamples by leaving out one observation from the original dataset at a time. We then calculate the CV of each subsample. The mean of these n estimates is used as an estimate of the CV. To facilitate the interpretation of the results in Tables 3 and 4 are presented the expressions of Bootstrap and Jackknife estimation defined in Sections 2.1 and 2.2 corresponding to the coefficient of variation. We consider samples size n = 5,10,15,..., 200. The results are presented in Figure 1. Various aspects are remarkable in this Figure. The Jackknife bias is less than the Bootstrap one, and its performance increases with continuous distributions (Normal and Gamma). Bootstrap underestimates the CV in all cases considered. We note that the greater the sample size, the better the Bootstrap estimation (less bias). For n values close to 200, the estimations by these methodologies are very similar. The methods produce similar standard error estimations from relatively small sample sizes. The results in Figure 1 suggest that Jackknife can be a better option for estimating bias and standard deviation of the coefficient of variation, particularly when the sample size is small.

Table 2: Probability models considered to study the consistency and power of the tests on the coefficient of variation

Table 3: Summary of Bootstrap estimation of the coefficient of variation

3.2 Hypothesis testing on the coefficient of variation

The power of a test using Bootstrap or Jackknife depends on the number of resamples or Jackknife samples used, as well as the characteristics of the original dataset. Generally, increasing the number of resamples or Jackknife samples will increase the power of the test, but at the cost of computational time. An essential factor that can affect the power of the test is the underlying distribution of the data. Bootstrap and Jackknife methods can perform well if the data are normally distributed. Here, we empirically compare (using simulated data) the power of tests based on Bootstrap and Jackknife. For this purpose, we simulate samples from the distributions in Table 2. Specifically, we test the hypothesis

Table 4: Summary of Jackknife estimation of the coefficient of variation

Figure 1: Estimation (left) and standard error (right) of the coefficient of variation according to the sample size (grey and black curves correspond to Jackknife and Bootstrap, respectively). The dashed line in the left panel corresponds to the CV of reference. From top to bottom, we have the results for Normal, Gamma, Poisson, and Binomial distributions, respectively.

with CV₀ defined with the expressions in the last column of the Table 2. The steps to calculate the power curves for each one of the probability models considered are the following

We fix a sample size n = 100 and a significance level a = 5%.
One sample size n is simulated under the null hypothesis.
We generate many Bootstrap and Jackknife samples by resampling with replacement or deleting one observation at a time, respectively.
For each Bootstrap or Jackknife sample, the CV is calculated and used to test whether it is significantly different from the CV under the null model (given in Table 2).
The proportion of times the null hypothesis is rejected across all Bootstrap or Jackknife samples is calculated This proportion is an estimate of the power of the test.
We repeat steps 3-5 for many values of \i, a, A, and p, respectively, under the alternative hypothesis.

The results obtained are presented in Figure 2. These suggest that in all cases (four probability models), the tests based on Bootstrap are more powerful than those found with Jackknife.

4 Review of Bootstrap and Jackknife in kernel density estimation

Kernel density estimation is widely used in several applied contexts, including, among others, marine biology [²²], chemistry [²⁰], and econometrics [³⁴]. A fundamental aspect of kernel estimation is bandwidth selection. Usually, leave-one-out and k-fold cross-validation are used for establishing the optimal bandwidth [³]. Here we compare the performance of Bootstrap and Jackknife in this scenario. Bootstrap and Jackknife can be used to estimate the variability of the kernel density estimator, but they differ in how they generate the resamples. Bootstrap requires random sampling with replacement, while Jackknife involves leaving out one observation at a time. In this section, we compare these strategies according to their performances in both histogram density estimation (Section 4.1) and kernel density estimation (Section 4.2). Specifically, we estimate the uncertainty of the estimated density function in kernel density estimation.

4.1 Bootstrap and Jackknife in bandwidth histogram estimation

The histogram is one of the most broadly used graphical tools in descriptive data analysis [²⁹]. This tool is a kernel density estimator where the underlying kernel is uniform [³]. Although there are better options for estimating the density, in this work, we consider the histogram given its extensive use in real data analysis. Specifically, it is established which of the two resampling methodologies (Bootstrap or Jackknife) performs best in estimating the amplitude of the class intervals.

Given an observed sample X1, x _n , the histogram estimator of the density function f (x) is defined as

With

The optimal bandwidth h is the one that minimizes the asymptotic mean integrated squared error (AMISE), defined as [³]

where f(x) is the population density. To estimate h is usually considered f(x) as a Gaussian distribution with mean and standard deviation estimated from the sample. Taking derivative of the AMISE with respect to h and equating to zero yields

Under the Gaussianity assumption

4.2 Bootstrap and Jackknife in bandwidth kernel density estimation

The problem of estimating h depends only on the j estimation. Here we evaluate the performance of Bootstrap and Jackknife in the estimation of h. The estimations obtained are compared with the classical estimation given in Equation (10). The Tables 5 and 6 show in parallel the general definitions of Bootstrap and Jackknife estimators and the corresponding expressions for estimating the bandwidth h by means of these approaches. In order to compare the methodologies, we conduct a simulation study. Suppose X ~ N ( μ= 10, σ = 3). Using R software [²⁶], we simulate random samples size n = 10,20,30,..., 200 from X, and at each case, we estimate the parameter h by using the estimator in Equation (10). We also carry out an estimation of h through Bootstrap and Jackknife (see Tables 5 and 6). The simulation results are shown in Table 7. There are several remarkable aspects of this Table.

Table 5: Summary of Bootstrap estimation of the bandwidth h in kernel density estimation. Assume that ĥ ^* _b is the estimation of the bandwidth h based on a Bootstrap sample

With small samples, the estimations by the three methods (classical, Bootstrap, and Jackknife) are very similar, and for large samples (n ≥ 60), the three estimations coincide. In all cases, the estimations are very close to the value of the parameter h. This result indicates that any of them can be used. However, one advantage of Bootstrap and Jackknife over classical estimation is that these allow for assessing the uncertainty of the estimations (employing the corresponding confidence intervals). As expected with both methodologies (Bootstrap and Jackknife), a large sample size provides narrower confidence intervals. In all cases, Bootstrap and Jackknife confidence intervals contain the corresponding parameter h. The results indicate that Bootstrap and Jackknife are valid and valuable alternatives for estimating the interval bandwidth in the histogram density estimation. These are preferable to the classical estimation since they allow obtaining, in addition to the point estimation, a measure of variability in the estimation. In the case of small samples, the Bootstrap intervals are slightly narrower than those obtained with Jackknife. This point suggests that Bootstrap might be more suitable when n values are small. Let x₁,..., x_n a sample size n of a population with unknown density( (x)) The kernel density estimator of f (x) is given by

where K is a kernel function (Gaussian, Epanechnikov, triangular, biweight, etc.) and h is the bandwidth. The optimal h is the value that minimizes the AMISEf (x)) defined as

Figure 2: Power curves based on simulated data. One side hypothesis of the coefficients of variation obtained with four probability models. Normal (μ = 10, σ - 2) (top left), Gamma(α - 2,β - 2) (top right), Poisson (λ - 5) (bottom left), Binomial (n - 10, p - 0.4)

Table 6: Summary of Jackknife estimation of the bandwidth h in kernel density estimation. Assume that σ_-i is the estimation of standard deviation based on a sample x _¡-1 , x _¡+1 ,…x _n .

Table 7: Assume X ~ N(/μ, o). h: optimal amplitude in histogram density estimation. h: classical estimation. ĥ _boot and ĥ _jack are the approaches based on Bootstrap and Jackknife, respectively. In these cases we also include 95% confidence intervals.

Estimator of f(x) is given by

where K is a kernel function (Gaussian, Epanechnikov, triangular, biweight, etc.) and h is the bandwidth. The optimal h is the value that minimizes the AMISE( (x)) defined as

Table 8: Assume X ~ N(μ, σ). h: optimal amplitude in kernel density estimation. h: classical estimation. ĥ _boot and ĥ _jack are the approaches based on Bootstrap and Jackknife, respectively. In these cases we also include 95% confidence intervals.

Differentiating the AMISE with respect to h we have

The optimal bandwidth h is obtained assuming f (x) as a Gaussian density. Hence, after some calculations, we have that the optimal bandwidth in kernel density estimation is obtained by

Changing ĥ - 3.491ôn^-1/3 by Equation (11) in Tables 5 and 6, we obtain the corresponding expressions to do the estimation of the optimal bandwidth in kernel density estimation by using Bootstrap and Jackknife. In Table 8, we show the estimations of the optimal bandwidth in kernel density estimation based on the same samples simulated to generate the results in Table 7. As in the particular case of the histogram, the bandwidth estimations shown in Table 8 indicate that Bootstrap and Jackknife can be favorable alternatives to carry out the estimation of the density using the general methodology based on kernel. We present the results using a Gaussian kernel; however, we obtained similar results with others. As in the case of the histogram estimation, in this section, we can conclude that using Bootstrap or Jackknife can be preferable because these approaches allow having an estimation of the variability for the estimator. Particularly with small sample sizes is helpful to know the uncertainty in the estimation. As in the case of the histogram estimation, we note that Bootstrap can be preferable with small samples because narrower confidence intervals are obtained. The results in Tables 7 and 8 show that the estimators ĥ _boot and ĥ _jack are consistent, i.e, lim_n→∞ ĥ _boot - h and lim _n→∞ ĥ _jack = h.

5 Bootstrap and Jackknife in bandwidth kernel regression estimation

Bootstrap and Jackknife have been compared under various regression contexts. Among others in linear [²], generalized linear [¹², ³⁰, ³³], and logistic regression [¹⁴]. This paper compares the efficiency of Bootstrap and Jackknife in estimating the bandwidth in kernel regression. This parameter, denoted as h (as in Sections 4.1 and 4.2 but now in a regression scenario), determines the width of the kernel function, which in turn affects the smoothness of the estimated regression function. The choice of the bandwidth is critical, as it can greatly affect the bias and variance of the regression estimator. Suppose we have an observed bivariate random sample (y _i, x _i), i = 1,..., n with y and x the response and predictor variables, respectively. We want to estimate the regression model.

Let f(x) and f(x, y) be the univariate and bivariate density functions of the variable X and the random vector (Y, X). The estimation of a kernel regression model based on the observed sample (x _i , y_i ), i = 1,..., n is given by [³]

where K is a kernel (Gaussian, rectangular, triangular, Epanechnikov, etc) and h > 0 is the bandwidth that controls the amount of smoothing. For a particular i, i - 1,..., n, we have

In matrix notation, the estimates at the n sampling points are calculated as

A key point in kernel regression is to determine the bandwidth h. The usual strategy is based on choosing the h value that minimizes the mean squared error (MSE) defined as [¹⁸]

with

In general, a small bandwidth will result in a high degree of local smoothing, while a large bandwidth will lead to less local smoothing and more global effects. A bandwidth that is too small may generate overfit of the data, while a bandwidth that is too large may produce an over-smoothing of the data and miss significant local trends. Various methods can be used to determine the optimal bandwidth, such as cross-validation or minimizing a particular criterion (e.g., mean squared error or Akaike's information criterion). Cross-validation involves partitioning the data into training and validation sets and evaluating the performance of the model with different bandwidth values. The bandwidth that results in the best behavior on the validation set is then chosen as the optimal bandwidth. In this Section, we compare Bootstrap and Jackknife to establish their efficiency in estimating the optimal bandwidth h in kernel regression according to the criterion in Equation (14). The Jackknife method in kernel regression involves repeatedly fitting the kernel regression estimator to a subset of the data, leaving out one observation each time. The estimation of the kernel regression function is then calculated using all the observations and each subset of the data. The Jackknife estimation of the bandwidth is obtained using the pseudovalues are calculated as

where ĥ and ĥ _-1 are defined using the criterion in Equation (14), with and without considering, respectively, the i-th observation (y _i, x _i), i - 1,..., n. The pseudo-values can be used to estimate h, the variance of the kernel regression estimator h and a confidence interval for h. Specifically we have

Using the quantiles and (1- ) from the pseudovalues, an approximate confidence interval for the bandwidth of the kernel regression can be obtained as

On the other hand, the Bootstrap method can also be used to obtain many estimations (using resampling with replacement) and, consequently, a variability measure and a confidence interval for the bandwidth h calculating the quantiles of the distribution of the Bootstrap estimations. Let (x^* ₁₁, y^* ₁₁),… (x ^* _1n, y ^* _1n),… (x ^* _B1, y^* _B1),… (x ^* _Bn, y ^* _Bn) be B Bootstrap samples size n taken from (x _i, y _i), i = 1,..., n. Based on each one of these is obtained an estimation of h according to the criterion in Equation (14), i.e., we find (ĥ* ₁,..., ĥ* _b). The Bootstrap estimator of the bandwidth, its variance, and the corresponding confidence interval are given by

Using the quantiles of the Bootstrap estimations (ĥ* ₁,..., ĥ* _B ) a confidence interval for the bandwidth of the kernel regression estimator can be also obtained as

Figure 3: Left panel: Scatterplot of (x _i, y _i), i = 1,..., 1000 points simulated from the model Y - 0.2x + 0.5x² - 0.8x³ + ∈, ∈ ~ N(0,30) which is assumed as the population data and fitted kernel regression model with h - 0.37 (red curve). Right panel: Simulation of n - 80 data from the population data (gray points in the right panel) and estimated kernel regression model (red curve) with ĥ= 0.65.

5.1 Simulation study

Here we present some simulation results comparing Bootstrap and Jackknife in estimating the bandwidth in kernel regression. For this purpose we assume Y - 0.2x + 0.5x² - 0.8x³ + e, e ~ N(0,30) as the population model (Figure 3). The red line corresponds to the fit of a kernel regression model with bandwidth h - 0.37, which was calculated according to the criterion in Equation (14). The comparisons are made by taking samples of size n - 20,40,60,..., 200 from this model. For illustrative purposes, in the right panel is presented an estimated a kernel regression model with a sample size n 80 of the population model. In this case, the bandwidth estimation is ĥ =0.64. Varying the sample size are obtained the estimations h, hb_oot and ĥ _jack (Table 9). Note that in practice given a dataset (x_i, y_i), i = 1,..., n we have just one estimation of h and several estimations of ĥ _boot and ĥ _jack obtained by resampling the data recorded. In this real case, we only could determine confidence intervals employing the approaches based on Bootstrap and Jackknife. For this reason, in order to compare the three methodologies (classical, Bootstrap, and Jackknife), we calculate confidence intervals based on the percentiles 5% and 95% obtained from 300 sets of values of ĥ, ĥ _boot and ĥ _jack generated with an equal number of samples size n - 20,40,60,..., 200 simulated from the population model (Table 9). The results suggest that the tree estimators are consistent, i.e., as we collect more and more data (when n increases), the difference between the estimated value and the "true value of the parameter " (assumed equal to 0.37) will become smaller and smaller. According to the results in Table 9, the Jackknife estimations are very close to the obtained with the classical approach. We also note (see last row of the table) that the estimates by Jackknife tend more quickly to the fixed reference value (ĥ = 0.37) than those obtained by Bootstrap. These results indicate that Jackknife may be a more appropriate and efficient option than Bootstrap to estimate the bandwidth in Kernel regression. In general is accepted that in regression problems, as long as the data set is reasonably large, Bootstrap is often acceptable. However, the estimations ĥ _boot and the corresponding confidence intervals CI(hb_oot) in Table 9 suggest that in the context of Kernel regression, Bootstrap is not the best option to establish the variability of the bandwidth estimator ĥ.

Table 9: Assume Y - 0.2x + 0.5x² - 0.8x³ + ∈ a regression function with e ~ N(0,30). The optimal bandwidth (see Section 5) in a kernel regression model calculated with 1000 data simulated from the population model is h - 0.36. ĥ, ĥ _boot, and ĥ _jack identify the approaches classical and based on Bootstrap and Jackknife, respectively, which are calculated for each sample size as the mean of fifty simulations. We also include 95% confidence obtained as percentiles 5% and 95% of the fifty estimations.

Table 10: Results based on Bioluminescence data. Assume Y m(x)+e, with m(x) a kernel regression function. ĥ: optimal estimation based on lest square. ĥ _boot and ĥ _jack are the approaches based on Bootstrap and Jackknife, respectively. In these cases, we also include 95% confidence intervals.

Figure 4: Scatterplot (dots) of pelagic bioluminescence along a depth gradient in the northeast Atlantic Ocean for a particular station. Data taken from [³⁵]. Red curve corresponds to a Kernel regression model with an optimal bandwidth h - 373.

5.2 Application to bioluminescence data

Bioluminescence is the emission of light by living organisms, typically due to chemical reactions involving luciferin and luciferase enzymes [¹⁶]. Overall, analyzing bioluminescence data can provide valuable insights into the biological processes and respect for the environmental and experimental factors affecting bioluminescence activity [²⁷]. Bioluminescence is very common in the ocean, at least in the pelagic zone. Bioluminescent creatures occur in all oceans at all depths, with the greatest numbers found in the upper 1000 m of the vast open ocean [³²]. The relationship between pelagic bioluminescence and depth has been considered from various perspectives. Among additive models [¹³], [¹⁷], and additive mixed models [³⁶] have been used in this context. This work applies a Kernel regression model to a pelagic bioluminescence dataset taken from [³⁵]. Based on these data, the aim is to illustrate how to define confidence intervals for the bandwidth h parameter. The scatterplot and the fitted Kernel regression model with bandwidth h - 373 are shown in Figure 4

From Table 10, we can establish that the optimal bandwidth h that minimizes the criterion in Equation (5) is ĥ - 373. We also note that the estimations by Bootstrap and Jackknife are very close to this value; however, Jackknife is a better option to define the confidence interval for h because, on the one hand, it includes the h value and, on the other hand, is shorter than the obtained by Bootstrap. The results in this Section confirm that Jackknife is a good alternative in kernel regression to define the bandwidth estimation uncertainty.

6 Conclusion and further research

The results in this work suggest that in the particular case of the coefficient of variation, Jackknife tends to produce less biased estimates, but it may have lower power than Bootstrap. In the case of the histogram and kernel density estimation, both methodologies produce similar results. Finally, in the context of Kernel regression, Jackknife is a better option. Resampling methods are helpful nowadays in many contexts. Power studies comparing these approaches in many statistical areas are required. For example, evaluating its performance in splines regression and neural networks can be valuable.

References

[1] Behme, A. Theory of Stochastic Objects: Probability, Stochastic Processes and Inference. Chapman & Hall/CRC Press, 2018 [ Links ]

[2] Bose, A., Chatterjee, S. Comparison of bootstrap and jackknife variance estimators in linear regression: Second order results. Statistica Sinica. 2002, 575-598. [ Links ]

[3] Bowman, A., Azzalini, A. Applied smoothing techniques for data analysis: the kernel approach with S-Plus illustrations. OUP Oxford, 1973. [ Links ]

[4] Buckland, S. Monte Carlo confidence intervals. Biometrics. 1984; 811-817. [ Links ]

[5] Chen, Y. A tutorial on kernel density estimation and recent advances. Biostatistics & Epidemiology. 2017; 1(1), 161-187. [ Links ]

[6] DiCiccio, T., Efron, B. Bootstrap confidence intervals. Statistical Science. 1996; 11(3): 189-228. [ Links ]

[7] Donegani, M. Bootstrap adaptive test for two way analysis of variance. Biometrical Journal. 1992; 34(2): 141-146 [ Links ]

[8] Donnelly-Makowecki, L., Moore, R. Hierarchical testing of three rainfall-runoff models in small forested catchments. Journal of Hydrology. 1999; 219(3-4): 136-152. [ Links ]

[9] Efron, B., Stein, C. The Jackknife estimate of variance. The Annals of Statistics. 1981; 586-596. [ Links ]

[10] Efron B., Tibshirani R. An introduction to the bootstrap. 1994; CRC Press. [ Links ]

[11] Freedman D., Peters S. Bootstrapping a regression equation: Some empirical results. Journal of Statistical Software. 2008; 25(7):1-37. [ Links ]

[12] Garay, A., Hashimoto, E., Ortega, E., Lachos, V. On estimation and influence diagnostics for zero-inflated negative binomial regression models. Computational Statistics & Data Analysis, 2011; 55(3): 1304-1318. [ Links ]

[13] Gillibrand E., Jamieson A., Bagley, P., Zuur, A., Priede, I. Seasonal development of a deep pelagic bioluminescent layer in the temperate NE Atlantic Ocean. Marine Ecology Progress Series, 2007, 341: 37-43. [ Links ]

[14] Gong, G. Cross-validation, the jackknife, and the bootstrap: Excess error estimation in forward logistic regression. Journal of the American Statistical Association. 1986; 81 (393): 108-113. [ Links ]

[15] Hahn, J. Bootstrapping quantile regression estimators. Econometric Theory. 1995; 11(1): 105-121. [ Links ]

[16] Han, J., Li, G., Liu, H., Hu, H., Zhang, X. Stimulation of bioluminescence in Noctiluca sp. using controlled temperature changes. Luminescence. 2013; 28(5): 742-744. [ Links ]

[17] Heger, A., Ieno, E., King, N., Morris, K., Bagley, P., Priede, I. Deep-sea pelagic bioluminescence over the Mid-Atlantic Ridge. Deep Sea Research Part II: Topical Studies in Oceanography. 2008, 55(1-2), 126-136. [ Links ]

[18] Kohler, M., Schindler, A., Sperlich, S. 2014. A review and comparison of bandwidth selection methods for kernel regression. International Statistical Review. 2014; 82(2): 243-274. [ Links ]

[19] Lu, X., Su, L. Jackknife model averaging for quantile regressions. Journal of Econometrics. 2015; 188(1): 40-58. [ Links ]

[20] McCabe, P., Korb, O., Cole, J. Kernel density estimation applied to bond length, bond angle, and torsion angle distributions. Journal of Chemical Information and Modeling, 2014, 54(5): 1284-1288. [ Links ]

[21] Moulton, L., Zeger, S. Bootstrapping generalized linear models. Computational Statistics & Data Analysis. 1991; 11: 53-63. [ Links ]

[22] O'Brien, S., Webb, A., Brewer, M., Reid, J. Use of kernel density estimation and maximum curvature to set marine protected area boundaries: Identifying a special protection area for wintering red-throated divers in the UK. Biological Conservation. 2012, 156: 15-21. [ Links ]

[23] Oyeyemi, G. Comparison of bootstrap and jackknife methods of resampling in estimating popultion parameters. Global Journal of Pure and Applied Sciences. 2008; 1(2):217-220. [ Links ]

[24] Quenouille, M. Approximate test of correlation in time series. Journal of the Royal Statistical Society, Series B. 1949; 11:68-84 [ Links ]

[25] Quenouille, M. Notes on bias in estimation. Biometrika. 1956; 43(3): 353-360. [ Links ]

[26] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. https://www.R-project.org/. 2022. [ Links ]

[27] Roda, A., Pasini, P., Mirasoli, M., Michelini, E., Guardigli, M. Biotechnological applications of bioluminescence and chemiluminescence. TRENDS in Biotechnology. 2004, 22(6), 295-303. [ Links ]

[28] Rudemo, M. Empirical choice of histograms and kernel density estimators. Scandinavian Journal of Statistics. 1982, 65-78. [ Links ]

[29] Scott, D. Histogram. Wiley Interdisciplinary Reviews: Computational Statistics. 2010, 2(1): 44-48. [ Links ]

[30] Sherman, M., Cessie, S. A comparison between bootstrap methods and generalized estimating equations for correlated outcomes in generalized linear models. Communications in Statistics-Simulation and Computation. 1997; 26(3): 901-925. [ Links ]

[31] Tukey, J. Bias and confidence in not quite large samples. Ann. Math. Statist. 1958; 29: 614. [ Links ]

[32] Widder, E. Marine bioluminescence. Why do so many animals in the open ocean make light?. Science. 2010; 328: 704-708. [ Links ]

[33] Wu, C. Jackknife, Bootstrap and other resampling methods in regression analysis. The Annals of Statistics. 1986; 14(4): 1261-1295. [ Links ]

[34] Zambom, A., Dias, R. A review of kernel density estimation with applications to econometrics. International Econometric Review. 2013; 5(1): 20-42. [ Links ]

[35] Zuur, A., Ieno, E. Smith, G. Analysing ecological data. Springer; 2007. [ Links ]

[36] Zuur, A. Additive mixed modelling applied on deep-sea pelagic bioluminescent organisms. In Mixed effects models and extensions in ecology with R. 2009; 399-422. [ Links ]

Cómo citar: Cruz, L., Blanco, J., & Giraldo, R. (2024). Bootstrap versus Jackknife: Confidence intervals, hypothesis testing, density estimation, and kernel regression. Ciencia En Desarrollo, 15(2). https://doi.org/10.19053/01217488.v15.n2.2024.15900

Recibido: 08 de Abril de 2023; Aprobado: 15 de Abril de 2024; Publicado: 23 de Julio de 2024

Este es un artículo publicado en acceso abierto bajo una licencia Creative Commons