Collaborative Sampling to Construct a Confidence Interval on the Population Mean

What is the Collaborative Sampling Design?

The Collaborative Sampling (CS) design uses two measurement techniques to obtain a cost effective estimate of the mean of the characteristic of interest for the specified target population (study area). One measurement technique is the standard analysis (referred to in VSP as the expensive analysis method), and the other is a less expensive and less accurate measurement method (referred to in VSP as the inexpensive analysis method). The idea behind CS is to replace the need for collecting so many expensive analyses with obtaining a fewer number of those analyses to allow one to obtain a relatively large number of the inexpensive analyses. It works like this: At \(n'\)  field locations selected using simple random sampling or grid sampling, the inexpensive analysis method is used. Then, for each of \(n\) of the \(n'\) locations, the expensive analysis method is also conducted. The data from these two analysis methods are used to estimate the mean and the standard error (SE: the standard deviation of the estimated mean). The method of estimating the mean and SE assumes there is a linear relationship between the inexpensive and expensive analysis methods. If the linear correlation between the two methods is sufficiently high (close to 1), and if the cost of the inexpensive analysis method is sufficiently less than that of the expensive analysis method, then CS is expected to be more cost effective at estimating the population mean than if the entire measurement budget was spent on obtaining only expensive analysis results at field locations selected using simple random sampling or grid sampling.

The VSP CS module computes the values of \(n'\) and \(n\) that should be used to estimate the CS mean and SE. VSP also computes the mean, standard error (standard deviation of the estimated mean) and other outputs that can be used to assess the validity of the assumptions that underlie CS. The equations used for these computations are provided below. It should be noted that Gilbert (1987, Chapter 9) and other statisticians use the term Double Sampling instead of Collaborative Sampling. Collaborative Sampling rather than Double Sampling is used in VSP in order to prevent potential users from thinking that a Double Sampling design requires doubling the number of samples.

Method Used to Determine if Collaborative Sampling is Cost Effective

Before VSP will compute \(n'\) and \(n\), it first determines if CS is cost effective compared with using the entire measurement budget to obtain only expensive analysis results at field locations selected using simple random sampling or grid sampling. If CS is found to be cost effective, then VSP will compute \(n'\) and \(n\). If CS is not cost effective, then VSP computes the number of field locations that should be collected and analyzed using only the expensive analysis method to estimate the mean and SE.

VSP declares CS to be cost effective if the following equation is satisfied:

\begin{equation} \rho^2 > \frac{4R}{(1+R)^2} \end{equation}

where

\(\rho\)

is the true correlation coefficient between the expensive and inexpensive analysis measurements.

\(R\)

is \( \frac{c_{Ex}}{c_{Inex}}\).

\(c_{Ex}\)

is the per unit cost of an expensive analysis, including the cost of collecting, handling, preparing, and measuring the sample.

\(c_{Inex}\)

is the per unit cost of an inexpensive analysis, including finding the field location and conducting the inexpensive analysis method.

 

Equation (1) above is from Gilbert (1987, page 108). It is assumed that the following cost equation applies:

\(C = c_{Ex}n + c_{Inex}n' \)

is the total dollars available for doing \(n'\) inexpensive analyses and \(n\) expensive analyses.

 

Note that \(C\) does not include what might be termed overhead costs of project management, preparing the Quality Assurance Project Plan or Sampling and Analysis Plan, QA/QC and other such costs.

Method for Computing the Number of Expensive and Inexpensive Analysis Measurements

When CS is more cost effective than simple random sampling , then VSP computes \(n'\) and \(n\) such that the total measurement cost, \(C\), is minimized subject to the constraint that the width of the confidence interval will be no greater than a confidence interval width that would be obtained using \(n_v\) samples obtained using simple random sampling and measured using only the expensive measurement method.

In order to compute \(n'\) and \(n\), VSP first computes \(n_v\) and then uses \(n_v\) to compute \(n'\) and \(n\) as explained below. The iterative procedure described in Gilbert (1987, page 32) is used to compute \(n_v\). The procedure starts by computing

\begin{equation} n_v = \left[ \frac{Z_{1- \alpha} \sigma_{total,ex}}{d}\right]^2 \end{equation} (for one-sided confidence intervals)

or

\begin{equation} n_v = \left[ \frac{Z_{1- \alpha /2} \sigma_{total,ex}}{d}\right]^2 \end{equation} (for two-sided confidence intervals)

where

\( Z_{1- \alpha}\)

is the \( (1- \alpha )^{th}\) percentile of the standard normal distribution.

\( Z_{1- \alpha} /2\)

is the \((1- \alpha /2)^{th}\) percentile of the standard normal distribution.

\( \sigma_{total,ex}\)

is the total standard deviation of the expensive measurement method.

\(d\)

is the desired width of the one-sided confidence interval or the desired half-width of the two-sided confidence interval.

 

Once \(n_v\) is computed using Equation (2) or (3), the degrees of freedom (\(df\)) are computed (\(df = n_v - 1\)) and used in Equations (4) or (5) to get a revised value for \(n_v\):

\begin{equation} n_v = \left[ \frac{t_{1- \alpha ,df} \sigma_{total,ex}}{d}\right]^2 \end{equation} (for one-sided confidence intervals)

or

\begin{equation} n_v = \left[ \frac{t_{1- \alpha /2, df} \sigma_{total,ex}}{d}\right]^2 \end{equation} (for two-sided confidence intervals)

where

\(t_{1- \alpha ,df}\)

is the \((1- \alpha)^{th}\) percentile of Student's t-distribution with \(df\) degrees of freedom.

\(t_{1- \alpha /2, df}\)

is the \((1- \alpha /2)^{th}\)  percentile of Student's t-distribution with \(df\) degrees of freedom.

 

This iterative procedure continues until \(n_v\) does not change or an oscillation between two adjacent values occurs, the larger of which is chosen. \(n_v\) is the recommended minimum number of samples to compute a confidence interval on the mean for the study area when simple random sampling and only the expensive measurement method are used.

Once \(n_v\) is computed, VSP uses \(n_v\) to calculate \(n\) and \(n'\) using the following formulas:

\begin{equation} n = n_v (1 - \rho^2 + \rho^2 f_0 ) \end{equation} (Equation 9.9 in Gilbert 1987, page 109)

and

\begin{equation} n' = \frac{n n_v \rho^2}{n - n_v (1- \rho^2)} \end{equation} (Equation 9.10 in Gilbert 1987, page 109)

where

\begin{equation} f_0 = \sqrt{ \frac{1 - \rho^2}{\rho^2 R}} \end{equation} (Equation 9.8 in Gilbert 1987, page 109)

When CS is not more cost effective than simple random sampling , then VSP uses the iterative procedure [Equations (2), (3), (4) and (5) above] to compute \(n_v\), which is now the number, \(n\), of expensive measurements needed. No computation of \(n'\) is required because simple random sampling rather than CS is used.

When CS is cost-effective, VSP uses the following method to Estimate the Mean and Standard Error

The mean and SE are computed assuming that there is a linear relationship between the measurements obtained using the expensive and inexpensive analysis methods. This assumption should be verified by the VSP user before collaborative sampling in VSP is used. However, note that once the VSP collaborative sampling design is used and data are entered into VSP, then VSP shows the linear regression plot of the data. This plot should be examined to verify that the assumption of a linear relationship does indeed seem reasonable. Also, VSP computes the correlation coefficient, \(\rho\), using the inexpensive and expensive data. The VSP user should use this correlation and the cost ratio, \(R\), to compute Equation (1) above to confirm that CS is indeed really more cost efficient than simple random sampling.

The process used by VSP in the collaborative sampling module to compute the mean and SE is given in the following steps.

1. The \(n'\) inexpensive analysis measurements and the \(n\) expensive analysis measurements are made by the VSP user and entered into VSP. Let \(x_{{Ex}_i}\) and \(x_{{Ines}_i}\) denote the expensive and inexpensive measurements, respectively, on the \(i^{th}\) unit.

2. VSP estimates the mean by computing \( \bar x_{cs}\) (cs stands for collaborative sampling) as follows

\begin{equation} \bar x_{cs} = \bar x_{Ex} + b( \bar x_{n'} - \bar x_{Inex}) \end{equation} (Equation 9.1 in Gilbert 1987, page 107)

where \(\bar x_{Ex}\) and \(\bar x_{Inex} \) are the means of the \(n\) expensive and inexpensive measurements, respectively, \( \bar x_{n'}\) is the mean of the \(n'\) inexpensive values, and \(b\) is the slope of the estimated regression of expensive on inexpensive values.

3. VSP computes the estimated standard error (standard deviation of \(\bar x_{cs}\)) as follows:

\begin{equation} SE = \sqrt{s^2(\bar x_{cs})} = \sqrt{ s_{Ex \bullet Inex} \left[ \frac{1}{n} + \frac{(\bar x_{n'} - \bar x_{Inex})^2}{(n-1) s_{Inex}^2} \right] + \frac{s_{Ex}^2 - s_{Ex \bullet Inex}^2}{n'}} \end{equation} (Equation 9.2 in Gilbert 1987, page 107, after \(N\) is set equal to infinity)

where \(s_{Ex}^2\) and \(x_{Inex}^2\) are the variances of the \(n\) expensive and inexpensive measurements, respectively, and \(s_{Ex \bullet Inex}^2 \) is the residual variance about the estimated linear regression line. The equations used to calculate the quantities in equations (9) and (10) are as follows:

\begin{equation} \bar x_{Ex} = \frac{1}{n} \displaystyle\sum_{i=1}^n x_{Ex_i} \end{equation}

\begin{equation} \bar x_{Inex} = \frac{1}{n} \displaystyle\sum_{i=1}^n x_{{Inex}_i} \end{equation}

\begin{equation} \bar x_{n'} = \frac{1}{n'} \displaystyle\sum_{i=1}^{n'} x_{{Inex}_i} \end{equation}

\begin{equation} b = \frac{ \displaystyle\sum_{i=1}^n (x_{{Ex}_i} - \bar x_{Ex} )(x_{{Inex}_i} - \bar x_{Inex})}{ \displaystyle\sum_{i=1}^n (x_{Inex} - \bar x_{Inex})^2} \end{equation}

\begin{equation} s_{Ex}^2 = \frac{1}{n-1} \displaystyle\sum_{i=1}^n (x_{{Ex}_i} - \bar x_{Ex} )^2 \end{equation}

\begin{equation} s_{Inex}^2 = \frac{1}{n-1} \displaystyle\sum_{i=1}^n (x_{{Inex}_i} - \bar x_{Inex} )^2 \end{equation}

\begin{equation} s_{Ex \bullet Inex}^2 = \frac{n-1}{n-2} (s_{Ex}^2 - b^2 s_{Inex}^2) \end{equation}

Equation (14) for estimating the slope of the regression line is appropriate if the residual variance about the regression line is constant, i.e., the variance about the line is the same along all portions of the line. If the measurements obtained indicate that the residual variance changes along the line, e.g., if the variance is larger for larger values of the inexpensive measurement, then a weighted regression analysis should be conducted, which will provide a new estimate of the slope (Equation 14) and of the residual variance (Equation 17). These new estimates should then be used in Equation (9) to estimate the mean and in Equation (10) to estimate the SE. The revised mean and SE should then be used to compute the confidence intervals (Equations 18, 19, 20, 21). The current version of VSP does not conduct weighted regression. A statistician should be consulted.

4. VSP uses the following formulas to calculate the lower and upper confidence limits. These formulas are based on the assumption that the data are normally distributed or that \(n\) and \(n'\) are large enough that the estimated mean is normally distributed.

Lower One-Sided Confidence Limit: \begin{equation} \bar x_{cs} - Z_{1- \alpha} SE \end{equation}

Upper One-Sided Confidence Limit: \begin{equation} \bar x_{cs} + Z_{1- \alpha} SE \end{equation}

or

Lower Two-Sided Confidence Limit: \begin{equation} \bar x_{cs} - Z_{1- \alpha /2} SE \end{equation}

Upper Two-Sided Confidence Limit: \begin{equation} \bar x_{cs} + Z_{1- \alpha /2} SE \end{equation}

When CS is not cost-effective, VSP uses the following method to Compute the Mean, Standard Deviation and Standard Deviation of the Estimated Mean

The process used by VSP in the CS module to compute the estimated mean, standard deviation, and standard deviation of the estimated mean is given in the following steps. This process is used when collaborative sampling is not found to be cost effective and simple random sampling or systematic grid sampling using the expensive analysis method is used instead.

1. After the samples are collected and the \(n\) measurements have been obtained, the VSP user enters them into VSP. Let \(x_i\) denote the expensive measurement on the \(i^{th}\) unit.

2. VSP estimates the mean by computing \(\bar x\) as follows:

\begin{equation} \bar x = \frac{1}{n} \displaystyle\sum_{i=1}^n x_i \end{equation}

3. VSP computes the standard deviation of the \(n\) measurements as follows:

\begin{equation} s = \sqrt{ \frac{1}{n-1} \displaystyle\sum_{i=1}^n (x_i - \bar x)^2 } \end{equation}

4. VSP computes the estimated standard error (standard deviation of \(\bar x\)) as follows:

\begin{equation} s( \bar x ) = \frac{s}{\sqrt{n}} \end{equation}

5. VSP uses the following formulas to calculate the lower and upper confidence limits. These formulas are based on the assumption that the data are normally distributed or that \(n\) is large enough that the estimated mean is normally distributed.

Lower One-Sided Confidence Limit: \begin{equation} \bar x - t_{1- \alpha ,df} s( \bar x ) \end{equation}

Upper One-Sided Confidence Limit:  \begin{equation} \bar x + t_{1- \alpha ,df} s( \bar x ) \end{equation}

or

Lower One-Sided Confidence Limit: \begin{equation} \bar x - t_{1- \alpha /2,df} s( \bar x ) \end{equation}

Upper Two-Sided Confidence Limit: \begin{equation} \bar x + t_{1- \alpha /2,df} s( \bar x ) \end{equation}

Statistical Assumptions

The assumptions that underlie the equations used to compute \(n'\) and \(n\) are:

1. There is an underlying linear relationship between the expensive and inexpensive analysis methods.

2. The true correlation coefficient, \(\rho\), is well known from prior studies or has been well estimated using a preliminary sampling study conducted at the study site or a very similar study site.

3. Collaborative sampling is more cost effective than simple random sampling.

4. The optimum values of \(n'\) and \(n\) are used to estimate the true mean, i.e., the values of \(n'\) and \(n\) computed, assuming the value of \(\rho\) used is valid.

5. The field sampling locations are selected using simple random sampling or systematic grid sampling.

6. The costs \(C_{Ex}\) and \(C_{Inex}\) are appropriate.

References:

Gilbert, R.O., 1987. Statistical Methods for Environmental Pollution Monitoring, VanNostrand Reinhold, New York (now published by Wiley & Sons, New York, 1997).

The Collaborative Sampling for Constructing a Confidence Interval dialog contains the following controls:

Correlation Coefficient between Expensive and Inexpensive Measurement Methods

Cost of a Single Expensive Analysis Measurement

Cost of a Single Inexpensive Analysis Measurement

One-Sided Confidence Interval:

Desired Width of a One-Sided Confidence Interval

Two-Sided Confidence Interval:

Desired Half-Width of a Two-Sided Confidence Interval

Confidence Level

Estimated Standard Deviation

Sample Placement page