Collaborative Sampling for Estimating a Mean

What is the Collaborative Sampling Design?

The Collaborative Sampling (CS) design uses two measurement techniques to obtain a cost effective estimate of the mean of the characteristic of interest for the specified target population (study area). One measurement technique is the standard analysis (referred to in VSP as the expensive analysis method), and the other is a less expensive and less accurate measurement method (referred to in VSP as the inexpensive analysis method). The idea behind CS is to replace the need for collecting so many expensive analyses with obtaining a fewer number of those analyses to allow one to obtain a relatively large number of the inexpensive analyses. It works like this: At \(n'\) field locations selected using simple random sampling or grid sampling; the inexpensive analysis method is used. Then, for each of \(n\) of the \(n'\) locations, the expensive analysis method is also conducted. The data from these two analysis methods are used to estimate the mean and the standard error (SE: the standard deviation of the estimated mean). The method of estimating the mean and SE assumes there is a linear relationship between the inexpensive and expensive analysis methods. If the linear correlation between the two methods is sufficiently high (close to 1), and if the cost of the inexpensive analysis method is sufficiently less than that of the expensive analysis method, then CS is expected to be more cost effective at estimating the population mean than if the entire measurement budget was spent on obtaining only expensive analysis results at field locations selected using simple random sampling or grid sampling.

The VSP CS module computes the values of \(n'\) and \(n\) that should be used to estimate the CS mean and SE. The formulas used in VSP for computing \(n'\) and \(n\) are Equations 9.6 through 9.10 in Gilbert (1987). VSP uses Equation 9.1 in Gilbert (1987) to compute the mean and the square root of Equation 9.2 (with \(N\) set equal to infinity) to compute the SE. It should be noted that Gilbert (1987, Chapter 9) and other statisticians use the term Double Sampling instead of Collaborative Sampling. Collaborative Sampling rather than Double Sampling is used in VSP in order to prevent potential users from thinking that a Double Sampling design requires doubling the number of samples.

Method Used to Determine if Collaborative Sampling is Cost Effective

Before VSP will compute \(n'\) and \(n\), it first determines if CS is cost effective compared with using the entire measurement budget to obtain only expensive analysis results at field locations selected using simple random sampling or grid sampling. If CS is found to be cost effective, then VSP will compute \(n'\) and \(n\). If CS is not cost effective, then VSP computes the number of field locations that should be collected and analyzed using only the expensive analysis method to estimate the mean and SE.

VSP declares CS to be cost effective if the following inequality is satisfied:

\begin{equation} \rho^2 \gt \frac{4R}{(1+R)^2} \end{equation}

where

\(\rho\) is the true correlation coefficient between the expensive and inexpensive analysis measurements, \(R = c_{Ex}/c_{Inex}\).

\(c_{Ex}\) is the per unit cost of an expensive analysis, including the cost of collecting, handling, preparing, and measuring the sample.

\(c_{Inex}\) is the per unit cost of an inexpensive analysis, including finding the field location and conducting the inexpensive analysis method.

Equation (1) above is from Gilbert (1987, page 108). It is assumed that the following cost equation applies:

\(C = c_{Ex}n+c_{Inex}n'\) = total dollars available for doing \(n'\) inexpensive analyses and \(n\) expensive analyses

Note that \(C\) does not include what might be termed overhead costs of project management, preparing the Quality Assurance Project Plan or Sampling and Analysis Plan, QA/QC, and other such costs.

Equations Used to Compute the Number of Inexpensive Analyses, \(n'\), and the Number of Expensive Analyses, \(n\)

This sections provides the equations VSP uses to compute \(n'\) and \(n\). VSP allows the VSP user two options for optimizing the values of \(n'\) and \(n\):

Design Option 1: Estimate the mean with the lowest possible SE under the restriction that there is a fixed upper limit, \(C\), on the measurement budget, i.e., on the budget for conducting the \(n'\) inexpensive analyses and the \(n\) expensive analyses.

Design Option 2: Estimate the mean under the restriction that the variance of the estimated mean (square of the SE) does not exceed the variance of the mean that would be achieved if the entire measurement budget were devoted to doing only expensive analyses.

Design Option 1: Fixed Upper Limit on the Measurement Budget, \(C\)

If the VSP user selects Design Option 1, then VSP computes \(n'\) and \(n\) as follows:

\begin{equation} n = \frac{C{f_0}}{c_{Ex}{f_0}+c_{Inex}} \end{equation}

and

\begin{equation} n' = \frac{C-c_{Ex}n}{c_{Inex}} \end{equation}

where

\begin{equation} f_0 = \Big(\frac{1-\rho^2}{\rho^2R}\Big)^{1/2} \end{equation}

and \(f_0\) is set equal to 1 if Eq. (4) gives an \(f_0\) greater than 1.

Design Option 2: Fixed Upper Limit on the Variance of the Mean

If the VSP user selects Design Option 2, then VSP computes \(n'\) and \(n\) as follows:

\begin{equation} n = n_v(1-\rho^2+\rho^2{f_0}) \end{equation}

\begin{equation} n' = \frac{n{n_v}\rho^2}{n-{n_v}(1-\rho^2)} \end{equation}

where

\begin{equation} n_v = C/c_{Ex} \end{equation}

and \(f_0\) is computed as in Eq. (4).

Statistical Assumptions

The assumptions that underlie the equations used to compute \(n'\) and \(n\) are:

1. There is an underlying relationship between the expensive and inexpensive analysis methods.

2. The true correlation coefficient, \(p\), is well known from prior studies or has been well estimated using a preliminary sampling study conducted at the study site or a very similar site.

3. Collaborative sampling is more cost effective than simple random sampling.

4. The optimum values of \(n'\) and \(n\) are used to estimate the true mean, i.e., the values of \(n'\) and \(n\) computed using Equation(2) through (6), assuming the value of \(p\) is valid.

5. The field sampling locations are selected using simple random sampling or systematic grid sampling.

6. The costs \(c_{Ex}\) and \(c_{Inex}\) are appropriate.

7. The scatter (variance) of expensive measurements about the regression line is constant along the entire length of the line.

When CS is cost-effective, VSP uses the following method to Estimate the Mean and Standard Error:

The mean and SE are computed assuming that there is a linear relationship between the measurements obtained using the expensive and inexpensive analysis methods. This assumption should be verified by the VSP user before collaborative sampling in VSP is used. However, note that once the VSP collaborative sampling design is used and data are entered into VSP, then VSP shows the linear regression plot of the data. This plot should be examined to verify that the assumption of a linear relationship does indeed seem reasonable. Also, VSP computes the correlation coefficient,\(\rho\), using the inexpensive and expensive data. The VSP user should use this correlation and the cost ratio, \(R\), to compute Eq. (1) above to confirm that CS is indeed really more cost efficient than simple random sampling.

The process used by VSP in the collaborative sampling module to compute the mean and SE is given in the following steps.

1. VSP computes the number of field locations, \(n'\), needed and places them over the sampling area using simple random sample or systematic grid sampling, as specified by the VSP user.

2. VSP selects a random sample of \(n\) units from among the \(n'\) units. If systematic grid sampling was used to determine the \(n'\) field locations, then the \(n\) units are also selected from among the \(n'\) units using systematic grid sampling.

3. The \(n'\) inexpensive analysis measurements and the \(n\) expensive analysis measurements are made by the VSP user and entered into VSP. Let \(x_{{Ex}_1}\) and \(x_{{Inex}_1}\) denote the expensive and inexpensive measurements, respectively, on the \(i\)th unit.

4. VSP estimates the mean by computing \({\bar{x}}_{CS}\) (CS stands for collaborative sampling) as follows:

\begin{equation} {\bar{x}}_{CS} = {\bar{x}}_{Ex} + b({\bar{x}}_{n'}-{\bar{x}}_{Inex}) \end{equation}

where \({\bar{x}}_{Ex}\) and \({\bar{x}}_{Inex}\) are the means of the \(n\) expensive and inexpensive measurements, respectively, \({\bar{x}}_{n'}\) is the mean of the \(n'\) inexpensive values, and \(b\) is the slope of the estimated regression of expensive on inexpensive values.

5. VSP computes the estimated standard error (standard deviation of \({\bar{x}}_{CS}\)) as follows:

\begin{equation} SE = \sqrt{s^2({\bar{x}}_{CS})} = \sqrt{s_{{Ex}\bullet{Inex}}^2 \Bigg[\frac{1}{n} + \frac{(\bar{x}_{n'}-\bar{x}_{Inex})^2}{(n-1)s_{Inex}^2}\Bigg]+\frac{s_{Ex}^2-s_{{Ex}\bullet{Inex}}^2}{n'}} \end{equation}

where \(s_{Ex}^2\) and \(s_{Inex}^2\) are the variances of the \(n\) expensive and inexpensive measurements, respectively, and \(s_{{Ex}\bullet{Inex}}^2\) is the residual variance about the estimated linear regression line. The equations used to calculate the quantities in equations (8) and (9) are as follows:

\begin{equation} \bar{x}_{Ex} = \frac{1}{n}\displaystyle\sum_{i=1}^{n}x_{{Ex}_1} \end{equation}

\begin{equation} \bar{x}_{Inex} = \frac{1}{n}\displaystyle\sum_{i=1}^{n}x_{{Inex}_1} \end{equation}

\begin{equation} \bar{x}_{n'} = \frac{1}{n'}\displaystyle\sum_{i=1}^{n'}x_{{Inex}_1} \end{equation}

\begin{equation} b = \frac{\displaystyle\sum_{i=1}^{n}(x_{{Ex}_1}-\bar{x}_{Ex})(x_{{Inex}_1}-\bar{x}_{Inex})}{\displaystyle\sum_{i=1}^{n}(x_{Inex}-\bar{x}_{Inex})^2} \end{equation}

\begin{equation} s_{Ex}^2 = \frac{1}{n-1}\displaystyle\sum_{i=1}^{n}(x_{{Ex}_1}-\bar{x}_{Ex})^2 \end{equation}

\begin{equation} s_{Inex}^2 = \frac{1}{n-1}\displaystyle\sum_{i=1}^{n}(x_{{Inex}_1}-\bar{x}_{Inex})^2 \end{equation}

\begin{equation} s_{{Ex}\bullet{Inex}}^2 = \frac{n-1}{n-2}(s_{Ex}^2 - b^2s_{Inex}^2) \end{equation}

Equation (13), for estimating the slope of the regression line, is appropriate if the residual variance about the regression line is constant, i.e., the variance about the line is the same along all portions of the line. If the measurements obtained indicate that the residual variance changes along the line, e.g., if the variance is larger for larger values of the inexpensive measurement, then a weighted regression analysis should be conducted, which will provide a new estimate of the slope (Equation 13) and of the residual variance (Equation 16). These new estimates should then be used in Equation (8) to estimate the mean and in Equation (9) to estimate the SE. The current version of VSP does not conduct weighted regression. A statistician should be consulted.

When CS is not cost-effective, VSP uses the following method to Compute the Mean, Standard Deviation and Standard Deviation of the Estimated Mean:

The process used by VSP in the CS module to compute the estimated mean, standard deviation, and standard deviation of the estimated mean is given in the following steps. This process is used when collaborative sampling is not found to be cost effective and simple random sampling or systematic grid sampling using the expensive analysis method is used instead.

1. VSP computes the number of field locations, \(n\), needed and places them over the sampling area using simple random sample or systematic grid sampling as specified by the VSP user.

2. The \(n\) expensive analysis measurements are made by the VSP user and entered into VSP. Let \(x_i\) denote the expensive measurement on the \(i\)th unit.

3. VSP estimates the mean by computing \(\bar{x}\) as follows:

\begin{equation} \bar{x} = \frac{1}{n}\displaystyle\sum_{i=1}^{n}x_i \end{equation}

4. VSP computes the standard deviation of the \(n\) measurements as follows:

\begin{equation} s = \sqrt{\frac{1}{n-1}\displaystyle\sum_{i=1}^{n}(x_i-\bar{x})^2} \end{equation}

5. VSP computes the estimated standard error (standard deviation of \(\bar{x}\)) as follows:

\begin{equation} s(\bar{x}) = \frac{s}{\sqrt{n}} \end{equation}

For an illustration on collaborative sampling, please refer to Collaborative Sampling for Estimating the Mean in chapter 3 of the VSP User’s Guide.

Reference:

Gilbert, RO. 1987. Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, New York. (Same text available as Gilbert, RO. 1997. John Wiley & Sons, New York.)

The Collaborative Sampling for Estimating a Mean dialog contains the following controls:

Correlation between Expensive and Inexpensive Measurement Methods

Cost of a Single Expensive Analysis Measurement

Cost of a Single Inexpensive Analysis Measurement

Total (Expensive and Inexpensive) Measurement Budget

Fixed Upper Limit on the Measurement Budget

Fixed Upper Limit on the Variance of the Mean

Sample Placement page