Collaborative Sampling to Compare a Mean against a Fixed Threshold

What is the Collaborative Sampling Design?

The Collaborative Sampling (CS) design uses two measurement techniques to obtain a cost effective estimate of the mean of the characteristic of interest for the specified target population (study area). One measurement technique is the standard analysis (referred to in VSP as the expensive analysis method), and the other is a less expensive and less accurate measurement method (referred to in VSP as the inexpensive analysis method). The idea behind CS is to replace the need for collecting many expensive analyses with a fewer number of those analyses to allow one to obtain a relatively large number of the inexpensive analyses. It works like this: At $ n'$ field locations selected using simple random sampling or grid sampling, the inexpensive analysis method is used. Then, for each of $n$ of the $n'$ locations, the expensive analysis method is also conducted. The data from these two analysis methods are used to estimate the mean and the standard error (SE: the standard deviation of the estimated mean). The method of estimating the mean and SE assumes there is a linear relationship between the inexpensive and expensive analysis methods. If the linear correlation between the two methods is sufficiently high (close to 1), and if the cost of the inexpensive analysis method is sufficiently less than that of the expensive analysis method, then CS is expected to be more cost effective at estimating the population mean than if the entire measurement budget was spent on obtaining only expensive analysis results at field locations selected using simple random sampling or grid sampling.

The VSP CS module computes the values of $n'$ and $n$ that should be used to estimate the CS mean and SE. VSP also computes the mean, standard error (standard deviation of the estimated mean) and other outputs that can be used to assess the validity of the assumptions that underlie CS. The equations used for these computations are provided below. It should be noted that Gilbert (1987, Chapter 9) and other statisticians use the term Double Sampling instead of Collaborative Sampling. Collaborative Sampling rather than Double Sampling is used in VSP in order to prevent potential users from thinking that a Double Sampling design requires doubling the number of samples.

Method Used to Determine if Collaborative Sampling is Cost Effective

Before VSP will compute $n'$ and $n$, it first determines if CS is cost effective compared with using the entire measurement budget to obtain only expensive analysis results at field locations selected using simple random sampling or grid sampling. If CS is found to be cost effective, then VSP will compute $n'$ and $n$. If CS is not cost effective, then VSP computes the number of field locations that should be collected and analyzed using only the expensive analysis method to estimate the mean and SE.

VSP declares CS to be cost effective if the following inequality is satisfied:

\begin{equation} \rho^2 > \frac{4R}{(1 + R)^2} \end{equation}

where

$ \rho $	is the true correlation coefficient between the expensive and inexpensive analysis measurements
$ R $	is $ \frac{C_{Ex}}{C_{Inex}} $
$ C_{Ex} $	is the per unit cost of an expensive analysis, including the cost of collecting, handling, preparing, and measuring the sample
$ C_{Inex} $	is the per unit cost of an inexpensive analysis, including finding the field location and conducting the inexpensive analysis method

Equation (1) above is from Gilbert (1987, page 108). It is assumed that the following cost equation applies:

$ C = c_{Ex} n + c_{Inex} n' $

is total dollars available for doing $n'$ inexpensive analyses and $n$ expensive analyses

Note that $C$ does not include what might be termed overhead costs of project management, preparing the Quality Assurance Project Plan or Sampling and Analysis Plan, QA/QC and other such costs.

When CS is cost-effective, VSP uses the following method to Compute the Number of Expensive and Inexpensive Analysis Measurements

VSP computes $n'$ and $n$ such that the product of the total measurement cost, $C$, and the variance of the estimated mean, $ \large \sigma_{\bar x_{cs}}^2 $, is minimized. VSP uses the following formulas that were derived using the same method of proof used in Appendix A of EPA (2000a):

\begin{equation} n' = \left[ \frac{(Z_{1- \alpha} + Z_{1- \beta} )^2 \sigma_{total,ex}^2}{\Delta^2} + \frac{1}{2}Z_{1- \alpha}^2 \right] \rho ( \sqrt{R(1- \rho^2 )} + \rho) \end{equation}

and

\begin{equation} n = \left[ \frac{(Z_{1- \alpha} + Z_{1- \beta} )^2 \sigma_{total,ex}^2}{\Delta^2} + \frac{1}{2}Z_{1- \alpha}^2 \right] \left[1 - \rho^2 + \rho \sqrt{ \frac{(1 - \rho^2)}{R}} \right] \end{equation}

where

$n'$	is the recommended minimum number of samples to measure with the inexpensive method
$n$	is the recommended minimum number (subset) of the $n'$ samples to also measure with the expensive method
$\alpha$	is the acceptable probability that the statistical test will falsely reject the null hypothesis
$\beta$	is the acceptable probability that the statistical test will falsely accept the null hypothesis
$\Delta$	is the width of the gray region in the Decision Performance Goal Diagram (DPGD)
$\sigma_{total,ex}$	is the total standard deviation of the expensive measurements, including analytical error
$Z_{1- \alpha}$	is the value of the standard normal distribution such that the proportion of the distribution less than $Z_{1- \alpha}$ is $ 1 - \alpha$
$Z_{1- \beta}$	is the value of the standard normal distribution such that the proportion of the distribution less than $Z_{1- \beta}$ is $1 - \beta$
$\rho$	is the assumed correlation between the expensive and inexpensive measurements obtained on the same samples
$C_{ex}$	is the cost of making a single expensive measurement
$C_{inex}$	is the cost of making a single inexpensive measurement
$R$	is the cost ratio $ \frac{C_{ex}}{C_{inex}}$
$C$	is the total measurement cost, i.e., $C = C_{inex}n' + C_{ex}n $

When CS is not cost-effective, VSP uses the following method to Compute the Number of Expensive Analysis Measurements

VSP uses the following equation to compute the required number of expensive measurements, $n$, needed to compare a mean to a threshold using the Z test described below:

\begin{equation} n = \frac{(Z_{1- \alpha} + Z_{1- \beta})^2 \sigma_{total,ex}^2}{\Delta^2} +\frac{1}{2} Z_{1- \alpha}^2 \end{equation} (derived in EPA 2000a, Appendix A)

The parameters in Equation (4) are defined above.

When CS is cost-effective, VSP uses the following method to Estimate the Mean and Standard Error

The estimated mean and SE (standard deviation of the estimated mean) are computed assuming that there is a linear relationship between the expensive and inexpensive measurements obtained on the same set of $n$ samples. This assumption should be verified by the VSP user before CS in VSP is used. After the VSP CS design is determined and the resulting measurements are entered into VSP, then VSP shows the linear regression plot of these data. This plot should be examined to verify that the assumption of a linear relationship does indeed seem reasonable. Also, VSP computes the correlation coefficient, $\rho$, using the inexpensive and expensive data. The VSP user should use this correlation and the cost ratio, $R$, in the CS module of VSP to see if CS is still considered to be more cost efficient than simple random sampling.

The process used by VSP in the CS module to compute the mean and SE is given in the following steps.

1. The $n'$ inexpensive and $n$ expensive analysis measurements are made by the VSP user and entered into VSP. Let $ \large x_{{Ex}_i} $ and $ \large x_{{Inex}_i} $ denote the expensive and inexpensive measurements, respectively, on the $i^{th}$ unit.

2. VSP estimates the mean by computing $ \bar x_{cs}$ (cs stands for collaborative sampling) as follows

\begin{equation} \large \bar x_{cs} = \bar x_{Ex} + b( \bar x_{n'} - \bar x_{Inex}) \end{equation}

where $ \bar x_{Ex}$ and $\bar x_{Inex}$ are the means of the $n$ expensive and inexpensive measurements, respectively, $ \bar x_{n'}$ is the mean of the $n'$ inexpensive values, and $b$ is the slope of the estimated regression of expensive on inexpensive values.

3. VSP computes the estimated standard error (standard deviation of $ \bar x_{cs}$) as follows:

\begin{equation} \large SE = \sqrt{s^2 ( \bar x_{cs})} = \sqrt{ s_{Ex \bullet Inex}^2 \Bigg[ \frac{1}{n} + \frac{(\bar x_{n'} - \bar x_{Inex} )^2}{(n-1) s_{Inex}^2} \Bigg] + \frac{s_{Ex}^2 - s_{Ex \bullet Inex}^2}{n'}} \end{equation}

where $ s_{Ex}^2 $ and $s_{Inex}^2 $ are the variances of the $n$ expensive and inexpensive measurements, respectively, and $s_{Ex \bullet Inex}^2$ is the residual variance about the estimated linear regression line. The equations used to calculate the quantities in equations (5) and (6) are as follows:

\begin{equation} \large \bar x_{Ex} = \frac{1}{n} \displaystyle\sum_{i=1}^n x_{{Ex}_i} \end{equation}

\begin{equation} \large \bar x_{Inex} = \frac{1}{n} \displaystyle\sum_{i=1}^{n} x_{{Inex}_i} \end{equation}

\begin{equation} \large \bar x_{n'} = \frac{1}{n'} \displaystyle\sum_{i=1}^{n'} x_{{Inex}_i} \end{equation}

\begin{equation} \large b = \frac{ \displaystyle\sum_{i=1}^n (x_{{Ex}_i} - \bar x_{Ex})(x_{{Inex}_i} - \bar x_{Inex} )}{\displaystyle\sum_{i=1}^n ( x_{Inex} - \bar x_{Inex} )^2} \end{equation}

\begin{equation} \large s_{Ex}^2 = \frac{1}{n-1} \displaystyle\sum_{i=1}^n (x_{{Ex}_i }- \bar x_{Ex})^2 \end{equation}

\begin{equation} \large s_{Inex}^2 = \frac{1}{n-1} \displaystyle\sum_{i=1}^n (x_{{Inex}_i} - \bar x_{Inex})^2 \end{equation}

\begin{equation} \large s_{Ex \bullet Inex}^2 = \frac{n-1}{n-2} (s_{Ex}^2 - b^2 s_{Inex}^2) \end{equation}

Equation (10) for estimating the slope of the regression line is appropriate if the residual variance about the regression line is constant, i.e., the variance about the line is the same along all portions of the line. If the measurements obtained indicate that the residual variance changes along the line, e.g., if the variance is larger for larger values of the inexpensive measurement, then a weighted regression analysis should be conducted, which will provide a new estimate of the slope (Equation 10) and of the residual variance (Equation 13). These new estimates should then be used in Equation (5) to estimate the mean and in Equation (6) to estimate the SE. The revised mean and SE should then be used to compute the Z test statistic (Equation 17) to test the null hypothesis. The current version of VSP does not conduct weighted regression. A statistician should be consulted.

When CS is not cost-effective, VSP uses the following method to Compute the Mean and SE

The process used by VSP to compute the estimated mean and SE (standard deviation of the estimated mean) when CS is not cost effective compared to simple random sampling is given in the following steps:

1. After the $n$ samples are collected and the $n$ expensive measurements have been obtained, the VSP user enters them into VSP. Let $x_i$ denote the expensive measurement on the $i^{th}$ unit.

2. VSP estimates the mean by computing $ \bar x $ as follows:

\begin{equation} \bar x = \frac{1}{n} \displaystyle\sum_{i=1}^n x_i \end{equation}

3. VSP computes the standard deviation of the $n$ measurements as follows:

\begin{equation} s = \sqrt{ \frac{1}{n-1} \displaystyle\sum_{i=1}^n (x_i - \bar x)^2} \end{equation}

4. VSP computes the standard error (standard deviation of $ \bar x$) as follows:

\begin{equation} s_x = \frac{s}{\sqrt{n}} \end{equation}

When CS is Cost-Effective, VSP uses the Following Method to Test if the True Mean Exceeds a Specified Threshold Value

If the null hypothesis is $H_0$: true mean $\geq$ threshold value, then VSP computes

\begin{equation} Z = \frac{ \bar x_{cs} - ThresholdValue}{\sqrt{s_{{\bar x}_{cs}}^2}} \end{equation}

and $H_0$ is rejected if

$$ Z \leq - z_{1- \alpha} $$

where $z_{1- \alpha}$ is the $(1- \alpha)^{th}$ percentile of the standard normal distribution. For example, if the VSP user specifies that $\alpha$ = 0.05, then $H_0$ is rejected if $ Z \leq $ -1.645.

If the null hypothesis is $H_0$: true mean $\leq$ threshold value, then VSP computes Equation (17) and $H_0$ is rejected if

$$ Z \geq z_{1- \alpha} $$

where $z_{1- \alpha}$ is defined above.

When CS is Not Cost-Effective, VSP uses the Following Method to Test if the True Mean Exceeds a Specified Threshold Value

If the null hypothesis is $H_0$: true mean $\geq$ threshold value, then VSP computes

\begin{equation} Z = \frac{\bar x - ThresholdValue}{\sqrt{s_x^2}} \end{equation}

where $ \bar x$ and $s_x^2$ are computed using Equations (14) and (16), respectively, and $H_0$ is rejected if

$$ Z \leq -z_{1- \alpha} $$

If the null hypothesis is $H_0$: true mean $\leq$ threshold value, then VSP computes Equation (18) and $H_0$ is rejected if

$$ Z \geq z_{1- \alpha} $$

Statistical Assumptions

The assumptions that underlie the equations used to compute $n'$ and $n$ and to test the hypotheses are:

1. There is an underlying linear relationship between the expensive and inexpensive analysis methods.

2. The true correlation coefficient, $\rho$, is well known from prior studies or has been well estimated using a preliminary sampling study conducted at the study site or a very similar study site.

3. Collaborative sampling is more cost effective than simple random sampling.

4. The optimum values of $n'$ and $n$ are used to estimate the true mean, i.e., the values of $n'$ and $n$ computed, assuming the value of $\rho$ used is valid.

5. The field sampling locations are selected using simple random sampling or systematic grid sampling.

6. The costs $C_{Ex}$ and $C_{Inex}$ are appropriate.

7. The scatter (variance) of expensive measurements about the regression line is constant along the entire length of the line.

8. The measurements are normally, or approximately normally distributed.

References:

Gilbert, R.O., 1987. Statistical Methods for Environmental Pollution Monitoring, VanNostrand Reinhold, New York (now published by Wiley & Sons, New York, 1997).

EPA. 2006a. Guidance on Systematic Planning Using the Data Quality Objectives Process. EPA QA/G-4, EPA/240/B-06/001, U.S. Environmental Protection Agency, Office of Environmental Information, Washington DC.

The Collaborative Sampling for Comparing a Mean to a Fixed Threshold dialog contains the following controls:

Correlation between Expensive and Inexpensive Measurement Methods

Cost of a Single Expensive Analysis Measurement

Cost of a Single Inexpensive Analysis Measurement

Null Hypothesis

Percent Confident

Action Level

Width of Gray Area (Delta) / LBGR / UBGR (when null hypothesis = "site is unacceptable")

Width of Gray Area (Delta) / LBGR / UBGR (when null hypothesis = "site is acceptable")

Type II Error Rate (Beta) (when null hypothesis = "site is unacceptable")

Type II Error Rate (Beta) (when null hypothesis = "site is acceptable")

Estimated Standard Deviation

Sample Placement page

Cost page

Data Analysis page

Data Entry sub-page

Summary Statistics sub-page

Tests sub-page

Plots sub-page

Analyte page

\( \rho \)	is the true correlation coefficient between the expensive and inexpensive analysis measurements
\( R \)	is \( \frac{C_{Ex}}{C_{Inex}} \)
\( C_{Ex} \)	is the per unit cost of an expensive analysis, including the cost of collecting, handling, preparing, and measuring the sample
\( C_{Inex} \)	is the per unit cost of an inexpensive analysis, including finding the field location and conducting the inexpensive analysis method

\(n'\)	is the recommended minimum number of samples to measure with the inexpensive method
\(n\)	is the recommended minimum number (subset) of the \(n'\) samples to also measure with the expensive method
\(\alpha\)	is the acceptable probability that the statistical test will falsely reject the null hypothesis
\(\beta\)	is the acceptable probability that the statistical test will falsely accept the null hypothesis
\(\Delta\)	is the width of the gray region in the Decision Performance Goal Diagram (DPGD)
\(\sigma_{total,ex}\)	is the total standard deviation of the expensive measurements, including analytical error
\(Z_{1- \alpha}\)	is the value of the standard normal distribution such that the proportion of the distribution less than \(Z_{1- \alpha}\) is \( 1 - \alpha\)
\(Z_{1- \beta}\)	is the value of the standard normal distribution such that the proportion of the distribution less than \(Z_{1- \beta}\) is \(1 - \beta\)
\(\rho\)	is the assumed correlation between the expensive and inexpensive measurements obtained on the same samples
\(C_{ex}\)	is the cost of making a single expensive measurement
\(C_{inex}\)	is the cost of making a single inexpensive measurement
\(R\)	is the cost ratio \( \frac{C_{ex}}{C_{inex}}\)
\(C\)	is the total measurement cost, i.e., \(C = C_{inex}n' + C_{ex}n \)