Adaptive Cluster Sampling for Estimating a Mean

Background Information

Adaptive cluster sampling begins by using a probability-based design such as simple random sampling to select an initial set of field units (locations) to sample. Then, additional neighboring samples are selected for observation when a characteristic of interest is present in an initial unit or when the initial unit has an observed value meeting some pre-specified condition (e.g., when a critical threshold value is exceeded).

Adaptive cluster sampling is considered in situations where the characteristic of interest is sparsely distributed but highly aggregated (clustered). Examples of such populations can be found in mineral investigations (unevenly distributed ore concentrations), animal and plant populations (rare and endangered species), pollution concentrations and hot spot investigations, and epidemiology of rare diseases. Adaptive cluster sampling is most useful when quick turnaround of analytical results is possible (e.g., with the use of field measurement technologies). Possible environmental applications of adaptive cluster sampling include soil remediation (investigating the extent of soil contamination while simultaneously estimating the mean), hazardous waste site characterizations, surveying Brownfields, and determining the extent of occurrence of effects of an airborne source of pollutant on nearby flora and fauna.

Adaptive Cluster Sampling Process

Implementing an adaptive cluster sampling design follows these basic steps:

  1. Divide the sample area into a grid of sampling units

Visual Sample Plan automatically divides the selected sample areas into square grid units of the specified size. The user specifies the size of the grid unit on the design dialog box. The units can be oriented at different angles by selecting Edit > Sample Areas > Set Grid Angle and Edit > Sample Areas > Reset Grid Angle from the menu.

  1. Define the sample design

    1. Choose an initial sample of units

VSP calculates the number of units that should be included in the initial sample, \( n_1 \), by computing the number of samples required for finding the confidence interval for the mean when simple random sampling is used.

For a two-sided confidence interval, the equation used to calculate the number of initial samples is:

$$ n_1 = s^2 \left( \frac{t_{1- \alpha/2, df}}{d} \right)^2 $$

For a one-sided confidence interval, the equation used is:

$$ n_1 = s^2 \left( \frac{t_{1- \alpha, df}}{d} \right)^2 $$

where

\( n_1 \)

is the recommended minimum number of initial samples,

\( S \)

is the estimated standard deviation of measurements of collected samples,

\( d \)

is the maximum desired width (or half-width) of the confidence interval,

\( t_{1- \alpha , df} \)

is the value of the Student's t-distribution with \(n-1\) degrees of freedom (\(df\)) such that the proportion of that distribution less than  is \(1- \alpha \).

\( t_{1- \alpha /2 , df} \)

is the value of the Student's t-distribution with \(n-1\) degrees of freedom (\(df\)) such that the proportion of that distribution less than \( t_{1- \alpha /2 , df} \) is \( 1- \alpha /2\).

 

Because \( n \) appears on both sides of the above equations (on the right side it appears in the degrees of freedom of the t distribution), the equation must be solved iteratively. VSP does this automatically using the iteration scheme in Gilbert (1987, pg. 32).

  1. b.   Specify a rule or criterion for performing follow-up sampling

    A criterion is needed to decide when follow-up sampling of neighboring grid units is needed. A typical criterion is a threshold, that is, if a measured value exceeds some predetermined threshold value, then the neighboring grid units need to be sampled.

  2. c.   Define the neighborhood of a sampling unit

A criterion is needed to decide when follow-up sampling of neighboring grid units is needed. A typical criterion is a threshold, that is, if a measured value exceeds some predetermined threshold value, then the neighboring grid units need to be sampled.

  1.  

VSP supports two different neighborhood designs: the 4-neighbor design and the 8-neighbor design. In the 4-neighbor design, the neighbor units are those grid units that share an edge with the original grid unit. In the 8-neighbor design, the neighbor units are those units that touch the original grid unit, even at a corner.

  1. Conduct sampling of the initial units

Collect a sample in each of the \( n_1 \) initial selected units and apply the follow-up rule or criterion to those samples. If you are using a threshold criterion, you can enter the measured value into the grid unit in VSP and the program will determine if and where follow-up sampling is needed.

  1. Conduct follow-up sampling

As in step 3, take samples in the units where follow-up sampling is indicated and apply the follow-up rule or criterion to those samples. Follow-up sampling continues in this way until no more follow-up sampling is indicated by the criterion.

The final sample consists of clusters of selected (observed) units for those initial units where follow-up sampling was conducted, plus all of the initial locations for which there was no follow-up sampling. Each cluster is surrounded by a set of observed units that do not exhibit the characteristic of interest. These are called edge units. A cluster without its edge units is called a network. Any observed unit, including an edge unit, that does not exhibit the characteristic of interest is a network of size one. Hence, the final sample can be partitioned into non-overlapping networks.

The usual method of computing the sample average and sample variance (from a simple random sample) will be statistically biased when calculated using the entire final sample obtained using Adaptive Cluster Sampling, unless there are no initial units for which follow-up sampling was conducted. If only the initial sample is used for estimating the mean and variance using the usual formulas, these estimates will be statistically unbiased.

Thompson (1990) has developed statistically unbiased estimators of the mean and the variance of the estimated mean that make use of all the samples obtained using adaptive cluster sampling. For an adaptive cluster sample with an initial set of units obtained using simple random sampling, VSP computes the modified Horvitz-Thompson unbiased estimators:

$$ \hat \mu = \frac{1}{N} \displaystyle\sum_{k=1}^m \frac{y_k*}{\alpha_k} $$

$$ \hat {var} ( \hat \mu ) = \frac{1}{N^2} \left[ \displaystyle\sum_{j=1}^m \displaystyle\sum_{k=1}^m \frac{y_j* y_k*}{\alpha_{jk}} \left( \frac{\alpha_{jk}}{\alpha_j \alpha_k} - 1 \right) \right] $$

where

\( y_k * \)

is the sum of the values of the character of interest, \( y \), for the \(k\)th network in the sample

\( N \)

is the number of units in the population

\( m \)

is the number of distinct networks (excluding edge units) in the final sample

\( \alpha_k \)

is the probability that the initial sample (set of initial units) intersects the \(k\)th network

\( \alpha_{jk} \)

is the probability that the initial sample intersects both the \(j\)th and the \(k\)th networks

 

Units in the initial sample that do not satisfy the condition are included in the calculation as networks of size one, but edge units are excluded.

If there are \(n_1\) units in the initial sample and \(x_k\) units in the \(k\)th network, then the intersection probabilities \( \alpha_k\) and \(\alpha_{jk}\) are calculated using combinatorial formulas as follows:

$$ \alpha_k = 1 - \left[ \dbinom{N - x_j}{n_1}/ \dbinom{N}{n_1} \right] $$

$$  \alpha_{jk} = 1 - \left[ \dbinom{N - x_j}{n_1} + \dbinom{N - x_k}{n_1} - \dbinom{N - x_j - x_k}{n_1} \right] / \dbinom{N}{n_1} $$

where \( \alpha_{jk} = \alpha_j \)

Statistical Assumptions

The assumptions associated with the formulas for computing the number of samples are:

  1. The sample mean is normally distributed (used to compute \(n_1\))

  2. The variance estimate, \(s_2\), is reasonable and representative of the population being sampled (used to compute \(n_1\)),

  3. The estimate of the sample mean is reasonable and representative of the population being sampled, and

  4. The field locations that will be the initial units are selected using simple random sampling

The first two assumptions will be assessed in a post data collection analysis. The third assumption is valid because the estimate of the mean will be an unbiased estimate of the mean.

For an illustration on adaptive cluster sampling, please refer to Adaptive Cluster Sampling in chapter 3 of the VSP User’s Guide.

References:

EPA, November 2001. Guidance for Choosing a Sampling Design for Environmental Data Collection, EPA QA/G-5S, Peer Review Draft, Washington, D.C.

Gilbert, R.O. 1987. Statistical Methods for Environmental Pollution Monitoring. John Wiley & Sons, New York.

Thompson, S. K. 1990. Adaptive Cluster Sampling. Journal of the American Statistical Association 85:412.

Thompson, S.K. and G.A.F. Seber. 1996. Adaptive Sampling. John Wiley & Sons, New York.

The Adaptive Cluster Sampling dialog contains the following controls:

Grid Size & Follow-Up Samples Page

Desired Grid Size for Samples

Units for Grid Size

Threshold for Triggering Follow-Up Sampling

4 Neighbors

8 Neighbors

Number of Initial Samples Page

Confidence Level

For One-Sided Confidence Interval:

 Maximum Acceptable Width of Confidence Interval

For Two-Sided Confidence Interval:

 Maximum Acceptable Half-Width of Confidence Interval

Estimated Standard Deviation

Cost page