# Ranked Set Sampling Design for Estimating a Mean

## Background Information

Ranked set sampling, originally developed by McIntyre (1952), combines simple random sampling with the field investigator's professional knowledge and judgment to pick places to collect samples. Alternatively, field screening measurements can replace professional judgment when appropriate. The use of ranked set sampling increases the chance that the collected samples will yield representative measurements. This results in better estimates of the mean as well as improved performance of many statistical procedures such as testing for compliance with risk-based or background-based (reference-based) standards.

Moreover, ranked set sampling can be more cost-efficient than simple random sampling because fewer samples need to be collected and measured. The use of professional judgment in the process of selecting sampling locations is a powerful incentive to use ranked set sampling. Professional judgment is typically applied by visually assessing some characteristic or feature of various potential sampling locations in the field, where the characteristic or feature is a good indicator of the relative amount of the variable or contaminant of interest that is present.

A simple ecological example will illustrate the ranked set sampling method. Suppose the average age of trees on a property needs to be estimated. An appropriate judgment-based measurement (the visual size of a tree - trees generally increase in size as they age) exists. Begin by randomly selecting three trees and judge by eye which tree is the smallest. Mark the smallest tree to be measured and ignore the other two. Next, randomly select another set of three trees to rank. Mark the medium sized tree and ignore the other two. Next, randomly select another three trees. Mark the largest tree and ignore the other two. Repeat this procedure 10 times (10 cycles) for a total of 90 trees. 30 of the trees will have been marked and 60 ignored. Of the 30 marked trees, 10 are from a stratum of generally smaller trees, 10 are from a stratum of generally middle-sized trees and 10 are from a stratum of generally larger trees. Determine the age of each of the 30 marked trees by coring or some other appropriate measurement technique and use that measurement to estimate the average age of the trees on the lot.

In this illustration there were 10 cycles and 3 samples chosen per cycle. In practice, the number of sample locations chosen per cycle (the "set size") and the number of cycles is determined using a systematic planning process. Visual Sample Plan implements the systematic process needed to determine the number of cycles, and hence, the number of locations to be ranked and the number of locations to be measured. VSP can also place ranking and sampling locations on the map of the study site.

## Implementing the example as a balanced (symmetric) design in VSP

#### In VSP, start with a blank project and draw a rectangle to represent the sample area. Next, open the Ranked Set Sampling Dialog box and enter the following values on the Costs page:

Ranking Cost per Location: 25.00

#### Enter the following values on the Ranked Set Sampling page:

Lab Data Distribution: Symmetric

Ranking Method: Professional Judgment

Set Size: 3

Choose: Two-sided Confidence Interval About the Mean

Confidence Level: 95%

Maximum acceptable half-width of confidence interval: 5.5

Estimated Standard Deviation: 20

#### Note the following outputs on the lower right corner of the dialog:

Chosen set size (m): 3

Number of cycles (r): 10

Required number of samples (m x r): 30

Number of field locations to rank (m x m x r): 90

#### The different colors represent different cycles. Use the Ranked Set Toolbar to choose the first cycle. You will see the field locations that represent the first cycle, similar to the following figure:

The squares represent the first set of field locations. The three trees closest to these locations on the map would be judged and the smallest tree would be chosen for sampling (coring or some other method). The triangles represent the second set of field locations. The three trees closest to these locations on the map would be judged and the medium-sized tree would be chosen for sampling. Finally, the circles represent the third set of field locations in the first cycle. The three trees nearest these locations on the map would be judged and the largest tree would be sampled.

#### This process would be repeated for the second cycle. The field locations associated with the second cycle are represented by the following figure:

The same process would be repeated for all 10 cycles. At the end of the 10 cycles, you will have collected 30 samples, 3 from each cycle. The mean age given by these samples will have a 95% probability of being within 5.5 years of the actual mean age of the trees on this property (if your assumptions are reasonable: 20 year standard deviation, etc.)

## Implementing the example as a unbalanced (asymmetric) design in VSP

Suppose that it has been noted that more of the trees are older than younger, making the population skewed toward higher ages. This would indicate the need for an unbalanced ranked set sampling design. In an unbalanced design, the top rank may be sampled more than once in order to better determine the mean when more values in the population are near the top of the distribution.

#### In VSP, start with a blank project and draw a rectangle to represent the sample area. Next, open the Ranked Set Sampling Dialog box and enter the following values on the Costs page:

Ranking Cost per Location: 25.00

#### Enter the following values on the Ranked Set Sampling page:

Lab Data Distribution: Asymmetric (Skewed toward High)

Ranking Method: Professional Judgment

Set Size: 3

Confidence Level: 95%

Maximum acceptable % difference between estimated mean and true mean: 15

Geometric Standard Deviation: 1.5

#### Note the following outputs on the lower right corner of the dialog:

Chosen set size [m]: 3

Number of cycles [r]: 6

Number of sets for top rank [t]: 2

Number of sets per cycle [m+t-1]: 4

Required number of samples [(m+t-1) x r]: 24

Number of field locations to rank [(m+t-1) x m x r]: 72

#### The different colors represent different cycles. Use the Ranked Set Toolbar to choose the first cycle. You will see the field locations that represent the first cycle, similar to the following figure:

The squares represent the first set of field locations. The three trees closest to these locations on the map would be judged and the smallest tree would be chosen for sampling (coring or some other method). The triangles represent the second set of field locations. The three trees closest to these locations on the map would be judged and the medium-sized tree would be chosen for sampling. The circles represent the top rank. Notice that there are two sets of circles: one set accompanied by the numeral 1 and the other set accompanied by the numeral 2. This is because the top rank has to be sampled twice for this unbalanced design (t = 2). The circle-1s represent the third set of field locations in the first cycle. The three trees nearest these locations on the map would be judged and the largest tree would be sampled. Finally, the circle-2s represent the fourth set of field locations in the first cycle. The three trees nearest these locations on the map would be judged and, again, the largest tree would be sampled.

#### This process would be repeated for the second cycle. The field locations associated with the second cycle are represented by the following figure:

The same process would be repeated for all 6 cycles. At the end of the 6 cycles, you will have collected 24 samples, 4 from each cycle. The mean age given by these samples will have a 95% probability of being within 15% of the actual mean age of the trees on this property (if your assumptions are reasonable: 1.5 geometric standard deviation, etc.)

## Determining the number of samples

There are 2 major design types for RSS: balanced and unbalanced designs. Balanced designs assume that the analytical measurements of interest are symmetrically distributed about the mean. Unbalanced designs assume that the distribution of measurements is skewed to higher values. Unbalanced designs select more samples from the top rank to better estimate the mean given the skewness of the data.

Determining the number of cycles, and hence, the number of locations to be ranked and the number of locations to be measured, is a 5 step process:

### Step 1: Determine the sample size assuming that simple random sampling is used.

For balanced designs, the number of samples needed for simple random sampling ($$n_0$$) is calculated by VSP using either a one-sided or a two-sided confidence interval equation, as selected by the VSP user.

For a two-sided confidence interval, the equation used to calculate the number of samples under simple random sampling, $$n_0$$, when the expected distribution is symmetric and a balanced ranked set sampling design is used is: $$n_0 = s^2\Bigg(\frac{t_{1-\alpha/2, df}}{d}\Bigg)^2$$

For a one-sided confidence interval, the equation used is: $$n_0 = s^2\Bigg(\frac{t_{1-\alpha, df}}{d}\Bigg)^2$$

where

 $$n_0$$ is the recommended minimum number of samples for the study area if simple random sampling were used, $$s$$ is the estimated standard deviation of measurements of collected samples, $$d$$ is the maximum desired width (or half-width) of the confidence interval, $$t_{1-\alpha,df}$$ is the value of the Student's t-distribution with $$n$$-1 degrees of freedom (df) such that the proportion of that distribution less than $$t_{1-\alpha,df}$$ is $$1-\alpha$$, and $$t_{1-\alpha/2,df}$$ is the value of the Student's t-distribution with $$n$$-1 degrees of freedom (df) such that the proportion of that distribution less than $$t_{1-\alpha/2,df}$$ is $$1-\alpha/2$$.

Because $$n$$ appears on both sides of the above equations (on the right side it appears in the degrees of freedom of the t-distribution), the equation must be solved iteratively. VSP does this automatically using the iteration scheme in (Gilbert 1987, pg. 32).

For unbalanced designs, the number of samples is computed using the Adjusted Classical Formula method outlined by (Perez and Lefante 1997).

Using this method, the first step is to compute an approximate sample size using the following formula: $$n_{\text{classic}} = \Bigg(\frac{Z_{1-\alpha/2}}{\pi}\Bigg)^2(GSD^{ln(GSD)}-1)$$

where

 $$n_{\text{classic}}$$ is the approximate recommended minimum number of samples, $$Z_{1-\alpha/2}$$ is the value of the standard normal distribution such that the proportion of the distribution less than $$Z_{1-\alpha/2}$$ is $$1-\alpha/2$$, $$GSD$$ is the geometric standard deviation, $$ln$$ is the natural logarithm, and $$\pi$$ is the maximum proportion difference between the estimated mean and the true mean.

Next, linear regression is used to calculate the recommended minimum number of samples using the following formula: $$n_0 = \beta_0 + \beta_1(n_{\text{classic}})$$

where

 $$n_0$$ is the number of samples required under simple random sampling, $$\beta_0$$ is the Y-intercept of the regression formula, $$\beta_1$$ is the slope of the regression formula.

$$\beta_0$$                            is obtained from the following table [from Table III in (Perez and Lefante 1997, pg.2791)]:

 Confidence Level GSD = 1.1 GSD = 1.5 GSD = 2.0 GSD = 2.5 GSD = 3.0 GSD = 3.5 GSD = 4.0 90% 2.9532 7.5249 11.3183 15.5638 20.1322 25.9327 30.3223 95% 3.3331 7.9237 14.0744 20.5406 27.1563 33.6865 40.1084 99% 4.9265 11.2470 20.5069 30.2478 40.1743 51.1945 60.6576

$$\beta_1$$ is obtained from the following table [from Table III in (Perez and Lefante 1997, pg.2791)]:

 Confidence Level GSD = 1.1 GSD = 1.5 GSD = 2.0 GSD = 2.5 GSD = 3.0 GSD = 3.5 GSD = 4.0 90% 0.4714 0.6926 0.8509 0.8794 0.8499 0.7731 0.7033 95% 0.4726 0.8094 0.9046 0.9129 0.8731 0.8072 0.7288 99% 0.4740 0.8865 0.9808 0.9877 0.9444 0.8612 0.7796

### Step 2: Select a value for the set size, m.

This value is usually based on practical constraints in ranking locations in the field using professional judgment or field screening measurements. It may be difficult to use professional judgment to accurately rank by eye more than 4 or 5 locations. Other constraints that may affect the size of m are time, staff, and cost considerations. VSP limits m to 5 for judgment sampling and 8 for field screening measurements.

### Step 3: Determine the relative precision.

Relative precision is needed to calculate the number of cycles. For balanced designs, relative precision RP is found from the following table assuming the data are normally distributed [from Table 1 of (Patil et. al. 1994, pg.176)]:

 Set Size = 2 Set Size = 3 Set Size = 4 Set Size = 5 1.467 1.914 2.347 2.770

If the set size, m, is greater than 5, RP is found using the following linear formula:

$$RP = 0.4342m + 0.6048$$

For unbalanced designs, RP is found from the following table , assuming the data are lognormally distributed [from Table 1 of (Patil et. al. 1994, pg.177)]:

 Set Size CV = 0.1 CV = 0.202 CV = 0.307 CV = 0.416 CV = 0.533 CV = 0.658 CV = 0.795 CV = 0.947 CV = 1.117 CV = 1.311 2 1.46 1.45 1.42 1.40 1.37 1.33 1.29 1.26 1.22 1.19 3 1.90 1.87 1.83 1.77 1.70 1.62 1.55 1.47 1.40 1.34 4 2.33 2.28 2.21 2.11 2.00 1.89 1.78 1.67 1.56 1.47 5 2.75 2.68 2.57 2.44 2.29 2.14 1.99 1.84 1.71 1.59 6 3.15 3.07 2.93 2.76 2.57 2.37 2.18 2.00 1.84 1.70 7 3.56 3.45 3.28 3.07 2.83 2.60 2.40 2.16 1.96 1.80 8 3.95 3.86 3.61 3.36 3.09 2.81 2.55 2.30 2.08 1.89

CV is the coefficient of variation found by computing:

$$CV = \sqrt{e^{ln(GSD)^2}-1}$$,

which assumes the data are lognormally distributed.

### Step 4: Compute the number of cycles.

Number of cycles, $$r$$, is found with the following formula:

$$r=\frac{n_0/m}{RP}$$

### Step 5: Compute the total number of samples that should be collected.

For balanced designs, the total number of samples, $$n$$, is found simply by the following formula:

$$n = r*m$$

The number of field locations that need to be ranked to get the $$n$$ samples is:

$$r*m*m$$

For unbalanced designs, the number of times, $$t$$, the top rank needs to be sampled is found from the following table (from Table 8-5 in QA/G-5S):

 CV 0.25 0.5 1 1.25 1.5 2 2.5 3 3.5 4 $$t$$ 1 2 3 4 5 6 7 8 9 10

The total number of samples, $$n$$, is found by the following formula:

$$n = rx(m+t-1)$$

The number of field locations that need to be ranked to get the n samples is:

$$r*m*(m+t-1)$$

## Ranked Set Samples

For ranked set sampling, VSP produces field sample markers on the map that have different shapes and colors.

The color of the marker indicates its cycle. The cycle colors start at red and go through the spectrum to violet. Use the Cycle drop list on the Ranked Set Sampling Toolbar to show only the field locations for a particular cycle.

The shape of the marker indicates its set. Field sample locations for the first set are marked with squares, locations for the second set are marked with triangles, etc. For unbalanced designs, the top set is sampled several times, so a number accompanies those markers. Use the Set drop list on the Ranked Set Sampling Toolbar to show only the field locations for a particular set.

Ranked set field sampling locations are generated with a label having the following format:

$$RSS- c-s-i$$

where

$$c$$ indicates the cycle number

$$s$$ indicates the set number (for unbalanced designs this number is also incremented for each iteration of the top set)

$$i$$ is a unique identifier within the set

Use the View Labels command on the menu or the Toolbar to show or hide the labels for the field sampling locations.

### Statistical Assumptions

The assumptions used to determine the number of balanced ranked set samples are:

1. The sample mean is normally distributed (used to compute $$n_0$$).

2. The variance estimate, $$s^2$$, is reasonable and representative of the population being sampled (used to compute $$n_0$$).

3. The data distribution is symmetric and approximately normally distributed (used to determine RP).

4. The estimate of the sample mean is reasonable and representative of the population being sampled.

5. The field locations that will be ranked are selected using simple random sampling.

The first three assumptions will be assessed in a post data collection analysis. The fourth assumption is valid because the sample mean will be an unbiased estimate of the population mean. It is the responsibility of the investigators to ensure that the fifth assumption is met.

For an illustration on ranked set sampling, please refer to Ranked Set Sampling in chapter 3 of the VSP Userâ€™s Guide.

## References:

EPA, November 2001. Guidance for Choosing a Sampling Design for Environmental Data Collection, EPA QA/G-5S, Peer Review Draft, Washington, D.C.

Gilbert, R.O. 1987. Statistical Methods for Environmental Pollution Monitoring. John Wiley & Sons, NY.

McIntyre, G.A. 1952. A method for unbiased selective sampling using ranked sets, Australian Journal of Agricultural Research 3:385-390.

Patil, G.P., A.K. Sinha and C. Taillie. 1994. Ranked set sampling, Handbook of Statistics 12, Environmental Statistics, pp. 167-200, (G.P. Patil and C.R. Rao, editors), North-Holland, New York, NY.

Perez, A., and J.J. Lefante. 1997. Sample size determination and the effect of censoring when estimating the arithmetic mean of a lognormal distribution. Communications in Statistics, Theory and Methods 26 (11):2779-2801.