Correlate Analytes

Background

Usually a large number of analytes are measured in each sample from groundwater wells. Considerable cost savings could be realized if the number of analytes could be reduced without affecting the ability to confidently detect trends and achieve groundwater remediation goals. This VSP module aims to provide the VSP user with correlation methods, graphical displays, and an automated report to assess if some analytes could be eliminated or measured less frequently to reduce analytical costs. VSP provides multiple measures of correlation, some of which can handle non-linear patterns and data with non-detect levels. Correlations are accompanied by tests of significance and 95% confidence intervals to assist in the identification of highly correlated pairs of analytes.

Analyte Correlation Options

VSP allows the user to choose analytes from the list of available analytes and display correlations in 3 different ways:

Clicking Analyze populates the table using the specified subset.

Filter Options

Clicking on this check box activates the filtering option. The abs value significantly > option allows the user to screen out lower (close to zero) correlations which do not have absolute values significantly greater than a specified value between 0 and 1. By default, if Any correlation (Pearson's r, Spearman's rho, or Kendall's tau-b, which are explained later in this help file) from a pair of analytes is significantly above the specified value, it is displayed. The significance of a pairs correlation is determined by checking the specified value against the analytes confidence interval. The user may also elect to use just one of Pearson's, Spearman's, or Kendall's for a given filter. If a correlation is an N/A (meaning it could not be computed due to the nature of one of the analytes non-detect levels), it is ignored. Alternatively, the abs value greater than option works in the same manner, except that it only checks to see if the absolute value of the correlation coefficients are greater than the specified value instead of checking for statistical significance.

Table Options

By default the VSP table displays pairs of analytes and three types of correlations (Pearson's r, Spearman's rho, and Kendall's tau-b). Show correlation confidence intervals displays a 95% confidence interval for each correlation. Show correlation p-values displays the p-value for each correlation from the appropriate t-test, which tests if the correlation is significantly different from zero. The Export table to disk button exports the table as a .txt file so the user can analyze the table offline (such as in cases where the table is very large).

Graphs and Interacting with the Table

On the left-hand side of each row in the table, there is a check box. Checking this box indicates that pair is a correlation of interest, and the user will have the option of displaying the correlation values and plots involving these pairs instead of all pairs (the Report Options section will explain this further). Double-clicking anywhere on a row will display the graph for a specific pair. The graph displays a 2-D plot of the measured values of the pair of analytes, and symbol types and colors are changed for cases where some measurements were non-detects. If the user decides this pair of analytes is a pair of interest, they can check the Add graph to report box which automatically checks the corresponding box in the table.

Report Options

By default VSP displays all rows in the table in the VSP report view and all corresponding graphs. In cases where the user only wishes to display select rows, they can select the corresponding radio button so that VSP will Send only selected rows to the automated report. This will only put table data and graphs into the report for the analyte pairs which have had their check box checked.

Pearson's \(r\) Correlation

Pearson's \(r\) is a common correlation which measures the linear relationship between two variables. Values are between -1 and 1, with 0 indicating no linear relationship, -1 indicating a high negative association (as one variable increases, the other decreases), and 1 indicating a high positive association (as one variable increases, the other increases). Pearson's \(r\) is calculated when none of the data values are non-detects.

Pearson's \(r\) is calculated such that if you have two variables X and Y, and \(n\) samples,

$$r=\frac{\displaystyle\sum_{i=1}^{n} (X_i-\bar{X})(Y_i-\bar{Y})/(n-1)}{s_xs_y}$$

where \(s_x\) and \(s_y\) are the sample standard deviations (Ramsey, 2002).

Where the assumed hypothesis \(H_0\) is that the analytes have no linear relationship (a correlation of 0). The test statistic for determining the significance level of a pair of analytes is \(T=\frac{r\sqrt{n-2}}{\sqrt{1-r^2}}\) which has a t distribution with \(n-2\) degrees of freedom (Devore, 1991).

An approximate 95% confidence interval is calculated by applying Fisher's Z distribution (Fisher, 1970). The process using Fisher's Z distribution for a correlation confidence interval is as follows:

1. Transform Pearson's \(r\) to \(z\) by applying the equation \(z=0.5[ln(1+r)-ln(1-r)]\)

2. Calculate it's standard error \(\sigma_z=\frac{1}{\sqrt{n-3}}\)

3. Compute a 95% confidence interval using \(z\pm1.96\sigma_z\)

4. Use a statistical z distribution to find the cumulative probability values \(a\) and \(b\) that correspond to the lower and upper confidence limits of the two values produced.

5. The approximate 95% confidence interval is \((2a-1, 2b-1)\)

Spearman's rho Non-Parametric Correlation

Spearman's rho is a non-parametric correlation which measures the linear relationship of the ranks of two variables. Values are between -1 and 1. Spearman's rho is calculated when none of the data values are non-detects, or when there is a single non-detect level, and none of the detects are below this non-detect value.

Spearman's rho is computed by first computing the ranks of each analyte value when compared to other values for that analyte (averaging ranks if there are ties). For instance, if an analyte has seven values (15, 25, 30, 15, 60, 100, 10), the ranks would be (2.5, 4, 5, 2.5, 6, 7, 1). Then compute the correlation, test statistic, and confidence on these ranks using the methods for Pearson's \(r\).

Kendall's tau-b

Kendall's tau-b is a non-parametric correlation with values between -1 and 1. A distinct advantage of Kendall's tau-b is that it accounts for multiple non-detect levels, and accounts for cases where detects are below some non-detect values (Helsel, 2004). The methodology was originally developed by Kendall (1955) and adapted for non-detects by Brown, Hollander, and Korwar (1974). Helsel (2004) explains the process of computing Kendall's tau-b. First, the number of concordant pairs (\(N_c\)) and discordant pairs (\(N_d\)) are computed by sorting the data by variable X and comparing each Y observation to each subsequent Y observation. Each time a subsequent observation is an increase from the value it is being compared to, this is a concordant pair. Each time a subsequent observation is a decrease, this is a discordant pair. Subsequent observations, where either X or Y do not change, or where the change is indeterminate due to non-detects, are considered ties. The equation for Kendall's tau-b as shown in Helsel (2004) is

$$\tau_b=\frac{N_c-N_d}{\sqrt{\Big(\frac{N(N-1)}{2}-\text{#ties}_x\Big)\Big(\frac{N(N-1)}{2}-\text{#ties}_y\Big)}}$$

For an illustration of how to use correlate analytes, please refer to section 5.1.7 in the VSP User’s Guide.

References:

B. W. Brown, Jr., M. Hollander, and R. M. Korwar, Nonparametric tests of independence for censored data, with applications to heart transplant studies, Reliability and Biometry: Statistical Analysis of Lifelength, F. Proschan and R. J. Serfling, (eds.), pp. 327-354, SIAM: Philadelphia, 1974.

Devore, J. L., 1991, Probability and Statistics for Engineering and the Sciences, Belmont, CA, Wadsworth, Inc.

Fisher, R. A., 1970, Statistical Methods for Research Workers, New York, NY, Hafner Press.

Helsel, D. R., 2004, Nondetects and Data Analysis: Statistics for Censored Environmental Data, New York, NY, John Wiley and Sons.

Kendall, M. G., 1955. Rank Correlation Methods, London, Charles Griffin and Company.

Ramsey, F. L., and Schafer, D. W., 2002, The Statistical Sleuth: A Course in Methods of Data Analysis, Belmont, CA: Duxbury Press.