Printer-friendly versionPrinter-friendly version

Recall, the Coronary Heart Disease example (table below) where that the data were taken from the Framingham longitudinal study of (Cornfield, 1962). In this study, n = 1329 patients were classified by cholesterol level and whether they had been diagnosed with coronary heart disease (CHD). When we did the test of independence in Lesson 3, we treated these variables as nominal. However, notice that cholesterol level is an ordinal variable. We can also treat heart disease as ordinal (e.g. no heart disease as 'low' condition vs. having heart disease as 'high' risk condition.), but for binary variable this does not have an impact.

 
Serum cholesterol (mg/100 cc)
 
 
0–199
200–219
220–259
260+
total
CHD
12
8
31
41
92
no CHD
307
246
439
245
1237
total
319
254
470
286
1329

QUESTION: Is there any evidence of a relationship between cholesterol and heart disease? If yes, what is the nature of that association?

When classification is ordinal, there may exist a linear trend among the levels of the characteristics. As the null hypothesis of independence is rejected, it is natural and meaningful to measure the linear trend. When two interval or ratio valued variables are considered, two common statistics measuring linear trend or correlation between them are Pearson's correlation coefficient, and its nonparametric alternative, Spearman's correlation coefficient There are many other measures of association for ordinal data, e.g. gamma, Kendall's tau, etc. that you can explore on your own depending on your research/work needs (see Agresti (2007), Sec 2.5 or Agresti (2013), Sec 3.4).  We will discuss Pearson's and Spearman's correlations and Mantel-Haenszel statistics for testing independence between two ordinal variables.

Pearson Correlation, r, describes a linear association between two interval variables. This is one of the most common measures of linear trend. If Y and Z both are ordinal variables, then

\(r=\dfrac{cov(Y,Z)}{s_Y s_Z}\)

where $cov(Y, Z)$ is a covariance between $Y$ and $Z$, and $s_Y$ and $s_Z$ are the standard deviations of $Y$ and $Z$, respectively.

Properties of r for contingency tables

  • −1 ≤ r ≤ 1
  • r = 0 corresponds to no (linear) relationship
  • r = ±1 implies perfect association i.e., all observations fall into the diagonal cells. This is well defined only when the contingency table is a square table.
  • has a very limited use for highly discrete and unbalanced data (e.g., large discrepancies in the cell sizes).
  • appropriate only when both variables can be considered ordinal, and is most appropriate when they are interval or we have reason to think we can choose an appriopriate scale.

Spearman correlation (Spearman's rho) statistic is a non-parametric alternative to r. Here the observations are converted into rank orders and correlation is computed from the ranked pairs.

For the heart disease example, you can see from either the SAS or R output below that Pearson's correlation r = −0.1403 and Spearman's correlation is −0.1448. Both are small negative values, implying a weak to moderate linear trend.

Mantel-Haenszel (MH) statistic, M2, applies to both the Pearson and Spearman correlation.  It tests the null hypothesis of independence with ordinal variables (i.e., correlation parameter, ρ, is equal to zero) versus the two-sided alternative:

H0 : ρ = 0
H0 : ρ ≠ 0

where the test statistic is

\(M^2=(n-1)r^2\)

  • When H0 is true, then M2 has approximately chi-square distribution with df = 1.
  • M2 has approximately standard normal distribution, N(0, 1), which can be used for testing one-sided alternatives too.
  • Under independence, ρ = 0, M2 = 0
  • Under perfect association, M2 = (n − 1)
  • Larger values of M2 provide more evidence against the independence model.
  • As n increases, M2 gets larger (recall our general discussion on effect of size and power)
  • As r2 increases, M2 gets larger.

Category scores

To compute r, and thus M2 we need to assign scores to both rows and columns. Scores are numerical values that we assign to each item in each category of our ordinal variables. For example, for the categories of the row variable, u1u2 ≤ ... ≤ uI . For the categories of the column variable, v1v2 , ...,≤ vJ .

The default in SAS (and other software) is typically the integer scores (e.g. u1 = 1, u2 = 2, ...,). For more on the choice of scores in SAS you can read http://support.sas.com/onlinedoc/912/getDoc/statug.hlp/freq_sect18.htm#stat_freq_freqscores

Then, the correlation for I × J tables equal:

\(r=\dfrac{\sum_i\sum_j(u_i-\bar{u})(v_j-\bar{v})n_{ij}}{\sqrt{[\sum_i\sum_j(u_i-\bar{u})^2 n_{ij}][\sum_i\sum_j(v_j-\bar{v})^2n_{ij}]}}\)

where \(\bar{u}=\sum_i\sum_j u_i n_{ij}/n\) is the row mean, and \(\bar{v}=\sum_i\sum_j v_j n_{ij}/n\) is the column mean.

For the heart example, let u1 = 1, u2 = 2 and v1 = 1, v2 = 2, v3 = 3, v4 = 4. Then means for the rows and columns, u = 1.93 and v = 2.54 and

 \(r=-0.14\text{ and }M^2=(1329-1)(0.14)^2=26.15\)

Can you identify those statistics from the relevant SAS and/or R output below? $M^2=26.15, df=1$ has a small p-value and we have strong evidence to reject the null hypothesis of independence between these two ordinal variables; there seems to be a weak to moderate linear trend, based on the correlation values.

Note that assignment of scores is arbitrary, in the sense that, instead of positive integers 1, 2, 3 ... one may assign scores 0, 1, 2, ... to the same ordinal classification.  

Below is the output for the test of independence that we saw previously.

SAS logoFor the SAS program, see HeartDisease.sas (output: HeartDisease.lst) discussed below: 

First, recall the SAS code, and the use of OPTION MEASURES, 

SAS program lec10ex3.sas

along with some parts of the output:

SAS output

Table 1.

and

SAS output

Table 2.

Note that SAS, as a default, assumes that both the rows and the columns contain ordinal data. In SAS, PROC FREQ the CMH statistics is labeled as Mantel-Haenszel Chi-Square as it's computed using OPTION MEASURES or CMH. Later we will see that the CMH (Cochran-Mantel-Haenszel) statistic is a generalization of the MH statistic and it also measures conditional independence in higher-dimensional tables.

We are interested in the values of  Mantel-Haenszel statistic, Pearson correlation and Spearman correlation statistics.

R logoFor the R programs see the R files HeartDisease.R (output: HeartDisease.out) described below:

In R, to obtain some of these measures we can use the function assocstats(), and pears.cor() (see pears.cor_.R) .

heart disease R code

The assocstats() function is part of the {vcd} package, and it produces the same values as the SAS output in Table 1, except for the Mantel-Haenszel Chi-Square statistic.

heart disease R output

The pears.cor() function produces the Mantel-Haenszel statistic for two-way tables, and the Pearson correlation value that the SAS output gives as in Table 2. c(1,2) and c(1,2,3,4) are the scores that we assign to the rows, that is the columns for the data table, 'heart'.

heart disease R output

This is the function written for this class and you will need to run it in R before you run the HeartDisease.R.  (See the comments in the HeartDisease.R file for more details.)

Note, if you search for a built-in function in R to compute the CMH statistic you will find a few additional functions, e.g., mantelhaen.test(). However, these will only work for 2x2xK tables as we will learn later. We will see that the CMH (Cochran-Mantel-Haenszel) statistic is a generalization of the MH statistic and it also measures conditional independence in higher-dimensional tables.  Try using HELP to see the function, e.g.,  help(mantelhaen.test). If nothing appears you will need to install package stats, e.g., install.package("stats"), library(stats)! The function is contained in this package. See this quick introduction for basic information on R.

We are interested in the values of  Mantel-Haenszel statistic, Pearson correlation and Spearman correlation statistics.

As you can see from the above tables, there are many different statistics for measures of associations. They all compare observed frequencies with expected frequencies in a some way. Pearson and likelihood-ratio chi-square statistics, like the MH statistic, also reject the independence between having a heart disease and levels of cholesterol. But the latter is a more powerful measure with ordinal data.

When dealing with ordinal data, when there is a positive or negative linear association between variables, M2 has power advantage over X2 and G2:

  • X2 and G2 test the most general alternative hypothesis for any type of association.
  • They need df = (I − 1) × (J − 1) parameters to describe the associations (e.g. odds ratios)
  • M2 detects a specific type of association (i.e., linear), and can summarize it in terms of df = 1 parameter.
  • M2 is more powerful because it approximately has the same value as X2 and G2 but with only df = 1 rather than (I − 1)(J − 1), and thus a smaller p-value.
  • For small to moderate sample size, the sampling distribution of M2 is better approximated with an appropriate chi-squared distribution than are the sampling distributions for X2 and G2; this in general holds for distributions with smaller df′s.