4.1.1  Measures of Linear Trend
Recall, the Coronary Heart Disease example (table below) where that the data were taken from the Framingham longitudinal study of (Cornfield, 1962). In this study, n = 1329 patients were classified by cholesterol level and whether they had been diagnosed with coronary heart disease (CHD). When we did the test of independence in Lesson 3, we treated these variables as nominal. However, notice that cholesterol level is an ordinal variable. We can also treat heart disease as ordinal (e.g. no heart disease as 'low' condition vs. having heart disease as 'high' risk condition.), but for binary variable this does not have an impact.
Serum cholesterol (mg/100 cc)


0–199

200–219

220–259

260+

total


CHD 
12

8

31

41

92

no CHD 
307

246

439

245

1237

total 
319

254

470

286

1329

QUESTION: Is there any evidence of a relationship between cholesterol and heart disease? If yes, what is the nature of that association?
When classification is ordinal, there may exist a linear trend among the levels of the characteristics. As the null hypothesis of independence is rejected, it is natural and meaningful to measure the linear trend. When two interval or ratio valued variables are considered, two common statistics measuring linear trend or correlation between them are Pearson's correlation coefficient, and its nonparametric alternative, Spearman's correlation coefficient. There are many other measures of association for ordinal data, e.g. gamma, Kendall's tau, etc. that you can explore on your own depending on your research/work needs (see Agresti (2007), Sec 2.5 or Agresti (2013), Sec 3.4). We will discuss Pearson's and Spearman's correlations and MantelHaenszel statistics for testing independence between two ordinal variables.
Pearson Correlation, r, describes a linear association between two interval variables. This is one of the most common measures of linear trend. If Y and Z both are ordinal variables, then
\(r=\dfrac{cov(Y,Z)}{s_Y s_Z}\)
where $cov(Y, Z)$ is a covariance between $Y$ and $Z$, and $s_Y$ and $s_Z$ are the standard deviations of $Y$ and $Z$, respectively.
Properties of r for contingency tables
 −1 ≤ r ≤ 1
 r = 0 corresponds to no (linear) relationship
 r = ±1 implies perfect association i.e., all observations fall into the diagonal cells. This is well defined only when the contingency table is a square table.
 has a very limited use for highly discrete and unbalanced data (e.g., large discrepancies in the cell sizes).
 appropriate only when both variables can be considered ordinal, and is most appropriate when they are interval or we have reason to think we can choose an appriopriate scale.
Spearman correlation (Spearman's rho) statistic is a nonparametric alternative to r. Here the observations are converted into rank orders and correlation is computed from the ranked pairs.
For the heart disease example, you can see from either the SAS or R output below that Pearson's correlation r = −0.1403 and Spearman's correlation is −0.1448. Both are small negative values, implying a weak to moderate linear trend.
MantelHaenszel (MH) statistic, M^{2}, applies to both the Pearson and Spearman correlation. It tests the null hypothesis of independence with ordinal variables (i.e., correlation parameter, ρ, is equal to zero) versus the twosided alternative:
H_{0} : ρ = 0
H_{0} : ρ ≠ 0
where the test statistic is
\(M^2=(n1)r^2\)
 When H_{0} is true, then M^{2} has approximately chisquare distribution with df = 1.
 √M^{2} has approximately standard normal distribution, N(0, 1), which can be used for testing onesided alternatives too.
 Under independence, ρ = 0, M^{2} = 0
 Under perfect association, M^{2} = (n − 1)
 Larger values of M^{2} provide more evidence against the independence model.
 As n increases, M^{2} gets larger (recall our general discussion on effect of size and power)
 As r^{2} increases, M^{2} gets larger.
Category scores
To compute r, and thus M^{2} we need to assign scores to both rows and columns. Scores are numerical values that we assign to each item in each category of our ordinal variables. For example, for the categories of the row variable, u_{1} ≤ u_{2 }≤ ... ≤ u_{I} . For the categories of the column variable, v_{1} ≤ v_{2 }, ...,≤ v_{J} .
The default in SAS (and other software) is typically the integer scores (e.g. u_{1} = 1, u_{2} = 2, ...,). For more on the choice of scores in SAS you can read http://support.sas.com/onlinedoc/912/getDoc/statug.hlp/freq_sect18.htm#stat_freq_freqscores
Then, the correlation for I × J tables equal:
\(r=\dfrac{\sum_i\sum_j(u_i\bar{u})(v_j\bar{v})n_{ij}}{\sqrt{[\sum_i\sum_j(u_i\bar{u})^2 n_{ij}][\sum_i\sum_j(v_j\bar{v})^2n_{ij}]}}\)
where \(\bar{u}=\sum_i\sum_j u_i n_{ij}/n\) is the row mean, and \(\bar{v}=\sum_i\sum_j v_j n_{ij}/n\) is the column mean.
For the heart example, let u_{1} = 1, u_{2} = 2 and v_{1} = 1, v_{2} = 2, v_{3} = 3, v_{4} = 4. Then means for the rows and columns, u = 1.93 and v = 2.54 and
\(r=0.14\text{ and }M^2=(13291)(0.14)^2=26.15\)
Can you identify those statistics from the relevant SAS and/or R output below? $M^2=26.15, df=1$ has a small pvalue and we have strong evidence to reject the null hypothesis of independence between these two ordinal variables; there seems to be a weak to moderate linear trend, based on the correlation values.
Note that assignment of scores is arbitrary, in the sense that, instead of positive integers 1, 2, 3 ... one may assign scores 0, 1, 2, ... to the same ordinal classification.
Below is the output for the test of independence that we saw previously.
For the SAS program, see HeartDisease.sas (output: HeartDisease.lst) discussed below:
First, recall the SAS code, and the use of OPTION MEASURES,
along with some parts of the output:
Table 1.
and
Table 2.
Note that SAS, as a default, assumes that both the rows and the columns contain ordinal data. In SAS, PROC FREQ the CMH statistics is labeled as MantelHaenszel ChiSquare as it's computed using OPTION MEASURES or CMH. Later we will see that the CMH (CochranMantelHaenszel) statistic is a generalization of the MH statistic and it also measures conditional independence in higherdimensional tables.
We are interested in the values of MantelHaenszel statistic, Pearson correlation and Spearman correlation statistics.
For the R programs see the R files HeartDisease.R (output: HeartDisease.out) described below:
In R, to obtain some of these measures we can use the function assocstats(), and pears.cor() (see pears.cor_.R) .
The assocstats() function is part of the {vcd} package, and it produces the same values as the SAS output in Table 1, except for the MantelHaenszel ChiSquare statistic.
The pears.cor() function produces the MantelHaenszel statistic for twoway tables, and the Pearson correlation value that the SAS output gives as in Table 2. c(1,2) and c(1,2,3,4) are the scores that we assign to the rows, that is the columns for the data table, 'heart'.
This is the function written for this class and you will need to run it in R before you run the HeartDisease.R. (See the comments in the HeartDisease.R file for more details.)
Note, if you search for a builtin function in R to compute the CMH statistic you will find a few additional functions, e.g., mantelhaen.test(). However, these will only work for 2x2xK tables as we will learn later. We will see that the CMH (CochranMantelHaenszel) statistic is a generalization of the MH statistic and it also measures conditional independence in higherdimensional tables. Try using HELP to see the function, e.g., help(mantelhaen.test). If nothing appears you will need to install package stats, e.g., install.package("stats"), library(stats)! The function is contained in this package. See this quick introduction for basic information on R.
We are interested in the values of MantelHaenszel statistic, Pearson correlation and Spearman correlation statistics.
As you can see from the above tables, there are many different statistics for measures of associations. They all compare observed frequencies with expected frequencies in a some way. Pearson and likelihoodratio chisquare statistics, like the MH statistic, also reject the independence between having a heart disease and levels of cholesterol. But the latter is a more powerful measure with ordinal data.
When dealing with ordinal data, when there is a positive or negative linear association between variables, M^{2} has power advantage over X^{2} and G^{2}:
 X^{2} and G^{2} test the most general alternative hypothesis for any type of association.
 They need df = (I − 1) × (J − 1) parameters to describe the associations (e.g. odds ratios)
 M^{2} detects a specific type of association (i.e., linear), and can summarize it in terms of df = 1 parameter.
 M^{2} is more powerful because it approximately has the same value as X^{2} and G^{2} but with only df = 1 rather than (I − 1)(J − 1), and thus a smaller pvalue.
 For small to moderate sample size, the sampling distribution of M^{2} is better approximated with an appropriate chisquared distribution than are the sampling distributions for X^{2} and G^{2}; this in general holds for distributions with smaller df′s.