# 1.2 - Discrete Data Types and Examples

Printer-friendly version

### Categorical/Discrete/Qualitative data

Measures on categorical or discrete variables consist of assigning observations to one of a number of categories in terms of counts or proportions. The categories can be unordered or ordered (see below).

#### Counts and Proportions

Counts are variables representing frequency of occurrence of an event:

• Number of students taking this class.
• Number of people who vote for a particular candidate in an election.

Proportions or “bounded counts” are ratios of counts:

• Number of students taking this class divided by the total number of graduate students.
• Number of people who vote for a particular candidate divided by the total number of people who voted.

Discretely measured responses can be:

• Nominal (unordered) variables, e.g., gender, ethnic background, religious or political affiliation
• Discrete interval variables with only a few values, e.g., number of times married
• Continuous variables grouped into small number of categories, e.g., income grouped into subsets, blood pressure levels (normal, high-normal etc)

We we learn and evaluate mostly parametric models for these responses.

#### Measurement Scale and Context

Interval variables have a numerical distance between two values (e.g. income)

Measurement hierarchy:

• nominal < ordinal < interval
• Methods applicable for one type of variable can be used for the variables at higher levels too (but not at lower levels). For example, methods specifically designed for ordinal data should NOT be used for nominal variables, but methods designed for nominal can be used for ordinal. However, it is good to keep in mind that such analysis method will be less than optimum as it will not be using the fullest amount of information available in the data.

• Nominal: pass/fail
• Ordinal: A,B,C,D,F
• Interval: 4,3,2.5,2,1

Note that many variables can be considered as either nominal or ordinal depending on the purpose of the analysis. Consider majors in English, Psychology and Computer Science. This classification may be considered nominal or ordinal depending whether there is an intrinsic belief that it is ‘better’ to have a major in Computer Science than in Psychology or in English. Generally speaking, for a binary variable like pass/fail ordinal or nominal consideration does not matter.

Context is important! The context of the study and the relevant questions of interest are important in specifying what kind of variable we will analyze. For example,

• Did you get a flu? (Yes or No) -- is a binary nominal categorical variable
• What was the severity of your flu? ( Low, Medium, or High) -- is an ordinal categorical variable

Based on the context we also decide whether a variable is a response (dependent) variable or an explanatory (independent) variable.

Discuss the following question on the Course Discussion Board:

Why do you think the measurement hierarchy matters and how does it influence analysis? That is, why we recommend that statistical methods/models designed for the variables at the higher level not be used for the analysis of the variables at the lower levels of hierarchy?

#### Contingency Tables

• A statistical tool for summarizing and displaying results for categorical variables
• Must have at least two categorical variables, each with at least two levels (2 x 2 table)May have several categorical variables, each at several levels (I1 × I2 × I3 × … × Ik tables) Place counts of each combination of the variables in the appropriate cells of the table.

Here are a few simple examples of contingency tables.

A university offers only two degree programs: English and Computer Science. Admission is competitive and there is a suspicion of discrimination against women in the admission process. Here is a two-way table of all applicants by sex and admission status. These data show an association between the sex of the applicants and their success in obtaining admission.

 Male Female Total Admit 35 20 55 Deny 45 40 85 Total 80 60 140

#### Example: Number of Delinquent Children by the County and the Head of Household Education Level

This is another example of a two-way table but in this case 4×4 table. The variable County could be treated as nominal, where as the Education Level of Head of Household can be treated as ordinal variable. Questions to ask, for example: (1) What is the distribution of a number of delinquent children per county given the education level of the head of the household? (2) Is there a trend of where the delinquent children reside given the education levels?

 County Low Medium High Very High Total Alpha 15 1 3 1 20 Beta 20 10 10 15 55 Gamma 3 10 10 2 25 Delta 12 14 7 2 35 Total 50 35 30 20 135
• Ordinal and nominal variables
• Fixed total

#### Example: Census Data

Source: American Fact Finder website (U.S. Census Bureau: Block level data)

This is an example of a 2×2×4 three-way table that cross-classifies a population from a PA census block by Sex, Age and Race where all three variables are nominal.

#### Example: Clinical Trial of Effectiveness of an Analgesic Drug

Source: Koch et al. (1982)

• This is a four-way table (2×2×2×3 table) because it cross-classifies observations by four categorical variables: Center, Status, Treatment and Response
• Fixed number of patients in two Treatment groups
• Small counts

We will see throughout this course that there are many different methods to analyze data that can be represented in coningency tables.

### Example of proportions in the news

You should be already familiar with a simple analysis of estimating a population proportion of interest and computing a 95% confidence interval, and the meaning of the margin or error (MOE).

Notation:

• Population proportion = p = sometimes we use π
• Population size = N
• Sample proportion = $$\hat{p}$$ = X/n = # with a trait / total #
• Sample size = n
• X is the number of units with a particular trait, or number of success.

The Rule for Sample Proportions

• If numerous samples of size n are taken, the frequency curve of the sample proportions ($$\hat{p}'s$$) from the various samples will be approximately normal with the mean p and standard deviation  $$\sqrt{p(1-p)/n}$$.
• $$\hat{p} \sim N(p,p(1-p)/n)$$

See the analysis in Lesson 0 on CNN/Gallup Poll from the news: As found in CNN in June, 2006 "How well is Bush handling Iraq"?