1.2  Discrete Data Types and Examples
Categorical/Discrete/Qualitative data
Measures on categorical or discrete variables consist of assigning observations to one of a number of categories in terms of counts or proportions. The categories can be unordered or ordered (see below).
Counts and Proportions
Counts are variables representing frequency of occurrence of an event:
 Number of students taking this class.
 Number of people who vote for a particular candidate in an election.
Proportions or “bounded counts” are ratios of counts:
 Number of students taking this class divided by the total number of graduate students.
 Number of people who vote for a particular candidate divided by the total number of people who voted.
Discretely measured responses can be:
 Nominal (unordered) variables, e.g., gender, ethnic background, religious or political affiliation
 Ordinal (ordered) variables, e.g., grade levels, income levels, school grades
 Discrete interval variables with only a few values, e.g., number of times married
 Continuous variables grouped into small number of categories, e.g., income grouped into subsets, blood pressure levels (normal, highnormal etc)
We we learn and evaluate mostly parametric models for these responses.
Measurement Scale and Context
Interval variables have a numerical distance between two values (e.g. income)
Measurement hierarchy:
 nominal < ordinal < interval
 Methods applicable for one type of variable can be used for the variables at higher levels too (but not at lower levels). For example, methods specifically designed for ordinal data should NOT be used for nominal variables, but methods designed for nominal can be used for ordinal. However, it is good to keep in mind that such analysis method will be less than optimum as it will not be using the fullest amount of information available in the data.
Example: Grades
 Nominal: pass/fail
 Ordinal: A,B,C,D,F
 Interval: 4,3,2.5,2,1
Note that many variables can be considered as either nominal or ordinal depending on the purpose of the analysis. Consider majors in English, Psychology and Computer Science. This classification may be considered nominal or ordinal depending whether there is an intrinsic belief that it is ‘better’ to have a major in Computer Science than in Psychology or in English. Generally speaking, for a binary variable like pass/fail ordinal or nominal consideration does not matter.
Context is important! The context of the study and the relevant questions of interest are important in specifying what kind of variable we will analyze. For example,
 Did you get a flu? (Yes or No)  is a binary nominal categorical variable
 What was the severity of your flu? ( Low, Medium, or High)  is an ordinal categorical variable
Based on the context we also decide whether a variable is a response (dependent) variable or an explanatory (independent) variable.
Discuss the following question on the Course Discussion Board:
Why do you think the measurement hierarchy matters and how does it influence analysis? That is, why we recommend that statistical methods/models designed for the variables at the higher level not be used for the analysis of the variables at the lower levels of hierarchy?
Contingency Tables
 A statistical tool for summarizing and displaying results for categorical variables
 Must have at least two categorical variables, each with at least two levels (2 x 2 table)May have several categorical variables, each at several levels (I_{1} × I_{2} × I_{3} × … × I_{k} tables) Place counts of each combination of the variables in the appropriate cells of the table.
Here are a few simple examples of contingency tables.
Example: Admissions Data
A university offers only two degree programs: English and Computer Science. Admission is competitive and there is a suspicion of discrimination against women in the admission process. Here is a twoway table of all applicants by sex and admission status. These data show an association between the sex of the applicants and their success in obtaining admission.
Male

Female

Total


Admit

35

20

55

Deny

45

40

85

Total

80

60

140

Example: Number of Delinquent Children by the County and the Head of Household Education Level
Source: OMB Statistical Policy Working Paper 22
This is another example of a twoway table but in this case 4×4 table. The variable County could be treated as nominal, where as the Education Level of Head of Household can be treated as ordinal variable. Questions to ask, for example: (1) What is the distribution of a number of delinquent children per county given the education level of the head of the household? (2) Is there a trend of where the delinquent children reside given the education levels?
County

Low

Medium

High

Very High

Total

Alpha 
15

1

3

1

20

Beta 
20

10

10

15

55

Gamma 
3

10

10

2

25

Delta 
12

14

7

2

35

Total 
50

35

30

20

135

 Ordinal and nominal variables
 Fixed total
Example: Census Data
Source: American Fact Finder website (U.S. Census Bureau: Block level data)
This is an example of a 2×2×4 threeway table that crossclassifies a population from a PA census block by Sex, Age and Race where all three variables are nominal.
Example: Clinical Trial of Effectiveness of an Analgesic Drug
Source: Koch et al. (1982)
 This is a fourway table (2×2×2×3 table) because it crossclassifies observations by four categorical variables: Center, Status, Treatment and Response
 Fixed number of patients in two Treatment groups
 Small counts
We will see throughout this course that there are many different methods to analyze data that can be represented in coningency tables.
Example of proportions in the news
You should be already familiar with a simple analysis of estimating a population proportion of interest and computing a 95% confidence interval, and the meaning of the margin or error (MOE).
Notation:
 Population proportion = p = sometimes we use π
 Population size = N
 Sample proportion = \(\hat{p}\) = X/n = # with a trait / total #
 Sample size = n
 X is the number of units with a particular trait, or number of success.
The Rule for Sample Proportions
 If numerous samples of size n are taken, the frequency curve of the sample proportions (\(\hat{p}'s\)) from the various samples will be approximately normal with the mean p and standard deviation \(\sqrt{p(1p)/n}\).
 \(\hat{p} \sim N(p,p(1p)/n)\)
See the analysis in Lesson 0 on CNN/Gallup Poll from the news: As found in CNN in June, 2006 "How well is Bush handling Iraq"?