# Glossary

**Bar graph**: Graphical representation for categorical data in which vertical (or sometimes horizontal) bars are used to depict the number of experimental units in each category; bars are separated by space.

**Bell-shaped distribution: **Unimodal, symmetric distribution (see below).

**Bias**: The systematic favoring of certain outcomes.

**Bimodal**: A distribution with two prominent peaks (i.e., two modes).

**Binomial random variable**: A specific type of discrete random variable that counts how often a particular event occurs in a fixed number of tries or trials.

**Boxplot**: Graphical representation for quantitative data in which a box represents the middle 50% of experimental units; the line within the box represents the middle of the distribution (i.e., median); and lines extend to the highest and lowest scores that are not outliers.

**Categorical data**: Names or labels without a meaningful order of magnitudes, also known as qualitative.

**Central Limit Theorem**: With a sufficient sample size (typically \(n\geq 30\)) the distribution of sample means will be approximately normally distributed regardless of the shape of the population with a mean of \(\mu\) and a standard deviation of \(\frac{\sigma}{\sqrt{n}}\).

**Chi-square test of independence:** A statistical test to determine if two or more categorical variables are related.

**Cluster Sampling**: A method of selecting a sample from a population in which the population is divided into subgroups (i.e., clusters) and a simple random sample of clusters is taken; all individuals within these clusters may be sampled, or a simple random sample may be taken from the selected clusters.

**Coefficient of determination: **The proportion of variance shared by two variables.

**Complement: **The probability that an event does not occur; e.g., the complement of event A is \(A^C\) or \({A}'\).

**Conditional probability: **The probability of an event occurring given that a second event has occurred, represented as \(\mid \); e.g., \(P(A\mid B)\) would be read as the “probability of A given B”.

**Confidence interval**: A range computed using sample statistics to estimate an unknown population parameter with a given level of confidence.

**Confounding variable**: A variable that is considered in a research study that could influence the relations between the variables in the study.

**Contingency table**: A display of counts for two categorical variables in which the rows represented one variable and the columns represent a second variable.

**Continuous data**: Quantitative data that can take on any value between the minimum and maximum, and any value between two other values.

**Control group**: A level of a factor that does not receive an actual treatment, this group may receive no treatment or a placebo.

**Correlation: **Measure of the strength and direction of the relationship between two variables (e.g., Pearson’s *r* is a measure of the relationship between two quantitative variables).

**Cumulative distribution**: A listing of all possible values along with the probability of that value and all lower values occuring (i.e., the **cumulative probability**).

**Cumulative probability**: Likelihood of an outcome less than or equal to a given value occurring.

**Data**: Pieces of information that may be used as the basis for inference or reasoning (Note: “data” is plural, “datum” is singular).

**Degrees of freedom: **Symbolized by \(df\) or \(\nu\); the number of values that are free to vary; typically the sample size minus the number of parameters being estimated.

**Dependent t test: **Statistical test comparing two paired means, also known as a paired

*t*test.

**Descriptive statistics**: Methods for summarizing data (e.g., mean, median, mode, range, variance, graphs).

**Deviation**: An individual score minus the mean.

**Discrete data**: Data that can only take on set number of values.

**Empirical Rule**: For bell-shaped distributions, about 68% of the data will be within one standard deviation of the mean, about 95% will be within two standard deviations of the mean, and about 99.7% will be within three standard deviations of the mean.

**Estimation**: An inferential procedure in which sample data is used to approximate a population parameter.

**Expected value**: The mean value in the long run for many repeated samples, symbolized as \(E(X)\).

**Experimental study**: A study in which the researcher manipulates the treatments received by subjects and collects data, also known as a **scientific study.**

**Experimental units**: Each individual that is studied in an experiment or observational study.

**Explanatory variable**: Variable that is manipulated by the researcher, also known as an independent variable.

**Event**: A particular outcome or collection of outcomes.

*F ***ratio: **Variability between groups divided by variability within groups; the test statistic used in analysis of variance.

**Factor**: One explanatory variable with two or more levels that is manipulated by the researcher.

**First quartile (Q1)**: 25th percentile; middle of the values below the median.

**Five number summary**: List of the minimum value, Q1, median, Q3, and maximum value.

**Histogram**: Graphical representation for quantitative data in which vertical (or sometimes horizontal) bars are used to depict the number of experimental units in each range of values; bars touch.

**Hypothesis testing**: An inferential procedure in which a statement about a population (e.g., “the mean heights of men and women are equal in the population”) is examined using data from a sample to determine the likelihood that the sample was drawn from a population for which the stated parameter is true.

**Independent**: Not related; the outcome of one event does not impact the outcome of the other event.

**Independent t test: **Statistical test comparing the means of two different groups.

**Independent variable**: Variable that is manipulated by the researcher, also known as an explanatory variable.

**Inferential statistics**: Methods for using sample data to make conclusions about a population.

**Interquartile range (IQR)**: The difference between the first (Q1) and third (Q3) quartiles.

**Intersection: **The overlapping of two or more events, represented as \(\cap\); e.g., \(P(A\cap B)\) is the “probability of A and B”.

**Law of Large Numbers: **Given a large number of repeated trials, the average of the results will be approximately equal to the expected value.

** Least squares method: **Method of constructing a regression line which makes the sum of squared residuals as small as possible for the given data.

**Levels**: Values of a factor.

**Lurking variable**: A variable that is not considered in a research study that could influence the relations between the variables in the study.

**Margin of error**: Distance of the range of above and below the sample statistic for which the population parameter likely (but not definitively) falls (i.e., half of the distance of a confidence interval).

**Mean**: The numerical average; calculated as the sum of all of the data values divided by the number of values; represented as \(overline{x}\).

**Median**: The middle of the distribution that has been ordered from smallest to largest; for distributions with an even number of values, this is the mean of the two middle values.

**Mode**: The most frequently occurring value(s) in the distribution, may be used with quantitative or categorical variables.

**Multimodal: **A distribution with more than two prominent peaks (i.e., more than two modes).

**Multistage Sampling**: A method for selecting a sample from a population in multiple stages in which successively smaller groups from the population are selected.

**Mutually exclusive**: Two events that do not occur at the same time, also known as **disjoint **events.

**Non-response bias**: Systematic favoring of certain outcomes that occurs when the individuals who choose participate in a study differ from the individuals who choose to not participate.

**Normal distribution**: bell-shaped distribution (i.e., symmetrical, unimodal) with a mean of 0 and standard deviation of 1, also known as the **z distribution**.

**Observational study**: A non-experimental study in which the researcher collects data without performing any manipulations.

**Odds: **Ratio of the chance that an event occurs over the chance that the event does not occur.

**Odds ratio: **Ratio of the odds for group 1 over the odds for group 2.

**One-way analysis of variance: **A statistical test for comparing the means of three or more independent groups.

**Outcome**: The result of one trial.

**Outlier**: An observation in a distribution that is much higher or much lower than the other observations.

*p ***value**: The probability that the observed sample statistic (or a statistic more extreme) would be randomly obtained from a population with the hypothesized parameter if the null hypothesis were true.

**Paired t test: **

**Statistical test comparing two paired means, also known as a dependent**test.

*t***Parameter**: A measure concerning a population (e.g., population mean).

**Participants**: Human experimental units.

**Percentile: **Proportion of values falling at or below a given percentage.

**Pie chart**: Graphical representation for categorical data in which a circle is partitioned into “slices” on the basis of the proportions of each category.

**Placebo group**: A group that receives what, to them, appears to be a treatment, but in fact is neutral and does not contain any actual treatment (e.g., a sugar pill in a medication study).

**Point estimate**: an estimate of a central parameter obtained through sample data, i.e. the sample mean or the sample proportion.

**Pooled standard deviation: **A method for computing a single standard deviation for more than one independent group in which the sample size of each group is taken into account.

**Population**: The entire set of possible observations in which we are interested.

**Power: **Given that the null hypothesis is false, the probability of rejecting it; in other words, the probability of correctly rejecting \(H_{0}\)** .**

**Probability density function (PDF)**:** **A curve such that the area under the curve within any interval of values along the horizontal gives the probability for that interval.

**Probability distribution**: A table, graph, or formula that gives the probability of a given outcome's occurrence.

**Quantitative data**: Numerical values with magnitudes that can be placed in a meaningful order, also known as numerical.

**Random sampling error: **Differences between the population and samples caused by taking random samples as opposed to using the entire population, also referred to as “sampling error”.

**Random variable**: a numerical characteristic that takes on different values due to chance.

**Randomization**: Method for assigning subjects different treatment groups based on chance (e.g., random number generator, flipping a coin), also known as random assignment.

**Range**: The difference between the maximum and minimum values.

**Relative frequency probability**: Across a large number of trials, the number of times a particular outcome occurs divided by the total number of trials.

**Relative risk: **Ratio of the risks of two groups.

**Replication**: Assigning numerous experimental units to each treatment group.

**Representative sample: **A subset of the population from which data is collected that accurate reflects the entire population.

**Residual: **Difference between the observed value and the predicted value; e.g., in simple linear regression \(e=y-\widehat{y}\).

**Response bias**: Systematic favoring of certain outcomes that occurs when participants either do not respond truthfully or give answer that they feel the researchers wants to hear.

**Response variable**: The outcome variable, also known as a dependent variable.

**Risk: **Percent or fraction of a group that experiences a given outcome.

**Robust test:** A statistical procedure that can perform relatively reliably even when some assumptions are not met.

**Rule of sample proportions: **Given that both \(n \times p \geq 10\) and \(n \times (1-p) \geq 10 \), the distribution of sample proportions will be approximately normally distributed with a mean of \(p\) and a standard deviation of \(\sqrt{\frac{p(p-1)}{n}}\).

**Sample**: A subset of the population from which data is actually collected.

**Sample space**: The collection of all possible outcomes of an experiment.

**Sampling bias**: Systematic favoring of certain outcomes due to the methods employed to obtain the sample.

**Scatterplot:** A 2-dimensional graphical representative for two quantitative variables in which the independent variable is on the x-axis and the dependent variable is on the y-axis.

**Scientific study**: A study in which the researcher manipulates the treatments received by subjects and collects data, also known as an **experimental study.**

**Second quartile (Q2)**: 50th percentile; the median.

**Selection bias**: Systematic favoring of certain outcomes that occurs when the sample that is selected does not reflect the population of interest.

**Side-by-side boxplots: **One graph containing multiple boxplots for independent groups on the same variable.

**Simple random sampling**: A method of selecting a sample from a population in which each member of the population has an equal chance of being selected; sometimes abbreviated as "SRS".

**Skewed**: A distribution in which values are more spread out on one side of the center than on the other.

**Skewed to the left**: A distribution in which the lower values (towards the right on a number line) are more spread out than the higher values, also known as negatively skewed.

**Skewed to the right**: A distribution in which the higher values (towards the right on a number line) are more spread out than the lower values, also known as positively skewed.

**Standard deviation**: Roughly the average difference between individual data and the mean represented as \(s\) in a sample or \(\sigma\) in a population.

**Standard error: **Standard deviation of a distribution of sample statistics.

**Standard error of the mean: **Standard deviation of a distribution of sample means, computed as \(SE(\overline{x})= \frac {\sigma}{\sqrt{n}}\).

**Standard error of the sample proportions: **Standard deviation of the distribution of sample proportions, computed as \(SE(\widehat{p})= \sqrt{\frac {p(1-p)}{n}}\).

**Standardized score**: Distance between an individual score and the mean in standard deviation units; also known as a **z score.**

**Statistic**: A measure concerning a sample (e.g., sample mean).

**Statistical literacy**: “People’s ability to interpret and critically evaluate statistical information and data-based arguments appearing in diverse media channels, and their ability to discuss their opinions regarding such statistical information” (Gal, as cited by Rumsey, 2002).

**Statistical significance**: Sample statistics vary from the specificed population parameters to the extent that it is unlikely that the results obtained were due to random sampling error, rather we conclude that the differences observed in the sample were due to actual differences in the population.

**Statistics**: The art and science of answering questions and exploring ideas through the processes of gathering data, describing data, and making generalizations about a population on the basis of a smaller sample.

**Stem-and-leaf plot**: Graphical representation for quantitative data in which individual observations are displayed using numbers; usually all digits except for the final digit are in the stem column and the final digits are in the leaf column.

**Stratified Random Sampling**: A method of obtaining a sample from a population in which the population is divided into important subgroups and then separate simple random samples are drawn from each subgroup which are known as strata.

**Subjective probability**: An individual’s personal judgement about the likelihood of an event occurring; also known as a personal probability.

**Subjects**: Experimental units, typically non-human.

**Sum of squared deviations**: Deviations are first squared and then added together, also known as sum of squares or SS.

**Sum of squared residuals**: The sum of all of the residuals squared: \(\sum (y-\widehat{y})^2\); also known as the **sum of squared errors** (SSE).

**Symmetric**: A distribution that is similar on both sides of the center.

**t ****distribution: **A bell-shaped distribution that takes into account the number of degrees of freedom; as *n* approaches infinity, the *t *distribution approaches the *z* distribution.

**Third quartile (Q3)**: 75th percentile; middle of the values above the median.

**Treatment**: A specific condition applied to the experimental units.

**Tukey’s Honestly Significant Differences Test: **A post-hoc analysis performed after an analysis of variance that tests the statistical significance of all possible pairs of groups.

**Type I error: **rejecting \(H_0\) when \(H_0\) is really true, denoted by \(\alpha\) ("alpha") and commonly set at .05.

**Type II error: **failing to reject \(H_0\) when \(H_0\) is really false, denoted by \(\beta\) ("beta").

**Unimodal**: A distribution with one prominent peak (i.e., one mode).

**Union: **The probability of at least one or more of the specified events occurring, represented as \(\cup \); e.g., \(P(A\cup B) \) is the “probability of A or B” (note: this would also include the probability of A and B).

**Variability**: The extent to which values differ from one another, also referred to “spread” or “dispersion”.

**Variable**: Characteristic that is measured and can take on different values (in other words, something that can vary).

**Variance**: Approximately the average of all of the squared deviations; represented as \(s^2\) in a sample or \(\sigma^2\) in a population.

**Venn diagram**: A visual representation in which the sample space is depicted as a box and events are represented as circles within the sample space.

**z distribution**: bell-shaped distribution (i.e., symmetrical, unimodal) with a mean of 0 and standard deviation of 1, also known as the **normal distribution.**

**z score**: Distance between an individual score and the mean in standard deviation units; also known as a **standardized score.**

-=-=-=-

Rumsey, D. J. (2002). Statistical literacy as a goal for introductory statistics courses. Journal of Statistics Education, 10(3). Retrieved from http://www.amstat.org/publications/jse/v10n3/rumsey2.html