**Correlation** is a measurement of the strength and direction of the linear relationship between two numerical variables.

Two variables have a **linear relationship** if the cloud of points on the scatterplot are centered around a linear pattern (specifically, the average values of the Y-variable across the different X-values form a linear pattern)

A **positive correlation** between two measurement variables indicates that as the values of one variable increase, so do the values of the other variable. For example, the height (in inches) and weight (in pounds) of a group of people would have a positive correlation because taller people tend to weigh more and shorter people tend to weigh less.

A **negative correlation** between two measurement variables indicates that as the values of one variable increases, the values of the other variable tends to decrease. For example, the gas mileage (in miles per gallon) and weight (in pounds) of a group of cars would have a negative correlation because heavier cars tend to get worse gas mileage and light cars tend to get better mileage.

A result is said to be **statistically significant** if it is unlikely to happen by chance.

An **outlier** is a data point that falls far outside the pattern seen in other points in the data set. **Outliers** can have a big effect on sensitive statistics such as the mean, the standard deviation, and the correlation.

An **explanatory variable** in a study is a variable that you use to help predict the outcome – where the researcher believes the response variable might depend on the value of the **explanatory variable**. For example, in a study of how a patient’s recovery time depends on the dosage of a new flu treatment, the treatment dosage would be the **explanatory variable**. It is customary to put the values of the **explanatory variable** on the x-axis.

The **response variable** is the outcome variable in a study or the variable the researcher is trying to predict. For example, in a study of how a patient’s recovery time depends on the dosage of a new flu treatment, the recovery time would be the **response variable**. It is customary to put the values of the **response variable** on the y-axis.

The regression line is also called the **least squares** line because it gives the smallest possible value to the square of the differences between the actual y-values in the data and the predicted y-values on the line.

The **regression line** provides the average value of Y (or the predicted value) for a given value of x when the scatterplot shows a linear pattern. The line can be defined by the equation:

\(\text{Predicted value of Y} = a + bx\)

where “a” is the y-intercept and “b” is the slope of the line (so b tells you how much Y is expected to go up for every increase of one unit in x).