Lesson 5: Regression Shrinkage Methods
Introduction
Key Learning Goals for this Lesson: 
Textbook reading: Consult Course Schedule 
Prediction:
 Linear regression: $E(Y_j  X) = X \beta$;
 Or for a more general regression function: $E(Y_j  X) = f (X)$;
 In a prediction context, there is less concern about the values of the components of the right hand side, rather interest is on the total contribution.
Variable Selection:
 The driving force behind variable selection:
 The desire for a parsimonious regression model (one that is simpler and easier to interpret);
 The need for greater accuracy in prediction.
 The notion of what makes a variable "important" is still not well understood, but one interpretation (Breiman, 2001) is that a variable is important if dropping it seriously affects prediction accuracy.
 Selecting variables in regression models is a complicated problem, and there are many conflicting views on which type of variable selection procedure is best, e.g. LRT, Ftest, AIC, and BIC.
There are two main types of stepwise procedures in regression:

Backward elimination: eliminate the least important variable from the selected ones.

Forward selection: add the most important variable from the remaining ones.

A hybrid version that incorporates ideas from both main types: alternates backwards and forwards steps, and stops when all variables have either been retained for inclusion or removed.
Criticisms of Stepwise Methods:

There is no guarantee that the subsets obtained from stepwise procedures will contain the same variables or even be the "best" subset.

When there are more variables than observations (p > n), backward elimination is typically not a feasible procedure.

The maximum or minimum of a set of correlated F statistics is not itself an F statistic.

It produces a single answer (a very specific subset) to the variable selection problem, although several different subsets may be equally good for regression purposes.

The computing is easy by the use of R function step() or regsubsets(). However, to specify a practically good answer, you must know the practical context in which your inference will be used.
Scott Zeger on 'how to pick the wrong model': Turn your scientific problem over to a computer that, knowing nothing about your science or your question, is very good at optimizing AIC, BIC, ...