STAT 897D - Applied Data Mining & Statistical Learning

Printer-friendly versionPrinter-friendly version

course overview

Data mining and statistical learning methods use a variety of computational tools for understanding large, complex datasets. In some cases, the focus is on building models to predict a quantitative or qualitative output based on a collection of inputs. In others, the goal is simply to find relationships and structure from data with no specific output variable. This course takes an applied approach to understand the methodology, motivation, assumptions, strengths, and weaknesses of the most widely applicable methods in this field.

course topics

This graduate level course covers the following topics:

  • Understanding statistical learning and model selection
  • Using resampling methods such as cross-validation and bootstrap
  • Using linear regression methods
  • Examining variable selection in building regression models
  • Using regression shrinkage methods such as ridge regression and LASSO
  • Using dimension reduction methods such as principle components regression and partial least squares
  • Methods for modeling non-linear relationships
  • Using classification methods such as logistic regression, discriminant analysis, and nearest-neighbors
  • Using decision tree methods including bagging and boosting
  • Understanding the use of support vector machines
  • Using principal components analysis methods
  • Using cluster analysis methods

Here is a link to the Online Notes for STAT 897D.

prerequisites

  • STAT 501 (Regression Methods) or a similar course that cover analysis of research data through simple and multiple regression and correlation; polynomial models; indicator variables; step-wise, piece-wise, and logistic regression.
  • Basics on probability, expectation, and conditional distributions.
  • Matrix algebra and multivariate calculus will be beneficial but is not required.

textbook

An Introduction to Statistical Learning: with Applications in R, By James, G., Witten, D., Hastie, T., Tibshirani, R.  Springer, 2013.

software

The examples in the course use R and students will do weekly R Labs to apply statistical learning methods to real-world data. Extensive guidance in using R will be provided, but previous basic programming skills in R or exposure to a programming language such as MATLAB or Python will be useful.

R involves programming. Students should already feel comfortable using R at a basic level, be a quick learner of software packages, or able to figure out how to do the required analyses in another package of their choice. Students who have no experience with programming or are anxious about being able to manipulate software code are strongly encouraged to take the one-credit course in R in order to establish this foundation before taking this course.

R will be supported and sample programs will be supplied but you will be required to do some programing on your own. Due to different software applications, software versions and platforms there may be issues with running code. Students must be proactive in seeking advice and help from appropriate sources including documentation resources, other students, the teaching assistant, instructor or helpdesk.

assessment plan

  • Weekly Quizzes: 20%
  • R Labs: 20%
  • Individual Projects (2): 25%
  • Team Project: 25%
  • Participation in online discussion forums: 10%

academic integrity

All Penn State policies regarding ethics and honorable behavior apply to this course. Academic integrity is the pursuit of scholarly activity free from fraud and deception and is an educational objective of this institution. All University policies regarding academic integrity apply to this course. Academic dishonesty includes, but is not limited to, cheating, plagiarizing, fabricating of information or citations, facilitating acts of academic dishonesty by others, having unauthorized possession of examinations, submitting work of another person or work previously used without informing the instructor, or tampering with the academic work of other students.

For any material or ideas obtained from other sources, such as the text or things you see on the web, in the library, etc., a source reference must be given. Direct quotes from any source must be identified as such.

All exam answers must be your own, and you must not provide any assistance to other students during exams. Any instances of academic dishonesty WILL be pursued under the University and Eberly College of Science regulations concerning academic integrity. For more information on academic integrity, see Penn State's statement on plagiarism and academic dishonesty.

The Eberly College of Science Code of Mutual Respect and Cooperation embodies the values that we hope our faculty, staff, and students possess and will endorse to make The Eberly College of Science a place where every individual feels respected and valued, as well as challenged and rewarded.

disabilities

Penn State welcomes students with disabilities into the University's educational programs. If you have a disability-related need for reasonable academic adjustments in this course, contact the Office for Disability Services (ODS) at 814-863-1807 (V/TTY). For further information regarding ODS, please visit the Office for Disability Services Web site at http://equity.psu.edu/ods/.

In order to receive consideration for course accommodations, you must contact ODS and provide documentation (see the documentation guidelines at http://equity.psu.edu/ods/guidelines/documentation-guidelines). If the documentation supports the need for academic adjustments, ODS will provide a letter identifying appropriate academic adjustments. Please share this letter and discuss the adjustments with your instructor as early in the course as possible. You must contact ODS and request academic adjustment letters at the beginning of each semester.

course author

Dr. Jia Li  is the original author of these course materials. They have been adapted and enhanced by Dr. Le Bao, Dr. Iain Pardoe and Dr. Megan Romer.