COURSE DESCRIPTION

This course is an introduction to statistical thinking and concepts, beginning with basic probability theory. The course concludes with selected statistical methods useful for data exploration and description of vector-valued data, a common setup in modern data analysis applications. Python and/or R will be used for practical implementation of all numerical and graphical procedures, including simulations.

Prerequisites

Common requirements for the Semester in Mathematical Tools for Data Science.

COURSE GOALS

On completion of the course, students will:

  • arrow_rightlearn about basic statistical concepts and methods, including uncertainty and the role of probabilistic reasoning in data analysis;
  • arrow_rightmaster presentation and use of mathematical concepts in probability theory;
  • arrow_rightlearn about selected methods for addressing statistical problems, such as multiple linear regression and logistic regression for issues in inferring about data structure, prediction, and classification;
  • arrow_rightimplement methods and graphical procedures via Python, using meaningful datasets.
COURSE CONTENTS

1. Introduction (0.5 week)
Statistical thinking, role of data, stochasticity, and uncertainty.

2. Probability Theory (2 weeks)
Sample space and events. Basic properties of probability. Probability laws. Conditional probability and independence. Bayes Theorem.

3. Random Variables (2.5 weeks)                                                                                    Mean and variance. Discrete families: Bernoulli, binomial, geometric and Poisson densities. Continuous families: exponential and normal densities. Multivariate normal distribution.

4. Graphical methods for exploring univariate and multivariate data (1.5 weeks)
Graphical tools for multivariate descriptions (matrix plots, parallel plots, icon plots, etc.).

5. Statistical Inference (4.5 weeks)
Likelihood. Asymptotic normality of maximum likelihood estimators. Bootstrap. Bayesian inference. Elements of Bayesian inference via MCMC (Markov Chain Monte Carlo).

6. Regression Models (3 weeks)
Linear regression and logistic regression. Prediction and classification.

Bibliography

  1. Baron, Michael (2014). Probability and Statistics for Computer Scientists, 2nd Edition, CRC Press.
  2. DeGroot, Morris H.; Schervish, Mark J. (2012). Probability and Statistics, Addison-Wesley. [Main textbook]
  3. Wasserman, Larry (2004). All of Statistics: A Concise Course on Statistical Inference, Springer.
  4. Cook, Dianne; Swayne, Deborah F. (2007). Interactive and Dynamical Graphics for Data Analysis: With R and GGobi, Springer.
  5. Support Sessions

    2 hours a week with a teaching assistant

    Grading

    Two midterm exams (25% each), homework (20%) and a final project (30%)

Support Sessions

2 hours a week with a teaching assistant

Grading

Two midterm exams (25% each), homework (20%) and a final project (30%)