COURSE DESCRIPTION

An introduction to statistical thinking and concepts is developed, beginning with the mathematical (probabilistic) description of random variables. Care will be placed on understanding why and when certain probabilistic models may be used in applications, and how these are elicited from context and data analysis. The course concludes with selected statistical methods useful for data exploration and description of vector-valued data, a common setup in modern data analysis appli-
cations. Python and/or R will be used for practical implementation of all numerical and graphical procedures, including computer simulations.

Prerequisites

Common requirements for the Semester in Mathematical Tools for Data Science (Spring).

COURSE GOALS

On completion of the course, students will:

  • arrow_rightlearn about basic statistical concepts and methods, including uncertainty and the role of probabilistic reasoning in data analysis;
  • arrow_rightfeel comfortable in the use of probability models for describing numerical data, includingmodel specification and computing aspects;
  • arrow_rightlearn about selected methods for addressing statistical problems, such as multiple linear re-gression and logistic regression for issues in inferring about data structure, prediction, and classification;
  • arrow_rightimplement methods and graphical procedures via R/Python, using meaningful datasets.
COURSE CONTENTS

1. Introduction (0.5 week)

1.1 Statistical thinking, role of data, stochasticity, and uncertainty.

2. Random Variables and random vectors (3 weeks)

2.1. Discrete and continuous univariate and multivariate densities.
2.2. Computer simulation of random variables.
2.3. Description of distributions: moments (mean, variance), covariance, correlation.
2.4. Joint, marginal and conditional distributions, independence.
2.5. Bayes Theorem.
2.6. Law of large numbers: theory, practice and simulations.

3. Notable probability models (2 weeks)

3.1. Discrete families: Bernoulli, binomial, geometric and Poisson densities.
3.2. Continuous families: exponential, Weibull and normal densities.
3.3. Multivariate normal distribution.

4. Graphical methods for exploring univariate and multivariate data (1.5 weeks)

4.1. Graphical tools for multivariate descriptions (matrix plots, parallel plots, icon plots, etc.).
4.2. Univariate and multivariate density estimation.

5. Statistical Inference (4.5 weeks)

5.1. Parametric estimation via likelihood methods.
5.2. Asymptotic properties of maximum likelihood estimators.
5.3. Bootstraping.
5.4. Bayesian inference.
5.5. Elements of Bayesian inference via MCMC (Markov Chain Monte Carlo).

6. Regression Models (3 weeks)

6.1. Linear regression and logistic regression.
6.2. Prediction and classification.

Grading

Course evaluation consists of homework assignments (20%) submitted via the Moodle site, two term exams (25% each) and one final exam (30%). Homework rate will be approximately one every 1–2 weeks.

Support Sessions

1.5 hours a week with a teaching assistant

References

Baron, M. (2014). Probability and statistics for computer scientists (2nd ed ed.). CRC Press.
Cook, D. & Swayne, D. F. (2007). Interactive and Dynamic Graphics for Data Analysis With R and GGobi (1st ed.). Springer Publishing Company, Incorporated.
DeGroot, M. & Schervish, M. (2012). Probability and Statistics. Addison-Wesley.
Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer Texts in Statistics. New York: Springer.