This course discusses quantitative methods for analyzing "big data", i.e. data sets that have either many observations or many variables or both. Firstly, the course covers flexible or "nonparametric" econometric methods for data with many observations, where "flexible" implies that the researcher aims at imposing as few behavioral assumptions as possible. These methods are often more accurate than standard approaches such as OLS, which assumes a linear relation between the explanatory and dependent variables that might not hold in reality.
Secondly, the course discusses so-called "machine learning" approaches to deal with data that include many variables, in order to optimally exploit the vast information provided in variables. Separating relevant from irrelevant information is key in a world with ever increasing data availability.
The following topics will be covered in the course:
* Flexible (non/semiparametric) vs. parametric statistical (or econometric) models
* Nonparametric regression methods: Kernel regression, series approximation, smoothing splines
* Methods for choosing smoothing and bandwidth parameters
* Testing: nonparametric specification and distribution tests
* Machine learning based on shrinkage and variable selection: Lasso and ridge regression
* Machine learning based on decision trees, bagged trees, and random forests
* Introduction to further machine learners: boosting, support vector machines, neural nets, and ensemble methods
The lecture is accompanied by 4 PC sessions based on the software package "R", in which the methods are applied to empirical data.