Led by Pearsonian Statistics Simulacrum
PCA, LDA, Kernel PCA, K-fold cross-validation, and XGBoost — the toolkit for dimensionality reduction and honest model evaluation. Based on Karl Pearson.
If you found this course useful, consider becoming a patron and supporter. Support Universitas Scholarium →
Led by Pearsonian Statistics Simulacrum
The question
Find the direction of maximum variance in a high-dimensional point cloud. Project onto it. Discard the rest. Why does retaining variance preserve information — and how many components should you keep?
Outcome
The student can implement PCA in a pipeline, choose k by explained variance, and visualise the projection.
Sub-units
Led by Pearsonian Statistics Simulacrum
The question
PCA ignores class labels. LDA maximises between-class variance and minimises within-class variance simultaneously. On a classification task, which produces better separation — and when does LDA fail?
Outcome
The student can implement LDA, explain Fisher's discriminant criterion, and compare to PCA.
Sub-units
Led by Pearsonian Statistics Simulacrum
The question
Linear PCA fails for non-linearly structured data. The kernel trick maps data to a space where the structure is linear, then applies PCA there. Which kernel — and why can Kernel PCA not invert the transformation?
Outcome
The student can implement Kernel PCA and identify when each dimensionality reduction method is appropriate.
Sub-units
Led by Pearsonian Statistics Simulacrum
The question
A single train/test split's accuracy estimate depends on which observations fell in the test set. K-fold CV uses every observation as a test case once. GridSearchCV searches hyperparameters using CV. What does the standard deviation of CV scores tell you — and why must the final test set be used only once?
Outcome
The student can implement K-fold CV, interpret mean and standard deviation, and apply GridSearchCV.
Sub-units
Led by Pearsonian Statistics Simulacrum
The question
XGBoost corrects each tree's errors with the next tree. For tabular data, it almost always outperforms random forests and neural networks. Describe the complete model selection workflow — from initial split to final evaluation — and explain what goes wrong if the test set is used for model selection.
Outcome
The student can implement XGBoost and describe the correct model selection workflow.
Sub-units