Tutorial Course

COMP 2208 · Machine Learning: Dimensionality Reduction and Model Selection

Led by Pearsonian Statistics Simulacrum

5 modules 5 modules Computing Updated 1 week ago

PCA, LDA, Kernel PCA, K-fold cross-validation, and XGBoost — the toolkit for dimensionality reduction and honest model evaluation. Based on Karl Pearson.

If you found this course useful, consider becoming a patron and supporter. Support Universitas Scholarium →

Module 1

Principal Component Analysis

Led by Pearsonian Statistics Simulacrum

The question
Find the direction of maximum variance in a high-dimensional point cloud. Project onto it. Discard the rest. Why does retaining variance preserve information — and how many components should you keep?

Outcome
The student can implement PCA in a pipeline, choose k by explained variance, and visualise the projection.
Sub-units
1. ○ 1.1 PCA on the Wine Dataset
Module 2

Linear Discriminant Analysis

Led by Pearsonian Statistics Simulacrum

The question
PCA ignores class labels. LDA maximises between-class variance and minimises within-class variance simultaneously. On a classification task, which produces better separation — and when does LDA fail?

Outcome
The student can implement LDA, explain Fisher's discriminant criterion, and compare to PCA.
Sub-units
1. ○ 2.1 LDA vs PCA
Module 3

Kernel PCA

Led by Pearsonian Statistics Simulacrum

The question
Linear PCA fails for non-linearly structured data. The kernel trick maps data to a space where the structure is linear, then applies PCA there. Which kernel — and why can Kernel PCA not invert the transformation?

Outcome
The student can implement Kernel PCA and identify when each dimensionality reduction method is appropriate.
Sub-units
1. ○ 3.1 Kernel PCA
Module 4

K-Fold Cross-Validation and Grid Search

Led by Pearsonian Statistics Simulacrum

The question
A single train/test split's accuracy estimate depends on which observations fell in the test set. K-fold CV uses every observation as a test case once. GridSearchCV searches hyperparameters using CV. What does the standard deviation of CV scores tell you — and why must the final test set be used only once?

Outcome
The student can implement K-fold CV, interpret mean and standard deviation, and apply GridSearchCV.
Sub-units
1. ○ 4.1 Cross-Validation and Grid Search
Module 5

XGBoost and Model Selection

Led by Pearsonian Statistics Simulacrum

The question
XGBoost corrects each tree's errors with the next tree. For tabular data, it almost always outperforms random forests and neural networks. Describe the complete model selection workflow — from initial split to final evaluation — and explain what goes wrong if the test set is used for model selection.

Outcome
The student can implement XGBoost and describe the correct model selection workflow.
Sub-units
1. ○ 5.1 Final Essay: The Model Selection Workflow

COMP 2208 · Machine Learning: Dimensionality Reduction and Model Selection

Principal Component Analysis

Linear Discriminant Analysis

Kernel PCA

K-Fold Cross-Validation and Grid Search

XGBoost and Model Selection