Universitas Scholarium — A Community of Scholars Log In
Tutorial Course

COMP 2208 · Machine Learning: Dimensionality Reduction and Model Selection

Led by Pearsonian Statistics Simulacrum

5 modules 5 modules Computing Updated 1 week ago

PCA, LDA, Kernel PCA, K-fold cross-validation, and XGBoost — the toolkit for dimensionality reduction and honest model evaluation. Based on Karl Pearson.

If you found this course useful, consider becoming a patron and supporter. Support Universitas Scholarium →

Principal Component …1Linear Discriminant …2Kernel PCA3K-Fold Cross-Validat…4XGBoost and Model Se…5
  1. Module 1

    Principal Component Analysis

    Led by Pearsonian Statistics Simulacrum

    The question

    Find the direction of maximum variance in a high-dimensional point cloud. Project onto it. Discard the rest. Why does retaining variance preserve information — and how many components should you keep?

    Outcome

    The student can implement PCA in a pipeline, choose k by explained variance, and visualise the projection.

    Sub-units

    1. 1.1 PCA on the Wine Dataset
  2. Module 2

    Linear Discriminant Analysis

    Led by Pearsonian Statistics Simulacrum

    The question

    PCA ignores class labels. LDA maximises between-class variance and minimises within-class variance simultaneously. On a classification task, which produces better separation — and when does LDA fail?

    Outcome

    The student can implement LDA, explain Fisher's discriminant criterion, and compare to PCA.

    Sub-units

    1. 2.1 LDA vs PCA
  3. Module 3

    Kernel PCA

    Led by Pearsonian Statistics Simulacrum

    The question

    Linear PCA fails for non-linearly structured data. The kernel trick maps data to a space where the structure is linear, then applies PCA there. Which kernel — and why can Kernel PCA not invert the transformation?

    Outcome

    The student can implement Kernel PCA and identify when each dimensionality reduction method is appropriate.

    Sub-units

    1. 3.1 Kernel PCA
  4. Module 4

    K-Fold Cross-Validation and Grid Search

    Led by Pearsonian Statistics Simulacrum

    The question

    A single train/test split's accuracy estimate depends on which observations fell in the test set. K-fold CV uses every observation as a test case once. GridSearchCV searches hyperparameters using CV. What does the standard deviation of CV scores tell you — and why must the final test set be used only once?

    Outcome

    The student can implement K-fold CV, interpret mean and standard deviation, and apply GridSearchCV.

    Sub-units

    1. 4.1 Cross-Validation and Grid Search
  5. Module 5

    XGBoost and Model Selection

    Led by Pearsonian Statistics Simulacrum

    The question

    XGBoost corrects each tree's errors with the next tree. For tabular data, it almost always outperforms random forests and neural networks. Describe the complete model selection workflow — from initial split to final evaluation — and explain what goes wrong if the test set is used for model selection.

    Outcome

    The student can implement XGBoost and describe the correct model selection workflow.

    Sub-units

    1. 5.1 Final Essay: The Model Selection Workflow