Tutorial Course

COMP 2201 · Machine Learning: Data Preprocessing

Led by Fisherian Statistical Learning Simulacrum

5 modules 5 modules Computing Updated 1 week ago

Every ML model is only as good as its input data. The complete preprocessing pipeline — split, impute, encode, scale — built on statistical first principles.

If you found this course useful, consider becoming a patron and supporter. Support Universitas Scholarium →

Module 1

The Machine Learning Workflow

Led by Fisherian Statistical Learning Simulacrum

The question
Data import, preprocessing, model training, evaluation, deployment — what is the purpose of each stage, and what mistake is possible at each? Why must training and test data be kept strictly separate from the first line of code?

Outcome
The student can describe the ML workflow and import a dataset identifying features and labels.
Sub-units
1. ○ 1.1 The Workflow
2. ○ 1.2 Import a Dataset
Module 2

Handling Missing Data

Led by Fisherian Statistical Learning Simulacrum

The question
Mean imputation replaces missing values with the column average. This is wrong in many situations and appropriate in others. What determines which strategy is justified — and why must the imputer be fit on training data only?

Outcome
The student can apply SimpleImputer and explain when mean imputation is appropriate.
Sub-units
1. ○ 2.1 Detect and Impute
2. ○ 2.2 Essay: Strategy Choice
Module 3

Encoding Categorical Data

Led by Fisherian Statistical Learning Simulacrum

The question
Assign 0, 1, 2 to three countries and you have implied an ordering that does not exist. One-hot encoding avoids this — but creates the dummy variable trap. What is the trap, and how does dropping one column avoid it?

Outcome
The student can apply one-hot encoding and explain the dummy variable trap.
Sub-units
1. ○ 3.1 One-Hot Encode
Module 4

Training and Test Split

Led by Fisherian Statistical Learning Simulacrum

The question
If you compute the imputation mean on the full dataset (training + test), you have contaminated your evaluation. What exactly goes wrong — and what is the correct order of operations?

Outcome
The student can apply train_test_split and explain the contamination principle.
Sub-units
1. ○ 4.1 Split the Data
2. ○ 4.2 The Contamination Question
Module 5

Feature Scaling

Led by Fisherian Statistical Learning Simulacrum

The question
A salary column with values 0-100,000 and a years-of-experience column with values 0-10 will mislead any distance-based algorithm. Standardisation corrects this. Which algorithms require it — and why must the scaler be fit on training data only?

Outcome
The student can apply StandardScaler correctly and describe a complete preprocessing pipeline.
Sub-units
1. ○ 5.1 Apply StandardScaler
2. ○ 5.2 Final Essay: The Complete Preprocessing Pipeline

COMP 2201 · Machine Learning: Data Preprocessing

The Machine Learning Workflow

Handling Missing Data

Encoding Categorical Data

Training and Test Split

Feature Scaling