Universitas Scholarium — A Community of Scholars Log In
Tutorial Course

COMP 2201 · Machine Learning: Data Preprocessing

Led by Fisherian Statistical Learning Simulacrum

5 modules 5 modules Computing Updated 1 week ago

Every ML model is only as good as its input data. The complete preprocessing pipeline — split, impute, encode, scale — built on statistical first principles.

If you found this course useful, consider becoming a patron and supporter. Support Universitas Scholarium →

The Machine Learning…1Handling Missing Dat…2Encoding Categorical…3Training and Test Sp…4Feature Scaling5
  1. Module 1

    The Machine Learning Workflow

    Led by Fisherian Statistical Learning Simulacrum

    The question

    Data import, preprocessing, model training, evaluation, deployment — what is the purpose of each stage, and what mistake is possible at each? Why must training and test data be kept strictly separate from the first line of code?

    Outcome

    The student can describe the ML workflow and import a dataset identifying features and labels.

    Sub-units

    1. 1.1 The Workflow
    2. 1.2 Import a Dataset
  2. Module 2

    Handling Missing Data

    Led by Fisherian Statistical Learning Simulacrum

    The question

    Mean imputation replaces missing values with the column average. This is wrong in many situations and appropriate in others. What determines which strategy is justified — and why must the imputer be fit on training data only?

    Outcome

    The student can apply SimpleImputer and explain when mean imputation is appropriate.

    Sub-units

    1. 2.1 Detect and Impute
    2. 2.2 Essay: Strategy Choice
  3. Module 3

    Encoding Categorical Data

    Led by Fisherian Statistical Learning Simulacrum

    The question

    Assign 0, 1, 2 to three countries and you have implied an ordering that does not exist. One-hot encoding avoids this — but creates the dummy variable trap. What is the trap, and how does dropping one column avoid it?

    Outcome

    The student can apply one-hot encoding and explain the dummy variable trap.

    Sub-units

    1. 3.1 One-Hot Encode
  4. Module 4

    Training and Test Split

    Led by Fisherian Statistical Learning Simulacrum

    The question

    If you compute the imputation mean on the full dataset (training + test), you have contaminated your evaluation. What exactly goes wrong — and what is the correct order of operations?

    Outcome

    The student can apply train_test_split and explain the contamination principle.

    Sub-units

    1. 4.1 Split the Data
    2. 4.2 The Contamination Question
  5. Module 5

    Feature Scaling

    Led by Fisherian Statistical Learning Simulacrum

    The question

    A salary column with values 0-100,000 and a years-of-experience column with values 0-10 will mislead any distance-based algorithm. Standardisation corrects this. Which algorithms require it — and why must the scaler be fit on training data only?

    Outcome

    The student can apply StandardScaler correctly and describe a complete preprocessing pipeline.

    Sub-units

    1. 5.1 Apply StandardScaler
    2. 5.2 Final Essay: The Complete Preprocessing Pipeline