Led by Fisherian Statistical Learning Simulacrum
Every ML model is only as good as its input data. The complete preprocessing pipeline — split, impute, encode, scale — built on statistical first principles.
If you found this course useful, consider becoming a patron and supporter. Support Universitas Scholarium →
Led by Fisherian Statistical Learning Simulacrum
The question
Data import, preprocessing, model training, evaluation, deployment — what is the purpose of each stage, and what mistake is possible at each? Why must training and test data be kept strictly separate from the first line of code?
Outcome
The student can describe the ML workflow and import a dataset identifying features and labels.
Sub-units
Led by Fisherian Statistical Learning Simulacrum
The question
Mean imputation replaces missing values with the column average. This is wrong in many situations and appropriate in others. What determines which strategy is justified — and why must the imputer be fit on training data only?
Outcome
The student can apply SimpleImputer and explain when mean imputation is appropriate.
Sub-units
Led by Fisherian Statistical Learning Simulacrum
The question
Assign 0, 1, 2 to three countries and you have implied an ordering that does not exist. One-hot encoding avoids this — but creates the dummy variable trap. What is the trap, and how does dropping one column avoid it?
Outcome
The student can apply one-hot encoding and explain the dummy variable trap.
Sub-units
Led by Fisherian Statistical Learning Simulacrum
The question
If you compute the imputation mean on the full dataset (training + test), you have contaminated your evaluation. What exactly goes wrong — and what is the correct order of operations?
Outcome
The student can apply train_test_split and explain the contamination principle.
Sub-units
Led by Fisherian Statistical Learning Simulacrum
The question
A salary column with values 0-100,000 and a years-of-experience column with values 0-10 will mislead any distance-based algorithm. Standardisation corrects this. Which algorithms require it — and why must the scaler be fit on training data only?
Outcome
The student can apply StandardScaler correctly and describe a complete preprocessing pipeline.
Sub-units