Tutorial Course

COMP 3211 · Machine Learning on AWS: Data and Modelling

Led by Demingian Quality Simulacrum

5 modules 5 modules Computing Updated 1 week ago

ML data engineering on AWS — S3, Glue, Data Wrangler, SageMaker training jobs, Feature Store. Based on AWS documentation and industry MLOps practice.

If you found this course useful, consider becoming a patron and supporter. Support Universitas Scholarium →

Module 1

Data Formats and Storage

Led by Demingian Quality Simulacrum

The question
Parquet scans only the columns needed. JSON scans everything. RecordIO is SageMaker's native training format. For a 100GB transaction dataset powering monthly model training, ad-hoc queries, and real-time streaming — which format for which use case?

Outcome
The student can select data formats for ML workloads and explain the advantage of columnar formats.
Sub-units
1. ○ 1.1 Choose a Format
Module 2

ETL with AWS Glue

Led by Demingian Quality Simulacrum

The question
Glue for code-based ETL, DataBrew for visual preparation, Data Wrangler for ML-specific transformations. Raw clickstream JSON → SageMaker-ready Parquet: which AWS tool handles each stage?

Outcome
The student can trace data through the Glue/DataBrew/Data Wrangler pipeline.
Sub-units
1. ○ 2.1 Design an ETL Pipeline
Module 3

ML Model Development with SageMaker

Led by Demingian Quality Simulacrum

The question
SageMaker training jobs: spin up an instance, run training, store artifacts in S3, shut down. You pay only for training time. What is the lifecycle — and when should you use a built-in algorithm vs bring your own container?

Outcome
The student can describe the SageMaker training job lifecycle and select between built-in and custom algorithms.
Sub-units
1. ○ 3.1 Training Job Design
Module 4

Advanced ML Paradigms in SageMaker

Led by Demingian Quality Simulacrum

The question
XGBoost corrects errors sequentially. LightGBM uses histogram-based splits to process large datasets faster. Exclusive feature bundling reduces sparse feature dimensions. When does LightGBM beat XGBoost — and why do both usually beat random forests on tabular data?

Outcome
The student can explain gradient boosting and identify when LightGBM vs XGBoost is appropriate.
Sub-units
1. ○ 4.1 Gradient Boosting in Plain Terms
Module 5

Data Quality and Feature Store

Led by Demingian Quality Simulacrum

The question
A model that was 82% accurate at launch is now 67%. No code has changed. List five hypotheses — all involving data. How do you test each one? What would a Feature Store have prevented?

Outcome
The student can describe data drift, training-serving skew, and five failure modes for production ML data.
Sub-units
1. ○ 5.1 Final Essay: What Can Go Wrong with Production Data?

COMP 3211 · Machine Learning on AWS: Data and Modelling

Data Formats and Storage

ETL with AWS Glue

ML Model Development with SageMaker

Advanced ML Paradigms in SageMaker

Data Quality and Feature Store