Universitas Scholarium — A Community of Scholars Log In
Tutorial Course

COMP 3211 · Machine Learning on AWS: Data and Modelling

Led by Demingian Quality Simulacrum

5 modules 5 modules Computing Updated 1 week ago

ML data engineering on AWS — S3, Glue, Data Wrangler, SageMaker training jobs, Feature Store. Based on AWS documentation and industry MLOps practice.

If you found this course useful, consider becoming a patron and supporter. Support Universitas Scholarium →

Data Formats and Sto…1ETL with AWS Glue2ML Model Development…3Advanced ML Paradigm…4Data Quality and Fea…5
  1. Module 1

    Data Formats and Storage

    Led by Demingian Quality Simulacrum

    The question

    Parquet scans only the columns needed. JSON scans everything. RecordIO is SageMaker's native training format. For a 100GB transaction dataset powering monthly model training, ad-hoc queries, and real-time streaming — which format for which use case?

    Outcome

    The student can select data formats for ML workloads and explain the advantage of columnar formats.

    Sub-units

    1. 1.1 Choose a Format
  2. Module 2

    ETL with AWS Glue

    Led by Demingian Quality Simulacrum

    The question

    Glue for code-based ETL, DataBrew for visual preparation, Data Wrangler for ML-specific transformations. Raw clickstream JSON → SageMaker-ready Parquet: which AWS tool handles each stage?

    Outcome

    The student can trace data through the Glue/DataBrew/Data Wrangler pipeline.

    Sub-units

    1. 2.1 Design an ETL Pipeline
  3. Module 3

    ML Model Development with SageMaker

    Led by Demingian Quality Simulacrum

    The question

    SageMaker training jobs: spin up an instance, run training, store artifacts in S3, shut down. You pay only for training time. What is the lifecycle — and when should you use a built-in algorithm vs bring your own container?

    Outcome

    The student can describe the SageMaker training job lifecycle and select between built-in and custom algorithms.

    Sub-units

    1. 3.1 Training Job Design
  4. Module 4

    Advanced ML Paradigms in SageMaker

    Led by Demingian Quality Simulacrum

    The question

    XGBoost corrects errors sequentially. LightGBM uses histogram-based splits to process large datasets faster. Exclusive feature bundling reduces sparse feature dimensions. When does LightGBM beat XGBoost — and why do both usually beat random forests on tabular data?

    Outcome

    The student can explain gradient boosting and identify when LightGBM vs XGBoost is appropriate.

    Sub-units

    1. 4.1 Gradient Boosting in Plain Terms
  5. Module 5

    Data Quality and Feature Store

    Led by Demingian Quality Simulacrum

    The question

    A model that was 82% accurate at launch is now 67%. No code has changed. List five hypotheses — all involving data. How do you test each one? What would a Feature Store have prevented?

    Outcome

    The student can describe data drift, training-serving skew, and five failure modes for production ML data.

    Sub-units

    1. 5.1 Final Essay: What Can Go Wrong with Production Data?