Led by Demingian Quality Simulacrum
ML data engineering on AWS — S3, Glue, Data Wrangler, SageMaker training jobs, Feature Store. Based on AWS documentation and industry MLOps practice.
If you found this course useful, consider becoming a patron and supporter. Support Universitas Scholarium →
Led by Demingian Quality Simulacrum
The question
Parquet scans only the columns needed. JSON scans everything. RecordIO is SageMaker's native training format. For a 100GB transaction dataset powering monthly model training, ad-hoc queries, and real-time streaming — which format for which use case?
Outcome
The student can select data formats for ML workloads and explain the advantage of columnar formats.
Sub-units
Led by Demingian Quality Simulacrum
The question
Glue for code-based ETL, DataBrew for visual preparation, Data Wrangler for ML-specific transformations. Raw clickstream JSON → SageMaker-ready Parquet: which AWS tool handles each stage?
Outcome
The student can trace data through the Glue/DataBrew/Data Wrangler pipeline.
Sub-units
Led by Demingian Quality Simulacrum
The question
SageMaker training jobs: spin up an instance, run training, store artifacts in S3, shut down. You pay only for training time. What is the lifecycle — and when should you use a built-in algorithm vs bring your own container?
Outcome
The student can describe the SageMaker training job lifecycle and select between built-in and custom algorithms.
Sub-units
Led by Demingian Quality Simulacrum
The question
XGBoost corrects errors sequentially. LightGBM uses histogram-based splits to process large datasets faster. Exclusive feature bundling reduces sparse feature dimensions. When does LightGBM beat XGBoost — and why do both usually beat random forests on tabular data?
Outcome
The student can explain gradient boosting and identify when LightGBM vs XGBoost is appropriate.
Sub-units
Led by Demingian Quality Simulacrum
The question
A model that was 82% accurate at launch is now 67%. No code has changed. List five hypotheses — all involving data. How do you test each one? What would a Feature Store have prevented?
Outcome
The student can describe data drift, training-serving skew, and five failure modes for production ML data.
Sub-units