Machine Learning Engineer
Associate
Master every topic for the AWS MLA-C01 exam. Covers data preparation, feature engineering, model training, deployment, MLOps workflows, monitoring, and security for ML solutions on SageMaker and related AWS services.
Domain Breakdown
The MLA-C01 exam validates your ability to build, train, tune, deploy, and monitor ML models on AWS using Amazon SageMaker and related AI/ML services. It spans the full ML engineering lifecycle.
| Domain | Topic | Weight |
|---|---|---|
| D1 | Data Preparation for Machine Learning | 28% |
| D2 | ML Model Development | 26% |
| D3 | Deployment and Orchestration of ML Workflows | 22% |
| D4 | ML Solution Monitoring, Maintenance, and Security | 24% |
Data Preparation for Machine Learning
Covers data ingestion, exploratory analysis, feature engineering, transformation, and labeling. SageMaker Data Wrangler, Processing Jobs, Feature Store, and Ground Truth are key services.
Data Ingestion & Preparation
Transforming raw data into ML-ready features
📥 Data Sources & Ingestion
- S3: primary data lake for ML datasets (structured + unstructured)
- Athena: query structured S3 data for feature extraction
- Redshift: SQL-based feature retrieval at scale
- Kinesis: streaming data into SageMaker Feature Store
- AWS Glue: ETL for transforming and cataloging raw data
- Lake Formation: governed, secure data lake for ML
- SageMaker Data Wrangler: GUI-based data prep + 300+ transforms
🔧 SageMaker Data Wrangler
- Visual, no-code data preparation within SageMaker Studio
- 300+ built-in transforms: normalization, imputation, encoding
- Data quality & insights: detect anomalies, missing values
- Exports to: Processing Job, Training Job, Feature Store, Pipeline
- Supports: S3, Redshift, Athena, SalesForce, Snowflake connectors
- Auto-generates feature transformation code (Python/PySpark)
⚙️ SageMaker Processing Jobs
- Run custom Python / Spark scripts at scale on managed infrastructure
- Containers: ScikitLearn, Spark, MXNet, or custom Docker image
- Inputs/outputs from/to S3; supports distributed Spark processing
- Use for: feature engineering, model evaluation, data validation
- Instances: ml.m5, ml.c5, ml.p3 (GPU) as needed
- Managed infrastructure — no cluster setup required
🗄️ SageMaker Feature Store
- Centralized repo for ML features — online + offline stores
- Online store: low-latency (ms) real-time inference lookups
- Offline store: S3-backed for training and batch inference
- Feature groups: schema + record identifier + event time
- Ingestion: via SDK, Kinesis, or SageMaker Pipelines
- Prevents training/serving skew by sharing features
🏷️ SageMaker Ground Truth
- Managed data labeling service with human/automated workflows
- Labeling workforce: Amazon Mechanical Turk, private team, vendor
- Auto-labeling with active learning (reduces manual effort ~70%)
- Task types: image classification, bounding box, NER, 3D point cloud
- Output: labeled dataset in S3 with augmented manifest format
- Ground Truth Plus: fully managed end-to-end labeling project
📊 Feature Engineering Techniques
- Normalization/Standardization: scale numeric features (MinMax, Z-score)
- One-Hot Encoding: convert categorical to binary vectors
- Imputation: fill missing values (mean, median, model-based)
- Binning: discretize continuous variables into buckets
- Log transform: normalize skewed distributions
- TF-IDF, word embeddings: text feature extraction
- PCA: dimensionality reduction for high-dim features
| Service | Use Case | Scale |
|---|---|---|
| Data Wrangler | Interactive GUI prep, EDA, 300+ transforms | Exploratory / medium |
| Processing Jobs | Custom scripts (Spark/Sklearn) in managed infra | Production / large |
| Glue ETL | Schema evolution, catalog management, serverless ETL | Enterprise / very large |
| Feature Store | Centralized feature registry, prevents train/serve skew | Shared feature platform |
| Ground Truth | Human + auto labeling with active learning | Any dataset size |
ML Model Development
Covers SageMaker built-in algorithms, custom training, hyperparameter tuning, distributed training strategies, experiment tracking, model evaluation, and AutoML with Autopilot.
Training & Model Development
From raw features to production-ready models
🤖 SageMaker Built-In Algorithms
- XGBoost: tabular data, classification/regression; CSV or libsvm
- Linear Learner: regression + classification, large-scale
- K-Means: unsupervised clustering
- PCA: dimensionality reduction
- Random Cut Forest (RCF): anomaly detection in time series
- BlazingText: Word2Vec + text classification (very fast)
- DeepAR: probabilistic time-series forecasting
- Object Detection / Image Classification: computer vision
- Seq2Seq: machine translation, summarization
- IP Insights: detects anomalous IP usage patterns
- Factorization Machines: recommendation systems (sparse data)
🏋️ Training Jobs
- SageMaker manages compute: provision, train, terminate instances
- Managed Spot Training: up to 90% savings; requires checkpointing
- Checkpointing: save model state to S3; resume after spot interruption
- Input modes: File (copy S3 to local), Pipe (stream from S3), FastFile
- File mode: best for small-medium datasets; Pipe: large datasets
- Training script + container image + hyperparameters + resource config
- Output artifacts → S3; CloudWatch logs per training job
🎛️ Hyperparameter Tuning (HPO)
- Automatic Model Tuning (AMT): finds best hyperparameter combinations
- Strategies: Bayesian (learns from prior runs), Random, Grid, Hyperband
- Define: parameter ranges, objective metric, max training jobs
- Bayesian = best for continuous spaces; Grid = exhaustive discrete
- Warm start: continue tuning from previous job (2 types: IDENTICAL_DATA_AND_ALGORITHM, TRANSFER_LEARNING)
- Early stopping: reduces cost by stopping poor-performing jobs
⚡ Distributed Training
- Data parallelism: split dataset across GPUs; each GPU has full model copy
- Model parallelism: split model across GPUs; for very large models
- SageMaker Distributed Training Library: SDP + SMP
- SDP (data parallel): AllReduce gradient aggregation
- SMP (model parallel): automatic pipeline partitioning
- Use p3, p4, g4dn instances; multiple instances via instance_count
- Frameworks: TensorFlow, PyTorch, MXNet natively supported
📐 Model Evaluation Metrics
- Classification: Accuracy, Precision, Recall, F1, AUC-ROC, Log Loss
- Regression: RMSE, MAE, R-squared, MSE
- Ranking: NDCG, MAP, MRR
- Confusion matrix: TP, TN, FP, FN breakdown
- Cross-validation: k-fold for robust evaluation on small datasets
- Bias metrics: use SageMaker Clarify for fairness evaluation
- SageMaker Experiments: track metrics, parameters, artifacts per run
🚀 SageMaker Autopilot (AutoML)
- Fully automated end-to-end ML pipeline for tabular data
- Auto explores: feature preprocessing, algorithm selection, HPO
- Modes: Auto, Ensembling, HPO
- Generates notebooks showing the best pipeline logic (explainable)
- Candidate definitions: configurable max candidates, max runtime
- Output: best model + leaderboard of all trials
- Supports: binary + multiclass classification, regression
| Algorithm | Task | Input Format |
|---|---|---|
| XGBoost | Classification / Regression | CSV, libsvm, Parquet |
| Linear Learner | Classification / Regression | CSV, RecordIO-protobuf |
| DeepAR | Time-series forecast | JSON Lines |
| Random Cut Forest | Anomaly detection | CSV, RecordIO-protobuf |
| BlazingText | Text classification / Word2Vec | Plain text (one sentence/line) |
| Factorization Machines | Recommendation (sparse) | RecordIO-protobuf |
| K-Means | Clustering | CSV, RecordIO-protobuf |
Deployment and Orchestration of ML Workflows
Covers inference endpoint types, A/B testing, batch transform, SageMaker Pipelines, Model Registry, MLOps with CI/CD, and integration with Step Functions and EventBridge.
Deployment & MLOps
Getting models to production and keeping them there
🌐 Real-Time Inference Endpoints
- Persistent HTTPS endpoint; invoke via
invoke_endpoint() - Latency: milliseconds; ideal for interactive applications
- Auto Scaling: scale based on InvocationsPerInstance metric
- Production Variants: A/B test by routing traffic % to multiple models
- Instance types: ml.c5, ml.m5 (CPU); ml.g4dn (GPU for deep learning)
- Elastic Inference: attach GPU fraction to CPU instance (cost-efficient)
- Data capture: log request/response payloads to S3 for monitoring
📦 Batch Transform
- Offline, asynchronous scoring of large datasets from S3
- No persistent endpoint; creates/destroys compute per job
- Input: S3 file(s); Output: S3 results with ".out" suffix
- Ideal for: nightly scoring, pre-computing predictions at scale
- Associating predictions: use input filter, output filter, join source
- Split type: line / RecordIO / TFRecord; batch strategy: MultiRecord
⏱️ Asynchronous & Serverless Inference
- Async Inference: queue requests; ideal for large payloads (up to 1 GB)
- Async: scales to 0 when idle (cost-saving); SNS notification on completion
- Serverless Inference: pay-per-use; no instance management
- Serverless: cold start latency; best for intermittent/unpredictable traffic
- Specify memory size (1–6 GB) and max concurrency for serverless
🔀 Multi-Model & Multi-Container Endpoints
- Multi-Model Endpoints (MME): host thousands of models on ONE endpoint
- MME: models loaded on demand from S3; share instance resources
- MME best for: many similar small models (per-tenant, per-customer)
- Multi-Container Endpoints: sequential or direct invocation of containers
- Direct: invoke any container independently on same endpoint
- Serial: pipeline inference where output of one feeds the next
🔧 SageMaker Pipelines (MLOps)
- Directed Acyclic Graph (DAG) of ML workflow steps
- Steps: Processing, Training, Evaluation, Condition, Register, Transform
- Condition step: branch based on evaluation metric thresholds
- Cached steps: skip re-execution if inputs unchanged (saves cost)
- Integrated with Model Registry for approval-gated deployment
- Triggers: EventBridge, API, SageMaker Studio UI
📋 SageMaker Model Registry
- Central catalog of versioned, approved models
- Approval status: Pending → Approved → Rejected
- Deploy only Approved models to production endpoints
- Stores: model artifacts, metadata, metrics, tags per version
- Cross-account model sharing via AWS RAM or S3 cross-account
- Integrates with Pipelines (Register Model step) + CodePipeline
| Inference Type | Latency | Payload Limit | Best For |
|---|---|---|---|
| Real-time Endpoint | ms | 6 MB | Interactive, low-latency apps |
| Async Inference | seconds–minutes | 1 GB | Large payloads, video/audio |
| Batch Transform | minutes–hours | No limit (S3) | Offline batch scoring |
| Serverless Inference | ms (warm) / s (cold) | 6 MB | Infrequent / spiky traffic |
ML Solution Monitoring, Maintenance, and Security
Covers SageMaker Model Monitor, Clarify for bias and explainability, CloudWatch integration, model retraining triggers, VPC isolation, IAM policies, and encryption best practices for ML workloads.
Monitoring & Security
Keeping models trustworthy, compliant, and secure in production
📈 SageMaker Model Monitor
- Detects model quality degradation in production over time
- 4 monitor types: Data Quality, Model Quality, Bias Drift, Feature Attribution Drift
- Baseline: compute statistics from training data (run once)
- Schedule: hourly/daily monitoring jobs comparing live traffic to baseline
- Violations → CloudWatch metrics → SNS / Lambda alerts
- Requires data capture enabled on endpoint (logs to S3)
⚖️ SageMaker Clarify
- Detects bias and explains model predictions
- Pre-training bias: in the dataset before training (e.g., class imbalance)
- Post-training bias: in the trained model's predictions
- Bias metrics: Class Imbalance (CI), DPL, KL divergence, DPPL
- Explainability: SHAP values for per-feature attribution
- Runs as a Processing Job; outputs bias report + explainability report
- Integrated with Model Monitor for continuous bias drift detection
📊 CloudWatch for ML
- SageMaker endpoints publish metrics: Invocations, Latency, Errors
- Model Monitor violations → custom CloudWatch metrics
- Training jobs: log training/validation loss, custom metrics via regex
- CloudWatch Logs: all container stdout/stderr captured automatically
- Alarms → SNS → Lambda for automated retraining triggers
- Dashboards: build custom ML ops visibility dashboards
🔒 IAM for SageMaker
- Execution role: attached to SageMaker resource (Training, Endpoint, Pipeline)
- Must have:
s3:GetObject/PutObjecton data buckets - ECR access:
ecr:GetDownloadUrlForLayerfor custom containers - SageMaker Role Manager: provides predefined ML personas (Data Scientist, MLOps Engineer)
- Condition keys:
sagemaker:NetworkIsolation,sagemaker:VolumeKmsKey - Resource-based policies not supported — use IAM roles only
🛡️ Network & VPC Security
- VPC mode: run training/processing jobs in customer VPC (private subnets)
- Requires: S3 VPC Gateway endpoint (free) + ECR VPC Interface endpoint
- Network isolation: training container has no internet access (supply all dependencies)
- SageMaker Studio: isolate per user profile in VPC with private endpoints
- Private SageMaker API: PrivateLink to avoid traffic over internet
- Security groups: control traffic to/from training containers
🔐 Encryption for ML Workloads
- Training data: SSE-KMS on S3 with CMK; specify
VolumeKmsKeyIdfor EBS - Model artifacts: S3 SSE-KMS; specify
OutputDataConfig.KmsKeyId - Inter-container encryption: enable for distributed training
- SageMaker Feature Store: KMS encryption for both stores
- SageMaker Notebooks / Studio: EBS volume encrypted with KMS
- CloudWatch logs: KMS log group encryption
| Security Control | Mechanism | Key Detail |
|---|---|---|
| Training data encryption | SSE-KMS on S3 + EBS KMS | Specify VolumeKmsKeyId in training config |
| Network isolation | VPC + NetworkIsolation=true | No internet from training container |
| Access control | IAM execution roles | Least-privilege per SageMaker resource type |
| Audit trail | CloudTrail data events | Log all SageMaker API calls + S3 access |
| Private endpoint | PrivateLink (VPC Interface) | Avoid SageMaker API traffic over internet |
100 Practice Questions
Test your MLA-C01 knowledge across all four domains