AWS Certified · Associate Tier

Machine Learning Engineer
Associate

Master every topic for the AWS MLA-C01 exam. Covers data preparation, feature engineering, model training, deployment, MLOps workflows, monitoring, and security for ML solutions on SageMaker and related AWS services.

Scored Questions

170

Minutes

720

Pass Score

Domains

Exam Blueprint

Domain Breakdown

The MLA-C01 exam validates your ability to build, train, tune, deploy, and monitor ML models on AWS using Amazon SageMaker and related AI/ML services. It spans the full ML engineering lifecycle.

Domain	Topic	Weight
D1	Data Preparation for Machine Learning	28%
D2	ML Model Development	26%
D3	Deployment and Orchestration of ML Workflows	22%
D4	ML Solution Monitoring, Maintenance, and Security	24%

🎯

Exam format: 65 scored + up to 15 unscored questions (80 total), 170 minutes. Passing score: 720/1000. No penalty for guessing. Multiple-choice and multiple-response formats. Scenario-based questions focus on choosing the right SageMaker feature or architecture pattern for given ML requirements.

Domain 1 · 28%

Data Preparation for Machine Learning

Covers data ingestion, exploratory analysis, feature engineering, transformation, and labeling. SageMaker Data Wrangler, Processing Jobs, Feature Store, and Ground Truth are key services.

Data Ingestion & Preparation

Transforming raw data into ML-ready features

28%

OF EXAM

📥 Data Sources & Ingestion

S3: primary data lake for ML datasets (structured + unstructured)
Athena: query structured S3 data for feature extraction
Redshift: SQL-based feature retrieval at scale
Kinesis: streaming data into SageMaker Feature Store
AWS Glue: ETL for transforming and cataloging raw data
Lake Formation: governed, secure data lake for ML
SageMaker Data Wrangler: GUI-based data prep + 300+ transforms

🔧 SageMaker Data Wrangler

Visual, no-code data preparation within SageMaker Studio
300+ built-in transforms: normalization, imputation, encoding
Data quality & insights: detect anomalies, missing values
Exports to: Processing Job, Training Job, Feature Store, Pipeline
Supports: S3, Redshift, Athena, SalesForce, Snowflake connectors
Auto-generates feature transformation code (Python/PySpark)

⚙️ SageMaker Processing Jobs

Run custom Python / Spark scripts at scale on managed infrastructure
Containers: ScikitLearn, Spark, MXNet, or custom Docker image
Inputs/outputs from/to S3; supports distributed Spark processing
Use for: feature engineering, model evaluation, data validation
Instances: ml.m5, ml.c5, ml.p3 (GPU) as needed
Managed infrastructure — no cluster setup required

🗄️ SageMaker Feature Store

Centralized repo for ML features — online + offline stores
Online store: low-latency (ms) real-time inference lookups
Offline store: S3-backed for training and batch inference
Feature groups: schema + record identifier + event time
Ingestion: via SDK, Kinesis, or SageMaker Pipelines
Prevents training/serving skew by sharing features

🏷️ SageMaker Ground Truth

Managed data labeling service with human/automated workflows
Labeling workforce: Amazon Mechanical Turk, private team, vendor
Auto-labeling with active learning (reduces manual effort ~70%)
Task types: image classification, bounding box, NER, 3D point cloud
Output: labeled dataset in S3 with augmented manifest format
Ground Truth Plus: fully managed end-to-end labeling project

📊 Feature Engineering Techniques

Normalization/Standardization: scale numeric features (MinMax, Z-score)
One-Hot Encoding: convert categorical to binary vectors
Imputation: fill missing values (mean, median, model-based)
Binning: discretize continuous variables into buckets
Log transform: normalize skewed distributions
TF-IDF, word embeddings: text feature extraction
PCA: dimensionality reduction for high-dim features

Key Concept

Training-Serving Skew

Training-serving skew occurs when features computed during training differ from features at inference time. SageMaker Feature Store solves this: the same feature pipeline writes to the offline store (training) and online store (inference), guaranteeing consistency. Always use the Feature Store when features are shared across training and real-time serving.

💡

Exam trap: Data Wrangler is for interactive, GUI-based exploration and small-to-medium datasets. For large-scale production feature transforms, use SageMaker Processing Jobs (with Spark or Scikit-learn) instead. Data Wrangler can export recipes to Processing Jobs for productionization.

Service	Use Case	Scale
Data Wrangler	Interactive GUI prep, EDA, 300+ transforms	Exploratory / medium
Processing Jobs	Custom scripts (Spark/Sklearn) in managed infra	Production / large
Glue ETL	Schema evolution, catalog management, serverless ETL	Enterprise / very large
Feature Store	Centralized feature registry, prevents train/serve skew	Shared feature platform
Ground Truth	Human + auto labeling with active learning	Any dataset size

Domain 2 · 26%

ML Model Development

Covers SageMaker built-in algorithms, custom training, hyperparameter tuning, distributed training strategies, experiment tracking, model evaluation, and AutoML with Autopilot.

Training & Model Development

From raw features to production-ready models

26%

OF EXAM

🤖 SageMaker Built-In Algorithms

XGBoost: tabular data, classification/regression; CSV or libsvm
Linear Learner: regression + classification, large-scale
K-Means: unsupervised clustering
PCA: dimensionality reduction
Random Cut Forest (RCF): anomaly detection in time series
BlazingText: Word2Vec + text classification (very fast)
DeepAR: probabilistic time-series forecasting
Object Detection / Image Classification: computer vision
Seq2Seq: machine translation, summarization
IP Insights: detects anomalous IP usage patterns
Factorization Machines: recommendation systems (sparse data)

🏋️ Training Jobs

SageMaker manages compute: provision, train, terminate instances
Managed Spot Training: up to 90% savings; requires checkpointing
Checkpointing: save model state to S3; resume after spot interruption
Input modes: File (copy S3 to local), Pipe (stream from S3), FastFile
File mode: best for small-medium datasets; Pipe: large datasets
Training script + container image + hyperparameters + resource config
Output artifacts → S3; CloudWatch logs per training job

🎛️ Hyperparameter Tuning (HPO)

Automatic Model Tuning (AMT): finds best hyperparameter combinations
Strategies: Bayesian (learns from prior runs), Random, Grid, Hyperband
Define: parameter ranges, objective metric, max training jobs
Bayesian = best for continuous spaces; Grid = exhaustive discrete
Warm start: continue tuning from previous job (2 types: IDENTICAL_DATA_AND_ALGORITHM, TRANSFER_LEARNING)
Early stopping: reduces cost by stopping poor-performing jobs

⚡ Distributed Training

Data parallelism: split dataset across GPUs; each GPU has full model copy
Model parallelism: split model across GPUs; for very large models
SageMaker Distributed Training Library: SDP + SMP
SDP (data parallel): AllReduce gradient aggregation
SMP (model parallel): automatic pipeline partitioning
Use p3, p4, g4dn instances; multiple instances via instance_count
Frameworks: TensorFlow, PyTorch, MXNet natively supported

📐 Model Evaluation Metrics

Classification: Accuracy, Precision, Recall, F1, AUC-ROC, Log Loss
Regression: RMSE, MAE, R-squared, MSE
Ranking: NDCG, MAP, MRR
Confusion matrix: TP, TN, FP, FN breakdown
Cross-validation: k-fold for robust evaluation on small datasets
Bias metrics: use SageMaker Clarify for fairness evaluation
SageMaker Experiments: track metrics, parameters, artifacts per run

🚀 SageMaker Autopilot (AutoML)

Fully automated end-to-end ML pipeline for tabular data
Auto explores: feature preprocessing, algorithm selection, HPO
Modes: Auto, Ensembling, HPO
Generates notebooks showing the best pipeline logic (explainable)
Candidate definitions: configurable max candidates, max runtime
Output: best model + leaderboard of all trials
Supports: binary + multiclass classification, regression

Key Decision Rule

Choosing a Built-In Algorithm

Tabular structured data → XGBoost or Linear Learner. Anomaly detection → Random Cut Forest. Time-series forecasting → DeepAR. Text classification (fast) → BlazingText. Recommendation with sparse data → Factorization Machines. Unsupervised clustering → K-Means. Computer vision → Object Detection / Image Classification / Semantic Segmentation built-ins.

Algorithm	Task	Input Format
XGBoost	Classification / Regression	CSV, libsvm, Parquet
Linear Learner	Classification / Regression	CSV, RecordIO-protobuf
DeepAR	Time-series forecast	JSON Lines
Random Cut Forest	Anomaly detection	CSV, RecordIO-protobuf
BlazingText	Text classification / Word2Vec	Plain text (one sentence/line)
Factorization Machines	Recommendation (sparse)	RecordIO-protobuf
K-Means	Clustering	CSV, RecordIO-protobuf

Domain 3 · 22%

Deployment and Orchestration of ML Workflows

Covers inference endpoint types, A/B testing, batch transform, SageMaker Pipelines, Model Registry, MLOps with CI/CD, and integration with Step Functions and EventBridge.

Deployment & MLOps

Getting models to production and keeping them there

22%

OF EXAM

🌐 Real-Time Inference Endpoints

Persistent HTTPS endpoint; invoke via invoke_endpoint()
Latency: milliseconds; ideal for interactive applications
Auto Scaling: scale based on InvocationsPerInstance metric
Production Variants: A/B test by routing traffic % to multiple models
Instance types: ml.c5, ml.m5 (CPU); ml.g4dn (GPU for deep learning)
Elastic Inference: attach GPU fraction to CPU instance (cost-efficient)
Data capture: log request/response payloads to S3 for monitoring

📦 Batch Transform

Offline, asynchronous scoring of large datasets from S3
No persistent endpoint; creates/destroys compute per job
Input: S3 file(s); Output: S3 results with ".out" suffix
Ideal for: nightly scoring, pre-computing predictions at scale
Associating predictions: use input filter, output filter, join source
Split type: line / RecordIO / TFRecord; batch strategy: MultiRecord

⏱️ Asynchronous & Serverless Inference

Async Inference: queue requests; ideal for large payloads (up to 1 GB)
Async: scales to 0 when idle (cost-saving); SNS notification on completion
Serverless Inference: pay-per-use; no instance management
Serverless: cold start latency; best for intermittent/unpredictable traffic
Specify memory size (1–6 GB) and max concurrency for serverless

🔀 Multi-Model & Multi-Container Endpoints

Multi-Model Endpoints (MME): host thousands of models on ONE endpoint
MME: models loaded on demand from S3; share instance resources
MME best for: many similar small models (per-tenant, per-customer)
Multi-Container Endpoints: sequential or direct invocation of containers
Direct: invoke any container independently on same endpoint
Serial: pipeline inference where output of one feeds the next

🔧 SageMaker Pipelines (MLOps)

Directed Acyclic Graph (DAG) of ML workflow steps
Steps: Processing, Training, Evaluation, Condition, Register, Transform
Condition step: branch based on evaluation metric thresholds
Cached steps: skip re-execution if inputs unchanged (saves cost)
Integrated with Model Registry for approval-gated deployment
Triggers: EventBridge, API, SageMaker Studio UI

📋 SageMaker Model Registry

Central catalog of versioned, approved models
Approval status: Pending → Approved → Rejected
Deploy only Approved models to production endpoints
Stores: model artifacts, metadata, metrics, tags per version
Cross-account model sharing via AWS RAM or S3 cross-account
Integrates with Pipelines (Register Model step) + CodePipeline

MLOps Pattern

End-to-End MLOps with SageMaker

A fully automated MLOps pipeline: 1) Trigger via EventBridge (scheduled or data arrival). 2) SageMaker Pipelines runs Processing → Training → Evaluation → Condition (check accuracy threshold). 3) If condition passes, Register Model with Approved status in Model Registry. 4) Lambda or CodePipeline detects approval → deploys to SageMaker Endpoint. 5) Model Monitor runs on endpoint to detect drift.

Inference Type	Latency	Payload Limit	Best For
Real-time Endpoint	ms	6 MB	Interactive, low-latency apps
Async Inference	seconds–minutes	1 GB	Large payloads, video/audio
Batch Transform	minutes–hours	No limit (S3)	Offline batch scoring
Serverless Inference	ms (warm) / s (cold)	6 MB	Infrequent / spiky traffic

Domain 4 · 24%

ML Solution Monitoring, Maintenance, and Security

Covers SageMaker Model Monitor, Clarify for bias and explainability, CloudWatch integration, model retraining triggers, VPC isolation, IAM policies, and encryption best practices for ML workloads.

Monitoring & Security

Keeping models trustworthy, compliant, and secure in production

24%

OF EXAM

📈 SageMaker Model Monitor

Detects model quality degradation in production over time
4 monitor types: Data Quality, Model Quality, Bias Drift, Feature Attribution Drift
Baseline: compute statistics from training data (run once)
Schedule: hourly/daily monitoring jobs comparing live traffic to baseline
Violations → CloudWatch metrics → SNS / Lambda alerts
Requires data capture enabled on endpoint (logs to S3)

⚖️ SageMaker Clarify

Detects bias and explains model predictions
Pre-training bias: in the dataset before training (e.g., class imbalance)
Post-training bias: in the trained model's predictions
Bias metrics: Class Imbalance (CI), DPL, KL divergence, DPPL
Explainability: SHAP values for per-feature attribution
Runs as a Processing Job; outputs bias report + explainability report
Integrated with Model Monitor for continuous bias drift detection

📊 CloudWatch for ML

SageMaker endpoints publish metrics: Invocations, Latency, Errors
Model Monitor violations → custom CloudWatch metrics
Training jobs: log training/validation loss, custom metrics via regex
CloudWatch Logs: all container stdout/stderr captured automatically
Alarms → SNS → Lambda for automated retraining triggers
Dashboards: build custom ML ops visibility dashboards

🔒 IAM for SageMaker

Execution role: attached to SageMaker resource (Training, Endpoint, Pipeline)
Must have: s3:GetObject/PutObject on data buckets
ECR access: ecr:GetDownloadUrlForLayer for custom containers
SageMaker Role Manager: provides predefined ML personas (Data Scientist, MLOps Engineer)
Condition keys: sagemaker:NetworkIsolation, sagemaker:VolumeKmsKey
Resource-based policies not supported — use IAM roles only

🛡️ Network & VPC Security

VPC mode: run training/processing jobs in customer VPC (private subnets)
Requires: S3 VPC Gateway endpoint (free) + ECR VPC Interface endpoint
Network isolation: training container has no internet access (supply all dependencies)
SageMaker Studio: isolate per user profile in VPC with private endpoints
Private SageMaker API: PrivateLink to avoid traffic over internet
Security groups: control traffic to/from training containers

🔐 Encryption for ML Workloads

Training data: SSE-KMS on S3 with CMK; specify VolumeKmsKeyId for EBS
Model artifacts: S3 SSE-KMS; specify OutputDataConfig.KmsKeyId
Inter-container encryption: enable for distributed training
SageMaker Feature Store: KMS encryption for both stores
SageMaker Notebooks / Studio: EBS volume encrypted with KMS
CloudWatch logs: KMS log group encryption

Monitoring Decision Tree

Which Monitor Type to Use?

Data Quality Monitor: detect statistical drift in input features (schema violations, distribution changes). Model Quality Monitor: detect accuracy/F1 degradation vs. ground truth labels. Bias Drift Monitor: detect fairness metric changes (requires Clarify). Feature Attribution Drift: detect changes in SHAP values over time (requires Clarify). All require data capture enabled on the endpoint.

Security Control	Mechanism	Key Detail
Training data encryption	SSE-KMS on S3 + EBS KMS	Specify VolumeKmsKeyId in training config
Network isolation	VPC + NetworkIsolation=true	No internet from training container
Access control	IAM execution roles	Least-privilege per SageMaker resource type
Audit trail	CloudTrail data events	Log all SageMaker API calls + S3 access
Private endpoint	PrivateLink (VPC Interface)	Avoid SageMaker API traffic over internet

100 Practice Questions

Test your MLA-C01 knowledge across all four domains

100 Questions

D1 · 28 Questions

D2 · 26 Questions

D3 · 22 Questions

D4 · 24 Questions

Machine Learning EngineerAssociate

Domain Breakdown

Data Preparation for Machine Learning

Data Ingestion & Preparation

📥 Data Sources & Ingestion

🔧 SageMaker Data Wrangler

⚙️ SageMaker Processing Jobs

🗄️ SageMaker Feature Store

🏷️ SageMaker Ground Truth

📊 Feature Engineering Techniques

ML Model Development

Training & Model Development

🤖 SageMaker Built-In Algorithms

🏋️ Training Jobs

🎛️ Hyperparameter Tuning (HPO)

⚡ Distributed Training

📐 Model Evaluation Metrics

🚀 SageMaker Autopilot (AutoML)

Deployment and Orchestration of ML Workflows

Deployment & MLOps

🌐 Real-Time Inference Endpoints

📦 Batch Transform

⏱️ Asynchronous & Serverless Inference

🔀 Multi-Model & Multi-Container Endpoints

🔧 SageMaker Pipelines (MLOps)

📋 SageMaker Model Registry

ML Solution Monitoring, Maintenance, and Security

Monitoring & Security

📈 SageMaker Model Monitor

⚖️ SageMaker Clarify

📊 CloudWatch for ML

🔒 IAM for SageMaker

🛡️ Network & VPC Security

🔐 Encryption for ML Workloads

100 Practice Questions

Machine Learning Engineer
Associate