← Back to Hub
AWS Certified · Associate Tier

Machine Learning Engineer
Associate

Master every topic for the AWS MLA-C01 exam. Covers data preparation, feature engineering, model training, deployment, MLOps workflows, monitoring, and security for ML solutions on SageMaker and related AWS services.

65
Scored Questions
170
Minutes
720
Pass Score
4
Domains

Domain Breakdown

The MLA-C01 exam validates your ability to build, train, tune, deploy, and monitor ML models on AWS using Amazon SageMaker and related AI/ML services. It spans the full ML engineering lifecycle.

DomainTopicWeight
D1Data Preparation for Machine Learning28%
D2ML Model Development26%
D3Deployment and Orchestration of ML Workflows22%
D4ML Solution Monitoring, Maintenance, and Security24%
🎯
Exam format: 65 scored + up to 15 unscored questions (80 total), 170 minutes. Passing score: 720/1000. No penalty for guessing. Multiple-choice and multiple-response formats. Scenario-based questions focus on choosing the right SageMaker feature or architecture pattern for given ML requirements.

Data Preparation for Machine Learning

Covers data ingestion, exploratory analysis, feature engineering, transformation, and labeling. SageMaker Data Wrangler, Processing Jobs, Feature Store, and Ground Truth are key services.

01

Data Ingestion & Preparation

Transforming raw data into ML-ready features

28%
OF EXAM

📥 Data Sources & Ingestion

  • S3: primary data lake for ML datasets (structured + unstructured)
  • Athena: query structured S3 data for feature extraction
  • Redshift: SQL-based feature retrieval at scale
  • Kinesis: streaming data into SageMaker Feature Store
  • AWS Glue: ETL for transforming and cataloging raw data
  • Lake Formation: governed, secure data lake for ML
  • SageMaker Data Wrangler: GUI-based data prep + 300+ transforms

🔧 SageMaker Data Wrangler

  • Visual, no-code data preparation within SageMaker Studio
  • 300+ built-in transforms: normalization, imputation, encoding
  • Data quality & insights: detect anomalies, missing values
  • Exports to: Processing Job, Training Job, Feature Store, Pipeline
  • Supports: S3, Redshift, Athena, SalesForce, Snowflake connectors
  • Auto-generates feature transformation code (Python/PySpark)

⚙️ SageMaker Processing Jobs

  • Run custom Python / Spark scripts at scale on managed infrastructure
  • Containers: ScikitLearn, Spark, MXNet, or custom Docker image
  • Inputs/outputs from/to S3; supports distributed Spark processing
  • Use for: feature engineering, model evaluation, data validation
  • Instances: ml.m5, ml.c5, ml.p3 (GPU) as needed
  • Managed infrastructure — no cluster setup required

🗄️ SageMaker Feature Store

  • Centralized repo for ML features — online + offline stores
  • Online store: low-latency (ms) real-time inference lookups
  • Offline store: S3-backed for training and batch inference
  • Feature groups: schema + record identifier + event time
  • Ingestion: via SDK, Kinesis, or SageMaker Pipelines
  • Prevents training/serving skew by sharing features

🏷️ SageMaker Ground Truth

  • Managed data labeling service with human/automated workflows
  • Labeling workforce: Amazon Mechanical Turk, private team, vendor
  • Auto-labeling with active learning (reduces manual effort ~70%)
  • Task types: image classification, bounding box, NER, 3D point cloud
  • Output: labeled dataset in S3 with augmented manifest format
  • Ground Truth Plus: fully managed end-to-end labeling project

📊 Feature Engineering Techniques

  • Normalization/Standardization: scale numeric features (MinMax, Z-score)
  • One-Hot Encoding: convert categorical to binary vectors
  • Imputation: fill missing values (mean, median, model-based)
  • Binning: discretize continuous variables into buckets
  • Log transform: normalize skewed distributions
  • TF-IDF, word embeddings: text feature extraction
  • PCA: dimensionality reduction for high-dim features
Key Concept
Training-Serving Skew
Training-serving skew occurs when features computed during training differ from features at inference time. SageMaker Feature Store solves this: the same feature pipeline writes to the offline store (training) and online store (inference), guaranteeing consistency. Always use the Feature Store when features are shared across training and real-time serving.
💡
Exam trap: Data Wrangler is for interactive, GUI-based exploration and small-to-medium datasets. For large-scale production feature transforms, use SageMaker Processing Jobs (with Spark or Scikit-learn) instead. Data Wrangler can export recipes to Processing Jobs for productionization.
ServiceUse CaseScale
Data WranglerInteractive GUI prep, EDA, 300+ transformsExploratory / medium
Processing JobsCustom scripts (Spark/Sklearn) in managed infraProduction / large
Glue ETLSchema evolution, catalog management, serverless ETLEnterprise / very large
Feature StoreCentralized feature registry, prevents train/serve skewShared feature platform
Ground TruthHuman + auto labeling with active learningAny dataset size

ML Model Development

Covers SageMaker built-in algorithms, custom training, hyperparameter tuning, distributed training strategies, experiment tracking, model evaluation, and AutoML with Autopilot.

02

Training & Model Development

From raw features to production-ready models

26%
OF EXAM

🤖 SageMaker Built-In Algorithms

  • XGBoost: tabular data, classification/regression; CSV or libsvm
  • Linear Learner: regression + classification, large-scale
  • K-Means: unsupervised clustering
  • PCA: dimensionality reduction
  • Random Cut Forest (RCF): anomaly detection in time series
  • BlazingText: Word2Vec + text classification (very fast)
  • DeepAR: probabilistic time-series forecasting
  • Object Detection / Image Classification: computer vision
  • Seq2Seq: machine translation, summarization
  • IP Insights: detects anomalous IP usage patterns
  • Factorization Machines: recommendation systems (sparse data)

🏋️ Training Jobs

  • SageMaker manages compute: provision, train, terminate instances
  • Managed Spot Training: up to 90% savings; requires checkpointing
  • Checkpointing: save model state to S3; resume after spot interruption
  • Input modes: File (copy S3 to local), Pipe (stream from S3), FastFile
  • File mode: best for small-medium datasets; Pipe: large datasets
  • Training script + container image + hyperparameters + resource config
  • Output artifacts → S3; CloudWatch logs per training job

🎛️ Hyperparameter Tuning (HPO)

  • Automatic Model Tuning (AMT): finds best hyperparameter combinations
  • Strategies: Bayesian (learns from prior runs), Random, Grid, Hyperband
  • Define: parameter ranges, objective metric, max training jobs
  • Bayesian = best for continuous spaces; Grid = exhaustive discrete
  • Warm start: continue tuning from previous job (2 types: IDENTICAL_DATA_AND_ALGORITHM, TRANSFER_LEARNING)
  • Early stopping: reduces cost by stopping poor-performing jobs

⚡ Distributed Training

  • Data parallelism: split dataset across GPUs; each GPU has full model copy
  • Model parallelism: split model across GPUs; for very large models
  • SageMaker Distributed Training Library: SDP + SMP
  • SDP (data parallel): AllReduce gradient aggregation
  • SMP (model parallel): automatic pipeline partitioning
  • Use p3, p4, g4dn instances; multiple instances via instance_count
  • Frameworks: TensorFlow, PyTorch, MXNet natively supported

📐 Model Evaluation Metrics

  • Classification: Accuracy, Precision, Recall, F1, AUC-ROC, Log Loss
  • Regression: RMSE, MAE, R-squared, MSE
  • Ranking: NDCG, MAP, MRR
  • Confusion matrix: TP, TN, FP, FN breakdown
  • Cross-validation: k-fold for robust evaluation on small datasets
  • Bias metrics: use SageMaker Clarify for fairness evaluation
  • SageMaker Experiments: track metrics, parameters, artifacts per run

🚀 SageMaker Autopilot (AutoML)

  • Fully automated end-to-end ML pipeline for tabular data
  • Auto explores: feature preprocessing, algorithm selection, HPO
  • Modes: Auto, Ensembling, HPO
  • Generates notebooks showing the best pipeline logic (explainable)
  • Candidate definitions: configurable max candidates, max runtime
  • Output: best model + leaderboard of all trials
  • Supports: binary + multiclass classification, regression
Key Decision Rule
Choosing a Built-In Algorithm
Tabular structured data → XGBoost or Linear Learner. Anomaly detection → Random Cut Forest. Time-series forecasting → DeepAR. Text classification (fast) → BlazingText. Recommendation with sparse data → Factorization Machines. Unsupervised clustering → K-Means. Computer vision → Object Detection / Image Classification / Semantic Segmentation built-ins.
AlgorithmTaskInput Format
XGBoostClassification / RegressionCSV, libsvm, Parquet
Linear LearnerClassification / RegressionCSV, RecordIO-protobuf
DeepARTime-series forecastJSON Lines
Random Cut ForestAnomaly detectionCSV, RecordIO-protobuf
BlazingTextText classification / Word2VecPlain text (one sentence/line)
Factorization MachinesRecommendation (sparse)RecordIO-protobuf
K-MeansClusteringCSV, RecordIO-protobuf

Deployment and Orchestration of ML Workflows

Covers inference endpoint types, A/B testing, batch transform, SageMaker Pipelines, Model Registry, MLOps with CI/CD, and integration with Step Functions and EventBridge.

03

Deployment & MLOps

Getting models to production and keeping them there

22%
OF EXAM

🌐 Real-Time Inference Endpoints

  • Persistent HTTPS endpoint; invoke via invoke_endpoint()
  • Latency: milliseconds; ideal for interactive applications
  • Auto Scaling: scale based on InvocationsPerInstance metric
  • Production Variants: A/B test by routing traffic % to multiple models
  • Instance types: ml.c5, ml.m5 (CPU); ml.g4dn (GPU for deep learning)
  • Elastic Inference: attach GPU fraction to CPU instance (cost-efficient)
  • Data capture: log request/response payloads to S3 for monitoring

📦 Batch Transform

  • Offline, asynchronous scoring of large datasets from S3
  • No persistent endpoint; creates/destroys compute per job
  • Input: S3 file(s); Output: S3 results with ".out" suffix
  • Ideal for: nightly scoring, pre-computing predictions at scale
  • Associating predictions: use input filter, output filter, join source
  • Split type: line / RecordIO / TFRecord; batch strategy: MultiRecord

⏱️ Asynchronous & Serverless Inference

  • Async Inference: queue requests; ideal for large payloads (up to 1 GB)
  • Async: scales to 0 when idle (cost-saving); SNS notification on completion
  • Serverless Inference: pay-per-use; no instance management
  • Serverless: cold start latency; best for intermittent/unpredictable traffic
  • Specify memory size (1–6 GB) and max concurrency for serverless

🔀 Multi-Model & Multi-Container Endpoints

  • Multi-Model Endpoints (MME): host thousands of models on ONE endpoint
  • MME: models loaded on demand from S3; share instance resources
  • MME best for: many similar small models (per-tenant, per-customer)
  • Multi-Container Endpoints: sequential or direct invocation of containers
  • Direct: invoke any container independently on same endpoint
  • Serial: pipeline inference where output of one feeds the next

🔧 SageMaker Pipelines (MLOps)

  • Directed Acyclic Graph (DAG) of ML workflow steps
  • Steps: Processing, Training, Evaluation, Condition, Register, Transform
  • Condition step: branch based on evaluation metric thresholds
  • Cached steps: skip re-execution if inputs unchanged (saves cost)
  • Integrated with Model Registry for approval-gated deployment
  • Triggers: EventBridge, API, SageMaker Studio UI

📋 SageMaker Model Registry

  • Central catalog of versioned, approved models
  • Approval status: Pending → Approved → Rejected
  • Deploy only Approved models to production endpoints
  • Stores: model artifacts, metadata, metrics, tags per version
  • Cross-account model sharing via AWS RAM or S3 cross-account
  • Integrates with Pipelines (Register Model step) + CodePipeline
MLOps Pattern
End-to-End MLOps with SageMaker
A fully automated MLOps pipeline: 1) Trigger via EventBridge (scheduled or data arrival). 2) SageMaker Pipelines runs Processing → Training → Evaluation → Condition (check accuracy threshold). 3) If condition passes, Register Model with Approved status in Model Registry. 4) Lambda or CodePipeline detects approval → deploys to SageMaker Endpoint. 5) Model Monitor runs on endpoint to detect drift.
Inference TypeLatencyPayload LimitBest For
Real-time Endpointms6 MBInteractive, low-latency apps
Async Inferenceseconds–minutes1 GBLarge payloads, video/audio
Batch Transformminutes–hoursNo limit (S3)Offline batch scoring
Serverless Inferencems (warm) / s (cold)6 MBInfrequent / spiky traffic

ML Solution Monitoring, Maintenance, and Security

Covers SageMaker Model Monitor, Clarify for bias and explainability, CloudWatch integration, model retraining triggers, VPC isolation, IAM policies, and encryption best practices for ML workloads.

04

Monitoring & Security

Keeping models trustworthy, compliant, and secure in production

24%
OF EXAM

📈 SageMaker Model Monitor

  • Detects model quality degradation in production over time
  • 4 monitor types: Data Quality, Model Quality, Bias Drift, Feature Attribution Drift
  • Baseline: compute statistics from training data (run once)
  • Schedule: hourly/daily monitoring jobs comparing live traffic to baseline
  • Violations → CloudWatch metrics → SNS / Lambda alerts
  • Requires data capture enabled on endpoint (logs to S3)

⚖️ SageMaker Clarify

  • Detects bias and explains model predictions
  • Pre-training bias: in the dataset before training (e.g., class imbalance)
  • Post-training bias: in the trained model's predictions
  • Bias metrics: Class Imbalance (CI), DPL, KL divergence, DPPL
  • Explainability: SHAP values for per-feature attribution
  • Runs as a Processing Job; outputs bias report + explainability report
  • Integrated with Model Monitor for continuous bias drift detection

📊 CloudWatch for ML

  • SageMaker endpoints publish metrics: Invocations, Latency, Errors
  • Model Monitor violations → custom CloudWatch metrics
  • Training jobs: log training/validation loss, custom metrics via regex
  • CloudWatch Logs: all container stdout/stderr captured automatically
  • Alarms → SNS → Lambda for automated retraining triggers
  • Dashboards: build custom ML ops visibility dashboards

🔒 IAM for SageMaker

  • Execution role: attached to SageMaker resource (Training, Endpoint, Pipeline)
  • Must have: s3:GetObject/PutObject on data buckets
  • ECR access: ecr:GetDownloadUrlForLayer for custom containers
  • SageMaker Role Manager: provides predefined ML personas (Data Scientist, MLOps Engineer)
  • Condition keys: sagemaker:NetworkIsolation, sagemaker:VolumeKmsKey
  • Resource-based policies not supported — use IAM roles only

🛡️ Network & VPC Security

  • VPC mode: run training/processing jobs in customer VPC (private subnets)
  • Requires: S3 VPC Gateway endpoint (free) + ECR VPC Interface endpoint
  • Network isolation: training container has no internet access (supply all dependencies)
  • SageMaker Studio: isolate per user profile in VPC with private endpoints
  • Private SageMaker API: PrivateLink to avoid traffic over internet
  • Security groups: control traffic to/from training containers

🔐 Encryption for ML Workloads

  • Training data: SSE-KMS on S3 with CMK; specify VolumeKmsKeyId for EBS
  • Model artifacts: S3 SSE-KMS; specify OutputDataConfig.KmsKeyId
  • Inter-container encryption: enable for distributed training
  • SageMaker Feature Store: KMS encryption for both stores
  • SageMaker Notebooks / Studio: EBS volume encrypted with KMS
  • CloudWatch logs: KMS log group encryption
Monitoring Decision Tree
Which Monitor Type to Use?
Data Quality Monitor: detect statistical drift in input features (schema violations, distribution changes). Model Quality Monitor: detect accuracy/F1 degradation vs. ground truth labels. Bias Drift Monitor: detect fairness metric changes (requires Clarify). Feature Attribution Drift: detect changes in SHAP values over time (requires Clarify). All require data capture enabled on the endpoint.
Security ControlMechanismKey Detail
Training data encryptionSSE-KMS on S3 + EBS KMSSpecify VolumeKmsKeyId in training config
Network isolationVPC + NetworkIsolation=trueNo internet from training container
Access controlIAM execution rolesLeast-privilege per SageMaker resource type
Audit trailCloudTrail data eventsLog all SageMaker API calls + S3 access
Private endpointPrivateLink (VPC Interface)Avoid SageMaker API traffic over internet

100 Practice Questions

Test your MLA-C01 knowledge across all four domains

100 Questions
D1 · 28 Questions
D2 · 26 Questions
D3 · 22 Questions
D4 · 24 Questions
0/100

Complete all questions to see your results