AWS Certified · Associate Tier

Data Engineer
Associate

Master every topic for the AWS DEA-C01 exam. Covers data ingestion pipelines, transformation, storage, operations, and governance using Kinesis, Glue, Redshift, EMR, Athena, Lake Formation, and more.

Scored Questions

130

Minutes

720

Pass Score

Domains

Exam Blueprint

Domain Breakdown

The DEA-C01 exam tests your ability to design, build, secure, and maintain data pipelines and data stores on AWS. The four domains span the full data engineering lifecycle.

Domain	Topic	Weight
D1	Data Ingestion and Transformation	34%
D2	Data Store Management	26%
D3	Data Operations and Support	22%
D4	Data Security and Governance	18%

🎯

Exam format: 65 scored + up to 15 unscored questions (85 total), 130 minutes. Passing score: 720/1000. No penalty for guessing. Multiple-choice and multiple-response formats. Also includes scenario-based questions about architecture decisions for data pipelines.

Domain 1 · 34%

Data Ingestion and Transformation

The largest domain. Covers streaming and batch ingestion, ETL with Glue, pipeline orchestration, and data format considerations. Kinesis and Glue are tested heavily.

Ingestion Services

Moving data from sources into the AWS data ecosystem at scale

34%

OF EXAM

🌊 Kinesis Data Streams (KDS)

Real-time streaming, data retained 1–365 days
Shards: 1 MB/s write, 2 MB/s read per shard
Enhanced fan-out: 2 MB/s per consumer per shard, HTTP/2 push
Ordering guaranteed per shard using partition key
PutRecord / PutRecords (batch) API
IteratorType: TRIM_HORIZON (oldest), LATEST, AT_TIMESTAMP
On-Demand vs Provisioned mode

🚒 Kinesis Data Firehose (KDF)

Fully managed, near-real-time (60s buffer minimum)
Destinations: S3, Redshift, OpenSearch, Splunk, HTTP
Built-in transformation via Lambda
Auto-scaling, no shards to manage
For Redshift: writes to S3 first, then COPY command
Compression: GZIP, Snappy, ZIP on S3
Cannot replay data — not a replay service

📊 Kinesis Data Analytics (KDA)

Real-time SQL or Apache Flink on streaming data
Input: KDS or KDF streams
Output: KDS, KDF, Lambda
Flink on KDA: stateful, exactly-once, Java/Scala/Python
Use for: time-window aggregations, anomaly detection
RANDOM_CUT_FOREST function for anomaly detection in SQL

🗃️ AWS MSK (Managed Kafka)

Fully managed Apache Kafka clusters
Topics, partitions, producers, consumers — same Kafka API
MSK Connect: Kafka Connect workers, fully managed
MSK Serverless: no capacity provisioning
Messages persist on broker storage (up to EBS limits)
Encrypted at rest + in transit; VPC isolation

🔄 AWS DMS

Database Migration Service — heterogeneous or homogeneous
Source stays live during migration (CDC support)
SCT: Schema Conversion Tool for different DB engines
Endpoints: RDS, Aurora, S3, DynamoDB, Kinesis, Redshift
Full Load + CDC = minimal downtime migration
Replication instance runs on EC2 in your VPC

📤 AWS DataSync & Transfer Family

DataSync: scheduled transfer between on-prem NFS/SMB and S3/EFS/FSx
Agent required for on-prem; no agent for AWS-to-AWS
Up to 10x faster than open-source tools
Transfer Family: SFTP/FTP/FTPS endpoints for S3 or EFS
Preserves file permissions and metadata (DataSync)
Use DataSync for one-time or periodic data migrations

ETL

AWS Glue — Core ETL Engine

Serverless data integration service — central to the DEA-C01 exam

🕷️ Glue Data Catalog & Crawlers

Metadata repository: databases, tables, schemas, partitions
Crawlers: auto-discover schema from S3, RDS, Redshift, DynamoDB
Catalog used by Athena, Redshift Spectrum, EMR, Lake Formation
Partition discovery: crawl new partitions on schedule
Custom classifiers for non-standard formats
Connection types: JDBC, S3, Kinesis, Kafka, MongoDB

⚙️ Glue ETL Jobs

Spark-based (PySpark/Scala) or Python shell jobs
DynamicFrame: flexible schema, handles schema evolution
vs DataFrame: DynamicFrame better for messy/evolving data
Job bookmarks: track processed data, avoid re-processing
Glue Studio: visual drag-and-drop ETL
Worker types: Standard, G.1X, G.2X, G.025X (Python shell)

🔀 Glue Workflows & Triggers

Workflows: chain crawlers, jobs, and triggers
Trigger types: On-demand, Schedule (cron), Job completion, Conditional
Conditional triggers: based on job success/failure
Glue DataBrew: visual data prep, no code required
Glue Streaming: reads from Kinesis/Kafka, processes with micro-batches

📂 Data Formats & Partitioning

Columnar: Parquet, ORC — best for analytics (compression + query perf)
Row: Avro, JSON, CSV — good for writes and streaming
Hive-style partitioning: s3://bucket/year=2024/month=01/
Partition projection in Athena: avoid crawler for time-series data
Convert to Parquet/ORC via Glue for Athena cost savings
Snappy: fast compression; GZIP: higher compression ratio

🌀 AppFlow & EventBridge Pipes

AppFlow: no-code SaaS data integration (Salesforce, Zendesk, Slack → S3/Redshift)
Scheduled or event-triggered flows
Data mapping and transformation built-in
EventBridge Pipes: point-to-point connections between event sources and targets
Filtering and enrichment via Lambda/Step Functions

⚡ Lambda & SQS/SNS for Ingestion

Lambda: event-driven micro-ETL, 15 min max runtime
SQS trigger for Lambda: batch size up to 10,000 messages
DLQ: failed messages after max retries for inspection
SNS fan-out → multiple SQS queues for parallel processing
S3 event notifications → Lambda/SQS/SNS on PUT/DELETE
Lambda destination: on success/failure route to SQS, SNS, EventBridge

Key Comparison

KDS vs KDF vs MSK — Choose the Right Service

KDS: Custom consumers, replay needed, ordering matters — use for real-time dashboards, log processing.
KDF: Simple load to S3/Redshift/OpenSearch, no custom consumer code needed, near-real-time OK.
MSK: Kafka ecosystem, existing Kafka code, need Kafka Connect, topic-based pub/sub with long retention.

💡

Exam pattern: Questions often ask you to pick between Glue DynamicFrame and Spark DataFrame — DynamicFrame wins when schemas are inconsistent or evolving. DataFrame wins for pure performance with known schema. Also know that Glue job bookmarks prevent re-processing but must be enabled explicitly.

Domain 2 · 26%

Data Store Management

Covers selecting the right storage layer for data use cases — S3 as data lake foundation, Redshift for analytics DW, DynamoDB for high-throughput NoSQL, and Lake Formation for governance.

Storage Architecture

Matching data access patterns to the right AWS storage service

26%

OF EXAM

🪣 S3 as Data Lake Foundation

11 nines (99.999999999%) durability, 3+ AZs
Optimal partitioning: /year/month/day/hour/ for time-series
S3 prefix performance: 3,500 PUT/s, 5,500 GET/s per prefix
Multi-part upload: recommended for files > 100 MB
S3 Transfer Acceleration: CloudFront edge → faster uploads
Object tagging: cost allocation, lifecycle rules, access control
S3 Inventory: list all objects + metadata for audit/compliance

🗂️ S3 Storage Classes

Standard: frequent access, ms latency
Standard-IA: infrequent, retrieval fee, min 30 days
One Zone-IA: single AZ, 20% cheaper, for re-creatable data
Glacier Instant: ms retrieval, min 90 days
Glacier Flexible: minutes–hours, min 90 days
Glacier Deep Archive: 12h retrieval, cheapest, min 180 days
Intelligent-Tiering: auto-moves, no retrieval fee, monitoring fee

🏗️ Amazon Redshift Architecture

Leader node: query planning, result aggregation
Compute nodes: execute queries, store data slices
Redshift Serverless: no cluster management, RPU-based pricing
AQUA: distributed cache layer, hardware-accelerated queries
RA3 nodes: managed storage, decouple compute from storage
Snapshot backups to S3: automated (1–35 days) or manual

📋 Redshift Distribution & Sort Keys

EVEN: default, round-robin, minimizes skew for no clear join key
KEY: distribute rows by column value — co-locate join keys
ALL: full copy on every node — for small dimension tables
AUTO: Redshift decides based on table size
Sort key: COMPOUND (multi-col, order matters) or INTERLEAVED (each col equal weight)
VACUUM: reclaim space, re-sort rows after DELETEs/UPDATEs
ANALYZE: update statistics for query planner

📥 Redshift COPY & Spectrum

COPY command: best way to bulk load — parallel from S3/DynamoDB/EMR
COPY from S3 uses IAM role on cluster — no access keys
Manifest file: load specific S3 files, avoid partial loads
Redshift Spectrum: query S3 data directly without loading
Spectrum requires Glue Data Catalog or Athena catalog
Spectrum scales independently — query exabytes of S3 data
WLM (Workload Management): query queues, priority, concurrency scaling

⚡ DynamoDB for Data Engineering

Partition key (+ optional sort key) determines partition placement
GSI: alternate partition key, eventual or strong consistency
LSI: alternate sort key, same partition, strong consistency
DynamoDB Streams: capture item-level changes, 24h retention
Streams → Lambda for real-time processing pipelines
DynamoDB → S3 export (no RCU consumed) for analytics
Adaptive capacity: auto-shifts capacity to hot partitions

Data Lake Architecture

AWS Lake Formation — Data Lake Governance

Lake Formation sits on top of S3 + Glue Data Catalog. It provides fine-grained access control at the database, table, column, and row level — far more granular than S3 bucket policies alone.

Blueprints: pre-built templates to ingest from RDS, DynamoDB, CloudTrail into the data lake.
LF-Tags: attribute-based access control — tag columns with sensitivity level, grant access by tag.
Cross-account: share catalogs and tables across AWS accounts via Resource Access Manager.
Governed Tables: ACID transactions on S3, automatic compaction, time-travel queries.

Service	Use Case	Key Exam Point
S3	Data lake, raw + processed storage	Partitioning strategy determines Athena query cost
Redshift	OLAP, BI, complex SQL analytics	Distribution key = co-locate join data
DynamoDB	Low-latency key-value, IoT, sessions	Hot partition = bad partition key choice
Aurora	OLTP, relational, high availability	Aurora S3 integration for SELECT INTO OUTFILE S3
OpenSearch	Full-text search, log analytics	Successor to Elasticsearch; works with Kinesis
Timestream	Time-series data (IoT, metrics)	Built-in time-series functions, auto data tiering
Lake Formation	Data lake governance	Column/row level security beyond S3 policies

Domain 3 · 22%

Data Operations and Support

Covers running, monitoring, troubleshooting, and optimizing data pipelines. Heavy focus on EMR, Athena query optimization, and Step Functions for orchestration.

Pipeline Operations

Running, debugging, and optimizing production data workloads

22%

OF EXAM

🔥 Amazon EMR

Managed Hadoop ecosystem: Spark, Hive, Presto, HBase, Flink
Master + Core + Task nodes: task nodes are spot-friendly
EMR on EC2: full control, HDFS or EMRFS (S3)
EMR Serverless: no cluster management, pay per vCPU/memory
EMR on EKS: submit Spark jobs to EKS cluster
EMRFS: S3-backed HDFS replacement, consistent view
Bootstrap actions: run scripts at cluster launch

🚀 Spark Performance Tuning

Partitioning: repartition() vs coalesce() — coalesce avoids full shuffle
Broadcast join: small table broadcast eliminates shuffle join
Caching: cache() for repeatedly used DataFrames
Speculation: retry slow tasks on another executor
Driver OOM: increase spark.driver.memory
Executor OOM: increase spark.executor.memory
AQE (Adaptive Query Execution): auto-optimizes at runtime

🔍 Amazon Athena Optimization

Serverless SQL on S3, pay per TB scanned ($5/TB)
Use columnar formats (Parquet/ORC) to reduce scanned data
Partition data → use partition pruning in WHERE clause
Partition Projection: define partitions in table properties, no crawler
Result caching: reuse query results for identical queries
Federated queries: query RDS, DynamoDB, on-prem via Lambda connectors
Athena CTAS: create table from SELECT → materialize results

📈 Step Functions for Data Pipelines

Orchestrate multi-step workflows: Glue jobs, Lambda, EMR, Athena
Standard: async, exactly-once, 1 year max, audit history
Express: high-volume, at-least-once, 5 min max, cheaper
Error handling: Catch + Retry in state machine definitions
Map state: process array items in parallel
Wait state: pause for external event or time delay
EventBridge → Step Functions for event-driven pipeline triggers

📊 Monitoring Data Pipelines

CloudWatch Metrics: Glue job run metrics, EMR cluster metrics
CloudWatch Logs: Glue job logs, Spark driver/executor logs
CloudTrail: API calls to Glue, S3, Redshift for audit
Glue job metrics: numOutputRows, bytesWritten, executionTime
EMR: Hadoop counters, YARN metrics, Spark UI
Redshift: STL tables (query history, load errors), STV (real-time)
AWS Glue Data Quality: DQ rules, anomaly detection on data

📉 QuickSight & Visualization

Serverless BI, SPICE engine for in-memory acceleration
SPICE: Super-fast Parallel In-memory Calculation Engine
Connects to: Athena, Redshift, S3, RDS, Aurora, OpenSearch
ML Insights: anomaly detection, forecasting built-in
Row-level security (RLS) and column-level security (CLS)
Paginated reports: for pixel-perfect operational reports
Embedded analytics: embed dashboards in apps

Architecture Pattern

Modern Data Pipeline: Lambda vs Kappa Architecture

Lambda architecture: separate batch layer (EMR/Glue) + speed layer (Kinesis/Flink) + serving layer (Redshift/Athena). Complex but handles both historical and real-time.

Kappa architecture: everything is streaming — simplifies by using KDS/MSK for all data. Reprocess by replaying from the stream. Fewer moving parts but requires long stream retention.

For the exam: Lambda arch = two pipelines, Kappa arch = stream-only. Kinesis 365-day retention enables Kappa on AWS.

Tool	Best For	Key Exam Point
EMR	Large-scale Spark/Hive, HDFS, legacy Hadoop	Use task nodes (Spot) for cost savings
Glue ETL	Serverless ETL, schema flexibility	DynamicFrame handles schema evolution
Athena	Ad-hoc SQL on S3	Parquet + partitioning = cost reduction
Step Functions	Multi-service workflow orchestration	Standard for long-running, Express for high-volume
MWAA	Managed Apache Airflow	DAG-based, complex schedules, Python operators
QuickSight	BI dashboards, embedded analytics	SPICE = in-memory; RLS for row-level security

Domain 4 · 18%

Data Security and Governance

Covers encryption, IAM roles for data services, Lake Formation fine-grained access, PII detection with Macie, and compliance-focused governance patterns.

Security & Governance

Protecting data at rest, in transit, and controlling access at scale

18%

OF EXAM

🔐 IAM for Data Services

Glue jobs use IAM roles — role needs S3, CloudWatch, Glue permissions
Redshift cluster uses IAM role for COPY/UNLOAD from S3
EMR service role + EC2 instance profile for cluster resources
Cross-account S3 access: bucket policy + IAM role in target account
Resource-based policies: S3 bucket policy, KMS key policy
Least-privilege: s3:GetObject per prefix, not bucket-wide

🔑 KMS Encryption for Data Services

S3: SSE-S3 (AWS managed), SSE-KMS (CMK), SSE-C (customer key), CSE
SSE-KMS: audit via CloudTrail, requires kms:GenerateDataKey
S3 Bucket Key: reduce KMS API calls/cost for SSE-KMS
Redshift: cluster-level encryption at rest with KMS CMK
Kinesis: server-side encryption with KMS CMK
DynamoDB: default encryption at rest (AWS owned key, KMS optional)
Glue: encrypt Data Catalog metadata + job logs with KMS

🏛️ Lake Formation Security

Centralized permissions replace S3 bucket policies for data lake
Column-level security: restrict specific columns per IAM principal
Row-level security: filter rows based on data filters
Cell-level security: column + row filter combination
LF-Tags (ABAC): tag resources + grant access by tag
Governed Tables: transactional ACID writes on S3
DataZone: data marketplace for sharing across accounts with governance

🔍 Macie, GuardDuty & Security Hub

Macie: ML-powered PII/sensitive data detection in S3
Macie custom data identifiers: regex patterns for org-specific data
GuardDuty: threat detection — monitors S3 data events, CloudTrail
GuardDuty S3 protection: detect anomalous API calls on S3
Security Hub: aggregates Macie + GuardDuty + Config findings
All findings → EventBridge for automated remediation

🌐 Network Security for Data

S3 Gateway Endpoint: free, keeps S3 traffic private in VPC
S3 Interface Endpoint (PrivateLink): private DNS for S3, useful for cross-account
Redshift: VPC-only, security groups, VPC endpoints for access
Glue Connections: VPC-based JDBC connections for RDS, on-prem
EMR in private subnet: NAT GW or VPC endpoints for S3/CloudWatch
MSK: VPC-based, SASL/SCRAM or TLS for client auth

📋 Data Governance & Compliance

AWS Config: track configuration changes to data services
CloudTrail: data events for S3 (API-level) — off by default, extra cost
Redshift audit logging: user activity, connection, user logs to S3
S3 Object Lock: WORM — compliance or governance mode
Glue Data Quality: define expectations, fail pipelines on bad data
AWS Audit Manager: evidence collection for compliance frameworks
Tag-based cost allocation: activate tags in Billing console

Security Pattern

Encrypting a Full Data Pipeline

Ingestion: Kinesis SSE with CMK → encrypted in transit (TLS) and at rest.
Processing: Glue job with security config — encrypt shuffle data, CloudWatch logs, job bookmarks with CMK.
Storage: S3 SSE-KMS with bucket key enabled (reduces KMS costs). S3 Block Public Access on all accounts.
Analytics: Redshift encryption at rest (CMK). Redshift column-level grants. Athena result bucket encrypted.
Governance: Lake Formation column/row security + Macie for PII scanning + CloudTrail data events.

Mock Exam

100 Practice Questions

Simulate the real DEA-C01 exam. Select your answer, then reveal the explanation.

⏱ Suggested: 130 minutes

📋 100 questions

🎯 Pass: 72%

Data EngineerAssociate

Domain Breakdown

Data Ingestion and Transformation

Ingestion Services

🌊 Kinesis Data Streams (KDS)

🚒 Kinesis Data Firehose (KDF)

📊 Kinesis Data Analytics (KDA)

🗃️ AWS MSK (Managed Kafka)

🔄 AWS DMS

📤 AWS DataSync & Transfer Family

AWS Glue — Core ETL Engine

🕷️ Glue Data Catalog & Crawlers

⚙️ Glue ETL Jobs

🔀 Glue Workflows & Triggers

📂 Data Formats & Partitioning

🌀 AppFlow & EventBridge Pipes

⚡ Lambda & SQS/SNS for Ingestion

Data Store Management

Storage Architecture

🪣 S3 as Data Lake Foundation

🗂️ S3 Storage Classes

🏗️ Amazon Redshift Architecture

📋 Redshift Distribution & Sort Keys

📥 Redshift COPY & Spectrum

⚡ DynamoDB for Data Engineering

Data Operations and Support

Pipeline Operations

🔥 Amazon EMR

🚀 Spark Performance Tuning

🔍 Amazon Athena Optimization

📈 Step Functions for Data Pipelines

📊 Monitoring Data Pipelines

📉 QuickSight & Visualization

Data Security and Governance

Security & Governance

🔐 IAM for Data Services

🔑 KMS Encryption for Data Services

🏛️ Lake Formation Security

🔍 Macie, GuardDuty & Security Hub

🌐 Network Security for Data

📋 Data Governance & Compliance

100 Practice Questions

Data Engineer
Associate