← Back to Hub
AWS Certified · Associate Tier

Data Engineer
Associate

Master every topic for the AWS DEA-C01 exam. Covers data ingestion pipelines, transformation, storage, operations, and governance using Kinesis, Glue, Redshift, EMR, Athena, Lake Formation, and more.

65
Scored Questions
130
Minutes
720
Pass Score
4
Domains

Domain Breakdown

The DEA-C01 exam tests your ability to design, build, secure, and maintain data pipelines and data stores on AWS. The four domains span the full data engineering lifecycle.

DomainTopicWeight
D1Data Ingestion and Transformation34%
D2Data Store Management26%
D3Data Operations and Support22%
D4Data Security and Governance18%
🎯
Exam format: 65 scored + up to 15 unscored questions (85 total), 130 minutes. Passing score: 720/1000. No penalty for guessing. Multiple-choice and multiple-response formats. Also includes scenario-based questions about architecture decisions for data pipelines.

Data Ingestion and Transformation

The largest domain. Covers streaming and batch ingestion, ETL with Glue, pipeline orchestration, and data format considerations. Kinesis and Glue are tested heavily.

01

Ingestion Services

Moving data from sources into the AWS data ecosystem at scale

34%
OF EXAM

🌊 Kinesis Data Streams (KDS)

  • Real-time streaming, data retained 1–365 days
  • Shards: 1 MB/s write, 2 MB/s read per shard
  • Enhanced fan-out: 2 MB/s per consumer per shard, HTTP/2 push
  • Ordering guaranteed per shard using partition key
  • PutRecord / PutRecords (batch) API
  • IteratorType: TRIM_HORIZON (oldest), LATEST, AT_TIMESTAMP
  • On-Demand vs Provisioned mode

🚒 Kinesis Data Firehose (KDF)

  • Fully managed, near-real-time (60s buffer minimum)
  • Destinations: S3, Redshift, OpenSearch, Splunk, HTTP
  • Built-in transformation via Lambda
  • Auto-scaling, no shards to manage
  • For Redshift: writes to S3 first, then COPY command
  • Compression: GZIP, Snappy, ZIP on S3
  • Cannot replay data — not a replay service

📊 Kinesis Data Analytics (KDA)

  • Real-time SQL or Apache Flink on streaming data
  • Input: KDS or KDF streams
  • Output: KDS, KDF, Lambda
  • Flink on KDA: stateful, exactly-once, Java/Scala/Python
  • Use for: time-window aggregations, anomaly detection
  • RANDOM_CUT_FOREST function for anomaly detection in SQL

🗃️ AWS MSK (Managed Kafka)

  • Fully managed Apache Kafka clusters
  • Topics, partitions, producers, consumers — same Kafka API
  • MSK Connect: Kafka Connect workers, fully managed
  • MSK Serverless: no capacity provisioning
  • Messages persist on broker storage (up to EBS limits)
  • Encrypted at rest + in transit; VPC isolation

🔄 AWS DMS

  • Database Migration Service — heterogeneous or homogeneous
  • Source stays live during migration (CDC support)
  • SCT: Schema Conversion Tool for different DB engines
  • Endpoints: RDS, Aurora, S3, DynamoDB, Kinesis, Redshift
  • Full Load + CDC = minimal downtime migration
  • Replication instance runs on EC2 in your VPC

📤 AWS DataSync & Transfer Family

  • DataSync: scheduled transfer between on-prem NFS/SMB and S3/EFS/FSx
  • Agent required for on-prem; no agent for AWS-to-AWS
  • Up to 10x faster than open-source tools
  • Transfer Family: SFTP/FTP/FTPS endpoints for S3 or EFS
  • Preserves file permissions and metadata (DataSync)
  • Use DataSync for one-time or periodic data migrations
ETL

AWS Glue — Core ETL Engine

Serverless data integration service — central to the DEA-C01 exam

🕷️ Glue Data Catalog & Crawlers

  • Metadata repository: databases, tables, schemas, partitions
  • Crawlers: auto-discover schema from S3, RDS, Redshift, DynamoDB
  • Catalog used by Athena, Redshift Spectrum, EMR, Lake Formation
  • Partition discovery: crawl new partitions on schedule
  • Custom classifiers for non-standard formats
  • Connection types: JDBC, S3, Kinesis, Kafka, MongoDB

⚙️ Glue ETL Jobs

  • Spark-based (PySpark/Scala) or Python shell jobs
  • DynamicFrame: flexible schema, handles schema evolution
  • vs DataFrame: DynamicFrame better for messy/evolving data
  • Job bookmarks: track processed data, avoid re-processing
  • Glue Studio: visual drag-and-drop ETL
  • Worker types: Standard, G.1X, G.2X, G.025X (Python shell)

🔀 Glue Workflows & Triggers

  • Workflows: chain crawlers, jobs, and triggers
  • Trigger types: On-demand, Schedule (cron), Job completion, Conditional
  • Conditional triggers: based on job success/failure
  • Glue DataBrew: visual data prep, no code required
  • Glue Streaming: reads from Kinesis/Kafka, processes with micro-batches

📂 Data Formats & Partitioning

  • Columnar: Parquet, ORC — best for analytics (compression + query perf)
  • Row: Avro, JSON, CSV — good for writes and streaming
  • Hive-style partitioning: s3://bucket/year=2024/month=01/
  • Partition projection in Athena: avoid crawler for time-series data
  • Convert to Parquet/ORC via Glue for Athena cost savings
  • Snappy: fast compression; GZIP: higher compression ratio

🌀 AppFlow & EventBridge Pipes

  • AppFlow: no-code SaaS data integration (Salesforce, Zendesk, Slack → S3/Redshift)
  • Scheduled or event-triggered flows
  • Data mapping and transformation built-in
  • EventBridge Pipes: point-to-point connections between event sources and targets
  • Filtering and enrichment via Lambda/Step Functions

⚡ Lambda & SQS/SNS for Ingestion

  • Lambda: event-driven micro-ETL, 15 min max runtime
  • SQS trigger for Lambda: batch size up to 10,000 messages
  • DLQ: failed messages after max retries for inspection
  • SNS fan-out → multiple SQS queues for parallel processing
  • S3 event notifications → Lambda/SQS/SNS on PUT/DELETE
  • Lambda destination: on success/failure route to SQS, SNS, EventBridge
Key Comparison
KDS vs KDF vs MSK — Choose the Right Service
KDS: Custom consumers, replay needed, ordering matters — use for real-time dashboards, log processing.
KDF: Simple load to S3/Redshift/OpenSearch, no custom consumer code needed, near-real-time OK.
MSK: Kafka ecosystem, existing Kafka code, need Kafka Connect, topic-based pub/sub with long retention.
💡
Exam pattern: Questions often ask you to pick between Glue DynamicFrame and Spark DataFrame — DynamicFrame wins when schemas are inconsistent or evolving. DataFrame wins for pure performance with known schema. Also know that Glue job bookmarks prevent re-processing but must be enabled explicitly.

Data Store Management

Covers selecting the right storage layer for data use cases — S3 as data lake foundation, Redshift for analytics DW, DynamoDB for high-throughput NoSQL, and Lake Formation for governance.

02

Storage Architecture

Matching data access patterns to the right AWS storage service

26%
OF EXAM

🪣 S3 as Data Lake Foundation

  • 11 nines (99.999999999%) durability, 3+ AZs
  • Optimal partitioning: /year/month/day/hour/ for time-series
  • S3 prefix performance: 3,500 PUT/s, 5,500 GET/s per prefix
  • Multi-part upload: recommended for files > 100 MB
  • S3 Transfer Acceleration: CloudFront edge → faster uploads
  • Object tagging: cost allocation, lifecycle rules, access control
  • S3 Inventory: list all objects + metadata for audit/compliance

🗂️ S3 Storage Classes

  • Standard: frequent access, ms latency
  • Standard-IA: infrequent, retrieval fee, min 30 days
  • One Zone-IA: single AZ, 20% cheaper, for re-creatable data
  • Glacier Instant: ms retrieval, min 90 days
  • Glacier Flexible: minutes–hours, min 90 days
  • Glacier Deep Archive: 12h retrieval, cheapest, min 180 days
  • Intelligent-Tiering: auto-moves, no retrieval fee, monitoring fee

🏗️ Amazon Redshift Architecture

  • Leader node: query planning, result aggregation
  • Compute nodes: execute queries, store data slices
  • Redshift Serverless: no cluster management, RPU-based pricing
  • AQUA: distributed cache layer, hardware-accelerated queries
  • RA3 nodes: managed storage, decouple compute from storage
  • Snapshot backups to S3: automated (1–35 days) or manual

📋 Redshift Distribution & Sort Keys

  • EVEN: default, round-robin, minimizes skew for no clear join key
  • KEY: distribute rows by column value — co-locate join keys
  • ALL: full copy on every node — for small dimension tables
  • AUTO: Redshift decides based on table size
  • Sort key: COMPOUND (multi-col, order matters) or INTERLEAVED (each col equal weight)
  • VACUUM: reclaim space, re-sort rows after DELETEs/UPDATEs
  • ANALYZE: update statistics for query planner

📥 Redshift COPY & Spectrum

  • COPY command: best way to bulk load — parallel from S3/DynamoDB/EMR
  • COPY from S3 uses IAM role on cluster — no access keys
  • Manifest file: load specific S3 files, avoid partial loads
  • Redshift Spectrum: query S3 data directly without loading
  • Spectrum requires Glue Data Catalog or Athena catalog
  • Spectrum scales independently — query exabytes of S3 data
  • WLM (Workload Management): query queues, priority, concurrency scaling

⚡ DynamoDB for Data Engineering

  • Partition key (+ optional sort key) determines partition placement
  • GSI: alternate partition key, eventual or strong consistency
  • LSI: alternate sort key, same partition, strong consistency
  • DynamoDB Streams: capture item-level changes, 24h retention
  • Streams → Lambda for real-time processing pipelines
  • DynamoDB → S3 export (no RCU consumed) for analytics
  • Adaptive capacity: auto-shifts capacity to hot partitions
Data Lake Architecture
AWS Lake Formation — Data Lake Governance
Lake Formation sits on top of S3 + Glue Data Catalog. It provides fine-grained access control at the database, table, column, and row level — far more granular than S3 bucket policies alone.

Blueprints: pre-built templates to ingest from RDS, DynamoDB, CloudTrail into the data lake.
LF-Tags: attribute-based access control — tag columns with sensitivity level, grant access by tag.
Cross-account: share catalogs and tables across AWS accounts via Resource Access Manager.
Governed Tables: ACID transactions on S3, automatic compaction, time-travel queries.
ServiceUse CaseKey Exam Point
S3Data lake, raw + processed storagePartitioning strategy determines Athena query cost
RedshiftOLAP, BI, complex SQL analyticsDistribution key = co-locate join data
DynamoDBLow-latency key-value, IoT, sessionsHot partition = bad partition key choice
AuroraOLTP, relational, high availabilityAurora S3 integration for SELECT INTO OUTFILE S3
OpenSearchFull-text search, log analyticsSuccessor to Elasticsearch; works with Kinesis
TimestreamTime-series data (IoT, metrics)Built-in time-series functions, auto data tiering
Lake FormationData lake governanceColumn/row level security beyond S3 policies

Data Operations and Support

Covers running, monitoring, troubleshooting, and optimizing data pipelines. Heavy focus on EMR, Athena query optimization, and Step Functions for orchestration.

03

Pipeline Operations

Running, debugging, and optimizing production data workloads

22%
OF EXAM

🔥 Amazon EMR

  • Managed Hadoop ecosystem: Spark, Hive, Presto, HBase, Flink
  • Master + Core + Task nodes: task nodes are spot-friendly
  • EMR on EC2: full control, HDFS or EMRFS (S3)
  • EMR Serverless: no cluster management, pay per vCPU/memory
  • EMR on EKS: submit Spark jobs to EKS cluster
  • EMRFS: S3-backed HDFS replacement, consistent view
  • Bootstrap actions: run scripts at cluster launch

🚀 Spark Performance Tuning

  • Partitioning: repartition() vs coalesce() — coalesce avoids full shuffle
  • Broadcast join: small table broadcast eliminates shuffle join
  • Caching: cache() for repeatedly used DataFrames
  • Speculation: retry slow tasks on another executor
  • Driver OOM: increase spark.driver.memory
  • Executor OOM: increase spark.executor.memory
  • AQE (Adaptive Query Execution): auto-optimizes at runtime

🔍 Amazon Athena Optimization

  • Serverless SQL on S3, pay per TB scanned ($5/TB)
  • Use columnar formats (Parquet/ORC) to reduce scanned data
  • Partition data → use partition pruning in WHERE clause
  • Partition Projection: define partitions in table properties, no crawler
  • Result caching: reuse query results for identical queries
  • Federated queries: query RDS, DynamoDB, on-prem via Lambda connectors
  • Athena CTAS: create table from SELECT → materialize results

📈 Step Functions for Data Pipelines

  • Orchestrate multi-step workflows: Glue jobs, Lambda, EMR, Athena
  • Standard: async, exactly-once, 1 year max, audit history
  • Express: high-volume, at-least-once, 5 min max, cheaper
  • Error handling: Catch + Retry in state machine definitions
  • Map state: process array items in parallel
  • Wait state: pause for external event or time delay
  • EventBridge → Step Functions for event-driven pipeline triggers

📊 Monitoring Data Pipelines

  • CloudWatch Metrics: Glue job run metrics, EMR cluster metrics
  • CloudWatch Logs: Glue job logs, Spark driver/executor logs
  • CloudTrail: API calls to Glue, S3, Redshift for audit
  • Glue job metrics: numOutputRows, bytesWritten, executionTime
  • EMR: Hadoop counters, YARN metrics, Spark UI
  • Redshift: STL tables (query history, load errors), STV (real-time)
  • AWS Glue Data Quality: DQ rules, anomaly detection on data

📉 QuickSight & Visualization

  • Serverless BI, SPICE engine for in-memory acceleration
  • SPICE: Super-fast Parallel In-memory Calculation Engine
  • Connects to: Athena, Redshift, S3, RDS, Aurora, OpenSearch
  • ML Insights: anomaly detection, forecasting built-in
  • Row-level security (RLS) and column-level security (CLS)
  • Paginated reports: for pixel-perfect operational reports
  • Embedded analytics: embed dashboards in apps
Architecture Pattern
Modern Data Pipeline: Lambda vs Kappa Architecture
Lambda architecture: separate batch layer (EMR/Glue) + speed layer (Kinesis/Flink) + serving layer (Redshift/Athena). Complex but handles both historical and real-time.

Kappa architecture: everything is streaming — simplifies by using KDS/MSK for all data. Reprocess by replaying from the stream. Fewer moving parts but requires long stream retention.

For the exam: Lambda arch = two pipelines, Kappa arch = stream-only. Kinesis 365-day retention enables Kappa on AWS.
ToolBest ForKey Exam Point
EMRLarge-scale Spark/Hive, HDFS, legacy HadoopUse task nodes (Spot) for cost savings
Glue ETLServerless ETL, schema flexibilityDynamicFrame handles schema evolution
AthenaAd-hoc SQL on S3Parquet + partitioning = cost reduction
Step FunctionsMulti-service workflow orchestrationStandard for long-running, Express for high-volume
MWAAManaged Apache AirflowDAG-based, complex schedules, Python operators
QuickSightBI dashboards, embedded analyticsSPICE = in-memory; RLS for row-level security

Data Security and Governance

Covers encryption, IAM roles for data services, Lake Formation fine-grained access, PII detection with Macie, and compliance-focused governance patterns.

04

Security & Governance

Protecting data at rest, in transit, and controlling access at scale

18%
OF EXAM

🔐 IAM for Data Services

  • Glue jobs use IAM roles — role needs S3, CloudWatch, Glue permissions
  • Redshift cluster uses IAM role for COPY/UNLOAD from S3
  • EMR service role + EC2 instance profile for cluster resources
  • Cross-account S3 access: bucket policy + IAM role in target account
  • Resource-based policies: S3 bucket policy, KMS key policy
  • Least-privilege: s3:GetObject per prefix, not bucket-wide

🔑 KMS Encryption for Data Services

  • S3: SSE-S3 (AWS managed), SSE-KMS (CMK), SSE-C (customer key), CSE
  • SSE-KMS: audit via CloudTrail, requires kms:GenerateDataKey
  • S3 Bucket Key: reduce KMS API calls/cost for SSE-KMS
  • Redshift: cluster-level encryption at rest with KMS CMK
  • Kinesis: server-side encryption with KMS CMK
  • DynamoDB: default encryption at rest (AWS owned key, KMS optional)
  • Glue: encrypt Data Catalog metadata + job logs with KMS

🏛️ Lake Formation Security

  • Centralized permissions replace S3 bucket policies for data lake
  • Column-level security: restrict specific columns per IAM principal
  • Row-level security: filter rows based on data filters
  • Cell-level security: column + row filter combination
  • LF-Tags (ABAC): tag resources + grant access by tag
  • Governed Tables: transactional ACID writes on S3
  • DataZone: data marketplace for sharing across accounts with governance

🔍 Macie, GuardDuty & Security Hub

  • Macie: ML-powered PII/sensitive data detection in S3
  • Macie custom data identifiers: regex patterns for org-specific data
  • GuardDuty: threat detection — monitors S3 data events, CloudTrail
  • GuardDuty S3 protection: detect anomalous API calls on S3
  • Security Hub: aggregates Macie + GuardDuty + Config findings
  • All findings → EventBridge for automated remediation

🌐 Network Security for Data

  • S3 Gateway Endpoint: free, keeps S3 traffic private in VPC
  • S3 Interface Endpoint (PrivateLink): private DNS for S3, useful for cross-account
  • Redshift: VPC-only, security groups, VPC endpoints for access
  • Glue Connections: VPC-based JDBC connections for RDS, on-prem
  • EMR in private subnet: NAT GW or VPC endpoints for S3/CloudWatch
  • MSK: VPC-based, SASL/SCRAM or TLS for client auth

📋 Data Governance & Compliance

  • AWS Config: track configuration changes to data services
  • CloudTrail: data events for S3 (API-level) — off by default, extra cost
  • Redshift audit logging: user activity, connection, user logs to S3
  • S3 Object Lock: WORM — compliance or governance mode
  • Glue Data Quality: define expectations, fail pipelines on bad data
  • AWS Audit Manager: evidence collection for compliance frameworks
  • Tag-based cost allocation: activate tags in Billing console
Security Pattern
Encrypting a Full Data Pipeline
Ingestion: Kinesis SSE with CMK → encrypted in transit (TLS) and at rest.
Processing: Glue job with security config — encrypt shuffle data, CloudWatch logs, job bookmarks with CMK.
Storage: S3 SSE-KMS with bucket key enabled (reduces KMS costs). S3 Block Public Access on all accounts.
Analytics: Redshift encryption at rest (CMK). Redshift column-level grants. Athena result bucket encrypted.
Governance: Lake Formation column/row security + Macie for PII scanning + CloudTrail data events.

100 Practice Questions

Simulate the real DEA-C01 exam. Select your answer, then reveal the explanation.

⏱ Suggested: 130 minutes
📋 100 questions
🎯 Pass: 72%