Data Engineer
Associate
Master every topic for the AWS DEA-C01 exam. Covers data ingestion pipelines, transformation, storage, operations, and governance using Kinesis, Glue, Redshift, EMR, Athena, Lake Formation, and more.
Domain Breakdown
The DEA-C01 exam tests your ability to design, build, secure, and maintain data pipelines and data stores on AWS. The four domains span the full data engineering lifecycle.
| Domain | Topic | Weight |
|---|---|---|
| D1 | Data Ingestion and Transformation | 34% |
| D2 | Data Store Management | 26% |
| D3 | Data Operations and Support | 22% |
| D4 | Data Security and Governance | 18% |
Data Ingestion and Transformation
The largest domain. Covers streaming and batch ingestion, ETL with Glue, pipeline orchestration, and data format considerations. Kinesis and Glue are tested heavily.
Ingestion Services
Moving data from sources into the AWS data ecosystem at scale
🌊 Kinesis Data Streams (KDS)
- Real-time streaming, data retained 1–365 days
- Shards: 1 MB/s write, 2 MB/s read per shard
- Enhanced fan-out: 2 MB/s per consumer per shard, HTTP/2 push
- Ordering guaranteed per shard using partition key
PutRecord/PutRecords(batch) API- IteratorType: TRIM_HORIZON (oldest), LATEST, AT_TIMESTAMP
- On-Demand vs Provisioned mode
🚒 Kinesis Data Firehose (KDF)
- Fully managed, near-real-time (60s buffer minimum)
- Destinations: S3, Redshift, OpenSearch, Splunk, HTTP
- Built-in transformation via Lambda
- Auto-scaling, no shards to manage
- For Redshift: writes to S3 first, then COPY command
- Compression: GZIP, Snappy, ZIP on S3
- Cannot replay data — not a replay service
📊 Kinesis Data Analytics (KDA)
- Real-time SQL or Apache Flink on streaming data
- Input: KDS or KDF streams
- Output: KDS, KDF, Lambda
- Flink on KDA: stateful, exactly-once, Java/Scala/Python
- Use for: time-window aggregations, anomaly detection
- RANDOM_CUT_FOREST function for anomaly detection in SQL
🗃️ AWS MSK (Managed Kafka)
- Fully managed Apache Kafka clusters
- Topics, partitions, producers, consumers — same Kafka API
- MSK Connect: Kafka Connect workers, fully managed
- MSK Serverless: no capacity provisioning
- Messages persist on broker storage (up to EBS limits)
- Encrypted at rest + in transit; VPC isolation
🔄 AWS DMS
- Database Migration Service — heterogeneous or homogeneous
- Source stays live during migration (CDC support)
- SCT: Schema Conversion Tool for different DB engines
- Endpoints: RDS, Aurora, S3, DynamoDB, Kinesis, Redshift
- Full Load + CDC = minimal downtime migration
- Replication instance runs on EC2 in your VPC
📤 AWS DataSync & Transfer Family
- DataSync: scheduled transfer between on-prem NFS/SMB and S3/EFS/FSx
- Agent required for on-prem; no agent for AWS-to-AWS
- Up to 10x faster than open-source tools
- Transfer Family: SFTP/FTP/FTPS endpoints for S3 or EFS
- Preserves file permissions and metadata (DataSync)
- Use DataSync for one-time or periodic data migrations
AWS Glue — Core ETL Engine
Serverless data integration service — central to the DEA-C01 exam
🕷️ Glue Data Catalog & Crawlers
- Metadata repository: databases, tables, schemas, partitions
- Crawlers: auto-discover schema from S3, RDS, Redshift, DynamoDB
- Catalog used by Athena, Redshift Spectrum, EMR, Lake Formation
- Partition discovery: crawl new partitions on schedule
- Custom classifiers for non-standard formats
- Connection types: JDBC, S3, Kinesis, Kafka, MongoDB
⚙️ Glue ETL Jobs
- Spark-based (PySpark/Scala) or Python shell jobs
- DynamicFrame: flexible schema, handles schema evolution
- vs DataFrame: DynamicFrame better for messy/evolving data
- Job bookmarks: track processed data, avoid re-processing
- Glue Studio: visual drag-and-drop ETL
- Worker types: Standard, G.1X, G.2X, G.025X (Python shell)
🔀 Glue Workflows & Triggers
- Workflows: chain crawlers, jobs, and triggers
- Trigger types: On-demand, Schedule (cron), Job completion, Conditional
- Conditional triggers: based on job success/failure
- Glue DataBrew: visual data prep, no code required
- Glue Streaming: reads from Kinesis/Kafka, processes with micro-batches
📂 Data Formats & Partitioning
- Columnar: Parquet, ORC — best for analytics (compression + query perf)
- Row: Avro, JSON, CSV — good for writes and streaming
- Hive-style partitioning:
s3://bucket/year=2024/month=01/ - Partition projection in Athena: avoid crawler for time-series data
- Convert to Parquet/ORC via Glue for Athena cost savings
- Snappy: fast compression; GZIP: higher compression ratio
🌀 AppFlow & EventBridge Pipes
- AppFlow: no-code SaaS data integration (Salesforce, Zendesk, Slack → S3/Redshift)
- Scheduled or event-triggered flows
- Data mapping and transformation built-in
- EventBridge Pipes: point-to-point connections between event sources and targets
- Filtering and enrichment via Lambda/Step Functions
⚡ Lambda & SQS/SNS for Ingestion
- Lambda: event-driven micro-ETL, 15 min max runtime
- SQS trigger for Lambda: batch size up to 10,000 messages
- DLQ: failed messages after max retries for inspection
- SNS fan-out → multiple SQS queues for parallel processing
- S3 event notifications → Lambda/SQS/SNS on PUT/DELETE
- Lambda destination: on success/failure route to SQS, SNS, EventBridge
KDF: Simple load to S3/Redshift/OpenSearch, no custom consumer code needed, near-real-time OK.
MSK: Kafka ecosystem, existing Kafka code, need Kafka Connect, topic-based pub/sub with long retention.
Data Store Management
Covers selecting the right storage layer for data use cases — S3 as data lake foundation, Redshift for analytics DW, DynamoDB for high-throughput NoSQL, and Lake Formation for governance.
Storage Architecture
Matching data access patterns to the right AWS storage service
🪣 S3 as Data Lake Foundation
- 11 nines (99.999999999%) durability, 3+ AZs
- Optimal partitioning:
/year/month/day/hour/for time-series - S3 prefix performance: 3,500 PUT/s, 5,500 GET/s per prefix
- Multi-part upload: recommended for files > 100 MB
- S3 Transfer Acceleration: CloudFront edge → faster uploads
- Object tagging: cost allocation, lifecycle rules, access control
- S3 Inventory: list all objects + metadata for audit/compliance
🗂️ S3 Storage Classes
- Standard: frequent access, ms latency
- Standard-IA: infrequent, retrieval fee, min 30 days
- One Zone-IA: single AZ, 20% cheaper, for re-creatable data
- Glacier Instant: ms retrieval, min 90 days
- Glacier Flexible: minutes–hours, min 90 days
- Glacier Deep Archive: 12h retrieval, cheapest, min 180 days
- Intelligent-Tiering: auto-moves, no retrieval fee, monitoring fee
🏗️ Amazon Redshift Architecture
- Leader node: query planning, result aggregation
- Compute nodes: execute queries, store data slices
- Redshift Serverless: no cluster management, RPU-based pricing
- AQUA: distributed cache layer, hardware-accelerated queries
- RA3 nodes: managed storage, decouple compute from storage
- Snapshot backups to S3: automated (1–35 days) or manual
📋 Redshift Distribution & Sort Keys
- EVEN: default, round-robin, minimizes skew for no clear join key
- KEY: distribute rows by column value — co-locate join keys
- ALL: full copy on every node — for small dimension tables
- AUTO: Redshift decides based on table size
- Sort key: COMPOUND (multi-col, order matters) or INTERLEAVED (each col equal weight)
- VACUUM: reclaim space, re-sort rows after DELETEs/UPDATEs
- ANALYZE: update statistics for query planner
📥 Redshift COPY & Spectrum
- COPY command: best way to bulk load — parallel from S3/DynamoDB/EMR
- COPY from S3 uses IAM role on cluster — no access keys
- Manifest file: load specific S3 files, avoid partial loads
- Redshift Spectrum: query S3 data directly without loading
- Spectrum requires Glue Data Catalog or Athena catalog
- Spectrum scales independently — query exabytes of S3 data
- WLM (Workload Management): query queues, priority, concurrency scaling
⚡ DynamoDB for Data Engineering
- Partition key (+ optional sort key) determines partition placement
- GSI: alternate partition key, eventual or strong consistency
- LSI: alternate sort key, same partition, strong consistency
- DynamoDB Streams: capture item-level changes, 24h retention
- Streams → Lambda for real-time processing pipelines
- DynamoDB → S3 export (no RCU consumed) for analytics
- Adaptive capacity: auto-shifts capacity to hot partitions
Blueprints: pre-built templates to ingest from RDS, DynamoDB, CloudTrail into the data lake.
LF-Tags: attribute-based access control — tag columns with sensitivity level, grant access by tag.
Cross-account: share catalogs and tables across AWS accounts via Resource Access Manager.
Governed Tables: ACID transactions on S3, automatic compaction, time-travel queries.
| Service | Use Case | Key Exam Point |
|---|---|---|
| S3 | Data lake, raw + processed storage | Partitioning strategy determines Athena query cost |
| Redshift | OLAP, BI, complex SQL analytics | Distribution key = co-locate join data |
| DynamoDB | Low-latency key-value, IoT, sessions | Hot partition = bad partition key choice |
| Aurora | OLTP, relational, high availability | Aurora S3 integration for SELECT INTO OUTFILE S3 |
| OpenSearch | Full-text search, log analytics | Successor to Elasticsearch; works with Kinesis |
| Timestream | Time-series data (IoT, metrics) | Built-in time-series functions, auto data tiering |
| Lake Formation | Data lake governance | Column/row level security beyond S3 policies |
Data Operations and Support
Covers running, monitoring, troubleshooting, and optimizing data pipelines. Heavy focus on EMR, Athena query optimization, and Step Functions for orchestration.
Pipeline Operations
Running, debugging, and optimizing production data workloads
🔥 Amazon EMR
- Managed Hadoop ecosystem: Spark, Hive, Presto, HBase, Flink
- Master + Core + Task nodes: task nodes are spot-friendly
- EMR on EC2: full control, HDFS or EMRFS (S3)
- EMR Serverless: no cluster management, pay per vCPU/memory
- EMR on EKS: submit Spark jobs to EKS cluster
- EMRFS: S3-backed HDFS replacement, consistent view
- Bootstrap actions: run scripts at cluster launch
🚀 Spark Performance Tuning
- Partitioning: repartition() vs coalesce() — coalesce avoids full shuffle
- Broadcast join: small table broadcast eliminates shuffle join
- Caching: cache() for repeatedly used DataFrames
- Speculation: retry slow tasks on another executor
- Driver OOM: increase
spark.driver.memory - Executor OOM: increase
spark.executor.memory - AQE (Adaptive Query Execution): auto-optimizes at runtime
🔍 Amazon Athena Optimization
- Serverless SQL on S3, pay per TB scanned ($5/TB)
- Use columnar formats (Parquet/ORC) to reduce scanned data
- Partition data → use partition pruning in WHERE clause
- Partition Projection: define partitions in table properties, no crawler
- Result caching: reuse query results for identical queries
- Federated queries: query RDS, DynamoDB, on-prem via Lambda connectors
- Athena CTAS: create table from SELECT → materialize results
📈 Step Functions for Data Pipelines
- Orchestrate multi-step workflows: Glue jobs, Lambda, EMR, Athena
- Standard: async, exactly-once, 1 year max, audit history
- Express: high-volume, at-least-once, 5 min max, cheaper
- Error handling: Catch + Retry in state machine definitions
- Map state: process array items in parallel
- Wait state: pause for external event or time delay
- EventBridge → Step Functions for event-driven pipeline triggers
📊 Monitoring Data Pipelines
- CloudWatch Metrics: Glue job run metrics, EMR cluster metrics
- CloudWatch Logs: Glue job logs, Spark driver/executor logs
- CloudTrail: API calls to Glue, S3, Redshift for audit
- Glue job metrics: numOutputRows, bytesWritten, executionTime
- EMR: Hadoop counters, YARN metrics, Spark UI
- Redshift: STL tables (query history, load errors), STV (real-time)
- AWS Glue Data Quality: DQ rules, anomaly detection on data
📉 QuickSight & Visualization
- Serverless BI, SPICE engine for in-memory acceleration
- SPICE: Super-fast Parallel In-memory Calculation Engine
- Connects to: Athena, Redshift, S3, RDS, Aurora, OpenSearch
- ML Insights: anomaly detection, forecasting built-in
- Row-level security (RLS) and column-level security (CLS)
- Paginated reports: for pixel-perfect operational reports
- Embedded analytics: embed dashboards in apps
Kappa architecture: everything is streaming — simplifies by using KDS/MSK for all data. Reprocess by replaying from the stream. Fewer moving parts but requires long stream retention.
For the exam: Lambda arch = two pipelines, Kappa arch = stream-only. Kinesis 365-day retention enables Kappa on AWS.
| Tool | Best For | Key Exam Point |
|---|---|---|
| EMR | Large-scale Spark/Hive, HDFS, legacy Hadoop | Use task nodes (Spot) for cost savings |
| Glue ETL | Serverless ETL, schema flexibility | DynamicFrame handles schema evolution |
| Athena | Ad-hoc SQL on S3 | Parquet + partitioning = cost reduction |
| Step Functions | Multi-service workflow orchestration | Standard for long-running, Express for high-volume |
| MWAA | Managed Apache Airflow | DAG-based, complex schedules, Python operators |
| QuickSight | BI dashboards, embedded analytics | SPICE = in-memory; RLS for row-level security |
Data Security and Governance
Covers encryption, IAM roles for data services, Lake Formation fine-grained access, PII detection with Macie, and compliance-focused governance patterns.
Security & Governance
Protecting data at rest, in transit, and controlling access at scale
🔐 IAM for Data Services
- Glue jobs use IAM roles — role needs S3, CloudWatch, Glue permissions
- Redshift cluster uses IAM role for COPY/UNLOAD from S3
- EMR service role + EC2 instance profile for cluster resources
- Cross-account S3 access: bucket policy + IAM role in target account
- Resource-based policies: S3 bucket policy, KMS key policy
- Least-privilege:
s3:GetObjectper prefix, not bucket-wide
🔑 KMS Encryption for Data Services
- S3: SSE-S3 (AWS managed), SSE-KMS (CMK), SSE-C (customer key), CSE
- SSE-KMS: audit via CloudTrail, requires
kms:GenerateDataKey - S3 Bucket Key: reduce KMS API calls/cost for SSE-KMS
- Redshift: cluster-level encryption at rest with KMS CMK
- Kinesis: server-side encryption with KMS CMK
- DynamoDB: default encryption at rest (AWS owned key, KMS optional)
- Glue: encrypt Data Catalog metadata + job logs with KMS
🏛️ Lake Formation Security
- Centralized permissions replace S3 bucket policies for data lake
- Column-level security: restrict specific columns per IAM principal
- Row-level security: filter rows based on data filters
- Cell-level security: column + row filter combination
- LF-Tags (ABAC): tag resources + grant access by tag
- Governed Tables: transactional ACID writes on S3
- DataZone: data marketplace for sharing across accounts with governance
🔍 Macie, GuardDuty & Security Hub
- Macie: ML-powered PII/sensitive data detection in S3
- Macie custom data identifiers: regex patterns for org-specific data
- GuardDuty: threat detection — monitors S3 data events, CloudTrail
- GuardDuty S3 protection: detect anomalous API calls on S3
- Security Hub: aggregates Macie + GuardDuty + Config findings
- All findings → EventBridge for automated remediation
🌐 Network Security for Data
- S3 Gateway Endpoint: free, keeps S3 traffic private in VPC
- S3 Interface Endpoint (PrivateLink): private DNS for S3, useful for cross-account
- Redshift: VPC-only, security groups, VPC endpoints for access
- Glue Connections: VPC-based JDBC connections for RDS, on-prem
- EMR in private subnet: NAT GW or VPC endpoints for S3/CloudWatch
- MSK: VPC-based, SASL/SCRAM or TLS for client auth
📋 Data Governance & Compliance
- AWS Config: track configuration changes to data services
- CloudTrail: data events for S3 (API-level) — off by default, extra cost
- Redshift audit logging: user activity, connection, user logs to S3
- S3 Object Lock: WORM — compliance or governance mode
- Glue Data Quality: define expectations, fail pipelines on bad data
- AWS Audit Manager: evidence collection for compliance frameworks
- Tag-based cost allocation: activate tags in Billing console
Processing: Glue job with security config — encrypt shuffle data, CloudWatch logs, job bookmarks with CMK.
Storage: S3 SSE-KMS with bucket key enabled (reduces KMS costs). S3 Block Public Access on all accounts.
Analytics: Redshift encryption at rest (CMK). Redshift column-level grants. Athena result bucket encrypted.
Governance: Lake Formation column/row security + Macie for PII scanning + CloudTrail data events.
100 Practice Questions
Simulate the real DEA-C01 exam. Select your answer, then reveal the explanation.