Download ebooks file Data engineering with aws: acquire the skills to design and build aws-based dat
Data Engineering with AWS: Acquire the skills to design and build AWS-based data transformation pipelines like a pro 2nd Edition Eagar
Visit to download the full and correct content document: https://ebookmass.com/product/data-engineering-with-aws-acquire-the-skills-to-desig n-and-build-aws-based-data-transformation-pipelines-like-a-pro-2nd-edition-eagar/
More products digital (pdf, epub, mobi) instant download maybe you interests ...
Architecting a Modern Data Warehouse for Large
Enterprises: Build Multi-cloud Modern Distributed Data
Warehouses with Azure and AWS 1st Edition Anjani Kumar
Data Observability for Data Engineering: Ensure and monitor data accuracy, prevent and resolve broken data pipelines with actionable steps Michele Pinto
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damage caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Grosvenor House 11 St Paul’s Square Birmingham B3 1RB, UK.
ISBN 978-1-80461-442-6
www.packt.com
Contributors
About the author
Gareth Eagar has over 25 years of experience in the IT industry, starting in South Africa, working in the United Kingdom for a while, and now based in the USA.
Having worked at AWS since 2017, Gareth has broad experience with a variety of AWS services, and deep expertise around building data platforms on AWS. While Gareth currently works as a Solutions Architect, he has also worked in AWS Professional Services, helping architect and implement data platforms for global customers.
Gareth also frequently speaks on data-related topics.
To my amazing wife and children, thank you for your patience and understanding as I spent countless hours writing the revised edition of this book. Your support for me taking on this project, and making the space and time for me to write, means so much to me.
A special thanks to Disha Umarwani, Praful Kava, and Natalie Rabinovich, who each contributed content for the first edition of this book. And many thanks to Amit Kalawat, Leonardo Gomez, and many others for helping to review content for this revised edition.
About the reviewers
Vaibhav Tyagi is a skilled and experienced cloud data engineer and architect with 10 years of experience. He has a deep understanding of AWS cloud services and is proficient in a variety of data engineering tools, including Spark, Hive, and Hadoop.
Throughout his career, he has worked for Teradata, Citigroup, NatWest, and Amazon, and has worked on, among other things, designing and implementing cloud-based pipelines, complex cloud environments, and the creation of data warehouses.
I would like to thank my wife and children who have been my biggest cheerleaders and put up with my long working hours. I am truly grateful for their love and support. And thank you to my friends who have also been a great source of support.
Gaurav Verma has 9 years of experience in the field, having worked at AWS, Skyscanner, Discovery Communications, and Tata Consultancy Services.
He excels in designing and delivering big data and analytics solutions on AWS. His expertise spans AWS services, Python, Scala, Spark, and more. He currently leads a team at Amazon, overseeing global analytics and the ML data platform. His career highlights include optimizing data pipelines, managing analytics projects, and extensive training in big data and data engineering technologies.
Learn more on Discord
To join the Discord community for this book – where you can share feedback, ask questions to the author, and learn about new releases – follow the QR code below:
https://discord.gg/9s5mHNyECd
Databases and data warehouses • 22
Dealing with big, unstructured data • 23
Cloud-based solutions for big data analytics • 24
A deeper dive into data warehouse concepts and architecture
Dimensional modeling in data warehouses • 28
Understanding the role of data marts • 32
Distributed storage and massively parallel processing • 33
Columnar data storage and efficient data compression • 35
Feeding data into the warehouse – ETL and ELT pipelines • 37
Data lake logical architecture • 41
The storage layer and storage zones • 42
Catalog and search layers • 43
Ingestion layer • 43
The processing layer • 44
The consumption layer • 44
lake architecture summary • 44
Federated queries across database engines • 47
Accessing the
CLI •
Using AWS CloudShell to access the CLI • 49 Creating new Amazon S3 buckets • 51
Amazon Database Migration Service (DMS) • 54
Amazon Kinesis for streaming data ingestion • 56
Amazon Kinesis Agent • 57
Amazon Kinesis Firehose • 58
Amazon Kinesis Data Streams • 59
Amazon Kinesis Data Analytics • 60
Amazon Kinesis Video Streams • 60
Amazon MSK for streaming data ingestion • 61
Amazon AppFlow for ingesting data from SaaS services • 62
AWS Transfer Family for ingestion using FTP/SFTP protocols • 63
AWS DataSync for ingesting from on premises and multicloud storage services • 64
The AWS Snow family of devices for large data transfers • 64
AWS Glue for data ingestion • 66
An overview of AWS services for transforming data
AWS Lambda for light transformations • 67
AWS Glue for serverless data processing • 68
Serverless ETL processing • 68
AWS Glue DataBrew • 70
AWS Glue Data Catalog • 70
AWS Glue crawlers • 72
Amazon EMR for Hadoop ecosystem processing • 73
An overview of AWS services for orchestrating big data pipelines
AWS Glue workflows for orchestrating Glue components • 75
AWS Step Functions for complex workflows • 77
Amazon Managed Workflows for Apache Airflow (MWAA) • 79
An overview of AWS services for consuming data
Amazon Athena for SQL queries in the data lake • 81
Amazon Redshift and Redshift Spectrum for data warehousing and data lakehouse architectures • 82
Overview of Amazon QuickSight for visualizing data • 85
Hands-on – triggering an AWS Lambda function when a new file arrives in an S3 bucket 87
Creating a Lambda layer containing the AWS SDK for pandas library • 87
Creating an IAM policy and role for your Lambda function • 89
Creating a Lambda function • 91
Configuring our Lambda function to be triggered by an S3 upload • 96
Common data regulatory requirements • 104
Core data protection concepts • 105
Personally identifiable information (PII) • 105
Personal data • 105
Encryption • 106
Anonymized data • 106
Pseudonymized data/tokenization • 107
Authentication • 108
Authorization • 109
Putting these concepts together • 109
Data quality • 110
Data profiling • 111
Data lineage • 113
Business and technical data
Implementing a data catalog to avoid creating a data swamp
Business data catalogs • 115
Technical data catalogs • 117
AWS services that help with data
The AWS Glue/Lake Formation technical data catalog • 118
AWS Glue DataBrew for profiling datasets • 120
AWS Glue Data Quality • 121
AWS Key Management Service (KMS) for data encryption • 122
Amazon Macie for detecting PII data in Amazon S3 objects • 123
The AWS Glue Studio Detect PII transform for detecting PII data in datasets • 124
Amazon GuardDuty for detecting threats in an AWS account • 124
AWS Identity and Access Management (IAM) service • 124
Using AWS Lake Formation to manage data lake access • 128
Permissions management before Lake Formation • 128
Permissions management using AWS Lake Formation • 129
Hands-on –
Creating a new user with IAM permissions • 130
Transitioning to managing fine-grained permissions with AWS Lake Formation • 135
Activating Lake Formation permissions for a database and table • 136
Granting Lake Formation permissions • 138
Section 2: Architecting and Implementing Data
Whiteboarding
Conducting
Data standardization • 154
Data quality checks • 155
Data partitioning • 155
Data denormalization • 155
Data cataloging • 155
Whiteboarding data transformation • 155
Loading data into data
Hands-on
architecting a sample
Detailed notes from the project “Bright Light” whiteboarding meeting of GP Widgets, Inc • 161
Initial full loads from a table, and subsequent loads of new records • 178
Creating AWS Glue jobs with AWS Lake Formation • 179
Other ways to ingest data from a database • 179
Deciding on the best approach to ingesting from a database • 180
The size of the database • 180
Database load • 181
Data ingestion frequency • 181
Technical requirements and compatibility • 182
Ingesting streaming data
Amazon Kinesis versus Amazon Managed Streaming for Kafka (MSK) • 183
Serverless services versus managed services • 183
Open-source flexibility versus proprietary software with strong AWS integration • 184
At-least-once messaging versus exactly once messaging • 185
A single processing engine versus niche tools • 185
Deciding on a streaming ingestion tool • 185
Hands-on – ingesting data with AWS DMS
Deploying MySQL and an EC2 data loader via CloudFormation • 186
Creating an IAM policy and role for DMS • 189
Configuring DMS settings and performing a full load from MySQL to S3 • 192
Querying data with Amazon Athena • 195
Hands-on – ingesting streaming data
Configuring Kinesis Data Firehose for streaming delivery to Amazon S3 • 197
Configuring Amazon Kinesis Data Generator (KDG) • 199
Adding newly ingested data to the Glue Data Catalog • 202
Querying the data with Amazon Athena • 203
Summary
Cooking, baking, and data transformations • 206
Transformations as part of a pipeline • 207
Types of data transformation tools
Apache Spark • 208
Hadoop and MapReduce • 208
SQL • 209
GUI-based tools • 210
Common data preparation transformations
Protecting PII data • 211
Optimizing the file format • 212
Optimizing with data partitioning • 213
Data cleansing • 215
Common business use case transformations
Data denormalization • 217
Enriching data • 218
Pre-aggregating data • 219
Extracting metadata from unstructured data • 219
Working with Change Data Capture (CDC) data
Traditional approaches – data upserts and SQL views • 221
Modern approaches – Open Table Formats (OTFs) • 222
Apache Iceberg • 223
Apache Hudi • 224
Databricks Delta Lake • 224
Hands-on – joining datasets with AWS Glue Studio
Creating a new data lake zone – the curated zone • 225
Creating a new IAM role for the Glue job • 225
Configuring a denormalization transform using AWS Glue Studio • 227
Finalizing the denormalization transform job to write to S3 • 232
Create a transform job to join streaming and film data using AWS Glue Studio • 234
Summary
A growing variety of data consumers • 243
How a data mesh helps data consumers • 244
Meeting the needs of business users with data visualization
AWS tools for business users • 245
A quick overview of Amazon QuickSight • 245
Meeting the needs of data analysts with structured reporting
AWS tools for data analysts • 248
Amazon Athena • 249
AWS Glue DataBrew • 249
Running Python or R in AWS • 250
Meeting the needs of data scientists and ML models
AWS tools used by data scientists to work with data • 252
SageMaker Ground Truth • 252
SageMaker Data Wrangler • 253
SageMaker Clarify • 253
Hands-on – creating data transformations with AWS Glue
Configuring new datasets for AWS Glue DataBrew • 255
Creating a new Glue DataBrew project • 256
Building your Glue DataBrew recipe • 257
Creating a Glue DataBrew job • 260 Summary
Extending analytics with data warehouses/data marts
Cold and warm data • 264
Cold data • 264
Warm data • 265
Amazon S3 storage classes • 265
Hot data • 269
What not to do – anti-patterns for a data warehouse
Using a data warehouse as a transactional datastore • 270
Using a data warehouse as a data lake • 270
Storing unstructured data • 271
Redshift architecture review and storage deep dive
Data distribution across slices • 272
Redshift Zone Maps and sorting data • 274
Designing a high-performance data warehouse
Provisioned versus Redshift Serverless clusters • 275
Selecting the optimal Redshift node type for provisioned clusters • 276
Selecting the optimal table distribution style and sort key • 277
Selecting the right data type for columns • 278
Character types • 278
Numeric types • 279
Datetime types • 280
Boolean type • 281
HLLSKETCH type • 281
SUPER type • 282
Selecting the optimal table type • 282
Local Redshift tables • 282
External tables for querying data in Amazon S3 with Redshift Spectrum • 283
Temporary staging tables for loading data into Redshift • 284
Data caching using Redshift materialized views • 285
Moving data between a data lake and Redshift
Optimizing data ingestion in Redshift • 286
Automating data loads from Amazon S3 into Redshift • 288
Exporting data from Redshift to the data lake • 288
286
Exploring advanced Redshift features
Data sharing between Redshift clusters • 290
Machine learning capabilities in Amazon Redshift • 291
Running Redshift clusters across multiple Availability Zones • 292
Redshift Dynamic Data Masking • 293
Zero-ETL between Amazon Aurora and Amazon Redshift • 293
Resizing a Redshift cluster • 294
Hands-on – deploying a Redshift Serverless cluster and running Redshift
Uploading our sample data to Amazon S3 • 294
IAM roles for Redshift • 295
Creating a Redshift cluster • 296
Querying data in the sample database • 298
Using Redshift Spectrum to directly query data in the data lake • 299
What is a data pipeline, and how do you orchestrate it? • 305
What is a directed acyclic graph? • 306
How do you trigger a data pipeline to run? • 307
Using manifest files as pipeline triggers • 307
How do you handle the failures of a step in your pipeline? • 308
Common reasons for failure in data pipelines • 308
Pipeline failure retry strategies • 309
Examining the options for orchestrating pipelines in AWS
AWS Data Pipeline (now in maintenance mode) • 310
AWS Glue workflows to orchestrate Glue resources • 310
Monitoring and error handling • 311
Triggering Glue workflows • 312
Apache Airflow as an open-source orchestration solution • 313
Core concepts for creating Apache Airflow pipelines • 313
AWS Step Functions for a serverless orchestration solution • 315
A sample Step Functions state machine • 315
Deciding on which data pipeline orchestration tool to use • 317
Hands-on – orchestrating a data pipeline using AWS Step Functions
Creating new Lambda functions • 319
Using a Lambda function to determine the file extension • 319
Using Lambda to randomly generate failures • 320
Creating an SNS topic and subscribing to an email address • 321
Creating a new Step Functions state machine • 322
Configuring our S3 bucket to send events to EventBridge • 327
Creating an EventBridge rule for triggering our Step Functions state machine • 327
Testing our event-driven data orchestration pipeline • 330
Section
3:
The
Bigger Picture: Data Analytics, Data Visualization, and Machine Learning
319
and layout
Transforming raw source files to optimized file formats • 340
Partitioning the dataset • 341
Other file-based optimizations • 343
Writing optimized SQL queries • 344
Selecting only the specific columns that you need • 345
Using approximate aggregate functions • 345
Reusing Athena query results • 346
Exploring advanced Athena functionality
Querying external data sources using Athena Federated Query • 347
Pre-built connectors and custom connectors • 349
Using Apache Spark in Amazon Athena • 350
Working with open table formats in Amazon Athena • 351
Provisioning capacity for queries • 352
Managing groups of users with Amazon Athena workgroups
Managing Athena costs with Athena workgroups • 353
Per query data usage control • 354
Athena workgroup data usage controls • 354
Implementing governance controls with Athena workgroups • 355
– creating an Amazon Athena workgroup and configuring Athena
Benefits of data visualization • 370
Popular uses of data visualizations • 370
Trends over time • 370
Data over a geographic area • 372
Heat maps to represent the intersection of data • 373 Understanding Amazon QuickSight’s core
Standard versus Enterprise edition • 374
SPICE – the in-memory storage and computation engine for QuickSight • 376
Managing SPICE capacity • 377
Ingesting and preparing data from a
Preparing datasets in QuickSight versus performing ETL outside of QuickSight •
Creating and sharing visuals with QuickSight analyses and dashboards ������������������������ 381
Visual types in Amazon QuickSight • 383
AutoGraph for automatic graphing • 383
Line, geospatial, and heat maps • 383
Bar charts • 383
Key performance indicators • 384
Tables as visuals • 385
Custom visual types • 385
Other visual types • 386
Understanding QuickSight’s advanced features
Amazon QuickSight ML Insights • 386
Amazon QuickSight autonarratives • 387
ML-powered anomaly detection • 387
ML-powered forecasting • 388
Amazon QuickSight Q for natural language queries • 388
Generative BI dashboarding authoring capabilities • 389
QuickSight Q Topics • 389
Fine-tuning your QuickSight Q Topics • 390
Amazon QuickSight embedded dashboards • 391
Embedding for registered QuickSight users • 391
Embedding for unauthenticated users • 392
Generating multi-page formatted reports • 393
Hands-on – creating a simple QuickSight visualization
Setting up a new QuickSight account and loading a dataset • 393
Creating a new analysis • 396
Publishing our visual as a dashboard • 401
Summary
Understanding the value of ML and AI for organizations
Specialized ML projects • 407
Medical clinical decision support platform • 407
Early detection of diseases • 408
Making sports safer • 408
Everyday use cases for ML and AI • 409
Forecasting • 409
Personalization • 409
Natural language processing • 410
Image recognition • 410
Exploring AWS services for ML
AWS ML services • 412
SageMaker in the ML preparation phase • 412
SageMaker in the ML build phase • 413
SageMaker in the ML training and tuning phase • 415
SageMaker in the ML deployment and management phase • 415
Exploring AWS services for AI
AI for unstructured speech and text • 417
Amazon Transcribe for converting speech into text • 417
Amazon Textract for extracting text from documents • 418
Amazon Comprehend for extracting insights from text • 420
AI for extracting metadata from images and video • 421
Amazon Rekognition • 421
AI for ML-powered forecasts • 423
Amazon Forecast • 423
AI for fraud detection and personalization • 424
Amazon Fraud Detector • 424
Amazon Personalize • 425
Building generative AI solutions on AWS
Understanding the foundations of generative AI technology • 425
Building on foundational models using Amazon SageMaker JumpStart • 426
Building on foundational models using Amazon Bedrock • 427 Common use cases for LLMs
Hands-on – reviewing reviews with Amazon Comprehend
Setting up a new Amazon SQS message queue • 428
Creating a Lambda function for calling Amazon Comprehend • 429
Adding Comprehend permissions for our IAM role • 432
Adding a Lambda function as a trigger for our SQS message queue • 433
Testing the solution with Amazon Comprehend • 434
Limitations of Hive-based data lakes • 441
High-level benefits of open table formats • 442
ACID transactions • 442
Record level updates • 443
Schema evolution • 443
Time travel • 443
Overview of how open table formats work • 444
Approaches used by table formats for updating tables • 445
COW approach to table updates • 446
MOR approach to table updates • 446
Choosing between COW and MOR • 447
An overview of Delta Lake, Apache Hudi, and Apache Iceberg
Deep dive into Delta Lake • 448
Advanced features available in Delta Lake • 448
Deep dive into Apache Hudi • 450
Hudi Primary Keys • 450
File groups • 451
Compaction • 451
Record level index • 452
Deep dive into Apache Iceberg • 452
Iceberg Metadata file • 453
The manifest list file • 454
The manifest file • 454
Putting it together • 454
Maintenance tasks for Iceberg tables • 456
AWS service integrations for building transactional data lakes
Open table format support in AWS Glue • 459
AWS Glue crawler support • 459
AWS Glue ETL engine support • 459
Open table support in AWS Lake Formation • 459
Open table support in Amazon EMR • 460
Open table support in Amazon Redshift • 461
Open table support in Amazon Athena • 461
Hands-on – Working with Apache Iceberg tables in
Creating an Apache Iceberg table using Amazon Athena • 463
Adding data to our Iceberg table and running queries • 464
Modifying data in our Iceberg table and running queries • 466
Iceberg table maintenance tasks • 470
Optimizing the table layout • 470
Reducing disk space by deleting snapshots • 471
Domain-oriented, decentralized data ownership • 478
Data as a product • 479
Self-service data infrastructure as a platform • 480
Federated computational governance • 481
Data producers and consumers • 482
Challenges that a data mesh approach attempts to resolve
Bottlenecks with a centralized data team • 483
The “Analytics is not my problem” problem • 484
No organization-wide visibility into datasets that are available • 485
The organizational and technical challenges of building a data mesh
Changing the way that an organization approaches analytical data • 486
Changes for the centralized data & analytics team • 486
Changes for line of business teams • 487
Technical challenges for building a data mesh • 489
Integrating existing analytical tools • 490
Centralizing dataset metadata in a single catalog and building automation • 490
Compromising on integrations • 491
AWS services that help enable a data mesh approach
Querying data across AWS accounts • 492
Sharing data with AWS Lake Formation • 492
Amazon DataZone, a business data catalog with data mesh functionality • 493
DataZone concepts • 494
DataZone components • 496
A sample architecture for a data mesh on AWS
Architecture for a data mesh using AWS-native services • 497
Architecture for a data mesh using non-AWS analytic services • 499
Automating the sharing of data in Snowflake • 501
Using query federation instead of data sharing • 501
Hands-on – Setting up Amazon DataZone
Setting up AWS Identity Center • 503
Enabling and configuring Amazon DataZone • 504
Adding a data source to our DataZone project • 506
Adding business metadata • 508
Creating a project for data analysis • 509
Search the data catalog and subscribe to data • 510
Approving the subscription request • 511
Chapter 16: Building a Modern Data Platform on AWS 513
Goals
A flexible and agile platform • 514
A scalable platform • 515
A well-governed platform • 515
A secure platform • 516
An easy-to-use, self-serve platform • 516
Deciding whether to build or buy a data platform
Choosing to buy a data platform • 517
When to buy a data platform • 519
Choosing to build a data platform • 520
When to build a data platform • 520
A third way – implementing an open-source data platform • 521
The Serverless Data Lake Framework (SDLF) • 522
Core SDLF concepts • 523
DataOps as an approach to building data platforms
Automation and observability as a key for DataOps • 524
Automating infrastructure and code deployment • 525
Automating observability • 526
AWS services for implementing a DataOps approach • 527
AWS services for infrastructure deployment • 527
AWS code management and deployment services • 530
Hands-on – automated deployment of data platform components and data transformation
Setting up a Cloud9 IDE environment • 532
Setting up our AWS CodeCommit repository • 534
Adding a Glue ETL script and CloudFormation template into our repository • 535
Automating deployment of our Glue code • 541
Automating the deployment of our Glue job • 541
Testing our CodePipeline • 544
A decade of data wrapped up for Spotify users • 551
Ingesting and processing streaming files at Netflix scale • 553
Enriching VPC Flow Logs with application information • 554
Working around Amazon SQS quota limits • 555
Imagining the future – a look
Increased adoption of a data mesh approach • 558
Requirement to work in a multi-cloud environment • 558
Migration to open table formats • 559
Managing costs with FinOps • 559
The merging of data warehouses and data lakes • 560
The application of generative AI to business intelligence and analytics • 561
The application of generative AI to building transformations • 562
Hands-on –
Reviewing AWS Billing to identify the resources being charged for • 564
Closing your AWS account • 567
Preface
We live in a world where the amount of data being generated is constantly increasing. While a few decades ago, an organization may have had a single database that could store everything they needed to track, today most organizations have tens, hundreds, or even thousands of databases, along with data warehouses, and perhaps a data lake. And these data stores are being fed from an increasing number of data sources (transaction data, web server log files, IoT and other sensors, and social media, to name just a few).
It is no surprise that we hear more and more companies talk about being data-driven in their decision making. But in order for an organization to be truly data-driven, they need to be masters of managing and drawing insights from these ever-increasing quantities and types of data. And to enable this, organizations need to employ people with specialized data skills.
Doing a search on LinkedIn for jobs related to data returns nearly 800,000 results (and that is just for the United States!). The job titles include roles such as data engineer, data scientist, and data architect.
This revised edition of the book includes updates to all chapters, covering new features and services from AWS, as well as three brand-new chapters. In these new chapters, we cover topics such as building transactional data lakes (using open table formats such as Apache Iceberg), implementing a data mesh approach on AWS, and using a DataOps approach to building a modern data platform.
While this book will not magically turn you into a data engineer, it has been designed to accelerate your journey toward data engineering on AWS. By the end of this book, you will not only have learned some of the core concepts around data engineering, but you will also have a good understanding of the wide variety of tools available in AWS for working with data. You will also have been through numerous hands-on exercises, and thus gained practical experience with things such as ingesting streaming data, transforming and optimizing data, building visualizations, and even drawing insights from data using AI.
Who this book is for
This book has been designed for two groups of people; firstly, those looking to get started with a career in data engineering, and who want to learn core data engineering concepts. This book introduces many different aspects of data engineering, providing a comprehensive high-level understanding of, and practical hands-on experience with, different focus areas of data engineering. Secondly, this book is for those people who may already have an established career focused on data, but who are new to the cloud, and to AWS in particular. For these people, this book provides a clear understanding of many of the different AWS services for working with data, and gives them hands-on experience with a variety of these AWS services.
What this book covers
Each of the chapters in this book takes the approach of introducing important concepts or key AWS services, and then providing a hands-on exercise related to the topic of the chapter:
Chapter 1, An Introduction to Data Engineering, reviews the challenges of ever-increasing dataset volumes, and the role of the data engineer in working with data in the cloud.
Chapter 2, Data Management Architectures for Analytics, introduces foundational concepts and technologies related to big data processing.
Chapter 3, The AWS Data Engineer’s Toolkit, provides an introduction to a wide range of AWS services that are used for ingesting, processing, and consuming data, and orchestrating pipelines.
Chapter 4, Data Governance, Security, and Cataloging, covers the all-important topics of keeping data secure, ensuring good data governance, and the importance of cataloging your data.
Chapter 5, Architecting Data Engineering Pipelines, provides an approach for whiteboarding the high-level design of a data engineering pipeline.
Chapter 6, Ingesting Batch and Streaming Data, looks at the variety of data sources that we may need to ingest from, and examines AWS services for ingesting both batch and streaming data.
Chapter 7, Transforming Data to Optimize for Analytics, covers common transformations for optimizing datasets and for applying business logic.
Chapter 8, Identifying and Enabling Data Consumers, is about better understanding the different types of data consumers that a data engineer may work to prepare data for.
Chapter 9, A Deeper Dive into Data Marts and Amazon Redshift, focuses on the use of data warehouses as a data mart and looks at moving data between a data lake and data warehouse. This chapter also does a deep dive into Amazon Redshift, a cloud-based data warehouse.
Chapter 10, Orchestrating the Data Pipeline, looks at how various data engineering tasks and transformations can be put together in a data pipeline, and how these can be run and managed with pipeline orchestration tools such as AWS Step Functions.
Chapter 11, Ad Hoc Queries with Amazon Athena, does a deeper dive into the Amazon Athena service, which can be used to run SQL queries directly on data in the data lake, and beyond.
Chapter 12, Visualizing Data with Amazon QuickSight, discusses the importance of being able to craft visualizations of data, and how the Amazon QuickSight service enables this.
Chapter 13, Enabling Artificial Intelligence and Machine Learning, reviews how AI and ML are increasingly important for gaining new value from data, and introduces some of the AWS services for both ML and AI.
Chapter 14, Building Transactional Data Lakes, looks at new table formats (including Apache Iceberg, Apache Hudi, and Delta Lake) that bring traditional data warehousing type features to data lakes.
Chapter 15, Implementing a Data Mesh Strategy, discusses a recent trend, referred to as a data mesh, that provides a new way to approach analytical data management and data sharing within an organization.
Chapter 16, Building a Modern Data Platform on AWS, introduces important concepts, such as DataOps, which provides automation and observability when building a modern data platform.
Chapter 17, Wrapping Up the First Part of Your Learning Journey, concludes the book by looking at the bigger picture of data analytics, including real-world examples of data pipelines, and a review of emerging trends in the industry.
To get the most out of this book
Basic knowledge of computer systems and concepts, and how these are used within large organizations, is helpful prerequisite knowledge for this book. However, no data engineering-specific skills or knowledge are required. Also, a familiarity with cloud computing fundamentals and core AWS systems will make it easier to follow along, especially with the hands-on exercises, but detailed step-by-step instructions are included for each task.