Download ebooks file Data engineering with aws: acquire the skills to design and build aws-based dat by Education Libraries

Data Engineering with AWS: Acquire the skills to design and build AWS-based data transformation pipelines like a pro 2nd Edition Eagar

Visit to download the full and correct content document: https://ebookmass.com/product/data-engineering-with-aws-acquire-the-skills-to-desig n-and-build-aws-based-data-transformation-pipelines-like-a-pro-2nd-edition-eagar/

More products digital (pdf, epub, mobi) instant download maybe you interests ...

Architecting a Modern Data Warehouse for Large

Enterprises: Build Multi-cloud Modern Distributed Data

Warehouses with Azure and AWS 1st Edition Anjani Kumar

https://ebookmass.com/product/architecting-a-modern-datawarehouse-for-large-enterprises-build-multi-cloud-moderndistributed-data-warehouses-with-azure-and-aws-1st-editionanjani-kumar-3/

A Complete Guide to DevOps with AWS: Deploy, Build, and Scale Services with AWS Tools and Techniques Osama Mustafa

https://ebookmass.com/product/a-complete-guide-to-devops-withaws-deploy-build-and-scale-services-with-aws-tools-andtechniques-osama-mustafa/

Architecting a Modern Data Warehouse for Large

Enterprises: Build Multi-cloud Modern Distributed Data

Warehouses with Azure and AWS 1st Edition Anjani Kumar

https://ebookmass.com/product/architecting-a-modern-datawarehouse-for-large-enterprises-build-multi-cloud-moderndistributed-data-warehouses-with-azure-and-aws-1st-editionanjani-kumar/

Architecting a Modern Data Warehouse for Large

Enterprises: Build Multi-cloud Modern Distributed Data

Warehouses with Azure and AWS 1st Edition Anjani Kumar

https://ebookmass.com/product/architecting-a-modern-datawarehouse-for-large-enterprises-build-multi-cloud-moderndistributed-data-warehouses-with-azure-and-aws-1st-edition-

Data Observability for Data Engineering: Ensure and monitor data accuracy, prevent and resolve broken data pipelines with actionable steps Michele Pinto

https://ebookmass.com/product/data-observability-for-dataengineering-ensure-and-monitor-data-accuracy-prevent-and-resolvebroken-data-pipelines-with-actionable-steps-michele-pinto/

Data Wrangling on AWS: Clean and organize complex data for analysis Shukla

https://ebookmass.com/product/data-wrangling-on-aws-clean-andorganize-complex-data-for-analysis-shukla/

Data Engineering with dbt: A practical guide to building a cloud-based, pragmatic, and dependable data platform with SQL Zagni

https://ebookmass.com/product/data-engineering-with-dbt-apractical-guide-to-building-a-cloud-based-pragmatic-anddependable-data-platform-with-sql-zagni/

Geospatial Data Analytics on AWS 1st Edition Scott Bateman

https://ebookmass.com/product/geospatial-data-analytics-onaws-1st-edition-scott-bateman/

Beginning AWS Security: Build Secure, Effective, and Efficient AWS Architecture 1st Edition Tasha Penwell

https://ebookmass.com/product/beginning-aws-security-buildsecure-effective-and-efficient-aws-architecture-1st-editiontasha-penwell/

Data Engineering with AWS

Second Edition

Acquire the skills to design and build AWS-based data transformation pipelines like a pro

Gareth Eagar

Data Engineering with AWS

Second Edition

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damage caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Senior Publishing Product Manager: Gebin George

Acquisition Editor – Peer Reviews: Tejas Mhasvekar

Project Editor: Namrata Katare

Content Development Editor: Elliot Dallow

Copy Editor: Safis Editing

Technical Editor: Srishty Bhardwaj

Proofreader: Safis Editing

Indexer: Rekha Nair

Presentation Designer: Pranit Padwal

Developer Relations Marketing Executive: Vignesh Raju

First published: December 2021

Second edition: October 2023

Production reference: 1261023

Published by Packt Publishing Ltd.

Grosvenor House 11 St Paul’s Square Birmingham B3 1RB, UK.

ISBN 978-1-80461-442-6

www.packt.com

Contributors

About the author

Gareth Eagar has over 25 years of experience in the IT industry, starting in South Africa, working in the United Kingdom for a while, and now based in the USA.

Having worked at AWS since 2017, Gareth has broad experience with a variety of AWS services, and deep expertise around building data platforms on AWS. While Gareth currently works as a Solutions Architect, he has also worked in AWS Professional Services, helping architect and implement data platforms for global customers.

Gareth also frequently speaks on data-related topics.

To my amazing wife and children, thank you for your patience and understanding as I spent countless hours writing the revised edition of this book. Your support for me taking on this project, and making the space and time for me to write, means so much to me.

A special thanks to Disha Umarwani, Praful Kava, and Natalie Rabinovich, who each contributed content for the first edition of this book. And many thanks to Amit Kalawat, Leonardo Gomez, and many others for helping to review content for this revised edition.

About the reviewers

Vaibhav Tyagi is a skilled and experienced cloud data engineer and architect with 10 years of experience. He has a deep understanding of AWS cloud services and is proficient in a variety of data engineering tools, including Spark, Hive, and Hadoop.

Throughout his career, he has worked for Teradata, Citigroup, NatWest, and Amazon, and has worked on, among other things, designing and implementing cloud-based pipelines, complex cloud environments, and the creation of data warehouses.

I would like to thank my wife and children who have been my biggest cheerleaders and put up with my long working hours. I am truly grateful for their love and support. And thank you to my friends who have also been a great source of support.

Gaurav Verma has 9 years of experience in the field, having worked at AWS, Skyscanner, Discovery Communications, and Tata Consultancy Services.

He excels in designing and delivering big data and analytics solutions on AWS. His expertise spans AWS services, Python, Scala, Spark, and more. He currently leads a team at Amazon, overseeing global analytics and the ML data platform. His career highlights include optimizing data pipelines, managing analytics projects, and extensive training in big data and data engineering technologies.

Learn more on Discord

To join the Discord community for this book – where you can share feedback, ask questions to the author, and learn about new releases – follow the QR code below:

https://discord.gg/9s5mHNyECd

Databases and data warehouses • 22

Dealing with big, unstructured data • 23

Cloud-based solutions for big data analytics • 24

A deeper dive into data warehouse concepts and architecture

Dimensional modeling in data warehouses • 28

Understanding the role of data marts • 32

Distributed storage and massively parallel processing • 33

Columnar data storage and efficient data compression • 35

Feeding data into the warehouse – ETL and ELT pipelines • 37

Data lake logical architecture • 41

The storage layer and storage zones • 42

Catalog and search layers • 43

Ingestion layer • 43

The processing layer • 44

The consumption layer • 44

lake architecture summary • 44

Federated queries across database engines • 47

Accessing the

CLI •

Using AWS CloudShell to access the CLI • 49 Creating new Amazon S3 buckets • 51

Amazon Database Migration Service (DMS) • 54

Amazon Kinesis for streaming data ingestion • 56

Amazon Kinesis Agent • 57

Amazon Kinesis Firehose • 58

Amazon Kinesis Data Streams • 59

Amazon Kinesis Data Analytics • 60

Amazon Kinesis Video Streams • 60

Amazon MSK for streaming data ingestion • 61

Amazon AppFlow for ingesting data from SaaS services • 62

AWS Transfer Family for ingestion using FTP/SFTP protocols • 63

AWS DataSync for ingesting from on premises and multicloud storage services • 64

The AWS Snow family of devices for large data transfers • 64

AWS Glue for data ingestion • 66

An overview of AWS services for transforming data

AWS Lambda for light transformations • 67

AWS Glue for serverless data processing • 68

Serverless ETL processing • 68

AWS Glue DataBrew • 70

AWS Glue Data Catalog • 70

AWS Glue crawlers • 72

Amazon EMR for Hadoop ecosystem processing • 73

An overview of AWS services for orchestrating big data pipelines

AWS Glue workflows for orchestrating Glue components • 75

AWS Step Functions for complex workflows • 77

Amazon Managed Workflows for Apache Airflow (MWAA) • 79

An overview of AWS services for consuming data

Amazon Athena for SQL queries in the data lake • 81

Amazon Redshift and Redshift Spectrum for data warehousing and data lakehouse architectures • 82

Overview of Amazon QuickSight for visualizing data • 85

Hands-on – triggering an AWS Lambda function when a new file arrives in an S3 bucket 87

Creating a Lambda layer containing the AWS SDK for pandas library • 87

Creating an IAM policy and role for your Lambda function • 89

Creating a Lambda function • 91

Configuring our Lambda function to be triggered by an S3 upload • 96

Common data regulatory requirements • 104

Core data protection concepts • 105

Personally identifiable information (PII) • 105

Personal data • 105

Encryption • 106

Anonymized data • 106

Pseudonymized data/tokenization • 107

Authentication • 108

Authorization • 109

Putting these concepts together • 109

Data quality • 110

Data profiling • 111

Data lineage • 113

Business and technical data

Implementing a data catalog to avoid creating a data swamp

Business data catalogs • 115

Technical data catalogs • 117

AWS services that help with data

The AWS Glue/Lake Formation technical data catalog • 118

AWS Glue DataBrew for profiling datasets • 120

AWS Glue Data Quality • 121

AWS Key Management Service (KMS) for data encryption • 122

Amazon Macie for detecting PII data in Amazon S3 objects • 123

The AWS Glue Studio Detect PII transform for detecting PII data in datasets • 124

Amazon GuardDuty for detecting threats in an AWS account • 124

AWS Identity and Access Management (IAM) service • 124

Using AWS Lake Formation to manage data lake access • 128

Permissions management before Lake Formation • 128

Permissions management using AWS Lake Formation • 129

Hands-on –

Creating a new user with IAM permissions • 130

Transitioning to managing fine-grained permissions with AWS Lake Formation • 135

Activating Lake Formation permissions for a database and table • 136

Granting Lake Formation permissions • 138

Section 2: Architecting and Implementing Data

Whiteboarding

Conducting

Data standardization • 154

Data quality checks • 155

Data partitioning • 155

Data denormalization • 155

Data cataloging • 155

Whiteboarding data transformation • 155

Loading data into data

Hands-on

architecting a sample

Detailed notes from the project “Bright Light” whiteboarding meeting of GP Widgets, Inc • 161

Meeting notes • 162

Summary ��

Data variety • 170

Structured data • 171

Semi-structured data • 172

Unstructured data • 174

Data volume • 174

Data velocity • 175

Data veracity • 175

Data value • 176

Questions to ask • 176

Ingesting data from a relational database

AWS DMS • 177

AWS Glue • 178

Full one-off loads from one or more tables • 178

Initial full loads from a table, and subsequent loads of new records • 178

Creating AWS Glue jobs with AWS Lake Formation • 179

Other ways to ingest data from a database • 179

Deciding on the best approach to ingesting from a database • 180

The size of the database • 180

Database load • 181

Data ingestion frequency • 181

Technical requirements and compatibility • 182

Ingesting streaming data

Amazon Kinesis versus Amazon Managed Streaming for Kafka (MSK) • 183

Serverless services versus managed services • 183

Open-source flexibility versus proprietary software with strong AWS integration • 184

At-least-once messaging versus exactly once messaging • 185

A single processing engine versus niche tools • 185

Deciding on a streaming ingestion tool • 185

Hands-on – ingesting data with AWS DMS

Deploying MySQL and an EC2 data loader via CloudFormation • 186

Creating an IAM policy and role for DMS • 189

Configuring DMS settings and performing a full load from MySQL to S3 • 192

Querying data with Amazon Athena • 195

Hands-on – ingesting streaming data

Configuring Kinesis Data Firehose for streaming delivery to Amazon S3 • 197

Configuring Amazon Kinesis Data Generator (KDG) • 199

Adding newly ingested data to the Glue Data Catalog • 202

Querying the data with Amazon Athena • 203

Summary

Cooking, baking, and data transformations • 206

Transformations as part of a pipeline • 207

Types of data transformation tools

Apache Spark • 208

Hadoop and MapReduce • 208

SQL • 209

GUI-based tools • 210

Common data preparation transformations

Protecting PII data • 211

Optimizing the file format • 212

Optimizing with data partitioning • 213

Data cleansing • 215

Common business use case transformations

Data denormalization • 217

Enriching data • 218

Pre-aggregating data • 219

Extracting metadata from unstructured data • 219

Working with Change Data Capture (CDC) data

Traditional approaches – data upserts and SQL views • 221

Modern approaches – Open Table Formats (OTFs) • 222

Apache Iceberg • 223

Apache Hudi • 224

Databricks Delta Lake • 224

Hands-on – joining datasets with AWS Glue Studio

Creating a new data lake zone – the curated zone • 225

Creating a new IAM role for the Glue job • 225

Configuring a denormalization transform using AWS Glue Studio • 227

Finalizing the denormalization transform job to write to S3 • 232

Create a transform job to join streaming and film data using AWS Glue Studio • 234

Summary

A growing variety of data consumers • 243

How a data mesh helps data consumers • 244

Meeting the needs of business users with data visualization

AWS tools for business users • 245

A quick overview of Amazon QuickSight • 245

Meeting the needs of data analysts with structured reporting

AWS tools for data analysts • 248

Amazon Athena • 249

AWS Glue DataBrew • 249

Running Python or R in AWS • 250

Meeting the needs of data scientists and ML models

AWS tools used by data scientists to work with data • 252

SageMaker Ground Truth • 252

SageMaker Data Wrangler • 253

SageMaker Clarify • 253

Hands-on – creating data transformations with AWS Glue

Configuring new datasets for AWS Glue DataBrew • 255

Creating a new Glue DataBrew project • 256

Building your Glue DataBrew recipe • 257

Creating a Glue DataBrew job • 260 Summary

Extending analytics with data warehouses/data marts

Cold and warm data • 264

Cold data • 264

Warm data • 265

Amazon S3 storage classes • 265

Hot data • 269

What not to do – anti-patterns for a data warehouse

Using a data warehouse as a transactional datastore • 270

Using a data warehouse as a data lake • 270

Storing unstructured data • 271

Redshift architecture review and storage deep dive

Data distribution across slices • 272

Redshift Zone Maps and sorting data • 274

Designing a high-performance data warehouse

Provisioned versus Redshift Serverless clusters • 275

Selecting the optimal Redshift node type for provisioned clusters • 276

Selecting the optimal table distribution style and sort key • 277

Selecting the right data type for columns • 278

Character types • 278

Numeric types • 279

Datetime types • 280

Boolean type • 281

HLLSKETCH type • 281

SUPER type • 282

Selecting the optimal table type • 282

Local Redshift tables • 282

External tables for querying data in Amazon S3 with Redshift Spectrum • 283

Temporary staging tables for loading data into Redshift • 284

Data caching using Redshift materialized views • 285

Moving data between a data lake and Redshift

Optimizing data ingestion in Redshift • 286

Automating data loads from Amazon S3 into Redshift • 288

Exporting data from Redshift to the data lake • 288

286

Exploring advanced Redshift features

Data sharing between Redshift clusters • 290

Machine learning capabilities in Amazon Redshift • 291

Running Redshift clusters across multiple Availability Zones • 292

Redshift Dynamic Data Masking • 293

Zero-ETL between Amazon Aurora and Amazon Redshift • 293

Resizing a Redshift cluster • 294

Hands-on – deploying a Redshift Serverless cluster and running Redshift

Uploading our sample data to Amazon S3 • 294

IAM roles for Redshift • 295

Creating a Redshift cluster • 296

Querying data in the sample database • 298

Using Redshift Spectrum to directly query data in the data lake • 299

What is a data pipeline, and how do you orchestrate it? • 305

What is a directed acyclic graph? • 306

How do you trigger a data pipeline to run? • 307

Using manifest files as pipeline triggers • 307

How do you handle the failures of a step in your pipeline? • 308

Common reasons for failure in data pipelines • 308

Pipeline failure retry strategies • 309

Examining the options for orchestrating pipelines in AWS

AWS Data Pipeline (now in maintenance mode) • 310

AWS Glue workflows to orchestrate Glue resources • 310

Monitoring and error handling • 311

Triggering Glue workflows • 312

Apache Airflow as an open-source orchestration solution • 313

Core concepts for creating Apache Airflow pipelines • 313

AWS Step Functions for a serverless orchestration solution • 315

A sample Step Functions state machine • 315

Deciding on which data pipeline orchestration tool to use • 317

Hands-on – orchestrating a data pipeline using AWS Step Functions

Creating new Lambda functions • 319

Using a Lambda function to determine the file extension • 319

Using Lambda to randomly generate failures • 320

Creating an SNS topic and subscribing to an email address • 321

Creating a new Step Functions state machine • 322

Configuring our S3 bucket to send events to EventBridge • 327

Creating an EventBridge rule for triggering our Step Functions state machine • 327

Testing our event-driven data orchestration pipeline • 330

Section

3:

The

Bigger Picture: Data Analytics, Data Visualization, and Machine Learning

319

and layout

Transforming raw source files to optimized file formats • 340

Partitioning the dataset • 341

Other file-based optimizations • 343

Writing optimized SQL queries • 344

Selecting only the specific columns that you need • 345

Using approximate aggregate functions • 345

Reusing Athena query results • 346

Exploring advanced Athena functionality

Querying external data sources using Athena Federated Query • 347

Pre-built connectors and custom connectors • 349

Using Apache Spark in Amazon Athena • 350

Working with open table formats in Amazon Athena • 351

Provisioning capacity for queries • 352

Managing groups of users with Amazon Athena workgroups

Managing Athena costs with Athena workgroups • 353

Per query data usage control • 354

Athena workgroup data usage controls • 354

Implementing governance controls with Athena workgroups • 355

– creating an Amazon Athena workgroup and configuring Athena

Benefits of data visualization • 370

Popular uses of data visualizations • 370

Trends over time • 370

Data over a geographic area • 372

Heat maps to represent the intersection of data • 373 Understanding Amazon QuickSight’s core

Standard versus Enterprise edition • 374

SPICE – the in-memory storage and computation engine for QuickSight • 376

Managing SPICE capacity • 377

Ingesting and preparing data from a

Preparing datasets in QuickSight versus performing ETL outside of QuickSight •

Creating and sharing visuals with QuickSight analyses and dashboards �� 381

Visual types in Amazon QuickSight • 383

AutoGraph for automatic graphing • 383

Line, geospatial, and heat maps • 383

Bar charts • 383

Key performance indicators • 384

Tables as visuals • 385

Custom visual types • 385

Other visual types • 386

Understanding QuickSight’s advanced features

Amazon QuickSight ML Insights • 386

Amazon QuickSight autonarratives • 387

ML-powered anomaly detection • 387

ML-powered forecasting • 388

Amazon QuickSight Q for natural language queries • 388

Generative BI dashboarding authoring capabilities • 389

QuickSight Q Topics • 389

Fine-tuning your QuickSight Q Topics • 390

Amazon QuickSight embedded dashboards • 391

Embedding for registered QuickSight users • 391

Embedding for unauthenticated users • 392

Generating multi-page formatted reports • 393

Hands-on – creating a simple QuickSight visualization

Setting up a new QuickSight account and loading a dataset • 393

Creating a new analysis • 396

Publishing our visual as a dashboard • 401

Summary

Understanding the value of ML and AI for organizations

Specialized ML projects • 407

Medical clinical decision support platform • 407

Early detection of diseases • 408

Making sports safer • 408

Everyday use cases for ML and AI • 409

Forecasting • 409

Personalization • 409

Natural language processing • 410

Image recognition • 410

Exploring AWS services for ML

AWS ML services • 412

SageMaker in the ML preparation phase • 412

SageMaker in the ML build phase • 413

SageMaker in the ML training and tuning phase • 415

SageMaker in the ML deployment and management phase • 415

Exploring AWS services for AI

AI for unstructured speech and text • 417

Amazon Transcribe for converting speech into text • 417

Amazon Textract for extracting text from documents • 418

Amazon Comprehend for extracting insights from text • 420

AI for extracting metadata from images and video • 421

Amazon Rekognition • 421

AI for ML-powered forecasts • 423

Amazon Forecast • 423

AI for fraud detection and personalization • 424

Amazon Fraud Detector • 424

Amazon Personalize • 425

Building generative AI solutions on AWS

Understanding the foundations of generative AI technology • 425

Building on foundational models using Amazon SageMaker JumpStart • 426

Building on foundational models using Amazon Bedrock • 427 Common use cases for LLMs

Hands-on – reviewing reviews with Amazon Comprehend

Setting up a new Amazon SQS message queue • 428

Creating a Lambda function for calling Amazon Comprehend • 429

Adding Comprehend permissions for our IAM role • 432

Adding a Lambda function as a trigger for our SQS message queue • 433

Testing the solution with Amazon Comprehend • 434

Limitations of Hive-based data lakes • 441

High-level benefits of open table formats • 442

ACID transactions • 442

Record level updates • 443

Schema evolution • 443

Time travel • 443

Overview of how open table formats work • 444

Approaches used by table formats for updating tables • 445

COW approach to table updates • 446

MOR approach to table updates • 446

Choosing between COW and MOR • 447

An overview of Delta Lake, Apache Hudi, and Apache Iceberg

Deep dive into Delta Lake • 448

Advanced features available in Delta Lake • 448

Deep dive into Apache Hudi • 450

Hudi Primary Keys • 450

File groups • 451

Compaction • 451

Record level index • 452

Deep dive into Apache Iceberg • 452

Iceberg Metadata file • 453

The manifest list file • 454

The manifest file • 454

Putting it together • 454

Maintenance tasks for Iceberg tables • 456

AWS service integrations for building transactional data lakes

Open table format support in AWS Glue • 459

AWS Glue crawler support • 459

AWS Glue ETL engine support • 459

Open table support in AWS Lake Formation • 459

Open table support in Amazon EMR • 460

Open table support in Amazon Redshift • 461

Open table support in Amazon Athena • 461

Hands-on – Working with Apache Iceberg tables in

Creating an Apache Iceberg table using Amazon Athena • 463

Adding data to our Iceberg table and running queries • 464

Modifying data in our Iceberg table and running queries • 466

Iceberg table maintenance tasks • 470

Optimizing the table layout • 470

Reducing disk space by deleting snapshots • 471

Domain-oriented, decentralized data ownership • 478

Data as a product • 479

Self-service data infrastructure as a platform • 480

Federated computational governance • 481

Data producers and consumers • 482

Challenges that a data mesh approach attempts to resolve

Bottlenecks with a centralized data team • 483

The “Analytics is not my problem” problem • 484

No organization-wide visibility into datasets that are available • 485

The organizational and technical challenges of building a data mesh

Changing the way that an organization approaches analytical data • 486

Changes for the centralized data & analytics team • 486

Changes for line of business teams • 487

Technical challenges for building a data mesh • 489

Integrating existing analytical tools • 490

Centralizing dataset metadata in a single catalog and building automation • 490

Compromising on integrations • 491

AWS services that help enable a data mesh approach

Querying data across AWS accounts • 492

Sharing data with AWS Lake Formation • 492

Amazon DataZone, a business data catalog with data mesh functionality • 493

DataZone concepts • 494

DataZone components • 496

A sample architecture for a data mesh on AWS

Architecture for a data mesh using AWS-native services • 497

Architecture for a data mesh using non-AWS analytic services • 499

Automating the sharing of data in Snowflake • 501

Using query federation instead of data sharing • 501

Hands-on – Setting up Amazon DataZone

Setting up AWS Identity Center • 503

Enabling and configuring Amazon DataZone • 504

Adding a data source to our DataZone project • 506

Adding business metadata • 508

Creating a project for data analysis • 509

Search the data catalog and subscribe to data • 510

Approving the subscription request • 511

Chapter 16: Building a Modern Data Platform on AWS 513

Goals

A flexible and agile platform • 514

A scalable platform • 515

A well-governed platform • 515

A secure platform • 516

An easy-to-use, self-serve platform • 516

Deciding whether to build or buy a data platform

Choosing to buy a data platform • 517

When to buy a data platform • 519

Choosing to build a data platform • 520

When to build a data platform • 520

A third way – implementing an open-source data platform • 521

The Serverless Data Lake Framework (SDLF) • 522

Core SDLF concepts • 523

DataOps as an approach to building data platforms

Automation and observability as a key for DataOps • 524

Automating infrastructure and code deployment • 525

Automating observability • 526

AWS services for implementing a DataOps approach • 527

AWS services for infrastructure deployment • 527

AWS code management and deployment services • 530

Hands-on – automated deployment of data platform components and data transformation

Setting up a Cloud9 IDE environment • 532

Setting up our AWS CodeCommit repository • 534

Adding a Glue ETL script and CloudFormation template into our repository • 535

Automating deployment of our Glue code • 541

Automating the deployment of our Glue job • 541

Testing our CodePipeline • 544

A decade of data wrapped up for Spotify users • 551

Ingesting and processing streaming files at Netflix scale • 553

Enriching VPC Flow Logs with application information • 554

Working around Amazon SQS quota limits • 555

Imagining the future – a look

Increased adoption of a data mesh approach • 558

Requirement to work in a multi-cloud environment • 558

Migration to open table formats • 559

Managing costs with FinOps • 559

The merging of data warehouses and data lakes • 560

The application of generative AI to business intelligence and analytics • 561

The application of generative AI to building transformations • 562

Hands-on –

Reviewing AWS Billing to identify the resources being charged for • 564

Closing your AWS account • 567

Preface

We live in a world where the amount of data being generated is constantly increasing. While a few decades ago, an organization may have had a single database that could store everything they needed to track, today most organizations have tens, hundreds, or even thousands of databases, along with data warehouses, and perhaps a data lake. And these data stores are being fed from an increasing number of data sources (transaction data, web server log files, IoT and other sensors, and social media, to name just a few).

It is no surprise that we hear more and more companies talk about being data-driven in their decision making. But in order for an organization to be truly data-driven, they need to be masters of managing and drawing insights from these ever-increasing quantities and types of data. And to enable this, organizations need to employ people with specialized data skills.

Doing a search on LinkedIn for jobs related to data returns nearly 800,000 results (and that is just for the United States!). The job titles include roles such as data engineer, data scientist, and data architect.

This revised edition of the book includes updates to all chapters, covering new features and services from AWS, as well as three brand-new chapters. In these new chapters, we cover topics such as building transactional data lakes (using open table formats such as Apache Iceberg), implementing a data mesh approach on AWS, and using a DataOps approach to building a modern data platform.

While this book will not magically turn you into a data engineer, it has been designed to accelerate your journey toward data engineering on AWS. By the end of this book, you will not only have learned some of the core concepts around data engineering, but you will also have a good understanding of the wide variety of tools available in AWS for working with data. You will also have been through numerous hands-on exercises, and thus gained practical experience with things such as ingesting streaming data, transforming and optimizing data, building visualizations, and even drawing insights from data using AI.

Who this book is for

This book has been designed for two groups of people; firstly, those looking to get started with a career in data engineering, and who want to learn core data engineering concepts. This book introduces many different aspects of data engineering, providing a comprehensive high-level understanding of, and practical hands-on experience with, different focus areas of data engineering. Secondly, this book is for those people who may already have an established career focused on data, but who are new to the cloud, and to AWS in particular. For these people, this book provides a clear understanding of many of the different AWS services for working with data, and gives them hands-on experience with a variety of these AWS services.

What this book covers

Each of the chapters in this book takes the approach of introducing important concepts or key AWS services, and then providing a hands-on exercise related to the topic of the chapter:

Chapter 1, An Introduction to Data Engineering, reviews the challenges of ever-increasing dataset volumes, and the role of the data engineer in working with data in the cloud.

Chapter 2, Data Management Architectures for Analytics, introduces foundational concepts and technologies related to big data processing.

Chapter 3, The AWS Data Engineer’s Toolkit, provides an introduction to a wide range of AWS services that are used for ingesting, processing, and consuming data, and orchestrating pipelines.

Chapter 4, Data Governance, Security, and Cataloging, covers the all-important topics of keeping data secure, ensuring good data governance, and the importance of cataloging your data.

Chapter 5, Architecting Data Engineering Pipelines, provides an approach for whiteboarding the high-level design of a data engineering pipeline.

Chapter 6, Ingesting Batch and Streaming Data, looks at the variety of data sources that we may need to ingest from, and examines AWS services for ingesting both batch and streaming data.

Chapter 7, Transforming Data to Optimize for Analytics, covers common transformations for optimizing datasets and for applying business logic.

Chapter 8, Identifying and Enabling Data Consumers, is about better understanding the different types of data consumers that a data engineer may work to prepare data for.

Chapter 9, A Deeper Dive into Data Marts and Amazon Redshift, focuses on the use of data warehouses as a data mart and looks at moving data between a data lake and data warehouse. This chapter also does a deep dive into Amazon Redshift, a cloud-based data warehouse.

Chapter 10, Orchestrating the Data Pipeline, looks at how various data engineering tasks and transformations can be put together in a data pipeline, and how these can be run and managed with pipeline orchestration tools such as AWS Step Functions.

Chapter 11, Ad Hoc Queries with Amazon Athena, does a deeper dive into the Amazon Athena service, which can be used to run SQL queries directly on data in the data lake, and beyond.

Chapter 12, Visualizing Data with Amazon QuickSight, discusses the importance of being able to craft visualizations of data, and how the Amazon QuickSight service enables this.

Chapter 13, Enabling Artificial Intelligence and Machine Learning, reviews how AI and ML are increasingly important for gaining new value from data, and introduces some of the AWS services for both ML and AI.

Chapter 14, Building Transactional Data Lakes, looks at new table formats (including Apache Iceberg, Apache Hudi, and Delta Lake) that bring traditional data warehousing type features to data lakes.

Chapter 15, Implementing a Data Mesh Strategy, discusses a recent trend, referred to as a data mesh, that provides a new way to approach analytical data management and data sharing within an organization.

Chapter 16, Building a Modern Data Platform on AWS, introduces important concepts, such as DataOps, which provides automation and observability when building a modern data platform.

Chapter 17, Wrapping Up the First Part of Your Learning Journey, concludes the book by looking at the bigger picture of data analytics, including real-world examples of data pipelines, and a review of emerging trends in the industry.

To get the most out of this book

Basic knowledge of computer systems and concepts, and how these are used within large organizations, is helpful prerequisite knowledge for this book. However, no data engineering-specific skills or knowledge are required. Also, a familiarity with cloud computing fundamentals and core AWS systems will make it easier to follow along, especially with the hands-on exercises, but detailed step-by-step instructions are included for each task.