Data Engineering with AWS: learn how to design and build cloud-based data transformation pipelines using AWS
"Knowing how to architect and implement complex data pipelines is a highly sought-after skill. Data engineers are responsible for building these pipelines that ingest, transform, and join raw datasets - creating new value from the data in the process. Amazon Web Services (AWS) offers a range of...
Gespeichert in:
1. Verfasser: | |
---|---|
Format: | Buch |
Sprache: | English |
Veröffentlicht: |
Birmingham
Packt Publishing
2021
|
Schlagworte: | |
Online-Zugang: | Inhaltsverzeichnis |
Zusammenfassung: | "Knowing how to architect and implement complex data pipelines is a highly sought-after skill. Data engineers are responsible for building these pipelines that ingest, transform, and join raw datasets - creating new value from the data in the process. Amazon Web Services (AWS) offers a range of tools to simplify a data engineer's job, making it the preferred platform for performing data engineering tasks. This book will take you through the services and the skills you need to architect and implement data pipelines on AWS. You'll begin by reviewing important data engineering concepts and some of the core AWS services that form a part of the data engineer's toolkit. You'll then architect a data pipeline, review raw data sources, transform the data, and learn how the transformed data is used by various data consumers. The book also teaches you about populating data marts and data warehouses along with how a data lakehouse fits into the picture. Later, you'll be introduced to AWS tools for analyzing data, including those for ad-hoc SQL queries and creating visualizations. In the final chapters, you'll understand how the power of machine learning and artificial intelligence can be used to draw new insights from data. By the end of this AWS book, you'll be able to carry out data engineering tasks and implement a data pipeline on AWS independently."--Amazon |
Beschreibung: | Includes index |
Beschreibung: | xvi, 461 pages illustrationen, Diagramme 24 cm |
Internformat
MARC
LEADER | 00000nam a2200000 c 4500 | ||
---|---|---|---|
001 | BV048209762 | ||
003 | DE-604 | ||
005 | 20220629 | ||
007 | t| | ||
008 | 220510s2021 xx a||| |||| 00||| eng d | ||
015 | |a GBC1H3506 |2 dnb | ||
020 | |z 1800560419 |9 1800560419 | ||
020 | |z 9781800560413 |9 978-1-80056-041-3 | ||
035 | |a (OCoLC)1334021941 | ||
035 | |a (DE-599)BVBBV048209762 | ||
040 | |a DE-604 |b ger |e rda | ||
041 | 0 | |a eng | |
049 | |a DE-739 | ||
100 | 1 | |a Eagar, Gareth |d ca. 20./21. Jh. |e Verfasser |0 (DE-588)1261399382 |4 aut | |
245 | 1 | 0 | |a Data Engineering with AWS |b learn how to design and build cloud-based data transformation pipelines using AWS |c Gareth Eagar |
264 | 1 | |a Birmingham |b Packt Publishing |c 2021 | |
300 | |a xvi, 461 pages |b illustrationen, Diagramme |c 24 cm | ||
336 | |b txt |2 rdacontent | ||
337 | |b n |2 rdamedia | ||
338 | |b nc |2 rdacarrier | ||
500 | |a Includes index | ||
520 | |a "Knowing how to architect and implement complex data pipelines is a highly sought-after skill. Data engineers are responsible for building these pipelines that ingest, transform, and join raw datasets - creating new value from the data in the process. Amazon Web Services (AWS) offers a range of tools to simplify a data engineer's job, making it the preferred platform for performing data engineering tasks. This book will take you through the services and the skills you need to architect and implement data pipelines on AWS. You'll begin by reviewing important data engineering concepts and some of the core AWS services that form a part of the data engineer's toolkit. You'll then architect a data pipeline, review raw data sources, transform the data, and learn how the transformed data is used by various data consumers. The book also teaches you about populating data marts and data warehouses along with how a data lakehouse fits into the picture. Later, you'll be introduced to AWS tools for analyzing data, including those for ad-hoc SQL queries and creating visualizations. In the final chapters, you'll understand how the power of machine learning and artificial intelligence can be used to draw new insights from data. By the end of this AWS book, you'll be able to carry out data engineering tasks and implement a data pipeline on AWS independently."--Amazon | ||
610 | 1 | 4 | |a Amazon Web Services (Firm) |
610 | 1 | 7 | |a Amazon Web Services (Firm) |2 fast |
650 | 4 | |a Cloud computing | |
650 | 4 | |a Big data | |
650 | 4 | |a Infonuagique | |
650 | 4 | |a Données volumineuses | |
650 | 7 | |a Big data |2 fast | |
650 | 7 | |a Cloud computing |2 fast | |
650 | 0 | 7 | |a Big Data |0 (DE-588)4802620-7 |2 gnd |9 rswk-swf |
650 | 0 | 7 | |a Cloud Computing |0 (DE-588)7623494-0 |2 gnd |9 rswk-swf |
650 | 0 | 7 | |a Amazon Web Services |0 (DE-588)1143985591 |2 gnd |9 rswk-swf |
689 | 0 | 0 | |a Amazon Web Services |0 (DE-588)1143985591 |D s |
689 | 0 | 1 | |a Cloud Computing |0 (DE-588)7623494-0 |D s |
689 | 0 | 2 | |a Big Data |0 (DE-588)4802620-7 |D s |
689 | 0 | |5 DE-604 | |
776 | 0 | 8 | |i ebook version |
856 | 4 | 2 | |m Digitalisierung UB Passau - ADAM Catalogue Enrichment |q application/pdf |u http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=033590628&sequence=000001&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |3 Inhaltsverzeichnis |
943 | 1 | |a oai:aleph.bib-bvb.de:BVB01-033590628 |
Datensatz im Suchindex
_version_ | 1820882392617844736 |
---|---|
adam_text |
Table of Contents Preface Section 1: AWS Data Engineering Concepts and Trends 1 An Introduction to Data Engineering Technical requirements The rise of big data as a corporate asset The challenges of ever-growing datasets Data engineers - the big data enablers Understanding the role of the data engineer Understanding the role of the data scientist Understanding the role of the data analyst 4 4 5 7 8 Understanding other common data-related roles The benefits of the cloud when building big data analytic solutions Hands-on - creating and accessing your AWS account 9 10 11 Creating a new AWS account Accessing your AWS account 12 15 Summary 19 8 9 2 Data Management Architectures for Analytics Technical requirements The evolution of data management for analytics 22 Databases and data warehouses 23 22 Dealing with big, unstructured data A lake on the cloud and a house on that lake 24 25
ii Table of Contents Understanding data warehouses and data marts fountains of truth 27 Data lake logical architecture 42 Bringing together the best of both worlds with the lake house architecture 45 Data lakehouse implementations Building a data lakehouse on AWS 46 47 Distributed storage and massively parallel processing Columnar data storage and efficient data compression Dimensional modeling in data warehouses Understanding the role of data marts Feeding data into the warehouse - ETL and ELT pipelines 32 36 Hands-on - configuring the AWS Command Line Interface tool 48 and creating an S3 bucket 37 Installing and configuring the AWS CLI Creating a new Amazon S3 bucket 49 50 Building data lakes to tame the variety and volume of big data 40 Summary 50 29 30 З The AWS Data Engineer's Toolkit Technical requirements AWS services for ingesting data Overview of Amazon Database Migration Service (DMS) Overview of Amazon Kinesis for streaming data ingestion Overview of Amazon MSK for streaming data ingestion Overview of Amazon AppFlow for ingesting data from SaaS services Overview of Amazon Transfer Family for ingestion using FTP/SFTP protocols Overview of Amazon DataSync for ingesting from on-premises storage Overview of the AWS Snow family of devices for large data transfers 52 52 52 54 59 60 61 62 63 AWS services for transforming data Overview of AWS Lambda for light transformations Overview of AWS Glue for serverless Spark processing Overview of Amazon EMR for Hadoop ecosystem processing AWS services for orchestrating big data pipelines Overview of AWS Glue workflows for
orchestrating Glue components Overview of AWS Step Functions for complex workflows Overview of Amazon managed workflows for Apache Airflow 64 64 65 69 71 71 73 75
Table of Contents iii AWS services for consuming data Overview of Amazon Athena for SQL queries in the data lake Overview of Amazon Redshift and Redshift Spectrum for data warehousing and data lakehouse architectures Overview of Amazon QuickSight for visualizing data 77 77 78 81 Hands-on - triggering an AWS Lambda function when a new file arrives in an S3 bucket Creating a Lambda layer containing the AWS Data Wrangler library Creating new Amazon S3 buckets Creating an IAM policy and role for your Lambda function Creating a Lambda function Configuring our Lambda function to be triggered by an S3 upload Summary 83 83 85 86 88 93 96 4 Data Cataloging, Security, and Governance Technical requirements Getting data security and governance right Common data regulatory requirements; Core data protection concepts Personal data Encryption Anonymized data Pseudonymized data/tokenization Authentication Authorization Putting these concepts together Cataloging your data to avoid the data swamp How to avoid the data swamp 98 98 99 100 101 101 102 102 103 104 104 105 106 The AWS Glue/Lake Formation data catalog 108 AWS services for data encryption and security monitoring AWS Key Management Service (KMS) Amazon Macie Amazon GuardDuty AWS services for managing identity and permissions 110 111 112 112 113 AWS Identity and Access Management (IAM) service 113 Using AWS Lake Formation to manage data lake access 116 Hands-on - configuring Lake Formation permissions 118 Creating a new user with IAM permissions 119 Transitioning to managing fine-grained permissions with AWS Lake Formation 123
Summary 129
iv Table of Contents Section 2: Architecting and Implementing Data Lakes and Data Lake Houses 5 Architecting Data Engineering Pipelines_ Technical requirements Approaching the data pipeline architecture Architecting houses and architecting pipelines Whiteboarding as an informationgathering tool Conducting a whiteboarding session Identifying data consumers and understanding their requirements Identifying data sources and ingesting data Identifying data transformations and optimizations File format optimizations 134 134 135 Data Data Data Data Data standardization quality checks partitioning denormalization cataloging Whiteboarding data transformation 136 137 138 140 142 143 143 143 143 144 144 Loading data into data marts 146 Wrapping up the whiteboarding session 147 Hands-on - architecting a sample pipeline 149 Detailed notes from the project "Bright Light" whiteboarding meeting of GP Widgets, Inc 150 Summary 156 142 6 Ingesting Batch and Streaming Data Technical requirements Understanding data sources 158 158 Data variety Data volume Data velocity Data veracity Data value 159 163 163 164 164 Questions to ask 165 Ingesting data from a relational database AWS Database Migration Service (DMS) AWS Glue Other ways to ingest data from a database Deciding on the best approach for ingesting from a database Ingesting streaming data 165 166 166 167 169 171
Table of Contents v Amazon Kinesis versus Amazon Managed Streaming for Kafka (MSK) Hands-on - ingesting data with AWS DMS Creating a new MySQL database instance Loading the demo data using an Amazon EC2 instance Creating an 1AM policy and role for DMS 171 174 175 177 179 Configuring DMS settings and performing a full load from MySQL to S3 Querying data with Amazon Athena 181 184 Hands-on - ingesting streaming data 186 Configuring Kinesis Data Firehose for streaming delivery to Amazon S3 186 Configuring Amazon Kinesis Data Generator (KDG) 187 Adding newly ingested data to the Glue Data Catalog 190 Querying the data with Amazon Athena 191 Summary 191 7 Transforming Data to Optimize for Analytics 194 Technical requirements Transformations - making raw 194 data more valuable Cooking, baking, and data transformations Transformations as part of a pipeline Types of data transformation tools Apache Spark Hadoop and MapReduce SQL GUI-based tools Data preparation transformations Protecting PII data Optimizing the file format 195 196 196 196 197 198 199 2QQ 200 201 Optimizing with data partitioning Data cleansing Business use case transforms 202 203 205 205 207 207 Data denormalization Enriching data Pre-aggregating data Extracting metadata from unstructured data 208 Working with change data capture (CDC) data 209 Traditional approaches - data upserts and SQL views 210 Modern approaches - the transactional data lake 211 Hands-on - joining datasets with AWS Glue Studio 214
vi Table of Contents Creating a new data lake zone - the 214 curated zone Creating a new 1AM role for the Gluejob215 Configuring a denormalization transform using AWS Glue Studio 217 Finalizing the denormalization transform job to write to S3 Create a transform job to join streaming and film data using AWS Glue Studio Summary 222 224 227 Identifying and Enabling Data Consumers Technical requirements Understanding the impact of data democratization 230 Meeting the needs of data scientists and ML models 230 A growing variety of data consumers 231 AWS tools used by data scientists to work with data Meeting the needs of business users with data visualization 232 AWS tools for business users Meeting the needs of data analysts with structured reporting AWS tools for data analysts 233 235 236 239 239 Hands-on - creating data transformations with AWS Glue 242 DataBrew Configuring new datasets for AWS Glue DataBrew Creating a new Glue DataBrew project Building your Glue DataBrew recipe Creating a Glue DataBrew job Summary 242 243 245 248 250 9 Loading Data into a Data Mart Technical requirements Extending analytics with data warehouses/data marts Cold data Warm data Hot data 252 252 252 253 255 What not to do - anti-patterns for a data warehouse 256 Using a data warehouse as a transactional datastore Using a data warehouse as a data lake Using data warehouses for real-time. record-level use cases Storing unstructured data Redshift architecture review and storage deep dive Data distribution across slices Redshift Zone Maps and sorting data 256 256 257 257 258 258 261
Table of Contents vii Designing a high-performance data warehouse Selecting the optimal Redshift node type Selecting the optimal table distribution style and sort key Selecting the right data type for columns Selecting the optimal table type Moving data between a data lake and Redshift Optimizing data ingestion in Redshift Exporting data from Redshift to the data lake Hands-on - loading data into an Amazon Redshift cluster and running queries 275 262 262 272 Uploading our sample data to Amazon S3 1AM roles for Redshift Creating a Redshift cluster Creating external tables for querying data in S3 Creating a schema for a local Redshift table Running complex SQL queries against our data 274 Summary 263 263 268 272 276 277 280 282 287 288 292 10 Orchestrating the Data Pipeline Technical requirements Understanding the core concepts for pipeline orchestration What is a data pipeline, and how do you orchestrate it? How do you trigger a data pipeline to run? How do you handle the failures of a step in your pipeline? 29 4 29 4 _ 29 3 29 7 Hands-on ֊ orchestrating a data pipeline using AWS Step Function 311 29 8 Creating new Lambda functions Examining the options for orchestrating pipelines in AWS 29 ց AWS Data Pipeline for managing ETL between data sources AWS Glue Workflows to orchestrate Glue resources Apache Airflow as an open source orchestration solution Pros and cons of using MWAA AWS Step Function for a serverless orchestration solution 306 Pros and cons of using AWS Step Function 308 Deciding on which data pipeline orchestration tool to use 309 _ 30 0 30 Creating an SNS topic
and subscribing to an email address 313 Creating a new Step Function state machine 314 Configuring AWS CloudTrail and Amazon EventBridge 319 Summary 30 3 305 311 324
viii Table of Contents Section 3: The Bigger Picture: Data Analytics, Data Visualization, and Machine Learning 11 Ad Hoc Queries with Amazon Athena Technical requirements 328 Amazon Athena - in-place SQL analytics for the data lake 329 Tips and tricks to optimize Amazon Athena queries 330 Common file format and layout optimizations Writing optimized SQL queries 330 334 Federating the queries of external data sources with Amazon Athena Query Federation 337 Querying external data sources using Athena Federated Query Managing governance and costs with Amazon Athena Workgroups Athena Workgroups overview Enforcing settings for groups of users Enforcing data usage controls 341 341 342 343 Hands-on - creating an Amazon Athena workgroup and configuring Athena settings 344 Hands-on - switching Workgroups and running queries 347 Summary 352 338 12 Visualizing Data with Amazon QuickSight Technical requirements Representing data visually for maximum impact Benefits of data visualization Popular uses of data visualizations Understanding Amazon QuickSight's core concepts Standard versus enterprise edition SPICE - the in-memory storage and computation engine for QuickSight 354 355 356 356 361 361 362 Ingesting and preparing data from a variety of sources Preparing datasets in QuickSight versus performing ETL outside of QuickSight 364 365 Creating and sharing visuals with QuickSight analyses and dashboards 367 Visual types in Amazon QuickSight 368
Table of Contents ix Understanding QuickSight's advanced features - ML Insights and embedded dashboards 372 Amazon QuickSight ML Insights Amazon QuickSight embedded dashboards 372 375 Hands-on - creating a simple QuickSight visualization 376 Setting up a new QuickSight account and loading a dataset Creating a new analysis 376 379 Summary 385 13 Enabling Artificial Intelligence and Machine Learning Technical requirements 388 Understanding the value of ML and Al for organizations 389 Specialized ML projects Everyday use cases for ML and Al Exploring AWS services for ML AWS ML services 389 391 392 393 Exploring AWS services for Al 398 Al for unstructured speech and text Al for extracting metadata from images and video Al for ML-powered forecasts Al for fraud detection and personalization 399 403 405 Hands-on - reviewing reviews with Amazon Comprehend Setting up a new Amazon SQS message queue Creating a Lambda function for calling Amazon Comprehend Adding Comprehend permissions for our IAM role Adding a Lambda function as a trigger for our SQS message queue Testing the solution with Amazon Comprehend Summary Further reading 407 407 408 411 412 413 415 416 406 14 Wrapping Up the First Part of Your Learning Journey Technical requirements Looking at the data analytics big picture 418 418 Managing complex data environments with DataOps 420 Examining examples of realworld data pipelines 422 A decade of data wrapped up for Spotify users Ingesting and processing streaming files at Netflix scale 423 424 Imagining the future - a look at emerging trends 428 ACID transactions directly on
data lake data 429
x Table of Contents More data and more streaming ingestion 429 430 Multi-cloud Decentralized data engineering teams, data platforms, and a data mesh 430 architecture Data and product thinking convergence 432 Data and self-serve platform design convergence 432 Other Books You May Enjoy Index Implementations of the data mesh architecture 433 Hands-on - cleaning up your AWS account 434 Reviewing AWS Billing to identify the resources being charged for Closing your AWS account Summary 435 437 439 |
adam_txt |
Table of Contents Preface Section 1: AWS Data Engineering Concepts and Trends 1 An Introduction to Data Engineering Technical requirements The rise of big data as a corporate asset The challenges of ever-growing datasets Data engineers - the big data enablers Understanding the role of the data engineer Understanding the role of the data scientist Understanding the role of the data analyst 4 4 5 7 8 Understanding other common data-related roles The benefits of the cloud when building big data analytic solutions Hands-on - creating and accessing your AWS account 9 10 11 Creating a new AWS account Accessing your AWS account 12 15 Summary 19 8 9 2 Data Management Architectures for Analytics Technical requirements The evolution of data management for analytics 22 Databases and data warehouses 23 22 Dealing with big, unstructured data A lake on the cloud and a house on that lake 24 25
ii Table of Contents Understanding data warehouses and data marts fountains of truth 27 Data lake logical architecture 42 Bringing together the best of both worlds with the lake house architecture 45 Data lakehouse implementations Building a data lakehouse on AWS 46 47 Distributed storage and massively parallel processing Columnar data storage and efficient data compression Dimensional modeling in data warehouses Understanding the role of data marts Feeding data into the warehouse - ETL and ELT pipelines 32 36 Hands-on - configuring the AWS Command Line Interface tool 48 and creating an S3 bucket 37 Installing and configuring the AWS CLI Creating a new Amazon S3 bucket 49 50 Building data lakes to tame the variety and volume of big data 40 Summary 50 29 30 З The AWS Data Engineer's Toolkit Technical requirements AWS services for ingesting data Overview of Amazon Database Migration Service (DMS) Overview of Amazon Kinesis for streaming data ingestion Overview of Amazon MSK for streaming data ingestion Overview of Amazon AppFlow for ingesting data from SaaS services Overview of Amazon Transfer Family for ingestion using FTP/SFTP protocols Overview of Amazon DataSync for ingesting from on-premises storage Overview of the AWS Snow family of devices for large data transfers 52 52 52 54 59 60 61 62 63 AWS services for transforming data Overview of AWS Lambda for light transformations Overview of AWS Glue for serverless Spark processing Overview of Amazon EMR for Hadoop ecosystem processing AWS services for orchestrating big data pipelines Overview of AWS Glue workflows for
orchestrating Glue components Overview of AWS Step Functions for complex workflows Overview of Amazon managed workflows for Apache Airflow 64 64 65 69 71 71 73 75
Table of Contents iii AWS services for consuming data Overview of Amazon Athena for SQL queries in the data lake Overview of Amazon Redshift and Redshift Spectrum for data warehousing and data lakehouse architectures Overview of Amazon QuickSight for visualizing data 77 77 78 81 Hands-on - triggering an AWS Lambda function when a new file arrives in an S3 bucket Creating a Lambda layer containing the AWS Data Wrangler library Creating new Amazon S3 buckets Creating an IAM policy and role for your Lambda function Creating a Lambda function Configuring our Lambda function to be triggered by an S3 upload Summary 83 83 85 86 88 93 96 4 Data Cataloging, Security, and Governance Technical requirements Getting data security and governance right Common data regulatory requirements; Core data protection concepts Personal data Encryption Anonymized data Pseudonymized data/tokenization Authentication Authorization Putting these concepts together Cataloging your data to avoid the data swamp How to avoid the data swamp 98 98 99 100 101 101 102 102 103 104 104 105 106 The AWS Glue/Lake Formation data catalog 108 AWS services for data encryption and security monitoring AWS Key Management Service (KMS) Amazon Macie Amazon GuardDuty AWS services for managing identity and permissions 110 111 112 112 113 AWS Identity and Access Management (IAM) service 113 Using AWS Lake Formation to manage data lake access 116 Hands-on - configuring Lake Formation permissions 118 Creating a new user with IAM permissions 119 Transitioning to managing fine-grained permissions with AWS Lake Formation 123
Summary 129
iv Table of Contents Section 2: Architecting and Implementing Data Lakes and Data Lake Houses 5 Architecting Data Engineering Pipelines_ Technical requirements Approaching the data pipeline architecture Architecting houses and architecting pipelines Whiteboarding as an informationgathering tool Conducting a whiteboarding session Identifying data consumers and understanding their requirements Identifying data sources and ingesting data Identifying data transformations and optimizations File format optimizations 134 134 135 Data Data Data Data Data standardization quality checks partitioning denormalization cataloging Whiteboarding data transformation 136 137 138 140 142 143 143 143 143 144 144 Loading data into data marts 146 Wrapping up the whiteboarding session 147 Hands-on - architecting a sample pipeline 149 Detailed notes from the project "Bright Light" whiteboarding meeting of GP Widgets, Inc 150 Summary 156 142 6 Ingesting Batch and Streaming Data Technical requirements Understanding data sources 158 158 Data variety Data volume Data velocity Data veracity Data value 159 163 163 164 164 Questions to ask 165 Ingesting data from a relational database AWS Database Migration Service (DMS) AWS Glue Other ways to ingest data from a database Deciding on the best approach for ingesting from a database Ingesting streaming data 165 166 166 167 169 171
Table of Contents v Amazon Kinesis versus Amazon Managed Streaming for Kafka (MSK) Hands-on - ingesting data with AWS DMS Creating a new MySQL database instance Loading the demo data using an Amazon EC2 instance Creating an 1AM policy and role for DMS 171 174 175 177 179 Configuring DMS settings and performing a full load from MySQL to S3 Querying data with Amazon Athena 181 184 Hands-on - ingesting streaming data 186 Configuring Kinesis Data Firehose for streaming delivery to Amazon S3 186 Configuring Amazon Kinesis Data Generator (KDG) 187 Adding newly ingested data to the Glue Data Catalog 190 Querying the data with Amazon Athena 191 Summary 191 7 Transforming Data to Optimize for Analytics 194 Technical requirements Transformations - making raw 194 data more valuable Cooking, baking, and data transformations Transformations as part of a pipeline Types of data transformation tools Apache Spark Hadoop and MapReduce SQL GUI-based tools Data preparation transformations Protecting PII data Optimizing the file format 195 196 196 196 197 198 199 2QQ 200 201 Optimizing with data partitioning Data cleansing Business use case transforms 202 203 205 205 207 207 Data denormalization Enriching data Pre-aggregating data Extracting metadata from unstructured data 208 Working with change data capture (CDC) data 209 Traditional approaches - data upserts and SQL views 210 Modern approaches - the transactional data lake 211 Hands-on - joining datasets with AWS Glue Studio 214
vi Table of Contents Creating a new data lake zone - the 214 curated zone Creating a new 1AM role for the Gluejob215 Configuring a denormalization transform using AWS Glue Studio 217 Finalizing the denormalization transform job to write to S3 Create a transform job to join streaming and film data using AWS Glue Studio Summary 222 224 227 Identifying and Enabling Data Consumers Technical requirements Understanding the impact of data democratization 230 Meeting the needs of data scientists and ML models 230 A growing variety of data consumers 231 AWS tools used by data scientists to work with data Meeting the needs of business users with data visualization 232 AWS tools for business users Meeting the needs of data analysts with structured reporting AWS tools for data analysts 233 235 236 239 239 Hands-on - creating data transformations with AWS Glue 242 DataBrew Configuring new datasets for AWS Glue DataBrew Creating a new Glue DataBrew project Building your Glue DataBrew recipe Creating a Glue DataBrew job Summary 242 243 245 248 250 9 Loading Data into a Data Mart Technical requirements Extending analytics with data warehouses/data marts Cold data Warm data Hot data 252 252 252 253 255 What not to do - anti-patterns for a data warehouse 256 Using a data warehouse as a transactional datastore Using a data warehouse as a data lake Using data warehouses for real-time. record-level use cases Storing unstructured data Redshift architecture review and storage deep dive Data distribution across slices Redshift Zone Maps and sorting data 256 256 257 257 258 258 261
Table of Contents vii Designing a high-performance data warehouse Selecting the optimal Redshift node type Selecting the optimal table distribution style and sort key Selecting the right data type for columns Selecting the optimal table type Moving data between a data lake and Redshift Optimizing data ingestion in Redshift Exporting data from Redshift to the data lake Hands-on - loading data into an Amazon Redshift cluster and running queries 275 262 262 272 Uploading our sample data to Amazon S3 1AM roles for Redshift Creating a Redshift cluster Creating external tables for querying data in S3 Creating a schema for a local Redshift table Running complex SQL queries against our data 274 Summary 263 263 268 272 276 277 280 282 287 288 292 10 Orchestrating the Data Pipeline Technical requirements Understanding the core concepts for pipeline orchestration What is a data pipeline, and how do you orchestrate it? How do you trigger a data pipeline to run? How do you handle the failures of a step in your pipeline? 29 4 29 4 _ 29 3 29 7 Hands-on ֊ orchestrating a data pipeline using AWS Step Function 311 29 8 Creating new Lambda functions Examining the options for orchestrating pipelines in AWS 29 ց AWS Data Pipeline for managing ETL between data sources AWS Glue Workflows to orchestrate Glue resources Apache Airflow as an open source orchestration solution Pros and cons of using MWAA AWS Step Function for a serverless orchestration solution 306 Pros and cons of using AWS Step Function 308 Deciding on which data pipeline orchestration tool to use 309 _ 30 0 30 Creating an SNS topic
and subscribing to an email address 313 Creating a new Step Function state machine 314 Configuring AWS CloudTrail and Amazon EventBridge 319 Summary 30 3 305 311 324
viii Table of Contents Section 3: The Bigger Picture: Data Analytics, Data Visualization, and Machine Learning 11 Ad Hoc Queries with Amazon Athena Technical requirements 328 Amazon Athena - in-place SQL analytics for the data lake 329 Tips and tricks to optimize Amazon Athena queries 330 Common file format and layout optimizations Writing optimized SQL queries 330 334 Federating the queries of external data sources with Amazon Athena Query Federation 337 Querying external data sources using Athena Federated Query Managing governance and costs with Amazon Athena Workgroups Athena Workgroups overview Enforcing settings for groups of users Enforcing data usage controls 341 341 342 343 Hands-on - creating an Amazon Athena workgroup and configuring Athena settings 344 Hands-on - switching Workgroups and running queries 347 Summary 352 338 12 Visualizing Data with Amazon QuickSight Technical requirements Representing data visually for maximum impact Benefits of data visualization Popular uses of data visualizations Understanding Amazon QuickSight's core concepts Standard versus enterprise edition SPICE - the in-memory storage and computation engine for QuickSight 354 355 356 356 361 361 362 Ingesting and preparing data from a variety of sources Preparing datasets in QuickSight versus performing ETL outside of QuickSight 364 365 Creating and sharing visuals with QuickSight analyses and dashboards 367 Visual types in Amazon QuickSight 368
Table of Contents ix Understanding QuickSight's advanced features - ML Insights and embedded dashboards 372 Amazon QuickSight ML Insights Amazon QuickSight embedded dashboards 372 375 Hands-on - creating a simple QuickSight visualization 376 Setting up a new QuickSight account and loading a dataset Creating a new analysis 376 379 Summary 385 13 Enabling Artificial Intelligence and Machine Learning Technical requirements 388 Understanding the value of ML and Al for organizations 389 Specialized ML projects Everyday use cases for ML and Al Exploring AWS services for ML AWS ML services 389 391 392 393 Exploring AWS services for Al 398 Al for unstructured speech and text Al for extracting metadata from images and video Al for ML-powered forecasts Al for fraud detection and personalization 399 403 405 Hands-on - reviewing reviews with Amazon Comprehend Setting up a new Amazon SQS message queue Creating a Lambda function for calling Amazon Comprehend Adding Comprehend permissions for our IAM role Adding a Lambda function as a trigger for our SQS message queue Testing the solution with Amazon Comprehend Summary Further reading 407 407 408 411 412 413 415 416 406 14 Wrapping Up the First Part of Your Learning Journey Technical requirements Looking at the data analytics big picture 418 418 Managing complex data environments with DataOps 420 Examining examples of realworld data pipelines 422 A decade of data wrapped up for Spotify users Ingesting and processing streaming files at Netflix scale 423 424 Imagining the future - a look at emerging trends 428 ACID transactions directly on
data lake data 429
x Table of Contents More data and more streaming ingestion 429 430 Multi-cloud Decentralized data engineering teams, data platforms, and a data mesh 430 architecture Data and product thinking convergence 432 Data and self-serve platform design convergence 432 Other Books You May Enjoy Index Implementations of the data mesh architecture 433 Hands-on - cleaning up your AWS account 434 Reviewing AWS Billing to identify the resources being charged for Closing your AWS account Summary 435 437 439 |
any_adam_object | 1 |
any_adam_object_boolean | 1 |
author | Eagar, Gareth ca. 20./21. Jh |
author_GND | (DE-588)1261399382 |
author_facet | Eagar, Gareth ca. 20./21. Jh |
author_role | aut |
author_sort | Eagar, Gareth ca. 20./21. Jh |
author_variant | g e ge |
building | Verbundindex |
bvnumber | BV048209762 |
classification_rvk | SR 770 |
ctrlnum | (OCoLC)1334021941 (DE-599)BVBBV048209762 |
discipline | Informatik |
discipline_str_mv | Informatik |
format | Book |
fullrecord | <?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>00000nam a2200000 c 4500</leader><controlfield tag="001">BV048209762</controlfield><controlfield tag="003">DE-604</controlfield><controlfield tag="005">20220629</controlfield><controlfield tag="007">t|</controlfield><controlfield tag="008">220510s2021 xx a||| |||| 00||| eng d</controlfield><datafield tag="015" ind1=" " ind2=" "><subfield code="a">GBC1H3506</subfield><subfield code="2">dnb</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="z">1800560419</subfield><subfield code="9">1800560419</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="z">9781800560413</subfield><subfield code="9">978-1-80056-041-3</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(OCoLC)1334021941</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-599)BVBBV048209762</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-604</subfield><subfield code="b">ger</subfield><subfield code="e">rda</subfield></datafield><datafield tag="041" ind1="0" ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="049" ind1=" " ind2=" "><subfield code="a">DE-739</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Eagar, Gareth</subfield><subfield code="d">ca. 20./21. Jh.</subfield><subfield code="e">Verfasser</subfield><subfield code="0">(DE-588)1261399382</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Data Engineering with AWS</subfield><subfield code="b">learn how to design and build cloud-based data transformation pipelines using AWS</subfield><subfield code="c">Gareth Eagar</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="a">Birmingham</subfield><subfield code="b">Packt Publishing</subfield><subfield code="c">2021</subfield></datafield><datafield tag="300" ind1=" " ind2=" "><subfield code="a">xvi, 461 pages</subfield><subfield code="b">illustrationen, Diagramme</subfield><subfield code="c">24 cm</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="500" ind1=" " ind2=" "><subfield code="a">Includes index</subfield></datafield><datafield tag="520" ind1=" " ind2=" "><subfield code="a">"Knowing how to architect and implement complex data pipelines is a highly sought-after skill. Data engineers are responsible for building these pipelines that ingest, transform, and join raw datasets - creating new value from the data in the process. Amazon Web Services (AWS) offers a range of tools to simplify a data engineer's job, making it the preferred platform for performing data engineering tasks. This book will take you through the services and the skills you need to architect and implement data pipelines on AWS. You'll begin by reviewing important data engineering concepts and some of the core AWS services that form a part of the data engineer's toolkit. You'll then architect a data pipeline, review raw data sources, transform the data, and learn how the transformed data is used by various data consumers. The book also teaches you about populating data marts and data warehouses along with how a data lakehouse fits into the picture. Later, you'll be introduced to AWS tools for analyzing data, including those for ad-hoc SQL queries and creating visualizations. In the final chapters, you'll understand how the power of machine learning and artificial intelligence can be used to draw new insights from data. By the end of this AWS book, you'll be able to carry out data engineering tasks and implement a data pipeline on AWS independently."--Amazon</subfield></datafield><datafield tag="610" ind1="1" ind2="4"><subfield code="a">Amazon Web Services (Firm)</subfield></datafield><datafield tag="610" ind1="1" ind2="7"><subfield code="a">Amazon Web Services (Firm)</subfield><subfield code="2">fast</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Cloud computing</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Big data</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Infonuagique</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Données volumineuses</subfield></datafield><datafield tag="650" ind1=" " ind2="7"><subfield code="a">Big data</subfield><subfield code="2">fast</subfield></datafield><datafield tag="650" ind1=" " ind2="7"><subfield code="a">Cloud computing</subfield><subfield code="2">fast</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Big Data</subfield><subfield code="0">(DE-588)4802620-7</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Cloud Computing</subfield><subfield code="0">(DE-588)7623494-0</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Amazon Web Services</subfield><subfield code="0">(DE-588)1143985591</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="689" ind1="0" ind2="0"><subfield code="a">Amazon Web Services</subfield><subfield code="0">(DE-588)1143985591</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2="1"><subfield code="a">Cloud Computing</subfield><subfield code="0">(DE-588)7623494-0</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2="2"><subfield code="a">Big Data</subfield><subfield code="0">(DE-588)4802620-7</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2=" "><subfield code="5">DE-604</subfield></datafield><datafield tag="776" ind1="0" ind2="8"><subfield code="i">ebook version</subfield></datafield><datafield tag="856" ind1="4" ind2="2"><subfield code="m">Digitalisierung UB Passau - ADAM Catalogue Enrichment</subfield><subfield code="q">application/pdf</subfield><subfield code="u">http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=033590628&sequence=000001&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA</subfield><subfield code="3">Inhaltsverzeichnis</subfield></datafield><datafield tag="943" ind1="1" ind2=" "><subfield code="a">oai:aleph.bib-bvb.de:BVB01-033590628</subfield></datafield></record></collection> |
id | DE-604.BV048209762 |
illustrated | Illustrated |
index_date | 2024-07-03T19:48:12Z |
indexdate | 2025-01-10T17:06:10Z |
institution | BVB |
language | English |
oai_aleph_id | oai:aleph.bib-bvb.de:BVB01-033590628 |
oclc_num | 1334021941 |
open_access_boolean | |
owner | DE-739 |
owner_facet | DE-739 |
physical | xvi, 461 pages illustrationen, Diagramme 24 cm |
publishDate | 2021 |
publishDateSearch | 2021 |
publishDateSort | 2021 |
publisher | Packt Publishing |
record_format | marc |
spelling | Eagar, Gareth ca. 20./21. Jh. Verfasser (DE-588)1261399382 aut Data Engineering with AWS learn how to design and build cloud-based data transformation pipelines using AWS Gareth Eagar Birmingham Packt Publishing 2021 xvi, 461 pages illustrationen, Diagramme 24 cm txt rdacontent n rdamedia nc rdacarrier Includes index "Knowing how to architect and implement complex data pipelines is a highly sought-after skill. Data engineers are responsible for building these pipelines that ingest, transform, and join raw datasets - creating new value from the data in the process. Amazon Web Services (AWS) offers a range of tools to simplify a data engineer's job, making it the preferred platform for performing data engineering tasks. This book will take you through the services and the skills you need to architect and implement data pipelines on AWS. You'll begin by reviewing important data engineering concepts and some of the core AWS services that form a part of the data engineer's toolkit. You'll then architect a data pipeline, review raw data sources, transform the data, and learn how the transformed data is used by various data consumers. The book also teaches you about populating data marts and data warehouses along with how a data lakehouse fits into the picture. Later, you'll be introduced to AWS tools for analyzing data, including those for ad-hoc SQL queries and creating visualizations. In the final chapters, you'll understand how the power of machine learning and artificial intelligence can be used to draw new insights from data. By the end of this AWS book, you'll be able to carry out data engineering tasks and implement a data pipeline on AWS independently."--Amazon Amazon Web Services (Firm) Amazon Web Services (Firm) fast Cloud computing Big data Infonuagique Données volumineuses Big data fast Cloud computing fast Big Data (DE-588)4802620-7 gnd rswk-swf Cloud Computing (DE-588)7623494-0 gnd rswk-swf Amazon Web Services (DE-588)1143985591 gnd rswk-swf Amazon Web Services (DE-588)1143985591 s Cloud Computing (DE-588)7623494-0 s Big Data (DE-588)4802620-7 s DE-604 ebook version Digitalisierung UB Passau - ADAM Catalogue Enrichment application/pdf http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=033590628&sequence=000001&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA Inhaltsverzeichnis |
spellingShingle | Eagar, Gareth ca. 20./21. Jh Data Engineering with AWS learn how to design and build cloud-based data transformation pipelines using AWS Amazon Web Services (Firm) Amazon Web Services (Firm) fast Cloud computing Big data Infonuagique Données volumineuses Big data fast Cloud computing fast Big Data (DE-588)4802620-7 gnd Cloud Computing (DE-588)7623494-0 gnd Amazon Web Services (DE-588)1143985591 gnd |
subject_GND | (DE-588)4802620-7 (DE-588)7623494-0 (DE-588)1143985591 |
title | Data Engineering with AWS learn how to design and build cloud-based data transformation pipelines using AWS |
title_auth | Data Engineering with AWS learn how to design and build cloud-based data transformation pipelines using AWS |
title_exact_search | Data Engineering with AWS learn how to design and build cloud-based data transformation pipelines using AWS |
title_exact_search_txtP | Data Engineering with AWS learn how to design and build cloud-based data transformation pipelines using AWS |
title_full | Data Engineering with AWS learn how to design and build cloud-based data transformation pipelines using AWS Gareth Eagar |
title_fullStr | Data Engineering with AWS learn how to design and build cloud-based data transformation pipelines using AWS Gareth Eagar |
title_full_unstemmed | Data Engineering with AWS learn how to design and build cloud-based data transformation pipelines using AWS Gareth Eagar |
title_short | Data Engineering with AWS |
title_sort | data engineering with aws learn how to design and build cloud based data transformation pipelines using aws |
title_sub | learn how to design and build cloud-based data transformation pipelines using AWS |
topic | Amazon Web Services (Firm) Amazon Web Services (Firm) fast Cloud computing Big data Infonuagique Données volumineuses Big data fast Cloud computing fast Big Data (DE-588)4802620-7 gnd Cloud Computing (DE-588)7623494-0 gnd Amazon Web Services (DE-588)1143985591 gnd |
topic_facet | Amazon Web Services (Firm) Cloud computing Big data Infonuagique Données volumineuses Big Data Cloud Computing Amazon Web Services |
url | http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=033590628&sequence=000001&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |
work_keys_str_mv | AT eagargareth dataengineeringwithawslearnhowtodesignandbuildcloudbaseddatatransformationpipelinesusingaws |