Holdings: Hands-on big data analytics with PySpark :

Hands-on big data analytics with PySpark :: analyze large datasets and discover techniques for testing, immunizing, and parallelizing Spark jobs /

In this book, you'll learn to implement some practical and proven techniques to improve aspects of programming and administration in Apache Spark. Techniques are demonstrated using practical examples and best practices. You will also learn how to use Spark and its Python API to create performan...

Full description

Saved in:

Bibliographic Details
Main Authors:	Lai, Rudy (Author), Potaczek, Bartłomiej (Author)
Format:	Electronic eBook
Language:	English
Published:	Birmingham, UK : Packt Publishing, 2019.
Subjects:	SPARK (Computer program language) Application software > Development. Big data. Electronic data processing. Python (Computer program language) Logiciels d'application > Développement. Données volumineuses. Python (Langage de programmation) Application software > Development Big data Electronic data processing
Online Access:	DE-862 DE-863
Summary:	In this book, you'll learn to implement some practical and proven techniques to improve aspects of programming and administration in Apache Spark. Techniques are demonstrated using practical examples and best practices. You will also learn how to use Spark and its Python API to create performant analytics with large-scale data.
Physical Description:	1 online resource : illustrations
ISBN:	1838648836 9781838648831

Staff View

MARC


LEADER	00000cam a2200000 i 4500
001	ZDB-4-EBA-on1100643398
003	OCoLC
005	20241004212047.0
006	m o d
007	cr unu\|\|\|\|\|\|\|\|
008	190509s2019 enka o 000 0 eng d
040			\|a UMI \|b eng \|e rda \|e pn \|c UMI \|d TEFOD \|d EBLCP \|d MERUC \|d UKMGB \|d OCLCF \|d YDX \|d UKAHL \|d OCLCQ \|d N$T \|d OCLCQ \|d OCLCO \|d NZAUC \|d OCLCQ \|d OCLCO \|d OCLCL
015			\|a GBB995016 \|2 bnb
016	7		\|a 019365492 \|2 Uk
019			\|a 1091701284 \|a 1096526626
020			\|a 1838648836
020			\|a 9781838648831 \|q (electronic bk.)
020			\|z 9781838644130
035			\|a (OCoLC)1100643398 \|z (OCoLC)1091701284 \|z (OCoLC)1096526626
037			\|a CL0501000047 \|b Safari Books Online
050		4	\|a QA76.73.S59
082	7		\|a 004.2 \|2 23
049			\|a MAIN
100	1		\|a Lai, Rudy, \|e author.
245	1	0	\|a Hands-on big data analytics with PySpark : \|b analyze large datasets and discover techniques for testing, immunizing, and parallelizing Spark jobs / \|c Rudy Lai, Bartłomiej Potaczek.
264		1	\|a Birmingham, UK : \|b Packt Publishing, \|c 2019.
300			\|a 1 online resource : \|b illustrations
336			\|a text \|b txt \|2 rdacontent
337			\|a computer \|b c \|2 rdamedia
338			\|a online resource \|b cr \|2 rdacarrier
588	0		\|a Online resource; title from title page (Safari, viewed May 9, 2019).
505	0		\|a Cover; Title Page; Copyright and Credits; About Packt; Contributors; Table of Contents; Preface; Chapter 1: Pyspark and Setting up Your Development Environment; An overview of PySpark; Spark SQL; Setting up Spark on Windows and PySpark; Core concepts in Spark and PySpark; SparkContext; Spark shell; SparkConf; Summary; Chapter 2: Getting Your Big Data into the Spark Environment Using RDDs; Loading data on to Spark RDDs; The UCI machine learning repository; Getting the data from the repository to Spark; Getting data into Spark; Parallelization with Spark RDDs; What is parallelization?
505	8		\|a Basics of RDD operationSummary; Chapter 3: Big Data Cleaning and Wrangling with Spark Notebooks; Using Spark Notebooks for quick iteration of ideas; Sampling/filtering RDDs to pick out relevant data points; Splitting datasets and creating some new combinations; Summary; Chapter 4: Aggregating and Summarizing Data into Useful Reports; Calculating averages with map and reduce; Faster average computations with aggregate; Pivot tabling with key-value paired data points; Summary; Chapter 5: Powerful Exploratory Data Analysis with MLlib; Computing summary statistics with MLlib
505	8		\|a Using Pearson and Spearman correlations to discover correlationsThe Pearson correlation; The Spearman correlation; Computing Pearson and Spearman correlations; Testing our hypotheses on large datasets; Summary; Chapter 6: Putting Structure on Your Big Data with SparkSQL; Manipulating DataFrames with Spark SQL schemas; Using Spark DSL to build queries; Summary; Chapter 7: Transformations and Actions; Using Spark transformations to defer computations to a later time; Avoiding transformations; Using the reduce and reduceByKey methods to calculate the results
505	8		\|a Performing actions that trigger computationsReusing the same rdd for different actions; Summary; Chapter 8: Immutable Design; Delving into the Spark RDD's parent/child chain; Extending an RDD; Chaining a new RDD with the parent; Testing our custom RDD; Using RDD in an immutable way; Using DataFrame operations to transform; Immutability in the highly concurrent environment; Using the Dataset API in an immutable way; Summary; Chapter 9: Avoiding Shuffle and Reducing Operational Expenses; Detecting a shuffle in a process; Testing operations that cause a shuffle in Apache Spark
505	8		\|a Changing the design of jobs with wide dependenciesUsing keyBy() operations to reduce shuffle; Using a custom partitioner to reduce shuffle; Summary; Chapter 10: Saving Data in the Correct Format; Saving data in plain text format; Leveraging JSON as a data format; Tabular formats -- CSV; Using Avro with Spark; Columnar formats -- Parquet; Summary; Chapter 11: Working with the Spark Key/Value API; Available actions on key/value pairs; Using aggregateByKey instead of groupBy(); Actions on key/value pairs; Available partitioners on key/value data; Implementing a custom partitioner; Summary
520			\|a In this book, you'll learn to implement some practical and proven techniques to improve aspects of programming and administration in Apache Spark. Techniques are demonstrated using practical examples and best practices. You will also learn how to use Spark and its Python API to create performant analytics with large-scale data.
650		0	\|a SPARK (Computer program language) \|0 http://id.loc.gov/authorities/subjects/sh2015001170
650		0	\|a Application software \|x Development. \|0 http://id.loc.gov/authorities/subjects/sh95009362
650		0	\|a Big data. \|0 http://id.loc.gov/authorities/subjects/sh2012003227
650		0	\|a Electronic data processing. \|0 http://id.loc.gov/authorities/subjects/sh85042288
650		0	\|a Python (Computer program language) \|0 http://id.loc.gov/authorities/subjects/sh96008834
650		6	\|a Logiciels d'application \|x Développement.
650		6	\|a Données volumineuses.
650		6	\|a Python (Langage de programmation)
650		7	\|a Application software \|x Development \|2 fast
650		7	\|a Big data \|2 fast
650		7	\|a Electronic data processing \|2 fast
650		7	\|a Python (Computer program language) \|2 fast
650		7	\|a SPARK (Computer program language) \|2 fast
700	1		\|a Potaczek, Bartłomiej, \|e author.
758			\|i has work: \|a Hands-On Big Data Analytics with PySpark (Text) \|1 https://id.oclc.org/worldcat/entity/E39PCXY3Xvkc3PMhcWcgRgjfjd \|4 https://id.oclc.org/worldcat/ontology/hasWork
776	0	8	\|i Print version: \|a Lai, Rudy. \|t Hands-On Big Data Analytics with Pyspark : Analyze Large Datasets and Discover Techniques for Testing, Immunizing, and Parallelizing Spark Jobs. \|d Birmingham : Packt Publishing Ltd, ©2019 \|z 9781838644130
966	4	0	\|l DE-862 \|p ZDB-4-EBA \|q FWS_PDA_EBA \|u https://search.ebscohost.com/login.aspx?direct=true&scope=site&db=nlebk&AN=2094759 \|3 Volltext
966	4	0	\|l DE-863 \|p ZDB-4-EBA \|q FWS_PDA_EBA \|u https://search.ebscohost.com/login.aspx?direct=true&scope=site&db=nlebk&AN=2094759 \|3 Volltext
938			\|a Askews and Holts Library Services \|b ASKH \|n BDZ0039952975
938			\|a ProQuest Ebook Central \|b EBLB \|n EBL5744445
938			\|a EBSCOhost \|b EBSC \|n 2094759
938			\|a YBP Library Services \|b YANK \|n 16142491
994			\|a 92 \|b GEBAY
912			\|a ZDB-4-EBA
049			\|a DE-862
049			\|a DE-863

Record in the Search Index

DE-BY-FWS_katkey	ZDB-4-EBA-on1100643398
_version_	1826942290017386496
adam_text
any_adam_object
author	Lai, Rudy Potaczek, Bartłomiej
author_facet	Lai, Rudy Potaczek, Bartłomiej
author_role	aut aut
author_sort	Lai, Rudy
author_variant	r l rl b p bp
building	Verbundindex
bvnumber	localFWS
callnumber-first	Q - Science
callnumber-label	QA76
callnumber-raw	QA76.73.S59
callnumber-search	QA76.73.S59
callnumber-sort	QA 276.73 S59
callnumber-subject	QA - Mathematics
collection	ZDB-4-EBA
contents	Cover; Title Page; Copyright and Credits; About Packt; Contributors; Table of Contents; Preface; Chapter 1: Pyspark and Setting up Your Development Environment; An overview of PySpark; Spark SQL; Setting up Spark on Windows and PySpark; Core concepts in Spark and PySpark; SparkContext; Spark shell; SparkConf; Summary; Chapter 2: Getting Your Big Data into the Spark Environment Using RDDs; Loading data on to Spark RDDs; The UCI machine learning repository; Getting the data from the repository to Spark; Getting data into Spark; Parallelization with Spark RDDs; What is parallelization? Basics of RDD operationSummary; Chapter 3: Big Data Cleaning and Wrangling with Spark Notebooks; Using Spark Notebooks for quick iteration of ideas; Sampling/filtering RDDs to pick out relevant data points; Splitting datasets and creating some new combinations; Summary; Chapter 4: Aggregating and Summarizing Data into Useful Reports; Calculating averages with map and reduce; Faster average computations with aggregate; Pivot tabling with key-value paired data points; Summary; Chapter 5: Powerful Exploratory Data Analysis with MLlib; Computing summary statistics with MLlib Using Pearson and Spearman correlations to discover correlationsThe Pearson correlation; The Spearman correlation; Computing Pearson and Spearman correlations; Testing our hypotheses on large datasets; Summary; Chapter 6: Putting Structure on Your Big Data with SparkSQL; Manipulating DataFrames with Spark SQL schemas; Using Spark DSL to build queries; Summary; Chapter 7: Transformations and Actions; Using Spark transformations to defer computations to a later time; Avoiding transformations; Using the reduce and reduceByKey methods to calculate the results Performing actions that trigger computationsReusing the same rdd for different actions; Summary; Chapter 8: Immutable Design; Delving into the Spark RDD's parent/child chain; Extending an RDD; Chaining a new RDD with the parent; Testing our custom RDD; Using RDD in an immutable way; Using DataFrame operations to transform; Immutability in the highly concurrent environment; Using the Dataset API in an immutable way; Summary; Chapter 9: Avoiding Shuffle and Reducing Operational Expenses; Detecting a shuffle in a process; Testing operations that cause a shuffle in Apache Spark Changing the design of jobs with wide dependenciesUsing keyBy() operations to reduce shuffle; Using a custom partitioner to reduce shuffle; Summary; Chapter 10: Saving Data in the Correct Format; Saving data in plain text format; Leveraging JSON as a data format; Tabular formats -- CSV; Using Avro with Spark; Columnar formats -- Parquet; Summary; Chapter 11: Working with the Spark Key/Value API; Available actions on key/value pairs; Using aggregateByKey instead of groupBy(); Actions on key/value pairs; Available partitioners on key/value data; Implementing a custom partitioner; Summary
ctrlnum	(OCoLC)1100643398
dewey-full	004.2
dewey-hundreds	000 - Computer science, information, general works
dewey-ones	004 - Computer science
dewey-raw	004.2
dewey-search	004.2
dewey-sort	14.2
dewey-tens	000 - Computer science, information, general works
discipline	Informatik
format	Electronic eBook
fullrecord	<?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>06338cam a2200673 i 4500</leader><controlfield tag="001">ZDB-4-EBA-on1100643398</controlfield><controlfield tag="003">OCoLC</controlfield><controlfield tag="005">20241004212047.0</controlfield><controlfield tag="006">m o d </controlfield><controlfield tag="007">cr unu\|\|\|\|\|\|\|\|</controlfield><controlfield tag="008">190509s2019 enka o 000 0 eng d</controlfield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">UMI</subfield><subfield code="b">eng</subfield><subfield code="e">rda</subfield><subfield code="e">pn</subfield><subfield code="c">UMI</subfield><subfield code="d">TEFOD</subfield><subfield code="d">EBLCP</subfield><subfield code="d">MERUC</subfield><subfield code="d">UKMGB</subfield><subfield code="d">OCLCF</subfield><subfield code="d">YDX</subfield><subfield code="d">UKAHL</subfield><subfield code="d">OCLCQ</subfield><subfield code="d">N$T</subfield><subfield code="d">OCLCQ</subfield><subfield code="d">OCLCO</subfield><subfield code="d">NZAUC</subfield><subfield code="d">OCLCQ</subfield><subfield code="d">OCLCO</subfield><subfield code="d">OCLCL</subfield></datafield><datafield tag="015" ind1=" " ind2=" "><subfield code="a">GBB995016</subfield><subfield code="2">bnb</subfield></datafield><datafield tag="016" ind1="7" ind2=" "><subfield code="a">019365492</subfield><subfield code="2">Uk</subfield></datafield><datafield tag="019" ind1=" " ind2=" "><subfield code="a">1091701284</subfield><subfield code="a">1096526626</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">1838648836</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">9781838648831</subfield><subfield code="q">(electronic bk.)</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="z">9781838644130</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(OCoLC)1100643398</subfield><subfield code="z">(OCoLC)1091701284</subfield><subfield code="z">(OCoLC)1096526626</subfield></datafield><datafield tag="037" ind1=" " ind2=" "><subfield code="a">CL0501000047</subfield><subfield code="b">Safari Books Online</subfield></datafield><datafield tag="050" ind1=" " ind2="4"><subfield code="a">QA76.73.S59</subfield></datafield><datafield tag="082" ind1="7" ind2=" "><subfield code="a">004.2</subfield><subfield code="2">23</subfield></datafield><datafield tag="049" ind1=" " ind2=" "><subfield code="a">MAIN</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Lai, Rudy,</subfield><subfield code="e">author.</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Hands-on big data analytics with PySpark :</subfield><subfield code="b">analyze large datasets and discover techniques for testing, immunizing, and parallelizing Spark jobs /</subfield><subfield code="c">Rudy Lai, Bartłomiej Potaczek.</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="a">Birmingham, UK :</subfield><subfield code="b">Packt Publishing,</subfield><subfield code="c">2019.</subfield></datafield><datafield tag="300" ind1=" " ind2=" "><subfield code="a">1 online resource :</subfield><subfield code="b">illustrations</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="a">text</subfield><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="a">computer</subfield><subfield code="b">c</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="a">online resource</subfield><subfield code="b">cr</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="588" ind1="0" ind2=" "><subfield code="a">Online resource; title from title page (Safari, viewed May 9, 2019).</subfield></datafield><datafield tag="505" ind1="0" ind2=" "><subfield code="a">Cover; Title Page; Copyright and Credits; About Packt; Contributors; Table of Contents; Preface; Chapter 1: Pyspark and Setting up Your Development Environment; An overview of PySpark; Spark SQL; Setting up Spark on Windows and PySpark; Core concepts in Spark and PySpark; SparkContext; Spark shell; SparkConf; Summary; Chapter 2: Getting Your Big Data into the Spark Environment Using RDDs; Loading data on to Spark RDDs; The UCI machine learning repository; Getting the data from the repository to Spark; Getting data into Spark; Parallelization with Spark RDDs; What is parallelization?</subfield></datafield><datafield tag="505" ind1="8" ind2=" "><subfield code="a">Basics of RDD operationSummary; Chapter 3: Big Data Cleaning and Wrangling with Spark Notebooks; Using Spark Notebooks for quick iteration of ideas; Sampling/filtering RDDs to pick out relevant data points; Splitting datasets and creating some new combinations; Summary; Chapter 4: Aggregating and Summarizing Data into Useful Reports; Calculating averages with map and reduce; Faster average computations with aggregate; Pivot tabling with key-value paired data points; Summary; Chapter 5: Powerful Exploratory Data Analysis with MLlib; Computing summary statistics with MLlib</subfield></datafield><datafield tag="505" ind1="8" ind2=" "><subfield code="a">Using Pearson and Spearman correlations to discover correlationsThe Pearson correlation; The Spearman correlation; Computing Pearson and Spearman correlations; Testing our hypotheses on large datasets; Summary; Chapter 6: Putting Structure on Your Big Data with SparkSQL; Manipulating DataFrames with Spark SQL schemas; Using Spark DSL to build queries; Summary; Chapter 7: Transformations and Actions; Using Spark transformations to defer computations to a later time; Avoiding transformations; Using the reduce and reduceByKey methods to calculate the results</subfield></datafield><datafield tag="505" ind1="8" ind2=" "><subfield code="a">Performing actions that trigger computationsReusing the same rdd for different actions; Summary; Chapter 8: Immutable Design; Delving into the Spark RDD's parent/child chain; Extending an RDD; Chaining a new RDD with the parent; Testing our custom RDD; Using RDD in an immutable way; Using DataFrame operations to transform; Immutability in the highly concurrent environment; Using the Dataset API in an immutable way; Summary; Chapter 9: Avoiding Shuffle and Reducing Operational Expenses; Detecting a shuffle in a process; Testing operations that cause a shuffle in Apache Spark</subfield></datafield><datafield tag="505" ind1="8" ind2=" "><subfield code="a">Changing the design of jobs with wide dependenciesUsing keyBy() operations to reduce shuffle; Using a custom partitioner to reduce shuffle; Summary; Chapter 10: Saving Data in the Correct Format; Saving data in plain text format; Leveraging JSON as a data format; Tabular formats -- CSV; Using Avro with Spark; Columnar formats -- Parquet; Summary; Chapter 11: Working with the Spark Key/Value API; Available actions on key/value pairs; Using aggregateByKey instead of groupBy(); Actions on key/value pairs; Available partitioners on key/value data; Implementing a custom partitioner; Summary</subfield></datafield><datafield tag="520" ind1=" " ind2=" "><subfield code="a">In this book, you'll learn to implement some practical and proven techniques to improve aspects of programming and administration in Apache Spark. Techniques are demonstrated using practical examples and best practices. You will also learn how to use Spark and its Python API to create performant analytics with large-scale data.</subfield></datafield><datafield tag="650" ind1=" " ind2="0"><subfield code="a">SPARK (Computer program language)</subfield><subfield code="0">http://id.loc.gov/authorities/subjects/sh2015001170</subfield></datafield><datafield tag="650" ind1=" " ind2="0"><subfield code="a">Application software</subfield><subfield code="x">Development.</subfield><subfield code="0">http://id.loc.gov/authorities/subjects/sh95009362</subfield></datafield><datafield tag="650" ind1=" " ind2="0"><subfield code="a">Big data.</subfield><subfield code="0">http://id.loc.gov/authorities/subjects/sh2012003227</subfield></datafield><datafield tag="650" ind1=" " ind2="0"><subfield code="a">Electronic data processing.</subfield><subfield code="0">http://id.loc.gov/authorities/subjects/sh85042288</subfield></datafield><datafield tag="650" ind1=" " ind2="0"><subfield code="a">Python (Computer program language)</subfield><subfield code="0">http://id.loc.gov/authorities/subjects/sh96008834</subfield></datafield><datafield tag="650" ind1=" " ind2="6"><subfield code="a">Logiciels d'application</subfield><subfield code="x">Développement.</subfield></datafield><datafield tag="650" ind1=" " ind2="6"><subfield code="a">Données volumineuses.</subfield></datafield><datafield tag="650" ind1=" " ind2="6"><subfield code="a">Python (Langage de programmation)</subfield></datafield><datafield tag="650" ind1=" " ind2="7"><subfield code="a">Application software</subfield><subfield code="x">Development</subfield><subfield code="2">fast</subfield></datafield><datafield tag="650" ind1=" " ind2="7"><subfield code="a">Big data</subfield><subfield code="2">fast</subfield></datafield><datafield tag="650" ind1=" " ind2="7"><subfield code="a">Electronic data processing</subfield><subfield code="2">fast</subfield></datafield><datafield tag="650" ind1=" " ind2="7"><subfield code="a">Python (Computer program language)</subfield><subfield code="2">fast</subfield></datafield><datafield tag="650" ind1=" " ind2="7"><subfield code="a">SPARK (Computer program language)</subfield><subfield code="2">fast</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Potaczek, Bartłomiej,</subfield><subfield code="e">author.</subfield></datafield><datafield tag="758" ind1=" " ind2=" "><subfield code="i">has work:</subfield><subfield code="a">Hands-On Big Data Analytics with PySpark (Text)</subfield><subfield code="1">https://id.oclc.org/worldcat/entity/E39PCXY3Xvkc3PMhcWcgRgjfjd</subfield><subfield code="4">https://id.oclc.org/worldcat/ontology/hasWork</subfield></datafield><datafield tag="776" ind1="0" ind2="8"><subfield code="i">Print version:</subfield><subfield code="a">Lai, Rudy.</subfield><subfield code="t">Hands-On Big Data Analytics with Pyspark : Analyze Large Datasets and Discover Techniques for Testing, Immunizing, and Parallelizing Spark Jobs.</subfield><subfield code="d">Birmingham : Packt Publishing Ltd, ©2019</subfield><subfield code="z">9781838644130</subfield></datafield><datafield tag="966" ind1="4" ind2="0"><subfield code="l">DE-862</subfield><subfield code="p">ZDB-4-EBA</subfield><subfield code="q">FWS_PDA_EBA</subfield><subfield code="u">https://search.ebscohost.com/login.aspx?direct=true&scope=site&db=nlebk&AN=2094759</subfield><subfield code="3">Volltext</subfield></datafield><datafield tag="966" ind1="4" ind2="0"><subfield code="l">DE-863</subfield><subfield code="p">ZDB-4-EBA</subfield><subfield code="q">FWS_PDA_EBA</subfield><subfield code="u">https://search.ebscohost.com/login.aspx?direct=true&scope=site&db=nlebk&AN=2094759</subfield><subfield code="3">Volltext</subfield></datafield><datafield tag="938" ind1=" " ind2=" "><subfield code="a">Askews and Holts Library Services</subfield><subfield code="b">ASKH</subfield><subfield code="n">BDZ0039952975</subfield></datafield><datafield tag="938" ind1=" " ind2=" "><subfield code="a">ProQuest Ebook Central</subfield><subfield code="b">EBLB</subfield><subfield code="n">EBL5744445</subfield></datafield><datafield tag="938" ind1=" " ind2=" "><subfield code="a">EBSCOhost</subfield><subfield code="b">EBSC</subfield><subfield code="n">2094759</subfield></datafield><datafield tag="938" ind1=" " ind2=" "><subfield code="a">YBP Library Services</subfield><subfield code="b">YANK</subfield><subfield code="n">16142491</subfield></datafield><datafield tag="994" ind1=" " ind2=" "><subfield code="a">92</subfield><subfield code="b">GEBAY</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">ZDB-4-EBA</subfield></datafield><datafield tag="049" ind1=" " ind2=" "><subfield code="a">DE-862</subfield></datafield><datafield tag="049" ind1=" " ind2=" "><subfield code="a">DE-863</subfield></datafield></record></collection>
id	ZDB-4-EBA-on1100643398
illustrated	Illustrated
indexdate	2025-03-18T14:25:39Z
institution	BVB
isbn	1838648836 9781838648831
language	English
oclc_num	1100643398
open_access_boolean
owner	MAIN DE-862 DE-BY-FWS DE-863 DE-BY-FWS
owner_facet	MAIN DE-862 DE-BY-FWS DE-863 DE-BY-FWS
physical	1 online resource : illustrations
psigel	ZDB-4-EBA FWS_PDA_EBA ZDB-4-EBA
publishDate	2019
publishDateSearch	2019
publishDateSort	2019
publisher	Packt Publishing,
record_format	marc
spelling	Lai, Rudy, author. Hands-on big data analytics with PySpark : analyze large datasets and discover techniques for testing, immunizing, and parallelizing Spark jobs / Rudy Lai, Bartłomiej Potaczek. Birmingham, UK : Packt Publishing, 2019. 1 online resource : illustrations text txt rdacontent computer c rdamedia online resource cr rdacarrier Online resource; title from title page (Safari, viewed May 9, 2019). Cover; Title Page; Copyright and Credits; About Packt; Contributors; Table of Contents; Preface; Chapter 1: Pyspark and Setting up Your Development Environment; An overview of PySpark; Spark SQL; Setting up Spark on Windows and PySpark; Core concepts in Spark and PySpark; SparkContext; Spark shell; SparkConf; Summary; Chapter 2: Getting Your Big Data into the Spark Environment Using RDDs; Loading data on to Spark RDDs; The UCI machine learning repository; Getting the data from the repository to Spark; Getting data into Spark; Parallelization with Spark RDDs; What is parallelization? Basics of RDD operationSummary; Chapter 3: Big Data Cleaning and Wrangling with Spark Notebooks; Using Spark Notebooks for quick iteration of ideas; Sampling/filtering RDDs to pick out relevant data points; Splitting datasets and creating some new combinations; Summary; Chapter 4: Aggregating and Summarizing Data into Useful Reports; Calculating averages with map and reduce; Faster average computations with aggregate; Pivot tabling with key-value paired data points; Summary; Chapter 5: Powerful Exploratory Data Analysis with MLlib; Computing summary statistics with MLlib Using Pearson and Spearman correlations to discover correlationsThe Pearson correlation; The Spearman correlation; Computing Pearson and Spearman correlations; Testing our hypotheses on large datasets; Summary; Chapter 6: Putting Structure on Your Big Data with SparkSQL; Manipulating DataFrames with Spark SQL schemas; Using Spark DSL to build queries; Summary; Chapter 7: Transformations and Actions; Using Spark transformations to defer computations to a later time; Avoiding transformations; Using the reduce and reduceByKey methods to calculate the results Performing actions that trigger computationsReusing the same rdd for different actions; Summary; Chapter 8: Immutable Design; Delving into the Spark RDD's parent/child chain; Extending an RDD; Chaining a new RDD with the parent; Testing our custom RDD; Using RDD in an immutable way; Using DataFrame operations to transform; Immutability in the highly concurrent environment; Using the Dataset API in an immutable way; Summary; Chapter 9: Avoiding Shuffle and Reducing Operational Expenses; Detecting a shuffle in a process; Testing operations that cause a shuffle in Apache Spark Changing the design of jobs with wide dependenciesUsing keyBy() operations to reduce shuffle; Using a custom partitioner to reduce shuffle; Summary; Chapter 10: Saving Data in the Correct Format; Saving data in plain text format; Leveraging JSON as a data format; Tabular formats -- CSV; Using Avro with Spark; Columnar formats -- Parquet; Summary; Chapter 11: Working with the Spark Key/Value API; Available actions on key/value pairs; Using aggregateByKey instead of groupBy(); Actions on key/value pairs; Available partitioners on key/value data; Implementing a custom partitioner; Summary In this book, you'll learn to implement some practical and proven techniques to improve aspects of programming and administration in Apache Spark. Techniques are demonstrated using practical examples and best practices. You will also learn how to use Spark and its Python API to create performant analytics with large-scale data. SPARK (Computer program language) http://id.loc.gov/authorities/subjects/sh2015001170 Application software Development. http://id.loc.gov/authorities/subjects/sh95009362 Big data. http://id.loc.gov/authorities/subjects/sh2012003227 Electronic data processing. http://id.loc.gov/authorities/subjects/sh85042288 Python (Computer program language) http://id.loc.gov/authorities/subjects/sh96008834 Logiciels d'application Développement. Données volumineuses. Python (Langage de programmation) Application software Development fast Big data fast Electronic data processing fast Python (Computer program language) fast SPARK (Computer program language) fast Potaczek, Bartłomiej, author. has work: Hands-On Big Data Analytics with PySpark (Text) https://id.oclc.org/worldcat/entity/E39PCXY3Xvkc3PMhcWcgRgjfjd https://id.oclc.org/worldcat/ontology/hasWork Print version: Lai, Rudy. Hands-On Big Data Analytics with Pyspark : Analyze Large Datasets and Discover Techniques for Testing, Immunizing, and Parallelizing Spark Jobs. Birmingham : Packt Publishing Ltd, ©2019 9781838644130
spellingShingle	Lai, Rudy Potaczek, Bartłomiej Hands-on big data analytics with PySpark : analyze large datasets and discover techniques for testing, immunizing, and parallelizing Spark jobs / Cover; Title Page; Copyright and Credits; About Packt; Contributors; Table of Contents; Preface; Chapter 1: Pyspark and Setting up Your Development Environment; An overview of PySpark; Spark SQL; Setting up Spark on Windows and PySpark; Core concepts in Spark and PySpark; SparkContext; Spark shell; SparkConf; Summary; Chapter 2: Getting Your Big Data into the Spark Environment Using RDDs; Loading data on to Spark RDDs; The UCI machine learning repository; Getting the data from the repository to Spark; Getting data into Spark; Parallelization with Spark RDDs; What is parallelization? Basics of RDD operationSummary; Chapter 3: Big Data Cleaning and Wrangling with Spark Notebooks; Using Spark Notebooks for quick iteration of ideas; Sampling/filtering RDDs to pick out relevant data points; Splitting datasets and creating some new combinations; Summary; Chapter 4: Aggregating and Summarizing Data into Useful Reports; Calculating averages with map and reduce; Faster average computations with aggregate; Pivot tabling with key-value paired data points; Summary; Chapter 5: Powerful Exploratory Data Analysis with MLlib; Computing summary statistics with MLlib Using Pearson and Spearman correlations to discover correlationsThe Pearson correlation; The Spearman correlation; Computing Pearson and Spearman correlations; Testing our hypotheses on large datasets; Summary; Chapter 6: Putting Structure on Your Big Data with SparkSQL; Manipulating DataFrames with Spark SQL schemas; Using Spark DSL to build queries; Summary; Chapter 7: Transformations and Actions; Using Spark transformations to defer computations to a later time; Avoiding transformations; Using the reduce and reduceByKey methods to calculate the results Performing actions that trigger computationsReusing the same rdd for different actions; Summary; Chapter 8: Immutable Design; Delving into the Spark RDD's parent/child chain; Extending an RDD; Chaining a new RDD with the parent; Testing our custom RDD; Using RDD in an immutable way; Using DataFrame operations to transform; Immutability in the highly concurrent environment; Using the Dataset API in an immutable way; Summary; Chapter 9: Avoiding Shuffle and Reducing Operational Expenses; Detecting a shuffle in a process; Testing operations that cause a shuffle in Apache Spark Changing the design of jobs with wide dependenciesUsing keyBy() operations to reduce shuffle; Using a custom partitioner to reduce shuffle; Summary; Chapter 10: Saving Data in the Correct Format; Saving data in plain text format; Leveraging JSON as a data format; Tabular formats -- CSV; Using Avro with Spark; Columnar formats -- Parquet; Summary; Chapter 11: Working with the Spark Key/Value API; Available actions on key/value pairs; Using aggregateByKey instead of groupBy(); Actions on key/value pairs; Available partitioners on key/value data; Implementing a custom partitioner; Summary SPARK (Computer program language) http://id.loc.gov/authorities/subjects/sh2015001170 Application software Development. http://id.loc.gov/authorities/subjects/sh95009362 Big data. http://id.loc.gov/authorities/subjects/sh2012003227 Electronic data processing. http://id.loc.gov/authorities/subjects/sh85042288 Python (Computer program language) http://id.loc.gov/authorities/subjects/sh96008834 Logiciels d'application Développement. Données volumineuses. Python (Langage de programmation) Application software Development fast Big data fast Electronic data processing fast Python (Computer program language) fast SPARK (Computer program language) fast
subject_GND	http://id.loc.gov/authorities/subjects/sh2015001170 http://id.loc.gov/authorities/subjects/sh95009362 http://id.loc.gov/authorities/subjects/sh2012003227 http://id.loc.gov/authorities/subjects/sh85042288 http://id.loc.gov/authorities/subjects/sh96008834
title	Hands-on big data analytics with PySpark : analyze large datasets and discover techniques for testing, immunizing, and parallelizing Spark jobs /
title_auth	Hands-on big data analytics with PySpark : analyze large datasets and discover techniques for testing, immunizing, and parallelizing Spark jobs /
title_exact_search	Hands-on big data analytics with PySpark : analyze large datasets and discover techniques for testing, immunizing, and parallelizing Spark jobs /
title_full	Hands-on big data analytics with PySpark : analyze large datasets and discover techniques for testing, immunizing, and parallelizing Spark jobs / Rudy Lai, Bartłomiej Potaczek.
title_fullStr	Hands-on big data analytics with PySpark : analyze large datasets and discover techniques for testing, immunizing, and parallelizing Spark jobs / Rudy Lai, Bartłomiej Potaczek.
title_full_unstemmed	Hands-on big data analytics with PySpark : analyze large datasets and discover techniques for testing, immunizing, and parallelizing Spark jobs / Rudy Lai, Bartłomiej Potaczek.
title_short	Hands-on big data analytics with PySpark :
title_sort	hands on big data analytics with pyspark analyze large datasets and discover techniques for testing immunizing and parallelizing spark jobs
title_sub	analyze large datasets and discover techniques for testing, immunizing, and parallelizing Spark jobs /
topic	SPARK (Computer program language) http://id.loc.gov/authorities/subjects/sh2015001170 Application software Development. http://id.loc.gov/authorities/subjects/sh95009362 Big data. http://id.loc.gov/authorities/subjects/sh2012003227 Electronic data processing. http://id.loc.gov/authorities/subjects/sh85042288 Python (Computer program language) http://id.loc.gov/authorities/subjects/sh96008834 Logiciels d'application Développement. Données volumineuses. Python (Langage de programmation) Application software Development fast Big data fast Electronic data processing fast Python (Computer program language) fast SPARK (Computer program language) fast
topic_facet	SPARK (Computer program language) Application software Development. Big data. Electronic data processing. Python (Computer program language) Logiciels d'application Développement. Données volumineuses. Python (Langage de programmation) Application software Development Big data Electronic data processing
work_keys_str_mv	AT lairudy handsonbigdataanalyticswithpysparkanalyzelargedatasetsanddiscovertechniquesfortestingimmunizingandparallelizingsparkjobs AT potaczekbartłomiej handsonbigdataanalyticswithpysparkanalyzelargedatasetsanddiscovertechniquesfortestingimmunizingandparallelizingsparkjobs

Holdings

There is no print copy available.

Get full text

MARC

Record in the Search Index

There is no print copy available.

Similar Items