Hands-on big data analytics with PySpark :: analyze large datasets and discover techniques for testing, immunizing, and parallelizing Spark jobs /
In this book, you'll learn to implement some practical and proven techniques to improve aspects of programming and administration in Apache Spark. Techniques are demonstrated using practical examples and best practices. You will also learn how to use Spark and its Python API to create performan...
Saved in:
Main Authors: | , |
---|---|
Format: | Electronic eBook |
Language: | English |
Published: |
Birmingham, UK :
Packt Publishing,
2019.
|
Subjects: | |
Online Access: | DE-862 DE-863 |
Summary: | In this book, you'll learn to implement some practical and proven techniques to improve aspects of programming and administration in Apache Spark. Techniques are demonstrated using practical examples and best practices. You will also learn how to use Spark and its Python API to create performant analytics with large-scale data. |
Physical Description: | 1 online resource : illustrations |
ISBN: | 1838648836 9781838648831 |
Staff View
MARC
LEADER | 00000cam a2200000 i 4500 | ||
---|---|---|---|
001 | ZDB-4-EBA-on1100643398 | ||
003 | OCoLC | ||
005 | 20241004212047.0 | ||
006 | m o d | ||
007 | cr unu|||||||| | ||
008 | 190509s2019 enka o 000 0 eng d | ||
040 | |a UMI |b eng |e rda |e pn |c UMI |d TEFOD |d EBLCP |d MERUC |d UKMGB |d OCLCF |d YDX |d UKAHL |d OCLCQ |d N$T |d OCLCQ |d OCLCO |d NZAUC |d OCLCQ |d OCLCO |d OCLCL | ||
015 | |a GBB995016 |2 bnb | ||
016 | 7 | |a 019365492 |2 Uk | |
019 | |a 1091701284 |a 1096526626 | ||
020 | |a 1838648836 | ||
020 | |a 9781838648831 |q (electronic bk.) | ||
020 | |z 9781838644130 | ||
035 | |a (OCoLC)1100643398 |z (OCoLC)1091701284 |z (OCoLC)1096526626 | ||
037 | |a CL0501000047 |b Safari Books Online | ||
050 | 4 | |a QA76.73.S59 | |
082 | 7 | |a 004.2 |2 23 | |
049 | |a MAIN | ||
100 | 1 | |a Lai, Rudy, |e author. | |
245 | 1 | 0 | |a Hands-on big data analytics with PySpark : |b analyze large datasets and discover techniques for testing, immunizing, and parallelizing Spark jobs / |c Rudy Lai, Bartłomiej Potaczek. |
264 | 1 | |a Birmingham, UK : |b Packt Publishing, |c 2019. | |
300 | |a 1 online resource : |b illustrations | ||
336 | |a text |b txt |2 rdacontent | ||
337 | |a computer |b c |2 rdamedia | ||
338 | |a online resource |b cr |2 rdacarrier | ||
588 | 0 | |a Online resource; title from title page (Safari, viewed May 9, 2019). | |
505 | 0 | |a Cover; Title Page; Copyright and Credits; About Packt; Contributors; Table of Contents; Preface; Chapter 1: Pyspark and Setting up Your Development Environment; An overview of PySpark; Spark SQL; Setting up Spark on Windows and PySpark; Core concepts in Spark and PySpark; SparkContext; Spark shell; SparkConf; Summary; Chapter 2: Getting Your Big Data into the Spark Environment Using RDDs; Loading data on to Spark RDDs; The UCI machine learning repository; Getting the data from the repository to Spark; Getting data into Spark; Parallelization with Spark RDDs; What is parallelization? | |
505 | 8 | |a Basics of RDD operationSummary; Chapter 3: Big Data Cleaning and Wrangling with Spark Notebooks; Using Spark Notebooks for quick iteration of ideas; Sampling/filtering RDDs to pick out relevant data points; Splitting datasets and creating some new combinations; Summary; Chapter 4: Aggregating and Summarizing Data into Useful Reports; Calculating averages with map and reduce; Faster average computations with aggregate; Pivot tabling with key-value paired data points; Summary; Chapter 5: Powerful Exploratory Data Analysis with MLlib; Computing summary statistics with MLlib | |
505 | 8 | |a Using Pearson and Spearman correlations to discover correlationsThe Pearson correlation; The Spearman correlation; Computing Pearson and Spearman correlations; Testing our hypotheses on large datasets; Summary; Chapter 6: Putting Structure on Your Big Data with SparkSQL; Manipulating DataFrames with Spark SQL schemas; Using Spark DSL to build queries; Summary; Chapter 7: Transformations and Actions; Using Spark transformations to defer computations to a later time; Avoiding transformations; Using the reduce and reduceByKey methods to calculate the results | |
505 | 8 | |a Performing actions that trigger computationsReusing the same rdd for different actions; Summary; Chapter 8: Immutable Design; Delving into the Spark RDD's parent/child chain; Extending an RDD; Chaining a new RDD with the parent; Testing our custom RDD; Using RDD in an immutable way; Using DataFrame operations to transform; Immutability in the highly concurrent environment; Using the Dataset API in an immutable way; Summary; Chapter 9: Avoiding Shuffle and Reducing Operational Expenses; Detecting a shuffle in a process; Testing operations that cause a shuffle in Apache Spark | |
505 | 8 | |a Changing the design of jobs with wide dependenciesUsing keyBy() operations to reduce shuffle; Using a custom partitioner to reduce shuffle; Summary; Chapter 10: Saving Data in the Correct Format; Saving data in plain text format; Leveraging JSON as a data format; Tabular formats -- CSV; Using Avro with Spark; Columnar formats -- Parquet; Summary; Chapter 11: Working with the Spark Key/Value API; Available actions on key/value pairs; Using aggregateByKey instead of groupBy(); Actions on key/value pairs; Available partitioners on key/value data; Implementing a custom partitioner; Summary | |
520 | |a In this book, you'll learn to implement some practical and proven techniques to improve aspects of programming and administration in Apache Spark. Techniques are demonstrated using practical examples and best practices. You will also learn how to use Spark and its Python API to create performant analytics with large-scale data. | ||
650 | 0 | |a SPARK (Computer program language) |0 http://id.loc.gov/authorities/subjects/sh2015001170 | |
650 | 0 | |a Application software |x Development. |0 http://id.loc.gov/authorities/subjects/sh95009362 | |
650 | 0 | |a Big data. |0 http://id.loc.gov/authorities/subjects/sh2012003227 | |
650 | 0 | |a Electronic data processing. |0 http://id.loc.gov/authorities/subjects/sh85042288 | |
650 | 0 | |a Python (Computer program language) |0 http://id.loc.gov/authorities/subjects/sh96008834 | |
650 | 6 | |a Logiciels d'application |x Développement. | |
650 | 6 | |a Données volumineuses. | |
650 | 6 | |a Python (Langage de programmation) | |
650 | 7 | |a Application software |x Development |2 fast | |
650 | 7 | |a Big data |2 fast | |
650 | 7 | |a Electronic data processing |2 fast | |
650 | 7 | |a Python (Computer program language) |2 fast | |
650 | 7 | |a SPARK (Computer program language) |2 fast | |
700 | 1 | |a Potaczek, Bartłomiej, |e author. | |
758 | |i has work: |a Hands-On Big Data Analytics with PySpark (Text) |1 https://id.oclc.org/worldcat/entity/E39PCXY3Xvkc3PMhcWcgRgjfjd |4 https://id.oclc.org/worldcat/ontology/hasWork | ||
776 | 0 | 8 | |i Print version: |a Lai, Rudy. |t Hands-On Big Data Analytics with Pyspark : Analyze Large Datasets and Discover Techniques for Testing, Immunizing, and Parallelizing Spark Jobs. |d Birmingham : Packt Publishing Ltd, ©2019 |z 9781838644130 |
966 | 4 | 0 | |l DE-862 |p ZDB-4-EBA |q FWS_PDA_EBA |u https://search.ebscohost.com/login.aspx?direct=true&scope=site&db=nlebk&AN=2094759 |3 Volltext |
966 | 4 | 0 | |l DE-863 |p ZDB-4-EBA |q FWS_PDA_EBA |u https://search.ebscohost.com/login.aspx?direct=true&scope=site&db=nlebk&AN=2094759 |3 Volltext |
938 | |a Askews and Holts Library Services |b ASKH |n BDZ0039952975 | ||
938 | |a ProQuest Ebook Central |b EBLB |n EBL5744445 | ||
938 | |a EBSCOhost |b EBSC |n 2094759 | ||
938 | |a YBP Library Services |b YANK |n 16142491 | ||
994 | |a 92 |b GEBAY | ||
912 | |a ZDB-4-EBA | ||
049 | |a DE-862 | ||
049 | |a DE-863 |
Record in the Search Index
DE-BY-FWS_katkey | ZDB-4-EBA-on1100643398 |
---|---|
_version_ | 1826942290017386496 |
adam_text | |
any_adam_object | |
author | Lai, Rudy Potaczek, Bartłomiej |
author_facet | Lai, Rudy Potaczek, Bartłomiej |
author_role | aut aut |
author_sort | Lai, Rudy |
author_variant | r l rl b p bp |
building | Verbundindex |
bvnumber | localFWS |
callnumber-first | Q - Science |
callnumber-label | QA76 |
callnumber-raw | QA76.73.S59 |
callnumber-search | QA76.73.S59 |
callnumber-sort | QA 276.73 S59 |
callnumber-subject | QA - Mathematics |
collection | ZDB-4-EBA |
contents | Cover; Title Page; Copyright and Credits; About Packt; Contributors; Table of Contents; Preface; Chapter 1: Pyspark and Setting up Your Development Environment; An overview of PySpark; Spark SQL; Setting up Spark on Windows and PySpark; Core concepts in Spark and PySpark; SparkContext; Spark shell; SparkConf; Summary; Chapter 2: Getting Your Big Data into the Spark Environment Using RDDs; Loading data on to Spark RDDs; The UCI machine learning repository; Getting the data from the repository to Spark; Getting data into Spark; Parallelization with Spark RDDs; What is parallelization? Basics of RDD operationSummary; Chapter 3: Big Data Cleaning and Wrangling with Spark Notebooks; Using Spark Notebooks for quick iteration of ideas; Sampling/filtering RDDs to pick out relevant data points; Splitting datasets and creating some new combinations; Summary; Chapter 4: Aggregating and Summarizing Data into Useful Reports; Calculating averages with map and reduce; Faster average computations with aggregate; Pivot tabling with key-value paired data points; Summary; Chapter 5: Powerful Exploratory Data Analysis with MLlib; Computing summary statistics with MLlib Using Pearson and Spearman correlations to discover correlationsThe Pearson correlation; The Spearman correlation; Computing Pearson and Spearman correlations; Testing our hypotheses on large datasets; Summary; Chapter 6: Putting Structure on Your Big Data with SparkSQL; Manipulating DataFrames with Spark SQL schemas; Using Spark DSL to build queries; Summary; Chapter 7: Transformations and Actions; Using Spark transformations to defer computations to a later time; Avoiding transformations; Using the reduce and reduceByKey methods to calculate the results Performing actions that trigger computationsReusing the same rdd for different actions; Summary; Chapter 8: Immutable Design; Delving into the Spark RDD's parent/child chain; Extending an RDD; Chaining a new RDD with the parent; Testing our custom RDD; Using RDD in an immutable way; Using DataFrame operations to transform; Immutability in the highly concurrent environment; Using the Dataset API in an immutable way; Summary; Chapter 9: Avoiding Shuffle and Reducing Operational Expenses; Detecting a shuffle in a process; Testing operations that cause a shuffle in Apache Spark Changing the design of jobs with wide dependenciesUsing keyBy() operations to reduce shuffle; Using a custom partitioner to reduce shuffle; Summary; Chapter 10: Saving Data in the Correct Format; Saving data in plain text format; Leveraging JSON as a data format; Tabular formats -- CSV; Using Avro with Spark; Columnar formats -- Parquet; Summary; Chapter 11: Working with the Spark Key/Value API; Available actions on key/value pairs; Using aggregateByKey instead of groupBy(); Actions on key/value pairs; Available partitioners on key/value data; Implementing a custom partitioner; Summary |
ctrlnum | (OCoLC)1100643398 |
dewey-full | 004.2 |
dewey-hundreds | 000 - Computer science, information, general works |
dewey-ones | 004 - Computer science |
dewey-raw | 004.2 |
dewey-search | 004.2 |
dewey-sort | 14.2 |
dewey-tens | 000 - Computer science, information, general works |
discipline | Informatik |
format | Electronic eBook |
fullrecord | <?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>06338cam a2200673 i 4500</leader><controlfield tag="001">ZDB-4-EBA-on1100643398</controlfield><controlfield tag="003">OCoLC</controlfield><controlfield tag="005">20241004212047.0</controlfield><controlfield tag="006">m o d </controlfield><controlfield tag="007">cr unu||||||||</controlfield><controlfield tag="008">190509s2019 enka o 000 0 eng d</controlfield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">UMI</subfield><subfield code="b">eng</subfield><subfield code="e">rda</subfield><subfield code="e">pn</subfield><subfield code="c">UMI</subfield><subfield code="d">TEFOD</subfield><subfield code="d">EBLCP</subfield><subfield code="d">MERUC</subfield><subfield code="d">UKMGB</subfield><subfield code="d">OCLCF</subfield><subfield code="d">YDX</subfield><subfield code="d">UKAHL</subfield><subfield code="d">OCLCQ</subfield><subfield code="d">N$T</subfield><subfield code="d">OCLCQ</subfield><subfield code="d">OCLCO</subfield><subfield code="d">NZAUC</subfield><subfield code="d">OCLCQ</subfield><subfield code="d">OCLCO</subfield><subfield code="d">OCLCL</subfield></datafield><datafield tag="015" ind1=" " ind2=" "><subfield code="a">GBB995016</subfield><subfield code="2">bnb</subfield></datafield><datafield tag="016" ind1="7" ind2=" "><subfield code="a">019365492</subfield><subfield code="2">Uk</subfield></datafield><datafield tag="019" ind1=" " ind2=" "><subfield code="a">1091701284</subfield><subfield code="a">1096526626</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">1838648836</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">9781838648831</subfield><subfield code="q">(electronic bk.)</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="z">9781838644130</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(OCoLC)1100643398</subfield><subfield code="z">(OCoLC)1091701284</subfield><subfield code="z">(OCoLC)1096526626</subfield></datafield><datafield tag="037" ind1=" " ind2=" "><subfield code="a">CL0501000047</subfield><subfield code="b">Safari Books Online</subfield></datafield><datafield tag="050" ind1=" " ind2="4"><subfield code="a">QA76.73.S59</subfield></datafield><datafield tag="082" ind1="7" ind2=" "><subfield code="a">004.2</subfield><subfield code="2">23</subfield></datafield><datafield tag="049" ind1=" " ind2=" "><subfield code="a">MAIN</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Lai, Rudy,</subfield><subfield code="e">author.</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Hands-on big data analytics with PySpark :</subfield><subfield code="b">analyze large datasets and discover techniques for testing, immunizing, and parallelizing Spark jobs /</subfield><subfield code="c">Rudy Lai, Bartłomiej Potaczek.</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="a">Birmingham, UK :</subfield><subfield code="b">Packt Publishing,</subfield><subfield code="c">2019.</subfield></datafield><datafield tag="300" ind1=" " ind2=" "><subfield code="a">1 online resource :</subfield><subfield code="b">illustrations</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="a">text</subfield><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="a">computer</subfield><subfield code="b">c</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="a">online resource</subfield><subfield code="b">cr</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="588" ind1="0" ind2=" "><subfield code="a">Online resource; title from title page (Safari, viewed May 9, 2019).</subfield></datafield><datafield tag="505" ind1="0" ind2=" "><subfield code="a">Cover; Title Page; Copyright and Credits; About Packt; Contributors; Table of Contents; Preface; Chapter 1: Pyspark and Setting up Your Development Environment; An overview of PySpark; Spark SQL; Setting up Spark on Windows and PySpark; Core concepts in Spark and PySpark; SparkContext; Spark shell; SparkConf; Summary; Chapter 2: Getting Your Big Data into the Spark Environment Using RDDs; Loading data on to Spark RDDs; The UCI machine learning repository; Getting the data from the repository to Spark; Getting data into Spark; Parallelization with Spark RDDs; What is parallelization?</subfield></datafield><datafield tag="505" ind1="8" ind2=" "><subfield code="a">Basics of RDD operationSummary; Chapter 3: Big Data Cleaning and Wrangling with Spark Notebooks; Using Spark Notebooks for quick iteration of ideas; Sampling/filtering RDDs to pick out relevant data points; Splitting datasets and creating some new combinations; Summary; Chapter 4: Aggregating and Summarizing Data into Useful Reports; Calculating averages with map and reduce; Faster average computations with aggregate; Pivot tabling with key-value paired data points; Summary; Chapter 5: Powerful Exploratory Data Analysis with MLlib; Computing summary statistics with MLlib</subfield></datafield><datafield tag="505" ind1="8" ind2=" "><subfield code="a">Using Pearson and Spearman correlations to discover correlationsThe Pearson correlation; The Spearman correlation; Computing Pearson and Spearman correlations; Testing our hypotheses on large datasets; Summary; Chapter 6: Putting Structure on Your Big Data with SparkSQL; Manipulating DataFrames with Spark SQL schemas; Using Spark DSL to build queries; Summary; Chapter 7: Transformations and Actions; Using Spark transformations to defer computations to a later time; Avoiding transformations; Using the reduce and reduceByKey methods to calculate the results</subfield></datafield><datafield tag="505" ind1="8" ind2=" "><subfield code="a">Performing actions that trigger computationsReusing the same rdd for different actions; Summary; Chapter 8: Immutable Design; Delving into the Spark RDD's parent/child chain; Extending an RDD; Chaining a new RDD with the parent; Testing our custom RDD; Using RDD in an immutable way; Using DataFrame operations to transform; Immutability in the highly concurrent environment; Using the Dataset API in an immutable way; Summary; Chapter 9: Avoiding Shuffle and Reducing Operational Expenses; Detecting a shuffle in a process; Testing operations that cause a shuffle in Apache Spark</subfield></datafield><datafield tag="505" ind1="8" ind2=" "><subfield code="a">Changing the design of jobs with wide dependenciesUsing keyBy() operations to reduce shuffle; Using a custom partitioner to reduce shuffle; Summary; Chapter 10: Saving Data in the Correct Format; Saving data in plain text format; Leveraging JSON as a data format; Tabular formats -- CSV; Using Avro with Spark; Columnar formats -- Parquet; Summary; Chapter 11: Working with the Spark Key/Value API; Available actions on key/value pairs; Using aggregateByKey instead of groupBy(); Actions on key/value pairs; Available partitioners on key/value data; Implementing a custom partitioner; Summary</subfield></datafield><datafield tag="520" ind1=" " ind2=" "><subfield code="a">In this book, you'll learn to implement some practical and proven techniques to improve aspects of programming and administration in Apache Spark. Techniques are demonstrated using practical examples and best practices. You will also learn how to use Spark and its Python API to create performant analytics with large-scale data.</subfield></datafield><datafield tag="650" ind1=" " ind2="0"><subfield code="a">SPARK (Computer program language)</subfield><subfield code="0">http://id.loc.gov/authorities/subjects/sh2015001170</subfield></datafield><datafield tag="650" ind1=" " ind2="0"><subfield code="a">Application software</subfield><subfield code="x">Development.</subfield><subfield code="0">http://id.loc.gov/authorities/subjects/sh95009362</subfield></datafield><datafield tag="650" ind1=" " ind2="0"><subfield code="a">Big data.</subfield><subfield code="0">http://id.loc.gov/authorities/subjects/sh2012003227</subfield></datafield><datafield tag="650" ind1=" " ind2="0"><subfield code="a">Electronic data processing.</subfield><subfield code="0">http://id.loc.gov/authorities/subjects/sh85042288</subfield></datafield><datafield tag="650" ind1=" " ind2="0"><subfield code="a">Python (Computer program language)</subfield><subfield code="0">http://id.loc.gov/authorities/subjects/sh96008834</subfield></datafield><datafield tag="650" ind1=" " ind2="6"><subfield code="a">Logiciels d'application</subfield><subfield code="x">Développement.</subfield></datafield><datafield tag="650" ind1=" " ind2="6"><subfield code="a">Données volumineuses.</subfield></datafield><datafield tag="650" ind1=" " ind2="6"><subfield code="a">Python (Langage de programmation)</subfield></datafield><datafield tag="650" ind1=" " ind2="7"><subfield code="a">Application software</subfield><subfield code="x">Development</subfield><subfield code="2">fast</subfield></datafield><datafield tag="650" ind1=" " ind2="7"><subfield code="a">Big data</subfield><subfield code="2">fast</subfield></datafield><datafield tag="650" ind1=" " ind2="7"><subfield code="a">Electronic data processing</subfield><subfield code="2">fast</subfield></datafield><datafield tag="650" ind1=" " ind2="7"><subfield code="a">Python (Computer program language)</subfield><subfield code="2">fast</subfield></datafield><datafield tag="650" ind1=" " ind2="7"><subfield code="a">SPARK (Computer program language)</subfield><subfield code="2">fast</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Potaczek, Bartłomiej,</subfield><subfield code="e">author.</subfield></datafield><datafield tag="758" ind1=" " ind2=" "><subfield code="i">has work:</subfield><subfield code="a">Hands-On Big Data Analytics with PySpark (Text)</subfield><subfield code="1">https://id.oclc.org/worldcat/entity/E39PCXY3Xvkc3PMhcWcgRgjfjd</subfield><subfield code="4">https://id.oclc.org/worldcat/ontology/hasWork</subfield></datafield><datafield tag="776" ind1="0" ind2="8"><subfield code="i">Print version:</subfield><subfield code="a">Lai, Rudy.</subfield><subfield code="t">Hands-On Big Data Analytics with Pyspark : Analyze Large Datasets and Discover Techniques for Testing, Immunizing, and Parallelizing Spark Jobs.</subfield><subfield code="d">Birmingham : Packt Publishing Ltd, ©2019</subfield><subfield code="z">9781838644130</subfield></datafield><datafield tag="966" ind1="4" ind2="0"><subfield code="l">DE-862</subfield><subfield code="p">ZDB-4-EBA</subfield><subfield code="q">FWS_PDA_EBA</subfield><subfield code="u">https://search.ebscohost.com/login.aspx?direct=true&scope=site&db=nlebk&AN=2094759</subfield><subfield code="3">Volltext</subfield></datafield><datafield tag="966" ind1="4" ind2="0"><subfield code="l">DE-863</subfield><subfield code="p">ZDB-4-EBA</subfield><subfield code="q">FWS_PDA_EBA</subfield><subfield code="u">https://search.ebscohost.com/login.aspx?direct=true&scope=site&db=nlebk&AN=2094759</subfield><subfield code="3">Volltext</subfield></datafield><datafield tag="938" ind1=" " ind2=" "><subfield code="a">Askews and Holts Library Services</subfield><subfield code="b">ASKH</subfield><subfield code="n">BDZ0039952975</subfield></datafield><datafield tag="938" ind1=" " ind2=" "><subfield code="a">ProQuest Ebook Central</subfield><subfield code="b">EBLB</subfield><subfield code="n">EBL5744445</subfield></datafield><datafield tag="938" ind1=" " ind2=" "><subfield code="a">EBSCOhost</subfield><subfield code="b">EBSC</subfield><subfield code="n">2094759</subfield></datafield><datafield tag="938" ind1=" " ind2=" "><subfield code="a">YBP Library Services</subfield><subfield code="b">YANK</subfield><subfield code="n">16142491</subfield></datafield><datafield tag="994" ind1=" " ind2=" "><subfield code="a">92</subfield><subfield code="b">GEBAY</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">ZDB-4-EBA</subfield></datafield><datafield tag="049" ind1=" " ind2=" "><subfield code="a">DE-862</subfield></datafield><datafield tag="049" ind1=" " ind2=" "><subfield code="a">DE-863</subfield></datafield></record></collection> |
id | ZDB-4-EBA-on1100643398 |
illustrated | Illustrated |
indexdate | 2025-03-18T14:25:39Z |
institution | BVB |
isbn | 1838648836 9781838648831 |
language | English |
oclc_num | 1100643398 |
open_access_boolean | |
owner | MAIN DE-862 DE-BY-FWS DE-863 DE-BY-FWS |
owner_facet | MAIN DE-862 DE-BY-FWS DE-863 DE-BY-FWS |
physical | 1 online resource : illustrations |
psigel | ZDB-4-EBA FWS_PDA_EBA ZDB-4-EBA |
publishDate | 2019 |
publishDateSearch | 2019 |
publishDateSort | 2019 |
publisher | Packt Publishing, |
record_format | marc |
spelling | Lai, Rudy, author. Hands-on big data analytics with PySpark : analyze large datasets and discover techniques for testing, immunizing, and parallelizing Spark jobs / Rudy Lai, Bartłomiej Potaczek. Birmingham, UK : Packt Publishing, 2019. 1 online resource : illustrations text txt rdacontent computer c rdamedia online resource cr rdacarrier Online resource; title from title page (Safari, viewed May 9, 2019). Cover; Title Page; Copyright and Credits; About Packt; Contributors; Table of Contents; Preface; Chapter 1: Pyspark and Setting up Your Development Environment; An overview of PySpark; Spark SQL; Setting up Spark on Windows and PySpark; Core concepts in Spark and PySpark; SparkContext; Spark shell; SparkConf; Summary; Chapter 2: Getting Your Big Data into the Spark Environment Using RDDs; Loading data on to Spark RDDs; The UCI machine learning repository; Getting the data from the repository to Spark; Getting data into Spark; Parallelization with Spark RDDs; What is parallelization? Basics of RDD operationSummary; Chapter 3: Big Data Cleaning and Wrangling with Spark Notebooks; Using Spark Notebooks for quick iteration of ideas; Sampling/filtering RDDs to pick out relevant data points; Splitting datasets and creating some new combinations; Summary; Chapter 4: Aggregating and Summarizing Data into Useful Reports; Calculating averages with map and reduce; Faster average computations with aggregate; Pivot tabling with key-value paired data points; Summary; Chapter 5: Powerful Exploratory Data Analysis with MLlib; Computing summary statistics with MLlib Using Pearson and Spearman correlations to discover correlationsThe Pearson correlation; The Spearman correlation; Computing Pearson and Spearman correlations; Testing our hypotheses on large datasets; Summary; Chapter 6: Putting Structure on Your Big Data with SparkSQL; Manipulating DataFrames with Spark SQL schemas; Using Spark DSL to build queries; Summary; Chapter 7: Transformations and Actions; Using Spark transformations to defer computations to a later time; Avoiding transformations; Using the reduce and reduceByKey methods to calculate the results Performing actions that trigger computationsReusing the same rdd for different actions; Summary; Chapter 8: Immutable Design; Delving into the Spark RDD's parent/child chain; Extending an RDD; Chaining a new RDD with the parent; Testing our custom RDD; Using RDD in an immutable way; Using DataFrame operations to transform; Immutability in the highly concurrent environment; Using the Dataset API in an immutable way; Summary; Chapter 9: Avoiding Shuffle and Reducing Operational Expenses; Detecting a shuffle in a process; Testing operations that cause a shuffle in Apache Spark Changing the design of jobs with wide dependenciesUsing keyBy() operations to reduce shuffle; Using a custom partitioner to reduce shuffle; Summary; Chapter 10: Saving Data in the Correct Format; Saving data in plain text format; Leveraging JSON as a data format; Tabular formats -- CSV; Using Avro with Spark; Columnar formats -- Parquet; Summary; Chapter 11: Working with the Spark Key/Value API; Available actions on key/value pairs; Using aggregateByKey instead of groupBy(); Actions on key/value pairs; Available partitioners on key/value data; Implementing a custom partitioner; Summary In this book, you'll learn to implement some practical and proven techniques to improve aspects of programming and administration in Apache Spark. Techniques are demonstrated using practical examples and best practices. You will also learn how to use Spark and its Python API to create performant analytics with large-scale data. SPARK (Computer program language) http://id.loc.gov/authorities/subjects/sh2015001170 Application software Development. http://id.loc.gov/authorities/subjects/sh95009362 Big data. http://id.loc.gov/authorities/subjects/sh2012003227 Electronic data processing. http://id.loc.gov/authorities/subjects/sh85042288 Python (Computer program language) http://id.loc.gov/authorities/subjects/sh96008834 Logiciels d'application Développement. Données volumineuses. Python (Langage de programmation) Application software Development fast Big data fast Electronic data processing fast Python (Computer program language) fast SPARK (Computer program language) fast Potaczek, Bartłomiej, author. has work: Hands-On Big Data Analytics with PySpark (Text) https://id.oclc.org/worldcat/entity/E39PCXY3Xvkc3PMhcWcgRgjfjd https://id.oclc.org/worldcat/ontology/hasWork Print version: Lai, Rudy. Hands-On Big Data Analytics with Pyspark : Analyze Large Datasets and Discover Techniques for Testing, Immunizing, and Parallelizing Spark Jobs. Birmingham : Packt Publishing Ltd, ©2019 9781838644130 |
spellingShingle | Lai, Rudy Potaczek, Bartłomiej Hands-on big data analytics with PySpark : analyze large datasets and discover techniques for testing, immunizing, and parallelizing Spark jobs / Cover; Title Page; Copyright and Credits; About Packt; Contributors; Table of Contents; Preface; Chapter 1: Pyspark and Setting up Your Development Environment; An overview of PySpark; Spark SQL; Setting up Spark on Windows and PySpark; Core concepts in Spark and PySpark; SparkContext; Spark shell; SparkConf; Summary; Chapter 2: Getting Your Big Data into the Spark Environment Using RDDs; Loading data on to Spark RDDs; The UCI machine learning repository; Getting the data from the repository to Spark; Getting data into Spark; Parallelization with Spark RDDs; What is parallelization? Basics of RDD operationSummary; Chapter 3: Big Data Cleaning and Wrangling with Spark Notebooks; Using Spark Notebooks for quick iteration of ideas; Sampling/filtering RDDs to pick out relevant data points; Splitting datasets and creating some new combinations; Summary; Chapter 4: Aggregating and Summarizing Data into Useful Reports; Calculating averages with map and reduce; Faster average computations with aggregate; Pivot tabling with key-value paired data points; Summary; Chapter 5: Powerful Exploratory Data Analysis with MLlib; Computing summary statistics with MLlib Using Pearson and Spearman correlations to discover correlationsThe Pearson correlation; The Spearman correlation; Computing Pearson and Spearman correlations; Testing our hypotheses on large datasets; Summary; Chapter 6: Putting Structure on Your Big Data with SparkSQL; Manipulating DataFrames with Spark SQL schemas; Using Spark DSL to build queries; Summary; Chapter 7: Transformations and Actions; Using Spark transformations to defer computations to a later time; Avoiding transformations; Using the reduce and reduceByKey methods to calculate the results Performing actions that trigger computationsReusing the same rdd for different actions; Summary; Chapter 8: Immutable Design; Delving into the Spark RDD's parent/child chain; Extending an RDD; Chaining a new RDD with the parent; Testing our custom RDD; Using RDD in an immutable way; Using DataFrame operations to transform; Immutability in the highly concurrent environment; Using the Dataset API in an immutable way; Summary; Chapter 9: Avoiding Shuffle and Reducing Operational Expenses; Detecting a shuffle in a process; Testing operations that cause a shuffle in Apache Spark Changing the design of jobs with wide dependenciesUsing keyBy() operations to reduce shuffle; Using a custom partitioner to reduce shuffle; Summary; Chapter 10: Saving Data in the Correct Format; Saving data in plain text format; Leveraging JSON as a data format; Tabular formats -- CSV; Using Avro with Spark; Columnar formats -- Parquet; Summary; Chapter 11: Working with the Spark Key/Value API; Available actions on key/value pairs; Using aggregateByKey instead of groupBy(); Actions on key/value pairs; Available partitioners on key/value data; Implementing a custom partitioner; Summary SPARK (Computer program language) http://id.loc.gov/authorities/subjects/sh2015001170 Application software Development. http://id.loc.gov/authorities/subjects/sh95009362 Big data. http://id.loc.gov/authorities/subjects/sh2012003227 Electronic data processing. http://id.loc.gov/authorities/subjects/sh85042288 Python (Computer program language) http://id.loc.gov/authorities/subjects/sh96008834 Logiciels d'application Développement. Données volumineuses. Python (Langage de programmation) Application software Development fast Big data fast Electronic data processing fast Python (Computer program language) fast SPARK (Computer program language) fast |
subject_GND | http://id.loc.gov/authorities/subjects/sh2015001170 http://id.loc.gov/authorities/subjects/sh95009362 http://id.loc.gov/authorities/subjects/sh2012003227 http://id.loc.gov/authorities/subjects/sh85042288 http://id.loc.gov/authorities/subjects/sh96008834 |
title | Hands-on big data analytics with PySpark : analyze large datasets and discover techniques for testing, immunizing, and parallelizing Spark jobs / |
title_auth | Hands-on big data analytics with PySpark : analyze large datasets and discover techniques for testing, immunizing, and parallelizing Spark jobs / |
title_exact_search | Hands-on big data analytics with PySpark : analyze large datasets and discover techniques for testing, immunizing, and parallelizing Spark jobs / |
title_full | Hands-on big data analytics with PySpark : analyze large datasets and discover techniques for testing, immunizing, and parallelizing Spark jobs / Rudy Lai, Bartłomiej Potaczek. |
title_fullStr | Hands-on big data analytics with PySpark : analyze large datasets and discover techniques for testing, immunizing, and parallelizing Spark jobs / Rudy Lai, Bartłomiej Potaczek. |
title_full_unstemmed | Hands-on big data analytics with PySpark : analyze large datasets and discover techniques for testing, immunizing, and parallelizing Spark jobs / Rudy Lai, Bartłomiej Potaczek. |
title_short | Hands-on big data analytics with PySpark : |
title_sort | hands on big data analytics with pyspark analyze large datasets and discover techniques for testing immunizing and parallelizing spark jobs |
title_sub | analyze large datasets and discover techniques for testing, immunizing, and parallelizing Spark jobs / |
topic | SPARK (Computer program language) http://id.loc.gov/authorities/subjects/sh2015001170 Application software Development. http://id.loc.gov/authorities/subjects/sh95009362 Big data. http://id.loc.gov/authorities/subjects/sh2012003227 Electronic data processing. http://id.loc.gov/authorities/subjects/sh85042288 Python (Computer program language) http://id.loc.gov/authorities/subjects/sh96008834 Logiciels d'application Développement. Données volumineuses. Python (Langage de programmation) Application software Development fast Big data fast Electronic data processing fast Python (Computer program language) fast SPARK (Computer program language) fast |
topic_facet | SPARK (Computer program language) Application software Development. Big data. Electronic data processing. Python (Computer program language) Logiciels d'application Développement. Données volumineuses. Python (Langage de programmation) Application software Development Big data Electronic data processing |
work_keys_str_mv | AT lairudy handsonbigdataanalyticswithpysparkanalyzelargedatasetsanddiscovertechniquesfortestingimmunizingandparallelizingsparkjobs AT potaczekbartłomiej handsonbigdataanalyticswithpysparkanalyzelargedatasetsanddiscovertechniquesfortestingimmunizingandparallelizingsparkjobs |