Advanced analytics with Spark: [patterns for learning from data at scale]
Gespeichert in:
Format: | Buch |
---|---|
Sprache: | English |
Veröffentlicht: |
Beijing [u.a.]
O'Reilly
2015
|
Ausgabe: | 1. ed. |
Schlagworte: | |
Online-Zugang: | Inhaltstext Inhaltsverzeichnis |
Beschreibung: | XII, 260 S. graph. Darst. |
ISBN: | 9781491912768 1491912766 |
Internformat
MARC
LEADER | 00000nam a2200000 c 4500 | ||
---|---|---|---|
001 | BV042625523 | ||
003 | DE-604 | ||
005 | 20160121 | ||
007 | t | ||
008 | 150618s2015 xxud||| |||| 00||| eng d | ||
015 | |a 15,N09 |2 dnb | ||
016 | 7 | |a 1067307931 |2 DE-101 | |
020 | |a 9781491912768 |c Pb. : EUR 41.00 (DE) (freier Pr.), EUR 42.20 (AT) (freier Pr.) |9 978-1-491-91276-8 | ||
020 | |a 1491912766 |9 1-4919-1276-6 | ||
035 | |a (OCoLC)904165350 | ||
035 | |a (DE-599)DNB1067307931 | ||
040 | |a DE-604 |b ger |e rakddb | ||
041 | 0 | |a eng | |
044 | |a xxu |c XD-US | ||
049 | |a DE-573 |a DE-11 |a DE-523 |a DE-M347 |a DE-92 |a DE-B768 |a DE-859 |a DE-19 | ||
082 | 0 | |a 004 | |
084 | |a ST 250 |0 (DE-625)143626: |2 rvk | ||
084 | |a ST 253 |0 (DE-625)143628: |2 rvk | ||
084 | |a 004 |2 sdnb | ||
245 | 1 | 0 | |a Advanced analytics with Spark |b [patterns for learning from data at scale] |c Sandy Ryza ... |
250 | |a 1. ed. | ||
264 | 1 | |a Beijing [u.a.] |b O'Reilly |c 2015 | |
300 | |a XII, 260 S. |b graph. Darst. | ||
336 | |b txt |2 rdacontent | ||
337 | |b n |2 rdamedia | ||
338 | |b nc |2 rdacarrier | ||
650 | 0 | 7 | |a Data Mining |0 (DE-588)4428654-5 |2 gnd |9 rswk-swf |
650 | 0 | 7 | |a SPARK |g Programmiersprache |0 (DE-588)4790233-4 |2 gnd |9 rswk-swf |
689 | 0 | 0 | |a SPARK |g Programmiersprache |0 (DE-588)4790233-4 |D s |
689 | 0 | 1 | |a Data Mining |0 (DE-588)4428654-5 |D s |
689 | 0 | |5 DE-604 | |
700 | 1 | |a Ryza, Sandy |e Sonstige |4 oth | |
856 | 4 | 2 | |m X:MVB |q text/html |u http://deposit.dnb.de/cgi-bin/dokserv?id=5158435&prov=M&dok_var=1&dok_ext=htm |3 Inhaltstext |
856 | 4 | 2 | |m HBZ Datenaustausch |q application/pdf |u http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=028058177&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |3 Inhaltsverzeichnis |
943 | 1 | |a oai:aleph.bib-bvb.de:BVB01-028058177 |
Datensatz im Suchindex
_version_ | 1809771710556143616 |
---|---|
adam_text |
Titel: Advanced analytics with Spark
Autor: Ryza, Sandy
Jahr: 2015
Table of Contents Foreword. vii Preface. ix 1. Analyzing Big Data. 1 The Challenges of Data Science 3 Introducing Apache Spark 4 About This Book 6 2. Introduction to Data Analysis with Scala and Spark. 9 Scala for Data Scientists 10 The Spark Programming Model 11 Record Linkage 11 Getting Started: The Spark Shell and SparkContext 13 Bringing Data from the Cluster to the Client 18 Shipping Code from the Client to the Cluster 22 Structuring Data with Tuples and Case Classes 23 Aggregations 28 Creating Histograms 29 Summary Statistics for Continuous Variables 30 Creating Reusable Code for Computing Summary Statistics 31 Simple Variable Selection and Scoring 36 Where to Go from Here 37 3. Recommending Music and the Audioscrobbler Data Set. 39 Data Set 40 The Alternating Least Squares Recommender Algorithm 41 Preparing the Data 44 iii
Building a First Model 46 Spot Checking Recommendations 48 Evaluating Recommendation Quality 50 Computing AUC 51 Hyperparameter Selection 53 Making Recommendations 55 Where to Go from Here 56 4. Predicting Forest Cover with Decision Trees.59 Fast Forward to Regression 59 Vectors and Features 60 Training Examples 61 Decision Trees and Forests 62 Covtype Data Set 65 Preparing the Data 66 A First Decision Tree 67 Decision Tree Hyperparameters 71 Tuning Decision Trees 73 Categorical Features Revisited 75 Random Decision Forests 77 Making Predictions 79 Where to Go from Here 79 5. Anomaly Detection in Network Traffic with K-means Clustering. 81 Anomaly Detection 82 K-means Clustering 82 Network Intrusion 83 KDD Cup 1999 Data Set 84 A First Take on Clustering 85 Choosing k 87 Visualization in R 90 Feature Normalization 91 Categorical Variables 94 Using Labels with Entropy 95 Clustering in Action 96 Where to Go from Here 97 6. Understanding Wikipedia with Latent Semantic Analysis.99 The Term-Document Matrix 100 Getting the Data 102 Parsing and Preparing the Data 102 Lemmatization 104 iv | Table of Contents
Computing the TF-IDFs 105 Singular Value Decomposition 107 Finding Important Concepts 109 Querying and Scoring with the Low-Dimensional Representation 112 Term-Term Relevance 113 Document-Document Relevance 115 Term-Document Relevance 116 Multiple-Term Queries 117 Where to Go from Here 119 7. Analyzing Co-occurrence Networks with GraphX. 121 The MEDLINE Citation Index: A Network Analysis 122 Getting the Data 123 Parsing XML Documents with Scala’s XML Library 125 Analyzing the MeSH Major Topics and Their Co-occurrences 127 Constructing a Co-occurrence Network with GraphX 129 Understanding the Structure of Networks 132 Connected Components 132 Degree Distribution 135 Filtering Out Noisy Edges 138 Processing EdgeTriplets 139 Analyzing the Filtered Graph 140 Small-Wo rid Networks 142 Cliques and Clustering Coefficients 143 Computing Average Path Length with Pregel 144 Where to Go from Here 149 8. Geospatial and Temporal Data Analysis on the New York City Taxi Trip Data. 151 Getting the Data 152 Working with Temporal and Geospatial Data in Spark 153 Temporal Data with JodaTime and NScalaTime 153 Geospatial Data with the Esri Geometry API and Spray 155 Exploring the Esri Geometry API 155 Intro to GeoJSON 157 Preparing the New York City Taxi Trip Data 159 Handling Invalid Records at Scale 160 Geospatial Analysis 164 Sessionization in Spark 167 Building Sessions: Secondary Sorts in Spark 168 Where to Go from Here 171 Table of Contents | v
9. Estimating Financial Risk through Monte Carlo Simulation. 173 Terminology 174 Methods for Calculating VaR 175 Variance-Covariance 175 Historical Simulation 175 Monte Carlo Simulation 175 Our Model 176 Getting the Data 177 Preprocessing 178 Determining the Factor Weights 181 Sampling 183 The Multivariate Normal Distribution 185 Running the Trials 186 Visualizing the Distribution of Returns 189 Evaluating Our Results 190 Where to Go from Here 192 10. Analyzing Genomics Data and the BDG Project. 195 Decoupling Storage from Modeling 196 Ingesting Genomics Data with the ADAM CLI 198 Parquet Format and Columnar Storage 204 Predicting Transcription Factor Binding Sites from ENCODE Data 206 Querying Genotypes from the 1000 Genomes Project 213 Where to Go from Here 214 11. Analyzing Neuroimaging Data with PySpark and Thunder. 217 Overview of PySpark 218 PySpark Internals 219 Overview and Installation of the Thunder Library 221 Loading Data with Thunder 222 Thunder Core Data Types 229 Categorizing Neuron Types with Thunder 231 Where to Go from Here 236 A. Deeper into Spark. 237 B. Upcoming MLlib Pipelines API. 247 Index. 253 vi | Table of Contents |
any_adam_object | 1 |
building | Verbundindex |
bvnumber | BV042625523 |
classification_rvk | ST 250 ST 253 |
ctrlnum | (OCoLC)904165350 (DE-599)DNB1067307931 |
dewey-full | 004 |
dewey-hundreds | 000 - Computer science, information, general works |
dewey-ones | 004 - Computer science |
dewey-raw | 004 |
dewey-search | 004 |
dewey-sort | 14 |
dewey-tens | 000 - Computer science, information, general works |
discipline | Informatik |
edition | 1. ed. |
format | Book |
fullrecord | <?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>00000nam a2200000 c 4500</leader><controlfield tag="001">BV042625523</controlfield><controlfield tag="003">DE-604</controlfield><controlfield tag="005">20160121</controlfield><controlfield tag="007">t</controlfield><controlfield tag="008">150618s2015 xxud||| |||| 00||| eng d</controlfield><datafield tag="015" ind1=" " ind2=" "><subfield code="a">15,N09</subfield><subfield code="2">dnb</subfield></datafield><datafield tag="016" ind1="7" ind2=" "><subfield code="a">1067307931</subfield><subfield code="2">DE-101</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">9781491912768</subfield><subfield code="c">Pb. : EUR 41.00 (DE) (freier Pr.), EUR 42.20 (AT) (freier Pr.)</subfield><subfield code="9">978-1-491-91276-8</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">1491912766</subfield><subfield code="9">1-4919-1276-6</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(OCoLC)904165350</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-599)DNB1067307931</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-604</subfield><subfield code="b">ger</subfield><subfield code="e">rakddb</subfield></datafield><datafield tag="041" ind1="0" ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="044" ind1=" " ind2=" "><subfield code="a">xxu</subfield><subfield code="c">XD-US</subfield></datafield><datafield tag="049" ind1=" " ind2=" "><subfield code="a">DE-573</subfield><subfield code="a">DE-11</subfield><subfield code="a">DE-523</subfield><subfield code="a">DE-M347</subfield><subfield code="a">DE-92</subfield><subfield code="a">DE-B768</subfield><subfield code="a">DE-859</subfield><subfield code="a">DE-19</subfield></datafield><datafield tag="082" ind1="0" ind2=" "><subfield code="a">004</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">ST 250</subfield><subfield code="0">(DE-625)143626:</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">ST 253</subfield><subfield code="0">(DE-625)143628:</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">004</subfield><subfield code="2">sdnb</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Advanced analytics with Spark</subfield><subfield code="b">[patterns for learning from data at scale]</subfield><subfield code="c">Sandy Ryza ...</subfield></datafield><datafield tag="250" ind1=" " ind2=" "><subfield code="a">1. ed.</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="a">Beijing [u.a.]</subfield><subfield code="b">O'Reilly</subfield><subfield code="c">2015</subfield></datafield><datafield tag="300" ind1=" " ind2=" "><subfield code="a">XII, 260 S.</subfield><subfield code="b">graph. Darst.</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Data Mining</subfield><subfield code="0">(DE-588)4428654-5</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">SPARK</subfield><subfield code="g">Programmiersprache</subfield><subfield code="0">(DE-588)4790233-4</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="689" ind1="0" ind2="0"><subfield code="a">SPARK</subfield><subfield code="g">Programmiersprache</subfield><subfield code="0">(DE-588)4790233-4</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2="1"><subfield code="a">Data Mining</subfield><subfield code="0">(DE-588)4428654-5</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2=" "><subfield code="5">DE-604</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Ryza, Sandy</subfield><subfield code="e">Sonstige</subfield><subfield code="4">oth</subfield></datafield><datafield tag="856" ind1="4" ind2="2"><subfield code="m">X:MVB</subfield><subfield code="q">text/html</subfield><subfield code="u">http://deposit.dnb.de/cgi-bin/dokserv?id=5158435&prov=M&dok_var=1&dok_ext=htm</subfield><subfield code="3">Inhaltstext</subfield></datafield><datafield tag="856" ind1="4" ind2="2"><subfield code="m">HBZ Datenaustausch</subfield><subfield code="q">application/pdf</subfield><subfield code="u">http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=028058177&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA</subfield><subfield code="3">Inhaltsverzeichnis</subfield></datafield><datafield tag="943" ind1="1" ind2=" "><subfield code="a">oai:aleph.bib-bvb.de:BVB01-028058177</subfield></datafield></record></collection> |
id | DE-604.BV042625523 |
illustrated | Illustrated |
indexdate | 2024-09-10T01:46:37Z |
institution | BVB |
isbn | 9781491912768 1491912766 |
language | English |
oai_aleph_id | oai:aleph.bib-bvb.de:BVB01-028058177 |
oclc_num | 904165350 |
open_access_boolean | |
owner | DE-573 DE-11 DE-523 DE-M347 DE-92 DE-B768 DE-859 DE-19 DE-BY-UBM |
owner_facet | DE-573 DE-11 DE-523 DE-M347 DE-92 DE-B768 DE-859 DE-19 DE-BY-UBM |
physical | XII, 260 S. graph. Darst. |
publishDate | 2015 |
publishDateSearch | 2015 |
publishDateSort | 2015 |
publisher | O'Reilly |
record_format | marc |
spelling | Advanced analytics with Spark [patterns for learning from data at scale] Sandy Ryza ... 1. ed. Beijing [u.a.] O'Reilly 2015 XII, 260 S. graph. Darst. txt rdacontent n rdamedia nc rdacarrier Data Mining (DE-588)4428654-5 gnd rswk-swf SPARK Programmiersprache (DE-588)4790233-4 gnd rswk-swf SPARK Programmiersprache (DE-588)4790233-4 s Data Mining (DE-588)4428654-5 s DE-604 Ryza, Sandy Sonstige oth X:MVB text/html http://deposit.dnb.de/cgi-bin/dokserv?id=5158435&prov=M&dok_var=1&dok_ext=htm Inhaltstext HBZ Datenaustausch application/pdf http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=028058177&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA Inhaltsverzeichnis |
spellingShingle | Advanced analytics with Spark [patterns for learning from data at scale] Data Mining (DE-588)4428654-5 gnd SPARK Programmiersprache (DE-588)4790233-4 gnd |
subject_GND | (DE-588)4428654-5 (DE-588)4790233-4 |
title | Advanced analytics with Spark [patterns for learning from data at scale] |
title_auth | Advanced analytics with Spark [patterns for learning from data at scale] |
title_exact_search | Advanced analytics with Spark [patterns for learning from data at scale] |
title_full | Advanced analytics with Spark [patterns for learning from data at scale] Sandy Ryza ... |
title_fullStr | Advanced analytics with Spark [patterns for learning from data at scale] Sandy Ryza ... |
title_full_unstemmed | Advanced analytics with Spark [patterns for learning from data at scale] Sandy Ryza ... |
title_short | Advanced analytics with Spark |
title_sort | advanced analytics with spark patterns for learning from data at scale |
title_sub | [patterns for learning from data at scale] |
topic | Data Mining (DE-588)4428654-5 gnd SPARK Programmiersprache (DE-588)4790233-4 gnd |
topic_facet | Data Mining SPARK Programmiersprache |
url | http://deposit.dnb.de/cgi-bin/dokserv?id=5158435&prov=M&dok_var=1&dok_ext=htm http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=028058177&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |
work_keys_str_mv | AT ryzasandy advancedanalyticswithsparkpatternsforlearningfromdataatscale |