Data just right: introduction to large-scale data & analytics
Gespeichert in:
1. Verfasser: | |
---|---|
Format: | Buch |
Sprache: | English |
Veröffentlicht: |
Upper Saddle River, NJ [u. a.]
Addison-Wesley
2014
|
Schriftenreihe: | Addison Wesley data and analytics series
|
Schlagworte: | |
Online-Zugang: | Inhaltsverzeichnis |
Beschreibung: | Includes bibliographical references and index |
Beschreibung: | XXIII, 215 S. graph. Darst., Kt. 24 cm |
ISBN: | 9780321898654 0321898656 |
Internformat
MARC
LEADER | 00000nam a2200000 c 4500 | ||
---|---|---|---|
001 | BV041725480 | ||
003 | DE-604 | ||
005 | 20140408 | ||
007 | t | ||
008 | 140310s2014 xxubd|| |||| 00||| eng d | ||
010 | |a 013041476 | ||
020 | |a 9780321898654 |9 978-0-321-89865-4 | ||
020 | |a 0321898656 |9 0-321-89865-6 | ||
035 | |a (OCoLC)861789010 | ||
035 | |a (DE-599)BVBBV041725480 | ||
040 | |a DE-604 |b ger |e aacr | ||
041 | 0 | |a eng | |
044 | |a xxu |c US | ||
049 | |a DE-473 |a DE-19 |a DE-1049 | ||
050 | 0 | |a QA76.9.D26 | |
082 | 0 | |a 005.74/3 |2 23 | |
084 | |a ST 265 |0 (DE-625)143634: |2 rvk | ||
100 | 1 | |a Manoochehri, Michael |e Verfasser |0 (DE-588)1048748499 |4 aut | |
245 | 1 | 0 | |a Data just right |b introduction to large-scale data & analytics |c Michael Manoochehri |
264 | 1 | |a Upper Saddle River, NJ [u. a.] |b Addison-Wesley |c 2014 | |
300 | |a XXIII, 215 S. |b graph. Darst., Kt. |c 24 cm | ||
336 | |b txt |2 rdacontent | ||
337 | |b n |2 rdamedia | ||
338 | |b nc |2 rdacarrier | ||
490 | 0 | |a Addison Wesley data and analytics series | |
500 | |a Includes bibliographical references and index | ||
650 | 4 | |a Database design | |
650 | 4 | |a Big data | |
650 | 0 | 7 | |a Datenanalyse |0 (DE-588)4123037-1 |2 gnd |9 rswk-swf |
650 | 0 | 7 | |a Big Data |0 (DE-588)4802620-7 |2 gnd |9 rswk-swf |
689 | 0 | 0 | |a Big Data |0 (DE-588)4802620-7 |D s |
689 | 0 | 1 | |a Datenanalyse |0 (DE-588)4123037-1 |D s |
689 | 0 | |5 DE-604 | |
856 | 4 | 2 | |m Digitalisierung UB Bamberg - ADAM Catalogue Enrichment |q application/pdf |u http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=027172353&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |3 Inhaltsverzeichnis |
999 | |a oai:aleph.bib-bvb.de:BVB01-027172353 |
Datensatz im Suchindex
_version_ | 1804152005799706624 |
---|---|
adam_text | Contents
Foreword
xv
Preface
xvii
Acknowledgments
xxv
About the Author
xxvii
I Directives in the Big Data Era l
1
Four Rules for Data Success
3
When Data Became a BIG Deal
3
Data and the Single Server
4
The Big Data Trade-Off
5
Build Solutions That Scale (Toward Infinity)
6
Build Systems That Can Share Data (On the
Internet)
7
Build Solutions, Not Infrastructure
8
Focus on Unlocking Value from Your Data
8
Anatomy of a Big Data Pipeline
9
The Ultimate Database
10
Summary
10
II Collecting and Sharing a Lot of Data
li
2
Hosting and Sharing Terabytes of Raw Data
13
Suffering from Files
14
The Challenges of Sharing Lots of Files
14
Storage: Infrastructure as a Service
15
The Network Is Slow
16
Choosing the Right Data Format
16
XML: Data, Describe Thyself
18
JSON: The Programmer s Choice
18
Character Encoding
19
File Transformations
21
Data in Motion: Data Serialization Formats
21
Apache Thrift and Protocol Buffers
22
Summary
23
viii Contents
3
Building a
NoSQL-Bated Web App
to Collect
Crowd-Sourced Data 25
Relational Databases: Command and Control
28
The Relational Database ACID Test
28
Relational Databases versus the Internet
28
CAP Theorem and BASE
30
Nonrelational Database Models
31
Key-Value Database
32
Document Store
33
Leaning toward Write Performance:
Redis
38
Sharding across Many
Redis
Instances
38
Automatic Partitioning with Twemproxy
39
Alternatives to Using
Redis
40
NewSQL: The Return of Codd
41
Summary
42
4
Strategies for Dealing with Data Silos
43
A Warehouse Full of Jargon
43
The Problem in Practice
45
Planning for Data Compliance and Security
46
Enter the Data Warehouse
46
Data Warehousing s Magic Words: Extract, Transform,
and Load
48
Hadoop: The Elephant in the Warehouse
48
Data Silos Can Be Good
49
Concentrate on the Data Challenge, Not the
Technology
50
Empower Employees to Ask Their Own
Questions
50
Invest in Technology That Bridges Data Silos
51
Convergence: The End of the Data Silo
51
Will Luhn s Business Intelligence System Become
Reality?
52
Summary
53
Contents
¡χ
III Asking Questions about Your Data
55
5
Using Hadoop, Hive, and Shark to Ask Questions
about Large
Datasets
57
What Is a Data Warehouse?
57
Apache Hive: Interactive Querying for Hadoop
60
Use Cases for Hive
60
Hive ¡n Practice
61
Using Additional Data Sources with Hive
65
Shark: Queries at the Speed of RAM
65
Data Warehousing in the Cloud
66
Summary
67
6
Building a Data Dashboard with Google
BlgQuery
69
Analytical Databases
69
Dremel: Spreading the Wealth
71
How Dremel and MapReduce Differ
72
BigQuery: Data Analytics as a Service
73
BigQuery s Query Language
74
Building a Custom Big Data Dashboard
75
Authorizing Access to the BigQuery API
76
Running a Query and Retrieving the Result
78
Caching Query Results
79
Adding Visualization
81
The Future of Analytical Query Engines
82
Summary
83
7
Visualization Strategies for Exploring Large
Datasets
85
Cautionary Tales: Translating Data into Narrative
86
Human Scale versus Machine Scale
89
Interactivity
89
Building Applications for Data Interactivity
90
Interactive Visualizations with
R
and ggplot2
90
matplotlib: 2-D Charts with Python
92
ОЗ.јѕ:
Interactive Visualizations for the Web
92
Summary
96
Contents
IV
Bullding Data Pipelines 97
8
Putting It Together: MapReduce Data
Pipelines 99
What Is a Data
Pipeline?
99
The Right Tool for the Job
100
Data Pipelines with Hadoop Streaming
101
MapReduce and Data Transformation
101
The Simplest Pipeline: stdin to stdout
102
A One-Step MapReduce Transformation
IOS
Extracting Relevant Information from Raw NVSS Data:
Map Phase
106
Counting Births per Month: The Reducer
Phase
107
Testing the MapReduce Pipeline Locally
108
Running Our MapReduce Job on a Hadoop
Cluster
109
Managing Complexity: Python MapReduce Frameworks for
Hadoop
110
Rewriting Our Hadoop Streaming Example Using
mrjob
110
Building a Multistep Pipeline
112
Running mrjob Scripts on Elastic MapReduce
113
Alternative Python-Based MapReduce
Frameworks
114
Summary
114
9
Building Data Transformation Workflows with Pig and
Cascading
117
Large-Scale Data Workflows in Practice
118
It s Complicated: Multistep MapReduce
Transformations
118
Apache Pig: Ixnay on the Omplexitycay
119
Running Pig Using the Interactive Grunt Shell
120
Filtering and Optimizing Data Workflows
121
Running a Pig Script in Batch Mode
122
Cascading: Building Robust Data-Workflow
Applications
122
Thinking in Terms of Sources and Sinks
123
Contents
x¡
Building
a Cascading Application
124
Creating a Cascade: A Simple JOIN Example
125
Deploying a Cascading Application on a Hadoop
Cluster
127
When to Choose Pig versus Cascading
128
Summary
128
V Machine Learning for Large
Datasets
129
10 Bullding
a Data Classification System with
Mahout
131
Can Machines Predict the Future?
132
Challenges of Machine Learning
132
Bayesian Classification
133
Clustering
134
Recommendation Engines
135
Apache Mahout: Scalable Machine Learning
136
Using Mahout to Classify Text
137
MLBase: Distributed Machine Learning
Framework
139
Summary
140
VI Statistical Analysis for Massive
Datasets
143
11
Using
R
with Large
Datasets
145
Why Statistics Are Sexy
146
Limitations of
R
for Large
Datasets
147
R
Data Frames and Matrices
148
Strategies for Dealing with Large
Datasets
149
Large Matrix Manipulation: bigmemory and
biganalytics
150
ff
:
Working with Data Frames Larger than
Memory
151
biglm: Linear Regression for Large
Datasets
152
RHadoop: Accessing Apache Hadoop from
R
154
Summary
155
xjj Contents
12 Bullding
Analytics Workflow· Using Python and
Pandas
157
The Snakes Are Loose in the Data Zoo
157
Choosing a Language for Statistical
Computation
158
Extending Existing Code
159
Tools and Testing
160
Python Libraries for Data Processing
160
NumPy
160
SciPy: Scientific Computing for Python
162
The Pandas Data Analysis Library
163
Building More Complex Workflows
167
Working with Bad or Missing Records
169
¡Python: Completing the Scientific Computing Tool
Chain
170
Parallelizing ¡Python Using a Cluster
171
Summary
174
VII
Looking Ahead
177
13
When to Build, When to Buy, When to
Outsource
179
Overlapping Solutions
179
Understanding Your Data Problem
181
A
Playbook
for the Build versus Buy Problem
182
What Have You Already Invested In?
183
Starting Small
183
Planning for Scale
184
My Own Private Data Center
184
Understand the Costs of Open-Source
186
Everything as a Service
187
Summary
187
14
The Future: Trends In Data Technology
189
Hadoop: The
Disruptor
and the Disrupted
190
Everything in the Cloud
191
The Rise and Fall of the Data Scientist
193
Contents xiii
Convergence:
The Ultimate Database
195
Convergence of Cultures
196
Summary
197
Index
199
|
any_adam_object | 1 |
author | Manoochehri, Michael |
author_GND | (DE-588)1048748499 |
author_facet | Manoochehri, Michael |
author_role | aut |
author_sort | Manoochehri, Michael |
author_variant | m m mm |
building | Verbundindex |
bvnumber | BV041725480 |
callnumber-first | Q - Science |
callnumber-label | QA76 |
callnumber-raw | QA76.9.D26 |
callnumber-search | QA76.9.D26 |
callnumber-sort | QA 276.9 D26 |
callnumber-subject | QA - Mathematics |
classification_rvk | ST 265 |
ctrlnum | (OCoLC)861789010 (DE-599)BVBBV041725480 |
dewey-full | 005.74/3 |
dewey-hundreds | 000 - Computer science, information, general works |
dewey-ones | 005 - Computer programming, programs, data, security |
dewey-raw | 005.74/3 |
dewey-search | 005.74/3 |
dewey-sort | 15.74 13 |
dewey-tens | 000 - Computer science, information, general works |
discipline | Informatik |
format | Book |
fullrecord | <?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>01714nam a2200445 c 4500</leader><controlfield tag="001">BV041725480</controlfield><controlfield tag="003">DE-604</controlfield><controlfield tag="005">20140408 </controlfield><controlfield tag="007">t</controlfield><controlfield tag="008">140310s2014 xxubd|| |||| 00||| eng d</controlfield><datafield tag="010" ind1=" " ind2=" "><subfield code="a">013041476</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">9780321898654</subfield><subfield code="9">978-0-321-89865-4</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">0321898656</subfield><subfield code="9">0-321-89865-6</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(OCoLC)861789010</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-599)BVBBV041725480</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-604</subfield><subfield code="b">ger</subfield><subfield code="e">aacr</subfield></datafield><datafield tag="041" ind1="0" ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="044" ind1=" " ind2=" "><subfield code="a">xxu</subfield><subfield code="c">US</subfield></datafield><datafield tag="049" ind1=" " ind2=" "><subfield code="a">DE-473</subfield><subfield code="a">DE-19</subfield><subfield code="a">DE-1049</subfield></datafield><datafield tag="050" ind1=" " ind2="0"><subfield code="a">QA76.9.D26</subfield></datafield><datafield tag="082" ind1="0" ind2=" "><subfield code="a">005.74/3</subfield><subfield code="2">23</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">ST 265</subfield><subfield code="0">(DE-625)143634:</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Manoochehri, Michael</subfield><subfield code="e">Verfasser</subfield><subfield code="0">(DE-588)1048748499</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Data just right</subfield><subfield code="b">introduction to large-scale data & analytics</subfield><subfield code="c">Michael Manoochehri</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="a">Upper Saddle River, NJ [u. a.]</subfield><subfield code="b">Addison-Wesley</subfield><subfield code="c">2014</subfield></datafield><datafield tag="300" ind1=" " ind2=" "><subfield code="a">XXIII, 215 S.</subfield><subfield code="b">graph. Darst., Kt.</subfield><subfield code="c">24 cm</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="490" ind1="0" ind2=" "><subfield code="a">Addison Wesley data and analytics series</subfield></datafield><datafield tag="500" ind1=" " ind2=" "><subfield code="a">Includes bibliographical references and index</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Database design</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Big data</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Datenanalyse</subfield><subfield code="0">(DE-588)4123037-1</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Big Data</subfield><subfield code="0">(DE-588)4802620-7</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="689" ind1="0" ind2="0"><subfield code="a">Big Data</subfield><subfield code="0">(DE-588)4802620-7</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2="1"><subfield code="a">Datenanalyse</subfield><subfield code="0">(DE-588)4123037-1</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2=" "><subfield code="5">DE-604</subfield></datafield><datafield tag="856" ind1="4" ind2="2"><subfield code="m">Digitalisierung UB Bamberg - ADAM Catalogue Enrichment</subfield><subfield code="q">application/pdf</subfield><subfield code="u">http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=027172353&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA</subfield><subfield code="3">Inhaltsverzeichnis</subfield></datafield><datafield tag="999" ind1=" " ind2=" "><subfield code="a">oai:aleph.bib-bvb.de:BVB01-027172353</subfield></datafield></record></collection> |
id | DE-604.BV041725480 |
illustrated | Illustrated |
indexdate | 2024-07-10T01:03:50Z |
institution | BVB |
isbn | 9780321898654 0321898656 |
language | English |
lccn | 013041476 |
oai_aleph_id | oai:aleph.bib-bvb.de:BVB01-027172353 |
oclc_num | 861789010 |
open_access_boolean | |
owner | DE-473 DE-BY-UBG DE-19 DE-BY-UBM DE-1049 |
owner_facet | DE-473 DE-BY-UBG DE-19 DE-BY-UBM DE-1049 |
physical | XXIII, 215 S. graph. Darst., Kt. 24 cm |
publishDate | 2014 |
publishDateSearch | 2014 |
publishDateSort | 2014 |
publisher | Addison-Wesley |
record_format | marc |
series2 | Addison Wesley data and analytics series |
spelling | Manoochehri, Michael Verfasser (DE-588)1048748499 aut Data just right introduction to large-scale data & analytics Michael Manoochehri Upper Saddle River, NJ [u. a.] Addison-Wesley 2014 XXIII, 215 S. graph. Darst., Kt. 24 cm txt rdacontent n rdamedia nc rdacarrier Addison Wesley data and analytics series Includes bibliographical references and index Database design Big data Datenanalyse (DE-588)4123037-1 gnd rswk-swf Big Data (DE-588)4802620-7 gnd rswk-swf Big Data (DE-588)4802620-7 s Datenanalyse (DE-588)4123037-1 s DE-604 Digitalisierung UB Bamberg - ADAM Catalogue Enrichment application/pdf http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=027172353&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA Inhaltsverzeichnis |
spellingShingle | Manoochehri, Michael Data just right introduction to large-scale data & analytics Database design Big data Datenanalyse (DE-588)4123037-1 gnd Big Data (DE-588)4802620-7 gnd |
subject_GND | (DE-588)4123037-1 (DE-588)4802620-7 |
title | Data just right introduction to large-scale data & analytics |
title_auth | Data just right introduction to large-scale data & analytics |
title_exact_search | Data just right introduction to large-scale data & analytics |
title_full | Data just right introduction to large-scale data & analytics Michael Manoochehri |
title_fullStr | Data just right introduction to large-scale data & analytics Michael Manoochehri |
title_full_unstemmed | Data just right introduction to large-scale data & analytics Michael Manoochehri |
title_short | Data just right |
title_sort | data just right introduction to large scale data analytics |
title_sub | introduction to large-scale data & analytics |
topic | Database design Big data Datenanalyse (DE-588)4123037-1 gnd Big Data (DE-588)4802620-7 gnd |
topic_facet | Database design Big data Datenanalyse Big Data |
url | http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=027172353&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |
work_keys_str_mv | AT manoochehrimichael datajustrightintroductiontolargescaledataanalytics |