Doing Data Science: [straight talk from the frontline]
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Buch |
Sprache: | English |
Veröffentlicht: |
Beijing [u.a.]
O'Reilly
2014
|
Ausgabe: | 1. ed. |
Schlagworte: | |
Online-Zugang: | Inhaltstext Inhaltsverzeichnis |
Beschreibung: | Hier auch später erschienene, unveränderte Nachdrucke |
Beschreibung: | XXV, 377 S. Ill., graph. Darst. |
ISBN: | 9781449358655 1449358659 |
Internformat
MARC
LEADER | 00000nam a2200000 c 4500 | ||
---|---|---|---|
001 | BV041253517 | ||
003 | DE-604 | ||
005 | 20211018 | ||
007 | t | ||
008 | 130903s2014 ad|| |||| 00||| eng d | ||
015 | |a 13,N28 |2 dnb | ||
016 | 7 | |a 1036704041 |2 DE-101 | |
020 | |a 9781449358655 |c kart. : EUR 31.99 (DE) (freier Pr.), EUR 32.90 (AT) (freier Pr.), $ 54.99 (US), $ 72.99 (CAN) |9 978-1-449-35865-5 | ||
020 | |a 1449358659 |9 1-449-35865-9 | ||
024 | 3 | |a 9781449358655 | |
035 | |a (OCoLC)864555597 | ||
035 | |a (DE-599)DNB1036704041 | ||
040 | |a DE-604 |b ger |e rakddb | ||
041 | 0 | |a eng | |
049 | |a DE-29T |a DE-11 |a DE-945 |a DE-355 |a DE-20 |a DE-1043 |a DE-573 |a DE-B768 |a DE-860 | ||
082 | 0 | |a 005.7 |2 23/ger | |
082 | 0 | |a 519.53 |2 23/ger | |
084 | |a QH 232 |0 (DE-625)141547: |2 rvk | ||
084 | |a QP 345 |0 (DE-625)141866: |2 rvk | ||
084 | |a ST 530 |0 (DE-625)143679: |2 rvk | ||
084 | |a 004 |2 sdnb | ||
084 | |a 510 |2 sdnb | ||
100 | 1 | |a O'Neil, Cathy |d 1972- |e Verfasser |0 (DE-588)104532714X |4 aut | |
245 | 1 | 0 | |a Doing Data Science |b [straight talk from the frontline] |c Cathy O'Neil and Rachel Schutt |
250 | |a 1. ed. | ||
264 | 1 | |a Beijing [u.a.] |b O'Reilly |c 2014 | |
300 | |a XXV, 377 S. |b Ill., graph. Darst. | ||
336 | |b txt |2 rdacontent | ||
337 | |b n |2 rdamedia | ||
338 | |b nc |2 rdacarrier | ||
500 | |a Hier auch später erschienene, unveränderte Nachdrucke | ||
650 | 0 | 7 | |a Big Data |0 (DE-588)4802620-7 |2 gnd |9 rswk-swf |
650 | 0 | 7 | |a Datenanalyse |0 (DE-588)4123037-1 |2 gnd |9 rswk-swf |
689 | 0 | 0 | |a Big Data |0 (DE-588)4802620-7 |D s |
689 | 0 | 1 | |a Datenanalyse |0 (DE-588)4123037-1 |D s |
689 | 0 | |5 DE-604 | |
700 | 1 | |a Schutt, Rachel |d 1976- |e Verfasser |0 (DE-588)1045327425 |4 aut | |
856 | 4 | 2 | |m X:MVB |q text/html |u http://deposit.dnb.de/cgi-bin/dokserv?id=4376836&prov=M&dok_var=1&dok_ext=htm |3 Inhaltstext |
856 | 4 | 2 | |m HBZ Datenaustausch |q application/pdf |u http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=026227509&sequence=000001&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |3 Inhaltsverzeichnis |
943 | 1 | |a oai:aleph.bib-bvb.de:BVB01-026227509 |
Datensatz im Suchindex
_version_ | 1806325740102221824 |
---|---|
adam_text |
Titel: Doing data science
Autor: O'Neil, Cathy
Jahr: 2014
Table of Contents
Preface. xiii
1. Introduction: What Is Data Science?. 1
Big Data and Data Science Hype 1
Getting Past the Hype 3
WhyNow? 4
Datafication 5
The Current Landscape (with a Little History) 6
Data Science Jobs 10
A Data Science Profile 10
Thought Experiment: Meta-Defmition 13
OK, So What Is a Data Scientist, Really? 14
In Academia 14
Inlndustry 15
2. Statistical Inference, Exploratory Data Analysis, and the Data Science
Process. 17
Statistical Thinking in the Age of Big Data 17
Statistical Inference 18
Populations and Samples 19
Populations and Samples of Big Data 21
Big Data Can Mean Big Assumptions 24
Modeling 26
Exploratory Data Analysis 34
Philosophy of Exploratory Data Analysis 36
Exercise: EDA 37
The Data Science Process 41
A Data Scientist's Role in This Process 43
Thought Experiment: How Would You Simulate Chaos? 44
Case Study: RealDirect 46
How Does RealDirect Make Money? 47
Exercise: RealDirect Data Strategy 48
3. Algorithms. 51
Machine Learning Algorithms 52
Three Basic Algorithms 54
Linear Regression 55
k-Nearest Neighbors (k-NN) 71
k-means 82
Exercise: Basic Machine Learning Algorithms 86
Solutions 86
Summing It All Up 91
Thought Experiment: Automated Statistician 92
4. Spam Filters, Naive Bayes, and Wrangling. 93
Thought Experiment: Learning by Example 93
Why Won11 Linear Regression Work for Filtering Spam? 95
How About k-nearest Neighbors? 96
Naive Bayes 98
Bayes Law 98
A Spam Filter for Individual Words 99
A Spam Filter That Combines Words: Naive Bayes 101
Fancy It Up: Laplace Smoothing 103
Comparing Naive Bayes to k-NN 105
Sample Code in bash 105
Scraping the Web: APIs and Other Tools 106
Jake's Exercise: Naive Bayes for Article Classification 108
Sample R Code for Dealing with the NYT API 110
5. Logistic Regression. 113
Thought Experiments 114
Classifiers 115
Runtime 116
You 117
Interpretability 117
Scalability 117
M6D Logistic Regression Case Study 118
Click Models 118
The Underlying Math 120
vi | Table of Contents
Estimating a and ß 122
Newtons Method 124
Stochastic Gradient Descent 124
Implementation 124
Evaluation 125
Media 6 Degrees Exercise 128
Sample R Code 129
6. Time Stamps and Financial Modeiing.135
Kyle Teague and GetGlue 13 5
Timestamps 137
Exploratory Data Analysis (EDA) 138
Metrics and New Variables or Features 142
What'sNext? 142
CathyO'Neil 144
Thought Experiment 144
Financial Modeiing 145
In-Sample, Out-of-Sample, and Causality 146
Preparing Financial Data 148
Log Returns 149
Example: The S P Index 151
Working out a Volatility Measurement 153
Exponential Downweighting 155
The Financial Modeiing Feedback Loop 156
Why Regression? 158
Adding Priors 159
A Baby Model 159
Exercise: GetGlue and Timestamped Event Data 162
Exercise: Financial Data 164
7. Extracting Meaning from Data. 165
William Cukierski 165
Background: Data Science Competitions 166
Background: Crowdsourcing 167
The Kaggle Model 170
A Single Contestant 170
Their Customers 172
Thought Experiment: What Are the Ethical Implications of a
Robo-Grader? 174
Feature Selection 176
Example: User Retention 177
Table of Contents I vii
Filters 181
Wrappers 181
Embedded Methods: Decision Trees 184
Entropy 186
The Decision Tree Algorithm 187
Handling Continuous Variables in Decision Trees 188
Random Forests 190
User Retention: Interpretability Versus Predictive Power 192
David Huffaker: Googles Hybrid Approach to Social
Research 193
Moving from Descriptive to Predictive 194
Social at Google 196
Privacy 196
Thought Experiment: What Is the Best Way to Decrease
Concern and Increase Understanding and Control? 197
8. Recommendation Engines: Building a User-Facing Data Product
at Scale. 199
A Real-World Recommendation Engine 200
Nearest Neighbor Algorithm Review 202
Some Problems with Nearest Neighbors 202
Beyond Nearest Neighbor: Machine Learning
Classification 204
The Dimensionality Problem 206
Singular Value Decomposition (SVD) 207
Important Properties of SVD 208
Principal Component Analysis (PCA) 209
Alternating Least Squares 211
Fix V and Update U 212
Last Thoughts on These Algorithms 213
Thought Experiment: Filter Bubbles 213
Exercise: Build Your Own Recommendation System 214
Sample Code in Python 214
9. Data Visualization and Fraud Detection.217
Data Visualization History 217
Gabriel Tarde 218
Marks Thought Experiment 219
What Is Data Science» Redux? 220
Processing 221
Franco Moretti 221
viii | Table of Contents
A Sample of Data Visualization Projects 222
Marks Data Visualization Projects 227
New York Times Lobby: Moveable Type 227
Project Cascade: Lives on a Screen 230
Cronkite Plaza 231
eBay Transactions and Books 232
Public Theater Shakespeare Machine 234
Goals of These Exhibits 235
Data Science and Risk 235
About Square 236
The Risk Challenge 237
The Trouble with Performance Estimation 240
Model Building Tips 244
Data Visualization at Square 248
Iaris Thought Experiment 249
Data Visualization for the Rest of Us 250
Data Visualization Exercise 251
10. Social Networksand Data Journalism. 253
Social Network Analysis at Morning Analytics 254
Case-Attribute Data versus Social Network Data 254
Social Network Analysis 255
Terminology from Social Networks 256
Centrality Measures 257
The Industry of Centrality Measures 258
Thought Experiment 259
Morningside Analytics 260
How Visualizations Help Us Find Schools of Fish 262
More Background on Social Network Analysis from a
Statistical Point of View 263
Representations of Networks and Eigenvalue Centrality 264
A First Example of Random Graphs: The Erdos-Renyi
Model 265
A Second Example of Random Graphs: The Exponential
Random Graph Model 266
Data Journalism 269
A Bit of History on Data Journalism 269
Writing Technical Journalism: Advice from an Expert 270
11. Causality.273
Correlation Doesrit Imply Causation 274
Table of Contents | ix
Asking Causal Questions 274
Confounders: A Dating Example 275
OK Cupid's Attempt 276
The Gold Standard: Randomized Clinical Trials 279
A/B Tests 280
Second Best Observational Studies 283
Simpsons Paradox 283
The Rubin Causal Model 285
Visualizing Causality 286
Definition: The Causal Effect 287
Three Pieces of Advice 289
12. Epidemiology.291
Madigans Background 291
Thought Experiment 292
Modern Academic Statistics 293
Medical Literature and Observational Studies 293
Stratification Does Not Solve the Confounder Problem 294
What Do People Do About Confounding Things in
Practice? 295
Is There a Better Way? 296
Research Experiment (Observational Medical Outcomes
Partnership) 298
Closing Thought Experiment 303
13. Lessons Learned from Data Competitions: Data Leakage and Model
Evaluation. 305
Claudias Data Scientist Profile 306
The Life of a Chief Data Scientist 306
On Being a Female Data Scientist 307
Data Mining Competitions 307
How to Be a Good Modeler 309
Data Leakage 309
Market Predictions 310
Amazon Case Study: Big Spenders 310
A Jewelry Sampling Problem 311
IBM Customer Targeting 311
Breast Cancer Detection 312
Pneumonia Prediction 313
How to Avoid Leakage 315
Evaluating Models 315
Table of Contents
Accuracy:Meh 317
Probabilities Matter, Not Os and ls 317
Choosing an Algorithm 320
A Final Example 321
Parting Thoughts 322
14. Data Engineering: MapReduce, Pregel, and Hadoop. 323
About David Crawshaw 324
Thought Experiment 325
MapReduce 326
Word Frequency Problem 327
Enter MapReduce 330
Other Examples of MapReduce 332
What Can't MapReduce Do? 333
Pregel 333
About Josh Wills 334
Thought Experiment 334
On Being a Data Scientist 334
Data Abundance Versus Data Scarcity 335
Designing Models 335
Economic Interlude: Hadoop 335
A Brief Introduction to Hadoop 336
Cloudera 337
Back to Josh: Workflow 337
So How to Get Started with Hadoop? 338
15. TheStudentsSpeak.339
Process Thinking 339
Naive No Longer 341
Helping Hands 342
Your Mileage May Vary 344
Bridging Tunnels 347
SomeofOurWork 347
16. Next-Generation Data Scientists, Hu bris, and Ethics. 349
What Just Happened? 349
What Is Data Science (Again)? 350
What Are Next-Gen Data Scientists? 352
Being Problem Solvers 352
Cultivating Soft Skills 353
Being Question Askers 354
Table of Contents | xi
Being an Ethical Data Scientist 356
Career Advice 361
Index. 363
xii i Table of Contents |
any_adam_object | 1 |
author | O'Neil, Cathy 1972- Schutt, Rachel 1976- |
author_GND | (DE-588)104532714X (DE-588)1045327425 |
author_facet | O'Neil, Cathy 1972- Schutt, Rachel 1976- |
author_role | aut aut |
author_sort | O'Neil, Cathy 1972- |
author_variant | c o co r s rs |
building | Verbundindex |
bvnumber | BV041253517 |
classification_rvk | QH 232 QP 345 ST 530 |
ctrlnum | (OCoLC)864555597 (DE-599)DNB1036704041 |
dewey-full | 005.7 519.53 |
dewey-hundreds | 000 - Computer science, information, general works 500 - Natural sciences and mathematics |
dewey-ones | 005 - Computer programming, programs, data, security 519 - Probabilities and applied mathematics |
dewey-raw | 005.7 519.53 |
dewey-search | 005.7 519.53 |
dewey-sort | 15.7 |
dewey-tens | 000 - Computer science, information, general works 510 - Mathematics |
discipline | Informatik Mathematik Wirtschaftswissenschaften |
edition | 1. ed. |
format | Book |
fullrecord | <?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>00000nam a2200000 c 4500</leader><controlfield tag="001">BV041253517</controlfield><controlfield tag="003">DE-604</controlfield><controlfield tag="005">20211018</controlfield><controlfield tag="007">t</controlfield><controlfield tag="008">130903s2014 ad|| |||| 00||| eng d</controlfield><datafield tag="015" ind1=" " ind2=" "><subfield code="a">13,N28</subfield><subfield code="2">dnb</subfield></datafield><datafield tag="016" ind1="7" ind2=" "><subfield code="a">1036704041</subfield><subfield code="2">DE-101</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">9781449358655</subfield><subfield code="c">kart. : EUR 31.99 (DE) (freier Pr.), EUR 32.90 (AT) (freier Pr.), $ 54.99 (US), $ 72.99 (CAN)</subfield><subfield code="9">978-1-449-35865-5</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">1449358659</subfield><subfield code="9">1-449-35865-9</subfield></datafield><datafield tag="024" ind1="3" ind2=" "><subfield code="a">9781449358655</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(OCoLC)864555597</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-599)DNB1036704041</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-604</subfield><subfield code="b">ger</subfield><subfield code="e">rakddb</subfield></datafield><datafield tag="041" ind1="0" ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="049" ind1=" " ind2=" "><subfield code="a">DE-29T</subfield><subfield code="a">DE-11</subfield><subfield code="a">DE-945</subfield><subfield code="a">DE-355</subfield><subfield code="a">DE-20</subfield><subfield code="a">DE-1043</subfield><subfield code="a">DE-573</subfield><subfield code="a">DE-B768</subfield><subfield code="a">DE-860</subfield></datafield><datafield tag="082" ind1="0" ind2=" "><subfield code="a">005.7</subfield><subfield code="2">23/ger</subfield></datafield><datafield tag="082" ind1="0" ind2=" "><subfield code="a">519.53</subfield><subfield code="2">23/ger</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">QH 232</subfield><subfield code="0">(DE-625)141547:</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">QP 345</subfield><subfield code="0">(DE-625)141866:</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">ST 530</subfield><subfield code="0">(DE-625)143679:</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">004</subfield><subfield code="2">sdnb</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">510</subfield><subfield code="2">sdnb</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">O'Neil, Cathy</subfield><subfield code="d">1972-</subfield><subfield code="e">Verfasser</subfield><subfield code="0">(DE-588)104532714X</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Doing Data Science</subfield><subfield code="b">[straight talk from the frontline]</subfield><subfield code="c">Cathy O'Neil and Rachel Schutt</subfield></datafield><datafield tag="250" ind1=" " ind2=" "><subfield code="a">1. ed.</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="a">Beijing [u.a.]</subfield><subfield code="b">O'Reilly</subfield><subfield code="c">2014</subfield></datafield><datafield tag="300" ind1=" " ind2=" "><subfield code="a">XXV, 377 S.</subfield><subfield code="b">Ill., graph. Darst.</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="500" ind1=" " ind2=" "><subfield code="a">Hier auch später erschienene, unveränderte Nachdrucke</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Big Data</subfield><subfield code="0">(DE-588)4802620-7</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Datenanalyse</subfield><subfield code="0">(DE-588)4123037-1</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="689" ind1="0" ind2="0"><subfield code="a">Big Data</subfield><subfield code="0">(DE-588)4802620-7</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2="1"><subfield code="a">Datenanalyse</subfield><subfield code="0">(DE-588)4123037-1</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2=" "><subfield code="5">DE-604</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Schutt, Rachel</subfield><subfield code="d">1976-</subfield><subfield code="e">Verfasser</subfield><subfield code="0">(DE-588)1045327425</subfield><subfield code="4">aut</subfield></datafield><datafield tag="856" ind1="4" ind2="2"><subfield code="m">X:MVB</subfield><subfield code="q">text/html</subfield><subfield code="u">http://deposit.dnb.de/cgi-bin/dokserv?id=4376836&prov=M&dok_var=1&dok_ext=htm</subfield><subfield code="3">Inhaltstext</subfield></datafield><datafield tag="856" ind1="4" ind2="2"><subfield code="m">HBZ Datenaustausch</subfield><subfield code="q">application/pdf</subfield><subfield code="u">http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=026227509&sequence=000001&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA</subfield><subfield code="3">Inhaltsverzeichnis</subfield></datafield><datafield tag="943" ind1="1" ind2=" "><subfield code="a">oai:aleph.bib-bvb.de:BVB01-026227509</subfield></datafield></record></collection> |
id | DE-604.BV041253517 |
illustrated | Illustrated |
indexdate | 2024-08-03T00:54:25Z |
institution | BVB |
isbn | 9781449358655 1449358659 |
language | English |
oai_aleph_id | oai:aleph.bib-bvb.de:BVB01-026227509 |
oclc_num | 864555597 |
open_access_boolean | |
owner | DE-29T DE-11 DE-945 DE-355 DE-BY-UBR DE-20 DE-1043 DE-573 DE-B768 DE-860 |
owner_facet | DE-29T DE-11 DE-945 DE-355 DE-BY-UBR DE-20 DE-1043 DE-573 DE-B768 DE-860 |
physical | XXV, 377 S. Ill., graph. Darst. |
publishDate | 2014 |
publishDateSearch | 2014 |
publishDateSort | 2014 |
publisher | O'Reilly |
record_format | marc |
spelling | O'Neil, Cathy 1972- Verfasser (DE-588)104532714X aut Doing Data Science [straight talk from the frontline] Cathy O'Neil and Rachel Schutt 1. ed. Beijing [u.a.] O'Reilly 2014 XXV, 377 S. Ill., graph. Darst. txt rdacontent n rdamedia nc rdacarrier Hier auch später erschienene, unveränderte Nachdrucke Big Data (DE-588)4802620-7 gnd rswk-swf Datenanalyse (DE-588)4123037-1 gnd rswk-swf Big Data (DE-588)4802620-7 s Datenanalyse (DE-588)4123037-1 s DE-604 Schutt, Rachel 1976- Verfasser (DE-588)1045327425 aut X:MVB text/html http://deposit.dnb.de/cgi-bin/dokserv?id=4376836&prov=M&dok_var=1&dok_ext=htm Inhaltstext HBZ Datenaustausch application/pdf http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=026227509&sequence=000001&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA Inhaltsverzeichnis |
spellingShingle | O'Neil, Cathy 1972- Schutt, Rachel 1976- Doing Data Science [straight talk from the frontline] Big Data (DE-588)4802620-7 gnd Datenanalyse (DE-588)4123037-1 gnd |
subject_GND | (DE-588)4802620-7 (DE-588)4123037-1 |
title | Doing Data Science [straight talk from the frontline] |
title_auth | Doing Data Science [straight talk from the frontline] |
title_exact_search | Doing Data Science [straight talk from the frontline] |
title_full | Doing Data Science [straight talk from the frontline] Cathy O'Neil and Rachel Schutt |
title_fullStr | Doing Data Science [straight talk from the frontline] Cathy O'Neil and Rachel Schutt |
title_full_unstemmed | Doing Data Science [straight talk from the frontline] Cathy O'Neil and Rachel Schutt |
title_short | Doing Data Science |
title_sort | doing data science straight talk from the frontline |
title_sub | [straight talk from the frontline] |
topic | Big Data (DE-588)4802620-7 gnd Datenanalyse (DE-588)4123037-1 gnd |
topic_facet | Big Data Datenanalyse |
url | http://deposit.dnb.de/cgi-bin/dokserv?id=4376836&prov=M&dok_var=1&dok_ext=htm http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=026227509&sequence=000001&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |
work_keys_str_mv | AT oneilcathy doingdatasciencestraighttalkfromthefrontline AT schuttrachel doingdatasciencestraighttalkfromthefrontline |