Big data and social science: data science methods and tools for research and practice
Gespeichert in:
Weitere Verfasser: | , , , , |
---|---|
Format: | Buch |
Sprache: | English |
Veröffentlicht: |
Boca Raton ; London ; New York
CRC Press
2021
|
Ausgabe: | Second edition |
Schriftenreihe: | Chapman & Hall/CRC statistics in the social and behavioral sciences series
|
Schlagworte: | |
Online-Zugang: | Inhaltsverzeichnis |
Beschreibung: | Includes bibliographical references and index |
Beschreibung: | xx, 391 Seiten Illustrationen, Diagramme |
ISBN: | 9780367568597 9780367341879 |
Internformat
MARC
LEADER | 00000nam a2200000 c 4500 | ||
---|---|---|---|
001 | BV047209515 | ||
003 | DE-604 | ||
005 | 20230103 | ||
007 | t | ||
008 | 210323s2021 xxua||| |||| 00||| eng d | ||
020 | |a 9780367568597 |c pbk |9 978-0-367-56859-7 | ||
020 | |a 9780367341879 |c hbk |9 978-0-367-34187-9 | ||
035 | |a (OCoLC)1226406716 | ||
035 | |a (DE-599)BVBBV047209515 | ||
040 | |a DE-604 |b ger |e rda | ||
041 | 0 | |a eng | |
044 | |a xxu |c US | ||
049 | |a DE-19 |a DE-N2 |a DE-739 | ||
050 | 0 | |a H61.3 | |
082 | 0 | |a 300.285/6312 |2 23 | |
084 | |a MR 2200 |0 (DE-625)123489: |2 rvk | ||
245 | 1 | 0 | |a Big data and social science |b data science methods and tools for research and practice |c edited by Ian Foster (University of Chicago, Argonne National Laboratory), Rayid Ghani (University of Chicago), Ron S. Jarmin (U.S. Census Bureau), Frauke Kreuter (University of Maryland, University of Manheim, Institute for Employment Research), Julia Lane (New York University, American Institutes for Research) |
250 | |a Second edition | ||
264 | 1 | |a Boca Raton ; London ; New York |b CRC Press |c 2021 | |
300 | |a xx, 391 Seiten |b Illustrationen, Diagramme | ||
336 | |b txt |2 rdacontent | ||
337 | |b n |2 rdamedia | ||
338 | |b nc |2 rdacarrier | ||
490 | 0 | |a Chapman & Hall/CRC statistics in the social and behavioral sciences series | |
500 | |a Includes bibliographical references and index | ||
650 | 4 | |a Datenverarbeitung | |
650 | 4 | |a Sozialwissenschaften | |
650 | 4 | |a Social sciences |x Data processing | |
650 | 4 | |a Social sciences |x Statistical methods | |
650 | 4 | |a Data mining | |
650 | 4 | |a Big data | |
650 | 0 | 7 | |a Big Data |0 (DE-588)4802620-7 |2 gnd |9 rswk-swf |
650 | 0 | 7 | |a Sozialwissenschaften |0 (DE-588)4055916-6 |2 gnd |9 rswk-swf |
655 | 7 | |0 (DE-588)4143413-4 |a Aufsatzsammlung |2 gnd-content | |
689 | 0 | 0 | |a Sozialwissenschaften |0 (DE-588)4055916-6 |D s |
689 | 0 | 1 | |a Big Data |0 (DE-588)4802620-7 |D s |
689 | 0 | |5 DE-604 | |
700 | 1 | |a Foster, Ian |d 1959- |0 (DE-588)122888529 |4 edt | |
700 | 1 | |a Ghani, Rayid |0 (DE-588)1206736933 |4 edt | |
700 | 1 | |a Jarmin, Ronald S. |d 1964- |0 (DE-588)124661262 |4 edt | |
700 | 1 | |a Kreuter, Frauke |0 (DE-588)1033254037 |4 edt | |
700 | 1 | |a Lane, Julia |d 1956- |0 (DE-588)129556807 |4 edt | |
776 | 0 | 8 | |i Erscheint auch als |n Online-Ausgabe |z 9780429324383 |
856 | 4 | 2 | |m Digitalisierung UB Passau - ADAM Catalogue Enrichment |q application/pdf |u http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=032614344&sequence=000001&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |3 Inhaltsverzeichnis |
999 | |a oai:aleph.bib-bvb.de:BVB01-032614344 |
Datensatz im Suchindex
_version_ | 1804182322553028608 |
---|---|
adam_text | h Hi l lí Ц IclioiI 1.1 Why this book?................................................................... 1.2 Defining big data and its value........................................ 1.3 The importance of inference.............................................. 1.3.1 Description............................................................ 1.3.2 Causation............................................................ 1.3.3 Prediction............................................................ 1.4 The importance of understanding how data are generated 1.5 New tools for new data.................................................... 1.6 The book’s “use case”.................................................... 1.7 The structure of the book.............................................. 1.7.1 Part I: Capture and curation ............................. 1.7.2 Part II: Modeling and analysis............................. 1.7.3 Part III: Inference and ethics ............................. 1.8 Resources........................................................................ Part I Capture and Curation Cameron Neylon 2.1 Introduction................................................................................... 2.2 Scraping information from the web................................................. 2.2.1 Obtaining data from websites.............................................. 2.2.1.1 Constructing the URL........................................... 2.2.1.2 Obtaining the contents of the page from the URL . 2.2.1.3 Processing the HTML response............................. 1 2 4 5 6 7 7 9 10 15 15
17 19 20 23 25 27 27 28 28 29
2.3 2.4 2.5 2.6 2.7 2.2.2 Programmatically iterating over the search results................................ 2.2.3 Limits of scraping........................................................................................ Application programming interfaces..................................................................... 2.3.1 Relevant APIs and resources..................................................................... 2.3.2 RESTful APIs, returned data, and Python wrappers................................ Using an API.................................................................................................................... Another example: Using the ORCID API via a wrapper........................................... Integrating data from multiple sources........................................................................ Summary.......................................................................................................................... š iücun í İ h i kćHj с Joshua Tokie and Stefan Bender 3.1 Motivation ....................................................................................................................... 3.2 Introduction to record linkage....................................................................................... 3.3 Preprocessing data for record linkage........................................................................... 3.4 Indexing and blocking.................................................................................................... 3.5
Matching.......................................................................................................................... 3.5.1 Rule-based approaches................................................................................... 3.5.2 Probabilistic record linkage............................................................................. 3.5.3 Machine learning approaches to record linkage.......................................... 3.5.4 Disambiguating networks................................................................................ 3.6 Classification.................................................................................................................... 3.6.1 Thresholds......................................................................................................... 3.6.2 One-to-one links............................................................................................... 3.7 Record linkage and data protection.............................................................................. 3.8 Summary.......................................................................................................................... 3.9 Resources.................................................................................. Ian Foster and Pascal Heus 4.1 Introduction............................................................................................................... 4.2 The DBMS: When and why...................................................................................... 4.3 Relational
DBMSs...................................................................................................... 4.3.1 Structured Query Language........................................................................ 4.3.2 Manipulating and querying data.............................................................. 4.3.3 Schema design and definition.................................................................... 4.3.4 Loading data................................................................................................. 4.3.5 Transactions and crash recovery.............................................................. 4.3.6 Database optimizations.............................................................................. 4.3.7 Caveats and challenges.............................................................................. 4.3.7.1 Data cleaning............................................................................... 4.3.7.2 Missing values............................................................................... 4.3.7.3 Metadata for categorical variables............................................ 33 34 35 35 35 37 39 40 41 ‘I i 43 44 49 51 53 54 55 57 60 60 61 62 63 64 4.4 4.5 4.6 4.7 4.8 4.9 Huy 5.1 5.2 5.3 5.4 5.5 5.6 645.7 Part II 67 68 74 76 76 79 82 83 84 87 87 87 87 Linking DBMSs and other tools............................................................................... NoSQL databases ..................................................................................................... 4.5.1 Challenges of scale: The CAP
theorem...................................................... 4.5.2 NoSQL and key-valuestores ...................................................................... 4.5.3 Other NoSQL databases............................................................................... Spatial databases ..................................................................................................... Which database to use?............................................................................................ 4.7.1 Relational DBMSs........................................................................................ 4.7.2 NoSQL DBMSs.............................................................................................. Summary..................................................................................................................... Resources..................................................................................................................... 88 91 91 92 94 95 97 97 98 98 99 Vo and Claudio Silva Introduction............................................................................................................... MapReduce.................................................................................................................. Apache Hadoop MapReduce..................................................................................... 5.3.1 The Hadoop Distributed File System......................................................... 5.3.2 Hadoop setup: Bringing compute to the data......................................... 5.3.3 Hardware
provisioning.................................................................................. 5.3.4 Programming in Hadoop.............................................................................. 5.3.5 Programming language support.................................................................. 5.3.6 Benefits and limitations of Hadoop............................................................ Other MapReduce Implementations ...................................................................... ApacheSpark............................................................................................................... Summary..................................................................................................................... Resources..................................................................................................................... Ю1 ЮЗ Ю5 105 106 Ю8 109 HI 112 H3 H4 H6 H7 Modeling and Analysis M. Adil Yalçın and Catherine Plaisant 6.1 Introduction............................................................................................................... 6.2 Developing effective visualizations .......................................................................... 6.3 A data-by-tasks taxonomy......................................................................................... 6.3.1 Multivariate data............................................................................................ 6.3.2 Spatial data..................................................................................................... 6.3.3 Temporal
data............................................................................................... 6.3.4 Hierarchical data............................................................................................ 6.3.5 Network data .................................................................................................. 6.3.6 Text data........................................................................................................ 119 121 122 127 129 130 131 133 134 136
Challenges.............................................................................................................................. 6.4.1 Scalability ............................................................................................................... 6.4.2 Evaluation............................................................................................................... 6.4.3 Visual impairment................................................................................................. 6.4.4 Visual literacy........................................................................................................ Summary................................................................................................................................. Resources................................................................................................................................. 138 138 139 140 140 141 141 Machino ..earning Rayid Ghani and Malte Schierholz 7.1 Introduction.......................................................................................................................... 7.2 What is machine learning? ............................................................................................... 7.3 Types of analysis................................................................................................................... 7.4 The machine learning process........................................................................................... 7.5 Problem formulation: Mapping a problem to machine learning methods ....
7.5.1 Features.................................................................................................................. 7.6 Methods..................................................................................................................................... 7.6.1 Unsupervised learning methods......................................................................... 7.6.1.1 Clustering............................................................................................... 7.6.1.2 The к-means clustering...................................................................... 7.6.1.3 Expectation-maximization clustering........................................... 7.6.1.4 Mean shift clustering.......................................................................... 7.6.1.5 Hierarchical clustering....................................................................... 7.6.1.6 Spectral clustering.............................................................................. 7.6.1.7 Principal components analysis......................................................... 7.6.1.8 Association rules................................................................................. 7.6.2 Supervised learning.............................................................................................. 7.6.2.1 Training a model................................................................................. 7.6.2.2 Using the model to score new data .............................................. 7.6.2.3 The к-nearest
neighbor...................................................................... 7.6.2.4 Support vector machines................................................................... 7.6.2.5 Decision trees........................................................................................ 7.6.2.6 Ensemble methods.............................................................................. 7.6.2.7 Bagging.................................................................................................. 7.6.2.8 Boosting.................................................................................................. 7.6.2.9 Random forests.................................................................................... 7.6.2.10 Stacking.................................................................................................. 7.6.2.11 Neural networks and deep learning.............................................. 7.6.3 Binaiy vs. multiclass classification problems............................................. 7.6.4 Skewed or imbalanced classification problems............................................. 7.6.5 Model interpretability.......................................................................................... 7.6.5.1 Global vs. individual-level explanations...................................... 143 144 147 147 150 151 153 154 154 155 157 157 158 158 160 160 161 163 163 163 165 167 169 169 170 171 172 172 174 175 176 176 6.4 6.5 6.6 7
Evaluation............................................................................................................................ 7.7.1 Methodology........................................................................................................... 7.7.1.1 In-sample evaluation......................................................................... 7.7.1.2 Out-of-sample and holdout set........................................................ 7.7.1.3 Cross-validation.................................................................................... 7.7.1.4 Temporal validation............................................................................ 7.7.2 Metrics..................................................................................................................... 7.8 Practical tips........................................................................................................................ 7.8.1 Avoiding leakage................................................................................................... 7.8.2 Machine learning pipeline.................................................................................. 7.9 How can social scientists benefit from machine learning?..................................... 7.10 Advanced topics................................................................................................................. 7.11 Summary............................................................................................................................... 7.12
Resources............................................................................................................................... 7.7 i;:ХІ ЛііЫуз :; Evgeny KԽchikhin and Jordan Boyd-Graber 8.1 Understanding human-generated text........................................................................ 8.2 How is text data different than “structured” data?.................................................... 8.3 What can we do with text data?...................................................................................... 8.4 How to analyze text ....................................................... 8.4.1 Initial processing.................................................................................................... 8.4.1.1 Tokenizatlon........................................................................................... 8.4.1.2 Stop words ........................................................................................... 8.4.1.3 The N-grams ........................................................................................ 8.4.1.4 Stemming and lemmatization......................................................... 8.4.2 Linguistic analysis........................... 8.4.2.1 Part-of-speech tagging....................................................................... 8.4.2.2 Order matters........................................................................................ 8.4.3 Turning text data into a matrix: How much is a word worth? .............. 8.4.4
Analysis..................................................................................................................... 8.4.4.1 Use case: Finding similar documents.......................................... 8.4.4.2 Example: Measuring similarity between documents.............. 8.4.4.3 Example code...................................................................................... 8.4.4.4 Augmenting similarity calculations with external knowledge repositories.......................................................................................... 8.4.4.5 Evaluating “find similar” methods.................................................. 8.4.4.6 The F score........................................................................................... 8.4.4.7 Examples............................................................................................... 8.4.4.8 Use case: Clustering.......................................................................... 8.4.5 Topic modeling........................................................................................................ 8.4.5.1 Inferring “topics” from raw text ..................................................... 8.4.5.2 Applications of topic models............................................................ 178 178 178 179 179 180 181 185 185 187 187 189 191 191 la֊֊1 193 194 194 196 197 197 198 198 199 199 199 200 200 201 202 203 203 203 205 206 206 206 208 209 212
8.5 Word embeddings and deep learning................................................................. 8.6 Text analysis tools............................................................................................. 8.6.1 The natural language toolkit................................................................ 8.6.2 Stanford CoreNLP................................................................................. 8.6.3 The MALLET......................................................................................... 8.6.3.1 Spacy.io.................................................................................... 8.6.3.2 Pytorch .................................................................................... 8.7 Summary........................................................................................................... 8.8 Resources........................................................................................................... 9 Networks: Ulti Basics Jason Owen-Smith 9.1 Introduction...................................................................................................... 9.2 What are networks?.......................................................................................... 9.3 Structure for this chapter.................................................................................. 9.4 Turning data into a network............................................................................... 9.4.1 Tÿpes of networks................................................................................. 9.4.2 Inducing one-mode
networksfrom two-mode data .............................. 9.5 Network measures............................................................................................. 9.5.1 Reachability.................... 9.5.2 Whole-network measures..................................................................... 9.5.2.1 Components and reachability.............................................. 9.5.2.2 Path length........................................................................... 9.5.2.3 Degree distribution................................................................. 9.5.2.4 Clustering coefficient.............................................................. 9.5.2.5 Centrality measures .............................................................. 9.6 Case study: Comparing collaboration networks............................................... 9.7 Summary............................................................................................................ 9.8 Resources............................................................................................................ Part III Inference and Ethics Paul P. Biemer 10.1 Introduction..................................................................................................... 10.2 The total error paradigm................................................................................... 10.2.1 The traditional model........................................................................... 10.2.1.1 Types of errors ..................................................................... 10.2.1.2 Column
error........................................................................ 10.2.1.3 Cell errors.............................................................................. 10.3 Example: Google Flu Trends ........................................................................... 214 215 216 216 217 217 217 218 218 w/i 221 222 224 224 225 227 230 231 232 232 233 236 236 238 241 246 246 249 251 252 253 254 257 258 260 10.4 Errors 10.4.1 10.4.2 10.4.3 10.4.4 10.4.5 in data analysis...................................................................................... Analysis errors despite accuratedata .................................................. Noise accumulation.............................................................................. Spurious correlations........................................................................... Incidental endogeneity........................................................................... Analysis errors resultingfrom inaccurate data .................................... 10.4.5.1 Variable (uncorrelated) and correlated error in continuous variables................................................................................ 10.4.5.2 Extending variable and correlated error to categorical data . 10.4.5.3 Errors when analyzing rare population groups.................... 10.4.5.4 Errors in correlation analysis.............................................. 10.4.5.5 Errors in regression analysis................................................. 10.5 Detecting and compensating for data
errors................................................... 10.5.1 TablePlots............................................................................................... 10.6 Summary.......................................................................................................... 10.7 Resources.......................................................................................................... Kit T. Rodolfa, Pedro Saleiro, and Rayid Ghani 11.1 Introduction.................................................................................................... 11.2 Sources of bias................................................................................................. 11.2.1 Sample bias ......................................................................................... 11.2.2 Label (outcome) bias.............................................................................. 11.2.3 Machine learning pipeline bias............................................................ 11.2.4 Application bias.................................................................................... 11.2.5 Considering bias when deploying your model...................................... 11.3 Dealing with bias.............................................................................................. 11.3.1 Define bias............................................................................................ 11.3.2 Definitions............................................................................................ 11.3.3 Choosing bias
metrics........................................................................... 11.3.4 Punitive example................................................................................... 11.3.4.1 Count of false positives.......................................................... 11.3.4.2 Group size-adjusted falsepositives..................................... 11.3.4.3 False discovery rate............................................................... 11.3.4.4 False positive rate.................................................................. 11.3.4.5 Tradeoffs in metric choice ................................................ 11.3.5 Assistive example ................................................................................. 11.3.5.1 Count of false negatives ....................................................... 11.3.5.2 Group size-adjusted falsenegatives..................................... 11.3.5.3 False omission rate............................................................... 11.3.5.4 False negative rate............................................................... 11.3.6 Special case: Resource-constrainedprograms....................................... 261 261 262 262 263 264 264 266 267 269 273 275 276 279 280 281 282 282 283 283 285 286 286 286 287 289 291 291 291 292 292 292 294 294 294 295 295 296
Contení:; Mitigating bias.......................................................................................................... 11.4.1 Auditing model results.................................................................................. 11.4.2 Model selection.............................................................................................. 11.4.3 Other options for mitigating bias............................................................... 11.5 Further considerations .......................................................................................... 11.5.1 Compared to what?..................................................................................... 11.5.2 Costs to both errors..................................................................................... 11.5.3 What is the relevant population?............................................................... 11.5.4 Continuous outcomes.................................................................................. 11.5.5 Considerations for ongoing measurement............................................... 11.5.6 Equity in practice........................................................................................ 11.5.7 Additional terms for metrics........................................................................ 11.6 Case studies............................................................................................................. 11.6.1 Recidivism predictions with COMPAS ...................................................... 11.6.2 Facial
recognition........................................................................................ 11.6.3 Facebook advertisement targeting ............................................................ 11.7 Aequitas: A toolkit for auditing bias and fairness in machine learning models............................................................................................................. 310 11.7.1 Aequitas in the larger context of the machine learning pipeline............. 11.4 12 WWocy and Coi riidciHÍaliíy Stefan Bender, Ron S. Jarmin, Frauke Kreuter, and Julia Lane 12.1 Introduction............................................................................................................. 12.2 Why is access important?....................................................................................... 12.2.1 Validating the data-generating process...................................................... 12.2.2 Replication..................................................................................................... 12.2.3 Building knowledge Infrastructure............................................................ 12.3 Providing access....................................................................................................... 12.3.1 Statistical disclosure controltechniques................................................... 12.3.2 Research data centers.................................................................................. 12.4 Non-tabular
data....................................................................................................... 12.5 The new challenges ................................................................................................. 12.6 Legal and ethical framework.................................................................................... 12.7 Summary.................................................................................................................... 12.8 Resources.................................................................................................................... 296 297 297 299 3θθ 300 301 301 302 302 303 304 305 306 307 309 311 Wo 313 319 319 320 320 321 321 323 323 326 328 329 331 id Workbook;; Brian Kim, Christoph Kern, Jonathan Scott Morgan, Clagton Hunter, and Avishek Kumar 13.1 Introduction............................................................................................................. 13.2 Notebooks................................................................................................................... 13.2.1 Databases ..................................................................................................... 13.2.2 DatasetExploration and Visualization....................................................... 333 334 334 334 13.2.3 APIs................................................................................................................. 13.2.4 Record Linkage.............................................................................................. 13.2.5 Text
Analysis................................................................................................. 13.2.6 Networks........................................................................................................ 13.2.7 Machine Learning—Creating Labels............................................................ 13.2.8 Machine Learning—Creating Features..................................................... 13.2.9 Machine Learning—Model Training and Evaluation............................... 13.2.10 Bias and Fairness........................................................................................ 13.2.11 Errors and Inference.................................................................................. 13.2.12 Additional workbooks.................................................................................. 13.3 Resources.................................................................................................................... 335 335 336 336 336 337 337 337 338 338 338
|
adam_txt |
h Hi l lí Ц IclioiI 1.1 Why this book?. 1.2 Defining big data and its value. 1.3 The importance of inference. 1.3.1 Description. 1.3.2 Causation. 1.3.3 Prediction. 1.4 The importance of understanding how data are generated 1.5 New tools for new data. 1.6 The book’s “use case”. 1.7 The structure of the book. 1.7.1 Part I: Capture and curation . 1.7.2 Part II: Modeling and analysis. 1.7.3 Part III: Inference and ethics . 1.8 Resources. Part I Capture and Curation Cameron Neylon 2.1 Introduction. 2.2 Scraping information from the web. 2.2.1 Obtaining data from websites. 2.2.1.1 Constructing the URL. 2.2.1.2 Obtaining the contents of the page from the URL . 2.2.1.3 Processing the HTML response. 1 2 4 5 6 7 7 9 10 15 15
17 19 20 23 25 27 27 28 28 29
2.3 2.4 2.5 2.6 2.7 2.2.2 Programmatically iterating over the search results. 2.2.3 Limits of scraping. Application programming interfaces. 2.3.1 Relevant APIs and resources. 2.3.2 RESTful APIs, returned data, and Python wrappers. Using an API. Another example: Using the ORCID API via a wrapper. Integrating data from multiple sources. Summary. š iücun í İ h i kćHj с Joshua Tokie and Stefan Bender 3.1 Motivation . 3.2 Introduction to record linkage. 3.3 Preprocessing data for record linkage. 3.4 Indexing and blocking. 3.5
Matching. 3.5.1 Rule-based approaches. 3.5.2 Probabilistic record linkage. 3.5.3 Machine learning approaches to record linkage. 3.5.4 Disambiguating networks. 3.6 Classification. 3.6.1 Thresholds. 3.6.2 One-to-one links. 3.7 Record linkage and data protection. 3.8 Summary. 3.9 Resources. Ian Foster and Pascal Heus 4.1 Introduction. 4.2 The DBMS: When and why. 4.3 Relational
DBMSs. 4.3.1 Structured Query Language. 4.3.2 Manipulating and querying data. 4.3.3 Schema design and definition. 4.3.4 Loading data. 4.3.5 Transactions and crash recovery. 4.3.6 Database optimizations. 4.3.7 Caveats and challenges. 4.3.7.1 Data cleaning. 4.3.7.2 Missing values. 4.3.7.3 Metadata for categorical variables. 33 34 35 35 35 37 39 40 41 ‘I' i 43 44 49 51 53 54 55 57 60 60 61 62 63 64 4.4 4.5 4.6 4.7 4.8 4.9 Huy 5.1 5.2 5.3 5.4 5.5 5.6 645.7 Part II 67 68 74 76 76 79 82 83 84 87 87 87 87 Linking DBMSs and other tools. NoSQL databases . 4.5.1 Challenges of scale: The CAP
theorem. 4.5.2 NoSQL and key-valuestores . 4.5.3 Other NoSQL databases. Spatial databases . Which database to use?. 4.7.1 Relational DBMSs. 4.7.2 NoSQL DBMSs. Summary. Resources. 88 91 91 92 94 95 97 97 98 98 99 Vo and Claudio Silva Introduction. MapReduce. Apache Hadoop MapReduce. 5.3.1 The Hadoop Distributed File System. 5.3.2 Hadoop setup: Bringing compute to the data. 5.3.3 Hardware
provisioning. 5.3.4 Programming in Hadoop. 5.3.5 Programming language support. 5.3.6 Benefits and limitations of Hadoop. Other MapReduce Implementations . ApacheSpark. Summary. Resources. Ю1 ЮЗ Ю5 105 106 Ю8 109 HI 112 H3 H4 H6 H7 Modeling and Analysis M. Adil Yalçın and Catherine Plaisant 6.1 Introduction. 6.2 Developing effective visualizations . 6.3 A data-by-tasks taxonomy. 6.3.1 Multivariate data. 6.3.2 Spatial data. 6.3.3 Temporal
data. 6.3.4 Hierarchical data. 6.3.5 Network data . 6.3.6 Text data. 119 121 122 127 129 130 131 133 134 136
Challenges. 6.4.1 Scalability . 6.4.2 Evaluation. 6.4.3 Visual impairment. 6.4.4 Visual literacy. Summary. Resources. 138 138 139 140 140 141 141 Machino '.earning Rayid Ghani and Malte Schierholz 7.1 Introduction. 7.2 What is machine learning? . 7.3 Types of analysis. 7.4 The machine learning process. 7.5 Problem formulation: Mapping a problem to machine learning methods .
7.5.1 Features. 7.6 Methods. 7.6.1 Unsupervised learning methods. 7.6.1.1 Clustering. 7.6.1.2 The к-means clustering. 7.6.1.3 Expectation-maximization clustering. 7.6.1.4 Mean shift clustering. 7.6.1.5 Hierarchical clustering. 7.6.1.6 Spectral clustering. 7.6.1.7 Principal components analysis. 7.6.1.8 Association rules. 7.6.2 Supervised learning. 7.6.2.1 Training a model. 7.6.2.2 Using the model to score new data . 7.6.2.3 The к-nearest
neighbor. 7.6.2.4 Support vector machines. 7.6.2.5 Decision trees. 7.6.2.6 Ensemble methods. 7.6.2.7 Bagging. 7.6.2.8 Boosting. 7.6.2.9 Random forests. 7.6.2.10 Stacking. 7.6.2.11 Neural networks and deep learning. 7.6.3 Binaiy vs. multiclass classification problems. 7.6.4 Skewed or imbalanced classification problems. 7.6.5 Model interpretability. 7.6.5.1 Global vs. individual-level explanations. 143 144 147 147 150 151 153 154 154 155 157 157 158 158 160 160 161 163 163 163 165 167 169 169 170 171 172 172 174 175 176 176 6.4 6.5 6.6 7
Evaluation. 7.7.1 Methodology. 7.7.1.1 In-sample evaluation. 7.7.1.2 Out-of-sample and holdout set. 7.7.1.3 Cross-validation. 7.7.1.4 Temporal validation. 7.7.2 Metrics. 7.8 Practical tips. 7.8.1 Avoiding leakage. 7.8.2 Machine learning pipeline. 7.9 How can social scientists benefit from machine learning?. 7.10 Advanced topics. 7.11 Summary. 7.12
Resources. 7.7 i;:ХІ ЛііЫуз':; Evgeny KԽchikhin and Jordan Boyd-Graber 8.1 Understanding human-generated text. 8.2 How is text data different than “structured” data?. 8.3 What can we do with text data?. 8.4 How to analyze text . 8.4.1 Initial processing. 8.4.1.1 Tokenizatlon. 8.4.1.2 Stop words . 8.4.1.3 The N-grams . 8.4.1.4 Stemming and lemmatization. 8.4.2 Linguistic analysis. 8.4.2.1 Part-of-speech tagging. 8.4.2.2 Order matters. 8.4.3 Turning text data into a matrix: How much is a word worth? . 8.4.4
Analysis. 8.4.4.1 Use case: Finding similar documents. 8.4.4.2 Example: Measuring similarity between documents. 8.4.4.3 Example code. 8.4.4.4 Augmenting similarity calculations with external knowledge repositories. 8.4.4.5 Evaluating “find similar” methods. 8.4.4.6 The F score. 8.4.4.7 Examples. 8.4.4.8 Use case: Clustering. 8.4.5 Topic modeling. 8.4.5.1 Inferring “topics” from raw text . 8.4.5.2 Applications of topic models. 178 178 178 179 179 180 181 185 185 187 187 189 191 191 la֊֊1 193 194 194 196 197 197 198 198 199 199 199 200 200 201 202 203 203 203 205 206 206 206 208 209 212
8.5 Word embeddings and deep learning. 8.6 Text analysis tools. 8.6.1 The natural language toolkit. 8.6.2 Stanford CoreNLP. 8.6.3 The MALLET. 8.6.3.1 Spacy.io. 8.6.3.2 Pytorch . 8.7 Summary. 8.8 Resources. 9 Networks: Ulti Basics Jason Owen-Smith 9.1 Introduction. 9.2 What are networks?. 9.3 Structure for this chapter. 9.4 Turning data into a network. 9.4.1 Tÿpes of networks. 9.4.2 Inducing one-mode
networksfrom two-mode data . 9.5 Network measures. 9.5.1 Reachability. 9.5.2 Whole-network measures. 9.5.2.1 Components and reachability. 9.5.2.2 Path length. 9.5.2.3 Degree distribution. 9.5.2.4 Clustering coefficient. 9.5.2.5 Centrality measures . 9.6 Case study: Comparing collaboration networks. 9.7 Summary. 9.8 Resources. Part III Inference and Ethics Paul P. Biemer 10.1 Introduction. 10.2 The total error paradigm. 10.2.1 The traditional model. 10.2.1.1 Types of errors . 10.2.1.2 Column
error. 10.2.1.3 Cell errors. 10.3 Example: Google Flu Trends . 214 215 216 216 217 217 217 218 218 w/i 221 222 224 224 225 227 230 231 232 232 233 236 236 238 241 246 246 249 251 252 253 254 257 258 260 10.4 Errors 10.4.1 10.4.2 10.4.3 10.4.4 10.4.5 in data analysis. Analysis errors despite accuratedata . Noise accumulation. Spurious correlations. Incidental endogeneity. Analysis errors resultingfrom inaccurate data . 10.4.5.1 Variable (uncorrelated) and correlated error in continuous variables. 10.4.5.2 Extending variable and correlated error to categorical data . 10.4.5.3 Errors when analyzing rare population groups. 10.4.5.4 Errors in correlation analysis. 10.4.5.5 Errors in regression analysis. 10.5 Detecting and compensating for data
errors. 10.5.1 TablePlots. 10.6 Summary. 10.7 Resources. Kit T. Rodolfa, Pedro Saleiro, and Rayid Ghani 11.1 Introduction. 11.2 Sources of bias. 11.2.1 Sample bias . 11.2.2 Label (outcome) bias. 11.2.3 Machine learning pipeline bias. 11.2.4 Application bias. 11.2.5 Considering bias when deploying your model. 11.3 Dealing with bias. 11.3.1 Define bias. 11.3.2 Definitions. 11.3.3 Choosing bias
metrics. 11.3.4 Punitive example. 11.3.4.1 Count of false positives. 11.3.4.2 Group size-adjusted falsepositives. 11.3.4.3 False discovery rate. 11.3.4.4 False positive rate. 11.3.4.5 Tradeoffs in metric choice . 11.3.5 Assistive example . 11.3.5.1 Count of false negatives . 11.3.5.2 Group size-adjusted falsenegatives. 11.3.5.3 False omission rate. 11.3.5.4 False negative rate. 11.3.6 Special case: Resource-constrainedprograms. 261 261 262 262 263 264 264 266 267 269 273 275 276 279 280 281 282 282 283 283 285 286 286 286 287 289 291 291 291 292 292 292 294 294 294 295 295 296
Contení:; Mitigating bias. 11.4.1 Auditing model results. 11.4.2 Model selection. 11.4.3 Other options for mitigating bias. 11.5 Further considerations . 11.5.1 Compared to what?. 11.5.2 Costs to both errors. 11.5.3 What is the relevant population?. 11.5.4 Continuous outcomes. 11.5.5 Considerations for ongoing measurement. 11.5.6 Equity in practice. 11.5.7 Additional terms for metrics. 11.6 Case studies. 11.6.1 Recidivism predictions with COMPAS . 11.6.2 Facial
recognition. 11.6.3 Facebook advertisement targeting . 11.7 Aequitas: A toolkit for auditing bias and fairness in machine learning models. 310 11.7.1 Aequitas in the larger context of the machine learning pipeline. 11.4 12 WWocy and Coi riidciHÍaliíy Stefan Bender, Ron S. Jarmin, Frauke Kreuter, and Julia Lane 12.1 Introduction. 12.2 Why is access important?. 12.2.1 Validating the data-generating process. 12.2.2 Replication. 12.2.3 Building knowledge Infrastructure. 12.3 Providing access. 12.3.1 Statistical disclosure controltechniques. 12.3.2 Research data centers. 12.4 Non-tabular
data. 12.5 The new challenges . 12.6 Legal and ethical framework. 12.7 Summary. 12.8 Resources. 296 297 297 299 3θθ 300 301 301 302 302 303 304 305 306 307 309 311 Wo 313 319 319 320 320 321 321 323 323 326 328 329 331 id Workbook;; Brian Kim, Christoph Kern, Jonathan Scott Morgan, Clagton Hunter, and Avishek Kumar 13.1 Introduction. 13.2 Notebooks. 13.2.1 Databases . 13.2.2 DatasetExploration and Visualization. 333 334 334 334 13.2.3 APIs. 13.2.4 Record Linkage. 13.2.5 Text
Analysis. 13.2.6 Networks. 13.2.7 Machine Learning—Creating Labels. 13.2.8 Machine Learning—Creating Features. 13.2.9 Machine Learning—Model Training and Evaluation. 13.2.10 Bias and Fairness. 13.2.11 Errors and Inference. 13.2.12 Additional workbooks. 13.3 Resources. 335 335 336 336 336 337 337 337 338 338 338 |
any_adam_object | 1 |
any_adam_object_boolean | 1 |
author2 | Foster, Ian 1959- Ghani, Rayid Jarmin, Ronald S. 1964- Kreuter, Frauke Lane, Julia 1956- |
author2_role | edt edt edt edt edt |
author2_variant | i f if r g rg r s j rs rsj f k fk j l jl |
author_GND | (DE-588)122888529 (DE-588)1206736933 (DE-588)124661262 (DE-588)1033254037 (DE-588)129556807 |
author_facet | Foster, Ian 1959- Ghani, Rayid Jarmin, Ronald S. 1964- Kreuter, Frauke Lane, Julia 1956- |
building | Verbundindex |
bvnumber | BV047209515 |
callnumber-first | H - Social Science |
callnumber-label | H61 |
callnumber-raw | H61.3 |
callnumber-search | H61.3 |
callnumber-sort | H 261.3 |
callnumber-subject | H - Social Science |
classification_rvk | MR 2200 |
ctrlnum | (OCoLC)1226406716 (DE-599)BVBBV047209515 |
dewey-full | 300.285/6312 |
dewey-hundreds | 300 - Social sciences |
dewey-ones | 300 - Social sciences |
dewey-raw | 300.285/6312 |
dewey-search | 300.285/6312 |
dewey-sort | 3300.285 46312 |
dewey-tens | 300 - Social sciences |
discipline | Soziologie |
discipline_str_mv | Soziologie |
edition | Second edition |
format | Book |
fullrecord | <?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>02617nam a2200565 c 4500</leader><controlfield tag="001">BV047209515</controlfield><controlfield tag="003">DE-604</controlfield><controlfield tag="005">20230103 </controlfield><controlfield tag="007">t</controlfield><controlfield tag="008">210323s2021 xxua||| |||| 00||| eng d</controlfield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">9780367568597</subfield><subfield code="c">pbk</subfield><subfield code="9">978-0-367-56859-7</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">9780367341879</subfield><subfield code="c">hbk</subfield><subfield code="9">978-0-367-34187-9</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(OCoLC)1226406716</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-599)BVBBV047209515</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-604</subfield><subfield code="b">ger</subfield><subfield code="e">rda</subfield></datafield><datafield tag="041" ind1="0" ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="044" ind1=" " ind2=" "><subfield code="a">xxu</subfield><subfield code="c">US</subfield></datafield><datafield tag="049" ind1=" " ind2=" "><subfield code="a">DE-19</subfield><subfield code="a">DE-N2</subfield><subfield code="a">DE-739</subfield></datafield><datafield tag="050" ind1=" " ind2="0"><subfield code="a">H61.3</subfield></datafield><datafield tag="082" ind1="0" ind2=" "><subfield code="a">300.285/6312</subfield><subfield code="2">23</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">MR 2200</subfield><subfield code="0">(DE-625)123489:</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Big data and social science</subfield><subfield code="b">data science methods and tools for research and practice</subfield><subfield code="c">edited by Ian Foster (University of Chicago, Argonne National Laboratory), Rayid Ghani (University of Chicago), Ron S. Jarmin (U.S. Census Bureau), Frauke Kreuter (University of Maryland, University of Manheim, Institute for Employment Research), Julia Lane (New York University, American Institutes for Research)</subfield></datafield><datafield tag="250" ind1=" " ind2=" "><subfield code="a">Second edition</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="a">Boca Raton ; London ; New York</subfield><subfield code="b">CRC Press</subfield><subfield code="c">2021</subfield></datafield><datafield tag="300" ind1=" " ind2=" "><subfield code="a">xx, 391 Seiten</subfield><subfield code="b">Illustrationen, Diagramme</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="490" ind1="0" ind2=" "><subfield code="a">Chapman & Hall/CRC statistics in the social and behavioral sciences series</subfield></datafield><datafield tag="500" ind1=" " ind2=" "><subfield code="a">Includes bibliographical references and index</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Datenverarbeitung</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Sozialwissenschaften</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Social sciences</subfield><subfield code="x">Data processing</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Social sciences</subfield><subfield code="x">Statistical methods</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Data mining</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Big data</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Big Data</subfield><subfield code="0">(DE-588)4802620-7</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Sozialwissenschaften</subfield><subfield code="0">(DE-588)4055916-6</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="655" ind1=" " ind2="7"><subfield code="0">(DE-588)4143413-4</subfield><subfield code="a">Aufsatzsammlung</subfield><subfield code="2">gnd-content</subfield></datafield><datafield tag="689" ind1="0" ind2="0"><subfield code="a">Sozialwissenschaften</subfield><subfield code="0">(DE-588)4055916-6</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2="1"><subfield code="a">Big Data</subfield><subfield code="0">(DE-588)4802620-7</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2=" "><subfield code="5">DE-604</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Foster, Ian</subfield><subfield code="d">1959-</subfield><subfield code="0">(DE-588)122888529</subfield><subfield code="4">edt</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Ghani, Rayid</subfield><subfield code="0">(DE-588)1206736933</subfield><subfield code="4">edt</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Jarmin, Ronald S.</subfield><subfield code="d">1964-</subfield><subfield code="0">(DE-588)124661262</subfield><subfield code="4">edt</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Kreuter, Frauke</subfield><subfield code="0">(DE-588)1033254037</subfield><subfield code="4">edt</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Lane, Julia</subfield><subfield code="d">1956-</subfield><subfield code="0">(DE-588)129556807</subfield><subfield code="4">edt</subfield></datafield><datafield tag="776" ind1="0" ind2="8"><subfield code="i">Erscheint auch als</subfield><subfield code="n">Online-Ausgabe</subfield><subfield code="z">9780429324383</subfield></datafield><datafield tag="856" ind1="4" ind2="2"><subfield code="m">Digitalisierung UB Passau - ADAM Catalogue Enrichment</subfield><subfield code="q">application/pdf</subfield><subfield code="u">http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=032614344&sequence=000001&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA</subfield><subfield code="3">Inhaltsverzeichnis</subfield></datafield><datafield tag="999" ind1=" " ind2=" "><subfield code="a">oai:aleph.bib-bvb.de:BVB01-032614344</subfield></datafield></record></collection> |
genre | (DE-588)4143413-4 Aufsatzsammlung gnd-content |
genre_facet | Aufsatzsammlung |
id | DE-604.BV047209515 |
illustrated | Illustrated |
index_date | 2024-07-03T16:53:52Z |
indexdate | 2024-07-10T09:05:43Z |
institution | BVB |
isbn | 9780367568597 9780367341879 |
language | English |
oai_aleph_id | oai:aleph.bib-bvb.de:BVB01-032614344 |
oclc_num | 1226406716 |
open_access_boolean | |
owner | DE-19 DE-BY-UBM DE-N2 DE-739 |
owner_facet | DE-19 DE-BY-UBM DE-N2 DE-739 |
physical | xx, 391 Seiten Illustrationen, Diagramme |
publishDate | 2021 |
publishDateSearch | 2021 |
publishDateSort | 2021 |
publisher | CRC Press |
record_format | marc |
series2 | Chapman & Hall/CRC statistics in the social and behavioral sciences series |
spelling | Big data and social science data science methods and tools for research and practice edited by Ian Foster (University of Chicago, Argonne National Laboratory), Rayid Ghani (University of Chicago), Ron S. Jarmin (U.S. Census Bureau), Frauke Kreuter (University of Maryland, University of Manheim, Institute for Employment Research), Julia Lane (New York University, American Institutes for Research) Second edition Boca Raton ; London ; New York CRC Press 2021 xx, 391 Seiten Illustrationen, Diagramme txt rdacontent n rdamedia nc rdacarrier Chapman & Hall/CRC statistics in the social and behavioral sciences series Includes bibliographical references and index Datenverarbeitung Sozialwissenschaften Social sciences Data processing Social sciences Statistical methods Data mining Big data Big Data (DE-588)4802620-7 gnd rswk-swf Sozialwissenschaften (DE-588)4055916-6 gnd rswk-swf (DE-588)4143413-4 Aufsatzsammlung gnd-content Sozialwissenschaften (DE-588)4055916-6 s Big Data (DE-588)4802620-7 s DE-604 Foster, Ian 1959- (DE-588)122888529 edt Ghani, Rayid (DE-588)1206736933 edt Jarmin, Ronald S. 1964- (DE-588)124661262 edt Kreuter, Frauke (DE-588)1033254037 edt Lane, Julia 1956- (DE-588)129556807 edt Erscheint auch als Online-Ausgabe 9780429324383 Digitalisierung UB Passau - ADAM Catalogue Enrichment application/pdf http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=032614344&sequence=000001&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA Inhaltsverzeichnis |
spellingShingle | Big data and social science data science methods and tools for research and practice Datenverarbeitung Sozialwissenschaften Social sciences Data processing Social sciences Statistical methods Data mining Big data Big Data (DE-588)4802620-7 gnd Sozialwissenschaften (DE-588)4055916-6 gnd |
subject_GND | (DE-588)4802620-7 (DE-588)4055916-6 (DE-588)4143413-4 |
title | Big data and social science data science methods and tools for research and practice |
title_auth | Big data and social science data science methods and tools for research and practice |
title_exact_search | Big data and social science data science methods and tools for research and practice |
title_exact_search_txtP | Big data and social science data science methods and tools for research and practice |
title_full | Big data and social science data science methods and tools for research and practice edited by Ian Foster (University of Chicago, Argonne National Laboratory), Rayid Ghani (University of Chicago), Ron S. Jarmin (U.S. Census Bureau), Frauke Kreuter (University of Maryland, University of Manheim, Institute for Employment Research), Julia Lane (New York University, American Institutes for Research) |
title_fullStr | Big data and social science data science methods and tools for research and practice edited by Ian Foster (University of Chicago, Argonne National Laboratory), Rayid Ghani (University of Chicago), Ron S. Jarmin (U.S. Census Bureau), Frauke Kreuter (University of Maryland, University of Manheim, Institute for Employment Research), Julia Lane (New York University, American Institutes for Research) |
title_full_unstemmed | Big data and social science data science methods and tools for research and practice edited by Ian Foster (University of Chicago, Argonne National Laboratory), Rayid Ghani (University of Chicago), Ron S. Jarmin (U.S. Census Bureau), Frauke Kreuter (University of Maryland, University of Manheim, Institute for Employment Research), Julia Lane (New York University, American Institutes for Research) |
title_short | Big data and social science |
title_sort | big data and social science data science methods and tools for research and practice |
title_sub | data science methods and tools for research and practice |
topic | Datenverarbeitung Sozialwissenschaften Social sciences Data processing Social sciences Statistical methods Data mining Big data Big Data (DE-588)4802620-7 gnd Sozialwissenschaften (DE-588)4055916-6 gnd |
topic_facet | Datenverarbeitung Sozialwissenschaften Social sciences Data processing Social sciences Statistical methods Data mining Big data Big Data Aufsatzsammlung |
url | http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=032614344&sequence=000001&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |
work_keys_str_mv | AT fosterian bigdataandsocialsciencedatasciencemethodsandtoolsforresearchandpractice AT ghanirayid bigdataandsocialsciencedatasciencemethodsandtoolsforresearchandpractice AT jarminronalds bigdataandsocialsciencedatasciencemethodsandtoolsforresearchandpractice AT kreuterfrauke bigdataandsocialsciencedatasciencemethodsandtoolsforresearchandpractice AT lanejulia bigdataandsocialsciencedatasciencemethodsandtoolsforresearchandpractice |