Verfügbarkeit: Big data and social science

Big data and social science: data science methods and tools for research and practice

Gespeichert in:

Bibliographische Detailangaben
Weitere Verfasser:	Foster, Ian 1959- (HerausgeberIn), Ghani, Rayid (HerausgeberIn), Jarmin, Ronald S. 1964- (HerausgeberIn), Kreuter, Frauke (HerausgeberIn), Lane, Julia 1956- (HerausgeberIn)
Format:	Buch
Sprache:	English
Veröffentlicht:	Boca Raton ; London ; New York CRC Press 2021
Ausgabe:	Second edition
Schriftenreihe:	Chapman & Hall/CRC statistics in the social and behavioral sciences series
Schlagworte:	Datenverarbeitung Sozialwissenschaften Social sciences > Data processing Social sciences > Statistical methods Data mining Big data Big Data Aufsatzsammlung
Online-Zugang:	Inhaltsverzeichnis
Beschreibung:	Includes bibliographical references and index
Beschreibung:	xx, 391 Seiten Illustrationen, Diagramme
ISBN:	9780367568597 9780367341879

Internformat

MARC


LEADER	00000nam a2200000 c 4500
001	BV047209515
003	DE-604
005	20230103
007	t
008	210323s2021 xxua\|\|\| \|\|\|\| 00\|\|\| eng d
020			\|a 9780367568597 \|c pbk \|9 978-0-367-56859-7
020			\|a 9780367341879 \|c hbk \|9 978-0-367-34187-9
035			\|a (OCoLC)1226406716
035			\|a (DE-599)BVBBV047209515
040			\|a DE-604 \|b ger \|e rda
041	0		\|a eng
044			\|a xxu \|c US
049			\|a DE-19 \|a DE-N2 \|a DE-739
050		0	\|a H61.3
082	0		\|a 300.285/6312 \|2 23
084			\|a MR 2200 \|0 (DE-625)123489: \|2 rvk
245	1	0	\|a Big data and social science \|b data science methods and tools for research and practice \|c edited by Ian Foster (University of Chicago, Argonne National Laboratory), Rayid Ghani (University of Chicago), Ron S. Jarmin (U.S. Census Bureau), Frauke Kreuter (University of Maryland, University of Manheim, Institute for Employment Research), Julia Lane (New York University, American Institutes for Research)
250			\|a Second edition
264		1	\|a Boca Raton ; London ; New York \|b CRC Press \|c 2021
300			\|a xx, 391 Seiten \|b Illustrationen, Diagramme
336			\|b txt \|2 rdacontent
337			\|b n \|2 rdamedia
338			\|b nc \|2 rdacarrier
490	0		\|a Chapman & Hall/CRC statistics in the social and behavioral sciences series
500			\|a Includes bibliographical references and index
650		4	\|a Datenverarbeitung
650		4	\|a Sozialwissenschaften
650		4	\|a Social sciences \|x Data processing
650		4	\|a Social sciences \|x Statistical methods
650		4	\|a Data mining
650		4	\|a Big data
650	0	7	\|a Big Data \|0 (DE-588)4802620-7 \|2 gnd \|9 rswk-swf
650	0	7	\|a Sozialwissenschaften \|0 (DE-588)4055916-6 \|2 gnd \|9 rswk-swf
655		7	\|0 (DE-588)4143413-4 \|a Aufsatzsammlung \|2 gnd-content
689	0	0	\|a Sozialwissenschaften \|0 (DE-588)4055916-6 \|D s
689	0	1	\|a Big Data \|0 (DE-588)4802620-7 \|D s
689	0		\|5 DE-604
700	1		\|a Foster, Ian \|d 1959- \|0 (DE-588)122888529 \|4 edt
700	1		\|a Ghani, Rayid \|0 (DE-588)1206736933 \|4 edt
700	1		\|a Jarmin, Ronald S. \|d 1964- \|0 (DE-588)124661262 \|4 edt
700	1		\|a Kreuter, Frauke \|0 (DE-588)1033254037 \|4 edt
700	1		\|a Lane, Julia \|d 1956- \|0 (DE-588)129556807 \|4 edt
776	0	8	\|i Erscheint auch als \|n Online-Ausgabe \|z 9780429324383
856	4	2	\|m Digitalisierung UB Passau - ADAM Catalogue Enrichment \|q application/pdf \|u http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=032614344&sequence=000001&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA \|3 Inhaltsverzeichnis
999			\|a oai:aleph.bib-bvb.de:BVB01-032614344

Datensatz im Suchindex

_version_	1804182322553028608
adam_text	h Hi l lí Ц IclioiI 1.1 Why this book?................................................................... 1.2 Defining big data and its value........................................ 1.3 The importance of inference.............................................. 1.3.1 Description............................................................ 1.3.2 Causation............................................................ 1.3.3 Prediction............................................................ 1.4 The importance of understanding how data are generated 1.5 New tools for new data.................................................... 1.6 The book’s “use case”.................................................... 1.7 The structure of the book.............................................. 1.7.1 Part I: Capture and curation ............................. 1.7.2 Part II: Modeling and analysis............................. 1.7.3 Part III: Inference and ethics ............................. 1.8 Resources........................................................................ Part I Capture and Curation Cameron Neylon 2.1 Introduction................................................................................... 2.2 Scraping information from the web................................................. 2.2.1 Obtaining data from websites.............................................. 2.2.1.1 Constructing the URL........................................... 2.2.1.2 Obtaining the contents of the page from the URL . 2.2.1.3 Processing the HTML response............................. 1 2 4 5 6 7 7 9 10 15 15 17 19 20 23 25 27 27 28 28 29 2.3 2.4 2.5 2.6 2.7 2.2.2 Programmatically iterating over the search results................................ 2.2.3 Limits of scraping........................................................................................ Application programming interfaces..................................................................... 2.3.1 Relevant APIs and resources..................................................................... 2.3.2 RESTful APIs, returned data, and Python wrappers................................ Using an API.................................................................................................................... Another example: Using the ORCID API via a wrapper........................................... Integrating data from multiple sources........................................................................ Summary.......................................................................................................................... š iücun í İ h i kćHj с Joshua Tokie and Stefan Bender 3.1 Motivation ....................................................................................................................... 3.2 Introduction to record linkage....................................................................................... 3.3 Preprocessing data for record linkage........................................................................... 3.4 Indexing and blocking.................................................................................................... 3.5 Matching.......................................................................................................................... 3.5.1 Rule-based approaches................................................................................... 3.5.2 Probabilistic record linkage............................................................................. 3.5.3 Machine learning approaches to record linkage.......................................... 3.5.4 Disambiguating networks................................................................................ 3.6 Classification.................................................................................................................... 3.6.1 Thresholds......................................................................................................... 3.6.2 One-to-one links............................................................................................... 3.7 Record linkage and data protection.............................................................................. 3.8 Summary.......................................................................................................................... 3.9 Resources.................................................................................. Ian Foster and Pascal Heus 4.1 Introduction............................................................................................................... 4.2 The DBMS: When and why...................................................................................... 4.3 Relational DBMSs...................................................................................................... 4.3.1 Structured Query Language........................................................................ 4.3.2 Manipulating and querying data.............................................................. 4.3.3 Schema design and definition.................................................................... 4.3.4 Loading data................................................................................................. 4.3.5 Transactions and crash recovery.............................................................. 4.3.6 Database optimizations.............................................................................. 4.3.7 Caveats and challenges.............................................................................. 4.3.7.1 Data cleaning............................................................................... 4.3.7.2 Missing values............................................................................... 4.3.7.3 Metadata for categorical variables............................................ 33 34 35 35 35 37 39 40 41 ‘I i 43 44 49 51 53 54 55 57 60 60 61 62 63 64 4.4 4.5 4.6 4.7 4.8 4.9 Huy 5.1 5.2 5.3 5.4 5.5 5.6 645.7 Part II 67 68 74 76 76 79 82 83 84 87 87 87 87 Linking DBMSs and other tools............................................................................... NoSQL databases ..................................................................................................... 4.5.1 Challenges of scale: The CAP theorem...................................................... 4.5.2 NoSQL and key-valuestores ...................................................................... 4.5.3 Other NoSQL databases............................................................................... Spatial databases ..................................................................................................... Which database to use?............................................................................................ 4.7.1 Relational DBMSs........................................................................................ 4.7.2 NoSQL DBMSs.............................................................................................. Summary..................................................................................................................... Resources..................................................................................................................... 88 91 91 92 94 95 97 97 98 98 99 Vo and Claudio Silva Introduction............................................................................................................... MapReduce.................................................................................................................. Apache Hadoop MapReduce..................................................................................... 5.3.1 The Hadoop Distributed File System......................................................... 5.3.2 Hadoop setup: Bringing compute to the data......................................... 5.3.3 Hardware provisioning.................................................................................. 5.3.4 Programming in Hadoop.............................................................................. 5.3.5 Programming language support.................................................................. 5.3.6 Benefits and limitations of Hadoop............................................................ Other MapReduce Implementations ...................................................................... ApacheSpark............................................................................................................... Summary..................................................................................................................... Resources..................................................................................................................... Ю1 ЮЗ Ю5 105 106 Ю8 109 HI 112 H3 H4 H6 H7 Modeling and Analysis M. Adil Yalçın and Catherine Plaisant 6.1 Introduction............................................................................................................... 6.2 Developing effective visualizations .......................................................................... 6.3 A data-by-tasks taxonomy......................................................................................... 6.3.1 Multivariate data............................................................................................ 6.3.2 Spatial data..................................................................................................... 6.3.3 Temporal data............................................................................................... 6.3.4 Hierarchical data............................................................................................ 6.3.5 Network data .................................................................................................. 6.3.6 Text data........................................................................................................ 119 121 122 127 129 130 131 133 134 136 Challenges.............................................................................................................................. 6.4.1 Scalability ............................................................................................................... 6.4.2 Evaluation............................................................................................................... 6.4.3 Visual impairment................................................................................................. 6.4.4 Visual literacy........................................................................................................ Summary................................................................................................................................. Resources................................................................................................................................. 138 138 139 140 140 141 141 Machino ..earning Rayid Ghani and Malte Schierholz 7.1 Introduction.......................................................................................................................... 7.2 What is machine learning? ............................................................................................... 7.3 Types of analysis................................................................................................................... 7.4 The machine learning process........................................................................................... 7.5 Problem formulation: Mapping a problem to machine learning methods .... 7.5.1 Features.................................................................................................................. 7.6 Methods..................................................................................................................................... 7.6.1 Unsupervised learning methods......................................................................... 7.6.1.1 Clustering............................................................................................... 7.6.1.2 The к-means clustering...................................................................... 7.6.1.3 Expectation-maximization clustering........................................... 7.6.1.4 Mean shift clustering.......................................................................... 7.6.1.5 Hierarchical clustering....................................................................... 7.6.1.6 Spectral clustering.............................................................................. 7.6.1.7 Principal components analysis......................................................... 7.6.1.8 Association rules................................................................................. 7.6.2 Supervised learning.............................................................................................. 7.6.2.1 Training a model................................................................................. 7.6.2.2 Using the model to score new data .............................................. 7.6.2.3 The к-nearest neighbor...................................................................... 7.6.2.4 Support vector machines................................................................... 7.6.2.5 Decision trees........................................................................................ 7.6.2.6 Ensemble methods.............................................................................. 7.6.2.7 Bagging.................................................................................................. 7.6.2.8 Boosting.................................................................................................. 7.6.2.9 Random forests.................................................................................... 7.6.2.10 Stacking.................................................................................................. 7.6.2.11 Neural networks and deep learning.............................................. 7.6.3 Binaiy vs. multiclass classification problems............................................. 7.6.4 Skewed or imbalanced classification problems............................................. 7.6.5 Model interpretability.......................................................................................... 7.6.5.1 Global vs. individual-level explanations...................................... 143 144 147 147 150 151 153 154 154 155 157 157 158 158 160 160 161 163 163 163 165 167 169 169 170 171 172 172 174 175 176 176 6.4 6.5 6.6 7 Evaluation............................................................................................................................ 7.7.1 Methodology........................................................................................................... 7.7.1.1 In-sample evaluation......................................................................... 7.7.1.2 Out-of-sample and holdout set........................................................ 7.7.1.3 Cross-validation.................................................................................... 7.7.1.4 Temporal validation............................................................................ 7.7.2 Metrics..................................................................................................................... 7.8 Practical tips........................................................................................................................ 7.8.1 Avoiding leakage................................................................................................... 7.8.2 Machine learning pipeline.................................................................................. 7.9 How can social scientists benefit from machine learning?..................................... 7.10 Advanced topics................................................................................................................. 7.11 Summary............................................................................................................................... 7.12 Resources............................................................................................................................... 7.7 i;:ХІ ЛііЫуз :; Evgeny KԽchikhin and Jordan Boyd-Graber 8.1 Understanding human-generated text........................................................................ 8.2 How is text data different than “structured” data?.................................................... 8.3 What can we do with text data?...................................................................................... 8.4 How to analyze text ....................................................... 8.4.1 Initial processing.................................................................................................... 8.4.1.1 Tokenizatlon........................................................................................... 8.4.1.2 Stop words ........................................................................................... 8.4.1.3 The N-grams ........................................................................................ 8.4.1.4 Stemming and lemmatization......................................................... 8.4.2 Linguistic analysis........................... 8.4.2.1 Part-of-speech tagging....................................................................... 8.4.2.2 Order matters........................................................................................ 8.4.3 Turning text data into a matrix: How much is a word worth? .............. 8.4.4 Analysis..................................................................................................................... 8.4.4.1 Use case: Finding similar documents.......................................... 8.4.4.2 Example: Measuring similarity between documents.............. 8.4.4.3 Example code...................................................................................... 8.4.4.4 Augmenting similarity calculations with external knowledge repositories.......................................................................................... 8.4.4.5 Evaluating “find similar” methods.................................................. 8.4.4.6 The F score........................................................................................... 8.4.4.7 Examples............................................................................................... 8.4.4.8 Use case: Clustering.......................................................................... 8.4.5 Topic modeling........................................................................................................ 8.4.5.1 Inferring “topics” from raw text ..................................................... 8.4.5.2 Applications of topic models............................................................ 178 178 178 179 179 180 181 185 185 187 187 189 191 191 la֊֊1 193 194 194 196 197 197 198 198 199 199 199 200 200 201 202 203 203 203 205 206 206 206 208 209 212 8.5 Word embeddings and deep learning................................................................. 8.6 Text analysis tools............................................................................................. 8.6.1 The natural language toolkit................................................................ 8.6.2 Stanford CoreNLP................................................................................. 8.6.3 The MALLET......................................................................................... 8.6.3.1 Spacy.io.................................................................................... 8.6.3.2 Pytorch .................................................................................... 8.7 Summary........................................................................................................... 8.8 Resources........................................................................................................... 9 Networks: Ulti Basics Jason Owen-Smith 9.1 Introduction...................................................................................................... 9.2 What are networks?.......................................................................................... 9.3 Structure for this chapter.................................................................................. 9.4 Turning data into a network............................................................................... 9.4.1 Tÿpes of networks................................................................................. 9.4.2 Inducing one-mode networksfrom two-mode data .............................. 9.5 Network measures............................................................................................. 9.5.1 Reachability.................... 9.5.2 Whole-network measures..................................................................... 9.5.2.1 Components and reachability.............................................. 9.5.2.2 Path length........................................................................... 9.5.2.3 Degree distribution................................................................. 9.5.2.4 Clustering coefficient.............................................................. 9.5.2.5 Centrality measures .............................................................. 9.6 Case study: Comparing collaboration networks............................................... 9.7 Summary............................................................................................................ 9.8 Resources............................................................................................................ Part III Inference and Ethics Paul P. Biemer 10.1 Introduction..................................................................................................... 10.2 The total error paradigm................................................................................... 10.2.1 The traditional model........................................................................... 10.2.1.1 Types of errors ..................................................................... 10.2.1.2 Column error........................................................................ 10.2.1.3 Cell errors.............................................................................. 10.3 Example: Google Flu Trends ........................................................................... 214 215 216 216 217 217 217 218 218 w/i 221 222 224 224 225 227 230 231 232 232 233 236 236 238 241 246 246 249 251 252 253 254 257 258 260 10.4 Errors 10.4.1 10.4.2 10.4.3 10.4.4 10.4.5 in data analysis...................................................................................... Analysis errors despite accuratedata .................................................. Noise accumulation.............................................................................. Spurious correlations........................................................................... Incidental endogeneity........................................................................... Analysis errors resultingfrom inaccurate data .................................... 10.4.5.1 Variable (uncorrelated) and correlated error in continuous variables................................................................................ 10.4.5.2 Extending variable and correlated error to categorical data . 10.4.5.3 Errors when analyzing rare population groups.................... 10.4.5.4 Errors in correlation analysis.............................................. 10.4.5.5 Errors in regression analysis................................................. 10.5 Detecting and compensating for data errors................................................... 10.5.1 TablePlots............................................................................................... 10.6 Summary.......................................................................................................... 10.7 Resources.......................................................................................................... Kit T. Rodolfa, Pedro Saleiro, and Rayid Ghani 11.1 Introduction.................................................................................................... 11.2 Sources of bias................................................................................................. 11.2.1 Sample bias ......................................................................................... 11.2.2 Label (outcome) bias.............................................................................. 11.2.3 Machine learning pipeline bias............................................................ 11.2.4 Application bias.................................................................................... 11.2.5 Considering bias when deploying your model...................................... 11.3 Dealing with bias.............................................................................................. 11.3.1 Define bias............................................................................................ 11.3.2 Definitions............................................................................................ 11.3.3 Choosing bias metrics........................................................................... 11.3.4 Punitive example................................................................................... 11.3.4.1 Count of false positives.......................................................... 11.3.4.2 Group size-adjusted falsepositives..................................... 11.3.4.3 False discovery rate............................................................... 11.3.4.4 False positive rate.................................................................. 11.3.4.5 Tradeoffs in metric choice ................................................ 11.3.5 Assistive example ................................................................................. 11.3.5.1 Count of false negatives ....................................................... 11.3.5.2 Group size-adjusted falsenegatives..................................... 11.3.5.3 False omission rate............................................................... 11.3.5.4 False negative rate............................................................... 11.3.6 Special case: Resource-constrainedprograms....................................... 261 261 262 262 263 264 264 266 267 269 273 275 276 279 280 281 282 282 283 283 285 286 286 286 287 289 291 291 291 292 292 292 294 294 294 295 295 296 Contení:; Mitigating bias.......................................................................................................... 11.4.1 Auditing model results.................................................................................. 11.4.2 Model selection.............................................................................................. 11.4.3 Other options for mitigating bias............................................................... 11.5 Further considerations .......................................................................................... 11.5.1 Compared to what?..................................................................................... 11.5.2 Costs to both errors..................................................................................... 11.5.3 What is the relevant population?............................................................... 11.5.4 Continuous outcomes.................................................................................. 11.5.5 Considerations for ongoing measurement............................................... 11.5.6 Equity in practice........................................................................................ 11.5.7 Additional terms for metrics........................................................................ 11.6 Case studies............................................................................................................. 11.6.1 Recidivism predictions with COMPAS ...................................................... 11.6.2 Facial recognition........................................................................................ 11.6.3 Facebook advertisement targeting ............................................................ 11.7 Aequitas: A toolkit for auditing bias and fairness in machine learning models............................................................................................................. 310 11.7.1 Aequitas in the larger context of the machine learning pipeline............. 11.4 12 WWocy and Coi riidciHÍaliíy Stefan Bender, Ron S. Jarmin, Frauke Kreuter, and Julia Lane 12.1 Introduction............................................................................................................. 12.2 Why is access important?....................................................................................... 12.2.1 Validating the data-generating process...................................................... 12.2.2 Replication..................................................................................................... 12.2.3 Building knowledge Infrastructure............................................................ 12.3 Providing access....................................................................................................... 12.3.1 Statistical disclosure controltechniques................................................... 12.3.2 Research data centers.................................................................................. 12.4 Non-tabular data....................................................................................................... 12.5 The new challenges ................................................................................................. 12.6 Legal and ethical framework.................................................................................... 12.7 Summary.................................................................................................................... 12.8 Resources.................................................................................................................... 296 297 297 299 3θθ 300 301 301 302 302 303 304 305 306 307 309 311 Wo 313 319 319 320 320 321 321 323 323 326 328 329 331 id Workbook;; Brian Kim, Christoph Kern, Jonathan Scott Morgan, Clagton Hunter, and Avishek Kumar 13.1 Introduction............................................................................................................. 13.2 Notebooks................................................................................................................... 13.2.1 Databases ..................................................................................................... 13.2.2 DatasetExploration and Visualization....................................................... 333 334 334 334 13.2.3 APIs................................................................................................................. 13.2.4 Record Linkage.............................................................................................. 13.2.5 Text Analysis................................................................................................. 13.2.6 Networks........................................................................................................ 13.2.7 Machine Learning—Creating Labels............................................................ 13.2.8 Machine Learning—Creating Features..................................................... 13.2.9 Machine Learning—Model Training and Evaluation............................... 13.2.10 Bias and Fairness........................................................................................ 13.2.11 Errors and Inference.................................................................................. 13.2.12 Additional workbooks.................................................................................. 13.3 Resources.................................................................................................................... 335 335 336 336 336 337 337 337 338 338 338
adam_txt	h Hi l lí Ц IclioiI 1.1 Why this book?. 1.2 Defining big data and its value. 1.3 The importance of inference. 1.3.1 Description. 1.3.2 Causation. 1.3.3 Prediction. 1.4 The importance of understanding how data are generated 1.5 New tools for new data. 1.6 The book’s “use case”. 1.7 The structure of the book. 1.7.1 Part I: Capture and curation . 1.7.2 Part II: Modeling and analysis. 1.7.3 Part III: Inference and ethics . 1.8 Resources. Part I Capture and Curation Cameron Neylon 2.1 Introduction. 2.2 Scraping information from the web. 2.2.1 Obtaining data from websites. 2.2.1.1 Constructing the URL. 2.2.1.2 Obtaining the contents of the page from the URL . 2.2.1.3 Processing the HTML response. 1 2 4 5 6 7 7 9 10 15 15 17 19 20 23 25 27 27 28 28 29 2.3 2.4 2.5 2.6 2.7 2.2.2 Programmatically iterating over the search results. 2.2.3 Limits of scraping. Application programming interfaces. 2.3.1 Relevant APIs and resources. 2.3.2 RESTful APIs, returned data, and Python wrappers. Using an API. Another example: Using the ORCID API via a wrapper. Integrating data from multiple sources. Summary. š iücun í İ h i kćHj с Joshua Tokie and Stefan Bender 3.1 Motivation . 3.2 Introduction to record linkage. 3.3 Preprocessing data for record linkage. 3.4 Indexing and blocking. 3.5 Matching. 3.5.1 Rule-based approaches. 3.5.2 Probabilistic record linkage. 3.5.3 Machine learning approaches to record linkage. 3.5.4 Disambiguating networks. 3.6 Classification. 3.6.1 Thresholds. 3.6.2 One-to-one links. 3.7 Record linkage and data protection. 3.8 Summary. 3.9 Resources. Ian Foster and Pascal Heus 4.1 Introduction. 4.2 The DBMS: When and why. 4.3 Relational DBMSs. 4.3.1 Structured Query Language. 4.3.2 Manipulating and querying data. 4.3.3 Schema design and definition. 4.3.4 Loading data. 4.3.5 Transactions and crash recovery. 4.3.6 Database optimizations. 4.3.7 Caveats and challenges. 4.3.7.1 Data cleaning. 4.3.7.2 Missing values. 4.3.7.3 Metadata for categorical variables. 33 34 35 35 35 37 39 40 41 ‘I' i 43 44 49 51 53 54 55 57 60 60 61 62 63 64 4.4 4.5 4.6 4.7 4.8 4.9 Huy 5.1 5.2 5.3 5.4 5.5 5.6 645.7 Part II 67 68 74 76 76 79 82 83 84 87 87 87 87 Linking DBMSs and other tools. NoSQL databases . 4.5.1 Challenges of scale: The CAP theorem. 4.5.2 NoSQL and key-valuestores . 4.5.3 Other NoSQL databases. Spatial databases . Which database to use?. 4.7.1 Relational DBMSs. 4.7.2 NoSQL DBMSs. Summary. Resources. 88 91 91 92 94 95 97 97 98 98 99 Vo and Claudio Silva Introduction. MapReduce. Apache Hadoop MapReduce. 5.3.1 The Hadoop Distributed File System. 5.3.2 Hadoop setup: Bringing compute to the data. 5.3.3 Hardware provisioning. 5.3.4 Programming in Hadoop. 5.3.5 Programming language support. 5.3.6 Benefits and limitations of Hadoop. Other MapReduce Implementations . ApacheSpark. Summary. Resources. Ю1 ЮЗ Ю5 105 106 Ю8 109 HI 112 H3 H4 H6 H7 Modeling and Analysis M. Adil Yalçın and Catherine Plaisant 6.1 Introduction. 6.2 Developing effective visualizations . 6.3 A data-by-tasks taxonomy. 6.3.1 Multivariate data. 6.3.2 Spatial data. 6.3.3 Temporal data. 6.3.4 Hierarchical data. 6.3.5 Network data . 6.3.6 Text data. 119 121 122 127 129 130 131 133 134 136 Challenges. 6.4.1 Scalability . 6.4.2 Evaluation. 6.4.3 Visual impairment. 6.4.4 Visual literacy. Summary. Resources. 138 138 139 140 140 141 141 Machino '.earning Rayid Ghani and Malte Schierholz 7.1 Introduction. 7.2 What is machine learning? . 7.3 Types of analysis. 7.4 The machine learning process. 7.5 Problem formulation: Mapping a problem to machine learning methods . 7.5.1 Features. 7.6 Methods. 7.6.1 Unsupervised learning methods. 7.6.1.1 Clustering. 7.6.1.2 The к-means clustering. 7.6.1.3 Expectation-maximization clustering. 7.6.1.4 Mean shift clustering. 7.6.1.5 Hierarchical clustering. 7.6.1.6 Spectral clustering. 7.6.1.7 Principal components analysis. 7.6.1.8 Association rules. 7.6.2 Supervised learning. 7.6.2.1 Training a model. 7.6.2.2 Using the model to score new data . 7.6.2.3 The к-nearest neighbor. 7.6.2.4 Support vector machines. 7.6.2.5 Decision trees. 7.6.2.6 Ensemble methods. 7.6.2.7 Bagging. 7.6.2.8 Boosting. 7.6.2.9 Random forests. 7.6.2.10 Stacking. 7.6.2.11 Neural networks and deep learning. 7.6.3 Binaiy vs. multiclass classification problems. 7.6.4 Skewed or imbalanced classification problems. 7.6.5 Model interpretability. 7.6.5.1 Global vs. individual-level explanations. 143 144 147 147 150 151 153 154 154 155 157 157 158 158 160 160 161 163 163 163 165 167 169 169 170 171 172 172 174 175 176 176 6.4 6.5 6.6 7 Evaluation. 7.7.1 Methodology. 7.7.1.1 In-sample evaluation. 7.7.1.2 Out-of-sample and holdout set. 7.7.1.3 Cross-validation. 7.7.1.4 Temporal validation. 7.7.2 Metrics. 7.8 Practical tips. 7.8.1 Avoiding leakage. 7.8.2 Machine learning pipeline. 7.9 How can social scientists benefit from machine learning?. 7.10 Advanced topics. 7.11 Summary. 7.12 Resources. 7.7 i;:ХІ ЛііЫуз':; Evgeny KԽchikhin and Jordan Boyd-Graber 8.1 Understanding human-generated text. 8.2 How is text data different than “structured” data?. 8.3 What can we do with text data?. 8.4 How to analyze text . 8.4.1 Initial processing. 8.4.1.1 Tokenizatlon. 8.4.1.2 Stop words . 8.4.1.3 The N-grams . 8.4.1.4 Stemming and lemmatization. 8.4.2 Linguistic analysis. 8.4.2.1 Part-of-speech tagging. 8.4.2.2 Order matters. 8.4.3 Turning text data into a matrix: How much is a word worth? . 8.4.4 Analysis. 8.4.4.1 Use case: Finding similar documents. 8.4.4.2 Example: Measuring similarity between documents. 8.4.4.3 Example code. 8.4.4.4 Augmenting similarity calculations with external knowledge repositories. 8.4.4.5 Evaluating “find similar” methods. 8.4.4.6 The F score. 8.4.4.7 Examples. 8.4.4.8 Use case: Clustering. 8.4.5 Topic modeling. 8.4.5.1 Inferring “topics” from raw text . 8.4.5.2 Applications of topic models. 178 178 178 179 179 180 181 185 185 187 187 189 191 191 la֊֊1 193 194 194 196 197 197 198 198 199 199 199 200 200 201 202 203 203 203 205 206 206 206 208 209 212 8.5 Word embeddings and deep learning. 8.6 Text analysis tools. 8.6.1 The natural language toolkit. 8.6.2 Stanford CoreNLP. 8.6.3 The MALLET. 8.6.3.1 Spacy.io. 8.6.3.2 Pytorch . 8.7 Summary. 8.8 Resources. 9 Networks: Ulti Basics Jason Owen-Smith 9.1 Introduction. 9.2 What are networks?. 9.3 Structure for this chapter. 9.4 Turning data into a network. 9.4.1 Tÿpes of networks. 9.4.2 Inducing one-mode networksfrom two-mode data . 9.5 Network measures. 9.5.1 Reachability. 9.5.2 Whole-network measures. 9.5.2.1 Components and reachability. 9.5.2.2 Path length. 9.5.2.3 Degree distribution. 9.5.2.4 Clustering coefficient. 9.5.2.5 Centrality measures . 9.6 Case study: Comparing collaboration networks. 9.7 Summary. 9.8 Resources. Part III Inference and Ethics Paul P. Biemer 10.1 Introduction. 10.2 The total error paradigm. 10.2.1 The traditional model. 10.2.1.1 Types of errors . 10.2.1.2 Column error. 10.2.1.3 Cell errors. 10.3 Example: Google Flu Trends . 214 215 216 216 217 217 217 218 218 w/i 221 222 224 224 225 227 230 231 232 232 233 236 236 238 241 246 246 249 251 252 253 254 257 258 260 10.4 Errors 10.4.1 10.4.2 10.4.3 10.4.4 10.4.5 in data analysis. Analysis errors despite accuratedata . Noise accumulation. Spurious correlations. Incidental endogeneity. Analysis errors resultingfrom inaccurate data . 10.4.5.1 Variable (uncorrelated) and correlated error in continuous variables. 10.4.5.2 Extending variable and correlated error to categorical data . 10.4.5.3 Errors when analyzing rare population groups. 10.4.5.4 Errors in correlation analysis. 10.4.5.5 Errors in regression analysis. 10.5 Detecting and compensating for data errors. 10.5.1 TablePlots. 10.6 Summary. 10.7 Resources. Kit T. Rodolfa, Pedro Saleiro, and Rayid Ghani 11.1 Introduction. 11.2 Sources of bias. 11.2.1 Sample bias . 11.2.2 Label (outcome) bias. 11.2.3 Machine learning pipeline bias. 11.2.4 Application bias. 11.2.5 Considering bias when deploying your model. 11.3 Dealing with bias. 11.3.1 Define bias. 11.3.2 Definitions. 11.3.3 Choosing bias metrics. 11.3.4 Punitive example. 11.3.4.1 Count of false positives. 11.3.4.2 Group size-adjusted falsepositives. 11.3.4.3 False discovery rate. 11.3.4.4 False positive rate. 11.3.4.5 Tradeoffs in metric choice . 11.3.5 Assistive example . 11.3.5.1 Count of false negatives . 11.3.5.2 Group size-adjusted falsenegatives. 11.3.5.3 False omission rate. 11.3.5.4 False negative rate. 11.3.6 Special case: Resource-constrainedprograms. 261 261 262 262 263 264 264 266 267 269 273 275 276 279 280 281 282 282 283 283 285 286 286 286 287 289 291 291 291 292 292 292 294 294 294 295 295 296 Contení:; Mitigating bias. 11.4.1 Auditing model results. 11.4.2 Model selection. 11.4.3 Other options for mitigating bias. 11.5 Further considerations . 11.5.1 Compared to what?. 11.5.2 Costs to both errors. 11.5.3 What is the relevant population?. 11.5.4 Continuous outcomes. 11.5.5 Considerations for ongoing measurement. 11.5.6 Equity in practice. 11.5.7 Additional terms for metrics. 11.6 Case studies. 11.6.1 Recidivism predictions with COMPAS . 11.6.2 Facial recognition. 11.6.3 Facebook advertisement targeting . 11.7 Aequitas: A toolkit for auditing bias and fairness in machine learning models. 310 11.7.1 Aequitas in the larger context of the machine learning pipeline. 11.4 12 WWocy and Coi riidciHÍaliíy Stefan Bender, Ron S. Jarmin, Frauke Kreuter, and Julia Lane 12.1 Introduction. 12.2 Why is access important?. 12.2.1 Validating the data-generating process. 12.2.2 Replication. 12.2.3 Building knowledge Infrastructure. 12.3 Providing access. 12.3.1 Statistical disclosure controltechniques. 12.3.2 Research data centers. 12.4 Non-tabular data. 12.5 The new challenges . 12.6 Legal and ethical framework. 12.7 Summary. 12.8 Resources. 296 297 297 299 3θθ 300 301 301 302 302 303 304 305 306 307 309 311 Wo 313 319 319 320 320 321 321 323 323 326 328 329 331 id Workbook;; Brian Kim, Christoph Kern, Jonathan Scott Morgan, Clagton Hunter, and Avishek Kumar 13.1 Introduction. 13.2 Notebooks. 13.2.1 Databases . 13.2.2 DatasetExploration and Visualization. 333 334 334 334 13.2.3 APIs. 13.2.4 Record Linkage. 13.2.5 Text Analysis. 13.2.6 Networks. 13.2.7 Machine Learning—Creating Labels. 13.2.8 Machine Learning—Creating Features. 13.2.9 Machine Learning—Model Training and Evaluation. 13.2.10 Bias and Fairness. 13.2.11 Errors and Inference. 13.2.12 Additional workbooks. 13.3 Resources. 335 335 336 336 336 337 337 337 338 338 338
any_adam_object	1
any_adam_object_boolean	1
author2	Foster, Ian 1959- Ghani, Rayid Jarmin, Ronald S. 1964- Kreuter, Frauke Lane, Julia 1956-
author2_role	edt edt edt edt edt
author2_variant	i f if r g rg r s j rs rsj f k fk j l jl
author_GND	(DE-588)122888529 (DE-588)1206736933 (DE-588)124661262 (DE-588)1033254037 (DE-588)129556807
author_facet	Foster, Ian 1959- Ghani, Rayid Jarmin, Ronald S. 1964- Kreuter, Frauke Lane, Julia 1956-
building	Verbundindex
bvnumber	BV047209515
callnumber-first	H - Social Science
callnumber-label	H61
callnumber-raw	H61.3
callnumber-search	H61.3
callnumber-sort	H 261.3
callnumber-subject	H - Social Science
classification_rvk	MR 2200
ctrlnum	(OCoLC)1226406716 (DE-599)BVBBV047209515
dewey-full	300.285/6312
dewey-hundreds	300 - Social sciences
dewey-ones	300 - Social sciences
dewey-raw	300.285/6312
dewey-search	300.285/6312
dewey-sort	3300.285 46312
dewey-tens	300 - Social sciences
discipline	Soziologie
discipline_str_mv	Soziologie
edition	Second edition
format	Book
fullrecord	<?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>02617nam a2200565 c 4500</leader><controlfield tag="001">BV047209515</controlfield><controlfield tag="003">DE-604</controlfield><controlfield tag="005">20230103 </controlfield><controlfield tag="007">t</controlfield><controlfield tag="008">210323s2021 xxua\|\|\| \|\|\|\| 00\|\|\| eng d</controlfield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">9780367568597</subfield><subfield code="c">pbk</subfield><subfield code="9">978-0-367-56859-7</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">9780367341879</subfield><subfield code="c">hbk</subfield><subfield code="9">978-0-367-34187-9</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(OCoLC)1226406716</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-599)BVBBV047209515</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-604</subfield><subfield code="b">ger</subfield><subfield code="e">rda</subfield></datafield><datafield tag="041" ind1="0" ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="044" ind1=" " ind2=" "><subfield code="a">xxu</subfield><subfield code="c">US</subfield></datafield><datafield tag="049" ind1=" " ind2=" "><subfield code="a">DE-19</subfield><subfield code="a">DE-N2</subfield><subfield code="a">DE-739</subfield></datafield><datafield tag="050" ind1=" " ind2="0"><subfield code="a">H61.3</subfield></datafield><datafield tag="082" ind1="0" ind2=" "><subfield code="a">300.285/6312</subfield><subfield code="2">23</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">MR 2200</subfield><subfield code="0">(DE-625)123489:</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Big data and social science</subfield><subfield code="b">data science methods and tools for research and practice</subfield><subfield code="c">edited by Ian Foster (University of Chicago, Argonne National Laboratory), Rayid Ghani (University of Chicago), Ron S. Jarmin (U.S. Census Bureau), Frauke Kreuter (University of Maryland, University of Manheim, Institute for Employment Research), Julia Lane (New York University, American Institutes for Research)</subfield></datafield><datafield tag="250" ind1=" " ind2=" "><subfield code="a">Second edition</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="a">Boca Raton ; London ; New York</subfield><subfield code="b">CRC Press</subfield><subfield code="c">2021</subfield></datafield><datafield tag="300" ind1=" " ind2=" "><subfield code="a">xx, 391 Seiten</subfield><subfield code="b">Illustrationen, Diagramme</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="490" ind1="0" ind2=" "><subfield code="a">Chapman & Hall/CRC statistics in the social and behavioral sciences series</subfield></datafield><datafield tag="500" ind1=" " ind2=" "><subfield code="a">Includes bibliographical references and index</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Datenverarbeitung</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Sozialwissenschaften</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Social sciences</subfield><subfield code="x">Data processing</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Social sciences</subfield><subfield code="x">Statistical methods</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Data mining</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Big data</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Big Data</subfield><subfield code="0">(DE-588)4802620-7</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Sozialwissenschaften</subfield><subfield code="0">(DE-588)4055916-6</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="655" ind1=" " ind2="7"><subfield code="0">(DE-588)4143413-4</subfield><subfield code="a">Aufsatzsammlung</subfield><subfield code="2">gnd-content</subfield></datafield><datafield tag="689" ind1="0" ind2="0"><subfield code="a">Sozialwissenschaften</subfield><subfield code="0">(DE-588)4055916-6</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2="1"><subfield code="a">Big Data</subfield><subfield code="0">(DE-588)4802620-7</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2=" "><subfield code="5">DE-604</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Foster, Ian</subfield><subfield code="d">1959-</subfield><subfield code="0">(DE-588)122888529</subfield><subfield code="4">edt</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Ghani, Rayid</subfield><subfield code="0">(DE-588)1206736933</subfield><subfield code="4">edt</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Jarmin, Ronald S.</subfield><subfield code="d">1964-</subfield><subfield code="0">(DE-588)124661262</subfield><subfield code="4">edt</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Kreuter, Frauke</subfield><subfield code="0">(DE-588)1033254037</subfield><subfield code="4">edt</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Lane, Julia</subfield><subfield code="d">1956-</subfield><subfield code="0">(DE-588)129556807</subfield><subfield code="4">edt</subfield></datafield><datafield tag="776" ind1="0" ind2="8"><subfield code="i">Erscheint auch als</subfield><subfield code="n">Online-Ausgabe</subfield><subfield code="z">9780429324383</subfield></datafield><datafield tag="856" ind1="4" ind2="2"><subfield code="m">Digitalisierung UB Passau - ADAM Catalogue Enrichment</subfield><subfield code="q">application/pdf</subfield><subfield code="u">http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=032614344&sequence=000001&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA</subfield><subfield code="3">Inhaltsverzeichnis</subfield></datafield><datafield tag="999" ind1=" " ind2=" "><subfield code="a">oai:aleph.bib-bvb.de:BVB01-032614344</subfield></datafield></record></collection>
genre	(DE-588)4143413-4 Aufsatzsammlung gnd-content
genre_facet	Aufsatzsammlung
id	DE-604.BV047209515
illustrated	Illustrated
index_date	2024-07-03T16:53:52Z
indexdate	2024-07-10T09:05:43Z
institution	BVB
isbn	9780367568597 9780367341879
language	English
oai_aleph_id	oai:aleph.bib-bvb.de:BVB01-032614344
oclc_num	1226406716
open_access_boolean
owner	DE-19 DE-BY-UBM DE-N2 DE-739
owner_facet	DE-19 DE-BY-UBM DE-N2 DE-739
physical	xx, 391 Seiten Illustrationen, Diagramme
publishDate	2021
publishDateSearch	2021
publishDateSort	2021
publisher	CRC Press
record_format	marc
series2	Chapman & Hall/CRC statistics in the social and behavioral sciences series
spelling	Big data and social science data science methods and tools for research and practice edited by Ian Foster (University of Chicago, Argonne National Laboratory), Rayid Ghani (University of Chicago), Ron S. Jarmin (U.S. Census Bureau), Frauke Kreuter (University of Maryland, University of Manheim, Institute for Employment Research), Julia Lane (New York University, American Institutes for Research) Second edition Boca Raton ; London ; New York CRC Press 2021 xx, 391 Seiten Illustrationen, Diagramme txt rdacontent n rdamedia nc rdacarrier Chapman & Hall/CRC statistics in the social and behavioral sciences series Includes bibliographical references and index Datenverarbeitung Sozialwissenschaften Social sciences Data processing Social sciences Statistical methods Data mining Big data Big Data (DE-588)4802620-7 gnd rswk-swf Sozialwissenschaften (DE-588)4055916-6 gnd rswk-swf (DE-588)4143413-4 Aufsatzsammlung gnd-content Sozialwissenschaften (DE-588)4055916-6 s Big Data (DE-588)4802620-7 s DE-604 Foster, Ian 1959- (DE-588)122888529 edt Ghani, Rayid (DE-588)1206736933 edt Jarmin, Ronald S. 1964- (DE-588)124661262 edt Kreuter, Frauke (DE-588)1033254037 edt Lane, Julia 1956- (DE-588)129556807 edt Erscheint auch als Online-Ausgabe 9780429324383 Digitalisierung UB Passau - ADAM Catalogue Enrichment application/pdf http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=032614344&sequence=000001&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA Inhaltsverzeichnis
spellingShingle	Big data and social science data science methods and tools for research and practice Datenverarbeitung Sozialwissenschaften Social sciences Data processing Social sciences Statistical methods Data mining Big data Big Data (DE-588)4802620-7 gnd Sozialwissenschaften (DE-588)4055916-6 gnd
subject_GND	(DE-588)4802620-7 (DE-588)4055916-6 (DE-588)4143413-4
title	Big data and social science data science methods and tools for research and practice
title_auth	Big data and social science data science methods and tools for research and practice
title_exact_search	Big data and social science data science methods and tools for research and practice
title_exact_search_txtP	Big data and social science data science methods and tools for research and practice
title_full	Big data and social science data science methods and tools for research and practice edited by Ian Foster (University of Chicago, Argonne National Laboratory), Rayid Ghani (University of Chicago), Ron S. Jarmin (U.S. Census Bureau), Frauke Kreuter (University of Maryland, University of Manheim, Institute for Employment Research), Julia Lane (New York University, American Institutes for Research)
title_fullStr	Big data and social science data science methods and tools for research and practice edited by Ian Foster (University of Chicago, Argonne National Laboratory), Rayid Ghani (University of Chicago), Ron S. Jarmin (U.S. Census Bureau), Frauke Kreuter (University of Maryland, University of Manheim, Institute for Employment Research), Julia Lane (New York University, American Institutes for Research)
title_full_unstemmed	Big data and social science data science methods and tools for research and practice edited by Ian Foster (University of Chicago, Argonne National Laboratory), Rayid Ghani (University of Chicago), Ron S. Jarmin (U.S. Census Bureau), Frauke Kreuter (University of Maryland, University of Manheim, Institute for Employment Research), Julia Lane (New York University, American Institutes for Research)
title_short	Big data and social science
title_sort	big data and social science data science methods and tools for research and practice
title_sub	data science methods and tools for research and practice
topic	Datenverarbeitung Sozialwissenschaften Social sciences Data processing Social sciences Statistical methods Data mining Big data Big Data (DE-588)4802620-7 gnd Sozialwissenschaften (DE-588)4055916-6 gnd
topic_facet	Datenverarbeitung Sozialwissenschaften Social sciences Data processing Social sciences Statistical methods Data mining Big data Big Data Aufsatzsammlung
url	http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=032614344&sequence=000001&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA
work_keys_str_mv	AT fosterian bigdataandsocialsciencedatasciencemethodsandtoolsforresearchandpractice AT ghanirayid bigdataandsocialsciencedatasciencemethodsandtoolsforresearchandpractice AT jarminronalds bigdataandsocialsciencedatasciencemethodsandtoolsforresearchandpractice AT kreuterfrauke bigdataandsocialsciencedatasciencemethodsandtoolsforresearchandpractice AT lanejulia bigdataandsocialsciencedatasciencemethodsandtoolsforresearchandpractice

Verfügbarkeit

Es ist kein Print-Exemplar vorhanden.

Fernleihe Bestellen Achtung: Nicht im THWS-Bestand! Inhaltsverzeichnis

MARC

Datensatz im Suchindex

Es ist kein Print-Exemplar vorhanden.

Ähnliche Einträge