Text as data: a new framework for machine learning and the social sciences
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Buch |
Sprache: | English |
Veröffentlicht: |
Princeton ; Oxford
Princeton University Press
[2022]
|
Schlagworte: | |
Online-Zugang: | Inhaltsverzeichnis |
Beschreibung: | xix, 336 Seiten Illustrationen, Diagramme |
ISBN: | 9780691207544 9780691207551 |
Internformat
MARC
LEADER | 00000nam a2200000 c 4500 | ||
---|---|---|---|
001 | BV047879793 | ||
003 | DE-604 | ||
005 | 20220930 | ||
007 | t | ||
008 | 220311s2022 a||| |||| 00||| eng d | ||
020 | |a 9780691207544 |c cloth |9 978-0-691-20754-4 | ||
020 | |a 9780691207551 |c pbk |9 978-0-691-20755-1 | ||
035 | |a (OCoLC)1312603230 | ||
035 | |a (DE-599)BVBBV047879793 | ||
040 | |a DE-604 |b ger |e rda | ||
041 | 0 | |a eng | |
049 | |a DE-19 |a DE-355 |a DE-188 |a DE-521 |a DE-473 |a DE-29 |a DE-20 |a DE-739 |a DE-11 | ||
084 | |a MR 2200 |0 (DE-625)123489: |2 rvk | ||
084 | |a ST 650 |0 (DE-625)143687: |2 rvk | ||
084 | |a ST 306 |0 (DE-625)143654: |2 rvk | ||
100 | 1 | |a Grimmer, Justin |e Verfasser |0 (DE-588)1048671410 |4 aut | |
245 | 1 | 0 | |a Text as data |b a new framework for machine learning and the social sciences |c Justin Grimmer, Margaret E. Roberts, Brandon M. Stewart |
264 | 1 | |a Princeton ; Oxford |b Princeton University Press |c [2022] | |
264 | 4 | |c © 2022 | |
300 | |a xix, 336 Seiten |b Illustrationen, Diagramme | ||
336 | |b txt |2 rdacontent | ||
337 | |b n |2 rdamedia | ||
338 | |b nc |2 rdacarrier | ||
650 | 0 | 7 | |a Computational social science |0 (DE-588)1249405939 |2 gnd |9 rswk-swf |
650 | 0 | 7 | |a Natürliche Sprache |0 (DE-588)4041354-8 |2 gnd |9 rswk-swf |
650 | 0 | 7 | |a Maschinelles Lernen |0 (DE-588)4193754-5 |2 gnd |9 rswk-swf |
650 | 0 | 7 | |a Text Mining |0 (DE-588)4728093-1 |2 gnd |9 rswk-swf |
653 | 0 | |a Text data mining | |
653 | 0 | |a Social sciences / Data processing | |
653 | 0 | |a Machine learning | |
689 | 0 | 0 | |a Text Mining |0 (DE-588)4728093-1 |D s |
689 | 0 | 1 | |a Natürliche Sprache |0 (DE-588)4041354-8 |D s |
689 | 0 | 2 | |a Maschinelles Lernen |0 (DE-588)4193754-5 |D s |
689 | 0 | 3 | |a Computational social science |0 (DE-588)1249405939 |D s |
689 | 0 | |5 DE-604 | |
700 | 1 | |a Roberts, Margaret E. |e Verfasser |0 (DE-588)1141692562 |4 aut | |
700 | 1 | |a Stewart, Brandon M. |e Verfasser |0 (DE-588)1262405440 |4 aut | |
776 | 0 | 8 | |i Erscheint auch als |n Online-Ausgabe |z 978-0-691-20799-5 |
856 | 4 | 2 | |m Digitalisierung UB Bamberg - ADAM Catalogue Enrichment |q application/pdf |u http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=033262161&sequence=000001&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |3 Inhaltsverzeichnis |
Datensatz im Suchindex
_version_ | 1805077530418348032 |
---|---|
adam_text |
Contents Preface Prerequisites and Notation Uses for This Book What This Book Is Not xvii xvii xviii xix PART I PRELIMINARIES 1 CHAPTER 1 Introduction 3 1.1 1.2 1.3 1.4 1.5 CHAPTER 2 How This Book Informs the Social Sciences How This Book Informs the Digital Humanities How This Book Informs Data Science in Industry and Government A Guide to This Book Conclusion Social Science Research and Text Analysis 2.1 2.2 2.3 2.4 2.5 2.6 2.7 Discovery Measurement Inference Social Science as an Iterative and Cumulative Process An Agnostic Approach to Text Analysis Discovery, Measurement, and Causal Inference: How the Chinese Government Censors Social Media Six Principles of Text Analysis 2.7.1 Social Science Theories and Substantive Knowledge are Essential for Research Design 2.7.2 Text Analysis does not Replace Humans—It Augments Them 2.7.3 Building, Refining, and Testing Social Science Theories Requires Iteration and Cumulation 2.7.4 Text Analysis Methods Distill Generalizations from Language 2.7.5 The Best Method Depends on the Task 5 8 9 10 11 13 15 16 17 17 18 20 22 22 24 26 28 29
viii Contents Validations are Essential and Depend on the Theory and the Task Conclusion: Text Data and Social Science 2.7.6 2.8 PART II CHAPTER 3 CHAPTER 4 SELECTION AND REPRESENTATION 33 Principles of Selection and Representation 35 3.1 3.2 3.3 3.4 3.5 3.6 3.7 35 36 37 38 38 39 40 4.3 4.4 Populations and Quantities of Interest Four Types of Bias 4.2.1 Resource Bias 4.2.2 Incentive Bias 4.2.3 Medium Bias 4.2.4 Retrieval Bias Considerations of "Found Data" Conclusion Bag of Words 5.1 5.2 5.3 5.4 5.5 5.6 5.7 CHAPTER 6 Principle 1 : Question-Specific Corpus Construction Principle 2: No Values-Free Corpus Construction Principle 3: No Right Way to Represent Text Principle 4: Validation State of the Union Addresses The Authorship of the Federalist Papers Conclusion Selecting Documents 4.1 4.2 CHAPTER 5 30 32 The Bag of Words Model Choose the Unit of Analysis Tokenize Reduce Complexity 5.4.1 Lowercase 5.4.2 Remove Punctuation 5.4.3 Remove Stop Words 5.4.4 Create Equivalence Classes (Lemmatize/Stem) 5.4.5 Filter by Frequency Construct Document-Feature Matrix Rethinking the Defaults 5.6.1 Authorship of the Federalist Papers 5.6.2 The Scale Argument against Preprocessing Conclusion The Multinomial Language Model 6.1 6.2 6.3 6.4 6.5 Multinomial Distribution Basic Language Modeling Regularization and Smoothing The Dirichlet Distribution Conclusion 41 42 43 43 44 44 45 46 46 48 48 49 50 52 52 52 53 54 55 55 57 57 58 59 60 61 63 66 66 69
Contents CHAPTER 7 CHAPTER 8 The Vector Space Model and Similarity Metrics 70 7.1 72 7.3 7.4 70 73 75 77 Distributed Representations of Words 8.1 8.2 8.3 8.4 8.5 8.6 CHAPTER 9 Similarity Metrics Distance Metrics tf-idf Weighting Conclusion Why Word Embeddings Estimating Word Embeddings 8.2.1 The Self-Supervision Insight 8.2.2 Design Choices in Word Embeddings 8.2.3 Latent Semantic Analysis 8.2.4 Neural Word Embeddings 8.2.5 Pretrained Embeddings 8.2.6 Rare Words 8.2.7 An Illustration Aggregating Word Embeddings to the Document Level Validation Contextualized Word Embeddings Conclusion Representations from Language Sequences 9.1 9.2 9.3 9.4 9.5 9.6 Text Reuse Parts of Speech Tagging 9.2.1 Using Phrases to Improve Visualization ֊ Named-Entity Recognition Dependency Parsing Broader Information Extraction Tasks Conclusion PART III DISCOVERY CHAPTER 10 Principles of Discovery 10.1 10.2 10.3 10.4 10.5 10.6 CHAPTER 11 Principle 1 : Context Relevance Principle 2: No Ground Truth Principle 3: Judge the Concept, Not the Method Principle 4: Separate Data Is Best Conceptualizing the US Congress Conclusion Discriminating Words 11.1 11.2 11.3 Mutual Information Fightin'Words Fictitious Prediction Problems 11.3.1 Standardized Test Statistics as Measures of Separation 78 79 81 81 81 82 82 84 84 85 86 87 88 89 90 90 91 92 94 95 96 97 99 103 103 104 105 106 106 109 111 112 115 117 118 ix
x Contents CHAPTER 12 CHAPTER 13 CHAPTER 14 11.3.2 χ 2 Test Statistics 11.3.3 Multinomial Inverse Regression 11.4 Conclusion 118 121 121 Clustering 123 12.1 An Initial Example Using /c-Means Clustering 12.2 Representations for Clustering 12.3 Approaches to Clustering 12.3.1 Components of a Clustering Method 12.3.2 Styles of Clustering Methods 12.3.3 Probabilistic Clustering Models 12.3.4 Algorithmic Clustering Models 12.3.5 Connections between Probabilistic and Algorithmic Clustering 12.4 Making Choices 12.4.1 Mode) Selection 12.4.2 Careful Reading 12.4.3 Choosing the Number of Clusters 12.5 The Human Side ofClustering 12.5.1 Interpretation 12.5.2 Interactive Clustering 12.6 Conclusion 124 127 127 128 130 132 134 Topic Models 147 13.1 Latent Dirichlet Allocation 13.1.1 Inference 13.1.2 Example: Discovering Credit Claiming for Fire Grants in Congressional PressReleases 13.2 Interpreting the Output of Topic Models 13.3 Incorporating Structure into LDA 13.3.1 Structure with Upstream, Known Prevalence Covariates 13.3.2 Structure with Upstream, Known Content Covariates 13.3.3 Structure with Downstream,Known Covariates 13.3.4 Additional Sources of Structure 13.4 Structural Topic Models 13.4.1 Example: Discovering the Components of Radical Discourse 13.5 Labeling Topic Models 13.6 Conclusion 147 149 137 137 137 140 140 144 144 144 145 149 151 153 154 154 156 157 157 159 159 160 Low-Dimensional Document Embeddings 162 14.1 162 Principal Component Analysis 14.1.1 Automated Methods for Labeling Principal Components 14.1.2 Manual Methods for Labeling Principal Components 163 164
Contents 14.1.3 Principal Component Analysis of Senate Press Releases 14.1.4 Choosing the Number of Principal Components 14.2 Classical Multidimensional Scaling 14.2.1 Extensions of Classical MDS 14.2.2 Applying Classical MDS to Senate Press Releases 14.3 Conclusion 164 165 167 168 168 169 PART IV MEASUREMENT 171 CHAPTER 15 Principles of Measurement 173 15.1 From Concept to Measurement 15.2 What Makes a Good Measurement 15.2.1 Principle 1 : Measures should have Clear Goals 15.2.2 Principle 2: Source Material should Always be Identified and Ideally Made Public 15.2.3 Principle 3: The Coding Process should be Explainable and Reproducible 15.2.4 Principle 4: The Measure should be Validated 15.2.5 Principle 5: Limitations should be Explored, Documented and Communicated to the Audience 15.3 Balancing Discovery and Measurement with Sample Splits 174 174 175 CHAPTER 16 175 175 176 176 Word Counting 16.1 Keyword Counting 16.2 Dictionary Methods 16.3 Limitations and Validations of Dictionary Methods 16.3.1 Moving Beyond Dictionaries:Wordscores 16.4 Conclusion CHAPTER 17 175 An Overview of Supervised Classification 17.1 17.2 17.3 17.4 17.5 17.6 Example: Discursive Governance Create a Training Set Classify Documents with Supervised Learning Check Performance Using the Measure Conclusion CHAPTER 18 Coding a Training Set 18.1 Characteristics of a Good Training Set 18.2 Hand Coding 18.2.1 1 : Decide on a Codebook 18.2.2 2: Select Coders 18.2.3 3: Select Documents to Code 18.2.4 4: Manage Coders 18.2.5 5: Check Reliability 178 ■ 178 180 181 182 183 184 185 186 186 187 187 188 189 190 190
191 191 191 192 192 xi
xii Contents CHAPTER 19 CHAPTER 20 CHAPTER 21 18.2.6 Managing Drift 18.2.7 Example: Making the News 18.3 Crowdsourcing 18.4 Supervision with Found Data 18.5 Conclusion 192 192 193 195 196 Classifying Documents with Supervised Learning 197 19.1 Naive Bayes 19.1.1 The Assumptions in Naive Bayes are Almost Certainly Wrong 19.1.2 Naive Bayes is a Generative Model 19.1.3 Naive Bayes is a Linear Classifier 19.2 Machine Learning 19.2.1 Fixed Basis Functions 19.2.2 Adaptive Basis Functions 19.2.3 Quantification 19.2.4 Concluding Thoughts on Supervised Learning with Random Samples 19.3 Example: Estimating Jihad Scores 19.4 Conclusion 198 200 200 201 202 203 205 206 207 207 210 Checking Performance 211 20.1 Validation with Gold-Standard Data 20.1.1 Validation Set 20.1.2 Cross-Validation 20.1.3 The Importance of Gold-Standard Data 20.1.4 Ongoing Evaluations 20.2 Validation without Gold-Standard Data 20.2.1 Surrogate Labels 20.2.2 Partial Category Replication 20.2.3 Nonexpert Human Evaluation 20.2.4 Correspondence to External Information 20.3 Example: Validating Jihad Scores 20.4 Conclusion 211 212 213 213 214 214 214 215 215 215 216 217 Repurposing Discovery Methods Unsupervised Methods Tend to Measure Subject Better than Subtleties 21.2 Example: Scaling via Differential Word Rates 21.3 A Workflow for Repurposing Unsupervised Methods for Measurement 21.3.1 1: Split the Data 21.3.2 2: Fit the Model 21.3.3 3: Validate the Model 21.3.4 4: Fit to the Test Data and Revalidate 21.4 Concerns in Repurposing Unsupervised Methods for Measurement 21.4.1 Concern 1 : The Method Always Returns a
Result 219 21.1 219 220 221 223 223 223 225 225 226
Contents 21.4.2 Concern 2: Opaque Differences in Estimation Strategies 21.4.3 Concern 3: Sensitivity to Unintuitive Hyperparameters 21.4.4 Concern 4: Instability in results 21.4.5 Rethinking Stability 21.5 Conclusion PART V 226 INFERENCE CHAPTER 22 Principles of Inference 22.1 Prediction 22.2 Causal Inference 22.2.1 Causal Inference Places Identification First 22.2.2 Prediction Is about Outcomes That Will Happen, Causal Inference Is about Outcomes from Interventions 22.2.3 Prediction and Causal Inference Require Different Validations 22.2.4 Prediction and Causal Inference Use Features Differently 22.3 Comparing Predictionand Causal Inference 22.4 Partial and General Equilibrium in Prediction and Causal Inference 22.5 Conclusion CHAPTER 23 227 227 228 229 231 233 233 234 235 235 236 237 238 238 240 Prediction 241 23.1 The Basic Task of Prediction 23.2 Similarities and Differences between Prediction and Measurement · 23.3 Five Principles of Prediction 23.3.1 Predictive Features do not have to Cause the Outcome 23.3.2 Cross-Validation is not Always a Good Measure of Predictive Power 23.3.3 It's Not Always Better to be More Accurate on Average 23.3.4 There can be Practical Value in Interpreting Models for Prediction 23.3.5 It can be Difficult to Apply Prediction to Policymaking 23.4 Using Text as Data for Prediction: Examples 23.4.1 Source Prediction 23.4.2 Linguistic Prediction 23.4.3 Social Forecasting 23.4.4 Nowcasting 23.5 Conclusion 242 243 244 244 244 246 247 247 249 249 253 254 256 257 xiii
xiv Contents CHAPTER 24 CHAPTER 25 CHAPTER 26 Causal Inference 259 24.1 Introduction to Causal Inference 24.2 Similarities and Differences between Prediction and Measurement, and Causal Inference 24.3 Key Principles of Causal Inference with Text 24.3.1 The Core Problems of Causal Inference Remain, even when Working with Text 263 24.3.2 Our Conceptualization of the Treatment and Outcome Remains a Critical Component of Causal Inference with Text 264 24.3.3 The Challenges of Making Causal Inferences with Text Underscore the Need for Sequential Science 24.4 The Mapping Function 24.4.1 Causal Inference with g 24.4.2 Identification and Overfitting 24.5 Workflows for Making Causal Inferences with Text 24.5.1 Define g before Looking at the Documents 24.5.2 Use a Train/Test Split 24.5.3 Run Sequential Experiments 24.6 Conclusion 260 264 266 267 268 269 269 269 271 271 Text as Outcome 272 25.1 An Experiment onImmigration 25.2 The Effect of PresidentialPublic Appeals 25.3 Conclusion 272 275 276 Text as Treatment 277 26.1 26.2 26.3 CHAPTER 27 263 263 An Experiment Using Trump's Tweets A Candidate BiographyExperiment Conclusion 279 281 284 Text as Confounder 285 27.1 287 290 292 T1.2 27.3 Regression Adjustmentsfor Text Confounders Matching Adjustments for Text Conclusion PART VI CONCLUSION 295 CHAPTER 28 Conclusion 297 28.1 How to Use Text as Data in the Social Sciences 28.1.1 The Focus on Social Science Tasks 28.1.2 Iterative and Sequential Nature of the Social Sciences 28.1.3 Model Skepticism and the Application of Machine Learning to the Social Sciences 298 298 298 299
Contents 28.2 Applying Our Principlesbeyond Text Data 28.3 Avoiding the Cycle of Creation and Destruction in Social Science Methodology 299 300 Acknowledgments 303 Bibliography 307 Index 331 XV |
adam_txt |
Contents Preface Prerequisites and Notation Uses for This Book What This Book Is Not xvii xvii xviii xix PART I PRELIMINARIES 1 CHAPTER 1 Introduction 3 1.1 1.2 1.3 1.4 1.5 CHAPTER 2 How This Book Informs the Social Sciences How This Book Informs the Digital Humanities How This Book Informs Data Science in Industry and Government A Guide to This Book Conclusion Social Science Research and Text Analysis 2.1 2.2 2.3 2.4 2.5 2.6 2.7 Discovery Measurement Inference Social Science as an Iterative and Cumulative Process An Agnostic Approach to Text Analysis Discovery, Measurement, and Causal Inference: How the Chinese Government Censors Social Media Six Principles of Text Analysis 2.7.1 Social Science Theories and Substantive Knowledge are Essential for Research Design 2.7.2 Text Analysis does not Replace Humans—It Augments Them 2.7.3 Building, Refining, and Testing Social Science Theories Requires Iteration and Cumulation 2.7.4 Text Analysis Methods Distill Generalizations from Language 2.7.5 The Best Method Depends on the Task 5 8 9 10 11 13 15 16 17 17 18 20 22 22 24 26 28 29
viii Contents Validations are Essential and Depend on the Theory and the Task Conclusion: Text Data and Social Science 2.7.6 2.8 PART II CHAPTER 3 CHAPTER 4 SELECTION AND REPRESENTATION 33 Principles of Selection and Representation 35 3.1 3.2 3.3 3.4 3.5 3.6 3.7 35 36 37 38 38 39 40 4.3 4.4 Populations and Quantities of Interest Four Types of Bias 4.2.1 Resource Bias 4.2.2 Incentive Bias 4.2.3 Medium Bias 4.2.4 Retrieval Bias Considerations of "Found Data" Conclusion Bag of Words 5.1 5.2 5.3 5.4 5.5 5.6 5.7 CHAPTER 6 Principle 1 : Question-Specific Corpus Construction Principle 2: No Values-Free Corpus Construction Principle 3: No Right Way to Represent Text Principle 4: Validation State of the Union Addresses The Authorship of the Federalist Papers Conclusion Selecting Documents 4.1 4.2 CHAPTER 5 30 32 The Bag of Words Model Choose the Unit of Analysis Tokenize Reduce Complexity 5.4.1 Lowercase 5.4.2 Remove Punctuation 5.4.3 Remove Stop Words 5.4.4 Create Equivalence Classes (Lemmatize/Stem) 5.4.5 Filter by Frequency Construct Document-Feature Matrix Rethinking the Defaults 5.6.1 Authorship of the Federalist Papers 5.6.2 The Scale Argument against Preprocessing Conclusion The Multinomial Language Model 6.1 6.2 6.3 6.4 6.5 Multinomial Distribution Basic Language Modeling Regularization and Smoothing The Dirichlet Distribution Conclusion 41 42 43 43 44 44 45 46 46 48 48 49 50 52 52 52 53 54 55 55 57 57 58 59 60 61 63 66 66 69
Contents CHAPTER 7 CHAPTER 8 The Vector Space Model and Similarity Metrics 70 7.1 72 7.3 7.4 70 73 75 77 Distributed Representations of Words 8.1 8.2 8.3 8.4 8.5 8.6 CHAPTER 9 Similarity Metrics Distance Metrics tf-idf Weighting Conclusion Why Word Embeddings Estimating Word Embeddings 8.2.1 The Self-Supervision Insight 8.2.2 Design Choices in Word Embeddings 8.2.3 Latent Semantic Analysis 8.2.4 Neural Word Embeddings 8.2.5 Pretrained Embeddings 8.2.6 Rare Words 8.2.7 An Illustration Aggregating Word Embeddings to the Document Level Validation Contextualized Word Embeddings Conclusion Representations from Language Sequences 9.1 9.2 9.3 9.4 9.5 9.6 Text Reuse Parts of Speech Tagging 9.2.1 Using Phrases to Improve Visualization ֊ Named-Entity Recognition Dependency Parsing Broader Information Extraction Tasks Conclusion PART III DISCOVERY CHAPTER 10 Principles of Discovery 10.1 10.2 10.3 10.4 10.5 10.6 CHAPTER 11 Principle 1 : Context Relevance Principle 2: No Ground Truth Principle 3: Judge the Concept, Not the Method Principle 4: Separate Data Is Best Conceptualizing the US Congress Conclusion Discriminating Words 11.1 11.2 11.3 Mutual Information Fightin'Words Fictitious Prediction Problems 11.3.1 Standardized Test Statistics as Measures of Separation 78 79 81 81 81 82 82 84 84 85 86 87 88 89 90 90 91 92 94 95 96 97 99 103 103 104 105 106 106 109 111 112 115 117 118 ix
x Contents CHAPTER 12 CHAPTER 13 CHAPTER 14 11.3.2 χ 2 Test Statistics 11.3.3 Multinomial Inverse Regression 11.4 Conclusion 118 121 121 Clustering 123 12.1 An Initial Example Using /c-Means Clustering 12.2 Representations for Clustering 12.3 Approaches to Clustering 12.3.1 Components of a Clustering Method 12.3.2 Styles of Clustering Methods 12.3.3 Probabilistic Clustering Models 12.3.4 Algorithmic Clustering Models 12.3.5 Connections between Probabilistic and Algorithmic Clustering 12.4 Making Choices 12.4.1 Mode) Selection 12.4.2 Careful Reading 12.4.3 Choosing the Number of Clusters 12.5 The Human Side ofClustering 12.5.1 Interpretation 12.5.2 Interactive Clustering 12.6 Conclusion 124 127 127 128 130 132 134 Topic Models 147 13.1 Latent Dirichlet Allocation 13.1.1 Inference 13.1.2 Example: Discovering Credit Claiming for Fire Grants in Congressional PressReleases 13.2 Interpreting the Output of Topic Models 13.3 Incorporating Structure into LDA 13.3.1 Structure with Upstream, Known Prevalence Covariates 13.3.2 Structure with Upstream, Known Content Covariates 13.3.3 Structure with Downstream,Known Covariates 13.3.4 Additional Sources of Structure 13.4 Structural Topic Models 13.4.1 Example: Discovering the Components of Radical Discourse 13.5 Labeling Topic Models 13.6 Conclusion 147 149 137 137 137 140 140 144 144 144 145 149 151 153 154 154 156 157 157 159 159 160 Low-Dimensional Document Embeddings 162 14.1 162 Principal Component Analysis 14.1.1 Automated Methods for Labeling Principal Components 14.1.2 Manual Methods for Labeling Principal Components 163 164
Contents 14.1.3 Principal Component Analysis of Senate Press Releases 14.1.4 Choosing the Number of Principal Components 14.2 Classical Multidimensional Scaling 14.2.1 Extensions of Classical MDS 14.2.2 Applying Classical MDS to Senate Press Releases 14.3 Conclusion 164 165 167 168 168 169 PART IV MEASUREMENT 171 CHAPTER 15 Principles of Measurement 173 15.1 From Concept to Measurement 15.2 What Makes a Good Measurement 15.2.1 Principle 1 : Measures should have Clear Goals 15.2.2 Principle 2: Source Material should Always be Identified and Ideally Made Public 15.2.3 Principle 3: The Coding Process should be Explainable and Reproducible 15.2.4 Principle 4: The Measure should be Validated 15.2.5 Principle 5: Limitations should be Explored, Documented and Communicated to the Audience 15.3 Balancing Discovery and Measurement with Sample Splits 174 174 175 CHAPTER 16 175 175 176 176 Word Counting 16.1 Keyword Counting 16.2 Dictionary Methods 16.3 Limitations and Validations of Dictionary Methods 16.3.1 Moving Beyond Dictionaries:Wordscores 16.4 Conclusion CHAPTER 17 175 An Overview of Supervised Classification 17.1 17.2 17.3 17.4 17.5 17.6 Example: Discursive Governance Create a Training Set Classify Documents with Supervised Learning Check Performance Using the Measure Conclusion CHAPTER 18 Coding a Training Set 18.1 Characteristics of a Good Training Set 18.2 Hand Coding 18.2.1 1 : Decide on a Codebook 18.2.2 2: Select Coders 18.2.3 3: Select Documents to Code 18.2.4 4: Manage Coders 18.2.5 5: Check Reliability 178 ■ 178 180 181 182 183 184 185 186 186 187 187 188 189 190 190
191 191 191 192 192 xi
xii Contents CHAPTER 19 CHAPTER 20 CHAPTER 21 18.2.6 Managing Drift 18.2.7 Example: Making the News 18.3 Crowdsourcing 18.4 Supervision with Found Data 18.5 Conclusion 192 192 193 195 196 Classifying Documents with Supervised Learning 197 19.1 Naive Bayes 19.1.1 The Assumptions in Naive Bayes are Almost Certainly Wrong 19.1.2 Naive Bayes is a Generative Model 19.1.3 Naive Bayes is a Linear Classifier 19.2 Machine Learning 19.2.1 Fixed Basis Functions 19.2.2 Adaptive Basis Functions 19.2.3 Quantification 19.2.4 Concluding Thoughts on Supervised Learning with Random Samples 19.3 Example: Estimating Jihad Scores 19.4 Conclusion 198 200 200 201 202 203 205 206 207 207 210 Checking Performance 211 20.1 Validation with Gold-Standard Data 20.1.1 Validation Set 20.1.2 Cross-Validation 20.1.3 The Importance of Gold-Standard Data 20.1.4 Ongoing Evaluations 20.2 Validation without Gold-Standard Data 20.2.1 Surrogate Labels 20.2.2 Partial Category Replication 20.2.3 Nonexpert Human Evaluation 20.2.4 Correspondence to External Information 20.3 Example: Validating Jihad Scores 20.4 Conclusion 211 212 213 213 214 214 214 215 215 215 216 217 Repurposing Discovery Methods Unsupervised Methods Tend to Measure Subject Better than Subtleties 21.2 Example: Scaling via Differential Word Rates 21.3 A Workflow for Repurposing Unsupervised Methods for Measurement 21.3.1 1: Split the Data 21.3.2 2: Fit the Model 21.3.3 3: Validate the Model 21.3.4 4: Fit to the Test Data and Revalidate 21.4 Concerns in Repurposing Unsupervised Methods for Measurement 21.4.1 Concern 1 : The Method Always Returns a
Result 219 21.1 219 220 221 223 223 223 225 225 226
Contents 21.4.2 Concern 2: Opaque Differences in Estimation Strategies 21.4.3 Concern 3: Sensitivity to Unintuitive Hyperparameters 21.4.4 Concern 4: Instability in results 21.4.5 Rethinking Stability 21.5 Conclusion PART V 226 INFERENCE CHAPTER 22 Principles of Inference 22.1 Prediction 22.2 Causal Inference 22.2.1 Causal Inference Places Identification First 22.2.2 Prediction Is about Outcomes That Will Happen, Causal Inference Is about Outcomes from Interventions 22.2.3 Prediction and Causal Inference Require Different Validations 22.2.4 Prediction and Causal Inference Use Features Differently 22.3 Comparing Predictionand Causal Inference 22.4 Partial and General Equilibrium in Prediction and Causal Inference 22.5 Conclusion CHAPTER 23 227 227 228 229 231 233 233 234 235 235 236 237 238 238 240 Prediction 241 23.1 The Basic Task of Prediction 23.2 Similarities and Differences between Prediction and Measurement · 23.3 Five Principles of Prediction 23.3.1 Predictive Features do not have to Cause the Outcome 23.3.2 Cross-Validation is not Always a Good Measure of Predictive Power 23.3.3 It's Not Always Better to be More Accurate on Average 23.3.4 There can be Practical Value in Interpreting Models for Prediction 23.3.5 It can be Difficult to Apply Prediction to Policymaking 23.4 Using Text as Data for Prediction: Examples 23.4.1 Source Prediction 23.4.2 Linguistic Prediction 23.4.3 Social Forecasting 23.4.4 Nowcasting 23.5 Conclusion 242 243 244 244 244 246 247 247 249 249 253 254 256 257 xiii
xiv Contents CHAPTER 24 CHAPTER 25 CHAPTER 26 Causal Inference 259 24.1 Introduction to Causal Inference 24.2 Similarities and Differences between Prediction and Measurement, and Causal Inference 24.3 Key Principles of Causal Inference with Text 24.3.1 The Core Problems of Causal Inference Remain, even when Working with Text 263 24.3.2 Our Conceptualization of the Treatment and Outcome Remains a Critical Component of Causal Inference with Text 264 24.3.3 The Challenges of Making Causal Inferences with Text Underscore the Need for Sequential Science 24.4 The Mapping Function 24.4.1 Causal Inference with g 24.4.2 Identification and Overfitting 24.5 Workflows for Making Causal Inferences with Text 24.5.1 Define g before Looking at the Documents 24.5.2 Use a Train/Test Split 24.5.3 Run Sequential Experiments 24.6 Conclusion 260 264 266 267 268 269 269 269 271 271 Text as Outcome 272 25.1 An Experiment onImmigration 25.2 The Effect of PresidentialPublic Appeals 25.3 Conclusion 272 275 276 Text as Treatment 277 26.1 26.2 26.3 CHAPTER 27 263 263 An Experiment Using Trump's Tweets A Candidate BiographyExperiment Conclusion 279 281 284 Text as Confounder 285 27.1 287 290 292 T1.2 27.3 Regression Adjustmentsfor Text Confounders Matching Adjustments for Text Conclusion PART VI CONCLUSION 295 CHAPTER 28 Conclusion 297 28.1 How to Use Text as Data in the Social Sciences 28.1.1 The Focus on Social Science Tasks 28.1.2 Iterative and Sequential Nature of the Social Sciences 28.1.3 Model Skepticism and the Application of Machine Learning to the Social Sciences 298 298 298 299
Contents 28.2 Applying Our Principlesbeyond Text Data 28.3 Avoiding the Cycle of Creation and Destruction in Social Science Methodology 299 300 Acknowledgments 303 Bibliography 307 Index 331 XV |
any_adam_object | 1 |
any_adam_object_boolean | 1 |
author | Grimmer, Justin Roberts, Margaret E. Stewart, Brandon M. |
author_GND | (DE-588)1048671410 (DE-588)1141692562 (DE-588)1262405440 |
author_facet | Grimmer, Justin Roberts, Margaret E. Stewart, Brandon M. |
author_role | aut aut aut |
author_sort | Grimmer, Justin |
author_variant | j g jg m e r me mer b m s bm bms |
building | Verbundindex |
bvnumber | BV047879793 |
classification_rvk | MR 2200 ST 650 ST 306 |
ctrlnum | (OCoLC)1312603230 (DE-599)BVBBV047879793 |
discipline | Informatik Soziologie |
discipline_str_mv | Informatik Soziologie |
format | Book |
fullrecord | <?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>00000nam a2200000 c 4500</leader><controlfield tag="001">BV047879793</controlfield><controlfield tag="003">DE-604</controlfield><controlfield tag="005">20220930</controlfield><controlfield tag="007">t</controlfield><controlfield tag="008">220311s2022 a||| |||| 00||| eng d</controlfield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">9780691207544</subfield><subfield code="c">cloth</subfield><subfield code="9">978-0-691-20754-4</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">9780691207551</subfield><subfield code="c">pbk</subfield><subfield code="9">978-0-691-20755-1</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(OCoLC)1312603230</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-599)BVBBV047879793</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-604</subfield><subfield code="b">ger</subfield><subfield code="e">rda</subfield></datafield><datafield tag="041" ind1="0" ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="049" ind1=" " ind2=" "><subfield code="a">DE-19</subfield><subfield code="a">DE-355</subfield><subfield code="a">DE-188</subfield><subfield code="a">DE-521</subfield><subfield code="a">DE-473</subfield><subfield code="a">DE-29</subfield><subfield code="a">DE-20</subfield><subfield code="a">DE-739</subfield><subfield code="a">DE-11</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">MR 2200</subfield><subfield code="0">(DE-625)123489:</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">ST 650</subfield><subfield code="0">(DE-625)143687:</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">ST 306</subfield><subfield code="0">(DE-625)143654:</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Grimmer, Justin</subfield><subfield code="e">Verfasser</subfield><subfield code="0">(DE-588)1048671410</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Text as data</subfield><subfield code="b">a new framework for machine learning and the social sciences</subfield><subfield code="c">Justin Grimmer, Margaret E. Roberts, Brandon M. Stewart</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="a">Princeton ; Oxford</subfield><subfield code="b">Princeton University Press</subfield><subfield code="c">[2022]</subfield></datafield><datafield tag="264" ind1=" " ind2="4"><subfield code="c">© 2022</subfield></datafield><datafield tag="300" ind1=" " ind2=" "><subfield code="a">xix, 336 Seiten</subfield><subfield code="b">Illustrationen, Diagramme</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Computational social science</subfield><subfield code="0">(DE-588)1249405939</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Natürliche Sprache</subfield><subfield code="0">(DE-588)4041354-8</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Maschinelles Lernen</subfield><subfield code="0">(DE-588)4193754-5</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Text Mining</subfield><subfield code="0">(DE-588)4728093-1</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="653" ind1=" " ind2="0"><subfield code="a">Text data mining</subfield></datafield><datafield tag="653" ind1=" " ind2="0"><subfield code="a">Social sciences / Data processing</subfield></datafield><datafield tag="653" ind1=" " ind2="0"><subfield code="a">Machine learning</subfield></datafield><datafield tag="689" ind1="0" ind2="0"><subfield code="a">Text Mining</subfield><subfield code="0">(DE-588)4728093-1</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2="1"><subfield code="a">Natürliche Sprache</subfield><subfield code="0">(DE-588)4041354-8</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2="2"><subfield code="a">Maschinelles Lernen</subfield><subfield code="0">(DE-588)4193754-5</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2="3"><subfield code="a">Computational social science</subfield><subfield code="0">(DE-588)1249405939</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2=" "><subfield code="5">DE-604</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Roberts, Margaret E.</subfield><subfield code="e">Verfasser</subfield><subfield code="0">(DE-588)1141692562</subfield><subfield code="4">aut</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Stewart, Brandon M.</subfield><subfield code="e">Verfasser</subfield><subfield code="0">(DE-588)1262405440</subfield><subfield code="4">aut</subfield></datafield><datafield tag="776" ind1="0" ind2="8"><subfield code="i">Erscheint auch als</subfield><subfield code="n">Online-Ausgabe</subfield><subfield code="z">978-0-691-20799-5</subfield></datafield><datafield tag="856" ind1="4" ind2="2"><subfield code="m">Digitalisierung UB Bamberg - ADAM Catalogue Enrichment</subfield><subfield code="q">application/pdf</subfield><subfield code="u">http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=033262161&sequence=000001&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA</subfield><subfield code="3">Inhaltsverzeichnis</subfield></datafield></record></collection> |
id | DE-604.BV047879793 |
illustrated | Illustrated |
index_date | 2024-07-03T19:22:21Z |
indexdate | 2024-07-20T06:14:39Z |
institution | BVB |
isbn | 9780691207544 9780691207551 |
language | English |
oai_aleph_id | oai:aleph.bib-bvb.de:BVB01-033262161 |
oclc_num | 1312603230 |
open_access_boolean | |
owner | DE-19 DE-BY-UBM DE-355 DE-BY-UBR DE-188 DE-521 DE-473 DE-BY-UBG DE-29 DE-20 DE-739 DE-11 |
owner_facet | DE-19 DE-BY-UBM DE-355 DE-BY-UBR DE-188 DE-521 DE-473 DE-BY-UBG DE-29 DE-20 DE-739 DE-11 |
physical | xix, 336 Seiten Illustrationen, Diagramme |
publishDate | 2022 |
publishDateSearch | 2022 |
publishDateSort | 2022 |
publisher | Princeton University Press |
record_format | marc |
spelling | Grimmer, Justin Verfasser (DE-588)1048671410 aut Text as data a new framework for machine learning and the social sciences Justin Grimmer, Margaret E. Roberts, Brandon M. Stewart Princeton ; Oxford Princeton University Press [2022] © 2022 xix, 336 Seiten Illustrationen, Diagramme txt rdacontent n rdamedia nc rdacarrier Computational social science (DE-588)1249405939 gnd rswk-swf Natürliche Sprache (DE-588)4041354-8 gnd rswk-swf Maschinelles Lernen (DE-588)4193754-5 gnd rswk-swf Text Mining (DE-588)4728093-1 gnd rswk-swf Text data mining Social sciences / Data processing Machine learning Text Mining (DE-588)4728093-1 s Natürliche Sprache (DE-588)4041354-8 s Maschinelles Lernen (DE-588)4193754-5 s Computational social science (DE-588)1249405939 s DE-604 Roberts, Margaret E. Verfasser (DE-588)1141692562 aut Stewart, Brandon M. Verfasser (DE-588)1262405440 aut Erscheint auch als Online-Ausgabe 978-0-691-20799-5 Digitalisierung UB Bamberg - ADAM Catalogue Enrichment application/pdf http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=033262161&sequence=000001&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA Inhaltsverzeichnis |
spellingShingle | Grimmer, Justin Roberts, Margaret E. Stewart, Brandon M. Text as data a new framework for machine learning and the social sciences Computational social science (DE-588)1249405939 gnd Natürliche Sprache (DE-588)4041354-8 gnd Maschinelles Lernen (DE-588)4193754-5 gnd Text Mining (DE-588)4728093-1 gnd |
subject_GND | (DE-588)1249405939 (DE-588)4041354-8 (DE-588)4193754-5 (DE-588)4728093-1 |
title | Text as data a new framework for machine learning and the social sciences |
title_auth | Text as data a new framework for machine learning and the social sciences |
title_exact_search | Text as data a new framework for machine learning and the social sciences |
title_exact_search_txtP | Text as data a new framework for machine learning and the social sciences |
title_full | Text as data a new framework for machine learning and the social sciences Justin Grimmer, Margaret E. Roberts, Brandon M. Stewart |
title_fullStr | Text as data a new framework for machine learning and the social sciences Justin Grimmer, Margaret E. Roberts, Brandon M. Stewart |
title_full_unstemmed | Text as data a new framework for machine learning and the social sciences Justin Grimmer, Margaret E. Roberts, Brandon M. Stewart |
title_short | Text as data |
title_sort | text as data a new framework for machine learning and the social sciences |
title_sub | a new framework for machine learning and the social sciences |
topic | Computational social science (DE-588)1249405939 gnd Natürliche Sprache (DE-588)4041354-8 gnd Maschinelles Lernen (DE-588)4193754-5 gnd Text Mining (DE-588)4728093-1 gnd |
topic_facet | Computational social science Natürliche Sprache Maschinelles Lernen Text Mining |
url | http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=033262161&sequence=000001&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |
work_keys_str_mv | AT grimmerjustin textasdataanewframeworkformachinelearningandthesocialsciences AT robertsmargarete textasdataanewframeworkformachinelearningandthesocialsciences AT stewartbrandonm textasdataanewframeworkformachinelearningandthesocialsciences |