한국노동연구원 전자도서관

로그인

한국노동연구원 전자도서관

자료검색

  1. 메인
  2. 자료검색
  3. 신착자료 검색

신착자료 검색

단행본

Text as data: a new framework for machine learning and the social sciences

청구기호
006.31 TEX2022
발행사항
New Jersey : Princeton University Press., 2022
형태사항
336 p
서지주기
Includes bibliographical references and index
ISBN
9780691207544
소장정보
위치등록번호청구기호 / 출력상태반납예정일
이용 가능 (1)
한국노동연구원00009690대출가능-
이용 가능 (1)
  • 등록번호
    00009690
    상태/반납예정일
    대출가능
    -
    위치/청구기호(출력)
    한국노동연구원
목차
Preface Prerequisites and Notation Uses for This Book What This Book Is Not PART I PRELIMINARIES CHAPTER 1 Introduction 1.1 How This Book Informs the Social Sciences 1.2 How This Book Informs the Digital Humanities 1.3 How This Book Informs Data Science in Industry and Government 1.4 A Guide to This Book 1.5 Conclusion CHAPTER 2 Social Science Research and Text Analysis 2.1 Discovery 2.2 Measurement 2.3 Inference 2.4 Social Science as an Iterative and Cumulative Process 2.5 An Agnostic Approach to Text Analysis 2.6 Discovery, Measurement, and Causal Inference: How the Chinese Government Censors Social Media 2.7 Six Principles of Text Analysis 2.7.1 Social Science Theories and Substantive Knowledge are Essential for Research Design 2.7.2 Text Analysis does not Replace Humans—It Augments Them 2.7.3 Building, Refining, and Testing Social Science Theories Requires Iteration and Cumulation 2.7.4 Text Analysis Methods Distill Generalizations from Language 2.7.5 The Best Method Depends on the Task 2.7.6 Validations are Essential and Depend on the Theory and the Task 2.8 Conclusion: Text Data and Social Science PART II SELECTION AND REPRESENTATION CHAPTER 3 Principles of Selection and Representation 3.1 Principle 1: Question-Specific Corpus Construction 3.2 Principle 2: No Values-Free Corpus Construction 3.3 Principle 3: No Right Way to Represent Text 3.4 Principle 4: Validation 3.5 State of the Union Addresses 3.6 The Authorship of the Federalist Papers 3.7 Conclusion CHAPTER 4 Selecting Documents 4.1 Populations and Quantities of Interest 4.2 Four Types of Bias 4.2.1 Resource Bias 4.2.2 Incentive Bias 4.2.3 Medium Bias 4.2.4 Retrieval Bias 4.3 Considerations of “Found Data” 4.4 Conclusion CHAPTER 5 Bag of Words 5.1 The Bag of Words Model 5.2 Choose the Unit of Analysis 5.3 Tokenize 5.4 Reduce Complexity 5.4.1 Lowercase 5.4.2 Remove Punctuation 5.4.3 Remove Stop Words 5.4.4 Create Equivalence Classes (Lemmatize/Stem) 5.4.5 Filter by Frequency 5.5 Construct Document-Feature Matrix 5.6 Rethinking the Defaults 5.6.1 Authorship of the Federalist Papers 5.6.2 The Scale Argument against Preprocessing 5.7 Conclusion CHAPTER 6 The Multinomial Language Model 6.1 Multinomial Distribution 6.2 Basic Language Modeling 6.3 Regularization and Smoothing 6.4 The Dirichlet Distribution 6.5 Conclusion CHAPTER 7 The Vector Space Model and Similarity Metrics 7.1 Similarity Metrics 7.2 Distance Metrics 7.3 tf-idf Weighting 7.4 Conclusion CHAPTER 8 Distributed Representations of Words 8.1 Why Word Embeddings 8.2 Estimating Word Embeddings 8.2.1 The Self-Supervision Insight 8.2.2 Design Choices in Word Embeddings 8.2.3 Latent Semantic Analysis 8.2.4 Neural Word Embeddings 8.2.5 Pretrained Embeddings 8.2.6 Rare Words 8.2.7 An Illustration 8.3 Aggregating Word Embeddings to the Document Level 8.4 Validation 8.5 Contextualized Word Embeddings 8.6 Conclusion CHAPTER 9 Representations from Language Sequences 9.1 Text Reuse 9.2 Parts of Speech Tagging 9.2.1 Using Phrases to Improve Visualization 9.3 Named-Entity Recognition 9.4 Dependency Parsing 9.5 Broader Information Extraction Tasks 9.6 Conclusion PART III DISCOVERY CHAPTER 10 Principles of Discovery 10.1 Principle 1: Context Relevance 10.2 Principle 2: No Ground Truth 10.3 Principle 3: Judge the Concept, Not the Method 10.4 Principle 4: Separate Data Is Best 10.5 Conceptualizing the US Congress 10.6 Conclusion CHAPTER 11 Discriminating Words 11.1 Mutual Information 11.2 Fightin’ Words 11.3 Fictitious Prediction Problems 11.3.1 Standardized Test Statistics as Measures of Separation 11.3.2 χ2 Test Statistics 11.3.3 Multinomial Inverse Regression 11.4 Conclusion CHAPTER 12 Clustering 12.1 An Initial Example Using k-Means Clustering 12.2 Representations for Clustering 12.3 Approaches to Clustering 12.3.1 Components of a Clustering Method 12.3.2 Styles of Clustering Methods 12.3.3 Probabilistic Clustering Models 12.3.4 Algorithmic Clustering Models 12.3.5 Connections between Probabilistic and Algorithmic Clustering 12.4 Making Choices 12.4.1 Model Selection 12.4.2 Careful Reading 12.4.3 Choosing the Number of Clusters 12.5 The Human Side of Clustering 12.5.1 Interpretation 12.5.2 Interactive Clustering 12.6 Conclusion CHAPTER 13 Topic Models 13.1 Latent Dirichlet Allocation 13.1.1 Inference 13.1.2 Example: Discovering Credit Claiming for Fire Grants in Congressional Press Releases 13.2 Interpreting the Output of Topic Models 13.3 Incorporating Structure into LDA 13.3.1 Structure with Upstream, Known Prevalence Covariates 13.3.2 Structure with Upstream, Known Content Covariates 13.3.3 Structure with Downstream, Known Covariates 13.3.4 Additional Sources of Structure 13.4 Structural Topic Models 13.4.1 Example: Discovering the Components of Radical Discourse 13.5 Labeling Topic Models 13.6 Conclusion CHAPTER 14 Low-Dimensional Document Embeddings 14.1 Principal Component Analysis 14.1.1 Automated Methods for Labeling Principal Components 14.1.2 Manual Methods for Labeling Principal Components 14.1.3 Principal Component Analysis of Senate Press Releases 14.1.4 Choosing the Number of Principal Components 14.2 Classical Multidimensional Scaling 14.2.1 Extensions of Classical MDS 14.2.2 Applying Classical MDS to Senate Press Releases 14.3 Conclusion PART IV MEASUREMENT CHAPTER 15 Principles of Measurement 15.1 From Concept to Measurement 15.2 What Makes a Good Measurement 15.2.1 Principle 1: Measures should have Clear Goals 15.2.2 Principle 2: Source Material should Always be Identified and Ideally Made Public 15.2.3 Principle 3: The Coding Process should be Explainable and Reproducible 15.2.4 Principle 4: The Measure should be Validated 15.2.5 Principle 5: Limitations should be Explored, Documented and Communicated to the Audience 15.3 Balancing Discovery and Measurement with Sample Splits CHAPTER 16 Word Counting 16.1 Keyword Counting 16.2 Dictionary Methods 16.3 Limitations and Validations of Dictionary Methods 16.3.1 Moving Beyond Dictionaries: Wordscores 16.4 Conclusion CHAPTER 17 An Overview of Supervised Classification 17.1 Example: Discursive Governance 17.2 Create a Training Set 17.3 Classify Documents with Supervised Learning 17.4 Check Performance 17.5 Using the Measure 17.6 Conclusion CHAPTER 18 Coding a Training Set 18.1 Characteristics of a Good Training Set 18.2 Hand Coding 18.2.1 1: Decide on a Codebook 18.2.2 2: Select Coders 18.2.3 3: Select Documents to Code 18.2.4 4: Manage Coders 18.2.5 5: Check Reliability 18.2.6 Managing Drift 18.2.7 Example: Making the News 18.3 Crowdsourcing 18.4 Supervision with Found Data 18.5 Conclusion CHAPTER 19 Classifying Documents with Supervised Learning 19.1 Naive Bayes 19.1.1 The Assumptions in Naive Bayes are Almost Certainly Wrong 19.1.2 Naive Bayes is a Generative Model 19.1.3 Naive Bayes is a Linear Classifier 19.2 Machine Learning 19.2.1 Fixed Basis Functions 19.2.2 Adaptive Basis Functions 19.2.3 Quantification 19.2.4 Concluding Thoughts on Supervised Learning with Random Samples 19.3 Example: Estimating Jihad Scores 19.4 Conclusion CHAPTER 20 Checking Performance 20.1 Validation with Gold-Standard Data 20.1.1 Validation Set 20.1.2 Cross-Validation 20.1.3 The Importance of Gold-Standard Data 20.1.4 Ongoing Evaluations 20.2 Validation without Gold-Standard Data 20.2.1 Surrogate Labels 20.2.2 Partial Category Replication 20.2.3 Nonexpert Human Evaluation 20.2.4 Correspondence to External Information 20.3 Example: Validating Jihad Scores 20.4 Conclusion CHAPTER 21 Repurposing Discovery Methods 21.1 Unsupervised Methods Tend to Measure Subject Better than Subtleties 21.2 Example: Scaling via Differential Word Rates 21.3 A Workflow for Repurposing Unsupervised Methods for Measurement 21.3.1 1: Split the Data 21.3.2 2: Fit the Model 21.3.3 3: Validate the Model 21.3.4 4: Fit to the Test Data and Revalidate 21.4 Concerns in Repurposing Unsupervised Methods for Measurement 21.4.1 Concern 1: The Method Always Returns a Result 21.4.2 Concern 2: Opaque Differences in Estimation Strategies 21.4.3 Concern 3: Sensitivity to Unintuitive Hyperparameters 21.4.4 Concern 4: Instability in results 21.4.5 Rethinking Stability 21.5 Conclusion PART V INFERENCE CHAPTER 22 Principles of Inference 22.1 Prediction 22.2 Causal Inference 22.2.1 Causal Inference Places Identification First 22.2.2 Prediction Is about Outcomes That Will Happen, Causal Inference is about Outcomes from Interventions 22.2.3 Prediction and Causal Inference Require Different Validations 22.2.4 Prediction and Causal Inference Use Features Differently 22.3 Comparing Prediction and Causal Inference 22.4 Partial and General Equilibrium in Prediction and Causal Inference 22.5 Conclusion CHAPTER 23 Prediction 23.1 The Basic Task of Prediction 23.2 Similarities and Differences between Prediction and Measurement 23.3 Five Principles of Prediction 23.3.1 Predictive Features do not have to Cause the Outcome 23.3.2 Cross-Validation is not Always a Good Measure of Predictive Power 23.3.3 It’s Not Always Better to be More Accurate on Average 23.3.4 There can be Practical Value in Interpreting Models for Prediction 23.3.5 It can be Difficult to Apply Prediction to Policymaking 23.4 Using Text as Data for Prediction: Examples 23.4.1 Source Prediction 23.4.2 Linguistic Prediction 23.4.3 Social Forecasting 23.4.4 Nowcasting 23.5 Conclusion CHAPTER 24 Causal Inference 24.1 Introduction to Causal Inference 24.2 Similarities and Differences between Prediction and Measurement, and Causal Inference 24.3 Key Principles of Causal Inference with Text 24.3.1 The Core Problems of Causal Inference Remain, even when Working with Text 24.3.2 Our Conceptualization of the Treatment and Outcome Remains a Critical Component of Causal Inference with Text 24.3.3 The Challenges of Making Causal Inferences with Text Underscore the Need for Sequential Science 24.4 The Mapping Function 24.4.1 Causal Inference with g 24.4.2 Identification and Overfitting 24.5 Workflows for Making Causal Inferences with Text 24.5.1 Define g before Looking at the Documents 24.5.2 Use a Train/Test Split 24.5.3 Run Sequential Experiments 24.6 Conclusion CHAPTER 25 Text as Outcome 25.1 An Experiment on Immigration 25.2 The Effect of Presidential Public Appeals 25.3 Conclusion CHAPTER 26 Text as Treatment 26.1 An Experiment Using Trump’s Tweets 26.2 A Candidate Biography Experiment 26.3 Conclusion CHAPTER 27 Text as Confounder 27.1 Regression Adjustments for Text Confounders 27.2 Matching Adjustments for Text 27.3 Conclusion PART VI CONCLUSION CHAPTER 28 Conclusion 28.1 How to Use Text as Data in the Social Sciences 28.1.1 The Focus on Social Science Tasks 28.1.2 Iterative and Sequential Nature of the Social Sciences 28.1.3 Model Skepticism and the Application of Machine Learning to the Social Sciences 28.2 Applying Our Principles beyond Text Data 28.3 Avoiding the Cycle of Creation and Destruction in Social Science Methodology Acknowledgments Bibliography Index