Book Contents and Contributors


Chapter 1: Introduction

This section provides a brief overview of the goals and structure of the book.



I. Capture and Curation

Chapter 2: Working with Web Data and APIs

Cameron NeylonCurtin University

This chapter will show you how to extract information from social media about the transmission of knowledge. The particular appli- cation will be to develop links to authors’ articles on Twitter using PLOS articles and to pull information using an API. You will get link data from bookmarking services, citations from Crossref, links from Facebook, and information from news coverage. The examples that will be used are from Twitter. In keeping with the social science grounding that is a core feature of the book, it will discuss what can be captured, what is potentially reliable, and how to manage data quality issues.


Chapter 3: Record Linkage

Joshua Tokle, Amazon, Stefan Bender, Deutsche Bundesbank 

Big data differs from survey data in that it is typically necessary to combine data from multiple sources to get a complete picture of the activities of interest. Although computer scientists tend to simply “mash” data sets together, social scientists are rightfully concerned about issues of missing links, duplicative links, and erroneous links. This chapter provides an overview of traditional rule-based and probabilistic approaches, as well as the important contribution of machine learning to record linkage.


Chapter 4: Databases

Ian FosterUniversity of Chicago, Pascal Heus, Metadata Technology North America

Once the data have been collected and linked into different files, it is necessary to store and organize them. Social scientists are used to working with one analytical file, often in SAS, Stata, SPSS, or R. This chapter, which may be the most important chapter in the book, describes different approaches to storing data in ways that permit rapid and reliable exploration and analysis.


Chapter 5: Programming with Big Data

Huy VoCity University of New York, Claudio Silva, NYU

Big data is sometimes defined as data that are too big to fit onto the analyst’s computer. This chapter provides an overview of clever programming techniques that facilitate the use of data (often using parallel computing). While the focus is on one of the most widely used big data programming paradigms and its most popular implementation, Apache Hadoop, the goal of the chapter is to provide a conceptual framework to the key challenges that the approach is designed to address.


II. Modeling and Analysis

Chapter 6: Machine Learning

Rayid GhaniUniversity of Chicago, Malte Schierholz, University of Mannheim

This chapter introduces you to the value of machine learning in the social sciences, particularly focusing on the overall machine learn- ing process as well as clustering and classification methods. You will get an overview of the machine learning pipeline and methods and how those methods are applied to solve social science problems. The goal is to give an intuitive explanation for the methods and to provide practical tips on how to use them in practice.



Chapter 7: Text Analysis

Evgeny KlochikhinAmerican Institutes for Research, Jordan Boyd-GraberUniversity of Colorado

This chapter provides an overview of how social scientists can make use of one of the most exciting advances in big data—text analysis. Vast amounts of data that are stored in documents can now be analyzed and searched so that different types of information can be retrieved. Documents (and the underlying activities of the entities that generated the documents) can be categorized into topics or fields as well as summarized. In addition, machine translation can be used to compare documents in different languages.


Chapter 8: Networks: The Basics

Jason Owen-SmithUniversity of Michigan

Social scientists are typically interested in describing the activities of individuals and organizations (such as households and firms) in a variety of economic and social contexts. The frame within which data has been collected will typically have been generated from tax or other programmatic sources. The new types of data permit new units of analysis—particularly network analysis—largely enabled by advances in mathematical graph theory. This chapter provides an overview of how social scientists can use network theory to generate measurable representations of patterns of relationships connecting entities. As the author points out, the value of the new framework is not only in constructing different right-hand-side variables but also in studying an entirely new unit of analysis that lies somewhere between the largely atomistic actors that occupy the markets of neo-classical theory and the tightly managed hierarchies that are the traditional object of inquiry of sociologists and organizational theorists.


III. Inference and Ethics

Chapter 9: Information Visualization

M. Adil Yalçın and Catherine PlaisantUniversity of Maryland

This chapter will show you how to explore data and communicate results so that data can be turned into interpretable, actionable information. There are many ways of presenting statistical information that convey content in a rigorous manner. The goal of this chapter is to present an introductory overview of effective visualization techniques for a range of data types and tasks, and to explore the foundations and challenges of information visualization.


Chapter 10: Errors and Inference

Paul P. BiemerUniversity of North Carolina

This chapter deals with inference and the errors associated with big data. Social scientists know only too well the cost associated with bad data—we highlighted both the classic Literary Digest example and the more recent Google Flu Trends problems in Chapter 1. Although the consequences are well understood, the new types of data are so large and complex that their properties often cannot be studied in traditional ways. In addition, the data generating function is such that the data are often selective, incomplete, and erroneous. Without proper data hygiene, the errors can quickly compound. This chapter provides, for the first time, a systematic way to think about the error framework in a big data setting.


Chapter 11: Privacy and Confidentiality

Stefan Bender, Deutsche Bundesbank, Ron S. Jarmin, US Census Bureau, Frauke Kreuter, University of Maryland, Julia Lane, NYU

This chapter addresses the issue that sits at the core of any study of human beings—privacy and confidentiality. In a new field, like the one covered in this book, it is critical that many researchers have access to the data so that work can be replicated and built upon—that there be a scientific basis to data science. Yet the rules that social scientists have traditionally used for survey data, namely anonymity and informed consent, no longer apply when data are collected in the wild. This concluding chapter identifies the issues that must be addressed for responsible and ethical research to take place.


Chapter 12: Workbooks

Jonathan Scott Morgan, Michigan State University, Christina Jones and Ahmad EmadAmerican Institutes for Research

This final chapter provides an overview of the Python workbooks that accompany each chapter. These workbooks combine text explanation and code you can run, implemented in Jupyter notebooks, to explain techniques and approaches selected from each chapter and to provide thorough implementation details, enabling students and interested practitioners to quickly get up to speed on and start using the technologies covered in the book. We hope you have a lot of fun with them.