Why would economists care about text?

Advances in machine learning have provided a wealth of opportunities in many areas of empirical research. Economists like Susan Athey and Guido Imbens at Stanford have been working on estimation and inference, and explored how the non-parametric learning toolbox may complement the traditional econometric methods. I have for some time been very interested in how machine learning can be used to create or classify novel and interesting data for economics research.

Robert Shiller (2017) coined the term 'narrative economics', which is the idea that narratives in the news and on social media may influence our perceptions of the economy, and how we act within it. Do trending news articles about rising unemployment rates affect your willingness to switch jobs? Will a barrage of Tweets and other social media posts about climate change influence your willingness-to-pay for renewable energy?

Economists who want to answer these questions will need to develop tools to collect, classify, and analyze large amounts of text, from newspapers to Twitter. Limited attempts at this have already recieved some mainstream recognition in the profession, e.g. research by Alice Wu (2019) on narratives of gender bias among economists on the 'Economics Job Market Rumors' forum, published in Review of Economics & Statistics. In this and future blog posts I will cover some of the resources available to get started with this kind of research.

Text mining in The New York Times archives

The New York Times is one of the most influential daily newspapers in the US and has been in circulation since 1851. Its Developer app provides free API keys to search articles in complete issues dating back to the 1980s. Older issues can be accessed through the NYT historical archives. To get started, we visit developer.nytimes.com and register. Log in and follow the step-by-step instructions under the 'Get Started'-tab. Upon completing the process, we will have access to our API key, which we can copy and paste into our R environment:

Setup

In [ ]:
library(httr)
library(jsonlite)
library(dplyr)

The NYTimes API returns a search query as a JSON file, so the jsonlite package is required to work with JSON in R and interact with the web API. The httr package will allow us to run GET() requests, which simply means to retrieve the page from a URL.

In [ ]:
Sys.setenv(NYT_KEY="*Your API key*")

This is where we enter the API key we copied earlier. It will allow us to search for articles within the entire NYTimes corpus. Let's run an example search! The coronavirus epidemic of 2019/2020 has become one of the most destructive public health crises in modern memory (I'm socially distancing at home as I write this) with far-reaching effects on the broader economy. Narratives about how various lockdown measures may harm jobs and employment have been a central part of the policy discourse for the past months.

In [ ]:
q_term <- "coronavirus+unemployment" 
begin_date <- "20200501"
end_date <- "20200601"

We search for articles containing the terms coronavirus and unemployment, and make a connection between the two, published in May. We plug the search conditions together with the API key into the Article Search API JSON query:

In [ ]:
baseurl <- paste0(
  "http://api.nytimes.com/svc/search/v2/articlesearch.json?q=",
  q_term,
  "&begin_date=",
  begin_date,
  "&end_date=",
  end_date,
  "&facet_filter=true&api-key=",
  Sys.getenv("NYT_KEY"))
In [ ]:
query <- fromJSON(baseurl)

The NYTimes API can only return 10 results per page, so we will have to write a little bit of extra code to merge the pages into a dataframe (Alice Friedman 2018). Quite simply, the number of pages we need is the number of search results (hits) divided by 10. Remember that the first page has index 0, therefore we also subtract 1.

In [ ]:
maxPages <- round((query$response$meta$hits[1]/10)-1)

pages <- list()
for(i in 0:maxPages){
  nytSearch <- fromJSON(paste0(baseurl, "&page=", i), flatten = TRUE) %>% data.frame() 
  message("Retrieving page ", i)
  pages[[i+1]] <- nytSearch 
  Sys.sleep(1) #This command is to prevent a crash in case of excessive runtime
}

The pages list now contains our search results with a number of tags (title, abstract, publication date, lead paragraph, source) that can be used for classification. Let's look at some of the article abstracts:

In [ ]:
query$response$docs$abstract
 [1] "Unemployment claims exceed 40 million since the start of the pandemic, with 2.1 million added last week, but a backlog may be leaving many uncounted."           
 [2] "In the latest challenge over their labor status, gig workers say the state is illegally failing to pay them jobless benefits in a timely way."                   
 [3] "Emergency programs have cushioned the shutdown’s impact on workers and businesses and lifted the economy, but may not outlast the coronavirus crisis."           
 [4] "Government assistance, mainly stimulus checks, provided an overall boost in April. With little immediate outlet, savings soared."                                
 [5] "Jerome H. Powell, the chair of the Federal Reserve, warned of the impacts that a second wave of the coronavirus could have on the economy."                      
 [6] "The latest on stock market and business news during the coronavirus outbreak."                                                                                   
 [7] "Billions are going to zillionaires under the guise of pandemic relief."                                                                                          
 [8] "Investigators see evidence of a sophisticated international attack they said could siphon hundreds of millions of dollars that were intended for the unemployed."
 [9] "How a government both sectarian and divisive learned (briefly) to become inclusive."                                                                             
[10] "The economy is reopening, but office buildings remain empty — and may stay that way, permanently."

And the dates when they were published:

In [ ]:
query$response$docs$pub_date
 [1] "2020-05-28T12:46:15+0000" "2020-05-26T14:18:08+0000" "2020-05-28T22:43:51+0000"
 [4] "2020-05-29T21:09:27+0000" "2020-05-29T18:07:55+0000" "2020-05-28T08:15:21+0000"
 [7] "2020-05-23T18:30:08+0000" "2020-05-16T15:10:57+0000" "2020-05-18T06:07:39+0000"
[10] "2020-05-24T11:00:09+0000"

Going forward

As you might imagine, this kind of data can be used in multiple ways. We might create an index of articles connecting coronavirus with unemployment to see whether news reporting actually reflect changes on the labor market over time. We might use sentiment analysis to classify articles as more or less severe, and together with household surveys study whether the media influence people's beliefs. These techniques can also potentially be used to better understand the narratives surrounding historical events such as the Great Recession, 9/11, or the Brexit campaign. I will try to keep updating this blog as my work progresses.

References

Shiller, Robert (2017) "Narrative Economics", American Economic Review, 107(4): 967–1004

Wu, H. Alice (2019) "Gender Bias in Rumors among Professionals: An Identity-Based Interpretation", Review of Economics and Statistics, pp. 1-40

Friedman, Alice (2018) "NY Times Article Search", Retrieved from: http://rstudio-pubs-static.s3.amazonaws.com/433948_af9b18781efb453da189f7e8da449e1c.html (2020-06-08)