Empirical Research with Large Datasets
Some reflections on the 2022 BPLIM conference
I have recently had the great privilege to attend and speak at the 2022 edition of the Banco de Portugal Microdata Research Laboratory (BPLIM).
It was a truly inspiring conference that dealt with a topic most dear to me: The challenges that come with the use highly granular micro data to enable data driven decision making and policy design. Big data, which I will loosely define as “data too big too fit into your retail-grade computer’s memory” (~20–30GB+), are becoming more and more common in economic research. However, in undergraduate economics education — and perhaps even on graduate level — students typically learn very little about how to handle large amounts of data. The focus is (was?) almost exclusively on statistical and econometric modelling and less so on how to prepare the data. However, over the past years I have observed an increased interest in “data science approaches” in econ research. In particular, the following features tend to gain more prominence:
- Open source programming languages
- Code version control systems
- Reproducible research
- Non conventional datasets (web scrapping etc.)
Open source programming languages
During my Ph.D. (which I started in 2014), Stata and Matlab were by far the most popular programming languages. It seems to me this is changing slowly towards open source alternatives, notably R and Python (and maybe to a lesser extent Julia as well). I could not find any good data on this except a 2020 paper by Lars Vilhuber (who also spoke at BPLIM, see below), in which he plots the usage of different languages in AEA supplements across time. From 2017 or so onwards one can see a slight upward trend for python and R, albeit they are still dwarfed by Stata and Matlab. My guess would be that this trend continued after 2019:
(Source: Reproducibility and Replicability in Economics )
There has been enough said about the advantages of open source languages. For economists the most important ones to me seem the huge online support networks (e.g. StackOverflow) where one can easily find help and, secondly the fact that many such languages provide better support for non traditional datasets. Consider for example Stata: It has awesome features if you work with survey data, but you probably would not want to do webscrapping with it. General purpose open source languages like python or R typically have some (user created) package that provides all or at least most of the functionality that you need. There is surprisingly little guidance in the literature on what programming language to use, possibly because things change quickly and any guide would be outdated soon (though I found some older work from David Kendrick and Hans Amman and a more recent study that focusses on estimating DGSE models).
My overall recommendation would use SQL if you have access to some database to any all pre-processing, do data cleaning, matching, merging, filtering, reshaping in Python. For traditional econometric regressions, R seems best right now. This “stack” provides the additional benefit that it is used extensively in the industry (unlike Stata/Matlab) which can be interesting for young researchers who are not fully committed to a career in academia.
Code version control systems
Code version control systems like Git may seem unfamiliar to social scientists initially, but they offer significant benefits. The necessity of proper code version control increases as more co-authors work on a project. In policy-oriented work where codes, data pipelines, and analysis are repeatedly executed, version control is even more critical to ensure replicable results. However, it is only one aspect of a complete solution. If the underlying data changes, even well-tracked codes may result in incorrect outcomes, especially if the data is not owned, such as in a corporate data warehouse. This is where data version control (dvc) comes in. I acknowledge that the current offerings in this field are not entirely satisfactory. They are effective for local data, but connecting to a Hadoop cluster presents challenges. Delta lake data lakes appear to be a promising solution, but they are relatively new and not widely accessible to economists yet
Related to the previous point, the replication crisis in the social sciences spurred interest in tools to make empirical results more robust and reproducible. For economics, p-hacking, i.e. the researcher selecting empirical specifications with “significant” results (i.e. p values below 10%) and publication bias (favouring results where the null hypothesis can be rejected) are the most scourging. Two papers that I can wholeheartedly recommend on the subject are methods matter and Star Wars: The Empirics Strike Back. To maintain the credibility of economics, new research methods using observational data (frequently the only available data) must be devised. Code version control systems can assist, but deeper incentive issues remain. The economics community must create incentives for reproducible work, including tenure prospects 😏.
Non-conventional datasets, such as data obtained through web scraping, geo and satellite data, and social network data, have gained significant prominence in economic research over the past decade. One example of this is the “nowcasting paper” by Edward Glaeser and his co-authors, in which they use Yelp reviews for restaurants to now-cast economic activity in a certain area. Official statistics often only appear with a multi-year lag, and this provides value not only to academics, but also to policy makers who may require more up-to-date data. While I find many of these new papers to be interesting, one crucial problem tends to be linking non-conventional data to conventional datasets. In my opinion, the entire field should move towards an open-source culture where creating datasets (e.g. match tables) can be shared among the community. I understand that there may be incentive problems, but other fields, such as computer science, have made this work. Therefore, I believe that it is possible for our field to do the same.
The themes mentioned above were also covered in some detail at the BPLIM workshop, and in particular how they come into play when the data under consideration gets bigger and bigger. In my own presentation, I tried to outline how a distributed computing solution like Spark, can help you speed up for data preparation and analysis. You can find the presentation here, along with all other presentations. I am summarizing the talks below so you can pick the ones that sound most interesting to you.
Lars Vilhuber: Reproducibility when data are confidential or big
Lars Vilhuber is the Executive Director of the Labor Dynamics Institute at Cornell University, Senior Research Associate in the Economics Department of Cornell University, and the American Economic Association’s Data Editor. His presentation focussed on the additional challenges of creating reproducible research when confidential data are used. He presented some interesting ideas, ranging from hashes as a proof of work to the national science foundations initiatives in the field of reproducible research.
Ulrich Matter: Big Data analytics. A geode to economist making the transition to Big Data
Ulrich Matter is an Assistant Professor of Economics at the University of St. Gallen’s SEPS-HSG. He started with anecdotal stories of data processing (like a panel regression taking 3.5 days where the author didn’t investigate why it takes so long. Spoiler alert: memory issues). He then provided some examples on how to work with R and SQL with big data. One of the most useful suggestions imho were the R packages
pryr to help you understand memory consumption in your application:
# packages library(pryr) # memory profiling # Example: how does a line of code affect memory # initiate a vector with 1000 (pseudo)-random numbers mem_change(thousand_numbers <- runif(1000)) >> 15.6 kB # initiate a vector with 1M (pseudo)-random numbers mem_change(a_million_numbers <- runif(1000^2)) >> 8 MB
bench to monitor computing time:
# packages library(bench) # computing time profiling # Example: compare speed of alternative implementations mark( # apply-approach to compute square roots sqrts1 <- sapply(thousand_numbers, sqrt), # exploit vectorization to compute square roots sqrts2 <- sqrt(thousand_numbers) )[,c(1,4)] # A tibble: 2 x 3 >> expression `itr/sec` mem_alloc >> <bch:expr> <dbl> <bch:byt> >> sqrts1 <- sapply(thousand_numbers, sqrt) 3333. 31.67KB >> sqrts2 <- sqrt(thousand_numbers) 261692. 7.86KB
Moreover, Ulrich presented some examples how to work with SQL and
sparklyr rather than loading all data into your R/Stata/matlab session.
Frauke Kreuter: Large surveys and other continuous data streams in statistics production
Frauke Kreuter holds the Chair of Statistics and Data Science at LMU Munich, Germany and at the University of Maryland, USA, she is Co-director of the Social Data Science Center (SoDa) and faculty member in the Joint Program in Survey Methodology (JPSM). She presented two recent research projects. The first one dealt with a survey that was basically a live stream of facebook users perception on Covid 19 (ranging from peoples perception of the riskiness of the sick towards their attitude to vaccinations). Slide of her presentation plots the correlation between self reported Covid 19 symptoms on facebook and confirmed cases:
I had previously been skeptical about the value of survey data in the era of big data. With so much data being automatically recorded and stored, do we still need surveys? Frauke effectively argued for their importance, although it must be acknowledged that big data surveys will be highly dependent on having an industry partner, such as Facebook, with access to the desired audience.
John Horton: Thoughts on Working with Corporate data
John Horton iis an economist, serving as the Richard S. Leghorn Career Development Professor at the MIT Sloan School of Management and a Faculty Research Fellow at the NBER. The first part of his talk centered on establishing a productive project structure for research projects. The second part was particularly interesting, as he shared his insights on obtaining private sector cooperation in research projects and data sharing.
Mauricio Bravo: How to work efficiently with large datasets
Marucio Bravo is a second-year Economics PhD student at Brown University where he works on Health, Labor, and IO. He previously worked as a research assistant at Yale, NBER, and MIT. His presentation dealt with how to work efficiently with large datasets in Stata. I have worked quite a bit with Stata in the past but I really learned a lot of tricks how to work with it. For example, using
compress variable active was float now byte ... variable country was str80 now str32 (29,993,145 bytes saved)
one can “reduce” variables to their minimum required datatype without losing any information. I tried this on the last dataset I had worked with in Stata and was able to reduce its memory footprint 103 MB by 29.9 MB.
Moreover, he is the author of the
gtools suite for Stata. I have used it in the past and now I got to meet the guy who coded it up!
Miguel Portela and Nelson Areal: Open source tools for large data
Miguel Portela and Nelson Areal showcased various methods for managing large data through Arrow, DuckDB, and Spark environments with out-of-core computation, meaning frameworks that efficiently manage which data is read into memory for specific operations. They not only had impressive slides but also demonstrated an example application in a singularity container.
This conference was informative and I highly recommended for a more in-depth look at the presentations.