Work in Progress

Theory and applied

Local Zipf’s Law: Emergence of Zipf’s Law for Firms

Abstract

There is extensive literature on the evolution of Zipf’s law for cities. However, this project addresses two theoretical gaps in the evidence for Zipf’s law. First, there is little dynamic information on the evolution of Zipf’s law for firms; most papers present a static snapshot. Second, it is unclear how lower and higher distribution orders interact in the actual context of markets and their industries. To investigate the first gap, I use historical data on near-population firm distribution from the 17th to 20th centuries for the UK and the 19th to 20th centuries for the Russian Empire. These data allow me to zoom in on the evolution of Zipf’s law down to the precise geocoded address. Regarding the second gap, I examine different orders of firm interactions—such as local markets, industries, and trade networks—to explain the general Zipf’s law.

A great discussion on why this is important is available on MathOverflow here.

Selection With Many Missing-At-Random Values

Abstract Exclusion, distributional moment imputations (e.g., mean, median), and inverse probability weighting are common approaches to tackling missing data in historical sources. In this project, I formalise a bias that these approaches introduce when using OLS. Such imputations generally lead to regression attenuation bias and tend to overestimate the average effect for recorded data. I propose a solution using a class of stable linear regressions that accommodate missing values, including RIGID, SLRM, and mixed models. I then compare these approaches with alternative methods, such as Random Forest and k-nearest neighbours, to evaluate how much variation in variables with high levels of missing-at-random values is captured under each method.

Post-Selection Inference in Data-Driven Variable Selection

Abstract Significant progress has been made in using LASSO for variable selection when working with many controls. However, this has unexpectedly led to a growing body of literature that uses LASSO for data-driven variable selection in high-dimensional datasets. I show that using standard LASSO for selection introduces two types of misspecification. First, it leads to shrunk confidence intervals. Second, subsequent analyses—such as OLS—fail to condition the selection process, making results unstable, as small changes in the penalty coefficient can produce different sets of selected variables. I summarise how the post-selection inference (PoSI) framework, with adjusted confidence intervals, addresses these issues. Finally, I illustrate how conditioning on selection with adjusted confidence intervals changes the empirical results in Xue and Michalopoulos (2021) in QJE.