top of page
In addition to doing my own research, I am interested in improving the productivity of researchers more generally by creating open-source datasets. The first (two) in this genre are at and are described below. They will be joined by three additional datasets, all of which is possible due to the generous support of the Alfred P. Sloan Foundation:
1. patent-paper pairs (with Ryan Shin)
2. author-inventor crosswalk (with Lee Fleming and Emma Scharfman)
3. age of privately-held patent assignees (with Mike Ewens)
Reliance on Science in Patenting
Citations from patents to other patents have frequently been employed in studies of innovation, but these citations have many limitations. By contrast, citations from patents to non-patent materials—especially scientific articles—promise to be more useful but are much more difficult to discern given that they appear in patent documents as unstructured text. We present methods for automatically linking patents to scientific papers from 1800-2018 and share the results publicly. Moreover, we characterize the performance of our algorithms and present ROC curves so that researchers can select data according to their sensitivity to false positives vs. false negatives. Our hope is that publicly-available patent citations to science fuel research on innovation, knowledge diffusion, technology commercialization, and other topics. Download at
Journal Commercial Impact Factor
Journals are commonly ranked based on Impact Factor, calculated for year t as the number of times articles from years t-1 and t-2 were cited during year t, divided by the number of articles published during years t-1 and t-2. We introduce a complementary measure of commercial impact by counting citations from patents instead of from papers, using the data from Marx & Fuegi (2019).  Download at
"Hubs" of commercial R&D
In Bikard & Marx (2019) we find that academic discoveries conducted in proximity to "hubs" of commercial R&D in the same field are much more likely to be built upon by firms (as measured by citations from their patents, as calculated in Marx & Fuegi (2019). This dataset defines those hubs of commercial R&D, listing for each USPTO subclass the latitude and longitude coordinates the act as the centroids of hubs for those subclasses. Download at
bottom of page