Data for "Sampling Uncertainty of Research Topics"

Quotable link

Abstract

The replication package contains the data and analysis codes for the study "Sampling Uncertainty of Research Topics", which examines the measurement of sampling uncertainty in latent topic models. The dataset comprises 19,059 abstracts submitted to the European Research Consortium for Informatics and Mathematics (ERCIM) and Computational and Financial Econometrics (CFE) conferences between 2007 and 2023. The abstracts have an average length of 158 words (including title) and underwent standardized preprocessing: After removing special characters, numbers, and stopwords, as well as lemmatization (using spaCy with the en_core_web_lg model), a vocabulary of 1,844 unique words remained. **Contents of the ZIP Files** *Data_Simulations.zip* Contains the raw data, processed datasets, and all analysis results relevant to the main study. This includes: - The original abstracts (before and after preprocessing), - The generated bootstrap corpora (for resampling analyses), - The estimated topic models (including document-specific topic weights), - The computed model fit metrics (e.g., sBIC for determining the optimal number of topics), - The Python and R code for data preprocessing, model estimation (including Structural Topic Modeling), and bootstrap analysis, - The results for Figures 1–5 and Tables 1–2 of the paper (e.g., sBIC trends, topic weights, recall metrics). *Algorithm_Confidence_Bands_Word_Clouds.zip* Includes the MATLAB code for generating uncertainty-adjusted word clouds (Figure 6 in the paper). This algorithm visualizes the robustness of topic-word probabilities across bootstrap replications by generating confidence bands for the top words of each topic. *Algorithm_Counts_Top_Flop.zip* Contains the MATLAB code for calculating the top and flop words for selected topics (basis for Figure 7). The code identifies the most and least stable words within a topic across all bootstrap samples, enabling a qualitative assessment of topic stability. *Algorithm_Topic_Time_Series.zip* Includes the code for generating the topic time series (basis for Figure 8). This algorithm aggregates document-specific topic weights on an annual basis and computes confidence bands for the temporal evolution of topic prevalence.

Link to publications or other datasets

Description

Notes

Original publication in

Original publication in

Anthology

Collections

URI of original publication

Forschungsdaten

Series

Citation