Data for "Sampling Uncertainty of Research Topics"

Naboka-Krell, Viktoriia

Data for "Sampling Uncertainty of Research Topics"

dc.contributor	Bystrov, Victor
dc.contributor	Staszewska-Bystrova, Anna
dc.contributor	Winker, Peter
dc.contributor.author	Naboka-Krell, Viktoriia
dc.date.accessioned	2026-06-23T09:35:16Z
dc.date.issued	2026-06
dc.description.abstract	The replication package contains the data and analysis codes for the study "Sampling Uncertainty of Research Topics", which examines the measurement of sampling uncertainty in latent topic models. The dataset comprises 19,059 abstracts submitted to the European Research Consortium for Informatics and Mathematics (ERCIM) and Computational and Financial Econometrics (CFE) conferences between 2007 and 2023. The abstracts have an average length of 158 words (including title) and underwent standardized preprocessing: After removing special characters, numbers, and stopwords, as well as lemmatization (using spaCy with the en_core_web_lg model), a vocabulary of 1,844 unique words remained. Contents of the ZIP Files Data_Simulations.zip Contains the raw data, processed datasets, and all analysis results relevant to the main study. This includes: - The original abstracts (before and after preprocessing), - The generated bootstrap corpora (for resampling analyses), - The estimated topic models (including document-specific topic weights), - The computed model fit metrics (e.g., sBIC for determining the optimal number of topics), - The Python and R code for data preprocessing, model estimation (including Structural Topic Modeling), and bootstrap analysis, - The results for Figures 1–5 and Tables 1–2 of the paper (e.g., sBIC trends, topic weights, recall metrics). Algorithm_Confidence_Bands_Word_Clouds.zip Includes the MATLAB code for generating uncertainty-adjusted word clouds (Figure 6 in the paper). This algorithm visualizes the robustness of topic-word probabilities across bootstrap replications by generating confidence bands for the top words of each topic. Algorithm_Counts_Top_Flop.zip Contains the MATLAB code for calculating the top and flop words for selected topics (basis for Figure 7). The code identifies the most and least stable words within a topic across all bootstrap samples, enabling a qualitative assessment of topic stability. Algorithm_Topic_Time_Series.zip Includes the code for generating the topic time series (basis for Figure 8). This algorithm aggregates document-specific topic weights on an annual basis and computes confidence bands for the temporal evolution of topic prevalence.
dc.description.sponsorship	Sonstige Drittmittelgeber/-innen
dc.identifier.uri	https://jlupub.ub.uni-giessen.de/handle/jlupub/21549
dc.identifier.uri	https://doi.org/10.22029/jlupub-20896
dc.language.iso	en
dc.rights	Attribution-NonCommercial-ShareAlike 4.0 International	en
dc.rights.uri	http://creativecommons.org/licenses/by-nc-sa/4.0/
dc.subject	topic modeling
dc.subject	bootstrapping
dc.subject	sampling uncertainty
dc.subject.ddc	ddc:000
dc.title	Data for "Sampling Uncertainty of Research Topics"
dc.type	Collection
local.affiliation	FB 02 - Wirtschaftswissenschaften
local.embargo.notice	Diese Analyse wurde im Rahmen des durch die Deutsch-Polnische Wissenschaftsstiftung geförderten Projektes durchgeführt. Titel: Analiza porównawcza terminologii i termatów z niemieckiej i polskiej ekonomicznej literatury naukowej/Vergleichende Analyse von Terminologie und Themen in der deutschen und polnischen wirtschaftswissenschaftlichen Literatur Projekt Nr.: 100-2024-00794