Data for "Sampling Uncertainty of Research Topics"

dc.contributorBystrov, Victor
dc.contributorStaszewska-Bystrova, Anna
dc.contributorWinker, Peter
dc.contributor.authorNaboka-Krell, Viktoriia
dc.date.accessioned2026-06-23T09:35:16Z
dc.date.issued2026-06
dc.description.abstractThe replication package contains the data and analysis codes for the study "Sampling Uncertainty of Research Topics", which examines the measurement of sampling uncertainty in latent topic models. The dataset comprises 19,059 abstracts submitted to the European Research Consortium for Informatics and Mathematics (ERCIM) and Computational and Financial Econometrics (CFE) conferences between 2007 and 2023. The abstracts have an average length of 158 words (including title) and underwent standardized preprocessing: After removing special characters, numbers, and stopwords, as well as lemmatization (using spaCy with the en_core_web_lg model), a vocabulary of 1,844 unique words remained. **Contents of the ZIP Files** *Data_Simulations.zip* Contains the raw data, processed datasets, and all analysis results relevant to the main study. This includes: - The original abstracts (before and after preprocessing), - The generated bootstrap corpora (for resampling analyses), - The estimated topic models (including document-specific topic weights), - The computed model fit metrics (e.g., sBIC for determining the optimal number of topics), - The Python and R code for data preprocessing, model estimation (including Structural Topic Modeling), and bootstrap analysis, - The results for Figures 1–5 and Tables 1–2 of the paper (e.g., sBIC trends, topic weights, recall metrics). *Algorithm_Confidence_Bands_Word_Clouds.zip* Includes the MATLAB code for generating uncertainty-adjusted word clouds (Figure 6 in the paper). This algorithm visualizes the robustness of topic-word probabilities across bootstrap replications by generating confidence bands for the top words of each topic. *Algorithm_Counts_Top_Flop.zip* Contains the MATLAB code for calculating the top and flop words for selected topics (basis for Figure 7). The code identifies the most and least stable words within a topic across all bootstrap samples, enabling a qualitative assessment of topic stability. *Algorithm_Topic_Time_Series.zip* Includes the code for generating the topic time series (basis for Figure 8). This algorithm aggregates document-specific topic weights on an annual basis and computes confidence bands for the temporal evolution of topic prevalence.
dc.description.sponsorshipSonstige Drittmittelgeber/-innen
dc.identifier.urihttps://jlupub.ub.uni-giessen.de/handle/jlupub/21549
dc.identifier.urihttps://doi.org/10.22029/jlupub-20896
dc.language.isoen
dc.rightsAttribution-NonCommercial-ShareAlike 4.0 Internationalen
dc.rights.urihttp://creativecommons.org/licenses/by-nc-sa/4.0/
dc.subjecttopic modeling
dc.subjectbootstrapping
dc.subjectsampling uncertainty
dc.subject.ddcddc:000
dc.titleData for "Sampling Uncertainty of Research Topics"
dc.typeCollection
local.affiliationFB 02 - Wirtschaftswissenschaften
local.embargo.noticeDiese Analyse wurde im Rahmen des durch die Deutsch-Polnische Wissenschaftsstiftung geförderten Projektes durchgeführt. Titel: Analiza porównawcza terminologii i termatów z niemieckiej i polskiej ekonomicznej literatury naukowej/Vergleichende Analyse von Terminologie und Themen in der deutschen und polnischen wirtschaftswissenschaftlichen Literatur Projekt Nr.: 100-2024-00794

Files

Original bundle

Now showing 1 - 3 of 3
Loading...
Thumbnail Image
Name:
Algorithm_Confindence_Bands_Word_Clouds.zip
Size:
543.43 MB
Format:
Unknown data format
Loading...
Thumbnail Image
Name:
Algorithm_Counts_Top_Flop.zip
Size:
212.43 MB
Format:
Unknown data format
Loading...
Thumbnail Image
Name:
Algorithm_Topic_Time_Series.zip
Size:
213.77 MB
Format:
Unknown data format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
7.58 KB
Format:
Item-specific license agreed upon to submission
Description:

Collections