# Illiberal SCRIPTS — What to do next

This folder contains five CSV outputs produced by the R pipeline. Below is what each file is for and suggested next steps.

## Files & purpose

1. **step03_freq_illiberal_terms.csv** — per *org × year* counts of dictionary hits for your illiberal term list (used for Fig. 1).
2. **step03_crosspollination_similarity.csv** — mean cosine similarity across RO pairs within the same year (used for Fig. 2).
3. **step03_first_use_by_org.csv** — for each *lemma × org*, the first year it appears (building blocks for diffusion).
4. **step03_adoptions_with_origin.csv** — for each lemma and adopter org, identifies the origin org & year and computes the adoption lag.
5. **step03_influence_edges_lag3.csv** — aggregated edges (origin → adopter) within a 3‑year window, with weights equal to number of terms (used for Fig. 3).

## Quick data snapshot
{'freq_terms': {'rows': 60, 'years': (2002, 2023), 'orgs': ['CIS', 'CSTO', 'SCO'], 'total_hits_sum': 122788}, 'similarity': {'rows': 108, 'years': (2002, 2023), 'pairs': ['CIS_CSTO', 'CIS_SCO', 'CSTO_CIS', 'CSTO_SCO', 'SCO_CIS', 'SCO_CSTO'], 'mean_cosine_avg': 0.09675860359886389}, 'first_use': {'rows': 6537, 'unique_terms': 3200, 'orgs': ['CIS', 'CSTO', 'SCO'], 'years': (2002, 2023)}, 'adoptions': {'rows': 6537, 'unique_terms': 3200, 'mean_lag': 3.2894294018662995, 'min_lag': 0, 'max_lag': 21}, 'edges': {'rows': 6, 'routes': [['SCO', 'CSTO'], ['SCO', 'CIS'], ['CSTO', 'CIS'], ['CIS', 'SCO'], ['CSTO', 'SCO'], ['CIS', 'CSTO']], 'total_weight': 1096}}

## Recommended next steps

### A) Substantive validation
- **Term list curation**: refine the `illib_terms` list in the R script. Remove overly generic items (e.g., *порядок*) or add domain‑specific phrases.
- **False positives**: spot‑check a few rows from `first_use_by_org.csv` and `adoptions_with_origin.csv` to ensure lemmas reflect your intended concepts.

### B) Sensitivity analysis
- **DFM trimming**: try different `min_termfreq` / `max_docfreq` thresholds and re‑run.
- **Lag window**: regenerate `influence_edges` with `lag_window = 2` and `= 4` to see how the network changes.
- **Dictionary vs. Topics**: complement the dictionary results with unsupervised topics (e.g., STM/LDA) as a robustness check.

### C) Visualization & reporting
- **Figures** are already saved as PNGs. You can also export to PDF by changing the `ggsave` extension.
- **Network graph**: import `influence_edges_lag*.csv` into Gephi or use `igraph` in R to draw a directed, weighted network.

### D) Reproducibility/housekeeping
- **Lock versions** with `renv::init()` in your R project to freeze package versions.
- **Save aligned DFM** (already done) so downstream scripts don’t need to re‑align docvars.
- **Document changes**: keep a short `CHANGELOG.md` for edits to the term list and thresholds.

## How to integrate into the paper
- **Methods**: one paragraph on corpus build + OCR, another on UDPipe lemmatization and DFM trimming, plus dictionary and cosine similarity logic.
- **Results**: use `freq_illiberal_terms.csv` for the temporal trend narrative, `crosspollination_similarity.csv` for pairwise annual proximity, and `influence_edges_lag*.csv` to argue lead–follow dynamics.
- **Appendix**: include the full term list, trimming thresholds, and lag sensitivity tables.

— Generated by helper script on demand.