Economics Letters 237 (2024) 111653

A
0

Contents lists available at ScienceDirect

Economics Letters

journal homepage: www.elsevier.com/locate/ecolet

Construction and analysis of uncertainty indices based on multilingual text
representations
Viktoriia Naboka-Krell
Justus Liebig University Giessen, Licher street 64, Giessen, 35394, Hessen, Germany

A R T I C L E I N F O

Keywords:
Text-as-data
fastText embeddings
BERT
Economic policy uncertainty
Natural language processing

A B S T R A C T

The work by Baker et al. (2016), who propose a dictionary based method and estimate the level of economic
policy uncertainty (EPU) based on the occurrence of specific terms in ten leading newspapers in the USA, is
among the first ones to detect the potential of text data in economic research. Following this line of research,
this paper proposes automated approaches to construction of EPU indices for different countries based on
newspapers’ texts. Multilingual fastText word embeddings, (S)BERT embeddings, and a novel multilingual
topic modeling approach are used to construct EPU indices for Germany, Russia, and Ukraine. It is shown that
constructed EPU indices based on multilingual word embeddings are Granger causal to the economic activity
in all of the considered countries.
1. Introduction

The work by Baker et al. (2016) is among the first ones to detect
the potential of text data in economic research. Although dictionary
based methods as in Baker et al. (2016) are widely used due to their
simplicity and interpretability, recent advances in NLP offer many
further possibilities to gain insights from text data. New approaches
include the use of topic models such as Latent Dirichlet Allocation
(LDA). Azqueta-Gavaldón (2017) proposes an LDA based procedure to
build an uncertainty index that strongly resembles the index introduced
by Baker et al. (2016) (BBD) index. The proposed EPU index is the
aggregated time series based on the time series of the identified EPU
related topics.

Some of them also make use of word embeddings, for example, to
extend the EPU related term set as proposed by Ghirelli et al. (2019)
for the case of Spain. The authors show that an unexpected shock
in their modified EPU index leads to a significant decline in GDP,
private consumption, and investments. Algaba et al. (2020) follow this
approach and construct an EPU index for Belgium using GloVe word
embeddings (Pennington et al., 2014). It has been shown that the
constructed index negatively correlates (−0.62) with Consumer Con-
fidence Indicator (CCI). Xie (2020) proposes a fully automated method
to build an uncertainty index. The author applies the Wasserstein Index
Generation model and uses word vectors to represent the analyzed text
units in a vector space.

These examples show that word embeddings might have a consider-
able impact on future applications and methods in economic literature.
In contrast to other methods, word embeddings are able to capture the

E-mail address: Viktoriia.Naboka@wirtschaft.uni-giessen.de.
1 fastText is a free library for text classification and representation learning.

semantic and syntactic characteristics of words, which is very useful
in numerous cases. The current work is dedicated to examination of
word and text representations in context of EPU measurement and
contributes to the growing area of text-as-data applications in eco-
nomics, particularly uncertainty measurement, in several ways. First,
it proposes several approaches to construction of EPU indices in the
multilingual setting without any supervision. Second, it applies a novel
zero-shot topic modeling approach that allows to train a topic model in
one language and to predict topic distributions for documents in unseen
languages. Third, the resulting uncertainty indices are evaluated with
regard to their impact on economic activity in selected countries.

2. Text representation techniques

Multilingual word embeddings are word vectors in multiple lan-
guages that are embedded in a shared vector space. These representa-
tions are characterized by the interpretability of the distances between
them in different languages, meaning that similar words are closer
to each other in the shared vector space. Several approaches have
been proposed to train such multilingual word embeddings. One of
the widely used approaches is the mapping based approach that relies
on so-called off-the-shelf lexicons. Freely available multilingual fast-
Text1 word representations were also learned following the mapping
based approach proposed by Joulin et al. (2018) (bilingual mapping)
and Grave et al. (2018) (multilingual mapping by defining a pivot
language).
vailable online 11 March 2024
165-1765/© 2024 The Author(s). Published by Elsevier B.V. This is an open access a

https://doi.org/10.1016/j.econlet.2024.111653
Received 15 December 2023; Received in revised form 8 March 2024; Accepted 8
rticle under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

March 2024

https://www.elsevier.com/locate/ecolet
https://www.elsevier.com/locate/ecolet
mailto:Viktoriia.Naboka@wirtschaft.uni-giessen.de
https://fasttext.cc/
https://doi.org/10.1016/j.econlet.2024.111653
https://doi.org/10.1016/j.econlet.2024.111653
http://crossmark.crossref.org/dialog/?doi=10.1016/j.econlet.2024.111653&domain=pdf
http://creativecommons.org/licenses/by/4.0/


Economics Letters 237 (2024) 111653V. Naboka-Krell

o
n
T
T
a
v
i
a
i

r
r

T
v

m
t

A great breakthrough in and a major contribution to the field
of language model learning has been made with the publication of
the work by Devlin et al. (2019). The authors present their novel
approach to text representations BERT which differs substantially from
existing models. BERT stands for Bidirectional Encoder Representations
from Transformers and consists of a multi-layer Transformer encoder.
BERT became the state-of-the-art in many NLP tasks. However, to
overcome some capacity and time issues, Sentence BERT (SBERT) was
introduced that was fine-tuned for semantic similarity search (Reimers
and Gurevych, 2019).

Probabilistic topic modeling approaches are one further well-known
and widely used tool for extracting and analyzing latent themes be-
hind the underlying unstructured text data in different areas. Bianchi
et al. (2021) introduce a novel approach to topic modeling for the
multilingual setting. Multilingual contextualized topic modeling al-
lows to train a topic model in one model and to infer topic distribu-
tions for documents in unseen languages just relying on their SBERT
representations.

Further details on methods used in this paper are provided in Ap-
pendix A.

3. Data

For the empirical analysis, three datasets of news articles in three
different languages are used: DER SPIEGEL for Germany, Lenta.ru for
Russia, and UNIAN for Ukraine. The following preprocessing steps were
taken: removing punctuation, numbers, special characters, stopwords
and lowercase. The final datasets contain 833,454 articles in the period
from January 2000 to September 2020 for Germany, 864,481 articles
in the period from September 1999 to September 2020 for Russia, and
785,750 articles in the period from January 2007 to September 2020
for Ukraine. Economic activity is measured by industrial production
index, which is often used in academic literature as a high-frequency
indicator of a country’s economic activity.

4. Construction of uncertainty indices

Overall, four different approaches are proposed to identify articles
related to uncertainty in economic policy. The first approach is a dic-
tionary based one, which uses fastText multilingual word embeddings
to either identify three term sets referring to the three components
of the EPU concept (later referred to as 𝑑𝑖𝑐1) or one combined term
set that should describe the EPU concept as a whole (later referred to
as 𝑑𝑖𝑐2). While the former method searches for nearest neighbors to
the three terms economic, policy, and uncertainty, the latter makes use
f the additive feature of the word vectors and searches for nearest
eighbors to the compound word vector economic + policy + uncertainty.
he similarity of the word vectors is measured by cosine similarity.2
he number of relevant words is controlled for automatically based on
threshold, namely the 99.99% percentile of all the cosine similarity

alues between a certain word in one language (e.g. English word ‘‘pol-
cy’’) and all the word vectors available in other language (e.g. German)
s shown in Fig. 1. The identified EPU related terms are presented
n Appendix A.

The second approach considers the articles as Bag-of-Words and
epresents them as the sum of the constituent words vectors, which
esults in aggregated document embeddings (later referred to as 𝑎𝑟𝑡_𝑒𝑚𝑏

approach). The third approach applies Transformer based text em-
beddings to identify articles that relate to EPU (later referred to as

2 Cosine similarity is defined as the cosine of the angle between two vectors.
he values range from 0 to 1. A cosine similarity value of 1 means that the
ectors are pointing in the same direction.
2

l

Fig. 1. Cosine Similarity Values between policy and all German words.

𝑎𝑟𝑡_𝑠𝑏𝑒𝑟𝑡).3 Finally, a novel language agnostic topic modeling technique
called zero-shot cross-lingual topic modeling is applied. Thereby, a
topic model was trained based on German SBERT embeddings of ar-
ticles and the topic distributions for Russian and Ukrainian articles
were then also inferred based on SBERT embeddings of articles. EPU
related topics have been identified using the embeddings of the most
frequent topic words. In the following, this approach is referred to
as 𝑀𝐶𝑇𝑀_{𝑘}_𝑇 𝑜𝑝𝑖𝑐. 𝑘 can stand for the topic number/label of a
topic that is identified as an EPU related topic or have the designation
𝑐𝑜𝑚𝑏𝑖𝑛𝑒𝑑, if the average topic frequency of all the EPU related topics
is used. Overall, 10 different EPU time series are provided for each
country.

5. Results

5.1. EPU indices

This section presents the constructed indices. All the indices were
normalized to have a mean of 100 and a variance of 1.

Fig. 2 shows the indices resulting from the 𝑑𝑖𝑐1 and 𝑑𝑖𝑐2 ap-
proaches. In Germany, the peaks correspond with such events as the
September 11 attacks, economic crisis in Germany, global financial
crisis, and Corona virus outbreak. The spikes of the EPU in Russia
between 2004 and 2005 as well as in the period from 2014 and 2018
could be explained by the Orange Revolution in neighboring Ukraine
and the Russia–Ukraine gas disputes, and by the Crimean crisis and
the War in Donbass, respectively. Surprisingly, both of the constructed
EPU indices for Ukraine show a downward trend. Some peaks can be
identified at the beginning of 2007 (political crisis in Ukraine), in 2008-
09 (global financial crisis), 2014 (beginning of the Crimean crisis and
the War in Donbass), and at the beginning of 2019 (presidential and
parliamentary elections). The Corona virus outbreak, instead, seems to
have caused a relatively small increase in the uncertainty index. As
this approach largely relates to that proposed by Baker et al. (2016),
further analyses between the constructed indices and the available
indices by Baker et al. (2016) (BBD indices) have been carried out
(see Appendix A).

Figs. 3 and 4 show the 𝑎𝑟𝑡_𝑒𝑚𝑑 and 𝑎𝑟𝑡_𝑠𝑏𝑒𝑟𝑡 indices for all three
countries, respectively. There are some noticeable differences between

3 To train SBERT articles’ embeddings, a pre-trained distiluse-base-
ultilingual-cased-v2 model was used. Thereby, Python’s implemen-

ation of Sentence BERT (SBERT), namely Sentence-Transformers, was used to
oad and apply the model.


Economics Letters 237 (2024) 111653

3

V. Naboka-Krell

Fig. 2. 𝑑𝑖𝑐1 and 𝑑𝑖𝑐2 EPU Indices.

Fig. 3. 𝑎𝑟𝑡_𝑒𝑚𝑑 EPU Indices.


Economics Letters 237 (2024) 111653V. Naboka-Krell
Fig. 4. 𝑎𝑟𝑡_𝑠𝑏𝑒𝑟𝑡 EPU Indices.
Fig. 5. MCTM with 40 Topics: EPU Related Topics.
the two, as for example, stronger interdependencies between Russia and
Ukraine according to the 𝑎𝑟𝑡_𝑠𝑏𝑒𝑟𝑡 approach.

Finally, based on the results of the multilingual topic modeling
five EPU related topics are identified in the current application. These
are presented in Fig. 5. Based on the qualitative assessment of the
most common words of the topics, these were assigned the follow-
ing labels: government, stock market, political parties,
elections, U.S. political leaders. The values in brackets
represent the cosine similarity values to the EPU embedding. The cor-
responding time series for each country as well as additional robustness
checks using U.S. data are presented in Appendix A.

5.2. VAR models

All the constructed EPU indices are tested within VAR models
with regard to their impact on the economic activity of the countries.
The economic activity is measured by the industrial production index,
which is often used in academic literature as a high-frequency indicator
4

of a country’s economic activity (Baker et al., 2016; Perić and Sorić,
2018; Čižmešija et al., 2017).4

For each country, 10 two-dimensional VAR models with seasonal
dummies were estimated, each including one of the constructed EPU
indices and the corresponding industrial production index.5

Granger causality
The first set of analysis is dedicated to Granger causality tests,

especially the null hypothesis ‘‘EPU does not Granger cause Industrial
Production Index’’. According to Granger causality tests, the following
EPU indices led to significant results at least in one country: 𝑑𝑖𝑐1
(Ukraine), 𝑑𝑖𝑐2 (all countries), 𝑎𝑟𝑡_𝑒𝑚𝑏 (Germany, Ukraine), 𝑎𝑟𝑡_𝑠𝑏𝑒𝑟𝑡

4 The data for the analyses come from State Statistics Service of the
considered countries.

5 According to the performed stationarity tests, all the variables needed
to be transformed to become stationary. For this reason, the first log differ-
ences of all the variables were calculated and used in all estimated Vector
Autoregressive (VAR) models.


Economics Letters 237 (2024) 111653V. Naboka-Krell
Fig. 6. IRFs: 𝑑𝑖𝑐2 EPU Index →Industrial Production Index.
(Germany), 𝑀𝐶𝑇𝑀_𝑠𝑡𝑜𝑐𝑘_𝑚𝑎𝑟𝑘𝑒𝑡_𝑇 𝑜𝑝𝑖𝑐 (all countries).6 The results of
Granger causality test are summarized in Appendix A.

Impulse response functions
A close look is taken on the 𝑑𝑖𝑐2 EPU index, as this index has

proved to be Granger causal to economic activity in all the considered
countries. Fig. 6 illustrates the responses of the industrial production
indices to an 𝑑𝑖𝑐2 EPU indices shock in all considered countries. The
shaded areas represent the 95% confidence bands. Thereby, the orthog-
onal impulse responses are considered meaning that contemporaneous
effects are allowed. It can be inferred from the figure that one standard
deviation shock in EPU leads to a significant drop of 0.1 and 0.44
percentage points in the industrial production index after one month
in Germany and Russia, respectively. While in Germany there is only a
short-term impact of the EPU shock on the industrial production, there
is also a long-term significant negative impact of the EPU shock on the
industrial production index (about 0.3 percentage points) in Russia. The
pattern of the Impulse Response Function (IRF) in Ukraine is similar but
not significant over the entire period.

6 Results were considered significant if 𝑝-value is smaller than 10%.
5

6. Conclusions

One of the key finding of the current work is that the dictionary
based approach in combination with multilingual word embeddings
results in indices that are Granger causal to the economic activity in
all of the considered countries. Further, to the best of my knowledge,
the current paper is the first to apply a novel language agnostic topic
modeling technique introduced by Bianchi et al. (2021) in economic
context. One of the identified EPU related topics has proved to Granger
cause the economic activity in all of the considered countries. This
is one promising finding as the topic modeling approach by Bianchi
et al. (2021) allows to predict topic distributions for texts in unseen
languages just based on SBERT embeddings without renewed training
of the model. Finally, it has been also shown that a sudden shock in
the constructed EPU indices leads to significant short-term and/or long-
term declines in industrial production. This finding indicates that the
constructed EPU indices could be used as high frequency indicators of
economic activity.

Data availability
Data will be made available on request.


Economics Letters 237 (2024) 111653V. Naboka-Krell
Acknowledgments

I thank my supervisor Peter Winker and my colleagues for their
support and advice.

Funding

No funds, grants, or other support was received.

Appendix A. Supplementary data

Supplementary material related to this article can be found online
at https://doi.org/10.1016/j.econlet.2024.111653.

References

Algaba, A., Borms, S., Boudt, K., van Pelt, J., 2020. The economic policy uncertainty
index for Flanders, Wallonia and Belgium. SSRN Electron. J. http://dx.doi.org/10.
2139/ssrn.3580000, BFW digitaal/RBF numérique 2020/6, URL: https://ssrn.com/
abstract=3580000.

Azqueta-Gavaldón, A., 2017. Developing news-based Economic Policy Uncertainty index
with unsupervised machine learning. Econom. Lett. 158, 47–50. http://dx.doi.org/
10.1016/j.econlet.2017.06.032, https://www.sciencedirect.com/science/article/pii/
S0165176517302598.

Baker, S.R., Bloom, N., Davis, S.J., 2016. Measuring Economic Policy Uncertainty*. Q.
J. Econ. 131 (4), 1593–1636. http://dx.doi.org/10.1093/qje/qjw024.

Bianchi, F., Terragni, S., Hovy, D., Nozza, D., Fersini, E., 2021. Cross-lingual Con-
textualized Topic Models with Zero-shot Learning. CoRR abs/2004.07737, arXiv:
2004.07737.

Čižmešija, M., Lolić, I., Sorić, P., 2017. Economic policy uncertainty index and economic
activity: What causes what? Croatian Oper. Res. Rev. 8 (2), 563–575.
6

Devlin, J., Chang, M.-W., Lee, K., Toutanova, K., 2019. BERT: Pre-training of Deep Bidi-
rectional Transformers for Language Understanding. Association for Computational
Linguistics, Minneapolis, Minnesota, pp. 4171–4186. http://dx.doi.org/10.18653/
v1/N19-1423, URL: https://www.aclweb.org/anthology/N19-1423.

Ghirelli, C., Pérez, J.J., Urtasun, A., 2019. A new economic policy uncertainty index for
Spain. Econom. Lett. 182, 64–67, https://ideas.repec.org/p/bde/wpaper/1906.html.

Grave, E., Bojanowski, P., Gupta, P., Joulin, A., Mikolov, T., 2018. Learning Word
Vectors for 157 Languages. In: Proceedings of the Eleventh International Conference
on Language Resources and Evaluation. LREC 2018, European Language Resources
Association (ELRA), Miyazaki, Japan, URL: https://www.aclweb.org/anthology/
L18-1550.

Joulin, A., Bojanowski, P., Mikolov, T., Jégou, H., Grave, E., 2018. Loss in translation:
Learning Bilingual Word Mapping with a Retrieval Criterion. In: Proceedings of the
2018 Conference on Empirical Methods in Natural Language Processing. Association
for Computational Linguistics, Brussels, Belgium, pp. 2979–2984. http://dx.doi.org/
10.18653/v1/D18-1330, URL: https://www.aclweb.org/anthology/D18-1330.

Pennington, J., Socher, R., Manning, C., 2014. GloVe: Global Vectors for Word
Representation. In: Moschitti, A., Pang, B., Daelemans, W. (Eds.), Proceedings
of the 2014 Conference on Empirical Methods in Natural Language Processing
(EMNLP). Association for Computational Linguistics, Doha, Qatar, pp. 1532–1543.
http://dx.doi.org/10.3115/v1/D14-1162, https://aclanthology.org/D14-1162.

Perić, B.Š., Sorić, P., 2018. A note on the ‘‘Economic Policy Uncertainty Index’’. Soc.
Indic. Res. 137 (2), 505–526.

Reimers, N., Gurevych, I., 2019. Sentence-BERT: Sentence embeddings using siamese
BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods
in Natural Language Processing and the 9th International Joint Conference on
Natural Language Processing. EMNLP-IJCNLP, Association for Computational Lin-
guistics, Hong Kong, China, pp. 3982–3992. http://dx.doi.org/10.18653/v1/D19-
1410, URL: https://www.aclweb.org/anthology/D19-1410.

Xie, F., 2020. Wasserstein index generation model: Automatic generation of time-series
index with application to economic policy uncertainty. Econom. Lett. 186, 108874.
http://dx.doi.org/10.1016/j.econlet.2019.108874, https://www.sciencedirect.com/
science/article/pii/S0165176519304410.

https://doi.org/10.1016/j.econlet.2024.111653
http://dx.doi.org/10.2139/ssrn.3580000
http://dx.doi.org/10.2139/ssrn.3580000
http://dx.doi.org/10.2139/ssrn.3580000
https://ssrn.com/abstract=3580000
https://ssrn.com/abstract=3580000
https://ssrn.com/abstract=3580000
http://dx.doi.org/10.1016/j.econlet.2017.06.032
http://dx.doi.org/10.1016/j.econlet.2017.06.032
http://dx.doi.org/10.1016/j.econlet.2017.06.032
https://www.sciencedirect.com/science/article/pii/S0165176517302598
https://www.sciencedirect.com/science/article/pii/S0165176517302598
https://www.sciencedirect.com/science/article/pii/S0165176517302598
http://dx.doi.org/10.1093/qje/qjw024
http://arxiv.org/abs/2004.07737
http://arxiv.org/abs/2004.07737
http://arxiv.org/abs/2004.07737
http://refhub.elsevier.com/S0165-1765(24)00136-8/sb5
http://refhub.elsevier.com/S0165-1765(24)00136-8/sb5
http://refhub.elsevier.com/S0165-1765(24)00136-8/sb5
http://dx.doi.org/10.18653/v1/N19-1423
http://dx.doi.org/10.18653/v1/N19-1423
http://dx.doi.org/10.18653/v1/N19-1423
https://www.aclweb.org/anthology/N19-1423
https://ideas.repec.org/p/bde/wpaper/1906.html
https://www.aclweb.org/anthology/L18-1550
https://www.aclweb.org/anthology/L18-1550
https://www.aclweb.org/anthology/L18-1550
http://dx.doi.org/10.18653/v1/D18-1330
http://dx.doi.org/10.18653/v1/D18-1330
http://dx.doi.org/10.18653/v1/D18-1330
https://www.aclweb.org/anthology/D18-1330
http://dx.doi.org/10.3115/v1/D14-1162
https://aclanthology.org/D14-1162
http://refhub.elsevier.com/S0165-1765(24)00136-8/sb11
http://refhub.elsevier.com/S0165-1765(24)00136-8/sb11
http://refhub.elsevier.com/S0165-1765(24)00136-8/sb11
http://dx.doi.org/10.18653/v1/D19-1410
http://dx.doi.org/10.18653/v1/D19-1410
http://dx.doi.org/10.18653/v1/D19-1410
https://www.aclweb.org/anthology/D19-1410
http://dx.doi.org/10.1016/j.econlet.2019.108874
https://www.sciencedirect.com/science/article/pii/S0165176519304410
https://www.sciencedirect.com/science/article/pii/S0165176519304410
https://www.sciencedirect.com/science/article/pii/S0165176519304410

	Construction and analysis of uncertainty indices based on multilingual text representations
	Introduction
	Text Representation Techniques
	Data
	Construction of Uncertainty Indices
	Results
	EPU Indices
	VAR Models
	Granger causality
	Impulse Response Functions


	Conclusions
	Data availability
	Acknowledgments
	Appendix A. Supplementary data
	References