Word Embedding-Based ISA Measures and Their Validity
Word Embedding-Based ISA Metrics
Word embeddings are trained on a large corpus of texts in an unsupervised manner, usually as a by-product of training a neural network to predict a word using other words surrounding it. This by-product can be used to quantify the semantic meanings of words based on the linguistic properties of distributional semantics. Word embeddings work by utilizing co-occurrence statistics to place all unique words in a corpus on multidimensional word embedding spaces. Each row in these spaces represents the word vector of one word. For example, words like fluid and water are more likely to occur in the same sentence as flow, while words like solid or steel are less likely to do so (Pennington et al., 2014). For words with a similar semantic meaning (e.g., fluid and water in the above example) their similarity can be quantified by comparing their word vectors. By the same token, words with different meanings (e.g., water and steel) should have very different word vectors. Therefore, the comparability of word vectors is able to tell us which words are more closely associated. In communication research, this comparability was originally used to enhance supervised and unsupervised machine learning models over the traditional “bag-of-words” model (e.g., Rudkowsky et al., 2018; Atteveldt et al., 2021). There exist several pretrained word vectors that researchers can use in an off-the-shelf manner.
However, this application has also been criticized because pretrained word embeddings might contain unwanted associations inherited from the texts in their training corpora, so that a downstream task would also be tainted by those associations. Pioneer works in this regard are Caliskan et al. (2017) and Garg et al. (2018). These studies derived several “word embedding association” metrics to quantify what the authors interpret as “biases” in pretrained word vectors. These metrics are also based on the comparability of word vectors: the relative similarity in the word embedding space of a target (e.g., nurse) with different attributes (e.g., male and female) can tell us how the word vectors—and by extension, their original training corpus—represent different target entities to be implicitly associated with other entities.
More recently, this approach has been extended as a computational communication research method to assess implicit stereotypical associations within text: Instead of using pretrained word vectors, word embeddings are trained on a large corpus of texts that one wants to study. Resulting word embedding association metrics are used to quantify ISAs in the corpora—which have in past research been interpreted as “stereotypes” (Kroon et al., 2021; Andrich et al., 2023; Andrich & Domahidi, 2022; Azzalini, 2025), “media bias” (Sales et al., 2019; Curto et al., 2022), or “implicit representations/ associations” (Fu, 2023; Müller et al., 2023; Urman & Makhortykh, 2022). In this kind of application, word embedding-based metrics are not used to assess potential biases within a word embedding model with the aim to reduce these biases. Instead, they are used as a content analysis tool: Researchers calculate word embedding-based scores to make observations about the corpora they train the model on. However, while the body of applied studies employing word embedding-based ISA metrics is flourishing (also see, Durrheim et al., 2023), these applications yet lack a demonstration of their semantic validity.
The Missing Semantic Validity of Using Word Embedding-Based ISA Metrics for Content Analysis
DiMaggio et al. (2013), Quinn et al. (2010), and Grimmer & Stewart (2013) conceptualize three different aspects of validation in automated text analysis: statistical, predictive, and semantic. These three aspects supplement each other. Statistical validity indicates whether results generated from a model agree with the model’s statistical assumptions. Usingword embedding-based ISA metrics as an example, the stability of the measurement is one aspect of this statistical assumption. It can be assessed using standardized tests (e.g., Spliethöver & Wachsmuth, 2021). Statistical validity per se is important but not sufficient as sole evidence of the validity of a measure because it evaluates only the internal coherence of a model.
Predictive validity, on the other hand, measures the “expected correspondence between a measure and exogenous events uninvolved in the measurement process” (Quinn et al., 2010, p. 222). Foundational works on word embedding-based ISA measures established the validity of the approach primarily in terms of predictive validity (for an overview see, Durrheim et al., 2023). For instance, Caliskan et al. (2017) show the correlation between WEAT-based scores derived from word vectors trained on Wikipedia and various web corpora (Pennington et al., 2014) with Implicit Association Test (IAT) scores obtained from some exogenous U.S. experiments conducted in the 1990s (Greenwald et al., 1998). Similarly, Garg et al. (2018) validate word embedding-based scores on gender biases in occupation within a Google News corpus by testing their correlation with the relative percentage of females in different occupations in the U.S. in the 1960s. For use cases in which word embedding-based ISA metrics are employed to detect biases within word embedding models and de-bias them for further application, this correspondence of these metrics to external data suffices.
In recent years, though, word embedding-based measures have been used to assess stereotypical associations within texts from specific periods and regions. For example, Kroon et al. (2021) aim to show that “representations of minorities in newspapers have become progressively remote from factual integration outcomes, and are therefore rather an artifact of news production processes than a true reflection of what is actually happening in society.” Their claim is that the associations measured within a word embedding model trained on news do not correspond to external data, but instead indicate a stereotypical portrayal within the news corpus. Word embedding-based scores are here used as a content analysis tool: The researchers’ goal in this kind of application is to make assertions about the content of news in the observed corpus, not about external society. To validate the application of word embedding-based scores for these kinds of applications, predictive validity does not suffice—in fact, where one expects a divergence between reporting and external societal phenomena, predictive validity cannot be used to assess the method’s validity at all.
Therefore, when using word embedding-based metrics as a content analysis tool, one needs to supplement statistical and predictive validity with semantic validity. Krippendorff (2018, p.323) defined semantic validity as “the degree to which the analytical categories of texts correspond to the meanings these texts have for particular readers or the roles they play within a chosen context.” An important distinction between predictive validity and semantic validity is therefore what the measure of interest corresponds to: The ground truth in the case of semantic validity is the human understanding of texts, rather than some exogenous cultural patterns. One can therefore consider semantic validity a form of criterion validity, where the criterion test—more commonly referred to as “gold standard” (Song et al., 2020; Atteveldt et al., 2021; Lind et al., 2017)—is the human understanding. Consequently, semantic validity as a category is much closer to the actual target construct that is supposed to be measured in text analysis, namely the meaning conveyed by text to human readers, than predictive validity.
Krippendorff’s works refer to manual content analysis with human coders. In this context, they suggest to compare human coders’ ratings of media content to that of individuals with expertise in the field under study (political professionals for political texts, legal experts for legal texts etc. Krippendorff, 1980). In the present study, instead of focusing on topical experts to produce the criterion values for semantic validity, we use a slightly different approach and examine a general media audience’s assessment. In manual content analysis, trained coders are typically not such experts, but rather resemble members of the general media audience. Therefore, the question whether their coding is in line with the general audience’s understanding of texts appears less pressing. For automated analysis, the case is different. Here, it is far from self-evident that the analysis routines produce a textual understanding that coincides with that of the general media audience.
However, this question is important when we consider the epistemological focus that content analytical assessments are typically conducted with. There are two inferential goals of media content analysis: (1.) the “diagnostic approach” is interested in making inferences about media messages’ production circumstances from content analysis, (2.) while the “prognostic approach” tries to infer predictions of a message’s potential processing and effects (Maurer & Reinemann, 2006, p. 13). The latter is especially important in the context of ISAs because of the harm that these kinds of messages might do to societal intergroup relations. Therefore, when establishing word embedding-based ISA assessment’s semantic validity we are particularly interested in whether its substantial findings are in line with a general audiences’ impression of the same content. This is a logical prerequisite to inferring predictions of media processing and effects from the content analytical results obtained by these measurements.
The Challenges of Semantic Validation—And How to Address Them
For common automated content analysis techniques such as sentiment detection or topic modeling, the units of analysis are typically articles or their sub-units such as sentences or paragraphs. Therefore, it is relatively easy to compare measures extracted for these units, for instance their sentiment scores or topic allocations, with assessments of the same categories made by human coders. It is also possible to test the semantic validity of the measure by studying a representative sample of articles in the corpus. Although the amount of sampled articles does impact the validation outcome (Song et al., 2020), it is indeed statistically valid (Krippendorff, 2018) and the approach has been recommended in the methods literature (Grimmer & Stewart, 2013; Atteveldt et al., 2021; Lind et al., 2017) and widely used in computational communication research.
However, word embedding-based ISA measures represent the word associations in an entire corpus and are therefore aggregated text measurements at the corpus level. The resulting scores cannot be broken down into single texts, paragraphs, or even sentences as units of analysis. Consequently, semantic validation that would compare word embedding-based scores with human-generated data of the same texts has previously not been applied when attempting to validate the method’s suitability for assessing ISAs within text. Müller et al. (2023, p. 409) even explicitly pointed out that a validation using human coding is impractical. In their analysis of implicit stereotypical associations of ethnic and religious group names with emotions, the authors assert that “[a] proper human validation would need raters to read the entire corpus of 697,913 articles and point out what racial biases they have learned from the corpus.” For a similar method, Arendt & Karadas (2017, p.13) defend the decision not to validate their measurement because “there is no real ‘gold standard’ of what the ‘true’ mediated associations are.” These assertions underscore the difficulty, but do not rule out the possibility for validating word embedding-based ISA measures. However, such an attempt cannot follow the same logic and approach as the semantic validation of text- or sentence-based measurements. For an entire corpus, it is impractical to use the so-called “gold standard” procedure of asking trained human coders to go over the texts under study and ask them to manually code ISAs as observational evidence of word embedding-based scores’ semantic validity.
A practical approach to overcome these issues, according to Müller et al. (2023, p.409), is “to develop ways of validating word embedding bias methods using a well-defined causal conjecture.” Applying this logic, one can first propose a causal conjecture in the form of hypotheses and then test them empirically. A suitable causal conjecture for semantically validating word embedding-based ISA measures is that a sentence package with stronger ISAs for a specified entity (as measured by word embedding-based scores) causes human readers to perceive that the sentence package contains such associations to a higher extent than a control condition. This causal conjecture can be studied using an experimental research design (Imai et al., 2011). In fact, a similar approach (using experimental designs to establish semantic validity) was proposed and applied in the early days of computational text analysis to study the validity of topic models by Grimmer & Stewart (2013); Grimmer & King (2011).
In transferring the notion of semantic validation via survey experiments to the case of word embedding-based scores, we propose a routine which proceeds in three steps:
-
As we cannot expect human participants to read a complete corpus, we extract different packages of sentences from a previous study’s corpus (Müller et al., 2023) that we expect to contain ISAs.
-
Then, we ensure that word embedding-based scores capture the ISAs assumed to be present in those sentences. We inject the sentence packages into a version of the original corpus that is stripped of all other mentions of the target group to remove the original ISA from the corpus. This modified corpus is used for assessing the level of ISAs in sentence packages.
-
Finally, we conduct an experimental study in which we let a large sample of human participants read scaled-down versions of the different sentence packages and afterwards ask them to answer a set of survey questions tailored to capture the ISAs conveyed by these sentences. If the results on these survey measures are in line with word embedding-based measures of ISAs, the causal conjecture established by the experimental design offers an argument for the semantic validity of using word embeddings to investigate ISAs in large text corpora.
By conducting this validation procedure in an experimental survey setting instead of hiring a limited number of trained coders (as in the typical gold standard validation routine), we make use of the law of large numbers. In contrast to the traditional coding approach in which each unit of analysis is judged by one coder (also the one applied in crowd coding, see, e.g., Atteveldt et al., 2021; Lind et al., 2017), the same sentence packages are evaluated by large groups of individuals. Instead of one data point per text unit, we, thus, gather a large number of data points on the same units. This accounts for the fact that implicit meanings such as ISAs may be perceived differently by various human readers, even by trained coders. Considering this variance, we assess the average meaning conveyed by different sentence packages to humans based on a large amount of data points. Further, contrary to a typical coding scheme the survey approach can account for the implicit nature of conveyed ISAs by using various text-dependent measures as indicators, not just one (single-item) assessment that is typically used to capture (quasi-)manifest textual meanings by human coders.
Study Design
We preregistered the hypotheses, survey, and analytical plan for this study before data collection. The code to reproduce both the stimulus generation process and the obtained survey data can be found on OSF.1 All Appendices are available in the same repository.
Case
We designed this study based on the Open Science materials shared by Müller et al. (2023). In that study, word embedding-based ISA metrics were applied to assess implicit stigmatization of ethnic and religious groups in German news reporting by measuring the co-occurrence of group labels with words implicitly charged with the positive emotion admiration or the negative emotion fear. Importantly, in the original study, this measurement was conducted based on a non-“pretrained” word embedding model. Instead, it was trained from the study’s own text corpus using the GLoVE algorithm.
For the present validation attempt, our goal was to create three artificial stimulus sentence packages—one consisting of sentences that implicitly associate a target group with fear, one implicitly associating the same group with admiration, and one that contains no implicit association of the target group with either of the two emotions (control condition). For the generation of sentence packages, we used a group (Italian people) that Müller et al. (2023) found to be portrayed in a balanced way on average within the corpus. We reasoned that an overall balanced group portrayal meant we would find enough sentences that implicitly contained either admiration or fear. Additionally, if we assume that the found associations within the original study are somewhat representative of stereotypes the wider population has internalized, it should be easier to construct an intuitively credible implicitly biased dataset based on this group. For example, using a group that is stereotypically associated with fear and constructing an artificial corpus where this group is implicitly associated with admiration might result in a failure of the validation attempt because existing stereotypes are less malleable and the sentences are therefore perceived as unrealistic or implausible by participants.
Following Müller et al. (2023); Urman & Makhortykh (2022), we used the normalized association score (
Suppose the cosine similarity score between word
For a given word
The mean of all cosine similarity scores of the union of
And the standard deviation
The
In Müller et al. (2023) and in the original implementation by Caliskan et al. (2017), the target is also a wordset
Preregistered Hypotheses
Our goal was to test whether the ISAs picked up by word embedding-based scores are conveyed to a human audience. Accounting for the implicit nature of the associations to be assessed, we attempted to establish this semantic validity based on a larger set of measures that may all be causally linked to implicit associations of the target group with admiration or fear in the stimulus sentence packages. We supposed that the more of the following preregistered hypotheses2 were supported, the stronger the evidence for the semantic validity of word embedding-based ISA measures was:
-
If word embedding-based measures indicate that a sentence package contains an ISA of a target group with fear, the representation of this target group will be perceived as more negative (compared to a corpus with no ISA).
-
If word embedding-based measures indicate that a sentence package contains an ISA of a target group with fear, the representation of this target group will be perceived as more negative (compared to a corpus with an ISA with admiration).
-
If word embedding-based measures indicate that a sentence package contains an ISA of a target group with admiration, the representation of this target group will be perceived as more positive (compared to a corpus with no ISA).
-
If word embedding-based measures indicate that a sentence package contains an ISA of a target group with admiration, this target group will be perceived as (a) more admirable and as (b) less frightening (compared to a corpus with no ISA).
-
If word embedding-based measures indicate that a corpus contains an ISA of a target group with admiration, this target group will be perceived as (a) more admirable and as (b) less frightening (compared to a corpus with an ISA with fear).
-
If word embedding-based measures indicate that a sentence package contains an ISA of a target group with fear, this target group will be perceived as (a) more frightening and as (b) less admirable (compared to a corpus with no ISA).
Semantic Validation Procedure
As outlined above, establishing semantic validity of an automated content analytical measure means to assess whether its results are mirrored in human readers’ judgments of the same texts. In the case of word embedding-based metrics which can only be calculated for large text corpora, it is nearly impossible to have human study participants read the full textual material that these measures are rating. We also usually do not know the sentences that introduce a specific association in the models. It is therefore not straightforward to arrive at a diminished corpus which could be processed by human readers. But by the means of reverse engineering, we can attempt to reproduce those sentences within a corpus that probably contributed to the measured association expressed in scores derived from a word-embedding model which was trained on said corpus. We can, then extract those sentences from the corpus, and use them to be rated by human study participants.
As previously outlined, our semantic validation routine followed this three-step process which we will describe in more detail in the following. The overall procedure for generating the sentence packages and testing their WEAT-based association scores within a large corpus is visualized in Figure 1. In addition to that, the table displayed in Appendix B offers a concise overview of the consecutive steps of text processing. We explain these individual steps in more detail in the following subsections.
Step 1: Identifying Potentially ISA-Inducing Sentence Packages
The idea behind implicit associations within word embeddings is that certain wordsets represent specific concepts. In the present study’s context, wordset
To validate this operationalization, in a first step, we generated packages of sentences that established such an implicit association between a group name and the two emotions fear and admiration from the original study’s corpus (Müller et al., 2023). For this purpose, two sentences are required: One that links a context word with an explicit emotion word (thereby, charging the context word with the respective emotion) and one that links a target group with a context word (thereby, implicitly charging the group label with the respective emotion). Such a sentence pair, for instance, looks like this:
-
The Italian actor keeps personal matters private and rarely discusses family life.
-
Audiences were moved by the actor’s admirable portrayal of a young musician.
These two sentences establish an implicit association between the target
An example for the implicit association with fear is:
-
The Italian from the Left Party blamed the government for the escalation.
-
The mayor fears an escalation of violence in the region.
In this example, the context word escalation co-occurs with an explicit fear word as well as with the group label Italian. Importantly, to establish an implicit rather than an explicit link between the concepts, these linking sentences do not occur close to each other or even within the same documents of a text corpus, but are distributed across several texts within the corpus.
In stimulus sentence package construction, the overarching goal therefore was to look for context words (e.g.,
To identify context words, we selected terms that are (1) semantically close to the words in
Subtracting
Given a word
Using
We found that this method did not generate an amount of sentence pairs that associated the selected group label ‘Italian’ with the concept fear that was sufficient for the purposes of our study. Therefore, we drew sentences for a similar group label (Spanish), and replaced this group name with Italian in all sentences. Thus, we created 347 sentence pairs for each condition. To make the final data set semantically meaningful for human readers, we manually removed duplicates, rephrased fragmentary and incoherent sentences, and arrived at one sentence-pair dummy corpus for each condition.
Step 2: Measuring ISAs of the Sentence Packages With WEAT-based Scores
Next, our goal was to assess whether the sentences extracted in Step 1 actually introduced the expected ISAs within the study corpus according to word embedding-based
As word embedding models are hardly statistically robust if trained on very small corpora, such as our sentence packages, we had to re-introduce the sentence packages into the original study corpus first. For this purpose, before estimating
With only about 300 mentions of
Figure 2:
Distribution of the NAS scores for the admiration, fear and control dummy corpora. Negative values denote more implicit fear, positive values more implicit admiration.
The median values are diverging in the expected directions between all three dummy corpora: The admiration dummy corpus has a positive median
Step 3: Measuring ISAs of the Sentence Packages as Perceived By Humans
After testing whether our sentence packages are generating the expected ISAs within the original corpus according to word embedding-based measures, we were ready to test whether the same sentences also generate ISAs that human readers pick up. To do so, we drew a random sample of sentences for each condition—featuring 25 sentence pairs (i.e. 50 sentences in total). This additional reduction step was performed to avoid wear-out-effects and unit-non-response in participants provoked by a potentially too lengthy stimulus exposure. Our pretest showed that native German speakers read 25 sentence pairs (or 50 sentences) in less than 10 minutes.
To still account for the semantic variance within the sentence packages, we drew three random samples of 25 sentence pairs as a stimulus package for each condition, giving us three stimulus packages of admiration sentence pairs, three stimulus packages of fear sentence pairs, and three stimulus packages of neutral sentence pairs for the control condition. In total, nine different stimulus sentence packages were created (3 ISA conditions
Participants. To conduct the planned 3
Procedure. Upon arrival at the survey platform, participants were given detailed information about the study (without unmasking the actual research purpose) and actively consented to participation. Each participant was, then, randomly allocated to read one out of the nine stimuli. Randomization checks found no significant differences between the treatment groups regarding sociodemographics and political left-right orientation. Each stimulus contained 50 news sentences, which were displayed on five subsequent pages, presenting 10 news sentences each. All sentences were drawn in a random order from the respective stimulus sentence package during the experiment. To avoid priming effects, participants were instructed to observe the language and tonality journalists use in the sentences in general and that they would be asked to rate this language on a number of dimensions after exposure. After stimulus confrontation, the dependent variables were assessed. Finally, participants were thanked for their time and fully debriefed.
Survey Measures. Following stimulus exposure, we first asked participants to rate the language of the sentences they had read (e.g., concerning comprehensibility and complexity). This was included as a distraction task and to fulfill participants’ expectations that they would have to rate journalists’ language on multiple dimensions. Subsequently, we used one item to capture the perceived valence of the sentence packages. Similar to a feeling thermometer, we assessed the negativity/positivity participants felt the sentence package they read expressed towards the group of Italians, ranging from “very positive” to “very negative” on a 7-point scale (“How positive/negative is the portrayal of Italians in the sentences you have read?”). This item was used to test hypotheses H1, H2 and H3. To test hypotheses H4, H5 and H6, the next two single-item measures inquired perceived admiration and perceived fear on a 7-point scale, ranging from “very much” to “very little” (“How much admiration/fear was expressed towards Italians in the sentences you have read?”). The original German versions of all measures are available in Appendix C in the OSF repository.
Statistical Analysis. Following our preregistered analysis plan, we applied Bayesian Analysis of Variance to test our hypotheses (Bürkner, 2017) We chose the Bayesian approach due to its pragmatic advantages yielding directly interpretable uncertainty statements about parameters, or providing only mild regularization of estimates via informative priors. We report both 1) the conditional effect and 2) the effect size
Results
Perceived Valence
Figure 3:
Conditional effect plots of perceived valence (top), perceived fear (center), and perceived admiration (bottom)
Figure 3 displays the conditional effect plots for perceived valence. The top subplot shows the conditional effect of perceived valence as a function of ISA measured for the received stimulus treatment. Results reveal that participants in the fear condition, perceived the representation of this group in the sentences they had read as much more negative than participants in the other two conditions. This supports Hs 1 and 2. Moreover, participants in the admiration condition perceived a more positive representation of the target group than those in the control condition, yielding support of H3. The effect size
Perceived Admiration
The center subplot of Figure 3 shows the conditional effect of perceived admiration as a function of ISA measured for the received stimulus treatment. Participants in the admiration condition perceived the representation of Italians as clearly more admirable than those in the control or fear conditions which supports H 4a and 5a. Contrary to that, participants in the fear condition found the target group presentation to convey less admiration than those in the control condition, supporting H6b. The effect size
Perceived Fear
The bottom subplot of Figure 3 shows the conditional effect of perceived fear as a function of ISA measured for the received stimulus treatment. Participants in the admiration condition did not perceive the representation of Italians as less frightening than individuals in the control condition. H4b is therefore not supported. However, participants in the admiration condition perceived a less frightening representation of Italians than participants in the fear condition, supporting H5b. Likewise, respondents in the fear condition perceived the representation of Italians as more frightening than those in the control condition. H6a is supported. The effect size
Discussion
The goal of this study was to test the semantic validity of word embedding-based measures of implicit stereotypical associations (ISAs) in large text corpora. We argued that when word embedding association measures are employed to draw conclusions about the content of these corpora—and not just as representations of exogenous cultural patterns—predictive validity does not suffice to establish that these measures allow for conclusions about the content of a study corpus. It is necessary to establish the semantic validity of these measures to be able to make claims about these texts.
To achieve this goal, we employed an experimental survey approach. This somewhat unusual validation strategy was chosen to compensate for the fact that traditional gold standard coding seemed inapplicable for the aggregated content analytical method that was to be validated in this study. Word embedding-based ISA metrics analyze implicit associations within a whole corpus in order to draw conclusions about the content of a corpus. We deemed it non-practicable for a limited number of trained human coders to make a reliable and, thus, reproducible generalizing judgment of ISAs within a whole corpus, particularly so since such associations are only implicitly present in the corpus.
We therefore developed and employed a validation method that made use of judgments from a large number of untrained human coders assuming that, following the law of large numbers, individual participants’ erroneous judgments that might possibly occur would not preponderate in this setting because the resulting patterns would regress to the (non-erroneous) mean. We conducted the data collection as a survey using typical items designed to assess given media stimuli, instead of a traditional coding task that follows a precise coding instruction. We deemed that a somewhat unreflected, more intuitive assessment of stimulus sentence packages (likely to follow a heuristic processing routine within participants; Chaiken et al., 1989) was actually better able to detect ISAs than the traditional coding routine for which a reflected, thorough decision-making, and, thus, systematic processing is the explicit goal.
For demonstration purposes, we decided to use the openly available analysis corpus from a recent application of word embedding-based ISA measures for media content analysis (Müller et al., 2023) which investigated implicit stigmatization of ethnic and religious groups in journalistic discourse, focusing on the implicit association of group labels with the emotions fear or admiration. A number of methodological decisions in this validation attempt had to be tailored to this specific study in enabling semantic validation at all. For instance, the original study (Müller et al., 2023) covered a large variety of ethnic groups. For the purposes of our experimental semantic validation attempt, the selection of one group out of this larger variety was a crucial step to keep group-level background factors constant in the experiment. For this group, the original article corpus needed to provide a sufficient number of sentences associating it with both of the two emotions, fear and admiration. For other semantic validation attempts of word embedding-based ISA measures, other factors might be more important when choosing the right selection strategy for used sentence packages—and this will be true for multiple other methodological decisions made during the planning of the present study, depending on the design of the application that is supposed to be validated.
For instance, for studies based on one-sided word embedding bias tests (Kroon et al., 2021), finding sentences that contain the opposite valance of the measured association would be more challenging—since our approach relies on using the inverted distance to one end of the spectrum as well as the distance to the other end of the spectrum to identify clear context words. Possible solutions would be either to solely rely on the distance to the measured end of the spectrum, or to artificially construct terms that represent the implicit opposite end of the spectrum of the measured dimension. Another interesting case would be the validation of gender-stereotypes, as in Garg et al. (2018). In one of the analyses, for example, instead of measuring the association of different groups with two emotions, the authors measured the association of a large number of jobs with two genders. If a validation study of this research followed our procedure, therefore, it would have to choose a set of target occupations and construct an artificial association with different genders. This would be an interesting robustness-test for our validation procedure, as constructing artificial associations of gender with occupations can be counter-intuitive to readers who are used to opposing stereotypes—for example, it would be interesting to see if a set of stimulus sentences can suffice to induce an association counter to the observed stereotype, e.g., an association of men with nurse. A larger variance of target words would probably be necessary to control for more and less salient stereotypes.
Arguably, such further validation attempts are necessary. The present study constitutes just one successful validation of the application of word embedding-based ISA measures in a specific application, namely in Müller et al. (2023). The between-subjects survey experiment presented in this article largely supported the assumption that participants’ perception of the stimulus sentence packages was in line with the measured ISAs, thus, representing first evidence for the semantic validity of the word embedding-based approach for investigating ISAs (Durrheim et al., 2023). The successful semantic validation of an exemplary case, however, does not warrant inferences about the general semantic validity of said method. The consensus in computational communication science is that there exists no off-the-shelf method and each method requires individual validation to show the (semantic) validity for one’s individual research data (Atteveldt & Peng, 2018; Baden et al., 2021). Therefore, this study can only be considered a first step in establishing semantic validity of word-embedding based measures of ISAs. Further steps will necessarily have to follow, particularly when considering Krippendorff’s (2018, p. 323) notion that semantic validity is context dependent. Yet, if multiple other future validation studies come to similar conclusions as the present one, this could be interpreted as cumulative evidence for the general validity of word embedding-based ISA measures. But even in this case, all future applications would still require individual validation efforts.
In the specific context of the present validation, some findings call for a more in-depth engagement. For instance, the results showed that, for perceived fear as a dependent variable, there was no difference between the admiration and the no ISA stimuli. However, vice versa, perceived admiration of the target group was significantly lower in the fear stimulus condition than in the no ISA condition. This pattern does not put into question the semantic validity of the tested word embedding-based ISA measure in principle. But, it should be seen as cause to reconsider the decision to use fear and admiration as the end poles of an emotion continuum, as the preliminary work by Müller et al. (2023) did. There, it was argued that the two emotions are functionally equivalent in group enhancement and devaluation in media reporting. As the present validation has shown, they are indeed causally linked, but only partially. Implicit fear-inducing messages not only increase the perceived fear of the group, but also reduce the positive emotion of admiration. Admiration-inducing messages, however, are unable to trigger a weakening of the negative emotion of fear. The communicative hurdles for overcoming fear of ethnic groups appear to be higher than those for eliminating admiration. This substantial finding of the present validation study should stimulate further research from an intergroup communication perspective. It adds a crucial facet to the discussion of the original study’s results that goes beyond mere methodological validation.
Related to this, we observed another interesting pattern when constructing the sentence packages for the present validation study. A clearer difference between the control and the fear conditions could be generated, compared to the difference between the admiration and control conditions. In the survey experiment, this impression is confirmed: Participants evaluated the control sentence packages closer to the admiration packages. This observation is particularly true for the overall perceived valence of the stimuli. One explanation for this could be that the negativity bias within reporting (Soroka & McAdams, 2015) leads to more cases of directly expressed fear within news reporting, while admiration is more dispersed and subtle. This could explain why the admiration sentences we found contained, at face value, less obvious traces of admiration, which both the word embedding ISA model and the participants’ responses seem to confirm. This could be taken as additional evidence to challenge the decision of using fear and admiration as end poles of a two-dimensional emotion scale in Müller et al. 2023.
However, as the survey responses are largely in line with this pattern, they should still be interpreted as evidence for the overall semantic validity of the word embedding-based ISA measure that was tested in this study. The measure seems to be able to detect both more explicitly expressed associations (resulting in higher ISA metrics) as well as largely implicit associations (resulting in lower ISA metrics which are, yet, still distinct from zero) in line with human judgments. Thus, the present semantic validation study can be seen as supporting the general idea that word embedding based measures are able to detect ISAs in texts like human readers would. At the same time, it underscores the importance of making an informed and reflected choice about which concepts to contrast in such a necessarily bipolar measure. More broadly speaking, the observations reported here underscore the value of semantic validation, not only for ascertaining the validity of content-analytical measurements, but also for refining their conceptual underpinnings, and thus for substantial theorizing.
Limitations and Future Research
The present validation attempt, of course, has some limitations. First, there is the question of the scalability of its results—how representative are our constructed, relatively small sentence packages of ISA-containing sentences for naturally occurring corpora with far larger numbers of group mentions and great noise particularly within more “neutral” sentences? As our results indicate, we were able to construct very convincing sentence packages for the fear-association conditions—but the “no ISA” sentence packages, both during a face validity check and in the survey experiment, appear to contain a visible rest of ISA with emotional valence and higher semantic variance. One avenue for future research would be to find a way to scale the individual contribution of terms to the overall ISA model (both in terms of their prevalence, and in terms of their effect on the resulting ISAs) to get a more fine-grained assessment of each sentence’s actual contribution. This would allow to vary the degree of implicit association within the stimulus in a linear fashion, rather than using the current three-level ordinal ISA scale for classifying the sentence packages.
In this study, we validated word embedding-based ISA measures for just one group that, within the original reporting, was portrayed relatively neutral with regards to the two tested emotions fear and admiration. The selection of this group was based on the assumption that implicit stereotypes towards such a group are more malleable, making it easier to measure primed perceptions of that group after exposure to a few sentence-pairs. A more complex setup would have to test whether the same semantic validation will also be successful for groups with, presumably, more established stereotypical associations with either of the two emotions fear or admiration. It may be that the experiment-based semantic validation routine presented in this study does not work for groups that are subject to strongly one-sided prejudices in public perception.
We show the concordance between
Another limitation could be seen in the participant sample used for the present study. Perceptions of the stereotype intensity conveyed by the same implicit associations plausibly vary with political orientation, lived experience (including being a target of stereotyping), socio-economic and cultural–linguistic background. For example, Italian immigrants may read the affective implications differently than members of the ethnic majority in Germany. This raises a normative question: Whose perceptions should count in semantic validation? The present experiment establishes semantic validity for a reference population that is somewhat skewed towards higher education, but diverse in terms of other socio-demographic factors and political attitudes (Leiner, 2016). Considering that typical gold-standard validation studies often rely exclusively on highly educated student assistants who are even less diverse in terms of political orientation and age than our sample, we deemed the dominance of highly educated individuals in the sample acceptable. However, one could argue that, if the goal is to assess the reception of a general audience, a truly representative sample is required as a next step. If the goal is to assess potential harm, it would be advisable to oversample targeted groups. Semantic validation studies employing an experimental approach should therefore carefully consider which kind of benchmark they are aiming for and what participant sample structure is required to achieve it.
Finally, it has to be mentioned that the approach to measuring ISAs which we semantically validated in the present study uses so-called static word embeddings such as GLoVE. However, there are also newer models (e.g., BERT, ELMO, and GPT) which allow to generate so called contextual word embeddings. Methods such as WEAT have been extended by the original authors recently to cover contextual word embeddings (Guo & Caliskan, 2021) and there appear to be new applications of contextual word embeddings in communication research (Thijs et al., 2024). The evidence of semantic validity presented in this article is certainly not directly transferable to ISAs measured using contextual word embeddings. Yet, its survey experimental approach can be used as the basis to validate ISAs found through contextual word embeddings, too. For this purpose, one might need to take the multi-level approach by Guo & Caliskan (2021) to 1) generate the same set of stimuli for various contexts, and then 2) combine the effect sizes from different contexts using a random-effect model.
At the same time, newer word embedding methods could ease validation in terms of the required computational effort: When conducting the present analysis based on GLoVE word embeddings, our main analysis took seven days to compute, while the robustness-check took about five weeks on a university HPC to run. This process could be sped up significantly with word-embedding models that are optimized for running on graphics cards instead of CPUs, allowing for a larger variance of parameters and combinations of contexts to be included in the analysis at sensible timescales for analysis.
Conclusion
Despite the aforementioned limitations, the present results are more than encouraging for the application of a word embedding-based corpus-level metrics in the domain of computational communication analysis. While previous research had already tested the statistical and predictive validity of word embedding-based ISA detection methods (Durrheim et al., 2023), we complemented the picture with the present study, offering first evidence for their semantic validity. This should be read as further consolidation for the assumption that word embedding models are able to capture and quantify actual implicit associations within text corpora as perceived by human readers. For the time being, a broader application of this method for the measurement of various kinds of associations within media (and other) texts does seem promising. However, researchers applying word embedding-based metrics in the future should, of course, be aware that the present semantic validation (even in conjunction with previous statistical and predictive validations of the method) may be limited in its transferability to other research domains.
Acknowledgments
This research was supported by the German Federal Ministry for Family Affairs, Senior Citizens, Women and Youth (BMFSFJ) through a grant to the “Research Association Discrimination and Racism” (FoDiRa) of the DeZIM-Research Community (German Center for Integration and Migration Research).